Introduction to

Internationalization & LocalizationGlobalization of software applications

Updated 2011.03.04 20:34 +0100 |

Ikke tilgængelig på dansk

By Michael Suodenjoki, michael@suodenjoki.dk.

Version 1.1, March 2011 - Minor update in Localization Tools section.
Version 1.0, October 2001

Terminology
A small introduction
The degree of an application's internationalization
Localization Tools
Bibliography (Books & Periodicals)
Links
Appendix A: File Formats

Introduction

I have been interested in internationalization and localization for some years now. My entry point into the field have been from the technical side. I'm working in a company that for over a decade have targeted the international software market - but for some reason the software (mainly C++ based Windows programs) have not yet meets its real challenges for some of the more complex regions of the world - e.g. Asian or Arabic countries. Therefore the programs must be ready to cope with these special problems when they hopefully occur (characters sets, scripts, translation, keyboard IME editors, UNICODE, resource files, multilingual user interfaces etc.)

On this page I have collected most of my gathered information from the Internet. It includes a huge collection of links to relevant companies, organization and articles. I hope that you - just as I - can use them.

For a newcomer the localization industry and translation industry can be quite confusing. The industry is divided into a lot of different organizations and companies selling of wide range of products that often are difficult to cope and understand. A lot of special terminology is used and a lot of different (file) formats are available. There are luckily tendencies towards more standardization. Some organizations have been trying to standardize file interchange formats. Typically these organization are supported by a group of companies providing the real tools and products that you often must pay of lot of money for. And that doesn't necessary guarantee that your requirements is meet.

For a introduction into the localization subject you may read my "A small introduction" section below, however I suggest browsing through the terminology section below first.

1 Terminology

Flagband

Character: A character in software development is an abstraction. The natural understanding of a character is that of a written character; one intuitively associates a certain graphic representation with a given character. This is what is called a glyph: the actual shape of a character image. A glyph appears on a display or is produced by a printer. Naturally, there can be many such representations. You can represent the characters ABC as: ABC or ABC or ABC. A set of glyphs is called a font. So indeed, one aspect of a character is its graphic representation. However, for the purpose of data processing in software development, a character also needs to have a data representation as a sequence of bits. This is called a code.
Character Code: A character code is a sequence of bits representing a character. Again, there are many such representations. The character a, for instance, can be represents as 0x61 in ASCII or as 0x81 in EBCDIC or as 0x0061 in Unicode. From this example you can see that not only the bit pattern but also the number of bits used for representing a character can vary; the bit pattern representing the character a has 16 bits in Unicode but only 8 bits in ASCII and EBCDIC.
Codeset: A character codeset is a 1:1 mapping between characters and character codes.
Encoding: A character encoding scheme is a set of rules for translating a byte sequence into a sequence of character codes.
g11n: The abbreviation for globalization - 11 characters between g and n.
Globalize, Globalization: The term is used for the internationalization and the localization process together or the concept to produce software that works globally. The Localization Industry Standards Association (LISA) defines globalization as follows:

"Globalization addresses the business issues associated with taking a product global. In the globalization of high-tech products this involves integrating localization throughout a company, after proper internationalization and product design, as well as marketing, sales, and support in the world market."
i18n: The abbreviation for internationalization - 18 characters between i and n.
Internationalize, Internationalization: The process of enabling your source for the international market. Internationalization is the design and development of software in a way that allows it to be localized (translated) to other locales (languages) without the need to alter the source code. Common errors are due to both cultural and locale differences. The Localization Industry Standards Association (LISA) defines internationalization as follows:

"Internationalization is the process of generalizing a product so that it can handle multiple languages and cultural conventions without the need for re-design. Internationalization takes places at the level of program design and document development."
Language Engineering: The process of converting human knowledge of a language into a computer model, so that computer programs can utilize this knowledge, e.g. to automatic translation. The Euromap Report, published on behalf of the EUROMAP Consortium in 1998, defined language engineering as follows:

"Language engineering is the application of knowledge of written and spoken language to the development of information, transaction and communication systems, so that they can recognize, understand, interpret, and generate human language. Language technologies include, for example, automatic of computer assisted translation (CAT), speech recognition and synthesis, speaker verification, semantic searches and information retrieval, text mining and fact extraction."
Locale: The features of the user's environment that are dependent on language, country, and cultural conventions. The locale determine convents such as sort order; keyboard layout; date, time, number and currency formats. In Windows, locales usually provide more information about cultural conventions that about language's. So simply put locales are simply a bunch of user preference information that's related to the user's language and sub-language.
l10n: The abbreviation for localization - 10 characters between l and n.
Localize, Localization: The process of adapting, translating and customizing a product for a specific market (for a specific locale). The Localization Industry Standards Association (LISA) defines localization as follows:

"Localization involves taking a product and making it linguistically and culturally appropriate to the target locale (country/region and language) where it will be used and sold."
Multilingual, Multilanguage: Supporting more than one language simultaneously. Often implies the ability to handle more than one script of character sets.
Script: A system of characters used to write one or several languages.
Translation: The process of translating a piece of text from one language (the source language) to another language (the target language). The translation process is one of the major parts of localization. Thus a localization project; in addition to translation, also involves many other tasks such as project management, software engineering, testing, and desktop publishing.
Translation Memory: A collection of translations of words, terms, phrases or sentences from a source language to a target language. The translation memory can be used by tools to automate more or less intelligently, provide a framework where the tools suggests translations or to ensure that the same words, terms, phrases or sentences of the source language are translated into the same words, terms, phrases or sentences of the target language. Translation Memory are typically stored in standard file formats (e.g. TMX or TBX) or databases. A simple sample is the Microsoft Glossaries providing the translations of the most commonly used Windows terms (e.g. for menu item texts). Note that a file representing a translation memory is not necessarily fit for communication of localization translations (in a localization package) - or a particular translation of a particular application. It merely represents typical translations, or translations for a particular industry (glossaries), or translations for a suite of products, which can be used by the translator to help (more or less automatically) with the translation.

2 A small introduction

When you begin to internationalize your software product there will be several steps that must be considered and handled. In this small introduction I will try to describe these steps. I'm not a professional localizer so there may be issues that I have described incorrectly or too simple. But let's start:

Internationalization. First you must internationalize your source code. This means that the code should be programmed/authored in such a manner that is doesn't dependent on any specific locale - that is language or cultural conventions. All the locale dependent stuff should preferably be put into separate files - those files that later on can be localized (translated) into a new locale (language). This is not a simple process and it involves many considerations dependent on the nature of your source code (whether it is e.g. C++ source files or HTML pages). You can buy thick books describing these considerations. It's important that all involved personnel (programmers and authors) are told what to do to implement internationalization (e.g. via standards).
Preparation (pre-processing). When the source code is internationalization ready the locale dependent part (the localization source) is ready to be localized. The hardest work of the localization is often basic translation of the text. The text to translate is provided in a source language (e.g. English) and the language to translate to is called the target language (e.g. Danish). Often you cannot yourself translate the text because you don't speak/understand the target language. Therefore you must engage a third-party to do the translation. This could be a person (a translator e.g. a freelance translator) , a translator company or maybe even your customer. In any case you must supply the source to be translated in some form to the third-party. Often you need to prepare the localization source in some form - either because it must be packed (for nice transportation) or because it needs extra information needed during translation. Both cases are often necessary. The extra information could be such things as the source language id, the target language id, information about how to translate, tone of language, context information, your standard glossaries (terms), the current status of the translation etc etc. It can be very complex - and it often are very complex. You can spend (a lot of) money buying tools that helps you with preparing your locale dependent source, but it is not easy to find a tool that suits your specific needs. Hopefully standardization will help in the future.
Localization. When you have prepared your localization packet you can send it to localization (translation). A couple of weeks later you hopefully receive the result of the localization from the third-party (the localization target). The ultimate goal is that the localization package contain everything needed for localization - but it is typically not that simple. Translation for example is often highly dependent on the context. Or the localizer must be expert into the product for being able to localize it satisfactory. Such things are difficult to capture in a software file.
Merging. When you have received the result of the localization you extract the localization target from the localization packet and merge it into your new localization source. If your version of the localization source haven't changed from the last time you send it to localization the merge process is simple. Otherwise the merge process can be quite complex and most likely need human intervention.
Post-processing. To finalize you sometimes need to post process the localization source (e.g. compile or link) to produce the final localized product. This step should preferably be as much independent of the locale independent source as possible. The best situation occur when you can distribute the localization source (or the post processed localization source) by itself to the customers thereby providing them with a system that miraculous supports a new locale.
Quality Assurance. When you have post-processed you can start checking whether everything is localized correctly. Again this is difficult. Often it is a huge amount of text that must be checked and the product itself much be tested rigorously. And the result - you can start a new preparation for localization.

It is clear that the localization by the third party can be most efficient if it is possible to actually see the result of the localization on location (directly in the product) . It is much more easy to see whether localization have succeeded in the real live product. Thus the post-processing should preferably be available for the third-party localizer. Often though this is difficult due to the complexity of the post-processing (e.g. compilations or third-party tools that needs to be licensed).

3 The degree of an application's internationalization

The degree of an application's internationalization can be divided into four major levels:

No international support: The application works in one language. If that language is not English, it probably works on only specific language versions of Windows.
Locale-dependent source: Different code base must be written and maintained for European, Far Eastern and Middle Eastern versions.
Single-source, locale-dependent binary: A single code base is written but separate compilations must be made for different languages or different Windows versions.
Single-source, single-binary: A single code base and single compilation satisfies all language and platform versions.

Top tips to ensure your code is internationalized (courtesy of http://www.alchemysoftware.ie/work/workzone.html)

Eliminate UI length restrictions. Translated strings are typically longer than the English.
Ensure support for accented characters including double byte.
Check for hard coded strings
Enable support for foreign keyboard layouts
Avoid fixed date, time, currency or number formats
Avoid country specific language or jargon
Avoid text in bitmaps as they are hard to edit

Note: that this list of tips is very small - there are many more things to consider.

4 Localization Tools

List of some localization/translation memory tools available:

AppLocalize from Sbuilders: www.sbuilders.com.
CATALYST™ from Alchemy Software at www.alchemysoftware.ie.
Déjà Vu from Atril at www.atril.com
ForeignDesk® Open Source Initiative from L10NBRIDGE. www.foreigndesk.net and www.lionbridge.com
Visual Localize at www.visloc.com
RL Tools from Microsoft Corporation. Search for article ID: Q110894 at msdn.microsoft.com. This toolset is rather old.
RWS Tools from RWS Group at www.translate.com
TRADOS Translation Suite, TagEditor, Translators Workbench from TRADOS at www.trados.com
Transit from STAR at www.star-transit.com
TransSolution from Hexadigm Systems at www.hexadigm.com provides a Visual Studio add-in to work with .resx files.
SDLX from SDL International at www.sdlintl.com

4.1 Internationalization Examination Tools

These kind of tools examines your programs for possible problems with respect of using them in different locales (e.g. with respect to language, type of operating system, character sets and so on).

i18n Expeditor from OneRealm www.onerealm.com. Handles C++ and Java.

4.2 Internationalization Programming Libraries

Provides programming libraries that helps you internationalize your source code.

International Components for Unicode (ICU). Provides an open source set of components in C++ or Java on a variety of platforms (including Windows).
The Dinkum CoreX library. Provides a set of character set converters (std::codecvt) built for Standard C++'s locale/facet framework.
GNU's libiconv library provides a set of C functions that support (character) conversion between a wide set of encodings.
GNOME libunicode library. It covers character set conversion, character properties, decomposition etc. The GNU libiconv library is probably a better choice - it is more complete and more actively maintained. Note: that there are a few different libraries named libunicode out there. I know of one libunicode at sourgeforge which implements a set of C functions handling Unicode strings. Essentially these maps the normal string functions available in C, e.g. like strcpy, strlen etc. Unicode versions of these are already available in most of today's C++ compilers.
Rosette Core Library for Unicode. Rosette Core Library for Unicode enables software engineers to quickly add support for over 150 of the world’s languages to their applications. Rosette Core Library for Unicode is a Unicode development library built based on Basis Technology’s experience implementing multilingual compliance into mission-critical systems in many different environments. Developers deploying Rosette Core Library for Unicode achieve multilingual support in their applications efficiently and economically.
Free recode library by François Pinard. I do not know much about this library but it seems to support more encodings that GNU's libiconv library. However the API is not following any standards.
Microsoft Layer for Unicode (MSLU). Provides a layer for running Unicode enabled applications on Windows 9x, ME.
RapidSolution (German) have the RapidTranslation.

Note that C++ have its own standard that contains support for locales - the Standard C++ Library (previously STL). However it does not include a standard way of converting between encodings.

5 Bibliography

5.1 Books

Developing International Software, 2002. [Dr. Intl. 2002] A Practical Guide for Localization (2nd Edition), 2000. [Esselink 2000] Standard C++ IOStreams and Locales, 2002. [Langer & Kreft 2000] International Programming for Microsoft Windows, 2000. [Schmidt 2000]

[Dr.Intl. 2002]: Developing International Software by Dr. International, Microsoft Press, October 2002. An updated version of Nadine Kano's book from 1995 [Kano 1995]. More info at http://www.microsoft.com/globaldev/DIS_v2/disv2.asp
[Deitsch & Czarnecki 2001]: JAVA Internationalization by Andrew Deitsch and D. Czarnecki, O'Reilly, 2001. ISBN 0-596-00019-7.
[Esselink 2000]: A Practical Guide for Localization (2nd Edition) by Bert Esselink, John Benjamins Pub. Co., 2000. ISBN 1588110060. For more information see www.locguide.com
[Kaplan 2000]: Internationalization with Visual Basic by Michael S. Kaplan, Sams, 2000. ISBN 0-672-31977-2. More information at www.i18nwithvb.com
[Langer & Kreft 2000]: Standard C++ IOStreams and Locales by Angelika Langer & Klaus Kreft, Addison-Wesley, 2000. ISBN 0-201-18395-1. Langer & Kreft have also published an article The Locale Framework in the magazine C++ Report, September 1997, pp. 58-66(69). More info at http://home.camelot.de/langer/Articles/Internationalization/I18N.htm
[Schmidt 2000]: International Programming for Microsoft Windows by David A. Schmidt, Microsoft Press, 2000. ISBN 1-57231-956-9. Essential guidelines for globalizing and localizing your software with examples in Microsoft Visual C++ 6.0. Covers features for Windows 2000.
[Unicode 2000]: The Unicode Standard: Version 3.0 by The Unicode Consortium, Addison-Wesley, 2000. ISBN 020-16-16335. See also www.unicode.org.
[Lunde 1999]: CJKV Information Processing by Ken Lunde, O'Reilly, 1999. ISBN 1-56592-224-7. More information at www.oreilly.com/catalog/cjkvinfo
[Ott 1999]: Global Solutions for Multilingual Applications by Chris Ott, Wiley, 1999. ISBN 0-471-34827-9
[Kano 1995]: Developing International Software by Nadine Kano. Microsoft Press, 1995. ISBN 1-55615-840-8. For Windows 95 and Windows NT. A handbook for international software design. Can be read online at http://www.microsoft.com/globaldev/dis_v1/disv1.asp or via MSDN at http://msdn.microsoft.com/library/books/devintl/S24AA.htm.

Bert Esselink has in his book [Esselink 2000] a further reading section that also is available on the Internet at www.locguide.com/references/publications/books.htm. Its a huge collection of reference material - though many of the books are outdated.

Furthermore Sybase has a list of books on internationalization and localization. Again many of them are rather outdated, but includes a broader variety on operating systems like Macintosh, internationalization in libraries like X Windows and more general usability books.

5.2 Periodicals

Language International. More information at www.language-international.com.
Multilingual Computing & Technology: The Magazine for Language Technology. More information at www.multilingual.com

6 Links

6.1 Organizations

Localization Industry Standards Association (LISA) at www.lisa.org
OpenTag at www.opentag.com
The UNICODE Organization at www.unicode.org
The International Standard Organization (ISO) at www.iso.ch

6.2 Localization Information Portals

www.microsoft.com/globaldev - Microsoft's global development portal
www.translation.net - a portal with all kinds of information, software and links reg. translation.
www.i18ngurus.com
www.i18n.com

6.3 Newsgroups

6.4 Mailings Lists

http://www.microsoft.com/globaldev/subscription/

6.5 Miscellaneous / Not yet grouped

Isys Information Architects: www.iarchitect.com e.g. the http://www.iarchitect.com/global.htm page.
Redcape Software Inc: www.redcape.com e.g. the http://www.redcape.com/i18n/index.htm page.
Lingscape: www.lingscape.com
www.ile.com/borneo
UniScape: www.uni-scape.com with the www.uni-scape.com/html/il8nview.htm page.
Internationalization standards and tools at ftp://dkuug.dk/i18n/index.htm
http://www.transarc.com/library/documentation/txseries/4.2/windows/erzhad/erzhad29.htm
The ISO8859 series at http://www.cs.tu-berlin.de/~czyborra/charsets
Progress Globalization Program at http://www.progress.com/services/partners/globalization/index.htm
Overview of many languages found at Berlitz GlobalNet: http://www.berlitzglobalnet.com/english/services/interpretation_languages.asp
"Options for Presentation of Multilingual Text: Use of the Unicode Standard" by Janet C. Erickson. http://dns.hti.umich.edu/htistaff/pubs/1997/janete.01/
www.linqualizer.net

6.6 Private persons interested in localization

Michael Gschwind with his Programming for Internationalization FAQ at www.vlsivie.tuwien.ac.at/mike/i18n.html
Peter Madsen with his International Country Codes page www.image.dk/~petermad/forms/ccode-uk.htm
Asmus Freytag
Nadine Kano
Sadius Eiva
Tiziana Perinotti

Appendix A: File Formats

This section contains a compiled list of file formats used by the localization and translation industry. Some of the formats are standardized openly whereas others are company proprietary. The list contains only a limited set of the file formats of source files - the original files that contains the text to be translated.

If you know of other formats please contact me.

Extension	Base format	Description	Sample Tool	Proprietor
.rc	ANSI Text	Resource Script files containing definitions of Windows resources such as dialogs, icons, bitmaps, menus and textual strings.	Microsoft Resource Editor, Visual Studio or normal text editor	Microsoft www.microsoft.com
.resx	XML	.NET Resource Script files (new .rc format for the .NET platform)		Microsoft www.microsoft.com
.res	Binary	Resource Files - contains compiled Windows Resource Script files. Resource files are stored directly in executables (exe/dll) as Unicode.	Microsoft Resource compiler (rc.exe)	Microsoft www.microsoft.com
.ressource	Binary	.NET Resource Files - contains compiled .NET Resource Script files. Resource files are stored directly in executables (exe/dll).		Microsoft www.microsoft.com
.xlf	XML	XLIFF - XML Localization Interchange File Format. Provides an open standard for transporting text to be translated.		OASIS http://www.oasis-open.org/committees/xliff/
.xml	XML	OpenTag- Another format for translation memory exchange.
.tmx	XML	Translation Memory eXchange format. A format containing information about already translated words, phrases or sentences.		LISA www.lisa.org
.ttk	? (Binary)	Translation Tool Kits.	CATALYST™	Alchemy Software www.alchemysoftware.ie
.skl	Binary	?	RWS Tools (Rainbow)	RWS Group www.translate.com
.itd	?	Intermediate Translation Document	SDLX Tools	SDL International www.sdlintl.com
.tdb	?	Terminology Database. A kind of translation memory with special focus of terminology (words and phrases).	SDLX Tools