I'm Michael Suodenjoki - a software engineer living in Kgs. Lyngby, north of Copenhagen, Denmark. This is my personal site containing my blog, photos, articles and main interests.

Introduction to

Internationalization & LocalizationGlobalization of software applications

Updated 2011.03.04 20:34 +0100

Ikke tilgængelig på dansk

By Michael Suodenjoki, michael@suodenjoki.dk.

Version 1.1, March 2011 - Minor update in Localization Tools section.
Version 1.0, October 2001

Contents

Introduction

I have been interested in internationalization and localization for some years now. My entry point into the field have been from the technical side. I'm working in a company that for over a decade have targeted the international software market - but for some reason the software (mainly C++ based Windows programs) have not yet meets its real challenges for some of the more complex regions of the world - e.g. Asian or Arabic countries. Therefore the programs must be ready to cope with these special problems when they hopefully occur (characters sets, scripts, translation, keyboard IME editors, UNICODE, resource files, multilingual user interfaces etc.)

On this page I have collected most of my gathered information from the Internet. It includes a huge collection of links to relevant companies, organization and articles. I hope that you - just as I - can use them.

For a newcomer the localization industry and translation industry can be quite confusing. The industry is divided into a lot of different organizations and companies selling of wide range of products that often are difficult to cope and understand. A lot of special terminology is used and a lot of different (file) formats are available. There are luckily tendencies towards more standardization. Some organizations have been trying to standardize file interchange formats. Typically these organization are supported by a group of companies providing the real tools and products that you often must pay of lot of money for. And that doesn't necessary guarantee that your requirements is meet.

For a introduction into the localization subject you may read my "A small introduction" section below, however I suggest browsing through the terminology section below first.

1 Terminology

Flagband

Character
A character in software development is an abstraction. The natural understanding of a character is that of a written character; one intuitively associates a certain graphic representation with a given character. This is what is called a glyph: the actual shape of a character image. A glyph appears on a display or is produced by a printer. Naturally, there can be many such representations. You can represent the characters ABC as: ABC or ABC or ABC. A set of glyphs is called a font. So indeed, one aspect of a character is its graphic representation. However, for the purpose of data processing in software development, a character also needs to have a data representation as a sequence of bits. This is called a code.
Character Code
A character code is a sequence of bits representing a character. Again, there are many such representations. The character a, for instance, can be represents as 0x61 in ASCII or as 0x81 in EBCDIC or as 0x0061 in Unicode. From this example you can see that not only the bit pattern but also the number of bits used for representing a character can vary; the bit pattern representing the character a has 16 bits in Unicode but only 8 bits in ASCII and EBCDIC.
Codeset
A character codeset is a 1:1 mapping between characters and character codes.
Encoding
A character encoding scheme is a set of rules for translating a byte sequence into a sequence of character codes.
g11n
The abbreviation for globalization - 11 characters between g and n.
Globalize, Globalization
The term is used for the internationalization and the localization process together or the concept to produce software that works globally. The Localization Industry Standards Association (LISA) defines globalization as follows:
"Globalization addresses the business issues associated with taking a product global. In the globalization of high-tech products this involves integrating localization throughout a company, after proper internationalization and product design, as well as marketing, sales, and support in the world market."
i18n
The abbreviation for internationalization - 18 characters between i and n.
Internationalize, Internationalization
The process of enabling your source for the international market. Internationalization is the design and development of software in a way that allows it to be localized (translated) to other locales (languages) without the need to alter the source code. Common errors are due to both cultural and locale differences. The Localization Industry Standards Association (LISA) defines internationalization as follows:
"Internationalization is the process of generalizing a product so that it can handle multiple languages and cultural conventions without the need for re-design. Internationalization takes places at the level of program design and document development."
Language Engineering
The process of converting human knowledge of a language into a computer model, so that computer programs can utilize this knowledge, e.g. to automatic translation. The Euromap Report, published on behalf of the EUROMAP Consortium in 1998, defined language engineering as follows:
"Language engineering is the application of knowledge of written and spoken language to the development of information, transaction and communication systems, so that they can recognize, understand, interpret, and generate human language. Language technologies include, for example, automatic of computer assisted translation (CAT), speech recognition and synthesis, speaker verification, semantic searches and information retrieval, text mining and fact extraction."
Locale
The features of the user's environment that are dependent on language, country, and cultural conventions. The locale determine convents such as sort order; keyboard layout; date, time, number and currency formats. In Windows, locales usually provide more information about cultural conventions that about language's. So simply put locales are simply a bunch of user preference information that's related to the user's language and sub-language.
l10n
The abbreviation for localization - 10 characters between l and n.
Localize, Localization
The process of adapting, translating and customizing a product for a specific market (for a specific locale). The Localization Industry Standards Association (LISA) defines localization as follows:
"Localization involves taking a product and making it linguistically and culturally appropriate to the target locale (country/region and language) where it will be used and sold."
Multilingual, Multilanguage
Supporting more than one language simultaneously. Often implies the ability to handle more than one script of character sets.
Script
A system of characters used to write one or several languages.
Translation
The process of translating a piece of text from one language (the source language) to another language (the target language). The translation process is one of the major parts of localization. Thus a localization project; in addition to translation, also involves many other tasks such as project management, software engineering, testing, and desktop publishing.
Translation Memory
A collection of translations of words, terms, phrases or sentences from a source language to a target language. The translation memory can be used by tools to automate more or less intelligently, provide a framework where the tools suggests translations or to ensure that the same words, terms, phrases or sentences of the source language are translated into the same words, terms, phrases or sentences of the target language.  Translation Memory are typically stored in standard file formats (e.g. TMX or TBX) or databases. A simple sample is the Microsoft Glossaries providing the translations of the most commonly used Windows terms (e.g. for menu item texts). Note that a file representing a translation memory is not necessarily fit for communication of localization translations (in a localization package) - or a particular translation of a particular application. It merely represents typical translations, or translations for a particular industry (glossaries), or translations for a suite of products, which can be used by the translator to help (more or less automatically) with the translation.

2 A small introduction

When you begin to internationalize your software product there will be several steps that must be considered and handled. In this small introduction I will try to describe these steps. I'm not a professional localizer so there may be issues that I have described incorrectly or too simple. But let's start:

  1. Internationalization. First you must internationalize your source code. This means that the code should be programmed/authored in such a manner that is doesn't dependent on any specific locale - that is language or cultural conventions. All the locale dependent stuff should preferably be put into separate files - those files that later on can be localized (translated) into a new locale (language). This is not a simple process and it involves many considerations dependent on the nature of your source code (whether it is e.g. C++ source files or HTML pages). You can buy thick books describing these considerations. It's important that all involved personnel (programmers and authors) are told what to do to implement internationalization (e.g. via standards).

    Figure illustrating sample software system with locale independent source (c++ files) and locale dependent source (resource script files).

  2. Preparation (pre-processing). When the source code is internationalization ready the locale dependent part (the localization source) is ready to be localized. The hardest work of the localization is often basic translation of the text. The text to translate is provided in a source language (e.g. English) and the language to translate to is called the target language (e.g. Danish). Often you cannot yourself translate the text because you don't speak/understand the target language. Therefore you must engage a third-party to do the translation. This could be a person (a translator e.g. a freelance translator) , a translator company or maybe even your customer. In any case you must supply the source to be translated in some form to the third-party. Often you need to prepare the localization source in some form - either because it must be packed (for nice transportation) or because it needs extra information needed during translation. Both cases are often necessary. The extra information could be such things as the source language id, the target language id, information about how to translate, tone of language, context information, your standard glossaries (terms), the current status of the translation etc etc. It can be very complex - and it often are very complex. You can spend (a lot of)  money buying tools that helps you with preparing your locale dependent source, but it is not easy to find a tool that suits your specific needs. Hopefully standardization will help in the future.

    Preparation

  3. Localization. When you have prepared your localization packet you can send it to localization (translation). A couple of weeks later you hopefully receive the result of the localization from the third-party (the localization target). The ultimate goal is that the localization package contain everything needed for localization - but it is typically not that simple. Translation for example is often highly dependent on the context. Or the localizer must be expert into the product for being able to localize it satisfactory.  Such things are difficult to capture in a software file.
  4. Merging. When you have received the result of the localization you extract the localization target from the localization packet and merge it into your new localization source. If your version of the localization source haven't changed from the last time you send it to localization the merge process is simple. Otherwise the merge process can be quite complex and most likely need human intervention.
  5. Post-processing. To finalize you sometimes need to post process the localization source (e.g. compile or link) to produce the final localized product. This step should preferably be as much independent of the locale independent source as possible. The best situation occur when you can distribute the localization source (or the post processed localization source) by itself to the customers thereby providing them with a system that miraculous supports a new locale.
  6. Quality Assurance. When you have post-processed you can start checking whether everything is localized correctly. Again this is difficult. Often it is a huge amount of text that must be checked and the product itself much be tested rigorously. And the result - you can start a new preparation for localization.

It is clear that the localization by the third party can be most efficient if it is possible to actually see the result of the localization on location (directly in the product) . It is much more easy to see whether localization have succeeded in the real live product. Thus the post-processing should preferably be available for the third-party localizer. Often though this is difficult due to the complexity of the post-processing (e.g. compilations or third-party tools that needs to be licensed).

3 The degree of an application's internationalization

The degree of an application's internationalization can be divided into four major levels:

  1. No international support: The application works in one language. If that language is not English, it probably works on only specific language versions of Windows.
  2. Locale-dependent source: Different code base must be written and maintained for European, Far Eastern and Middle Eastern versions.
  3. Single-source, locale-dependent binary: A single code base is written but separate compilations must be made for different languages or different Windows versions.
  4. Single-source, single-binary: A single code base and single compilation satisfies all language and platform versions.

Top tips to ensure your code is internationalized (courtesy of http://www.alchemysoftware.ie/work/workzone.html)

Note: that this list of tips is very small - there are many more things to consider.

4 Localization Tools

List of some localization/translation memory tools available:

4.1 Internationalization Examination Tools

These kind of tools examines your programs for possible problems with respect of using them in different locales (e.g. with respect to language, type of operating system, character sets and so on).

4.2 Internationalization Programming Libraries

Provides programming libraries that helps you internationalize your source code.

Note that C++ have its own standard that contains support for locales - the Standard C++ Library (previously STL). However it does not include a standard way of converting between encodings.

5 Bibliography

5.1 Books

Developing International Software, 2002. [Dr. Intl. 2002] Developing International Software, 1995. [Kano 1995] A Practical Guide for Localization (2nd Edition), 2000. [Esselink 2000] Standard C++ IOStreams and Locales, 2002. [Langer & Kreft 2000] International Programming for Microsoft Windows, 2000. [Schmidt 2000]

[Dr.Intl. 2002]
Developing International Software by Dr. International, Microsoft Press, October 2002. An updated version of Nadine Kano's book from 1995 [Kano 1995]. More info at http://www.microsoft.com/globaldev/DIS_v2/disv2.asp
[Deitsch & Czarnecki 2001]
JAVA Internationalization by Andrew Deitsch and D. Czarnecki, O'Reilly, 2001. ISBN 0-596-00019-7.
[Esselink 2000]
A Practical Guide for Localization (2nd Edition) by Bert Esselink, John Benjamins Pub. Co., 2000. ISBN 1588110060. For more information see www.locguide.com
[Kaplan 2000]
Internationalization with Visual Basic by Michael S. Kaplan, Sams, 2000. ISBN 0-672-31977-2. More information at www.i18nwithvb.com
[Langer & Kreft 2000]
Standard C++ IOStreams and Locales by Angelika Langer & Klaus Kreft, Addison-Wesley, 2000. ISBN 0-201-18395-1. Langer & Kreft have also published an article The Locale Framework in the magazine C++ Report, September 1997, pp. 58-66(69). More info at http://home.camelot.de/langer/Articles/Internationalization/I18N.htm
[Schmidt 2000]
International Programming for Microsoft Windows by David A. Schmidt, Microsoft Press, 2000. ISBN 1-57231-956-9. Essential guidelines for globalizing and localizing your software with examples in Microsoft Visual C++ 6.0. Covers features for Windows 2000.
[Unicode 2000]
The Unicode Standard: Version 3.0 by The Unicode Consortium, Addison-Wesley, 2000. ISBN 020-16-16335. See also www.unicode.org.
[Lunde 1999]
CJKV Information Processing by Ken Lunde, O'Reilly, 1999. ISBN 1-56592-224-7. More information at www.oreilly.com/catalog/cjkvinfo
[Ott 1999]
Global Solutions for Multilingual Applications by Chris Ott, Wiley, 1999. ISBN 0-471-34827-9
[Kano 1995]
Developing International Software by Nadine Kano. Microsoft Press, 1995. ISBN 1-55615-840-8. For Windows 95 and Windows NT. A handbook for international software design. Can be read online at http://www.microsoft.com/globaldev/dis_v1/disv1.asp or via MSDN at http://msdn.microsoft.com/library/books/devintl/S24AA.htm.

Bert Esselink has in his book [Esselink 2000] a further reading section that also is available on the Internet at www.locguide.com/references/publications/books.htm. Its a huge collection of reference material - though many of the books are outdated.

Furthermore Sybase has a list of books on internationalization and localization. Again many of them are rather outdated, but includes a broader variety on operating systems like Macintosh, internationalization in libraries like X Windows and more general usability books.

5.2 Periodicals

6 Links

6.1 Organizations

6.2 Localization Information Portals

6.3 Newsgroups

6.4 Mailings Lists

6.5 Miscellaneous / Not yet grouped

6.6 Private persons interested in localization

Appendix A: File Formats

This section contains a compiled list of file formats used by the localization and translation industry. Some of the formats are standardized openly whereas others are company proprietary. The list contains only a limited set of the file formats of source files - the original files that contains the text to be translated.

If you know of other formats please contact me.

Extension Base format Description Sample Tool Proprietor
.rc ANSI Text Resource Script files containing definitions of Windows resources such as dialogs, icons, bitmaps, menus and textual strings. Microsoft Resource Editor, Visual Studio or normal text editor Microsoft
www.microsoft.com
.resx XML .NET Resource Script files (new .rc format for the .NET platform)   Microsoft
www.microsoft.com
.res Binary Resource Files - contains compiled Windows Resource Script files. Resource files are stored directly in executables (exe/dll) as Unicode. Microsoft Resource compiler (rc.exe) Microsoft
www.microsoft.com
.ressource Binary .NET Resource Files - contains compiled .NET Resource Script files. Resource files are stored  directly in executables (exe/dll).   Microsoft
www.microsoft.com
.xlf XML XLIFF - XML Localization Interchange File Format. Provides an open standard for transporting text to be translated.   OASIS
http://www.oasis-open.org/committees/xliff/
.xml XML OpenTag- Another format for translation memory exchange.    
.tmx XML Translation Memory eXchange format. A format containing information about already translated words, phrases or sentences.   LISA
www.lisa.org
.ttk ? (Binary) Translation Tool Kits. CATALYST™ Alchemy Software
www.alchemysoftware.ie
.skl Binary ? RWS Tools (Rainbow) RWS Group www.translate.com
.itd ? Intermediate Translation Document SDLX Tools SDL International
www.sdlintl.com
.tdb ? Terminology Database. A kind of translation memory with special focus of terminology (words and phrases). SDLX Tools