Archive-name: internationalization/programming-faq
Posting-Frequency: monthly
Version: 1.93
Programming for Internationalization
Michael K. Gschwind
DISCLAIMER: THE AUTHOR MAKES NO WARRANTY OF ANY KIND WITH REGARD TO
THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
Note: Most of this was tested on a Sun 10, running SunOS 4.1.* - other
systems might differ slightly
This FAQ discusses topics related to internationalization. Simple i18n
support for Europe, Latin America, and the Middle East might use of
the ISO 8859-X based 8 bit character sets. For wider portability, a
standard such as Unicode is in order.
This FAQ discusses how to program applications which support the use
European (Latin American) national character sets on UNIX-based
systems and standard C environments, and discusses some choices with
respect to character sets.
INTRODUCTION
Most of the information given here is independent of the character
encoding used (e.g. DEC MCS, ISO Latin-X, etc.), but can be applied to
any character set, providing the programming environment has
provisions for this standard.
1. Which coding should I use for accented characters?
Use the internationally standardized ISO-8859-1 character set to type
accented characters. This character set contains all characters
necessary to type (West) European languages. This encoding is also the
preferred encoding on the Internet. ISO 8859-X character sets use the
characters 0xa0 through 0xff to represent national characters, while
the characters in the 0x20-0x7f range are those used in the US-ASCII
(ISO 646) character set. Thus, ASCII text is a proper subset of all
ISO 8859-X character sets.
The characters 0x80 through 0x9f are earmarked as extended control
chracters, and are not used for encoding characters. These characters
are not currently used to specify anything. A practical reason for
this is interoperability with 7 bit devices (or when the 8th bit gets
stripped by faulty software). Devices would then interpret the character
as some control character and put the device in an undefined state.
(When the 8th bit gets stripped from the characters at 0xa0 to 0xff, a
wrong character is represented, but this cannot change the state of a
terminal or other device.)
This character set is also used by AmigaDOS, MS-Windows, VMS (DEC MCS
is practically equivalent to ISO 8859-1) and (practically all) UNIX
implementations. MS-DOS normally uses a different character set and
is not compatible with this character set. (It can, however, be
translated to this format with various tools. See below.)
Footnote: Supposedly, IBM code page 819 is fully ISO 8859-1 compliant.
ISO 8859-1 supports the following languages:
Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish,
French, Galician, German, Icelandic, Irish, Italian, Norwegian,
Portuguese, Spanish and Swedish.
(It has been called to my attention that Albanian can be written with
ISO 8859-1 also. However, from a standards point of view, ISO 8859-2
is the appropriate character set for Balkan countries.)
ISO 8859-1 is just one part of the ISO-8859 standard, which specifies
several character sets:
8859-1 Europe, Latin America
8859-2 Eastern Europe
8859-3 SE Europe/miscellaneous (Esperanto, Maltese, etc.)
8859-4 Scandinavia/Baltic (mostly covered by 8859-1 also)
8859-5 Cyrillic
8859-6 Arabic
8859-7 Greek
8859-8 Hebrew
8859-9 Latin5, same as 8859-1 except for Turkish instead of Icelandic
8859-10 Latin6, for Lappish/Nordic/Eskimo languages
Another nascent standard is UNICODE (ISO 10646). UNICODE is an
extension of ISO 8859-1 (which itself is an extension of US-ASCII) to
wide characters. Thus, most of the world's languages (including
Japanese, Korean, Chinese...) can be covered.
Unicode is advantageous because one character set suffices to encode
all the world's languages. The degree of Unicode support available
depends on the operating system and on application availability.
However very few programs support wide characters. Thus, a `cheap'
upgrade from 7 bit US-ASCII might be to only 8 bit wide character sets
(such as the ISO 8859-X). Unfortunately, some programmers still
insist on using the `spare' eigth bit for clever tricks, which will
make conversion more difficult.
Footnote: Some people have complained about missing characters,
e.g. French users about a missing 'oe'. Note that oe is
not a character, but a ligature (a combination of two
characters for typographical purposes). Ligatures are not
part of the ISO 8859-X standard. (Although 'oe' used to
be in the draft 8859-1 standard before it was unmasked as
`mere' ligature.)
2. Choosing the character set encoding
Depending on your needs, you will probably want to choose different
solutions. A quick shot i18n of US programs might simply be going to
8 bit and use one of the ISO 8859-X character sets.
If you have a choice and start from scratch, you might want to
consider Unicode. There are several aspects to choosing a particular
character set (and you may want to decide on different character sets
for different purposes):
1) what codeset should the application run in?
2) what codeset should files be saved in
3) what codeset is used as output (to screens etc.) and
4) should wide characters or multi-byte characters be used (this
choice may be different for each of points 1-3)
For example, if portability of your files across cultural borders is
an objective, you might want to use some form of Unicode encoding to
achieve this. If interaction with other tools in your environment is
the main objective, and these tools use an encoding different from
Unicode, this character set might be used instead.
Using Unicode internally but writing a different format to files may
sound funny (esp. if the output file format is only a subset of
Unicode), but you would only have to adapt the file write and read
functions and the same program will be able to execute in all
countries your product might be used...)
Also, terminals and/or which process Unicode may not be available (or
you might have to support legacy hardware), so you might need to adapt
the output format to a third standard.
2. Getting your environment right for ISO 8859-X
To configure your environment such that you can enter, process and
display 8 bit ISO characters, check out the ISO-8859-1 FAQ available
via anonymous ftp from ftp.vlsivie.tuwien.ac.at in
/pub/8bit/FAQ-ISO-8859-1.
If you use a different encoding, you will probably also have to
configure your system to fully support that encoding.
3. Setting your environment for ISO-C (ANSI-C) programs
The ISO C Standard (ANSI C Standard 4.4) defines several functions for
supporting localization. To set your international environment on
program startup, you should make one or several calls to the setlocale
functions. Calls to this function will predetermine the reaction of
other localization functions according to your language/country
environment.
To configure a particular aspect of you environment, say the number
representation, you would call
--
setlocale (LC_NUMERIC, "Germany");
--
This call would set all number representation functions defined in the
localization set to return numbers in the format used in Germany. If
the call was successful, setlocale will return the name of your
locale. A NULL return value indicates failure. Note that the
environments are predetermined outside your C program by the system
you run on. (So the example given here is likely to fail on all but a
few systems.) Check the setlocale manual page or your system
documentation to find out about the environments available.
There are several LOCALE types available for different localization
aspects (currency sign, number representation, characters sets). The
value they can take is highly system dependent. Also, it should be up
to the user to define the locale environment he needs.
A C program inherits its locale environment variables when it starts up.
This happens automatically. However, these variables do not
automatically control the locale used by the library functions, because
ISO/ANSI C says that all programs start by default in the standard C
locale. To use the locales specified by the environment, The POSIX
standard defines the following call:
-----
setlocale (LC_ALL, "");
-----
Of course, you can only set part of your environment, by calling, say:
----
setlocale (LC_CTYPE, "");
----
This only defines the character classification macros (defined in
ctype.h).
This is a list of local categories:
Effect of Specifying Environment Variable
category the Value Affected
__________________________________________________________
LC_ALL Sets or queries LANG
entire environment
LC_COLLATE Changes or queries LC_COLLATE
collation sequences
LC_CTYPE Changes or queries LC_CTYPE
character classifi-
cation
LC_NUMERIC Changes or queries LC_NUMERIC
number format infor-
mation
LC_TIME Changes or queries LC_TIME
time conversion
parameters
LC_MONETARY Changes or queries LC_MONETARY
monetary information
4. Using the locale information for character classification
If you write a program which supports international use, you should
use the available standardized functions, as only these will be
influenced by the setlocale call. Thus, if you want to convert a
capital letter in c to a lower case letter in l, _don't_ write:
l = c - 'A' + 'a';
While this will work for characters in the US-ASCII character set, it
will not work with many other character sets. The following,
standard-conformant
...
read more »