Programming for Internationalization FAQ

Programming for Internationalization FAQ

Post by m.. » Mon, 26 Jul 1999 04:00:00



Archive-name: internationalization/programming-faq
Posting-Frequency: monthly
Version: 1.93

                  Programming for Internationalization

                          Michael K. Gschwind

DISCLAIMER: THE AUTHOR MAKES NO WARRANTY OF ANY KIND WITH REGARD TO
THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

Note: Most of this was tested on a Sun 10, running SunOS 4.1.* - other
systems might differ slightly

This FAQ discusses topics related to internationalization. Simple i18n
support for Europe, Latin America, and the Middle East might use of
the ISO 8859-X based 8 bit character sets.  For wider portability, a
standard such as Unicode is in order.

This FAQ discusses how to program applications which support the use
European (Latin American) national character sets on UNIX-based
systems and standard C environments, and discusses some choices with
respect to character sets.

INTRODUCTION

Most of the information given here is independent of the character
encoding used (e.g. DEC MCS, ISO Latin-X, etc.), but can be applied to
any character set, providing the programming environment has
provisions for this standard.

1. Which coding should I use for accented characters?
Use the internationally standardized ISO-8859-1 character set to type
accented characters. This character set contains all characters
necessary to type (West) European languages. This encoding is also the
preferred encoding on the Internet.  ISO 8859-X character sets use the
characters 0xa0 through 0xff to represent national characters, while
the characters in the 0x20-0x7f range are those used in the US-ASCII
(ISO 646) character set.  Thus, ASCII text is a proper subset of all
ISO 8859-X character sets.  

The characters 0x80 through 0x9f are earmarked as extended control
chracters, and are not used for encoding characters.  These characters
are not currently used to specify anything.  A practical reason for
this is interoperability with 7 bit devices (or when the 8th bit gets
stripped by faulty software).  Devices would then interpret the character
as some control character and put the device in an undefined state.
(When the 8th bit gets stripped from the characters at 0xa0 to 0xff, a
wrong character is represented, but this cannot change the state of a
terminal or other device.)

This character set is also used by AmigaDOS, MS-Windows, VMS (DEC MCS
is practically equivalent to ISO 8859-1) and (practically all) UNIX
implementations.  MS-DOS normally uses a different character set and
is not compatible with this character set. (It can, however, be
translated to this format with various tools. See below.)

Footnote: Supposedly, IBM code page 819 is fully ISO 8859-1 compliant.

ISO 8859-1 supports the following languages:
Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish,
French, Galician, German, Icelandic, Irish, Italian, Norwegian,
Portuguese, Spanish and Swedish.

(It has been called to my attention that Albanian can be written with
ISO 8859-1 also.  However, from a standards point of view, ISO 8859-2
is the appropriate character set for Balkan countries.)

ISO 8859-1 is just one part of the ISO-8859 standard, which specifies
several character sets:
8859-1  Europe, Latin America
8859-2  Eastern Europe
8859-3  SE Europe/miscellaneous (Esperanto, Maltese, etc.)
8859-4  Scandinavia/Baltic (mostly covered by 8859-1 also)
8859-5  Cyrillic
8859-6  Arabic
8859-7  Greek
8859-8  Hebrew
8859-9  Latin5, same as 8859-1 except for Turkish instead of Icelandic
8859-10 Latin6, for Lappish/Nordic/Eskimo languages

Another nascent standard is UNICODE (ISO 10646).  UNICODE is an
extension of ISO 8859-1 (which itself is an extension of US-ASCII) to
wide characters.  Thus, most of the world's languages (including
Japanese, Korean, Chinese...) can be covered.

Unicode is advantageous because one character set suffices to encode
all the world's languages.  The degree of Unicode support available
depends on the operating system and on application availability.
However very few programs support wide characters. Thus, a `cheap'
upgrade from 7 bit US-ASCII might be to only 8 bit wide character sets
(such as the ISO 8859-X).  Unfortunately, some programmers still
insist on using the `spare' eigth bit for clever tricks, which will
make conversion more difficult.

Footnote: Some people have complained about missing characters,
          e.g. French users about a missing 'oe'.  Note that oe is
          not a character, but a ligature (a combination of two
          characters for typographical purposes).  Ligatures are not
          part of the ISO 8859-X standard.  (Although 'oe' used to
          be in the draft 8859-1 standard before it was unmasked as
          `mere' ligature.)

2. Choosing the character set encoding

Depending on your needs, you will probably want to choose different
solutions.  A quick shot i18n of US programs might simply be going to
8 bit and use one of the ISO 8859-X character sets.

If you have a choice and start from scratch, you might want to
consider Unicode.  There are several aspects to choosing a particular
character set (and you may want to decide on different character sets
for different purposes):
1) what codeset should the application run in?  
2) what codeset should files be saved in
3) what codeset is used as output (to screens etc.) and
4) should wide characters or multi-byte characters be used (this
   choice may be different for each of points 1-3)

For example, if portability of your files across cultural borders is
an objective, you might want to use some form of Unicode encoding to
achieve this.  If interaction with other tools in your environment is
the main objective, and these tools use an encoding different from
Unicode, this character set might be used instead.  

Using Unicode internally but writing a different format to files may
sound funny (esp. if the output file format is only a subset of
Unicode), but you would only have to adapt the file write and read
functions and the same program will be able to execute in all
countries your product might be used...)

Also, terminals and/or which process Unicode may not be available (or
you might have to support legacy hardware), so you might need to adapt
the output format to a third standard.

2. Getting your environment right for ISO 8859-X
To configure your environment such that you can enter, process and
display 8 bit ISO characters, check out the ISO-8859-1 FAQ available
via anonymous ftp from ftp.vlsivie.tuwien.ac.at in
/pub/8bit/FAQ-ISO-8859-1.  

If you use a different encoding, you will probably also have to
configure your system to fully support that encoding.

3. Setting your environment for ISO-C (ANSI-C) programs
The ISO C Standard (ANSI C Standard 4.4) defines several functions for
supporting localization. To set your international environment on
program startup, you should make one or several calls to the setlocale
functions.  Calls to this function will predetermine the reaction of
other localization functions according to your language/country
environment.

To configure a particular aspect of you environment, say the number
representation, you would call
--
setlocale (LC_NUMERIC, "Germany");
--

This call would set all number representation functions defined in the
localization set to return numbers in the format used in Germany.  If
the call was successful, setlocale will return the name of your
locale.  A NULL return value indicates failure.  Note that the
environments are predetermined outside your C program by the system
you run on. (So the example given here is likely to fail on all but a
few systems.) Check the setlocale manual page or your system
documentation to find out about the environments available.

There are several LOCALE types available for different localization
aspects (currency sign, number representation, characters sets). The
value they can take is highly system dependent. Also, it should be up
to the user to define the locale environment he needs.

A C program inherits its locale environment variables when it starts up.
This happens automatically.  However, these variables do not
automatically control the locale used by the library functions, because
ISO/ANSI C says that all programs start by default in the standard C
locale.  To use the locales specified by the environment, The POSIX
standard defines the following call:
-----
setlocale (LC_ALL, "");
-----

Of course, you can only set part of your environment, by calling, say:
----
setlocale (LC_CTYPE, "");
----
This only defines the character classification macros (defined in
ctype.h).

This is a list of local categories:

                   Effect of Specifying   Environment Variable
     category      the Value              Affected
     __________________________________________________________

     LC_ALL        Sets or queries        LANG
                   entire environment
     LC_COLLATE    Changes or queries     LC_COLLATE
                   collation sequences
     LC_CTYPE      Changes or queries     LC_CTYPE
                   character classifi-
                   cation
     LC_NUMERIC    Changes or queries     LC_NUMERIC
                   number format infor-
                   mation
     LC_TIME       Changes or queries     LC_TIME
                   time conversion
                   parameters
     LC_MONETARY   Changes or queries     LC_MONETARY
                   monetary information

4. Using the locale information for character classification
If you write a program which supports international use, you should
use the available standardized functions, as only these will be
influenced by the setlocale call. Thus, if you want to convert a
capital letter in c to a lower case letter in l, _don't_ write:

l = c - 'A' + 'a';

While this will work for characters in the US-ASCII character set, it
will not work with many other character sets. The following,
standard-conformant ...

read more »

 
 
 

Programming for Internationalization FAQ

Post by Markus Ku » Mon, 26 Jul 1999 04:00:00


Something much more up-to-date than the "Programming for
Internationalization FAQ" that teaches you how to use Unicode
and UTF-8 instead of all this 8-bit legacy stuff is available from

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

It tells you how to install the very latest xterm version with
UTF-8 support and how to install Unicode upgrades of the standard
X11 fonts.

Enter a new era of Unix character set handling, forget about
the boring old ASCII and ISO 8859, and read the

  UTF-8 and Unicode FAQ
  http://www.cl.cam.ac.uk/~mgk25/unicode.html

You will not only get character support for numerous languages for
which there exists no other ISO character set standard, you will
also get hundreds of new mathematical, phonetic, and technical
symbols on your system in one single Unicode font, including all
TeX symbols and more. Unicode is NOT just about
internationalization, it is also about proper typography and
scientific writing! Check it out now and make sure your software
is fit for it.

Comments and suggestions for improvement of the FAQ welcome!

Markus

--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

 
 
 

Programming for Internationalization FAQ

Post by Jerry Coff » Tue, 27 Jul 1999 04:00:00



says...

Quote:> Something much more up-to-date than the "Programming for
> Internationalization FAQ" that teaches you how to use Unicode
> and UTF-8 instead of all this 8-bit legacy stuff is available from

>   http://www.cl.cam.ac.uk/~mgk25/unicode.html

> It tells you how to install the very latest xterm version with
> UTF-8 support and how to install Unicode upgrades of the standard
> X11 fonts.

This is a fine link, but it's NOT anything like a replacement for the
programming for internationalization FAQ.  First of all, while
programming is mentioned in passing, the majority of the page is
related to UNIX administration.  Since this is widely cross-posted
(including to some UNIX-specific newsgroups) it may be perfectly
reasonable to recommend that where you're posting.  At the same time,
this message is being cross-posted to newsgroups such as comp.std.c,
where it's marginal at best -- in particular, it's related primarily
to making Unicode work on legacy systems that weren't designed for it.

People using more up-to-date systems that were designed for Unicode
(e.g. Windows NT) will find little or nothing in this page that's even
marginally useful.  People who are concerned with C standardization
and need some basic background in Unicode-related terminology may find
a few useful bits, but little beyond that.

Summary: this is likely to be of a great deal of use to people trying
to set Linux up to use Unicode.  It may be of limited use to people
doing the same with UNIX other than Linux.  To the rest of the world,
especially those using systems that already fully support Unicode, it
contains little that's likely to be of any interest at all.

 
 
 

Programming for Internationalization FAQ

Post by Harald Kirs » Thu, 29 Jul 1999 04:00:00



> Something much more up-to-date than the "Programming for
> Internationalization FAQ" that teaches you how to use Unicode
> and UTF-8 instead of all this 8-bit legacy stuff is available from

>   http://www.cl.cam.ac.uk/~mgk25/unicode.html

Informative and interesting article. It tells me that we'll be having
a hard time in the near future when UTF-8 aware and older applications
do their best to mess things up for each other.

However, it is certainly a good thing that glyphs get a name and a
defined number. Is there any hope that this finalizes problems of
font-encoding.

Harald Kirsch
--
P.S.: Never ever mail me copies of your posts.
---------------------+---------------------------------------------


 
 
 

Programming for Internationalization FAQ

Post by Harald Kirs » Thu, 29 Jul 1999 04:00:00



> doing the same with UNIX other than Linux.  To the rest of the world,
> especially those using systems that already fully support Unicode, it

Would you care to let me know of such systems.

Many thanks,
        Harald Kirsch

--
P.S.: Never ever mail me copies of your posts.
---------------------+---------------------------------------------


 
 
 

Programming for Internationalization FAQ

Post by Peter Wyzl » Sun, 01 Aug 1999 04:00:00






>> doing the same with UNIX other than Linux.  To the rest of the
>> world, especially those using systems that already fully
>> support Unicode, it

>Would you care to let me know of such systems.

How about WinNT and Plan9?

Peter
--
"A great many people think they are thinking when they are merely
rearranging their prejudices." -- William James

 
 
 

Programming for Internationalization FAQ

Post by Markus Ku » Tue, 03 Aug 1999 04:00:00


|> >
|> >   http://www.cl.cam.ac.uk/~mgk25/unicode.html
|> >
|> Informative and interesting article. It tells me that we'll be having
|> a hard time in the near future when UTF-8 aware and older applications
|> do their best to mess things up for each other.

The migration to UTF-8 will certainly not be painless during some
transition period (just as the migration from the various ISO 646
7-bit ASCII variants to ISO 8859-1 wasn't without problems), but
I am sure that it will definitely be worth it.

If Unix developers get familiar with UTF-8 and the ISO 10646-1
and Unicode standards as early and as widely as possible, then this
will certainly help to simplify things.

By the way for those looking for Unicode X fonts: We have just added
over a thousand new glyphs this weekend to the "-misc-fixed-*" fonts:

  http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html

Markus

--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>