UCS-2

UCS-2

Post by wae » Tue, 29 Jul 2003 02:03:06



hi
How to convert any string to UCS-2
by example
char 'A'  UCS-2 of Char 'A' is 0065

please can help ?

thank you
best regards
wael

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

 
 
 

UCS-2

Post by Richard Smit » Wed, 30 Jul 2003 06:31:03


Quote:> How to convert any string to UCS-2 by example
> char 'A'  UCS-2 of Char 'A' is 0065

Depends what you're starting with, really.  If you're
starting with ISO-8859-1 (aka Latin-1) or ISO-646 (aka
ASCII), then it's really easy: you just need to pad with
zeros each byte to fit into a 2-byte 'character'.  For the
purposes of this, I'm assuming that sizeof(wchar_t) == 2 and
that a std::wstring is UCS-2 encoded.

  std::string src("This is string is ASCII");
  std::wstring dst( src.begin(), src.end() );
  // dst is a UCS-2 encoded string.

One subtlety to be aware of.  UCS-2 comes in two flavours:
big-endian and little-endian.  The above will choose the
host endianness.  (On Intel x86 machines, this is
little-endian.)

If your compiler has sizeof(wchar_t) == 4, then you'll find
that std::wstring is UTF-32 (aka UCS-4) encoded.  One
solution in this situation is to find a 16-bit integer type
(assuming your compiler has one -- almost all do), and
specialise std::char_traits for this type, and provide a
typedef

  typedef std::basic_string< uint16_t > ucs2_string;

If you are starting with some other encoding (e.g. UTF-8)
then you should probably look at some third-party library
for doing this.  I can highly recommend the Dinkumware CoreX
library [1], which does this by way of C++ standard code
conversion facets.

Other libraries that do such code conversions include the
GNU project's iconv library [2] and IBM's ICU library [3].
Both of these provide C-style interfaces, so don't integrate
as nicely with std::string and/or IOStreams.

 [1] http://www.dinkumware.com/refxcorex.html
 [2] http://www.gnu.org/software/libiconv/
 [3] http://oss.software.ibm.com/icu/

--
Richard Smith

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

 
 
 

UCS-2

Post by Ben Hutching » Thu, 31 Jul 2003 07:38:42



<snip>
Quote:> Other libraries that do such code conversions include the GNU project's
> iconv library [2]

<snip>

iconv is a standard Unix feature, though comparatively recent.  GNU libc
has an excellent implementation of it, but it's also available in other
Unix C libraries.

Windows has code conversion too, but it only converts complete strings
between UCS-2 or UTF-16 and single or multibyte encodings that use up
to 2 bytes per character.  It ignores many error conditions.  The MLang
facility which was introduced with IE 4 appears to be better, but is
actually inconsistent and buggy, so don't waste your time with it like
I did.

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

 
 
 

UCS-2

Post by ka.. » Thu, 31 Jul 2003 08:22:22



Quote:> > How to convert any string to UCS-2 by example
> > char 'A'  UCS-2 of Char 'A' is 0065
> Depends what you're starting with, really. If you're starting with
> ISO-8859-1 (aka Latin-1) or ISO-646 (aka ASCII), then it's really
> easy: you just need to pad with zeros each byte to fit into a 2-byte
> 'character'. For the purposes of this, I'm assuming that
> sizeof(wchar_t) == 2 and that a std::wstring is UCS-2 encoded.
>   std::string src("This is string is ASCII");
>   std::wstring dst( src.begin(), src.end() );
>   // dst is a UCS-2 encoded string.
> One subtlety to be aware of. UCS-2 comes in two flavours: big-endian
> and little-endian. The above will choose the host endianness. (On
> Intel x86 machines, this is little-endian.)

Another subtlety to be aware of.  The type char may be signed or
unsigned.  If it is signed, 8 bits and the implementation is using
ISO-8859-1 (which is the case for Windows and Linux on PC, and Solaris
on Sparc), then some of the values will in fact be negative, and the
results in dst will be wrong.  You need to declare dst, then use
something like:

    struct LowBits
    {
        wchar_t operator()( char ch )
        {
            return static_cast< unsigned char >( ch ) ;
        }
    } ;
    std::transform( src.begin(), src.end(),
                    std::back_inserter( dst ),
                    LowBits() ) ;

(And out of curiosity, does anyone know why we have things like
std::plus and std::minus, but nothing for the bitwise operators?)

Quote:> If your compiler has sizeof(wchar_t) == 4, then you'll find that
> std::wstring is UTF-32 (aka UCS-4) encoded. One solution in this
> situation is to find a 16-bit integer type (assuming your compiler has
> one -- almost all do), and specialise std::char_traits for this type,

The specialization is illegal.  You can only specialize on user defined
types.

Quote:> and provide a typedef
>   typedef std::basic_string< uint16_t > ucs2_string;
> If you are starting with some other encoding (e.g. UTF-8) then you
> should probably look at some third-party library for doing this. I can
> highly recommend the Dinkumware CoreX library [1], which does this by
> way of C++ standard code conversion facets.

There's really nothing wrong with the classical solution: a simple
table, indexed by the character.  Again, watch out for negative
characters.  (On the other hand, if you actually want to do something
with the characters, like output them, you probably will want the
library anyway.)

--

Conseils en informatique oriente objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

 
 
 

UCS-2

Post by Richard Smit » Fri, 01 Aug 2003 09:16:52



> > If your compiler has sizeof(wchar_t) == 4, then you'll find that
> > std::wstring is UTF-32 (aka UCS-4) encoded. One solution in this
> > situation is to find a 16-bit integer type (assuming your compiler has
> > one -- almost all do), and specialise std::char_traits for this type,

> The specialization is illegal.  You can only specialize on user defined
> types.

I'd overlooked that.  And the fact that std::string must be
instantiated on a POD type [21/1], that a POD-struct must be
an aggregate [9/4] and that an aggregate cannot have
user-declared constructors [8.5.1/1] means that you can't
wrap the 16-bit integer up into a class that feels like a
wchar_t and instantiate std::string on that.  (You can get
most of the way, though.  It's only really conversions from
char that you can't get.)

Quote:> > If you are starting with some other encoding (e.g. UTF-8) then you
> > should probably look at some third-party library for doing this. I can
> > highly recommend the Dinkumware CoreX library [1], which does this by
> > way of C++ standard code conversion facets.

> There's really nothing wrong with the classical solution: a simple
> table, indexed by the character.

It's fine if your coming from something like Latin-2 where
the table will only be 128 (perhaps 256) entries long.  But
if your source character set is large or complicated (e.g.
Big Five), I wouldn't want to maintain such a look-up table.
And if you're dealing with multiple source character sets,
you really do benefit from using a library that others
maintain.

--
Richard Smith

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

 
 
 

UCS-2

Post by ka.. » Sat, 02 Aug 2003 10:08:38




> > > If your compiler has sizeof(wchar_t) == 4, then you'll find that
> > > std::wstring is UTF-32 (aka UCS-4) encoded. One solution in this
> > > situation is to find a 16-bit integer type (assuming your compiler
> > > has one -- almost all do), and specialise std::char_traits for
> > > this type,
> > The specialization is illegal.  You can only specialize on user defined
> > types.
> I'd overlooked that.  And the fact that std::string must be
> instantiated on a POD type [21/1], that a POD-struct must be an
> aggregate [9/4] and that an aggregate cannot have user-declared
> constructors [8.5.1/1] means that you can't wrap the 16-bit integer up
> into a class that feels like a wchar_t and instantiate std::string on
> that.  (You can get most of the way, though.  It's only really
> conversions from char that you can't get.)

This is the route I took -- I wrapped my ISO10646::Character type in a
struct.

Conversions to and from char aren't a problem, since you can't convert
without specifying the character code for the char anyway.

Still, if I were doing it again, I think I would simply use the traits
as the user defined type, rather than providing a specialization for
std::char_traits.  (It is sufficient if one type is user defined.)

Quote:> > > If you are starting with some other encoding (e.g. UTF-8) then you
> > > should probably look at some third-party library for doing this. I
> > > can highly recommend the Dinkumware CoreX library [1], which does
> > > this by way of C++ standard code conversion facets.
> > There's really nothing wrong with the classical solution: a simple
> > table, indexed by the character.
> It's fine if your coming from something like Latin-2 where the table
> will only be 128 (perhaps 256) entries long.

If you're coming from a single byte character set (like any of the 8859
codes), the table will have at most (and at least) 256 entries.  If you
are coming from a multibyte character set, you always need some code.
(Of course, that code can be table driven -- in my UTF-8 to 10646
converter, I use a table indexed by the first byte to determine the
number of bytes in the character, for example.)

Quote:> But if your source character set is large or complicated (e.g.  Big
> Five), I wouldn't want to maintain such a look-up table.

I don't know Big Five, so I can't say.  The only case I can see where
just a table would be an alternative, but you might not want to use it,
is when the multibyte characters have a fixed length -- all characters
are exactly two bytes, for example.

With regards to the maintenance, of course, the Unicode site has a
number of files you can download with the data, and it isn't hard to
write a bit of AWK which will convert them into the C++ code you need.
(To take a simple example, my fromISO8859_10 converter is based on the
following table:

    GB_ISO10646::BasicCharacter const table[] =
    {
        0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007,
        0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F,
        0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017,
        0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F,
        0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027,
        0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
        0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037,
        0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F,
        0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047,
        0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F,
        0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057,
        0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F,
        0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067,
        0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
        0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077,
        0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F,
        0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085, 0x0086, 0x0087,
        0x0088, 0x0089, 0x008A, 0x008B, 0x008C, 0x008D, 0x008E, 0x008F,
        0x0090, 0x0091, 0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097,
        0x0098, 0x0099, 0x009A, 0x009B, 0x009C, 0x009D, 0x009E, 0x009F,
        0x00A0, 0x0104, 0x0112, 0x0122, 0x012A, 0x0128, 0x0136, 0x00A7,
        0x013B, 0x0110, 0x0160, 0x0166, 0x017D, 0x00AD, 0x016A, 0x014A,
        0x00B0, 0x0105, 0x0113, 0x0123, 0x012B, 0x0129, 0x0137, 0x00B7,
        0x013C, 0x0111, 0x0161, 0x0167, 0x017E, 0x2015, 0x016B, 0x014B,
        0x0100, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x012E,
        0x010C, 0x00C9, 0x0118, 0x00CB, 0x0116, 0x00CD, 0x00CE, 0x00CF,
        0x00D0, 0x0145, 0x014C, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x0168,
        0x00D8, 0x0172, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF,
        0x0101, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x012F,
        0x010D, 0x00E9, 0x0119, 0x00EB, 0x0117, 0x00ED, 0x00EE, 0x00EF,
        0x00F0, 0x0146, 0x014D, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x0169,
        0x00F8, 0x0173, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x0138,
    } ;

You don't really think I'm going to type something like that in by hand,
do you?

Quote:> And if you're dealing with multiple source character sets, you really
> do benefit from using a library that others maintain.

Agreed.  Even better is when you design it so that the code is
mechanically generated from files provided by the Unicode consortium:-).

If you're interested in seeing what I have done in this respect, my code
is available at my site (www.gabi-soft.fr), in the Experimental
section.  And it is *very* experimental.

--

Conseils en informatique oriente objet/     http://www.gabi-soft.fr
                    Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

 
 
 

1. UTF-8 and UCS-2

I have an .xslt file in UTF-8 that contains Japanese characters.  When I
transform and write to the Response object the resulting response contains ?
wherever the source .xslt contained Japanese characters.  Any inline Roman
characters come through just fine.  Ideas?  Do I have to do another conversion
after the transfomration?

-Matthew

2. 'Max Headroom' Demo?

3. Newbie question: using iconv to convert SJIS files to UCS-2

4. regarding mysql

5. In article u5@usenet.ucs.indiana.edu, ycheng@bronze.ucs.indiana.edu (yung-rang cheng) writes:

6. Larissa Oleynik 10 things i hate about you

7. UCS-2 w/ gvim

8. Money99 Account Question

9. UCS-2 synchronization

10. Reading UCS-2 with wifstream

11. IBM have MS Joliet "UCS-3 Level" support for OS/2

12. dex/ucs

13. UCS and 3DOrbit problem