> > > If your compiler has sizeof(wchar_t) == 4, then you'll find that
> > > std::wstring is UTF-32 (aka UCS-4) encoded. One solution in this
> > > situation is to find a 16-bit integer type (assuming your compiler
> > > has one -- almost all do), and specialise std::char_traits for
> > > this type,
> > The specialization is illegal. You can only specialize on user defined
> > types.
> I'd overlooked that. And the fact that std::string must be
> instantiated on a POD type [21/1], that a POD-struct must be an
> aggregate [9/4] and that an aggregate cannot have user-declared
> constructors [8.5.1/1] means that you can't wrap the 16-bit integer up
> into a class that feels like a wchar_t and instantiate std::string on
> that. (You can get most of the way, though. It's only really
> conversions from char that you can't get.)
This is the route I took -- I wrapped my ISO10646::Character type in a
Conversions to and from char aren't a problem, since you can't convert
without specifying the character code for the char anyway.
Still, if I were doing it again, I think I would simply use the traits
as the user defined type, rather than providing a specialization for
std::char_traits. (It is sufficient if one type is user defined.)
Quote:> > > If you are starting with some other encoding (e.g. UTF-8) then you
> > > should probably look at some third-party library for doing this. I
> > > can highly recommend the Dinkumware CoreX library , which does
> > > this by way of C++ standard code conversion facets.
> > There's really nothing wrong with the classical solution: a simple
> > table, indexed by the character.
> It's fine if your coming from something like Latin-2 where the table
> will only be 128 (perhaps 256) entries long.
If you're coming from a single byte character set (like any of the 8859
codes), the table will have at most (and at least) 256 entries. If you
are coming from a multibyte character set, you always need some code.
(Of course, that code can be table driven -- in my UTF-8 to 10646
converter, I use a table indexed by the first byte to determine the
number of bytes in the character, for example.)
Quote:> But if your source character set is large or complicated (e.g. Big
> Five), I wouldn't want to maintain such a look-up table.
I don't know Big Five, so I can't say. The only case I can see where
just a table would be an alternative, but you might not want to use it,
is when the multibyte characters have a fixed length -- all characters
are exactly two bytes, for example.
With regards to the maintenance, of course, the Unicode site has a
number of files you can download with the data, and it isn't hard to
write a bit of AWK which will convert them into the C++ code you need.
(To take a simple example, my fromISO8859_10 converter is based on the
GB_ISO10646::BasicCharacter const table =
0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007,
0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F,
0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017,
0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F,
0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027,
0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037,
0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F,
0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047,
0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F,
0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057,
0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F,
0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067,
0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077,
0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F,
0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085, 0x0086, 0x0087,
0x0088, 0x0089, 0x008A, 0x008B, 0x008C, 0x008D, 0x008E, 0x008F,
0x0090, 0x0091, 0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097,
0x0098, 0x0099, 0x009A, 0x009B, 0x009C, 0x009D, 0x009E, 0x009F,
0x00A0, 0x0104, 0x0112, 0x0122, 0x012A, 0x0128, 0x0136, 0x00A7,
0x013B, 0x0110, 0x0160, 0x0166, 0x017D, 0x00AD, 0x016A, 0x014A,
0x00B0, 0x0105, 0x0113, 0x0123, 0x012B, 0x0129, 0x0137, 0x00B7,
0x013C, 0x0111, 0x0161, 0x0167, 0x017E, 0x2015, 0x016B, 0x014B,
0x0100, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x012E,
0x010C, 0x00C9, 0x0118, 0x00CB, 0x0116, 0x00CD, 0x00CE, 0x00CF,
0x00D0, 0x0145, 0x014C, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x0168,
0x00D8, 0x0172, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF,
0x0101, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x012F,
0x010D, 0x00E9, 0x0119, 0x00EB, 0x0117, 0x00ED, 0x00EE, 0x00EF,
0x00F0, 0x0146, 0x014D, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x0169,
0x00F8, 0x0173, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x0138,
You don't really think I'm going to type something like that in by hand,
Quote:> And if you're dealing with multiple source character sets, you really
> do benefit from using a library that others maintain.
Agreed. Even better is when you design it so that the code is
mechanically generated from files provided by the Unicode consortium:-).
If you're interested in seeing what I have done in this respect, my code
is available at my site (www.gabi-soft.fr), in the Experimental
section. And it is *very* experimental.
Conseils en informatique oriente objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]