Preserving literal hex encoding of enties for diacretic symbols

Preserving literal hex encoding of enties for diacretic symbols

Post by Toine de Gree » Mon, 22 Jul 2002 18:55:45



I'm using msxml3 for instance like this:

   // IXMLDOMNode* attribute_node;
   // VARIANT varAttributeValue;
   attribute_node->get_nodeValue(&varAttributeValue);
   BSTR bstrAttributeValue = V_BSTR(&varAttributeValue);

Even if the original XML file has either UTF-8 or ISO-8859-1 encoding, and
contains diacretic symbols encoded as entities ("é"), this BSTR will
contain these entities as "normal" ASCII values. Of course, one could
translate these values back by parsing the attribute value, but this would
cause a performance hit *twice*.

Is there a more elegant solution to this? Would this require _bstr_t, or a
stream with a specified encoding or such? I'm really looking forward to know
about this.

Regards,

Toine de Greef

 
 
 

Preserving literal hex encoding of enties for diacretic symbols

Post by Julian F. Reschk » Mon, 22 Jul 2002 18:56:52




Quote:> I'm using msxml3 for instance like this:

>    // IXMLDOMNode* attribute_node;
>    // VARIANT varAttributeValue;
>    attribute_node->get_nodeValue(&varAttributeValue);
>    BSTR bstrAttributeValue = V_BSTR(&varAttributeValue);

> Even if the original XML file has either UTF-8 or ISO-8859-1 encoding, and
> contains diacretic symbols encoded as entities ("é"), this BSTR will
> contain these entities as "normal" ASCII values. Of course, one could

Yes. That's a (required) feature of the XML processors.

Quote:> translate these values back by parsing the attribute value, but this would
> cause a performance hit *twice*.

Why would you want to translate them back?

Quote:> Is there a more elegant solution to this? Would this require _bstr_t, or a
> stream with a specified encoding or such? I'm really looking forward to
know
> about this.

> Regards,

> Toine de Greef


 
 
 

Preserving literal hex encoding of enties for diacretic symbols

Post by Toine de Gree » Mon, 22 Jul 2002 20:18:55


Julian,

I think you helped me by asking the right question: "Why would you want to
translate them back?". I was going to answer: "Because after doing some
other stuff with it, no XSL(T), I want to write it back into an XML
document." And I thought that other document would dislike these literal
diacretical entries (being UTF-8). Everything will work fine though., when
using ISO-8859-1 for both documents.

Thanks again,

Toine.





> > I'm using msxml3 for instance like this:

> >    // IXMLDOMNode* attribute_node;
> >    // VARIANT varAttributeValue;
> >    attribute_node->get_nodeValue(&varAttributeValue);
> >    BSTR bstrAttributeValue = V_BSTR(&varAttributeValue);

> > Even if the original XML file has either UTF-8 or ISO-8859-1 encoding,
and
> > contains diacretic symbols encoded as entities ("é"), this BSTR
will
> > contain these entities as "normal" ASCII values. Of course, one could

> Yes. That's a (required) feature of the XML processors.

> > translate these values back by parsing the attribute value, but this
would
> > cause a performance hit *twice*.

> Why would you want to translate them back?

> > Is there a more elegant solution to this? Would this require _bstr_t, or
a
> > stream with a specified encoding or such? I'm really looking forward to
> know
> > about this.

> > Regards,

> > Toine de Greef

 
 
 

Preserving literal hex encoding of enties for diacretic symbols

Post by Julian F. Reschk » Mon, 22 Jul 2002 20:37:57




Quote:> Julian,

> I think you helped me by asking the right question: "Why would you want to
> translate them back?". I was going to answer: "Because after doing some
> other stuff with it, no XSL(T), I want to write it back into an XML
> document." And I thought that other document would dislike these literal
> diacretical entries (being UTF-8). Everything will work fine though., when
> using ISO-8859-1 for both documents.

Well. XML processors are required to accept UTF-8 and UTF-16 encoded
documents. If you can find an XML processor which rejects a UTF-8-encoded
document, you should report that as a bug.
 
 
 

Preserving literal hex encoding of enties for diacretic symbols

Post by Toine de Gree » Mon, 22 Jul 2002 21:30:57


Hopefully we're not stretching this out of proportion, but what I was saying
is te following. ISO-8859-1 diacretic symbols not written as entities (i.e.
"?"), in an XML file with UTF-8 encoding, violated encoding constraints. I
assume this is in accorandance with XML processor requirements.

As a workaround, I used entity notation for them. When using
get_nodeValue(), to my surprise these values where translated, thus breaking
my workaround. I even tried setting encoding="ISO-8859-1", but that did not
change the behaviour of  get_nodeValue().

My conclusion: no bug.





> > Julian,

> > I think you helped me by asking the right question: "Why would you want
to
> > translate them back?". I was going to answer: "Because after doing some
> > other stuff with it, no XSL(T), I want to write it back into an XML
> > document." And I thought that other document would dislike these literal
> > diacretical entries (being UTF-8). Everything will work fine though.,
when
> > using ISO-8859-1 for both documents.

> Well. XML processors are required to accept UTF-8 and UTF-16 encoded
> documents. If you can find an XML processor which rejects a UTF-8-encoded
> document, you should report that as a bug.

 
 
 

Preserving literal hex encoding of enties for diacretic symbols

Post by Julian F. Reschk » Mon, 22 Jul 2002 21:53:11




Quote:> Hopefully we're not stretching this out of proportion, but what I was
saying
> is te following. ISO-8859-1 diacretic symbols not written as entities
(i.e.
> "?"), in an XML file with UTF-8 encoding, violated encoding constraints. I
> assume this is in accorandance with XML processor requirements.

Sure. If the file is declared to be UTF-8, it must not contain characters
encoded as a non-UTF-8 octet sequence.

Quote:> As a workaround, I used entity notation for them. When using

Why not just use UTF-8 encoding, if this is what the file is declared to
use?

> get_nodeValue(), to my surprise these values where translated, thus
breaking
> my workaround. I even tried setting encoding="ISO-8859-1", but that did
not
> change the behaviour of  get_nodeValue().

> My conclusion: no bug.





> > > Julian,

> > > I think you helped me by asking the right question: "Why would you
want
> to
> > > translate them back?". I was going to answer: "Because after doing
some
> > > other stuff with it, no XSL(T), I want to write it back into an XML
> > > document." And I thought that other document would dislike these
literal
> > > diacretical entries (being UTF-8). Everything will work fine though.,
> when
> > > using ISO-8859-1 for both documents.

> > Well. XML processors are required to accept UTF-8 and UTF-16 encoded
> > documents. If you can find an XML processor which rejects a
UTF-8-encoded
> > document, you should report that as a bug.

 
 
 

Preserving literal hex encoding of enties for diacretic symbols

Post by Toine de Gree » Sun, 28 Jul 2002 16:59:30




[...]

Quote:> Why not just use UTF-8 encoding, if this is what the file is declared to
> use?

I think both ISO-8859-1 and references to entities are more intuitive/human
readable (especially when interfacing non-Unicode aware applications) than
pure UTF-8 encoding, for instance:

? in ISO-8859-1: ?
? as entity reference: ë or ë or ë
? in UTF-8: ??

 
 
 

1. Display Symbols with Hex Codes

I'm trying to display a degree symbol through XSL and having
problems, just trying to track down where the problems lies:
tried these:

<xsl:text disable-output-escaping="yes">#176;</xsl:text>
<xsl:text disable-output-escaping="yes"></xsl:text>
<xsl:text >#176;</xsl:text>

..but just prints out #176, or a "?" ...is this the right code to
display symbols through XSLT?

..I'm developing this for WML, so trying to find out if the problem
is in the WML, the XSL or the Emulator, fun stuff.. thanks for any
help.

2. searching LIN slave hardware

3. Apache SOAP Literal XML Encoding

4. Antwort: RE: Reading a CSV File

5. How to display symbols in national encodings

6. FS: Windows 95 Game Devlopers Guide - using the game sdk

7. Encoding special characters like the (TM) symbol?

8. Terminal Service Tab

9. xml to html using xsl: How to preserve xml encoding ?

10. UTF-8 character handling problem (hex:9D) using Xerces/Xalan

11. are these valid literals? (if string-value="literal")

12. hex value with XPath

13. How to get bin.hex data?