Automatic URL encoding

Automatic URL encoding

Post by Andrzej Mierzw » Thu, 04 Jul 2002 20:51:00



Let's have some data in XML file with some czech chars (doesn't matter in
whitch encoding, eg. 1??y). Transform ithat file with XSL
transformation to produce HTML code. Everythign goes good, but it this chars
are within some <a> tag, this URLs are encoded using UTF-8 encoding. But
this causes big problems, it whole page is encoded in some other encoding,
eg. windows-1250 (Windows Central European). When user click to link
generated such way, on server parameters passed via this link are decoded
with encoding used to generate page. But they are encoded in UTF-8!

(I try this with .NET XML classes and with MSXML3.0)

Any ideas?

Thx
Andrzej Mierzwa
WEBCOM a.s.

 
 
 

Automatic URL encoding

Post by Stuart Celarie » Thu, 11 Jul 2002 00:46:04


Andrzej,

URIs, such as URLs, are restricted to a small set of characters [1], so
Czech characters cannot appear in a URL without being escaped using octet
encoded sequences. In that case, why is the character encoding of URLs an
issue? And what does that have to do with XSLT? Wouldn't the same issues
apply to statically-produced HTML encoded (e.g.) with Windows-1250?

Cheers,
Stuart
--
Stuart Celarier, Fern Creek, www.ferncrk.com
Consultant on .NET, Win32, C#, C++, COM, XML, XSLT and more.

[1] http://www.ietf.org/rfc/rfc2396.txt, Section 2 URI Characters and Escape
Sequences

 
 
 

Automatic URL encoding

Post by Andrzej Mierzw » Thu, 11 Jul 2002 16:59:53


Dear Stuart

Imagine this files:
XML datafile (test1.xml):
<?xml version="1.0" encoding="windows-1250" ?>
<?xml-stylesheet href="test1.xsl" type="Text/xsl"?>
<data>
  <link>
    <url>http://something.somewhere/page.aspx?data=11??y</url>
    <desc>Link11?y?</desc>
  </link>
</data>

XSL transformation (test1.xsl):
<?xml version="1.0" encoding="windows-1250"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output indent="yes" method="html"/>
<xsl:template match="/">
<html>
<head><title>test</title></head>
<body><xsl:apply-templates/></body></html>
</xsl:template>
<xsl:template match="link">
<xsl:element name="a">
<xsl:attribute name="href"><xsl:value-of select="url"/>
</xsl:attribute>
<xsl:value-of select="desc"/></xsl:element>
</xsl:template>
</xsl:stylesheet>

Take this C# code:
Encoding enc = Encoding.GetEncoding("windows-1250");
StreamWriter sw = new StreamWriter("out.html", false, enc);
XPathDocument xpath = new XPathDocument("test1.xml");
XslTransform xslt = new XslTransform();
xslt.Load("test1.xsl");
xslt.Transform(xpath, null, sw);
sw.Close();

IMHO this code may produce html, that is encoded in windows-1250 codepage.
That's true for contents of <a> tag in output HTML. But if you look at href
atribute, url is encoded in UTF-8. That's the main problem with aspx - if
you specify in web.config <globalization responseEncoding="windows-1250"
requestEncoding="windows-1250"/>, you got problem, because URL's are encoded
in UTF-8, but asp.net framework decodes it with windows-1250 codepage.
Something is wrong IMHO...


Quote:> Andrzej,

> URIs, such as URLs, are restricted to a small set of characters [1], so
> Czech characters cannot appear in a URL without being escaped using octet
> encoded sequences. In that case, why is the character encoding of URLs an
> issue? And what does that have to do with XSLT? Wouldn't the same issues
> apply to statically-produced HTML encoded (e.g.) with Windows-1250?

> Cheers,
> Stuart
> --
> Stuart Celarier, Fern Creek, www.ferncrk.com
> Consultant on .NET, Win32, C#, C++, COM, XML, XSLT and more.

> [1] http://www.ietf.org/rfc/rfc2396.txt, Section 2 URI Characters and
Escape
> Sequences

 
 
 

Automatic URL encoding

Post by Julian F. Reschk » Thu, 11 Jul 2002 17:33:42




Quote:> Dear Stuart

> Imagine this files:
> XML datafile (test1.xml):
> <?xml version="1.0" encoding="windows-1250" ?>
> <?xml-stylesheet href="test1.xsl" type="Text/xsl"?>
> <data>
>   <link>
>     <url>http://something.somewhere/page.aspx?data=11??y</url>
>     <desc>Link11?y?</desc>
>   </link>
> </data>

> XSL transformation (test1.xsl):
> <?xml version="1.0" encoding="windows-1250"?>
> <xsl:stylesheet version="1.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

> <xsl:output indent="yes" method="html"/>
> <xsl:template match="/">
> <html>
> <head><title>test</title></head>
> <body><xsl:apply-templates/></body></html>
> </xsl:template>
> <xsl:template match="link">
> <xsl:element name="a">
> <xsl:attribute name="href"><xsl:value-of select="url"/>
> </xsl:attribute>
> <xsl:value-of select="desc"/></xsl:element>
> </xsl:template>
> </xsl:stylesheet>

> Take this C# code:
> Encoding enc = Encoding.GetEncoding("windows-1250");
> StreamWriter sw = new StreamWriter("out.html", false, enc);
> XPathDocument xpath = new XPathDocument("test1.xml");
> XslTransform xslt = new XslTransform();
> xslt.Load("test1.xsl");
> xslt.Transform(xpath, null, sw);
> sw.Close();

> IMHO this code may produce html, that is encoded in windows-1250 codepage.
> That's true for contents of <a> tag in output HTML. But if you look at
href
> atribute, url is encoded in UTF-8. That's the main problem with aspx - if
> you specify in web.config <globalization responseEncoding="windows-1250"
> requestEncoding="windows-1250"/>, you got problem, because URL's are
encoded
> in UTF-8, but asp.net framework decodes it with windows-1250 codepage.
> Something is wrong IMHO...

But that would be a bug in ASP.NET, right? The only portable way to
transport non-ASCII characters in URLs is to UTF-8-encode and then
URL-escape them.
 
 
 

Automatic URL encoding

Post by Stuart Celarie » Thu, 11 Jul 2002 23:25:58


Andrzej,

The URL

Quote:> >     http://something.somewhere/page.aspx?data=11??y

is not a valid URL. That is a fundamental issue which you must solve first.
See the reference on URIs that I previously cited. It doesn't matter what
character encoding you use.

What's more, you've already declared that the output method is HTML

Quote:> > <xsl:output indent="yes" method="html"/>

and an XSLT processor is required to be aware of what constitutes valid
HTML.

Quote:> > ... That's the main problem with aspx - if
> > you specify in web.config <globalization responseEncoding="windows-1250"
> > requestEncoding="windows-1250"/>, you got problem, because URL's are
> encoded
> > in UTF-8, but asp.net framework decodes it with windows-1250 codepage.

I don't think so. Try this at home: create a static HTML page by hand using
the URL above, with any character encoding you might desire. Can you get it
to work? (Hint: it should not.) You could also try creating static HTML
content in an ASPX page with the same URL. I think you will discover this is
not an XSLT problem, it is not an ASP.NET problem, it is not even an HTML
problem. The URL you've written is invalid, and that's a problem.

As Julian pointed out

Quote:> The only portable way to
> transport non-ASCII characters in URLs is to UTF-8-encode and then
> URL-escape them.

Well, we have to refine what Julian meant by non-ASCII, e.g., 0x20 is a
space character in ASCII, but it has to be octet encoded as %20, but he's
close to being correct. Again I will refer you to the specification of what
constitutes a valid URI.

Cheers,
Stuart
--
Stuart Celarier, Fern Creek, www.ferncrk.com
Consultant on .NET, Win32, C#, C++, COM, XML, XSLT and more.

 
 
 

Automatic URL encoding

Post by Andrzej Mierzw » Fri, 12 Jul 2002 22:30:14


Dear Sruart :-)

Quote:> > >     http://something.somewhere/page.aspx?data=11??y

> is not a valid URL. That is a fundamental issue which you must solve
first.
> See the reference on URIs that I previously cited. It doesn't matter what
> character encoding you use.

I know, that this URL is not valid. XSLT processor translated it to valid
form encoding it acording to RFC1738. But why this URL is encoded with UTF-8
codepage, while everything else in windows-1250? That's what I'm pointing
at. Why while processing one document is used two difrerent codepages to
encode it? I think, that's bug in .NET XSLT processor.

--
Andrzej Mierzwa
WEBCOM a.s.

 
 
 

Automatic URL encoding

Post by Julian F. Reschk » Fri, 12 Jul 2002 22:35:36




Quote:> Dear Sruart :-)

> > > >     http://something.somewhere/page.aspx?data=11??y

> > is not a valid URL. That is a fundamental issue which you must solve
> first.
> > See the reference on URIs that I previously cited. It doesn't matter
what
> > character encoding you use.

> I know, that this URL is not valid. XSLT processor translated it to valid
> form encoding it acording to RFC1738. But why this URL is encoded with
UTF-8
> codepage, while everything else in windows-1250? That's what I'm pointing
> at. Why while processing one document is used two difrerent codepages to
> encode it? I think, that's bug in .NET XSLT processor.

When decoding an arbirary URL into characters (not octets!), you need to
know the encoding. There's no way to specify the encoding *in* the URL and
there's no kind of out-of-band-information you can use, so there MUST be a
single encoding use by everybody. AFAIK, there's currently no official spec
defining this, but it is in the works (and it *will* specify UTF-8).

So if you write software that assumes that characters in URLs are encoded in
something different than UTF-8, you'll have a future problem.

 
 
 

Automatic URL encoding

Post by Andrzej Mierzw » Sat, 13 Jul 2002 16:26:38


It's not my software, it's Microsofts ASP.NET and this is configurable
option. BTW when you look at some my previous post, you will see one can
specify encoding transforming an XML data with XSL transformation.



Quote:> When decoding an arbirary URL into characters (not octets!), you need to
> know the encoding. There's no way to specify the encoding *in* the URL and
> there's no kind of out-of-band-information you can use, so there MUST be a
> single encoding use by everybody. AFAIK, there's currently no official
spec
> defining this, but it is in the works (and it *will* specify UTF-8).

> So if you write software that assumes that characters in URLs are encoded
in
> something different than UTF-8, you'll have a future problem.

 
 
 

Automatic URL encoding

Post by Julian F. Reschk » Sun, 14 Jul 2002 04:55:53




Quote:> It's not my software, it's Microsofts ASP.NET and this is configurable
> option. BTW when you look at some my previous post, you will see one can
> specify encoding transforming an XML data with XSL transformation.

Yes, but that doesn't affect *URL* encoding.
 
 
 

1. Probs with automatic HTML encoding using Text property of node

Hi guys

I'm writing a small content management tool for a websites news section. On
one screen I have a textarea where the users can enter newsupdates with HTML
tags. The problem is that the HTML tags are automatically escaped, < will be
converted to &lt; and so on. I'm desperate here guys as I'm facing a dead
line. I'm not an expert in XML at all.

Well anyways here's the code, if you have any questions please send me an
email, the variables 'Title' and 'Details' may contain HTML tags:

var Title =  String ( Request.Form("title") ) ;
var Details = String (  Request.Form("details") ) ;

var xmlDoc = Server.CreateObject("Microsoft.XMLDOM");

xmlDoc.async="false";
xmlDoc.validateOnParse="true";
xmlDoc.load(Server.MapPath("news.xml"));

if(xmlDoc.parseError.errorCode==0)
{
 var xsl = new ActiveXObject("Microsoft.XMLDOM");
 xsl.async = false;
 xsl.load("news.xsl");

 nodes = xmlDoc.documentElement.childNodes;

   nodes.item ( 0 ).setAttribute ( "DATE" , Date ( ) ) ;
   news = nodes.item ( 0 ).childNodes;
   news.item ( 0 ).text = Title ;
   news.item ( 1 ).text = Details ;

 xmlDoc.save ( Server.MapPath ( "news.xml" ) ) ;
 Response.redirect ( "main.asp#news" ) ;
...
...

News.xml looks like this:
<?xml version="1.0"?>
<!DOCTYPE NEWSES SYSTEM "news.dtd">
<NEWSES>
 <NEWS DATE="Thu May 31 17:33:39 2001">
  <TITLE>NEWS FLASH!!</TITLE>
  <DETAILS>Dit is een test op Homer</DETAILS>
 </NEWS>
</NEWSES>

News.dtd looks like:
<!ELEMENT NEWSES (NEWS)>
<!ELEMENT NEWS (TITLE,DETAILS)>

<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT DETAILS (#PCDATA)>

<!ATTLIST NEWS DATE CDATA #REQUIRED>

I appreciate any help.

Gabriel Lozano-Morn
Software Engineer
Harte-Hanks CRM Services Belgium

2. Epson gt5000 any good..... better than HP 5p ?

3. Automatic HTML-encoding of special characters

4. TA Help

5. URL encoded transformation?

6. canon bj-30 cables

7. URL encoding in XSLT

8. Sharp Memory

9. XSLT Question - Can you decode a URL encoded string?

10. SQL XML won't URL encode

11. the Text Encoding Inititive URL?

12. URL encoding

13. url encoding