Requirement for N on Unicode literals

Requirement for N on Unicode literals

Post by Ti » Mon, 21 Jan 2002 07:40:00



Article INF: Unicode String Constants in SQL Server Require N Prefix
(Q239530) argues "This notation is necessary to provide backward
compatibility with existing applications. For example, "SELECT
'Hello'" must continue to return a non-Unicode string because many
applications will expect the behavior of SQL Server 6.5, which did not
support Unicode data; the new syntax "SELECT N'Hello'" has been added
to allow the passing of Unicode strings to and from SQL Server 7.0. "

What was the use of NCHAR, NVARCHAR and NTEXT columns in 6.5?

The SQL standards requirement of the N prefix seems to be completely
redundant. The database process knows the character set of the client
(source) and the database column and can convert as required.

- Tim

 
 
 

Requirement for N on Unicode literals

Post by Erland Sommarsko » Tue, 22 Jan 2002 08:19:08



> Article INF: Unicode String Constants in SQL Server Require N Prefix
> (Q239530) argues "This notation is necessary to provide backward
> compatibility with existing applications. For example, "SELECT
> 'Hello'" must continue to return a non-Unicode string because many
> applications will expect the behavior of SQL Server 6.5, which did not
> support Unicode data; the new syntax "SELECT N'Hello'" has been added
> to allow the passing of Unicode strings to and from SQL Server 7.0. "

> What was the use of NCHAR, NVARCHAR and NTEXT columns in 6.5?

> The SQL standards requirement of the N prefix seems to be completely
> redundant. The database process knows the character set of the client
> (source) and the database column and can convert as required.

I believe that N'' is ANSI.

As for the RDBMS knowing the client character set, it's a quite broad
statement to make: it should cover Sybase, SQL Server, Oracle, Ocelot
all alike.

But let's constrain ourselves to SQL Server. Does SQL Server always know
what the client is capable of handling? I don't think so. Sure, if a
client using DB-Library, or ODBC < 3.70, SQL Server knows that this
client is not able to understand Unicode, so SQL Server will convert.

But the client could be running on a machine where MDAC 2.6 is installed.
However, the code which is calling ODBC/OLE-DB is legacy code which
does not handle Unicode. If SQL Server would pass 'Hello' as
0x480065006C006C006F00 to the client library, the application would
probably stop at the first NUL byte.

--
Erland Sommarskog, Abaris AB

SQL Server MVP

 
 
 

Requirement for N on Unicode literals

Post by Bart Duncan [M » Wed, 23 Jan 2002 02:23:11


For what it's worth, the use of the N prefix to designate a literal as
type nchar/nvarchar is part of the ANSI spec, and other database vendors
handle nchar/nvarchar types in a similar way.  I agree that it is not
immediately obvious why the N prefix is needed but Erland's comments are
right on target.  Without the N prefix the server has no way of knowing
whether the client is expecting a wide char (Unicode) or single-byte
output.  Suppose a developer allocated a 5 byte buffer to hold the output
of "SELECT 'Hello'".  If this query suddenly started returning a 10-byte
WCHAR string instead of a 5-byte char string, the buffer would be overrun
and the likely result would be access violations or heap corruption in
the client application.  

And there are some other reasons why this syntax is necessary.  A set of
string sorting/comparison rules for a non-Unicode code page that has, in
most cases, less than 255 characters isn't very useful when you need to
compare two Unicode strings which may contain any of tens of thousands of
different characters.  For this reason, different sets of rules are
needed for Unicode string comparisons vs non-Unicode string comparisons.  
For example, for backwards compatibility a DBA might configure his server
to provide a legacy SQL sort order for non-Unicode strings but use a more
robust Windows collation for Unicode data.  SQL needs to know whether a
string literal should be internally typed as Unicode or non-Unicode so it
knows what string comparison rules should come into play.  

    if 'abc' = 'ABC' print 'equal' else print 'not equal'

Suppose you had a SQL 7.0 instance with a case-sensitive Unicode
collation but a case-insensitive non-Unicode sort order.  In this case,
if SQL decides to treat these strings as Unicode they result would be
"not equal", but if they are handled as non-Unicode strings the output
would be "equal".  So without the N prefix to disambiguate Unicode string
literals from non-Unicode string literals the result of a simple
expression like 'abc'='ABC' may be unknown.  

SQL 6.5 didn't support Unicode data types or nchar/nvarchar/ntext.  

HTH -
Bart
------------
Bart Duncan
Microsoft SQL Server Support
Please reply to the newsgroup only - thanks.
This posting is provided "AS IS" with no warranties, and confers no
rights.

--------------------

Newsgroups: microsoft.public.sqlserver.programming
Subject: Re: Requirement for N on Unicode literals
Date: Sun, 20 Jan 2002 23:19:08 +0000 (UTC)
Organization: BRF Beckasinen 13
Lines: 37



NNTP-Posting-Host: wilhelmina.algonet.se
X-Trace: green.tninet.se 1011568748 7543 194.213.75.170 (20 Jan 2002
23:19:08 GMT)

NNTP-Posting-Date: Sun, 20 Jan 2002 23:19:08 +0000 (UTC)
User-Agent: Xnews/5.01.10
Path:
cpmsftngxa07!tkmsftngxs01!tkmsftngp01!newsfeed00.sul.t-online.de!t-online.
de!newsfeed.wirehub.nl!news.tele.dk!small.news.tele.dk!195.54.122.107!news
feed1.bredband.com!bredband!news01.chello.se!newsfeed1.telenordia.se!algon
et!news2.tninet.se!not-for-mail
Xref: cpmsftngxa07 microsoft.public.sqlserver.programming:224673
X-Tomcat-NG: microsoft.public.sqlserver.programming


> Article INF: Unicode String Constants in SQL Server Require N Prefix
> (Q239530) argues "This notation is necessary to provide backward
> compatibility with existing applications. For example, "SELECT
> 'Hello'" must continue to return a non-Unicode string because many
> applications will expect the behavior of SQL Server 6.5, which did not
> support Unicode data; the new syntax "SELECT N'Hello'" has been added
> to allow the passing of Unicode strings to and from SQL Server 7.0. "

> What was the use of NCHAR, NVARCHAR and NTEXT columns in 6.5?

> The SQL standards requirement of the N prefix seems to be completely
> redundant. The database process knows the character set of the client
> (source) and the database column and can convert as required.

I believe that N'' is ANSI.

As for the RDBMS knowing the client character set, it's a quite broad
statement to make: it should cover Sybase, SQL Server, Oracle, Ocelot
all alike.

But let's constrain ourselves to SQL Server. Does SQL Server always know
what the client is capable of handling? I don't think so. Sure, if a
client using DB-Library, or ODBC < 3.70, SQL Server knows that this
client is not able to understand Unicode, so SQL Server will convert.

But the client could be running on a machine where MDAC 2.6 is installed.
However, the code which is calling ODBC/OLE-DB is legacy code which
does not handle Unicode. If SQL Server would pass 'Hello' as
0x480065006C006C006F00 to the client library, the application would
probably stop at the first NUL byte.

--
Erland Sommarskog, Abaris AB

SQL Server MVP

 
 
 

Requirement for N on Unicode literals

Post by Ti » Wed, 23 Jan 2002 12:54:38


Bart and Erland, thanks for your comments.

I know that the N prefix is part of the ANSI standard. My argument was
that it was it was unnecessary there, but our differences perhaps come
from some different programming models.

My background is more with Unix/Oracle/Sybase and general I18N rather
than specifics of the Windows platform and SQL Server. On those
platforms the client has a character set with which it connects to the
server. Each char/varchar or text column on the server has a character
set that may or may not be same as the database default set. For
example the client may be in Japanese SJIS and the server EUC-JP -
same repertoire with different encodings. Or the client could be SJIS
and the server UTF-8, a superset repertoire of the clients. The server
does have to know the client character set - both Oracle and Sybase do
this. On both commands and results returned the server makes the
appropriate character set translation. If a character in one set does
not exist in the other a fallback, usually '?' is substituted.

So if a column is defined as
CHARACTER (length) CHARACTER SET IS UTF-8
then it can be accessed, updated and queried by a client of any
character set with the server making all the conversions. Updates and
queries of course can only be made with the common repertoire subset.

In the ANSI SQL standard NATIONAL CHARACTER (a/k/a NCHAR) is defined
to be absolutely nothing other than:

CHARACTER (length) CHARACTER SET IS implementation-defined-charset

Where the implementation defined charset is the National Character
Set, fixed to UCS-2 in SQL Server 7 and configurable in Oracle.

After discussions with colleagues today I believe that the Windows
conceptual model is different. Instead of two there are three
character sets  in play. Instead of the client operating in the
connection character set there is the actual client character set and
the connection character set. On Windows 2000 and NT the actual client
character set is Unicode using UCS-2. This is converted to the
connection character set (or at least the system acts as though this
conversion occurs). In this scenario the N prefix tells the server to
not convert the literal to the connection character set, but to pass
the bits through unchanged.

This jibes with Barts response with Windows treating Unicode
characters quite separately from other characters. You are saying that
it is not just at the encoding form that a Unicode 'A' differs from an
ASCII 'A', but also at the abstract character level. I may not agree
with this, but can accept it.

One question - if this analysis is reasonable, is the connection
character set established by the collation sequence defined for a
column, or by the locale of the Windows system?

 
 
 

Requirement for N on Unicode literals

Post by Bart Duncan [M » Fri, 25 Jan 2002 07:39:04


The N prefix is required for any strings of type NCHAR ("national
language character set") in Oracle (in fact I believe Oracle won't even
do an implicit datatype conversion between the two types -- for example
if my memory serves you will get an error if you try to insert a literal
into a NCHAR column with the N prefix).  However, one key difference
between Oracle and SQL is (as you mention) that Oracle allows you to set
the "database character set" that applies to CHAR datatypes to Unicode
(UTF-8 encoding).  If you set up Oracle with UTF-8 as the database
character set you can store and retrieve Unicode data without the N
prefix on literals because the Unicode string can be internally
represented as CHAR without data loss.  In contrast, SQL Server doesn't
allow you to use a Unicode encoding as the character set that underlies
CHAR datatypes -- you must use NCHAR to store Unicode data.  And for all
NCHAR literals an N prefix is required or the string is internally typed
as a CHAR literal, which causes a conversion that results in the loss of
any characters that don't exist in the non-Unicode SQL Server code page.  

So if SQL also allowed you to use a Unicode encoding for CHAR datatypes
it would not be necessary to use NCHAR datatypes at all, and therefore
the N prefix wouldn't be necessary.  Is this what you are saying?  This
is true.  I'm not sure that I understand your comments about the client's
code page -- as far as I can see the client code page doesn't really come
into play in any of the things discussed above.  

Quote:>The server does have to know the client character set - both
>Oracle and Sybase do this. On both commands and results
>returned the server makes the appropriate character set translation

SQL Server does not convert CHAR data to the client's code page before
sending.  If the client app tells ODBC/OLEDB that it should place the
results in a buffer that is of SQL_C_CHAR type (non-WCHAR), it is the
ODBC driver or OLEDB provider that does this conversion, using a
mechanism called Autotranslation to convert the server's data from the
source code page to Unicode, then from Unicode to the client's ANSI code
page before filling the client app's buffer.  If the app provides a
SQL_C_WCHAR buffer the driver or provider will convert from the source
code page to Unicode UTF-16.  In both cases the "source code page" is the
code page determined by whatever expression produced the string.  This
could be the SQL instance's default code page in the case of a query like
"select 'hello'" where no other collation applies, it could be the code
page of a particular column for a query like "select mycol from mytable",
or it could be an arbitrary code page in the event of some expression
that forced a different code page (for example "SELECT 'hello' COLLATE
Japanese_CI_AS").

In the case of data sent from the client to the server the situation is a
bit different.  In some cases the procedure is the same as above: the
driver/provider converts from the client's ANSI code page to the SQL
Server's code page via Unicode before sending.  In other cases (namely
language events, or adhoc non-prepared queries) the driver always
converts the entire T-SQL batch Unicode before sending it to the server.  
And finally, with certain versions of the driver and SQL the data is
simply tagged with the client's ANSI code page and sent to the server,
where SQL Server may convert it to either Unicode or to a different
non-Unicode code page if necessary (for example if the data must be
inserted into a column with a different collation).  

The specifics of all the cases described above are very confusing, but
the outcome of all three cases is the same and I don't see how the path
the data takes on the client is relevant to whether an N prefix is
necessary to disambiguate NCHAR string literals in a language event.  The
client's collation shouldn't dictate the server's string comparison
semantics.  Bottom line is that there are two different string data types
in the server with different encodings and comparison semantics, so the
server needs a different syntax to differentiate one of the two types so
that the encoding/semantics are known and the server can provide
consistent, well-defined behavior.  

HTH -
Bart
------------
Bart Duncan
Microsoft SQL Server Support

Please reply to the newsgroup only - thanks.

This posting is provided "AS IS" with no warranties, and confers no
rights.

--------------------

Newsgroups: microsoft.public.sqlserver.programming
Subject: Re: Requirement for N on Unicode literals
Date: 21 Jan 2002 19:54:38 -0800
Organization: http://groups.google.com/
Lines: 53



NNTP-Posting-Host: 207.172.11.232
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: posting.google.com 1011671678 30532 127.0.0.1 (22 Jan 2002
03:54:38 GMT)

NNTP-Posting-Date: 22 Jan 2002 03:54:38 GMT
Path:
cpmsftngxa09!tkmsftngxs01!tkmsftngp01!newsfeed00.sul.t-online.de!t-online.
de!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!feeder.qis.net!sn-xit-0
2!supernews.com!postnews1.google.com!not-for-mail
Xref: cpmsftngxa09 microsoft.public.sqlserver.programming:225006
X-Tomcat-NG: microsoft.public.sqlserver.programming

Bart and Erland, thanks for your comments.

I know that the N prefix is part of the ANSI standard. My argument was
that it was it was unnecessary there, but our differences perhaps come
from some different programming models.

My background is more with Unix/Oracle/Sybase and general I18N rather
than specifics of the Windows platform and SQL Server. On those
platforms the client has a character set with which it connects to the
server. Each char/varchar or text column on the server has a character
set that may or may not be same as the database default set. For
example the client may be in Japanese SJIS and the server EUC-JP -
same repertoire with different encodings. Or the client could be SJIS
and the server UTF-8, a superset repertoire of the clients. The server
does have to know the client character set - both Oracle and Sybase do
this. On both commands and results returned the server makes the
appropriate character set translation. If a character in one set does
not exist in the other a fallback, usually '?' is substituted.

So if a column is defined as
CHARACTER (length) CHARACTER SET IS UTF-8
then it can be accessed, updated and queried by a client of any
character set with the server making all the conversions. Updates and
queries of course can only be made with the common repertoire subset.

In the ANSI SQL standard NATIONAL CHARACTER (a/k/a NCHAR) is defined
to be absolutely nothing other than:

CHARACTER (length) CHARACTER SET IS implementation-defined-charset

Where the implementation defined charset is the National Character
Set, fixed to UCS-2 in SQL Server 7 and configurable in Oracle.

After discussions with colleagues today I believe that the Windows
conceptual model is different. Instead of two there are three
character sets  in play. Instead of the client operating in the
connection character set there is the actual client character set and
the connection character set. On Windows 2000 and NT the actual client
character set is Unicode using UCS-2. This is converted to the
connection character set (or at least the system acts as though this
conversion occurs). In this scenario the N prefix tells the server to
not convert the literal to the connection character set, but to pass
the bits through unchanged.

This jibes with Barts response with Windows treating Unicode
characters quite separately from other characters. You are saying that
it is not just at the encoding form that a Unicode 'A' differs from an
ASCII 'A', but also at the abstract character level. I may not agree
with this, but can accept it.

One question - if this analysis is reasonable, is the connection
character set established by the collation sequence defined for a
column, or by the locale of the Windows system?

 
 
 

1. JDBC Driver Converting Character String Literals to Unicode

I'm working with a client who has developed a Java application hitting a SQL
Server 2000 database.  The database tables use char() and varchar() fields.
The SQL statements that are used in the application use character string
literals to search the database.  For example:

select lastname, firstname from customers where state = 'GA'

The JDBC driver (they have tried the DataDirect, BEA, and Microsoft drivers)
appear to be converting the character string literals to unicode before the
SQL statement gets to the database.  Thus, SQL Server is receiving the
following:

select lastname, firstname from customers where state = N'GA'

Since indices on the tables are based on char() and varchar() fields, SQL
Server does not appear to be able to make optimal use of the indices since
the search criteria arrives as a unicode value instead of a normal character
value.  If we use Query Analyzer to look at the query plans generated for
various examples of queries without and with the unicode character string
literals, we see definite differences in how SQL Server is using indices on
various tables.

So, the question:  is there any way to keep the JDBC driver from converting
character string literals to unicode values?  Although we could convert all
of the underlying char() and varchar() fields to nchar() and nvarchar()
fields, it would be a massive job at this point.  In researching this
problem, we came across some information regarding the ODBC driver for SQL
Server that indicates that it has a connection string setting that allows
you to turn on and turn off this "translation" mechanism.  Anyone know of a
similar feature for any JDBC driver out there (DataDirect, BEA, Microsoft,
or otherwise)?

Thanks for any thoughts...

Dave Fackler

2. Export of Table

3. Unicode string literals in SQL queries

4. Q: Empty Table

5. Predicting unicode space requirements.

6. Using SQL for a field that "Contains"

7. unicode vs. non-unicode

8. temporary table

9. Unicode import with apparently mixed ANSI and Unicode characters

10. Unicode or not Unicode ?

11. Need to upgrade a non-unicode database to unicode database

12. Conversion Unicode database to a non-Unicode

13. Help with converting Windows Unicode (UCS-2) to Oracle Unicode (UTF-8) via ADO