The N prefix is required for any strings of type NCHAR ("national
language character set") in Oracle (in fact I believe Oracle won't even
do an implicit datatype conversion between the two types -- for example
if my memory serves you will get an error if you try to insert a literal
into a NCHAR column with the N prefix). However, one key difference
between Oracle and SQL is (as you mention) that Oracle allows you to set
the "database character set" that applies to CHAR datatypes to Unicode
(UTF-8 encoding). If you set up Oracle with UTF-8 as the database
character set you can store and retrieve Unicode data without the N
prefix on literals because the Unicode string can be internally
represented as CHAR without data loss. In contrast, SQL Server doesn't
allow you to use a Unicode encoding as the character set that underlies
CHAR datatypes -- you must use NCHAR to store Unicode data. And for all
NCHAR literals an N prefix is required or the string is internally typed
as a CHAR literal, which causes a conversion that results in the loss of
any characters that don't exist in the non-Unicode SQL Server code page.
So if SQL also allowed you to use a Unicode encoding for CHAR datatypes
it would not be necessary to use NCHAR datatypes at all, and therefore
the N prefix wouldn't be necessary. Is this what you are saying? This
is true. I'm not sure that I understand your comments about the client's
code page -- as far as I can see the client code page doesn't really come
into play in any of the things discussed above.
Quote:>The server does have to know the client character set - both
>Oracle and Sybase do this. On both commands and results
>returned the server makes the appropriate character set translation
SQL Server does not convert CHAR data to the client's code page before
sending. If the client app tells ODBC/OLEDB that it should place the
results in a buffer that is of SQL_C_CHAR type (non-WCHAR), it is the
ODBC driver or OLEDB provider that does this conversion, using a
mechanism called Autotranslation to convert the server's data from the
source code page to Unicode, then from Unicode to the client's ANSI code
page before filling the client app's buffer. If the app provides a
SQL_C_WCHAR buffer the driver or provider will convert from the source
code page to Unicode UTF-16. In both cases the "source code page" is the
code page determined by whatever expression produced the string. This
could be the SQL instance's default code page in the case of a query like
"select 'hello'" where no other collation applies, it could be the code
page of a particular column for a query like "select mycol from mytable",
or it could be an arbitrary code page in the event of some expression
that forced a different code page (for example "SELECT 'hello' COLLATE
In the case of data sent from the client to the server the situation is a
bit different. In some cases the procedure is the same as above: the
driver/provider converts from the client's ANSI code page to the SQL
Server's code page via Unicode before sending. In other cases (namely
language events, or adhoc non-prepared queries) the driver always
converts the entire T-SQL batch Unicode before sending it to the server.
And finally, with certain versions of the driver and SQL the data is
simply tagged with the client's ANSI code page and sent to the server,
where SQL Server may convert it to either Unicode or to a different
non-Unicode code page if necessary (for example if the data must be
inserted into a column with a different collation).
The specifics of all the cases described above are very confusing, but
the outcome of all three cases is the same and I don't see how the path
the data takes on the client is relevant to whether an N prefix is
necessary to disambiguate NCHAR string literals in a language event. The
client's collation shouldn't dictate the server's string comparison
semantics. Bottom line is that there are two different string data types
in the server with different encodings and comparison semantics, so the
server needs a different syntax to differentiate one of the two types so
that the encoding/semantics are known and the server can provide
consistent, well-defined behavior.
Microsoft SQL Server Support
Please reply to the newsgroup only - thanks.
This posting is provided "AS IS" with no warranties, and confers no
Subject: Re: Requirement for N on Unicode literals
Date: 21 Jan 2002 19:54:38 -0800
Content-Type: text/plain; charset=ISO-8859-1
X-Trace: posting.google.com 1011671678 30532 127.0.0.1 (22 Jan 2002
NNTP-Posting-Date: 22 Jan 2002 03:54:38 GMT
Xref: cpmsftngxa09 microsoft.public.sqlserver.programming:225006
Bart and Erland, thanks for your comments.
I know that the N prefix is part of the ANSI standard. My argument was
that it was it was unnecessary there, but our differences perhaps come
from some different programming models.
My background is more with Unix/Oracle/Sybase and general I18N rather
than specifics of the Windows platform and SQL Server. On those
platforms the client has a character set with which it connects to the
server. Each char/varchar or text column on the server has a character
set that may or may not be same as the database default set. For
example the client may be in Japanese SJIS and the server EUC-JP -
same repertoire with different encodings. Or the client could be SJIS
and the server UTF-8, a superset repertoire of the clients. The server
does have to know the client character set - both Oracle and Sybase do
this. On both commands and results returned the server makes the
appropriate character set translation. If a character in one set does
not exist in the other a fallback, usually '?' is substituted.
So if a column is defined as
CHARACTER (length) CHARACTER SET IS UTF-8
then it can be accessed, updated and queried by a client of any
character set with the server making all the conversions. Updates and
queries of course can only be made with the common repertoire subset.
In the ANSI SQL standard NATIONAL CHARACTER (a/k/a NCHAR) is defined
to be absolutely nothing other than:
CHARACTER (length) CHARACTER SET IS implementation-defined-charset
Where the implementation defined charset is the National Character
Set, fixed to UCS-2 in SQL Server 7 and configurable in Oracle.
After discussions with colleagues today I believe that the Windows
conceptual model is different. Instead of two there are three
character sets in play. Instead of the client operating in the
connection character set there is the actual client character set and
the connection character set. On Windows 2000 and NT the actual client
character set is Unicode using UCS-2. This is converted to the
connection character set (or at least the system acts as though this
conversion occurs). In this scenario the N prefix tells the server to
not convert the literal to the connection character set, but to pass
the bits through unchanged.
This jibes with Barts response with Windows treating Unicode
characters quite separately from other characters. You are saying that
it is not just at the encoding form that a Unicode 'A' differs from an
ASCII 'A', but also at the abstract character level. I may not agree
with this, but can accept it.
One question - if this analysis is reasonable, is the connection
character set established by the collation sequence defined for a
column, or by the locale of the Windows system?