I have an Oracle database that stores some data values in Simplified Chinese. I have created an ASP.net MVC C# webpage that is supposed to display this information. I am using a OdbcConnection in order to retrieve the data, however when I run my da.Fill(t) command the values return as "?"
OdbcCommand cmd = new OdbcCommand();
cmd.CommandText = select;
OdbcConnection SqlConn = new OdbcConnection("Driver={Oracle in instantclient_11_2};Dbq=Database;Uid=Username;pwd=password;");
DataTable t = new DataTable();
cmd.Connection = SqlConn;
SqlConn.Open();
OdbcDataAdapter da = new OdbcDataAdapter(cmd);
SqlConn.Close();
da.Fill(t);
return t;
t has the data but everything that is supposed to be the Chinese characters is just a series of "?????"
Problems with character set are quite common, let me try to give some general notes.
In principle you have to consider four different character set settings.
1 and 2: NLS_CHARACTERSET and NLS_NCHAR_CHARACTERSET
Example: AL32UTF8
They are defined only on your database, you can interrogate them with
SELECT *
FROM V$NLS_PARAMETERS
WHERE PARAMETER IN ('NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET');
These settings define which characters (in which format) can be stored in your database - no more, no less. It requires some effort (see Character Set Migration and/or Oracle Database Migration Assistant for Unicode) if you have to change it on existing database.
You can find Oracle supported character set at Character Sets.
3: NLS_LANG
Example: AMERICAN_AMERICA.AL32UTF8
This value is defined only on your client. NLS_LANG has nothing to do with the ability to store characters in a database. It is used to let Oracle know what character set you are using on the client side. When you set NLS_LANG value (for example to AL32UTF8) then you just tell the Oracle database "my client uses character set AL32UTF8" - it does not necessarily mean that your client is really using AL32UTF8! (see below #4)
NLS_LANG can be defined by environment variable NLS_LANG or by Windows Registry at HKLM\SOFTWARE\Wow6432Node\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG (for 32 bit), resp. HKLM\SOFTWARE\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG (for 64 bit). Depending on your application there might be other ways to specify NLS_LANG, but let's stick to the basics. If NLS_LANG value is not provided then Oracle defaults it to AMERICAN_AMERICA.US7ASCII
Format of NLS_LANG is NLS_LANG=language_territory.charset. The {charset} part of NLS_LANG is not shown in any system table or view. All components of the NLS_LANG definition are optional, so following definitions are all valid: NLS_LANG=.WE8ISO8859P1, NLS_LANG=_GERMANY, NLS_LANG=AMERICAN, NLS_LANG=ITALIAN_.WE8MSWIN1252, NLS_LANG=_BELGIUM.US7ASCII.
As stated above the {charset} part of NLS_LANG is not available in database at any system table/view or any function. Strictly speaking this is true, however you can run this query:
SELECT DISTINCT CLIENT_CHARSET
FROM V$SESSION_CONNECT_INFO
WHERE (SID, SERIAL#) = (SELECT SID, SERIAL# FROM v$SESSION WHERE AUDSID = USERENV('SESSIONID'));
It should return character set from your current NLS_LANG setting - however based on my experience the value is often NULL or Unknown, i.e. not reliable.
Find more very useful information here: NLS_LANG FAQ
Note, some technologies do not utilize NLS_LANG, settings there do not have any effect, for example:
ODP.NET Managed Driver is not NLS_LANG sensitive. It is only .NET locale sensitive. (see Data Provider for .NET Developer's Guide)
OraOLEDB (from Oracle) always use UTF-16 (see OraOLEDB Provider Specific Features)
Java based JDBC (for example SQL Developer) has its own methods to deal with character sets (see Database JDBC Developer's Guide - Globalization Support for further details)
4: The "real" character set of your terminal, your application or the encoding of .sql files
Example: UTF-8
If you work with a terminal program (i.e. SQL*plus or isql) you can interrogate the code page with command chcp, on Unix/Linux the equivalent is locale charmap or echo $LANG. You can get a list of all Windows code pages identifiers from here: Code Page Identifiers. Note, for UTF-8 (chcp 65001) there are some issues, see this discussion.
If you work with .sql files and an editor like TOAD or SQL-Developer you have to check the save options. Usually you can choose values like UTF-8, ANSI, ISO-8859-1, etc.
ANSI means the Windows ANSI codepage, typically CP1252, you can check in your Registry at HKLM\SYSTEM\ControlSet001\Control\Nls\CodePage\ACP or here: National Language Support (NLS) API Reference
[Microsoft removed this reference, take it form web-archive [National Language Support (NLS) API Reference]
11]
How to set all these values?
The most important point is to match NLS_LANG and your "real" character set of your terminal, resp. application or the encoding of your .sql files
Some common pairs are:
CP850 -> WE8PC850
CP1252 or ANSI (in case of "Western" PC) -> WE8MSWIN1252
ISO-8859-1 -> WE8ISO8859P1
ISO-8859-15 -> WE8ISO8859P15
UTF-8 -> AL32UTF8
Or run this query to get some more:
SELECT VALUE AS ORACLE_CHARSET, UTL_I18N.MAP_CHARSET(VALUE) AS IANA_NAME
FROM V$NLS_VALID_VALUES
WHERE PARAMETER = 'CHARACTERSET';
Some technologies make you life easier, e.g. ODP.NET (unmanged driver) or ODBC driver from Oracle automatically inherits the character set from NLS_LANG value, so condition from above is always true.
Is it required to set client NLS_LANG value equal to database NLS_CHARACTERSET value?
No, not necessarily! For example, if you have the database character set NLS_CHARACTERSET=AL32UTF8 and the client character set NLS_LANG=.ZHS32GB18030 then it will work without any problem (provided your client really uses GB18030), although these character sets are completely different. GB18030 is a character set commonly used for Chinese, like UTF-8 it supports all Unicode characters.
If you have, for example NLS_CHARACTERSET=AL32UTF8 and NLS_LANG=.WE8ISO8859P1 it will also work (again, provided your client really uses ISO-8859-P1). However, the database may store characters which your client is not able to display, instead the client will display a placeholder (e.g. ¿).
Anyway, it is beneficial to have matching NLS_LANG and NLS_CHARACTERSET values, if suitable. If they are equal you can be sure that any character which may be stored in database can also be displayed and any character you enter in your terminal or write in your .sql file can also be stored in database and is not substituted by placeholder.
Supplement
So many times you can read advise like "The NLS_LANG character set must be the same as your database character set" (also here on SO). This is simply not true and a popular myth!
See also Should the NLS_LANG Setting Match the Database Character Set?
The NLS_LANG character set should reflect the setting of the operating system character set of the client. For example, if the database character set is AL32UTF8 and the client is running on a Windows operating system, then you should not set AL32UTF8 as the client character set in the NLS_LANG parameter because there are no UTF-8 WIN32 clients. Instead, the NLS_LANG setting should reflect the code page of the client. For example, on an English Windows client, the code page is 1252. An appropriate setting for NLS_LANG is AMERICAN_AMERICA.WE8MSWIN1252.
Setting NLS_LANG correctly enables proper conversion from the client operating system character set to the database character set. When these settings are the same, Oracle Database assumes that the data being sent or received is encoded in the same character set as the database character set, so character set validation or conversion may not be performed. This can lead to corrupt data if the client code page and the database character set are different and conversions are necessary.
However, statement "there are no UTF-8 WIN32 clients" is certainly outdated nowadays!
Here is the proof:
C:\>set NLS_LANG=.AL32UTF8
C:\>sqlplus ...
SQL> SET SERVEROUTPUT ON
SQL> DECLARE
2 CharSet VARCHAR2(20);
3 BEGIN
4 SELECT VALUE INTO Charset FROM nls_database_parameters WHERE parameter = 'NLS_CHARACTERSET';
5 DBMS_OUTPUT.PUT_LINE('Database NLS_CHARACTERSET is '||Charset);
6 IF UNISTR('\20AC') = '€' THEN
7 DBMS_OUTPUT.PUT_LINE ( '"€" is equal to U+20AC' );
8 ELSE
9 DBMS_OUTPUT.PUT_LINE ( '"€" is not the same as U+20AC' );
10 END IF;
11 END;
12 /
Database NLS_CHARACTERSET is AL32UTF8
"€" is not the same as U+20AC
PL/SQL procedure successfully completed.
Both, client and database character sets are AL32UTF8, however the characters do not match. The reason is, my cmd.exe and thus also SQL*Plus use Windows CP1252. Therefore I must set NLS_LANG accordingly:
C:\>chcp
Active code page: 1252
C:\>set NLS_LANG=.WE8MSWIN1252
C:\>sqlplus ...
SQL> SET SERVEROUTPUT ON
SQL> DECLARE
2 CharSet VARCHAR2(20);
3 BEGIN
4 SELECT VALUE INTO Charset FROM nls_database_parameters WHERE parameter = 'NLS_CHARACTERSET';
5 DBMS_OUTPUT.PUT_LINE('Database NLS_CHARACTERSET is '||Charset);
6 IF UNISTR('\20AC') = '€' THEN
7 DBMS_OUTPUT.PUT_LINE ( '"€" is equal to U+20AC' );
8 ELSE
9 DBMS_OUTPUT.PUT_LINE ( '"€" is not the same as U+20AC' );
10 END IF;
11 END;
12 /
Database NLS_CHARACTERSET is AL32UTF8
"€" is equal to U+20AC
PL/SQL procedure successfully completed.
Also consider this example:
CREATE TABLE ARABIC_LANGUAGE (
LANG_CHAR VARCHAR2(20),
LANG_NCHAR NVARCHAR2(20));
INSERT INTO ARABIC_LANGUAGE VALUES ('العربية', 'العربية');
You would need to set two different values for NLS_LANG for a single statement - which is not possible.
See also If we have US7ASCII characterset why does it let us store non-ascii characters? or difference between NLS_NCHAR_CHARACTERSET and NLS_CHARACTERSET for Oracle
Related
I am pulling data from SQL Server to my c# project. I have some textboxes that update from page-to-page. I have one textbox in particular that is set up to only accept two characters, which is setup in the database as char(2). If I were to delete those two characters and click my button to update the database and go to the next page, it stores two empty spaces. I need it to just be empty with no spaces. In my other textboxes, this issue does not occur. The database allows the data to be null. I am able to manually enter "null" in the database, but I need it to be done when erasing the two chars and updating it.
A column declared as CHAR(2) may contain one of the following:
2 characters
NULL.
A column declared as CHAR(2) may not contain any of the following:
0 characters
1 character
3 characters
etc.
When you try to store anything other than either 2 characters or NULL in the column, your database will troll you in the name of some ill-conceived notion of convenience: instead of generating an error, it will store something other than what you gave it to store.
(Amusingly enough, receiving an error when doing something wrong is, and historically has been, regarded as an inconvenience by a surprisingly large portion of programmers. But that's okay, that's how we get stackoverflow questions to answer.)
Specifically, your database will pad the value you are storing with spaces, to match the length of the column. So, if you try to store just one character, it will add one space. If you try to store zero characters, it will add two spaces.
Possible Solutions:
If you have the freedom to change the type of the column:
Declare it as VARCHAR instead of CHAR(2), so that it will contain exactly what you store in it.
If you do not have the freedom to change the type of the column:
You have to always be manually checking whether you are about to store an empty string into it, and if so, store NULL instead.
Note about Oracle
The Oracle RDBMS before version 11g (and perhaps also in more recent versions, I am not sure, if someone knows, please leave a comment) will do that last conversion for you: if you try to store an empty string, it will store NULL instead. This is extremely treacherous due to the following reasons:
It is yet one more example of the database system trolling you by storing something very different from what you gave it to store.
They apply the same rule to all types of character columns, even VARCHAR, which means that you cannot have an empty string even in columns that could accommodate one; you store either NULL or an empty string, you always get NULL back.
This behavior is completely different from the behavior of any other RDBMS.
The behavior is fixed, there is no way to configure Oracle to quit doing that.
I have written the following SQL CLR function in order to hash string values larger then 8000 bytes (the limit of input value of the T-SQL built-it HASHBYTES function):
[SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true)]
public static SqlBinary HashBytes(SqlString algorithm, SqlString value)
{
HashAlgorithm algorithmType = HashAlgorithm.Create(algorithm.Value);
if (algorithmType == null || value.IsNull)
{
return new SqlBinary();
}
else
{
byte[] bytes = Encoding.UTF8.GetBytes(value.Value);
return new SqlBinary(algorithmType.ComputeHash(bytes));
}
}
It is working fine for Latin strings. For example, the following hashes are the same:
SELECT dbo.fn_Utils_GetHashBytes ('MD5', 'test'); -- 0x098F6BCD4621D373CADE4E832627B4F6
SELECT HASHBYTES ('MD5', 'test'); -- 0x098F6BCD4621D373CADE4E832627B4F6
The issue is it is not working with Cyrillic strings. For example:
SELECT dbo.fn_Utils_GetHashBytes ('MD5 ', N'даровете на влъхвите') -- NULL
SELECT HashBytes ('MD5 ',N'даровете на влъхвите') -- 0x838B1B625A6074B2BE55CDB7FCEA2832
SELECT dbo.fn_Utils_GetHashBytes ('SHA256', N'даровете на влъхвите') -- 0xA1D65374A0B954F8291E00BC3DD9DF655D8A4A6BF127CFB15BBE794D2A098844
SELECT HashBytes ('SHA2_256',N'даровете на влъхвите') -- 0x375F6993E0ECE1864336E565C8E14848F2A4BAFCF60BC0C8F5636101DD15B25A
I am getting NULL for MD5, although the code returns value if it is executed as console application. Could anyone tell what I am doing wrong?
Also, I've got the function from here and one of the comments says that:
Careful with CLR SP parameters being silently truncated to 8000 bytes
- I had to tag the parameter with [SqlFacet(MaxSize = -1)] otherwise bytes after the 8000th would simply be ignored!
but I have tested this and it is working fine. For example, if I generate a hash of 8000 bytes string and a second hash of the same string plus one symbol, I get the hashes are different.
DECLARE #A VARCHAR(MAX) = '8000 bytes string...'
DECLARE #B VARCHAR(MAX) = #A + '1'
SELECT LEN(#A), LEN(#B)
SELECT IIF(dbo.fn_Utils_GetHashBytes ('MD5', #A + '1') = dbo.fn_Utils_GetHashBytes ('MD5', #B), 1, 0) -- 0
Should I worry about this?
Encoding.UTF8.GetBytes(...)
SQL Server has no concept of UTF-8. Use UCS-2 (UTF-16) or ASCII. The encoding used must match what you'd pass to HASHBYTES. You can easily see that HASHBYTES will hash differently VARCHAR vs. NVARCHAR:
select HASHBYTES('MD5', 'Foo') -- 0x1356C67D7AD1638D816BFB822DD2C25D
select HASHBYTES('MD5', N'Foo') -- 0xB25FF0AD90D09D395090E8A29FF4C63C
Best would be to change the SQLCLR function to accept the bytes, not a string, and deal with the cast to VARBINARY in the caller.
SELECT dbo.fn_Utils_GetHashBytes ('MD5', CAST(N'даровете на влъхвите' AS VARBINARY(MAX));
FYI SQL Server 2016 has lifted the 8000 bytes restriction on HASHBYTES:
For SQL Server 2014 and earlier, allowed input values are limited to 8000 bytes.
For a detailed walk-through that explains why you are seeing the differences, please see my answer to the following Question:
TSQL md5 hash different to C# .NET md5
And for anyone who does not wish to compile and deploy this themselves, this function is available in the Free version of the SQL# library of SQLCLR functions, stored procedures, etc (which I am the creator of, but Util_Hash and Util_HashBinary, among many others, are free). There is one difference between what is shown in the Question the two Util_Hash* functions in SQL#: the function shown in the Question takes a NVARCHAR / SqlString input parameter whereas the SQL# functions take VARBINARY / SqlBinary input. The differences are:
Accepting VARBINARY input also works for binary source data (files, images, encrypted values, etc)
While accepting VARBINARY input does require an extra step of doing a CONVERT(VARBINARY(MAX), source_string) in the call to the function, doing so preserves whatever Code Page is being used for VARCHAR data. While not used that often, this can be handy when working with non-Unicode data.
Regarding the warning from the other post of:
Careful with CLR SP parameters being silently truncated to 8000 bytes - I had to tag the parameter with [SqlFacet(MaxSize = -1)] otherwise bytes after the 8000th would simply be ignored!
and yet you not experiencing the same thing: this due to changes in how SSDT generates the T-SQL wrapper objects for SQLCLR objects. In earlier versions (especially those that came with Visual Studio prior to VS 2013), the default behavior was to use NVARCHAR(MAX) for SqlChars, and NVARCHAR(4000) for SqlString. But then at some point (I don't want to say as of VS 2013 since Visual Studio and SSDT are independent products even though VS comes with SSDT) the default was changed to use NVARCHAR(MAX) for both SqlChars and SqlString. The person who posted the warning (on 2013-02-06) must have been using an earlier version of SSDT. Still, it doesn't hurt (and is even a good practice) to be explicit and use [SqlFacet(MaxSize = -1)].
Regarding the if (algorithmType == null || value.IsNull) logic: since either one being NULL should return a NULL, you might be better off removing that logic and using the WITH RETURNS NULL ON NULL INPUT option of the CREATE FUNCTION statement. This option, however, is unfortunately not supported via any SSDT construct (i.e. no SqlFacet for it). So in order to get this option enabled, you can create a Post-Deployment SQL script (which will automatically deploy after the main script), that issues an ALTER FUNCTION with the desired definition. And it wouldn't hurt to vote for my Connect suggestion to natively support this option: Implement OnNullCall property in SqlFunctionAttribute for RETURNS NULL ON NULL INPUT SQLCLR. On a practical level, the performance gain would be mainly seen in situation where you are passing in large values for the #value parameter but yet somehow #algorithm is NULL and so you don't end up using the value of #value. The reason to use the RETURNS NULL ON NULL INPUT option is that when you call a SQLCLR function passing in either SqlString or SqlBinary, the entire value is copied over to the App Domain's memory. That is time, memory, and CPU you don't need to waste if you know ahead of time that you won't be using it :-). You might also see a gain, even if passing in smaller values, on functions that are called very frequently.
Side note regarding the warning and your test: SQLCLR does not support VARCHAR, only NVARCHAR. Hence, there never was a limit of 8000 since the limit would have been at 4000 characters had SSDT not automatically been using NVARCHAR(MAX). So if there would have been a difference, it would have been seen first testing with only 4000 and 4001 characters.
UPDATE: Starting in SQL Server 2019, it's now possible to use UTF-8 natively via the _UTF8 collations. However, you still won't be able to pass in a UTF-8 character string into a SQLCLR object because the SQLCLR API only handles NVARCHAR and not VARCHAR. So, attempting to pass in 'UTF-8 encoded string' will still come through as UTF-16 LE because it will be implicitly converted on the way in. The only way to get UTF-8 encoded characters into SQLCLR is to first convert them to VARBINARY and pass those bytes into the SQLCLR object (as VARBINARY -> SqlBinary / SqlBytes).
We use the standard System.Data classes, DbConnection and DbCommand, to connect to SQL Server from C#, and we have many stored procedures that take VARCHAR or NVARCHAR parameters as input. We found that neither SQL Server nor our C# application throws any kind of error or warning when a string longer than maximum length of a parameter is passed in as the value to that parameter. Instead, the value is silently truncated to the maximum length of the parameter.
So, for example, if the stored procedure input is of type VARCHAR(10) and we pass in 'U R PRETTY STUPID', the stored procedure receives the input as 'U R PRETTY', which is very nice but totally not what we meant to say.
What I've done in the past to detect these truncations, and what others have likewise suggested, is to make the parameter input length one character larger than required, and then check if the length of the input is equal to that new max length. So in the above example my input would become VARCHAR(11) and I would check for input of length 11. Any input of length 11 or more would be caught by this check. This works, but feels wrong. Ideally, the data access layer would detect these problems automatically.
Is there a better way to detect that the provided stored procedure input is longer than allowed? Shouldn't DbCommand already be aware of the input length limits?
Also, as a matter of curiosity, what is responsible for silently truncating our inputs?
Use VARCHAR(8000), NVARCHAR(4000) or even N/VARCHAR(MAX), for all the variables and parameters. This way you do not need to worry about truncation when assigning #variables and #parameters. Truncation may occur at actual data write (insert or update) but that is not silent, is going to trigger a hard error and you'll find out about it. You also get the added benefit of the stored procedure code not having to be changed with schema changes (change column length, code is still valid). And you also get better plan cache behavior from using consistent parameter lengths, see How Data Access Code Affects Database Performance.
Be aware that there is a slight performance hit for using MAX types for #variables/#parameters, see Performance comparison of varchar(max) vs. varchar(N).
I have an existing database with existing data that I can't change it's structure or values.
In that database there is a nvarchar column that contains values in the twilight unicode zone starting with F800, upward.
When I select those values in SQL or use SQL function, unicode - I get the proper values.
When I select the same values in .Net - I get an error value - all the values in that twilight zone become 65533.
I need those values - how can I presuade .Net to give me those values - something like chaninging the connection encoding to a custom one - or ucs-2 etc...
Here is a sample code that demonstraits the problem:
c.CommandText = "select NCHAR(55297)";
using (var r = c.ExecuteReader())
{
r.Read();
var result = r[0]; //expected 55297 but got 65533
}
55297 is D801 which isn't defined? you probably want f801 which is 63489? But it appears as if that one isn't defined either. Which characters do you want?
If I try doing a "select NCHAR(55297)" in SQL Server Management studio, I get back the diamond question mark, but if I do "select NCHAR(63489)" I get back a dot of some sort:
If what you want is the character values, you can ask for them directly:
select Unicode(NCHAR(63489))
This returns 63489 (as an integer)
If you want them as a byte array, you can ask for that:
select CONVERT(varbinary(MAX), FieldThatIsAnNvarchar) from ThatTable
After much investigations I failed to find any way around this. I couldn't find any two way conversion that would work here.
It seems that some unicode values are intended for some strange unicode scenario that isn't supported by .Net, but is partially supported in a way that breaks what we need here.
I am using MySql DB and want to be able to read & write unicode data values. For example, French/Greek/Hebrew values.
My client program is C# (.NET framework 3.5).
How do i configure my DB to allow unicode? and how do I use C# to read/write values as unicode from MySql?
Upddate: 7 Sep. 09
OK, So my Schema, Table & columns are set to 'utf8' + collation 'utf8_general_ci'. I run the 'set names utf8' when the connection is opened. so far so good... but, still values are saved as '??????? '
any ideas?
The Solution!
OK, so for C# client to read & write unicode values, you must include in the connection string: charset=utf8
for example: server=my_sql_server;user id=my_user;password=my_password;database=some_db123;charset=utf8;
of course you should also define the relevant table as utf8 + collation utf8_bin.
The Solution!
OK, so for C# client to read & write unicode values, you must include in the connection string: charset=utf8
for example: server=my_sql_server;user id=my_user;password=my_password;database=some_db123;charset=utf8;
of course you should also define the relevant table as utf8 + collation utf8_bin.
You have to set the collation for your MySQL schema, tables or even columns.
Most of the time, the utf8_general_ci collation is used because it is case insensitive and accent insensitive comparisons.
On the other hand, utf8_unicode_ci is case sensitive and uses more advanced sorting technics (like sorting eszet ('ß') near 'ss'). This collation is a tiny bit slower than the other two.
Finally, utf8_bin compares string using their binary value. Thus, it also is case sensitive.
If you're using MySQL's Connector/NET (which I recommend), everything should go smoothly.
try to use this query before any other fetch or send:
SET NAMES UTF8
You need to set the db charset to UTF-8 (if you are using utf-8), collation for relevant tables/fields to utf, execute SET NAMES 'UTF-8' before doing queries, and of course make sure you set the proper encoding in the html that is showing the output.