Table character encoding - exception in application - c#

I have a code:
CREATE TABLE IF NOT EXISTS Person
(
name varchar(24) ...
)
CHARACTER SET utf8 COLLATE utf8_polish_ci;
This works OK in my application, but I read if someone put in name field a string that contains character wchich code is greater than 127, database will use 2 bytes (or more) to store this character. So i think, i will change character set to utf16:
CHARACTER SET utf16 COLLATE utf16_polish_ci;
But now when I run my application, exception apears: KeyNotFoundException. It apears exactly at these instructions:
MySqlCommand komenda = baza.Połączenie.CreateCommand ();
komenda.CommandText = zapytanie;
MySqlDataReader dr = komenda.ExecuteReader (); // HERE, at execute reader method
if (dr.Read ()) ...
1) Anyone had similar problem? 2) Any idea how to use always 2 bytes/char in database field?

I'm not sure I understand why you're converting from UTF-8 to UTF-16. I'm assuming you're worried that any characters that require two bytes or more to store, won't fit in a UTF-8 encoding. This is not the case. In MySQL UTF-8 values can be stored with one, two, or three bytes. Unicode points U+0000 to U+007F take 1 byte and points U+0080 to U+07FF take 2 bytes--this range covers the Polish alphabet. Since the majority of characters in the Polish alphabet take 1 byte to store you should probably stick with UTF-8 and save some memory. However, if you want to always use 2 bytes, at the cost of wasted space, you could stick with UTF-16.
Here are some helpful links:
Unicode support in MySQL: http://dev.mysql.com/doc/refman/5.6/en/charset-unicode.html
Basic Unicode Overview : http://www.joelonsoftware.com/articles/Unicode.html
As for the exception, and this is purely a guess, it may have something to do with trying to read data that is UTF-8 encoded as if it were UTF-16 encoded. Did you change the character set after you already had UTF-8 encoded data in your table?

Documenation says:
[...] utf8 characters can require up to three bytes per character [...]
Read this link for more information.
My advice would be not to focus on how many bytes the DBMS is using, as one of its purposess is to abstract you from that. Just focus on coding according to the selected data types.

Related

Encoding problem on MySql Database from Blazor form on hosting environment [duplicate]

I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

ñ character in in put file being interpreted as ñ in C# console app

I've seen questions where the two characters are the same, but noting that relates to this specific question so here goes.
I'm running a C# console app that reads an input file that is variable length records. Each record is variable length fields. I've got everything working in terms of parsing out each individual field within each record, not a problem. Except that today I cam across the ñ character in the input file. Now I know this translates to ñ, so I'm ok with it. However, because I the input file sees ñ as 2 characters, the record length changes in the C# app because the app is interpreting those 2 characters as a single ñ. This is causing my record length to change from 154 characters to 153, and then during the parsing, messing up the individual fields.
I'm ok with the ñ character getting stored in my DB. But my question is this.
Prior to parsing the fields out of the record, how can I go about easily (with checking every single character) detecting that the ñ exists and trigger it to change the parsing logic? Should I simply do a IndexOf on the character and code it that way? I would think that would add a bit of overhead of I had to put that logic on every single field, although it seems like the easiest way. I would think there's a better way to handle it overall but I've not encountered this before. Most of the posts I have found are more for handling the ñ character in text as opposed to text being converted (properly) from ñ to ñ
Ideas?
the streamreader open I am using is as follows:
System.IO.StreamReader concatenatedFile = new System.IO.StreamReader("c:\Testing\test.txt",System.Text.Encoding.UTF8);
The record length changes from 154 characters on the input to 153 interpreted characters.
You must always read a text file in the encoding it was written. Of course, sometimes you don't which encoding that was...
Thing of the input file as a stream of bytes. Most are 1-byte-1-ASCII-character, but there are 2 bytes (probably) that can be interpreted differently depending on encoding:
UTF8 - 1 character, ñ
(some other encoding) - 2 characters, ñ
Since you say "the input file sees ñ as 2 characters", this would probably be the encoding intended by whoever produces the file.
So, you should find out which encoding was originally meant, and use that - it's probably some ANSI encoding. You could try System.Text.Encoding.Default, but beware that this changes on different machines, so your code will now depend on the machine's default encoding.
You should set the StreamReader you use to read your input file to UTF-8 encoding. I don't believe for a second the original input was meant to be ñ, so why do you care how many bytes the original input was - you care about character length, right?
Refer to this article to understand what's what in text encoding: http://www.joelonsoftware.com/articles/Unicode.html .

C# File Method not reading accented character [duplicate]

This question already has answers here:
Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8
(2 answers)
Closed 8 years ago.
I'm using C# to automate an insert into command for a users table, and there's a user whose first name has an accented E, with a grave I believe?
Desirée
Whenever it makes it into the SQL Server table it appears as:
Desir?e
Which data type should I use on this column to ensure that it keeps the accented e?
I've tried varchar and nvarchar, neither seemed to matter.
Code for inserting:
var lines = File.ReadAllLines(users_feed_file);
I believe that there is an encoding issue occurring. When Visual Studio reads my file it reads the name as Desir?e.
So far I've tried to overload the File method, using:
Encoding enc = new UTF8Encoding(true, true);
var lines = File.ReadAllLines(users_feed_file,enc);
But this had no effect.
var lines = File.ReadAllLines(users_feed_file, Encoding.UTF8);
Doesn't work either.
Sql Server stores unicode text essentially as Unicode-2 or UTF-16. That is, it uses fixed, two-bytes for all characters. UTF-8 uses variable three-bytes for all characters, using one, two, or three bytes as needed. If the character in questions (it would be good to post the actual unicode value) is translated by UTF-8 into three bytes, then Sql Server reads that back as two two-byte characters, one of which probably is not a valid, displayable character, thus rendering a question mark. Note that Sql Server is not storing a question mark, that is just how whatever text editor you are using renders this garbled character.
Try changing your C# encoding to Encoding.Unicode and see if that helps round-trip the character in question.
The same reasoning applies to characters that ought to fit into one-byte, but are represented with two by UTF-8. So for example, the unicode hex value for small e with grave is xE8, which could be represented as 00 E8 in two bytes. But UTF-8 renders it as C3 E8. Now, look for that value in Unicode (UTF-16) - there is no such character. So in this case it is not two bytes represented as three, but one byte represented incorrectly as two. This resource is invaluable when trying to debug extended character issues.
Note that for the basic Latin ascii set, UTF-8 uses the same values as Unicode, and thus those characters round-trip just fine. It's when using extended character sets that compatibility for both encodings cannot be guaranteed.
Hi try with this code:
var lines = File.ReadAllLines(users_feed_file, Encoding.Unicode);
but in notepade++ you can view the file encoding, check this.

Repair bad character due to encoding problem

Recently we had an encoding problem in our system :
If we had the string "æ" in our db ,it became "æ" on our web pages.
Now this problem is solved, but the problem is that now we have a lot of "æ" in our database : users didn't see and validate pre-filled form with these characters.
I found that If you read in utf 8 C3A6 you'll get "æ", if you read it in ascii you'll get "æ".
It's strange because if I execute
"select convert(varbinary(40),N'æ'),convert(varbinary(40),'æ')"
I don't have the same result...
Do you have any idea on how I can fix my database (ie change all "æ" to "æ") ?
thx
As far as I know, the only means to fix is to use Replace:
Update Table
Set Column = Replace(Column, N'æ', N'æ')
In this case, I'm assuming that the column is now Unicode (i.e. nvarchar or nchar).
if you read it in ascii you'll get "æ".
ASCII only assigns characters to the bytes 00-7F. There are, however, several "extended ASCII" encodings in which C3 A6 represents "æ", including the popular Western European encodings ISO-8859-1 and windows-1252, and Turkish ISO-8859-9 and windows-1254.
To fix your encoding problem, simply:
Encode the string to a byte array using code page 1252 (or 1254 for Turkish). This should produce the UTF-8 bytes.
Decode the byte array to a string using UTF-8.

Conversion of a unicode character from byte

In our API, we use byte[] to send over data across the network. Everything worked fine, until the day our "foreign" clients decided to pass/receive Unicode characters.
As far as I know, Unicode characters occupy 2 bytes, however, we only allocate 1 byte in the byte array for them.
Here is how we read the character from the byte[] array:
// buffer is a byte[6553] and index is a current location in the buffer
char c = System.BitConverter.ToChar(buffer, m_index);
index += SIZEOF_BYTE;
return c;
So the current issue is the API is receiving a strange Unicode character, when I look at the Unicode hexadecimal. I found that the last significant byte is correct but the most significant byte has a value when it’s supposed to be 0. A quick workaround, thus far, has been to 0x00FF & c to filter the msb.
Please suggest the correct approach to deal with Unicode characters coming from the socket?
Thanks.
Solution:
Kudos to Jon:
char c = (char) buffer[m_index];
And as he mentioned, the reason it works, is because the client api receives a character occupying only one byte, and BitConverter.ToChar uses two, hence the issue in converting it. I am still startled as to why it worked for some set of characters and not the others, as it should have failed in all cases.
Thanks Guys, great responses!
You should use Encoding.GetString, using the most appropriate encoding.
I don't quite understand your situation fully, but the Encoding class is almost certain to be the way to handle it.
Who is in control of the data here? Your code, or that of your customers? Have you defined what the correct format is?
EDIT: Okay, I've had another look at your code: BitConverter.ToChar returns "A character formed by two bytes beginning at startIndex." If you only want to use one byte, just cast it:
char c = (char) buffer[m_index];
I'm surprised your code has been working at all, as it would be breaking any time the next byte was non-zero.
You should look at the System.Text.ASCIIEncoder.ASCII.GetString function which takes a byte[] array and converts it to a string (for ascii).
And System.Text.UTF8Encoder or System.Text.UTF16Encoder for Unicode strings in the UTF8 or UTF16 encodings.
There are also functions for converting Strings to Byte[] in the ASCIIEncoding, UTF8Encoding and UTF16Encoding classes: see the GetBytes(String) functions.
Unicode characters can take up to four bytes, but rarely are messages encoded on the wire using 4 bytes for each character. Rather, schemes like UTF8 or UTF16 are used that only bring in extra bytes when required.
Have a look at the Encoding class guidance.
Test streams should contain a byte-order marker that will allow you to determine how to treat the binary data.
It's unclear what exactly your goal is here. From what I can tell, there are 2 routes that you can take
Ignore all data sent in Unicode
Process both unicode and ASCII strings
IMHO, #1 is the way to go. But it sounds like your protocol is not necessarily setup to deal with a unicode string. You will have to do some detection logic to determine if the string coming in is a Unicode version. If it is you can use the Enconding.Unicode.GetString method to convert that particular byte array.
What encoding are your customers using? If some of your customers are still using ASCII, then you'll need your international customers to use something which maps the ASCII set (1-127) to itself, such as UTF8. After that, use the UTF8 encoding's GetString method.
My only solution is to fix the API. Either tell the users to use only ASCII string in the Byte[] or fix it to support ASCII and any other encoding you need to use.
Deciding what encoding is supplied by the foreign clients from just the byte[] can be a bit tricky.

Categories

Resources