I have an old MySQL database with encoding set to UTF-8. I am using Ado.Net Entity framework to connect to it.
The string that I retrieve from it have strange characters when ë like characters are expected.
For example: "ë" is "ë".
I thought I could get this right by converting from UTF8 to UTF16.
return Encoding.Unicode.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.Unicode,
Encoding.UTF8.GetBytes(utf8)));
}
This however doesn't change a thing.
How could I get the data from this database in proper form?
There are two things that you need to do to support UTF-8 in the ADO.NET Entity frame work (or in general using the MySQL .NET Connector):
Ensure that the collation of your database of table is a UTF-8 collation (i.e. utf8_general_ci or one of its relations)
Add Charset=utf8; to your connection string.
"Server=localhost;Database=test;Uid=test;Pwd=test;Charset=utf8;"
I'm not certain, but the encoding may be case sensitive; I found that CharSet=UTF8; did not work for me.
Even if the database is set to UTF8 you must do the following things to get Unicode fields to work correctly:
Ensure you are using a Unicode field type like NVARCHAR or TEXT CHARSET utf8
Whenever you insert anything into the field you must prefix it with the N character to indicate Unicode data as shown in the examples below
Whenever you select based on Unicode data ensure you use the N prefix again
MySqlCommand cmd = new MySqlCommand("INSERT INTO EXAMPLE (someField) VALUES (N'Unicode Data')");
MySqlCommand cmd2 = new MySqlCommand("SELECT * FROM EXAMPLE WHERE someField=N'Unicode Data'");
If the database wasn't configured correctly or the data was inserted without using the N prefix it won't be possible to get the correct data out since it will have been downcast into the Latin 1/ASCII character set
Try set the encoding by "set names utf8" query. You can set this parameter in mysql config too.
As others have said this could be a db issue, but it could also be caused by using an old version of the .net mysql connector.
What I actually wanted to comment on was the utf8 to utf16 conversion. The string you are trying to convert is actually alreay unicode encoded, so your "ë" characters actually takes up 4 bytes (or more) and are no longer, at the point of your conversion, a misrepresentation of the "ë" character. That is the reason why your conversion doesn't do anything.
If you want to do a conversion like that I think you would have to encode your utf8 string as a old style 1 byte per character string, using a codepage where the byte values of à and « actually represent the utf8 byte sequence of ë and then treat the bytes of this new string as an utf8 string. Fun stuff.
thank you The Mouth of a Cow ,
your solution works but still we need converting characters.
i think this is your problem :)
and for converting characters you can use this code
System.Text.Encoding utf_8 = System.Text.Encoding.UTF8;
string s = "unicode";
//string to utf
byte[] utf = System.Text.Encoding.UTF8.GetBytes(s);
//utf to string
string s2= System.Text.Encoding.UTF8.GetString(utf);
"Server=localhost;Database=test;Uid=test;Pwd=test;Charset=utf8;"
It worked - PowerShell 7.2, MySQL Connector 8.0.29
Related
I have a mysql database with utf8_general_ci encoding ,
i'm connecting to the same database with php using utf-8 page and file encode and no problem
but when connection mysql with C# i have letters like this غزة
i editit the connection string to be like this
server=localhost;password=root;User Id=root;Persist Security Info=True;database=mydatabase;Character Set=utf8
but the same problem .
Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword; CharSet=utf8;
Note! Use lower case value utf8 and not upper case UTF8 as this will fail.
See http://www.connectionstrings.com/mysql
could you try:
Server=localhost;Port=3306;Database=xxx;Uid=x xx;Pwd=xxxx;charset=utf8;"
Edit: I got a new idea:
//To encode a string to UTF8 encoding
string source = "hello world";
byte [] UTF8encodes = UTF8Encoding.UTF8.GetBytes(source);
//get the string from UTF8 encoding
string plainText = UTF8Encoding.UTF8.GetString(UTF8encodes);
good luck
more info about this technique http://social.msdn.microsoft.com/forums/en-us/csharpgeneral/thread/BF68DDD8-3D95-4478-B84A-6570A2E20AE5
You might need to use the "utf8mb4" character set for the column in order to support 4 byte characters like this: "λ𝛌 "
The utf8 charset only supports 1-3 bytes per character and thus can't support all unicode characters.
See http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html for more details.
CHARSET should be uppercase
Server=localhost;Port=3306;Database=xxx;Uid=x xx;Pwd=xxxx;CHARSET=utf8;
Just in case some come here later.
I needed to create a Seed method using Mysql with EF6, to load a SQL file. After running it I got weird characters on database like ? replacing é, ó, á
SOLUTION:
Make sure I read the file using the right charset: UTF8 on my case.
var path = System.AppDomain.CurrentDomain.BaseDirectory;
var sql = System.IO.File.ReadAllText(path + "../../Migrations/SeedData/scripts/sectores.sql", Encoding.UTF8);
And then M.Shakeri reminder:
CHARSET=utf8 on cxn string in web.config. Using CHARSET as uppercase and utf8 lowercase.
Hope it helps.
R.
One thing I found, but haven't had the opportunity to really browse is the collation charts available here: http://www.collation-charts.org/mysql60/
This will show you which characters are part of a given MySQL collation so you can pick the best option for your dataset.
Setting the charset in the connection string refers to the charset of the queries sent to the server. It does not affect the results returned from the server.
https://dev.mysql.com/doc/connectors/en/connector-net-connection-options.html
One way I have found to specify the charset from the client is to run this after opening the connection.
set character_set_results='utf8';
this worked for me:
"datasource=xxx;port=3306;username=xxx;password=xxx;database=xxx;charset=utf8mb4"
I have a code:
CREATE TABLE IF NOT EXISTS Person
(
name varchar(24) ...
)
CHARACTER SET utf8 COLLATE utf8_polish_ci;
This works OK in my application, but I read if someone put in name field a string that contains character wchich code is greater than 127, database will use 2 bytes (or more) to store this character. So i think, i will change character set to utf16:
CHARACTER SET utf16 COLLATE utf16_polish_ci;
But now when I run my application, exception apears: KeyNotFoundException. It apears exactly at these instructions:
MySqlCommand komenda = baza.Połączenie.CreateCommand ();
komenda.CommandText = zapytanie;
MySqlDataReader dr = komenda.ExecuteReader (); // HERE, at execute reader method
if (dr.Read ()) ...
1) Anyone had similar problem? 2) Any idea how to use always 2 bytes/char in database field?
I'm not sure I understand why you're converting from UTF-8 to UTF-16. I'm assuming you're worried that any characters that require two bytes or more to store, won't fit in a UTF-8 encoding. This is not the case. In MySQL UTF-8 values can be stored with one, two, or three bytes. Unicode points U+0000 to U+007F take 1 byte and points U+0080 to U+07FF take 2 bytes--this range covers the Polish alphabet. Since the majority of characters in the Polish alphabet take 1 byte to store you should probably stick with UTF-8 and save some memory. However, if you want to always use 2 bytes, at the cost of wasted space, you could stick with UTF-16.
Here are some helpful links:
Unicode support in MySQL: http://dev.mysql.com/doc/refman/5.6/en/charset-unicode.html
Basic Unicode Overview : http://www.joelonsoftware.com/articles/Unicode.html
As for the exception, and this is purely a guess, it may have something to do with trying to read data that is UTF-8 encoded as if it were UTF-16 encoded. Did you change the character set after you already had UTF-8 encoded data in your table?
Documenation says:
[...] utf8 characters can require up to three bytes per character [...]
Read this link for more information.
My advice would be not to focus on how many bytes the DBMS is using, as one of its purposess is to abstract you from that. Just focus on coding according to the selected data types.
I have an asp.net page connected to a MySql DB.
When I try to insert/update values from the webpage into the DB
the chars are shown in the DB as question marks (I am using SP).
If i will write a query directly in the DB, It will work and the chars
will be displayed correctly.
The DB default charset is utf8, and the column collation is utf8_general_ci.
10x alot & Have a great weekend :)
Eventually what solved my problem is adding CharSet=utf8 to the connection string.
10x alot everyone :)
I believe your C# strings are being treated as unicode instead of UTF8
Some sample code from a snippet I had found some time ago:
System.Text.Encoding utf_8 = System.Text.Encoding.UTF8;
// This is our Unicode string:
string s_unicode = "abcéabc";
// Convert a string to utf-8 bytes.
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);
// Convert utf-8 bytes to a string.
string s_unicode2 = System.Text.Encoding.UTF8.GetString(utf8Bytes);
MessageBox.Show(s_unicode2);
Recently we had an encoding problem in our system :
If we had the string "æ" in our db ,it became "æ" on our web pages.
Now this problem is solved, but the problem is that now we have a lot of "æ" in our database : users didn't see and validate pre-filled form with these characters.
I found that If you read in utf 8 C3A6 you'll get "æ", if you read it in ascii you'll get "æ".
It's strange because if I execute
"select convert(varbinary(40),N'æ'),convert(varbinary(40),'æ')"
I don't have the same result...
Do you have any idea on how I can fix my database (ie change all "æ" to "æ") ?
thx
As far as I know, the only means to fix is to use Replace:
Update Table
Set Column = Replace(Column, N'æ', N'æ')
In this case, I'm assuming that the column is now Unicode (i.e. nvarchar or nchar).
if you read it in ascii you'll get "æ".
ASCII only assigns characters to the bytes 00-7F. There are, however, several "extended ASCII" encodings in which C3 A6 represents "æ", including the popular Western European encodings ISO-8859-1 and windows-1252, and Turkish ISO-8859-9 and windows-1254.
To fix your encoding problem, simply:
Encode the string to a byte array using code page 1252 (or 1254 for Turkish). This should produce the UTF-8 bytes.
Decode the byte array to a string using UTF-8.
I came upon trying to convert a database that is encoded in UTF8 from what it looks like, into a windows 1251 encoding (dont ask, but I need to do this). All of the Russian, encoded characters in the db show up as абвгдÐ. When I pull them out of the db into my C# app, into strings, I still see абвгдÐ. No matter what I try to do to interpret this string as UTF8 encoded string, it seems to be interpreted as latin1 single byte string, and I do not see my text show up as russian. What I basically need to do is convert this latin1 looking-utf8 encoded string into Unicode, so that I can convert it later to 1251, but I have not been able to do this successfully. Anyone got any ideas?
Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(s))
Now you have a normal Unicode string containing Cyrillic.
Note that it is possible that your ‘Latin-1’ misencoded string might actually be a ‘Windows codepage 1252’ misencoded string; I can't tell from the given example as it doesn't use any of the characters that are different between the two encodings. If this is the case use GetEncoding(1252) instead.
Also this is assuming that it's the contents of the database at fault. If the database is supposed to be storing UTF-8 strings but you're pulling them out as if they were Latin-1 (or codepage 1252 due to that being the system codepage) then really you need to reconfigure your data access layer to set the right encoding. If you're using SQL Server, better to start using NVARCHAR.
I am using sql server, and all columns are nvarchar. The data was imported with mysql dump from a db that was latin1, not utf8. So all the unicode strings are simply latin1 encoded. In any case, I figured it out, and its very similar to what you suggested. here's what I did to convert the latin1 encoded utf8 into 1251.
//re interpret latin1 in proper utf8 encoding
str = Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(str));
//convert from utf8 to 1251
str = Encoding.GetEncoding(1251).GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1251), Encoding.UTF8.GetBytes(str)));