Recently we had an encoding problem in our system :
If we had the string "æ" in our db ,it became "æ" on our web pages.
Now this problem is solved, but the problem is that now we have a lot of "æ" in our database : users didn't see and validate pre-filled form with these characters.
I found that If you read in utf 8 C3A6 you'll get "æ", if you read it in ascii you'll get "æ".
It's strange because if I execute
"select convert(varbinary(40),N'æ'),convert(varbinary(40),'æ')"
I don't have the same result...
Do you have any idea on how I can fix my database (ie change all "æ" to "æ") ?
thx
As far as I know, the only means to fix is to use Replace:
Update Table
Set Column = Replace(Column, N'æ', N'æ')
In this case, I'm assuming that the column is now Unicode (i.e. nvarchar or nchar).
if you read it in ascii you'll get "æ".
ASCII only assigns characters to the bytes 00-7F. There are, however, several "extended ASCII" encodings in which C3 A6 represents "æ", including the popular Western European encodings ISO-8859-1 and windows-1252, and Turkish ISO-8859-9 and windows-1254.
To fix your encoding problem, simply:
Encode the string to a byte array using code page 1252 (or 1254 for Turkish). This should produce the UTF-8 bytes.
Decode the byte array to a string using UTF-8.
Related
I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html)
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.
This question already has answers here:
Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8
(2 answers)
Closed 8 years ago.
I'm using C# to automate an insert into command for a users table, and there's a user whose first name has an accented E, with a grave I believe?
Desirée
Whenever it makes it into the SQL Server table it appears as:
Desir?e
Which data type should I use on this column to ensure that it keeps the accented e?
I've tried varchar and nvarchar, neither seemed to matter.
Code for inserting:
var lines = File.ReadAllLines(users_feed_file);
I believe that there is an encoding issue occurring. When Visual Studio reads my file it reads the name as Desir?e.
So far I've tried to overload the File method, using:
Encoding enc = new UTF8Encoding(true, true);
var lines = File.ReadAllLines(users_feed_file,enc);
But this had no effect.
var lines = File.ReadAllLines(users_feed_file, Encoding.UTF8);
Doesn't work either.
Sql Server stores unicode text essentially as Unicode-2 or UTF-16. That is, it uses fixed, two-bytes for all characters. UTF-8 uses variable three-bytes for all characters, using one, two, or three bytes as needed. If the character in questions (it would be good to post the actual unicode value) is translated by UTF-8 into three bytes, then Sql Server reads that back as two two-byte characters, one of which probably is not a valid, displayable character, thus rendering a question mark. Note that Sql Server is not storing a question mark, that is just how whatever text editor you are using renders this garbled character.
Try changing your C# encoding to Encoding.Unicode and see if that helps round-trip the character in question.
The same reasoning applies to characters that ought to fit into one-byte, but are represented with two by UTF-8. So for example, the unicode hex value for small e with grave is xE8, which could be represented as 00 E8 in two bytes. But UTF-8 renders it as C3 E8. Now, look for that value in Unicode (UTF-16) - there is no such character. So in this case it is not two bytes represented as three, but one byte represented incorrectly as two. This resource is invaluable when trying to debug extended character issues.
Note that for the basic Latin ascii set, UTF-8 uses the same values as Unicode, and thus those characters round-trip just fine. It's when using extended character sets that compatibility for both encodings cannot be guaranteed.
Hi try with this code:
var lines = File.ReadAllLines(users_feed_file, Encoding.Unicode);
but in notepade++ you can view the file encoding, check this.
I have a code:
CREATE TABLE IF NOT EXISTS Person
(
name varchar(24) ...
)
CHARACTER SET utf8 COLLATE utf8_polish_ci;
This works OK in my application, but I read if someone put in name field a string that contains character wchich code is greater than 127, database will use 2 bytes (or more) to store this character. So i think, i will change character set to utf16:
CHARACTER SET utf16 COLLATE utf16_polish_ci;
But now when I run my application, exception apears: KeyNotFoundException. It apears exactly at these instructions:
MySqlCommand komenda = baza.Połączenie.CreateCommand ();
komenda.CommandText = zapytanie;
MySqlDataReader dr = komenda.ExecuteReader (); // HERE, at execute reader method
if (dr.Read ()) ...
1) Anyone had similar problem? 2) Any idea how to use always 2 bytes/char in database field?
I'm not sure I understand why you're converting from UTF-8 to UTF-16. I'm assuming you're worried that any characters that require two bytes or more to store, won't fit in a UTF-8 encoding. This is not the case. In MySQL UTF-8 values can be stored with one, two, or three bytes. Unicode points U+0000 to U+007F take 1 byte and points U+0080 to U+07FF take 2 bytes--this range covers the Polish alphabet. Since the majority of characters in the Polish alphabet take 1 byte to store you should probably stick with UTF-8 and save some memory. However, if you want to always use 2 bytes, at the cost of wasted space, you could stick with UTF-16.
Here are some helpful links:
Unicode support in MySQL: http://dev.mysql.com/doc/refman/5.6/en/charset-unicode.html
Basic Unicode Overview : http://www.joelonsoftware.com/articles/Unicode.html
As for the exception, and this is purely a guess, it may have something to do with trying to read data that is UTF-8 encoded as if it were UTF-16 encoded. Did you change the character set after you already had UTF-8 encoded data in your table?
Documenation says:
[...] utf8 characters can require up to three bytes per character [...]
Read this link for more information.
My advice would be not to focus on how many bytes the DBMS is using, as one of its purposess is to abstract you from that. Just focus on coding according to the selected data types.
I came upon trying to convert a database that is encoded in UTF8 from what it looks like, into a windows 1251 encoding (dont ask, but I need to do this). All of the Russian, encoded characters in the db show up as абвгдÐ. When I pull them out of the db into my C# app, into strings, I still see абвгдÐ. No matter what I try to do to interpret this string as UTF8 encoded string, it seems to be interpreted as latin1 single byte string, and I do not see my text show up as russian. What I basically need to do is convert this latin1 looking-utf8 encoded string into Unicode, so that I can convert it later to 1251, but I have not been able to do this successfully. Anyone got any ideas?
Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(s))
Now you have a normal Unicode string containing Cyrillic.
Note that it is possible that your ‘Latin-1’ misencoded string might actually be a ‘Windows codepage 1252’ misencoded string; I can't tell from the given example as it doesn't use any of the characters that are different between the two encodings. If this is the case use GetEncoding(1252) instead.
Also this is assuming that it's the contents of the database at fault. If the database is supposed to be storing UTF-8 strings but you're pulling them out as if they were Latin-1 (or codepage 1252 due to that being the system codepage) then really you need to reconfigure your data access layer to set the right encoding. If you're using SQL Server, better to start using NVARCHAR.
I am using sql server, and all columns are nvarchar. The data was imported with mysql dump from a db that was latin1, not utf8. So all the unicode strings are simply latin1 encoded. In any case, I figured it out, and its very similar to what you suggested. here's what I did to convert the latin1 encoded utf8 into 1251.
//re interpret latin1 in proper utf8 encoding
str = Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(str));
//convert from utf8 to 1251
str = Encoding.GetEncoding(1251).GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1251), Encoding.UTF8.GetBytes(str)));
I have an old MySQL database with encoding set to UTF-8. I am using Ado.Net Entity framework to connect to it.
The string that I retrieve from it have strange characters when ë like characters are expected.
For example: "ë" is "ë".
I thought I could get this right by converting from UTF8 to UTF16.
return Encoding.Unicode.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.Unicode,
Encoding.UTF8.GetBytes(utf8)));
}
This however doesn't change a thing.
How could I get the data from this database in proper form?
There are two things that you need to do to support UTF-8 in the ADO.NET Entity frame work (or in general using the MySQL .NET Connector):
Ensure that the collation of your database of table is a UTF-8 collation (i.e. utf8_general_ci or one of its relations)
Add Charset=utf8; to your connection string.
"Server=localhost;Database=test;Uid=test;Pwd=test;Charset=utf8;"
I'm not certain, but the encoding may be case sensitive; I found that CharSet=UTF8; did not work for me.
Even if the database is set to UTF8 you must do the following things to get Unicode fields to work correctly:
Ensure you are using a Unicode field type like NVARCHAR or TEXT CHARSET utf8
Whenever you insert anything into the field you must prefix it with the N character to indicate Unicode data as shown in the examples below
Whenever you select based on Unicode data ensure you use the N prefix again
MySqlCommand cmd = new MySqlCommand("INSERT INTO EXAMPLE (someField) VALUES (N'Unicode Data')");
MySqlCommand cmd2 = new MySqlCommand("SELECT * FROM EXAMPLE WHERE someField=N'Unicode Data'");
If the database wasn't configured correctly or the data was inserted without using the N prefix it won't be possible to get the correct data out since it will have been downcast into the Latin 1/ASCII character set
Try set the encoding by "set names utf8" query. You can set this parameter in mysql config too.
As others have said this could be a db issue, but it could also be caused by using an old version of the .net mysql connector.
What I actually wanted to comment on was the utf8 to utf16 conversion. The string you are trying to convert is actually alreay unicode encoded, so your "ë" characters actually takes up 4 bytes (or more) and are no longer, at the point of your conversion, a misrepresentation of the "ë" character. That is the reason why your conversion doesn't do anything.
If you want to do a conversion like that I think you would have to encode your utf8 string as a old style 1 byte per character string, using a codepage where the byte values of à and « actually represent the utf8 byte sequence of ë and then treat the bytes of this new string as an utf8 string. Fun stuff.
thank you The Mouth of a Cow ,
your solution works but still we need converting characters.
i think this is your problem :)
and for converting characters you can use this code
System.Text.Encoding utf_8 = System.Text.Encoding.UTF8;
string s = "unicode";
//string to utf
byte[] utf = System.Text.Encoding.UTF8.GetBytes(s);
//utf to string
string s2= System.Text.Encoding.UTF8.GetString(utf);
"Server=localhost;Database=test;Uid=test;Pwd=test;Charset=utf8;"
It worked - PowerShell 7.2, MySQL Connector 8.0.29