Convert MySQL non-unicode characters in C# - c#

I have a PHP application that currently stores data in MySQL tables in non-conventional format(I assume this is because it is using a non-unicode mysql connection).
Example, this is one of the customer names as shown in PHP app UI:
DILORIO’S AUTO BODY SHOP
Notice there is a difference in apostrophe between it and the following.
DILORIO'S AUTO BODY SHOP
The latter one uses a standard latin apostrophe as oppose to unicode(i guess) style.
This name is stored in DB table like so:
DILORIO’S AUTO BODY SHOP
When it is being pulled from DB and displayed in UI it all looks correct, but the problem arised when I started to use MYSQL.Data C# connector to pull the same data.
At first I thought I should be able to just bull the value byte array and then convert it to latin1 (I assumed this is a default for PHP), however none of the existing encodings seemed to get me the result I wanted and this is what I get:
this is a DB collation for the field in mysql and how it looks:
Ideally I want to get rid of all corrupt data in DB and fix the PHP connection to unicode. But at this point it would be nice to just read whats already in there the same way as PHP is able to.
I also tried it with Encoding convert in all different combinations but no luck here either:

The text is encoded with Windows-1252, not Latin1, which is why your attempts to decode it above failed. Once you convert the string to Windows-1252 bytes, then decode that using UTF-8, you should have the correct value:
// note: on .NET 6.0, add 'System.Text.Encoding.CodePages' and call this line of code:
// Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var windows1252 = Encoding.GetEncoding(1252);
var utf8Bytes = windows1252.GetBytes("DILORIO’S AUTO BODY SHOP");
var correct = Encoding.UTF8.GetString(utf8Bytes);
// correct == "DILORIO’S AUTO BODY SHOP"

Related

Encoding problem on MySql Database from Blazor form on hosting environment [duplicate]

I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

Incorrect Greek characters with MySQL connector/net and C#

I'm trying to make a c# project that reads from a MySQL database.
The data are inserted from a php page with utf-8 encoding. Both page and data is utf-8.
The data is self is greek words like "Λεπτομέρεια 3".
When fetching the data it looks like "ΛεπτομέÏεια 3".
I have set 'charset=utf8' in the connection string and also tried with 'set session character_set_results=latin1;' query.
When doing the same with mysql (linux), MySQL Workbench, MySQL native connector for OpenOffice with OpenOffice Base, the data are displayed correctly.
I'm I doing something wrong or what else can I do?
Running the query 'SELECT value, HEX(value), LENGTH(value), CHAR_LENGTH(value) FROM call_attribute;' from inside my program.
It returns :
Value:
ΛεπτομέÏεια 3
HEX(value) :
C38EE280BAC38EC2B5C38FE282ACC38FE2809EC38EC2BFC38EC2BCC38EC2ADC38FC281C38EC2B5C38EC2B9C38EC2B12033
LENGTH(value) :
49
CHAR_LENGTH(value) :
24
Any ideas???
You state that the first character of your data is capital lambda, Λ.
The UTF-8 represenation of this character is 0xCE 0x9B, whereas the HEX() value starts with C38E, which is indeed capital I with circumflex, as displayed in your question.
So I guess the original bug was not in the PHP configuration, and your impression that "data are displayed correctly" was wrong and due to an encoding problem.
Also note that the Greek alphabet only requires Latin-7, rather than Latin-1, when storing Greek data as single-byte characters rather than in Unicode.
Most likely, you have an encoding problem here, meaning different applications interpret the binary data as different character sets or encodings. (But lacking PHP and MySQL knowledge, I cannot really help you how to configure correctly).
You should try SET NAMES 'utf8' and have a look at this link
I've manage to solve my problem by setting the 'skip-character-set-client-handshake' in /etc/my.cnf'. After that everything was ok, the encoding of greek words was correct and the display was perfect.
One drawback was that I had to re-enter all the data into the database again.

Writing in Paradox tables in .Net with a specific Langdriver

I'm trying to add values in a paradox table with c#.
The point is that this table is containing localized strings, for which the Langdriver ANSII850 is required by the BDE.
I tried to use both OLEDB and Odbc drivers in .Net, but I cannot write correct values in my database. I always get encoding issues.
Example:
// ODBC Connection string (using string.Format for setting the path)
string connectionBase = #"Driver={{Microsoft Paradox Driver (*.db )}};DriverID=538;Fil=Paradox 5.X;DefaultDir={0};CollatingSequence=ASCII;";
// I tried to put the langdriver in the CollatingSequence parameter
string connectionBase = #"Driver={{Microsoft Paradox Driver (*.db )}};DriverID=538;Fil=Paradox 5.X;DefaultDir={0};CollatingSequence=ANSII850;";
// I tried the OleDb driver
string connectionBase = #"Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties=Paradox 5.x;"Data Source={0};";
Then, I'm trying to insert the value "çã á çõ" in order to test. Depending on the driver I'm using, I get different results but the final string is never encoded correctly.
Edited:
Finally, I found a solution, but not ideal:
I'm able to switch from a langdriver to another by calling an external executable, written in delphi. In this case, I'm using ANSII850.
Then, I'm able to read data from my paradox tables. But I still don't get my data in a good format.
Strings from the tables are not encoded with the code page 850 either, trying to decode them with .Net tools just does not work
Instead, I'm manually tracking special chars (that are not correctly read) and replacing them by the correct utf8 chars.
For writing I'm doing the exact opposite.
It works, but it's still not ideal.
Are you sure you're using the BDE? Your examples refer to a lot of Microsoft parts.
The BDE used the higher codes for these "special characters" and a codepage to interpret them. Looks like 850 is what you think would be correct. If you can just send a string to the bde with the hex or decimal of the characters you want, you may be able to see if it will print that correctly.

C# storing text in SQL Server for full text search

Im writing an Outlook Add-in to file emails acdcording to certain parameters.
I am currently storing the Outlook.MailItem.Body property in a varbinary(max) field in SQL Server 2008R2. I have also enabled FTS on this column.
Currently I store the Body property of the email as a byte array in the database, and use ASCIIEncoder.GetBytes() function to convert this clear text. Currently I am experiencing some weird results, whereby I notice ? characters occasionally for apostrophes and new lines.
I have two questions:
Is this the best method to store text in a database? As a byte array? And is the ASCIIEncoder the best method to acheive this?
I want to handle Unicode strings correctly, is there anything I should be aware of?
I'm not sure whether FullTextSearch works best on VarBinary columns, though my instinct says "no", but I can answer the second half of your question.
The reason you're getting odd characters is that ASCIIEncoder.GetBytes() treats the text as ASCII, and can have exactly those sort of errors if the text you're encoding ISN'T ASCII-encoded. By default, strings in .NET are UTF8, so you're probably running into problems there. Use Encoding.UTF8.GetBytes() to get the bytes for a UTF8 string.
This also answers the second question - is this method useful for Unicode strings? Yes, since you're not storing strings at all. You're storing bytes, which your application happens to know are encoded Unicode strings. SQL won't do anything to them, because they're just bytes.
Since you have to support Unicode characters and handle only text you should store your data in a column of type nvarchar. That would address both of your problems:
1.) Text is saved as variable-length Unicode character data in the database, you don't need a byte encoder/decoder to retrieve the data
2.) See 1.)

MySQL C# Text Encoding Problems

I have an old MySQL database with encoding set to UTF-8. I am using Ado.Net Entity framework to connect to it.
The string that I retrieve from it have strange characters when ë like characters are expected.
For example: "ë" is "ë".
I thought I could get this right by converting from UTF8 to UTF16.
return Encoding.Unicode.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.Unicode,
Encoding.UTF8.GetBytes(utf8)));
}
This however doesn't change a thing.
How could I get the data from this database in proper form?
There are two things that you need to do to support UTF-8 in the ADO.NET Entity frame work (or in general using the MySQL .NET Connector):
Ensure that the collation of your database of table is a UTF-8 collation (i.e. utf8_general_ci or one of its relations)
Add Charset=utf8; to your connection string.
"Server=localhost;Database=test;Uid=test;Pwd=test;Charset=utf8;"
I'm not certain, but the encoding may be case sensitive; I found that CharSet=UTF8; did not work for me.
Even if the database is set to UTF8 you must do the following things to get Unicode fields to work correctly:
Ensure you are using a Unicode field type like NVARCHAR or TEXT CHARSET utf8
Whenever you insert anything into the field you must prefix it with the N character to indicate Unicode data as shown in the examples below
Whenever you select based on Unicode data ensure you use the N prefix again
MySqlCommand cmd = new MySqlCommand("INSERT INTO EXAMPLE (someField) VALUES (N'Unicode Data')");
MySqlCommand cmd2 = new MySqlCommand("SELECT * FROM EXAMPLE WHERE someField=N'Unicode Data'");
If the database wasn't configured correctly or the data was inserted without using the N prefix it won't be possible to get the correct data out since it will have been downcast into the Latin 1/ASCII character set
Try set the encoding by "set names utf8" query. You can set this parameter in mysql config too.
As others have said this could be a db issue, but it could also be caused by using an old version of the .net mysql connector.
What I actually wanted to comment on was the utf8 to utf16 conversion. The string you are trying to convert is actually alreay unicode encoded, so your "ë" characters actually takes up 4 bytes (or more) and are no longer, at the point of your conversion, a misrepresentation of the "ë" character. That is the reason why your conversion doesn't do anything.
If you want to do a conversion like that I think you would have to encode your utf8 string as a old style 1 byte per character string, using a codepage where the byte values of à and « actually represent the utf8 byte sequence of ë and then treat the bytes of this new string as an utf8 string. Fun stuff.
thank you The Mouth of a Cow ,
your solution works but still we need converting characters.
i think this is your problem :)
and for converting characters you can use this code
System.Text.Encoding utf_8 = System.Text.Encoding.UTF8;
string s = "unicode";
//string to utf
byte[] utf = System.Text.Encoding.UTF8.GetBytes(s);
//utf to string
string s2= System.Text.Encoding.UTF8.GetString(utf);
"Server=localhost;Database=test;Uid=test;Pwd=test;Charset=utf8;"
It worked - PowerShell 7.2, MySQL Connector 8.0.29

Categories

Resources