Handling Alphabets (Norwegian and Danish) in Sql Server

Handling Alphabets (Norwegian and Danish) in Sql Server - c#

I have a application that have lot of products with special alphabets like é, è, ê, ó, ò, â, and ô.
Now these alphabets gives me problem like when i store them in sql server these symbols get replaced by ?. I also find problem during the processing.
How can i handle these.
Should i keep on using string to handle them or use something else
What should be their data-types in sql-server
Any help is appreciated.

Have you tried using nvarchar as the datatype? This is usually recommended when storing non-English text (the cost is more storage space). We use nvarchar for Finnish text (ä ö å), and have no problems or special processing. If writing to a stream, then make sure to use the iso-8859-1 encoding (at least for scandic languages. Eastern European languages use a different one).
If its not possible for you to change the datatype, let me know and we can come up with a different solution.

Related

Encoding problem on MySql Database from Blazor form on hosting environment [duplicate]

I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as SeÃ±or or æ–°æµªæ–°é—» for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?

This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (SeÃ±or for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were SeÃ±or.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.

I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.

Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.

Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

Identify problematic characters in a string

I want to be able to identify problematic characters in a string saved in my sql server using LINQ to Entities.
Problematic characters are characters which had problem in the encoding process.
This is an example of a problematic string : "testing�stringáאç".
In the above example only the � character is considered as problematic.
So for example the following string isn't considered problematic:"testingstringáאç".
How can I check this Varchar and identify that there are problematic chars in it?
Notice that my preferred solution is to identify it via a LINQ to entities query , but other solutions are also welcome - for example: some store procedure maybe?
I tried to play with Regex and with "LIKE" statement but with no success...

Check out the Encoding class.
It has a DecoderFallback Property and a EncoderFallback Property that lets you detect and substitute bad characters found during decoding.

.Net and NVARCHAR both use Unicode, so there is nothing inherently "problematic" (at least not for BMP characters).
So you first have to define what "problematic" in meant to mean:
characters are not mapped in target codepages
Simply convert between encodings and check whether data is lost:
CONVERT(NVARCHAR, CONVERT(VARCHAR, #originalNVarchar)) = #originalNVarchar
Note that you can use SQL Server collations using the COLLATE clause rather than using the default database collation.
characters cannot be displayed due to the fonts used
This cannot be easily done in .Net

You can do something like this:
DECLARE #StringWithProblem NVARCHAR(20) = N'This is '+NCHAR(8)+N'roblematic';
DECLARE #ProblemChars NVARCHAR(4000) = N'%['+NCHAR(0)+NCHAR(1)+NCHAR(8)+']%'; --list all problematic characters here, wrapped in %[]%
SELECT PATINDEX(#ProblemChars, #StringWithProblem), #StringWithProblem;
That gives you the index of the first problematic character or 0 if none is found.

ગુજરાતી language in C#

I am creating a "Gujarati To English Dictionary" application using C#.
In which I would want that if user types "tajmhal" from the keyboard "તાજમહાલ" would be displayed in the TEXTBOX at the same time.
ie. If "koml" , "કોમલ" is displayed. etc...
I have downloaded and set the font for the textbox to "Gujarati Saral-1" and it works.
But when I store the text of the textbox to the database it is stored as "tajmhal" not "તાજમહાલ".
so, could you please suggest to me another solution?

From what I call tell after looking at this font info page this font is NOT a Unicode font, but rather displays Latin code-points as Gujarati glyphs.
In order to make your idea work, you need to use a Gujarati Unicode-compliant font, and replace each Latin character with its equivalent character. See this table for the Gujarati code-points.

If the user types tajmhal from the keyboard, then that will be saved in the database. You can show it using whatever font you like - a nice Indian font, fancy calligraphy font, comic sans, barcode, wingdings - but it is still tajmhal.
You need to translate the word or characters from one language to another. That's not a font issue, that requires some mapping from one to the other. The characters will have different unicode values - you can try mapping them yourself, character by character, but this is unlikely to work unless the two languages have the same number of letters and they map directly from one to another. So you need to translate.
The other answer suggests using Google. You're writing a desktop application but you can still integrate with Google technology; if network is down then you don't translate and try again later.
Question: if the text is displaying as you want then why do you need to translate it? You're obviously using an english keyboard - why not store the text as english characters?

Google has a transliteration api (that is unfortunately deprecated), which might solve your problem. If you need there are other services like Quillpad that you might want to buy. These will allow you to type in one script and get the phonetic equivalent in another script - transliteration. Once you store your data in the database, you should be able to display it again, unless you're storing the English string there by mistake.

Ok, can you check your datatype in SQL, is it varchar or nvarchar. nvarchar deals with unicode characters.

How to detect unicode strings with unprintable characters?

I have Unicode strings stored in a database. Some of the character encodings are wrong and instead of displaying actual characters for the language, it's now displaying characters that make no sense. How do I fix this issue? Is there a way to detect if strings have a wrong encoding?

The problem with mojibake (the Japanese slang "mojibake" gets used in English because the historical status of Japan as a non-Western country with heavy early computer use meant the issue was encountered a lot there) is that the characters will generally be valid in themselves, but nonsense, which is much harder to detect with 100% accuracy.
The first thing you need to do is identify the encoding that the data was really in, the encoding the data was read as being in, and write a converter to undo that.
For example, if UTF-8 had been mis-interpreted as ISO 8859-1, then you would want to read through the stream, and create the binary stream of encoding it back into ISO 8859-1, and then create the text stream of reading that binary stream as UTF-8, as should have been done in the first place.
Now for the hard part, finding the incorrect streams. If you can do this by some means that isn't heuristic, then this is the way to go (e.g. if you knew that every record added within a particular range of id numbers was invalid, just use that).
Failing that, your best bet is to do some heuristics as follows:
If a character in the text is not a graphical character, then its probably caused by this mojibake issue.
Certain sequences will be common in the given case of mojibake. For example, é in UTF-8 mis-interpreted as ISO 8859-1 will become Ã©. Since Ã© is an extremely rare combination in real data (about the only time you'll see it deliberately is in a case like this when someone is talking about how it can appear by mistake), then any text containing it is almost certainly one that needs to be fixed. If you have some of the original data, you can find the sequences you need to look for by identifying those characters in the original data that differ in the two encodings, and producing the sequence necessary (e.g. if we find that ç appears in the data, and we find that this would have the sequence Ã§, then we know that's a sequence to look for.
Note that we can compute such sequences if we have System.Text.Encoding objects that correspond to the mojikbake. If for example you had read as your system's default encoding when you should have read as UTF-8 then you could use:
Encoding.Default.GetString(Encoding.UTF8.GetBytes(testString))
For example:
Encoding.Default.GetString(Encoding.UTF8.GetBytes("ç"))
returns "Ã§".

How to decode a string that has been UTF-8 encoded twice to simple UTF-8?

I have a huge MySQL table which has its rows encoded in UTF-8 twice.
For example "Újratárgyalja" is stored as "ÃšjratÃ¡rgyalja".
The MySQL .Net connector downloads them this way. I tried lots of combinations with System.Text.Encoding.Convert() but none of them worked.
Sending set names 'utf8' (or other charset) won't solve it.
How can I decode them from double UTF-8 to UTF-8?

Peculiar problem, but I think I can reproduce it by a suitably-unholy mix of UTF-8 and Latin-1 (not by just two uses of UTF-8 without an interspersed mis-step in Latin-1 though). Here's the whole weird round trip, "there and back again" (Python 2.* or IronPython should both be able to reproduce this):
# -*- coding: utf-8 -*-
uni = u'Újratárgyalja'
enc1 = uni.encode('utf-8')
enc2 = enc1.decode('latin-1').encode('utf-8')
dec3 = enc2.decode('utf-8')
dec4 = dec3.encode('latin-1').decode('utf-8')
for x in (uni, enc1, enc2, dec3, dec4):
print repr(x), x
This is the interesting output...:
u'\xdajrat\xe1rgyalja' Újratárgyalja
'\xc3\x9ajrat\xc3\xa1rgyalja' Újratárgyalja
'\xc3\x83\xc2\x9ajrat\xc3\x83\xc2\xa1rgyalja' ÃjratÃ¡rgyalja
u'\xc3\x9ajrat\xc3\xa1rgyalja' ÃjratÃ¡rgyalja
u'\xdajrat\xe1rgyalja' Újratárgyalja
The weird string starting with Ã appears as enc2, i.e. two utf-8 encodings WITH an interspersed latin-1 decoding thrown into the mix. And as you can see it can be undone by the exactly-converse sequence of operations: decode as utf-8, re-encode as latin-1, re-decode as utf-8 again -- and the original string is back (yay!).
I believe that the normal round-trip properties of both Latin-1 (aka ISO-8859-1) and UTF-8 should guarantee that this sequence will work (sorry, no C# around to try in that language right now, but I would expect that the encoding/decoding sequences should not depend on the specific programming language in use).

When you write "The MySQL .Net connector downloads them this way." there's a good chance this means the MySQL .Net connector believes it is speaking Latin-1 to MySQL, while MySQL believes the conversation is in UTF-8. There's also a chance the column is declared as Latin-1, but actually contains UTF-8 data.
If it's the latter (column labelled Latin-1 but data is actually UTF-8) you will get mysterious collation problems and other bugs if you make use of MySQL's text processing functions, ORDER BY on the column, or other situations where the text "means something" rather than just being bytes sent over the wire.
In either case you should try to fix the underlying problem, not least because it is going to be a complete headache for whoever has to maintain the system otherwise.

You could try using
SELECT CONVERT(`your_column` USING ascii)
FROM `your_table`
at the MySQL query level. This is a stab in the dark, though.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.