My website could possibly receive both Unicode and Non-Unicode characters - c#

A user on my website could possibly add a comment on an item that contains both Arabic and English characters. sometimes maybe just Arabic, others just English, and others French!
It's an international website you can't expect the characters being stored in my application.
My website has nothing to do with the Facebook but I need my comments' TextBoxes to be able to accept and show any characters from any language!
...So how do you think I could achieve this ?

All strings in .NET are unicode strings (see documentation of the String class for more information). So taking input from the user in a mix of languages shouldn't be a problem.
If you plan to store this information in the database you will need to make sure the database columns are of type nchar or nvarchar. As others pointed out, when you run queries against these columns out of SSMS you will need to prefix Unicode strings with N to make sure they are handled properly. When executing queries from code you should use parameterized queries (or an ORM, which would probably be better) and ADO.NET will take care of properly constructing queries for you.

There are two elements here:
displaying the characters - this is handled on the user side, if something is missing there the outcome will be giberish, however you can't affect that
what you can affect is the way you save characters in your database - convert everything to utf-8 regardless of the input. Any popular browser is able to render utf-8.

If you use Unicode as a charset on the web pages and the database then you dont have to worry where are users from, since they will all type in unicode into your textboxes.

First, be sure for the fields of your database that will store data may be unicode charactters change these fields to Nvarchar not varchar
and you must know that NVarchar takes double value of row
ex. the maximum row size in sqlserver is 8000 character
that mean when you make a field nvarchar and make it 4000 that mean you have take the all 8000 character
Second, using according to language user select in browsing, set your charset in your code or page like
read from url like http://website.tld/ar/
<meta http-equiv="Content-Language" content="ar" >
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" >
so according to change the language i n url you change the meta tags of your page
and that is it
and change it according to your language
Regards

Related

Encoding problem on MySql Database from Blazor form on hosting environment [duplicate]

I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

Incorrect Greek characters with MySQL connector/net and C#

I'm trying to make a c# project that reads from a MySQL database.
The data are inserted from a php page with utf-8 encoding. Both page and data is utf-8.
The data is self is greek words like "Λεπτομέρεια 3".
When fetching the data it looks like "ΛεπτομέÏεια 3".
I have set 'charset=utf8' in the connection string and also tried with 'set session character_set_results=latin1;' query.
When doing the same with mysql (linux), MySQL Workbench, MySQL native connector for OpenOffice with OpenOffice Base, the data are displayed correctly.
I'm I doing something wrong or what else can I do?
Running the query 'SELECT value, HEX(value), LENGTH(value), CHAR_LENGTH(value) FROM call_attribute;' from inside my program.
It returns :
Value:
ΛεπτομέÏεια 3
HEX(value) :
C38EE280BAC38EC2B5C38FE282ACC38FE2809EC38EC2BFC38EC2BCC38EC2ADC38FC281C38EC2B5C38EC2B9C38EC2B12033
LENGTH(value) :
49
CHAR_LENGTH(value) :
24
Any ideas???
You state that the first character of your data is capital lambda, Λ.
The UTF-8 represenation of this character is 0xCE 0x9B, whereas the HEX() value starts with C38E, which is indeed capital I with circumflex, as displayed in your question.
So I guess the original bug was not in the PHP configuration, and your impression that "data are displayed correctly" was wrong and due to an encoding problem.
Also note that the Greek alphabet only requires Latin-7, rather than Latin-1, when storing Greek data as single-byte characters rather than in Unicode.
Most likely, you have an encoding problem here, meaning different applications interpret the binary data as different character sets or encodings. (But lacking PHP and MySQL knowledge, I cannot really help you how to configure correctly).
You should try SET NAMES 'utf8' and have a look at this link
I've manage to solve my problem by setting the 'skip-character-set-client-handshake' in /etc/my.cnf'. After that everything was ok, the encoding of greek words was correct and the display was perfect.
One drawback was that I had to re-enter all the data into the database again.

C# storing text in SQL Server for full text search

Im writing an Outlook Add-in to file emails acdcording to certain parameters.
I am currently storing the Outlook.MailItem.Body property in a varbinary(max) field in SQL Server 2008R2. I have also enabled FTS on this column.
Currently I store the Body property of the email as a byte array in the database, and use ASCIIEncoder.GetBytes() function to convert this clear text. Currently I am experiencing some weird results, whereby I notice ? characters occasionally for apostrophes and new lines.
I have two questions:
Is this the best method to store text in a database? As a byte array? And is the ASCIIEncoder the best method to acheive this?
I want to handle Unicode strings correctly, is there anything I should be aware of?
I'm not sure whether FullTextSearch works best on VarBinary columns, though my instinct says "no", but I can answer the second half of your question.
The reason you're getting odd characters is that ASCIIEncoder.GetBytes() treats the text as ASCII, and can have exactly those sort of errors if the text you're encoding ISN'T ASCII-encoded. By default, strings in .NET are UTF8, so you're probably running into problems there. Use Encoding.UTF8.GetBytes() to get the bytes for a UTF8 string.
This also answers the second question - is this method useful for Unicode strings? Yes, since you're not storing strings at all. You're storing bytes, which your application happens to know are encoded Unicode strings. SQL won't do anything to them, because they're just bytes.
Since you have to support Unicode characters and handle only text you should store your data in a column of type nvarchar. That would address both of your problems:
1.) Text is saved as variable-length Unicode character data in the database, you don't need a byte encoder/decoder to retrieve the data
2.) See 1.)

ગુજરાતી language in C#

I am creating a "Gujarati To English Dictionary" application using C#.
In which I would want that if user types "tajmhal" from the keyboard "તાજમહાલ" would be displayed in the TEXTBOX at the same time.
ie. If "koml" , "કોમલ" is displayed. etc...
I have downloaded and set the font for the textbox to "Gujarati Saral-1" and it works.
But when I store the text of the textbox to the database it is stored as "tajmhal" not "તાજમહાલ".
so, could you please suggest to me another solution?
From what I call tell after looking at this font info page this font is NOT a Unicode font, but rather displays Latin code-points as Gujarati glyphs.
In order to make your idea work, you need to use a Gujarati Unicode-compliant font, and replace each Latin character with its equivalent character. See this table for the Gujarati code-points.
If the user types tajmhal from the keyboard, then that will be saved in the database. You can show it using whatever font you like - a nice Indian font, fancy calligraphy font, comic sans, barcode, wingdings - but it is still tajmhal.
You need to translate the word or characters from one language to another. That's not a font issue, that requires some mapping from one to the other. The characters will have different unicode values - you can try mapping them yourself, character by character, but this is unlikely to work unless the two languages have the same number of letters and they map directly from one to another. So you need to translate.
The other answer suggests using Google. You're writing a desktop application but you can still integrate with Google technology; if network is down then you don't translate and try again later.
Question: if the text is displaying as you want then why do you need to translate it? You're obviously using an english keyboard - why not store the text as english characters?
Google has a transliteration api (that is unfortunately deprecated), which might solve your problem. If you need there are other services like Quillpad that you might want to buy. These will allow you to type in one script and get the phonetic equivalent in another script - transliteration. Once you store your data in the database, you should be able to display it again, unless you're storing the English string there by mistake.
Ok, can you check your datatype in SQL, is it varchar or nvarchar. nvarchar deals with unicode characters.

linq search for French characters

I'm usint EF and have a simple LINQ statement and want to search for words.
So there is Textbox search and submit button.
when searchtext contains "march" it finds eg. "des marchés", but If I search for "marché" it doesn't find. So it's the French character.
listAgendaItems = dc.agenda.Where(a =>
a.libelle_activite.Contains(searchText)
).ToList<agenda>();
The database and the table Agenda have extended properties -> Collation : French_CI_AS
So how can I makes sure I get the French words, too? like "é, à" etc
I also tried to search for "marche" but it doesn't find "marchés".
Your collation French_CI_AS is "Case-Insensitive", "Accent-Sensitive". If you want a query for "marches" to match "marchés", you need French_CI_AI as your collation. In most languages, that's actually NOT what native speakers want, because the accents are semantically important, but that may depend on the circumstances or context.
If, in fact, your users do always want accent insensitive searches, you should set that collation property to AI instead of AS on the table (or the specific fields). Otherwise, if the need is rare, you can apply collation to a table in MS Sql on a per-query basis; keep in mind that if there is no index on that collation there may be a substantial performance cost. That may be nearly immaterial when you're doing a %wildcard% query, however, since you'll generally have a full table scan in that case anyway.
The last I checked, it wasn't possible to specify a collation in a Linq query directly, so if you are doing case insensitivity on an ad-hoc basis, you'd need to use direct-to-sql query through your data context.
Edited:
Based on your comment, it sounds like you are allowing HTML content to be stored in your database. You have numeric character references in your table, which SQL Server knows nothing about, since they're a feature of HTML, XML and SGML. You can only make this searchable if those characters are string literals in a suitable encoding.
NVARCHAR will store the content in Unicode, specifically UTF-16, and VARCHAR will use Windows-1252 with French collation.
If you are accepting this input via web forms, make sure the page encoding is appropriate. If you are only supporting modern browsers (essentially anything IE5+), UTF-8 is well supported, so you should consider using UTF-8 for all of your requests and responses.
Make sure in your web.config, you have something like this:
<configuration>
<system.web>
<globalization
requestEncoding="utf-8"
responseEncoding="utf-8" />
</system.web>
</configuration>
If you already have data stored with those numeric character references in your database, you can unescape them by translating &#ddddd; into literal UTF-16 sequences, and store them again. Make sure you don't accidentally unescape semantically important NCRs like the greater-than, less-than or ampersand codepoints.

Categories

Resources