This question already has answers here:
Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8
(2 answers)
Closed 8 years ago.
I'm using C# to automate an insert into command for a users table, and there's a user whose first name has an accented E, with a grave I believe?
Desirée
Whenever it makes it into the SQL Server table it appears as:
Desir?e
Which data type should I use on this column to ensure that it keeps the accented e?
I've tried varchar and nvarchar, neither seemed to matter.
Code for inserting:
var lines = File.ReadAllLines(users_feed_file);
I believe that there is an encoding issue occurring. When Visual Studio reads my file it reads the name as Desir?e.
So far I've tried to overload the File method, using:
Encoding enc = new UTF8Encoding(true, true);
var lines = File.ReadAllLines(users_feed_file,enc);
But this had no effect.
var lines = File.ReadAllLines(users_feed_file, Encoding.UTF8);
Doesn't work either.
Sql Server stores unicode text essentially as Unicode-2 or UTF-16. That is, it uses fixed, two-bytes for all characters. UTF-8 uses variable three-bytes for all characters, using one, two, or three bytes as needed. If the character in questions (it would be good to post the actual unicode value) is translated by UTF-8 into three bytes, then Sql Server reads that back as two two-byte characters, one of which probably is not a valid, displayable character, thus rendering a question mark. Note that Sql Server is not storing a question mark, that is just how whatever text editor you are using renders this garbled character.
Try changing your C# encoding to Encoding.Unicode and see if that helps round-trip the character in question.
The same reasoning applies to characters that ought to fit into one-byte, but are represented with two by UTF-8. So for example, the unicode hex value for small e with grave is xE8, which could be represented as 00 E8 in two bytes. But UTF-8 renders it as C3 E8. Now, look for that value in Unicode (UTF-16) - there is no such character. So in this case it is not two bytes represented as three, but one byte represented incorrectly as two. This resource is invaluable when trying to debug extended character issues.
Note that for the basic Latin ascii set, UTF-8 uses the same values as Unicode, and thus those characters round-trip just fine. It's when using extended character sets that compatibility for both encodings cannot be guaranteed.
Hi try with this code:
var lines = File.ReadAllLines(users_feed_file, Encoding.Unicode);
but in notepade++ you can view the file encoding, check this.
Related
I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html)
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....
It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.
Explanation:
I've come across an edge case when writing my web app. I accept UTF-8 files to be uploaded, and I've got a check in place to confirm it is UTF-8 encoded (or at least the best check possible, apparently there is no silver bullet, I'm aware there are many other questions on Stack Overflow for that specific issue).
As a test, I took an ANSI encoded file and converted it to UTF-8 by both (in separate tests) converting it UTF-8 in Notepad++, and also by just decoding as UTF-8 (even though it is ANSI) on the fly in C# using Encoding.UTF.GetBytes(inputStream).
Where The Problem Arises:
Later on, I place the raw data of the file as one of the elements in an XML file. This is where the problem arises. It appears that a character has persisted from the ANSI file which (I assume) is not valid in UTF-8. When I try load the XML using the following command...
XDocument xmlSample = XDocument.Load(outputPath);
I get this exception...
{"Invalid character in the given encoding. Line 10, position 14."}
Which looks like this in Visual Studio...
And like this in Notepad++...
Below is the character copy and pasted.
From NPP: ¡ From Visual Studio String Viewer: �
Question:
How can I remove invalid characters from UTF-8 encoded file, or at least discover them in a sane way so I can reject the file?
First, as to your example, the word “Temperature” suggests that the offending character is in fact the “degree” sign (°, Unicode 176), so that the full text reads “Temperature(°C)”. In this case the character would be coded as a \260 byte in ANSI and as the two bytes \302\260 in UTF-8. \260(preceded by the left parenthesis in this case) is not valid UTF-8.
Second – if you are still interested after more than a year – could you clarify how you use Encoding.UTF.GetBytes()to “decode a file as UTF-8?” GetBytes()reads characters, not bytes, and characters in C# do not have an encoding; the encoding has been applied when reading the file and converting it into characters. What UTF.GetBytes() does is encode (not decode) the characters into a UTF-8 byte sequence.
In order to check an incoming byte sequence you might use Encoding.UTF.GetChars() to decode your byte sequence into characters. Depending on the constructor you use you can get a “cleaned-up” character string (with data loss if problems occurred) or receive a DecoderFallbackException on offending byte sequences, so you can reject the input.
I've seen questions where the two characters are the same, but noting that relates to this specific question so here goes.
I'm running a C# console app that reads an input file that is variable length records. Each record is variable length fields. I've got everything working in terms of parsing out each individual field within each record, not a problem. Except that today I cam across the ñ character in the input file. Now I know this translates to ñ, so I'm ok with it. However, because I the input file sees ñ as 2 characters, the record length changes in the C# app because the app is interpreting those 2 characters as a single ñ. This is causing my record length to change from 154 characters to 153, and then during the parsing, messing up the individual fields.
I'm ok with the ñ character getting stored in my DB. But my question is this.
Prior to parsing the fields out of the record, how can I go about easily (with checking every single character) detecting that the ñ exists and trigger it to change the parsing logic? Should I simply do a IndexOf on the character and code it that way? I would think that would add a bit of overhead of I had to put that logic on every single field, although it seems like the easiest way. I would think there's a better way to handle it overall but I've not encountered this before. Most of the posts I have found are more for handling the ñ character in text as opposed to text being converted (properly) from ñ to ñ
Ideas?
the streamreader open I am using is as follows:
System.IO.StreamReader concatenatedFile = new System.IO.StreamReader("c:\Testing\test.txt",System.Text.Encoding.UTF8);
The record length changes from 154 characters on the input to 153 interpreted characters.
You must always read a text file in the encoding it was written. Of course, sometimes you don't which encoding that was...
Thing of the input file as a stream of bytes. Most are 1-byte-1-ASCII-character, but there are 2 bytes (probably) that can be interpreted differently depending on encoding:
UTF8 - 1 character, ñ
(some other encoding) - 2 characters, ñ
Since you say "the input file sees ñ as 2 characters", this would probably be the encoding intended by whoever produces the file.
So, you should find out which encoding was originally meant, and use that - it's probably some ANSI encoding. You could try System.Text.Encoding.Default, but beware that this changes on different machines, so your code will now depend on the machine's default encoding.
You should set the StreamReader you use to read your input file to UTF-8 encoding. I don't believe for a second the original input was meant to be ñ, so why do you care how many bytes the original input was - you care about character length, right?
Refer to this article to understand what's what in text encoding: http://www.joelonsoftware.com/articles/Unicode.html .
I have a code:
CREATE TABLE IF NOT EXISTS Person
(
name varchar(24) ...
)
CHARACTER SET utf8 COLLATE utf8_polish_ci;
This works OK in my application, but I read if someone put in name field a string that contains character wchich code is greater than 127, database will use 2 bytes (or more) to store this character. So i think, i will change character set to utf16:
CHARACTER SET utf16 COLLATE utf16_polish_ci;
But now when I run my application, exception apears: KeyNotFoundException. It apears exactly at these instructions:
MySqlCommand komenda = baza.Połączenie.CreateCommand ();
komenda.CommandText = zapytanie;
MySqlDataReader dr = komenda.ExecuteReader (); // HERE, at execute reader method
if (dr.Read ()) ...
1) Anyone had similar problem? 2) Any idea how to use always 2 bytes/char in database field?
I'm not sure I understand why you're converting from UTF-8 to UTF-16. I'm assuming you're worried that any characters that require two bytes or more to store, won't fit in a UTF-8 encoding. This is not the case. In MySQL UTF-8 values can be stored with one, two, or three bytes. Unicode points U+0000 to U+007F take 1 byte and points U+0080 to U+07FF take 2 bytes--this range covers the Polish alphabet. Since the majority of characters in the Polish alphabet take 1 byte to store you should probably stick with UTF-8 and save some memory. However, if you want to always use 2 bytes, at the cost of wasted space, you could stick with UTF-16.
Here are some helpful links:
Unicode support in MySQL: http://dev.mysql.com/doc/refman/5.6/en/charset-unicode.html
Basic Unicode Overview : http://www.joelonsoftware.com/articles/Unicode.html
As for the exception, and this is purely a guess, it may have something to do with trying to read data that is UTF-8 encoded as if it were UTF-16 encoded. Did you change the character set after you already had UTF-8 encoded data in your table?
Documenation says:
[...] utf8 characters can require up to three bytes per character [...]
Read this link for more information.
My advice would be not to focus on how many bytes the DBMS is using, as one of its purposess is to abstract you from that. Just focus on coding according to the selected data types.
I was trying to convert a file from utf-8 to Arabic-1265 encoding using the Encoding APIs in C#, but I faced a strange problem that some characters are not converted correctly such as "لا" in the following statement "ﻣﺣﻣد ﺻﻼ ح عادل" it appears as "ﻣﺣﻣد ﺻ? ح عادل". Some of my friends told me that this is because these characters are from the Arabic Presentation Forms B. I create the file using notepad++ and save it as utf-8.
here is the code I use
StreamReader sr = new StreamReader(#"C:\utf-8.txt", Encoding.UTF8);
string str = sr.ReadLine();
StreamWriter sw = new StreamWriter(#"C:\windows-1256.txt", false, Encoding.GetEncoding("windows-1256"));
sw.Write(str);
sw.Flush();
sw.Close();
But, I don't know how to convert the file correctly using this presentation forms in C#.
Yes, your string contains lots of ligatures that cannot be represented in the 1256 code page. You'll have to decompose the string before writing it. Like this:
str = str.Normalize(NormalizationForm.FormKD);
st.Write(str);
To give a more general answer:
The Windows-1256 encoding is an obsolete 8-bit character encoding. It has only 256 characters, of which only 60 are Arabic letters.
Unicode has a much wider range of characters. In particular, it contains:
the “normal” Arabic characters, U+0600 to U+06FF. These are supposed to be used for normal Arabic text, including text written in other languages that use the Arabic script, such as Farsi. For example, “لا” is U+0644 (ل) followed by U+0627 (ا).
the “Presentation Form” characters, U+FB50 to U+FDFF (“Presentation Forms-A”) and U+FE70 to U+FEFF (“Presentation Forms-B”). These are not intended to be used for representing Arabic text. They are primarily intended for compatibility, especially with font-file formats that require separate code points for every different ligated form of every character and ligated character combination. The “لا” ligature is represented by a single codepoint (U+FEFB) despite being two characters.
When encoding into Windows-1256, the .NET encoding for Windows-1256 will automatically convert characters from the Presentation Forms block to “normal text” because it has no other choice (except of course to turn it all into question marks). For obvious reasons, it can only do that with characters that actually have an “equivalent”.
When decoding from Windows-1256, the .NET encoding for Windows-1256 will always generate characters from the “normal text” block.
As we’ve discovered, your input file contains characters that are not representable in Windows-1256. Such characters will turn into question marks (?). Furthermore, those Presentation-Form characters which do have a normal-text equivalent, will change their ligation behaviour, because that is what normal Arabic text does.
First of all, the two characters you quoted are not from the Arabic Presentation Forms block. They are \x0644 and \x0627, which are from the standard Arabic block. However, just to be sure I tried the character \xFEFB, which is the “equivalent” (not equivalent, but you know) character for لا from the Presentation Forms block, and it works fine even for that.
Secondly, I will assume you mean the encoding Windows-1256, which is for legacy 8-bit Arabic text.
So I tried the following:
var input = "لا";
var encoding = Encoding.GetEncoding("windows-1256");
var result = encoding.GetBytes(input);
Console.WriteLine(string.Join(", ", result));
The output I get is 225, 199. So let’s try to turn it back:
var bytes = new byte[] { 225, 199 };
var result2 = encoding.GetString(bytes);
Console.WriteLine(result2);
Fair enough, the Console does not display the result correctly — but the Watch window in the debugger tells me that the answer is correct (it says “لا”). I can also copy the output from the Console and it is correct in the clipboard.
Therefore, the Windows-1256 encoding is working just fine and it is not clear what your problem is.
My recommendation:
Write a short piece of code that exhibits the problem.
Post a new question with that piece of code.
In that question, describe exactly what result you get, and what result you expected instead.