I'm getting confused about C# UTF8 encoding...
Assuming those "facts" are right:
Unicode is the "protocol" which define each character.
UTF-8 define the "implementation" - how to store those characters.
Unicode define character range from 0x0000 to 0x10FFFF (source)
According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
In contrast to C#, when I using Python for writing UTF8 text - it's covering all the expected range (0x0000 to 0x10FFFF). For example:
u"\U00010000" #WORKING!!!
which isn't working for C#. What's more, when I writing the string u"\U00010000" (single character) in Python to text file and then read it from C#, this single character document became 2 characters in C#!
# Python (write):
import codecs
with codes.open("file.txt", "w+", encoding="utf-8") as f:
f.write(text) # len(text) -> 1
// C# (read):
string text = File.ReadAllText("file.txt", Encoding.UTF8); // How I read this text from file.
Console.Writeline(text.length); // 2
Why? How to fix?
According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
Unfortunately, a C#/.NET char does not represent a Unicode character.
A char is a 16-bit value in the range 0x0000 to 0xFFFF which represents one “UTF-16 code unit”. Characters in the ranges U+0000–U+D7FF and U+E000–U+FFFF, are represented by the code unit of the same number so everything's fine there.
The less-often-used other characters, in the range U+010000 to U+10FFFF, are squashed into the remaining space 0xD800–0xDFFF by representing each character as two UTF-16 code units together, so the equivalent of the Python string "\U00010000" is C# "\uD800\uDC00".
Why?
The reason for this craziness is that the Windows NT series itself uses UTF-16LE as the native string encoding, so for interoperability convenience .NET chose the same. WinNT chose that encoding—at the time thought of as UCS-2 and without any of the pesky surrogate code unit pairs—because in the early days Unicode only had characters up to U+FFFF, and the thinking was that was going to be all anyone was going to need.
How to fix?
There isn't really a good fix. Some other languages that were unfortunate enough to have based their string type on UTF-16 code units (Java, JavaScript) are starting to add methods to their strings to do operations on them counting a code point at a time; but there is no such functionality in .NET at present.
Often you don't actually need to consistently need to count/find/split/order/etc strings using proper code point items and indexes. But when you really really do, in .NET, you're in for a bad time. You end up having to re-implement each normally-trivial method by manually walking over each char and check it for being part of a two-char surrogate pair, or converting the string to an array of codepoint ints and back. This isn't a lot of fun, either way.
A more elegant and altogether more practical option is to invent a time machine, so we can send the UTF-8 design back to 1988 and prevent UTF-16 from ever having existed.
Unicode has so-called planes (wiki).
As you can see, C#'s char type only supports the first plane, plane 0, the basic multilingual plane.
I know for a fact that C# uses UTF-16 encoding, so I'm a bit surprised to see that it doesn't support code points beyond the first plane in the char datatype. (haven't run into this issue myself...).
This is an artificial restriction in char's implementation, but one that's understandable. The designers of .NET probably didn't want to tie the abstraction of their own character datatype to the abstraction that Unicode defines, in case that standard would not survive (it already superseded others). This is just my guess of course. It just "uses" UTF-16 for memory representation.
UTF-16 uses a trick to squash code points higher than 0xFFFF into 16 bits, as you can read about here. Technically those code points consist of 2 "characters", the so-called surrogate pair. In that sense it breaks the "one code point = one character" abstraction.
You can definitely get around this by working with string and maybe arrays of char. If you have more specific problems, you can find plenty of information on StackOverflow and elsewhere about working with all of Unicode's code points in .NET.
Related
So:
The C# compiler outputs the (line,column) style location.
The Roslyn API expects sequential text location
How to map the former to the latter?
The C# code could be UTF8 with or without the BOM or even UTF16. It could contain all kinds of characters in the form of comments or embedded strings.
Let us assume we know the encoding and have the respective Encoding object handy. I can convert the file bytes to char[]. The problem is that some chars may contribute zero to the final sequential position. I know that the BOM character does. I have no idea if others may too.
Now, if we know for sure that BOM is the only character that contributes 0 to the length, then I can skip it and count the characters and my question becomes trivial. This is what I do today - I just assume that the BOM is the only "bad" player.
But maybe there is a better way? Maybe Roslyn API contains some hidden gem that knows for a change to accept (line,column) and spit the sequential position? Or maybe some of the Microsoft.Build libraries?
EDIT 1
As per the accepted answer the following gives the location:
var srcText = SourceText.From(File.ReadAllText(err.FilePath));
int location = srcText.Lines[err.Line - 1].Start + err.Column - 1;
You have uncovered the reason that the SourceText type exists in the roslyn apis. Its entire purpose is to handle encoding of strings and preform calculations of lines, columns, and spans.
Due to the way .NET handles unicode and depending on which code pages are installed in your OS there could be cases that SourceText does not do what you need. It has generally proven "good enough" for our purposes though.
I am facing a problem when searching on filenames with unicode characters. Those files may having correct or altered names (with replaced equivalent ascii characters).
I would like to make some code to find files using same words, altered or not, with possible incoherent mix of culture inside the same string.
To keep it simple, I should only manage strings in European languages.
Equivalence examples :
Ɛpsilon <=> epsilon
København <=> Kobenhavn
Ångström <=> Angstrom
El Niño <=> El Nino
Tiếng Việt <=> Tieng Viet
Čeština <=> Cestina
encyklopædi <=> encyklopaedi
Expediția <=> Expeditia
øðrum <=> odrum
œuf <=> oeuf
μ (\u03bc) <=> µ (\u00b5)
Straße <=> Strasse
I already found some answers to similar questions, but they are based on simpler string (where removing accent is enough, using Unicode normalization and the drop of diacritics), or "do it yourself" based.
How to compare Unicode characters that "look alike"?
How to convert a Unicode character to its ASCII equivalent
Replacing characters in C# (ascii)
Unfortunately, Unicode normalization (the automatic way) does not work at least on following characters :
Ɛ ø ð => missing equivalence
æ œ ß => missing expansion
Is there a function/library to achieve this in C#, other that manually converting each 'well known' character myself ?
I don't think, there is a simple way to do this. There probably is no universal normalisation (even when you limit it to group of European languages).
All solutions have have manual work:
RegEx - It should be possible, but this solution (a RegEx expression that would do the job) would be really incredible crazy.
There is (or at least was) a plug-in for Total Commander for transliteration. But the plug-in is/was buggy/unstable and you need to write the transliteration table manually.
"Manual transliteration".
I have similar problem with file names. But in my case the file names contains Japanese characters. This translation/transliteration is a little bit harder.
To simplify your solution you can use the code page conversions in Windows.
It would be nice when the conversion to ASCII (7 bit) would do the job, but no. This produces only '?' characters.
This example should handle some of the characters.
Encoding encoding;
string data = "Čeština, øðrum";
encoding = Encoding.GetEncoding(1250);
data = encoding.GetString(encoding.GetBytes(data)); // "Čeština, o?rum"
encoding = Encoding.GetEncoding(1252);
data = encoding.GetString(encoding.GetBytes(data)); // "Ceština, o?rum"
encoding = Encoding.ASCII;
data = encoding.GetString(encoding.GetBytes(data));
Console.WriteLine(data); // "Ce?tina, o?rum"
It is not perfect, but at least you cleared some of the unwanted characters without the need of a substitution dictionary.
You can try to add another code pages (perhaps Greece code page would fix the "μ" problem, but it will probably remove all other characters).
After these start conversions you can search the transformed text for '?' characters and see, whether there is '?' character in the original/source. When there is not, now you can use a substitution dictionary for given character.
In my project I use substitution dictionary (updated manually in runtime by user for unknown words). When all your transliterations are only single characters, you do not need to use some special methods, but when there are cases like "ßs" --> "ss" (not as 'ß' + 's' = "ss" + 's' = "sss"), you will need a sorted list of substitutions, that need to be processed before character substitutions. The list should be sorted by string length (longer first) and not by alphabet.
Remarks:
In you case, there is probably not the problem of ambiguous transcription (明日 = "ashita" or "asu", or perhaps a different word according to surrounding characters) but you should consider if it really is so.
In my project I found out, that there are programs that store files with wrong encoding. Downloader get the correct file name in UTF-8 the sequence of bytes is interpreted as Encoding.Default (or "Encoding.DOS" [symbolic name], or other code page for zipped files). Therefore it would be good to test the file names for this type of error.
See how to test for invalid file name encoding:
https://stackoverflow.com/a/19068371/2826535
Only to complete the answer:
Unicode normalisation based "remove accents" method:
https://stackoverflow.com/a/3288164/2826535
Maybe i dont need 32bit strings but i need to represent 32bit characters
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
Now i grabbed the symbola font and can see the character when i paste it (in the url or any text areas) so i know i have the font support for it.
But how do i support it in my C#/.NET app?
-edit- i'll add something. When i pasted the said character in my .NET winform app i DO NOT see the character correctly. When pasting it into firefox i do see it correctly. How do i see the characters correctly in my winform apps?
I am not sure I understand your question:
Strings in .NET are UTF-16 encoded, and there is nothing you can do about this. If you want to get the UTF-32 version of a string, you will have to convert it into a byte array with the UTF32Encoding class.
Characters in .NET are thus 16 bits long, and there is nothing you can do about this either. A UTF-32 encoded character can only be represented by a byte array (with 4 items). You can use the UTF32Encoding class for this purpose.
Every UTF-32 character has an equivalent UTF-16 representation, and vice-versa. So in this context we could only speak of characters, and of their different representations (encodings), UTF-16 being the representation of choice on the .NET platform.
You didn't say what exactly do you mean by “support”. But there is nothing special you need to do to to work with characters that don't fit into one 16-bit char, unless you do string manipulation. They will just be represented as surrogate pairs, but you shouldn't need to know about that if you treat the string as a whole.
One exception is that some string manipulation methods won't work correctly. For example "\U0001F4A9".Substring(1) will return the second half of the surrogate pair, which is not a valid string.
I'm working on an application in C# and need to read and write from a particular datafile format. The only issue at the moment is that the format uses strictly single byte characters, and C# keeps trying to throw in Unicode when I use a writer and a char array (which doubles filesize, among other serious issues). I've been working on modifying the code to use byte arrays instead, but that causes a few complaints when feeding them into a tree view and datagrid controls, and it involves conversions and whatnot.
I've spent a little time googling, and there doesn't seem to be a simple typedef I can use to force the char type to use byte for my program, at least not without causing extra complications.
Is there a simple way to force a C# .NET program to use ASCII-only and not touch Unicode?
Later, I got this almost working. Using the ASCIIEncoding on the BinaryReader/Writers ended up fixing most of the problems (a few issues with an extra character being prepended to strings occurred, but I fixed that up). I'm having one last issue, which is very small but could be big: In the file, a particular character (prints as the Euro sign) gets converted to a ? when I load/save the files. That's not an issue in texts much, but if it occurred in a record length, it could change the size by kilobytes (not good, obviously). I think it's caused by the encoding, but if it came from the file, why won't it go back?
The precise problem/results are such:
Original file: 0x80 (euro)
Encodings:
** ASCII: 0x3F (?)
** UTF8: 0xC280 (A-hat euro)
Neither of those results will work, since anywhere in the file, it can change (if an 80 changed to 3F in a record length int, it could be a difference of 65*(256^3)). Not good. I tried using a UTF-8 encoding, figuring that would fix the issue pretty well, but it's now adding that second character, which is even worse.
C# (.NET) will always use Unicode for strings. This is by design.
When you read or write to your file, you can, however, use a StreamReader/StreamWriter set to force ASCII Encoding, like so:
StreamReader reader = new StreamReader (fileStream, new ASCIIEncoding());
Then just read using StreamReader.
Writing is the same, just use StreamWriter.
Interally strings in .NET are always Unicode, but that really shouldn't be of much interest to you. If you have a particular format that you need to adhere to, then the route you went down (reading it as bytes) was correct. You simply need to use the System.Encoding.ASCII class to do your conversions from string->byte[] and byte[]->string.
If you have a file format that mixes text in single-byte characters with binary values such as lengths, control characters, a good encoding to use is code page 28591 aka Latin1 aka ISO-8859-1.
You can get this encoding by using whichever of the following is the most readable:
Encoding.GetEncoding(28591)
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")
This encoding has the useful characteristic that byte values up to 255 are converted to unchanged to the unicode character with the same value (e.g. the byte 0x80 becomes the character 0x0080).
In your scenario, this may be more useful than the ASCII encoding (which converts values in the range 0x80 to 0xFF to '?') or any of the other usual encodings, which will also convert some of the characters in this range.
If you want this in .NET, you could use F# to make a library supporting this. F# supports ASCII strings, with a byte array as the underlying type, see Literals (F#) (MSDN):
let asciiString = "This is a string"B
In our API, we use byte[] to send over data across the network. Everything worked fine, until the day our "foreign" clients decided to pass/receive Unicode characters.
As far as I know, Unicode characters occupy 2 bytes, however, we only allocate 1 byte in the byte array for them.
Here is how we read the character from the byte[] array:
// buffer is a byte[6553] and index is a current location in the buffer
char c = System.BitConverter.ToChar(buffer, m_index);
index += SIZEOF_BYTE;
return c;
So the current issue is the API is receiving a strange Unicode character, when I look at the Unicode hexadecimal. I found that the last significant byte is correct but the most significant byte has a value when it’s supposed to be 0. A quick workaround, thus far, has been to 0x00FF & c to filter the msb.
Please suggest the correct approach to deal with Unicode characters coming from the socket?
Thanks.
Solution:
Kudos to Jon:
char c = (char) buffer[m_index];
And as he mentioned, the reason it works, is because the client api receives a character occupying only one byte, and BitConverter.ToChar uses two, hence the issue in converting it. I am still startled as to why it worked for some set of characters and not the others, as it should have failed in all cases.
Thanks Guys, great responses!
You should use Encoding.GetString, using the most appropriate encoding.
I don't quite understand your situation fully, but the Encoding class is almost certain to be the way to handle it.
Who is in control of the data here? Your code, or that of your customers? Have you defined what the correct format is?
EDIT: Okay, I've had another look at your code: BitConverter.ToChar returns "A character formed by two bytes beginning at startIndex." If you only want to use one byte, just cast it:
char c = (char) buffer[m_index];
I'm surprised your code has been working at all, as it would be breaking any time the next byte was non-zero.
You should look at the System.Text.ASCIIEncoder.ASCII.GetString function which takes a byte[] array and converts it to a string (for ascii).
And System.Text.UTF8Encoder or System.Text.UTF16Encoder for Unicode strings in the UTF8 or UTF16 encodings.
There are also functions for converting Strings to Byte[] in the ASCIIEncoding, UTF8Encoding and UTF16Encoding classes: see the GetBytes(String) functions.
Unicode characters can take up to four bytes, but rarely are messages encoded on the wire using 4 bytes for each character. Rather, schemes like UTF8 or UTF16 are used that only bring in extra bytes when required.
Have a look at the Encoding class guidance.
Test streams should contain a byte-order marker that will allow you to determine how to treat the binary data.
It's unclear what exactly your goal is here. From what I can tell, there are 2 routes that you can take
Ignore all data sent in Unicode
Process both unicode and ASCII strings
IMHO, #1 is the way to go. But it sounds like your protocol is not necessarily setup to deal with a unicode string. You will have to do some detection logic to determine if the string coming in is a Unicode version. If it is you can use the Enconding.Unicode.GetString method to convert that particular byte array.
What encoding are your customers using? If some of your customers are still using ASCII, then you'll need your international customers to use something which maps the ASCII set (1-127) to itself, such as UTF8. After that, use the UTF8 encoding's GetString method.
My only solution is to fix the API. Either tell the users to use only ASCII string in the Byte[] or fix it to support ASCII and any other encoding you need to use.
Deciding what encoding is supplied by the foreign clients from just the byte[] can be a bit tricky.