Understanding Text Encoding (In .Net)

Understanding Text Encoding (In .Net) - c#

I have done very little with encoding of Text. Truthfully, I don't really even know what it means exactly.
For example, if I have something like:
Dim myStr as String = "Hello"
Is that 'encoded' in memory in a particular format? Does that format depend on what language I'm using?
If I were in another country, like China, for example, and I had a string of Chinese (mandarin? My apologies if I'm using the wrong words here) would the following code (that I've used fine on English strings) still work the same?
System.Text.UTF8Encoding encoding=new System.Text.UTF8Encoding();
return encoding.GetBytes(str);
Or would it lose all meaning when you convert that .Net string to a UTF8Encoding when that conversion isn't valid?
Finally, I've worked with .Net for a few years now and I've never seen, heard, or had to do anything with Encoding. Am I the exception, or is it not a common thing to do?

The .NET string class is encoding strings using UTF16 - that means 2 bytes per character (although it allows for special combinations of two characters to form a single 4 byte character, so called "surrogate pairs") .
UTF8 on the other hand will use a variable number of bytes necessary to represent a particular Unicode character, i.e. only one byte for regular ASCII characters, but maybe 3 bytes for a Chinese character. Both encodings allow representing all Unicode characters, so there is always a mapping between them - both are different binary represenations (i.e for storing in memory or on disk) of the same (unicode) character set.
Since not all Unicode characters were able to fit into the original 2 bytes reserved by UTF-16, the format also allows to denote a combination of two UTF-16 characters to form 4 byte characters - the so formed character is called a "surrogate" or surrogate pair and is a pair of 16-bit Unicode encoding values that, together, represent a single character.
UTF-8 does not have this problem, since the number of bytes per Unicode character is not fixed. A good general overview over UTF-8, UTF-16 and BOMs can be gathered here.
An excellent overview / introduction to Unicode character encoding is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

First and foremeost: do not despair, you are not alone. Awareness of the treatment of character encoding and text representation in general is an unfortunately uncommon thing, but there is no better time to start learning than right now!
In modern systems, including .NET, text strings are represented in memory by some encoding of Unicode code points. These are just numbers. The code point for the character A is 65. The code point for the copyright (c) is 169. The code point for the Thai digit six is 3670.
The term "encoding" refers to how these numbers are represented in memory. There are a number of standard encodings that are used so that textual representation can remain consistent as data is transmitted from one system to another.
A simple encoding standard is UCS-2, whereby the code point is stored in the raw as a 16-bit word. This is limited due to the fact that it can only represent code points 0000-FFFF and such a range does not cover the full breadth of Unicode code points.
UTF-16 is the encoding used internally by the .NET String class. Most characters fit into a single 16-bit word here, but values larger than FFFF are encoded using surrogate pairs (see the Wiki). Because of this encoding scheme, code points D800-DFFF cannot be enocded by UTF-16.
UTF-8 is perhaps the most popular encoding used today, for a number of reasons which are outlined in the Wiki article.

UTF is a specific type of encoding with multiple different sizes. Each encoding type is how much memory and what representation in that memory the characters will take.
Generally we work with Unicode and Ascii.
Unicode is 2 Bytes per character.
Ascii is 1 Byte per character.
Ascii can be represented in unicode. however Unicode cannot be represented in ascii without being encoded.
UTF encoding uses a special character '%' to tell you that the following is the hex value of an encoded character.
%20 for instance is the character 32, which is actually a space.
http://www.google.com?q=space%20character
placing that url in a browser would UTF-8 decode that string and q= would actually be interpreted as "space character" notice the %20 is now a space.
UTF-16 uses 2 Bytes and is represented as such.
http://www.google.com?q=space%0020character
this example would actually fail as the URI is actually supposed to use UTF-8, But this example demonstrates the point.
The Unicode character would be 0020 or two bytes with values 0 and 32 respectively.
Mandarin would be some type of unicode characters, and UTF-16 would encode the Unicode so it would be representable in Ascii.
Here is a wiki article explaining a little more in depth
http://en.wikipedia.org/wiki/UTF-8

Related

How can C# 16 bit chars encode all Unicode characters?

I'm reading that C# stores unicode characters in char (aka System.Char) variables, which have fixed length of 16bits. However, 16 bits are not enough to store all Unicode characters! How, in this case, do C#'s char variables support Unicode?

In short: Surrogates
This is a good question. Unicode is more complicated, than most people think, because it introduces multiple new concepts (character set, code point, encoding, code unit), but i will try to give an almost complet answer.
Intro:
Unicode is a character set. A character set is just list of character and code point pairs. A code point is just a number to identify the paired character. UTF-8, UTF-16 and UTF-32 are encodings. Encodings define how the numbers (code points) are represented in binary form (as code units). Code units could be made of one or more bytes. (Actually the original ASCII code units are even just 7-bits long, but that's another story 😝)
Remember: character sets are made of code points and encodings are made of code units.
The C# char type represents a UTF-16 character (code unit). UTF-16 is a variable-length / multibyte encoding for the Unicode character set. Meaning characters can be represented by one or two 16-bit code units. Unicode code points beyond 16-bit are represented by two UTF-16 code units which equals four bytes.
Now to answering your question: How?
The original idea of Unicode was 1 character = 1 code point. But the origianl encoding which is UCS-2 (which is now obsolete) uses two bytes (16-bits) and could only encode 65,536 code points. After a short time this was not enough for the growing Unicode character set. Oh really, what the f did they think? Two bytes are obviously not enough. To fix this problem Unicode must step back from the original idea and introduced surrogates.
Therefore the UTF-16 was born, which is a variable-length/multibyte (16-bit code units) encoding which implements surrogates. This surrogates are special 16-bit code units equal to code points defined in Unicode which explicitly are not characters. Finding a surrogate while parsing your text simply means, you also have to read the next 16-bits and interpret both 16-bit units (the surrogate and the subsequent code unit) as one combined 32-bit Unicode code point.
UTF-32 is a fixed four byte encoding, which is big enough to avoid space problems, and could map 1 character on 1 code point, but UTF-32 also has to handle surrogates, since the UTF encodings are based on the Unicode standard and surrogates are part of the definion of the Unicode character set.
UTF-8 is also a variable-length/multibyte encoding but with another interessting encoding technique. In short: The number of leading zeros in a code unit defines the number of up to four subsequent bytes, which have to be combined to one Unicode code point.

c# character encoding that treats accented characters like simple ones

Is there an encoding where an accented character like á or ä is treated as a single character?
and if not then what would be the most commonly used encoding today? I'm using UTF7 currently, how compatible is that with other types of encoding?
Thanks

You might think about what you're asking for. You're asking for an encoding that will recognize 'á' and turn it into 'a'. That's a converter, not an encoding. It would have to know what encoding the source is in so that it can convert to whatever encoding you're using.
Wait, maybe that's not what you're asking. There are encodings that treat those as single bytes. For example, the ISO-8859-1 encoding (also called Latin-1) treats many accented characters as a single byte.
(The following struck out because I was talking about ASCII, not UTF-7 ... long day.)
UTF-7 isn't particularly compatible with many other encodings. It has 128 possible values: just enough space for the 52 letters (upper and lower case, combined) used in the Latin alphabet, the 10 numerals, 32 control characters, and various punctuation marks. But it's not sufficient for Spanish, for example, which has upside-down questions marks and exclamation points as well as other things.
UTF-7 is "compatible" with other encodings in that it can represent the entire Unicode character set. But only some characters (known as the "direct characters") and a few control characters can be directly encoded as single ASCII bytes. Those characters will be the same as in UTF-8 and in many single-byte character sets. All other characters are represented by sequences, and will be different from any other encoding.
The most commonly used encoding today? On the Web, UTF-8 is used a lot. It's also the default encoding used when you create a StreamWriter. For the work I do (mostly English, and Western European character sets), it works better than anything else.
Now, it's possible that what you're looking for is something that will treat 'á' and 'a' as the same in comparisons. That's a different question. See Performing Culture-Insensitive String Comparisons for information on that.

This doesn't seem to have anything to do with encodings. In C# it doesn't matter what encoding you use for storage and transmission, the strings of characters are always internally in UTF-16 and ä is always 1 char long in composed form.
If "ä".Length is giving 2 to you, your string is in decomposed form and all you need to do is
string str = "ä"; //a + U+0308, .Length == 2
str = str.Normalize(NormalizationForm.FormC); //just ä now, with Length == 1

Sorry for the confusion over this issue, i finally found what i was looking for, which is that i needed my text to use Windows-1250 (Central European (Windows)) code page, because that is what a lot of other programs use, that correctly support characters like €đłŁ¤...etc
Thanks for all the help i got, it was a useful learning experience.

why do char takes 2 bytes as it can be stored in one byte

can anybody tell me that in c# why does char takes two bytes although it can be stored in one byte. Don't you think it is wastage of a memory. if not , then how is extra 1-byte used?
in simple words ..please make me clear what is the use of extra 8-bits.!!

although it can be stored in one byte
What makes you think that?
It only takes one byte to represent every character in the English language, but other languages use other characters. Consider the number of different alphabets (Latin, Chinese, Arabic, Cyrillic...), and the number of symbols in each of these alphabets (not only letters or digits, but also punctuation marks and other special symbols)... there are tens of thousands of different symbols in use in the world ! So one byte is never going to be enough to represent them all, that's why the Unicode standard was created.
Unicode has several representations (UTF-8, UTF-16, UTF-32...). .NET strings use UTF-16, which takes two bytes per character (code points, actually). Of course, two bytes is still not enough to represent all the different symbols in the world; surrogate pairs are used to represent characters above U+FFFF

The char keyword is used to declare a Unicode character in the range indicated in the following table. Unicode characters are 16-bit characters used to represent most of the known written languages throughout the world.
http://msdn.microsoft.com/en-us/library/x9h8tsay%28v=vs.80%29.aspx

Unicode characters. True, we have enough room in 8bits for the English alphabet, but when it comes to Chinese and such, it takes a lot more characters.

In C#, char's are 16-bit Unicode characters by default. Unicode supports a much larger character set than can be supported by ASCII.
If memory really is a concern, here is a good discussion on SO regarding how you might work with 8-bit chars: Is there a string type with 8 BIT chars?
References:
On C#'s char datatype: http://msdn.microsoft.com/en-us/library/x9h8tsay(v=vs.80).aspx
On Unicode: http://en.wikipedia.org/wiki/Unicode

because utf-8 was probably still too young for microsoft to consider using it

To which character encoding (Unicode version) set does a char object correspond?

What Unicode character encoding does a char object correspond to in:
C#
Java
JavaScript (I know there is not actually a char type but I am assuming that the String type is still implemented as an array of Unicode characters)
In general, is there a common convention among programming languages to use a specific character encoding?
Update
I have tried to clarify my question. The changes I made are discussed in the comments below.
Re: "What problem are you trying to solve?", I am interested in code generation from language independent expressions, and the particular encoding of the file is relevant.

In C# and Java it's UTF-16.

I'm not sure that I am answering your question, but let me make a few remarks that hopefully shed some light.
At the core, general-purpose programming languages like the ones we are talking about (C, C++, C#, Java, PHP) do not have a notion of "text", merely of "data". Data consists of sequences of integral values (i.e. numbers). There is no inherent meaning behind those numbers.
The process of turning a stream of numbers into a text is one of semantics, and it is usually left to the consumer to assign the relevant semantics to a data stream.
Warning: I will now use the word "encoding", which unfortunately has multiple inequivalent meanings. The first meaning of "encoding" is the assignment of meaning to a number. The semantic interpretation of a number is also called a "character". For example, in the ASCII encoding, 32 means "space" and 65 means "captial A". ASCII only assigns meanings to 128 numbers, so every ASCII character can be conveniently represented by a single 8-bit byte (with the top bit always 0). There are many encodings with assign characters to 256 numbers, thus all using one byte per character. In these fixed-width encodings, a text string has as many characters as it takes bytes to represent. There are also other encodings in which characters take a variable amount of bytes to represent.
Now, Unicode is also an encoding, i.e. an assignment of meaning to numbers. On the first 128 numbers it is the same as ASCII, but it assigns meanings to (theoretically) 2^21 numbers. Because there are lots of meanings which aren't strictly "characters" in the sense of writing (such as zero-width joiners or diacritic modifiers), the term "codepoint" is preferred over "character". Nonetheless, any integral data type that is at least 21 bits wide can represent one codepoint. Typically one picks a 32-bit type, and this encoding, in which every element stands for one codepoint, is called UTF-32 or UCS-4.
Now we have a second meaning of "encoding": I can take a string of Unicode codepoints and transform it into a string of 8-bit or 16-bit values, thus further "encoding" the information. In this new, transformed form (called "unicode transformation format", or "UTF"), we now have strings of 8-bit or 16-bit values (called "code units"), but each individual value does not in general correspond to anything meaningful -- it first has to be decoded into a sequence of Unicode codepoints.
Thus, from a programming perspective, if you want to modify text (not bytes), then you should store your text as a sequence of Unicode codepoints. Practically that means that you need a 32-bit data type. The char data type in C and C++ is usually 8 bits wide (though that's only a minimum), while on C# and Java it is always 16 bits wide. An 8-bit char could conceivably be used to store a transformed UTF-8 string, and a 16-bit char could store a transformed UTF-16 string, but in order to get at the raw, meaningful Unicode codepoints (and in particular at the length of the string in codepoints) you will always have to perform decoding.
Typically your text processing libraries will be able to do the decoding and encoding for you, so they will happily accept UTF8 and UTF16 strings (but at a price), but if you want to spare yourself this extra indirection, store your strings as raw Unicode codepoints in a sufficiently wide type.

In Java:
The char data type is a single 16-bit Unicode character.
Taken from http://download.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
In C#:
A single Unicode character
Taken from http://msdn.microsoft.com/en-us/library/ms228360(v=vs.80).aspx

What does the .NET String.Length property return? Surrogate neutral length or complete character length

The documentation and language varies between VS 2008 and 2010:
VS 2008 Documentation
Internally, the text is stored as a readonly collection of Char objects, each of which represents one Unicode character encoded in UTF-16. ... The length of a string represents the number of characters regardless of whether the characters are formed from Unicode surrogate pairs or not. To access the individual Unicode code points in a string, use the StringInfo object. - http://msdn.microsoft.com/en-us/library/ms228362%28v=vs.90%29.aspx
VS 2010 Documentation
Internally, the text is stored as a sequential read-only collection of Char objects. ... The Length property of a string represents the number of Char objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the StringInfo object. - http://msdn.microsoft.com/en-us/library/ms228362%28v=VS.100%29.aspx
The language used in both cases doesn't clearly differentiate between "character", "Unicode character", "Char class", "Unicode surrogate pair", and "Unicode code point".
The language in the VS2008 documentation stating that a "string represents the number of characters regardless of whether the characters are formed from Unicode surrogate pairs or not" seems to be defining "character" as as object that may be the result of a Unicode surrogate pair, which suggests that it may represent a 4-byte sequence rather than a 2-byte sequence. It also specifically states at the beginning that a "char" object is encoded in UTF-16, which suggests that it could represent a surrogate pair (being 4 bytes instead of 2). I'm fairly certain that is wrong though.
The VS2010 documentation is a little more precise. It draws a distinction between "char" and "Unicode character", but not between "Unicode character" and "Unicode code point". If a code point refers to half a surrogate pair, and a "Unicode character" represents a full pair, then the "Char" class is named incorrectly, and does not refer to a "Unicode character" at all (which they state it does not), and it's really a Unicode code point.
So are both of the following statements true? (Yes, I think.)
String.Length represents the Unicode code-point length, and
String.Length represents neither the Unicode character length nor what we would consider to be a true character length (number of characters that would be displayed), but rather the number of "Char" objects, which each represent a Unicode code point (not a Unicode character).

String.Length does not account for surrogate pairs; however, the StringInfo.LengthInTextElements method does.
StringInfo.SubstringByTextElements is similar to String.Substring, but it operates on "Text Elements", such as surrogate pairs and combining characters, as well as normal characters. The functionality of both these methods are based on the StringInfo.ParseCombiningCharacters method, which extracts the starting index of each text element and stores them in a private array.
"The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be a base character, a surrogate pair, or a combining character
sequence." - http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx

String.Length does not account for surrogate pairs, it only counts UTF-16 chars (i.e. chars are always 2 bytes) - surrogate pairs are counted as 2 chars.

Both i would consider false. The second question would be true if you'd ask about the count of unicode codepoints but you asked about "length". The String's Length is the count of its elements which are words. Just in case that there are only unicode codepoints from the BMP (Basic Multilingual Plane) within the string, the length is equal to the number of unicode characters/codepoints. If there are codepoints from beyond the BMP or orphaned surrogates (high- or low-surrogates that do not appear as ordered pair) the length is NOT equal to the number of characters/codepoints.
First of all, the String is a bunch of words, a word list, word array or word stream. Its content are 16 bit words and that's it. To name an element "char" or "wchar" is a sin regarding unicode characters. Because a unicode character can have a codepoint greater than 0xFFFF it cannot be stored in a type that is 16 bits wide and if this type is called char or wchar it's even worse because it can only ever hold codepoints limited to 0xFFFF which accords to the unicode 1.0 standard which nowerdays is 20 years old. In order to store even the highest possible unicode codepoint in a single datatype, this type should have 21 bits but there is no such type, so we'd use a 32 bit type. In fact there is a static method (of the char class !) that is named ConvertToUtf32() which does just this, it can return a low ASCII codepoint or even the highest unicode codepoint whereby the latter implies that this method can detect a surrogate pair within the position of a String.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.