How can C# 16 bit chars encode all Unicode characters? - c#

I'm reading that C# stores unicode characters in char (aka System.Char) variables, which have fixed length of 16bits. However, 16 bits are not enough to store all Unicode characters! How, in this case, do C#'s char variables support Unicode?

In short: Surrogates
This is a good question. Unicode is more complicated, than most people think, because it introduces multiple new concepts (character set, code point, encoding, code unit), but i will try to give an almost complet answer.
Intro:
Unicode is a character set. A character set is just list of character and code point pairs. A code point is just a number to identify the paired character. UTF-8, UTF-16 and UTF-32 are encodings. Encodings define how the numbers (code points) are represented in binary form (as code units). Code units could be made of one or more bytes. (Actually the original ASCII code units are even just 7-bits long, but that's another story 😝)
Remember: character sets are made of code points and encodings are made of code units.
The C# char type represents a UTF-16 character (code unit). UTF-16 is a variable-length / multibyte encoding for the Unicode character set. Meaning characters can be represented by one or two 16-bit code units. Unicode code points beyond 16-bit are represented by two UTF-16 code units which equals four bytes.
Now to answering your question: How?
The original idea of Unicode was 1 character = 1 code point. But the origianl encoding which is UCS-2 (which is now obsolete) uses two bytes (16-bits) and could only encode 65,536 code points. After a short time this was not enough for the growing Unicode character set. Oh really, what the f did they think? Two bytes are obviously not enough. To fix this problem Unicode must step back from the original idea and introduced surrogates.
Therefore the UTF-16 was born, which is a variable-length/multibyte (16-bit code units) encoding which implements surrogates. This surrogates are special 16-bit code units equal to code points defined in Unicode which explicitly are not characters. Finding a surrogate while parsing your text simply means, you also have to read the next 16-bits and interpret both 16-bit units (the surrogate and the subsequent code unit) as one combined 32-bit Unicode code point.
UTF-32 is a fixed four byte encoding, which is big enough to avoid space problems, and could map 1 character on 1 code point, but UTF-32 also has to handle surrogates, since the UTF encodings are based on the Unicode standard and surrogates are part of the definion of the Unicode character set.
UTF-8 is also a variable-length/multibyte encoding but with another interessting encoding technique. In short: The number of leading zeros in a code unit defines the number of up to four subsequent bytes, which have to be combined to one Unicode code point.

Related

Why is the length of this string longer than the number of characters in it?

This code:
string a = "abc";
string b = "A𠈓C";
Console.WriteLine("Length a = {0}", a.Length);
Console.WriteLine("Length b = {0}", b.Length);
outputs:
Length a = 3
Length b = 4
Why? The only thing I could imagine is that the Chinese character is 2 bytes long and that the .Length method returns the byte count.
Everyone else is giving the surface answer, but there's a deeper rationale too: the number of "characters" is a difficult-to-define question and can be surprisingly expensive to compute, whereas a length property should be fast.
Why is it difficult to define? Well, there's a few options and none are really more valid than another:
The number of code units (bytes or other fixed size data chunk; C# and Windows typically use UTF-16 so it returns the number of two-byte pieces) is certainly relevant, as the computer still needs to deal with the data in that form for many purposes (writing to a file, for example, cares about bytes rather than characters)
The number of Unicode codepoints is fairly easy to compute (although O(n) because you gotta scan the string for surrogate pairs) and might matter to a text editor.... but isn't actually the same thing as the number of characters printed on screen (called graphemes). For example, some accented letters can be represented in two forms: a single codepoint, or two points paired together, one representing the letter, and one saying "add an accent to my partner letter". Would the pair be two characters or one? You can normalize strings to help with this, but not all valid letters have a single codepoint representation.
Even the number of graphemes isn't the same as the length of a printed string, which depends on the font among other factors, and since some characters are printed with some overlap in many fonts (kerning), the length of a string on screen is not necessarily equal to the sum of the length of graphemes anyway!
Some Unicode points aren't even characters in the traditional sense, but rather some kind of control marker. Like a byte order marker or a right-to-left indicator. Do these count?
In short, the length of a string is actually a ridiculously complex question and calculating it can take a lot of CPU time as well as data tables.
Moreover, what's the point? Why does these metrics matter? Well, only you can answer that for your case, but personally, I find they are generally irrelevant. Limiting data entry I find is more logically done by byte limits, as that's what needs to be transferred or stored anyway. Limiting display size is better done by the display side software - if you have 100 pixels for the message, how many characters you fit depends on the font, etc., which isn't known by the data layer software anyway. Finally, given the complexity of the unicode standard, you're probably going to have bugs at the edge cases anyway if you try anything else.
So it is a hard question with not a lot of general purpose use. Number of code units is trivial to calculate - it is just the length of the underlying data array - and the most meaningful/useful as a general rule, with a simple definition.
That's why b has length 4 beyond the surface explanation of "because the documentation says so".
From the documentation of the String.Length property:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
Your character at index 1 in "A𠈓C" is a SurrogatePair
The key point to remember is that surrogate pairs represent 32-bit
single characters.
You can try this code and it will return True
Console.WriteLine(char.IsSurrogatePair("A𠈓C", 1));
Char.IsSurrogatePair Method (String, Int32)
true if the s parameter includes adjacent characters at positions
index and index + 1, and the numeric value of the character at
position index ranges from U+D800 through U+DBFF, and the numeric
value of the character at position index+1 ranges from U+DC00 through
U+DFFF; otherwise, false.
This is further explained in String.Length property:
The Length property returns the number of Char objects in this
instance, not the number of Unicode characters. The reason is that a
Unicode character might be represented by more than one Char. Use the
System.Globalization.StringInfo class to work with each Unicode
character instead of each Char.
As the other answers have pointed out, even if there are 3 visible character they are represented with 4 char objects. Which is why the Length is 4 and not 3.
MSDN states that
The Length property returns the number of Char objects in this
instance, not the number of Unicode characters.
However if what you really want to know is the number of "text elements" and not the number of Char objects you can use the StringInfo class.
var si = new StringInfo("A𠈓C");
Console.WriteLine(si.LengthInTextElements); // 3
You can also enumerate each text element like this
var enumerator = StringInfo.GetTextElementEnumerator("A𠈓C");
while(enumerator.MoveNext()){
Console.WriteLine(enumerator.Current);
}
Using foreach on the string will split the middle "letter" in two char objects and the printed result won't correspond to the string.
That is because the Length property returns the number of char objects, not the number of unicode characters. In your case, one of the Unicode characters is represented by more than one char object (SurrogatePair).
The Length property returns the number of Char objects in this
instance, not the number of Unicode characters. The reason is that a
Unicode character might be represented by more than one Char. Use the
System.Globalization.StringInfo class to work with each Unicode
character instead of each Char.
As others said, it's not the number of characters in the string but the number of Char objects. The character 𠈓 is code point U+20213. Since the value is outside 16-bit char type's range, it's encoded in UTF-16 as the surrogate pair D840 DE13.
The way to get the length in characters was mentioned in the other answers. However it should be use with care as there can be many ways to represent a character in Unicode. "à" may be 1 composed character or 2 characters (a + diacritics). Normalization may be needed like in the case of twitter.
You should read this
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
This is because length() only works for Unicode code points that are no larger than U+FFFF. This set of code points is known as the Basic Multilingual Plane (BMP) and uses only 2 bytes.
Unicode code points outside of the BMP are represented in UTF-16 using 4 byte surrogate pairs.
To correctly count the number of characters (3), use StringInfo
StringInfo b = new StringInfo("A𠈓C");
Console.WriteLine(string.Format("Length 2 = {0}", b.LengthInTextElements));
Okay, in .Net and C# all strings are encoded as UTF-16LE. A string is stored as a sequence of chars. Each char encapsulates the storage of 2 bytes or 16 bits.
What we see "on paper or screen" as a single letter, character, glyph, symbol, or punctuation mark can be thought of as a single Text Element. As described in Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION, each Text Element is represented by one or more Code Points. An exhaustive list of Codes can be found here.
Each Code Point needs to encoded into binary for internal representation by a computer. As stated, each char stores 2 bytes. Code Points at or below U+FFFF can be stored in a single char. Code Points above U+FFFF are stored as a surrogate pair, using two chars to represent a single Code Point.
Given what we now know we can deduce, a Text Element can be stored as one char, as a Surrogate Pair of two chars or, if the Text Element is represented by multiple Code Points some combination of single chars and Surrogate Pairs. As if that weren't complicated enough, some Text Elements can be represented by different combinations of Code Points as described in, Unicode Standard Annex #15, UNICODE NORMALIZATION FORMS.
Interlude
So, strings that look the same when rendered can actually be made up of a different combination of chars. An ordinal (byte by byte) comparison of two such strings would detect a difference, this may be unexpected or undesirable.
You can re-encode .Net strings. so that they use the same Normalization Form. Once normalized, two strings with the same Text Elements will be encoded the same way. To do this, use the string.Normalize function. However, remember, some different Text Elements look similar to each other. :-s
So, what does this all mean in relation to the question? The Text Element '𠈓' is represented by the single Code Point U+20213 cjk unified ideographs extension b. This means it cannot be encoded as a single char and must be encoded as Surrogate Pair, using two chars. This is why string b is one char longer that string a.
If you need to reliably (see caveat) count the number of Text Elements in a string you should use the
System.Globalization.StringInfo class like this.
using System.Globalization;
string a = "abc";
string b = "A𠈓C";
Console.WriteLine("Length a = {0}", new StringInfo(a).LengthInTextElements);
Console.WriteLine("Length b = {0}", new StringInfo(b).LengthInTextElements);
giving the output,
"Length a = 3"
"Length b = 3"
as expected.
Caveat
The .Net implementation of Unicode Text Segmentation in the StringInfo and TextElementEnumerator classes should be generally useful and, in most cases, will yield a response that the caller expects. However, as stated in Unicode Standard Annex #29, "The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries."

why do char takes 2 bytes as it can be stored in one byte

can anybody tell me that in c# why does char takes two bytes although it can be stored in one byte. Don't you think it is wastage of a memory. if not , then how is extra 1-byte used?
in simple words ..please make me clear what is the use of extra 8-bits.!!
although it can be stored in one byte
What makes you think that?
It only takes one byte to represent every character in the English language, but other languages use other characters. Consider the number of different alphabets (Latin, Chinese, Arabic, Cyrillic...), and the number of symbols in each of these alphabets (not only letters or digits, but also punctuation marks and other special symbols)... there are tens of thousands of different symbols in use in the world ! So one byte is never going to be enough to represent them all, that's why the Unicode standard was created.
Unicode has several representations (UTF-8, UTF-16, UTF-32...). .NET strings use UTF-16, which takes two bytes per character (code points, actually). Of course, two bytes is still not enough to represent all the different symbols in the world; surrogate pairs are used to represent characters above U+FFFF
The char keyword is used to declare a Unicode character in the range indicated in the following table. Unicode characters are 16-bit characters used to represent most of the known written languages throughout the world.
http://msdn.microsoft.com/en-us/library/x9h8tsay%28v=vs.80%29.aspx
Unicode characters. True, we have enough room in 8bits for the English alphabet, but when it comes to Chinese and such, it takes a lot more characters.
In C#, char's are 16-bit Unicode characters by default. Unicode supports a much larger character set than can be supported by ASCII.
If memory really is a concern, here is a good discussion on SO regarding how you might work with 8-bit chars: Is there a string type with 8 BIT chars?
References:
On C#'s char datatype: http://msdn.microsoft.com/en-us/library/x9h8tsay(v=vs.80).aspx
On Unicode: http://en.wikipedia.org/wiki/Unicode
because utf-8 was probably still too young for microsoft to consider using it

To which character encoding (Unicode version) set does a char object correspond?

What Unicode character encoding does a char object correspond to in:
C#
Java
JavaScript (I know there is not actually a char type but I am assuming that the String type is still implemented as an array of Unicode characters)
In general, is there a common convention among programming languages to use a specific character encoding?
Update
I have tried to clarify my question. The changes I made are discussed in the comments below.
Re: "What problem are you trying to solve?", I am interested in code generation from language independent expressions, and the particular encoding of the file is relevant.
In C# and Java it's UTF-16.
I'm not sure that I am answering your question, but let me make a few remarks that hopefully shed some light.
At the core, general-purpose programming languages like the ones we are talking about (C, C++, C#, Java, PHP) do not have a notion of "text", merely of "data". Data consists of sequences of integral values (i.e. numbers). There is no inherent meaning behind those numbers.
The process of turning a stream of numbers into a text is one of semantics, and it is usually left to the consumer to assign the relevant semantics to a data stream.
Warning: I will now use the word "encoding", which unfortunately has multiple inequivalent meanings. The first meaning of "encoding" is the assignment of meaning to a number. The semantic interpretation of a number is also called a "character". For example, in the ASCII encoding, 32 means "space" and 65 means "captial A". ASCII only assigns meanings to 128 numbers, so every ASCII character can be conveniently represented by a single 8-bit byte (with the top bit always 0). There are many encodings with assign characters to 256 numbers, thus all using one byte per character. In these fixed-width encodings, a text string has as many characters as it takes bytes to represent. There are also other encodings in which characters take a variable amount of bytes to represent.
Now, Unicode is also an encoding, i.e. an assignment of meaning to numbers. On the first 128 numbers it is the same as ASCII, but it assigns meanings to (theoretically) 2^21 numbers. Because there are lots of meanings which aren't strictly "characters" in the sense of writing (such as zero-width joiners or diacritic modifiers), the term "codepoint" is preferred over "character". Nonetheless, any integral data type that is at least 21 bits wide can represent one codepoint. Typically one picks a 32-bit type, and this encoding, in which every element stands for one codepoint, is called UTF-32 or UCS-4.
Now we have a second meaning of "encoding": I can take a string of Unicode codepoints and transform it into a string of 8-bit or 16-bit values, thus further "encoding" the information. In this new, transformed form (called "unicode transformation format", or "UTF"), we now have strings of 8-bit or 16-bit values (called "code units"), but each individual value does not in general correspond to anything meaningful -- it first has to be decoded into a sequence of Unicode codepoints.
Thus, from a programming perspective, if you want to modify text (not bytes), then you should store your text as a sequence of Unicode codepoints. Practically that means that you need a 32-bit data type. The char data type in C and C++ is usually 8 bits wide (though that's only a minimum), while on C# and Java it is always 16 bits wide. An 8-bit char could conceivably be used to store a transformed UTF-8 string, and a 16-bit char could store a transformed UTF-16 string, but in order to get at the raw, meaningful Unicode codepoints (and in particular at the length of the string in codepoints) you will always have to perform decoding.
Typically your text processing libraries will be able to do the decoding and encoding for you, so they will happily accept UTF8 and UTF16 strings (but at a price), but if you want to spare yourself this extra indirection, store your strings as raw Unicode codepoints in a sufficiently wide type.
In Java:
The char data type is a single 16-bit Unicode character.
Taken from http://download.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
In C#:
A single Unicode character
Taken from http://msdn.microsoft.com/en-us/library/ms228360(v=vs.80).aspx

What does the .NET String.Length property return? Surrogate neutral length or complete character length

The documentation and language varies between VS 2008 and 2010:
VS 2008 Documentation
Internally, the text is stored as a readonly collection of Char objects, each of which represents one Unicode character encoded in UTF-16. ... The length of a string represents the number of characters regardless of whether the characters are formed from Unicode surrogate pairs or not. To access the individual Unicode code points in a string, use the StringInfo object. - http://msdn.microsoft.com/en-us/library/ms228362%28v=vs.90%29.aspx
VS 2010 Documentation
Internally, the text is stored as a sequential read-only collection of Char objects. ... The Length property of a string represents the number of Char objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the StringInfo object. - http://msdn.microsoft.com/en-us/library/ms228362%28v=VS.100%29.aspx
The language used in both cases doesn't clearly differentiate between "character", "Unicode character", "Char class", "Unicode surrogate pair", and "Unicode code point".
The language in the VS2008 documentation stating that a "string represents the number of characters regardless of whether the characters are formed from Unicode surrogate pairs or not" seems to be defining "character" as as object that may be the result of a Unicode surrogate pair, which suggests that it may represent a 4-byte sequence rather than a 2-byte sequence. It also specifically states at the beginning that a "char" object is encoded in UTF-16, which suggests that it could represent a surrogate pair (being 4 bytes instead of 2). I'm fairly certain that is wrong though.
The VS2010 documentation is a little more precise. It draws a distinction between "char" and "Unicode character", but not between "Unicode character" and "Unicode code point". If a code point refers to half a surrogate pair, and a "Unicode character" represents a full pair, then the "Char" class is named incorrectly, and does not refer to a "Unicode character" at all (which they state it does not), and it's really a Unicode code point.
So are both of the following statements true? (Yes, I think.)
String.Length represents the Unicode code-point length, and
String.Length represents neither the Unicode character length nor what we would consider to be a true character length (number of characters that would be displayed), but rather the number of "Char" objects, which each represent a Unicode code point (not a Unicode character).
String.Length does not account for surrogate pairs; however, the StringInfo.LengthInTextElements method does.
StringInfo.SubstringByTextElements is similar to String.Substring, but it operates on "Text Elements", such as surrogate pairs and combining characters, as well as normal characters. The functionality of both these methods are based on the StringInfo.ParseCombiningCharacters method, which extracts the starting index of each text element and stores them in a private array.
"The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be a base character, a surrogate pair, or a combining character
sequence." - http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx
String.Length does not account for surrogate pairs, it only counts UTF-16 chars (i.e. chars are always 2 bytes) - surrogate pairs are counted as 2 chars.
Both i would consider false. The second question would be true if you'd ask about the count of unicode codepoints but you asked about "length". The String's Length is the count of its elements which are words. Just in case that there are only unicode codepoints from the BMP (Basic Multilingual Plane) within the string, the length is equal to the number of unicode characters/codepoints. If there are codepoints from beyond the BMP or orphaned surrogates (high- or low-surrogates that do not appear as ordered pair) the length is NOT equal to the number of characters/codepoints.
First of all, the String is a bunch of words, a word list, word array or word stream. Its content are 16 bit words and that's it. To name an element "char" or "wchar" is a sin regarding unicode characters. Because a unicode character can have a codepoint greater than 0xFFFF it cannot be stored in a type that is 16 bits wide and if this type is called char or wchar it's even worse because it can only ever hold codepoints limited to 0xFFFF which accords to the unicode 1.0 standard which nowerdays is 20 years old. In order to store even the highest possible unicode codepoint in a single datatype, this type should have 21 bits but there is no such type, so we'd use a 32 bit type. In fact there is a static method (of the char class !) that is named ConvertToUtf32() which does just this, it can return a low ASCII codepoint or even the highest unicode codepoint whereby the latter implies that this method can detect a surrogate pair within the position of a String.

Understanding Text Encoding (In .Net)

I have done very little with encoding of Text. Truthfully, I don't really even know what it means exactly.
For example, if I have something like:
Dim myStr as String = "Hello"
Is that 'encoded' in memory in a particular format? Does that format depend on what language I'm using?
If I were in another country, like China, for example, and I had a string of Chinese (mandarin? My apologies if I'm using the wrong words here) would the following code (that I've used fine on English strings) still work the same?
System.Text.UTF8Encoding encoding=new System.Text.UTF8Encoding();
return encoding.GetBytes(str);
Or would it lose all meaning when you convert that .Net string to a UTF8Encoding when that conversion isn't valid?
Finally, I've worked with .Net for a few years now and I've never seen, heard, or had to do anything with Encoding. Am I the exception, or is it not a common thing to do?
The .NET string class is encoding strings using UTF16 - that means 2 bytes per character (although it allows for special combinations of two characters to form a single 4 byte character, so called "surrogate pairs") .
UTF8 on the other hand will use a variable number of bytes necessary to represent a particular Unicode character, i.e. only one byte for regular ASCII characters, but maybe 3 bytes for a Chinese character. Both encodings allow representing all Unicode characters, so there is always a mapping between them - both are different binary represenations (i.e for storing in memory or on disk) of the same (unicode) character set.
Since not all Unicode characters were able to fit into the original 2 bytes reserved by UTF-16, the format also allows to denote a combination of two UTF-16 characters to form 4 byte characters - the so formed character is called a "surrogate" or surrogate pair and is a pair of 16-bit Unicode encoding values that, together, represent a single character.
UTF-8 does not have this problem, since the number of bytes per Unicode character is not fixed. A good general overview over UTF-8, UTF-16 and BOMs can be gathered here.
An excellent overview / introduction to Unicode character encoding is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
First and foremeost: do not despair, you are not alone. Awareness of the treatment of character encoding and text representation in general is an unfortunately uncommon thing, but there is no better time to start learning than right now!
In modern systems, including .NET, text strings are represented in memory by some encoding of Unicode code points. These are just numbers. The code point for the character A is 65. The code point for the copyright (c) is 169. The code point for the Thai digit six is 3670.
The term "encoding" refers to how these numbers are represented in memory. There are a number of standard encodings that are used so that textual representation can remain consistent as data is transmitted from one system to another.
A simple encoding standard is UCS-2, whereby the code point is stored in the raw as a 16-bit word. This is limited due to the fact that it can only represent code points 0000-FFFF and such a range does not cover the full breadth of Unicode code points.
UTF-16 is the encoding used internally by the .NET String class. Most characters fit into a single 16-bit word here, but values larger than FFFF are encoded using surrogate pairs (see the Wiki). Because of this encoding scheme, code points D800-DFFF cannot be enocded by UTF-16.
UTF-8 is perhaps the most popular encoding used today, for a number of reasons which are outlined in the Wiki article.
UTF is a specific type of encoding with multiple different sizes. Each encoding type is how much memory and what representation in that memory the characters will take.
Generally we work with Unicode and Ascii.
Unicode is 2 Bytes per character.
Ascii is 1 Byte per character.
Ascii can be represented in unicode. however Unicode cannot be represented in ascii without being encoded.
UTF encoding uses a special character '%' to tell you that the following is the hex value of an encoded character.
%20 for instance is the character 32, which is actually a space.
http://www.google.com?q=space%20character
placing that url in a browser would UTF-8 decode that string and q= would actually be interpreted as "space character" notice the %20 is now a space.
UTF-16 uses 2 Bytes and is represented as such.
http://www.google.com?q=space%0020character
this example would actually fail as the URI is actually supposed to use UTF-8, But this example demonstrates the point.
The Unicode character would be 0020 or two bytes with values 0 and 32 respectively.
Mandarin would be some type of unicode characters, and UTF-16 would encode the Unicode so it would be representable in Ascii.
Here is a wiki article explaining a little more in depth
http://en.wikipedia.org/wiki/UTF-8

Categories

Resources