Force .NET UTF-8 encoder to output 3-byte encoded characters

Force .NET UTF-8 encoder to output 3-byte encoded characters - c#

I am working on a foreign file format that was apparently developed in Japan. Most of their strings are stored with UTF-8 encoding in the 3-byte format (i.e. the capital A is represented as 0xEF,0xBC,0xA1). While it is no problem to decode such strings in .NET, I could not find a way to force the framework to output in the same format, as it will default to the abbreviated form (makes sense, but really I do need the 3-byte form).
Is there any standard functionality that will take care of this? Me being lazy I do not want to implement it myself :)

That's not the letter 'A'. It's a different rune, the FULLWIDTH LATIN CAPITAL LETTER A. Notice the extra spacing in 'Ａ'.
This isn't a different UTF8 format, it's a different character. Whoever produced this kind of file either made a mistake, or intentionally used those glyphs for layout purposes.
If you want to produce a similar text, you'll have to find how those characters are used in the first place, eg. for some words, every word, specific sections? Then you'll have to modify your own text to match this, eg by replacing normal letters with the full-width equivalents.
You can convert such strings with String.Normalize, using the KC or KD normalisation forms. For example, the following expression :
"'ＡA'".Normalize(System.Text.NormalizationForm.FormKC)
Returns:
'AA'

Related

Suggestions needed to Apply SuperScript to C# string For Xsl transformation

I Want to apply SuperScript to String for display
It works fine with numbers in superscript, doesn't work for String characters.
Suggestions needed.
Works fine for :
var o2 = "O₂"; // or "O\x2082"
var unit2 = "unit²"; // or "unit\xB2"
Does not work for :
var xyz = "ABC365\xBTM"
Can not get TM superscripted over string ABC365.
Suggestions appreciated.

You seem to have completely misunderstood what is going on here, so I'll try a very basic explanation.
Unicode defines a large number of characters (some 1,114,111 if I remember right). These came from a large number of historic sources, and there's no great rhyme or reason about which characters made it in and which didn't. The available characters include some subscript and superscript digits, for example x2082 is subscript 2, and x00B2 is superscript 2. It also includes some special symbols such as the trademark symbol x2122 which are traditionally rendered with a superscript appearance.
But there's no general mechanism in Unicode to render any character in superscript or subscript rendition. If you want to write Xn, Unicode won't help you: to achieve that I had to resort to mechanisms outside Unicode, specifically HTML tagging. HTML allows you to render anything as subscript or superscript; Unicode only handles a few select cases.
C# recognizes the escape sequences \xHH and \xHHHH (depending on context), where H is any hex digit, to represent special characters by their Unicode code point value. So if there's a codepoint x2082 meaning subscript 2, you can write it as \x2082 in a Unicode string literal. But there's no codepoint for subscript-lowercase-italic N, so there's no way of representing that.
Now when you write \xBTM it should be clear that's nonsense. \x must be followed by 2 or 4 hex digits (depending on context). If you want the trademark symbol, you can use \x2122. If you want the two characters "T" and "M" in superscript rendition, you're out of luck; if you need to pass that sort of thing around in your application, you will need to pass strings containing HTML markup, rather than just plain Unicode.
You indicate that you're trying to create strings that will be used as input to an XSLT transformation. My suggestion would to pass XML documents rather than plain strings: but I would need to understand the requirement in better detail before saying that's definitively the right solution.

c# character encoding that treats accented characters like simple ones

Is there an encoding where an accented character like á or ä is treated as a single character?
and if not then what would be the most commonly used encoding today? I'm using UTF7 currently, how compatible is that with other types of encoding?
Thanks

You might think about what you're asking for. You're asking for an encoding that will recognize 'á' and turn it into 'a'. That's a converter, not an encoding. It would have to know what encoding the source is in so that it can convert to whatever encoding you're using.
Wait, maybe that's not what you're asking. There are encodings that treat those as single bytes. For example, the ISO-8859-1 encoding (also called Latin-1) treats many accented characters as a single byte.
(The following struck out because I was talking about ASCII, not UTF-7 ... long day.)
UTF-7 isn't particularly compatible with many other encodings. It has 128 possible values: just enough space for the 52 letters (upper and lower case, combined) used in the Latin alphabet, the 10 numerals, 32 control characters, and various punctuation marks. But it's not sufficient for Spanish, for example, which has upside-down questions marks and exclamation points as well as other things.
UTF-7 is "compatible" with other encodings in that it can represent the entire Unicode character set. But only some characters (known as the "direct characters") and a few control characters can be directly encoded as single ASCII bytes. Those characters will be the same as in UTF-8 and in many single-byte character sets. All other characters are represented by sequences, and will be different from any other encoding.
The most commonly used encoding today? On the Web, UTF-8 is used a lot. It's also the default encoding used when you create a StreamWriter. For the work I do (mostly English, and Western European character sets), it works better than anything else.
Now, it's possible that what you're looking for is something that will treat 'á' and 'a' as the same in comparisons. That's a different question. See Performing Culture-Insensitive String Comparisons for information on that.

This doesn't seem to have anything to do with encodings. In C# it doesn't matter what encoding you use for storage and transmission, the strings of characters are always internally in UTF-16 and ä is always 1 char long in composed form.
If "ä".Length is giving 2 to you, your string is in decomposed form and all you need to do is
string str = "ä"; //a + U+0308, .Length == 2
str = str.Normalize(NormalizationForm.FormC); //just ä now, with Length == 1

Sorry for the confusion over this issue, i finally found what i was looking for, which is that i needed my text to use Windows-1250 (Central European (Windows)) code page, because that is what a lot of other programs use, that correctly support characters like €đłŁ¤...etc
Thanks for all the help i got, it was a useful learning experience.

Select Unicode character subset by culture

I'd like to iterate over a character list. The character list should be a subset of a Unicode character set. Unicode have a lot of codes, and AFAIK Unicode includes all characters for all cultures.
What I want is to select a certain subset of Unicode characters depending on a specific culture, since a specific culture doesn't use all Unicode characters.
Is this possible?
I'm trying to draw a set of characters for OpenGL texture generation. In this way I can render font glyphs with OpenGL using a texture (very simple, enought fast). I'm already supporting ASCII character set, since there are less than 256 displayable characters, but with Unicode I need to select a subset of characters, otherwise the resulting OpenGL texture will be unmanageable.
What I'm trying is to select a subset of Unicode characters depending on the requested culture. I cannot think about another top level filter except the culture.

I think it would be fair to say, a specific culture does not use most Unicode Characters.
Check out the current standard. I don't think there is a direct correlation between Cultures and Scripts, this previous question touches on the problem.

To which character encoding (Unicode version) set does a char object correspond?

What Unicode character encoding does a char object correspond to in:
C#
Java
JavaScript (I know there is not actually a char type but I am assuming that the String type is still implemented as an array of Unicode characters)
In general, is there a common convention among programming languages to use a specific character encoding?
Update
I have tried to clarify my question. The changes I made are discussed in the comments below.
Re: "What problem are you trying to solve?", I am interested in code generation from language independent expressions, and the particular encoding of the file is relevant.

In C# and Java it's UTF-16.

I'm not sure that I am answering your question, but let me make a few remarks that hopefully shed some light.
At the core, general-purpose programming languages like the ones we are talking about (C, C++, C#, Java, PHP) do not have a notion of "text", merely of "data". Data consists of sequences of integral values (i.e. numbers). There is no inherent meaning behind those numbers.
The process of turning a stream of numbers into a text is one of semantics, and it is usually left to the consumer to assign the relevant semantics to a data stream.
Warning: I will now use the word "encoding", which unfortunately has multiple inequivalent meanings. The first meaning of "encoding" is the assignment of meaning to a number. The semantic interpretation of a number is also called a "character". For example, in the ASCII encoding, 32 means "space" and 65 means "captial A". ASCII only assigns meanings to 128 numbers, so every ASCII character can be conveniently represented by a single 8-bit byte (with the top bit always 0). There are many encodings with assign characters to 256 numbers, thus all using one byte per character. In these fixed-width encodings, a text string has as many characters as it takes bytes to represent. There are also other encodings in which characters take a variable amount of bytes to represent.
Now, Unicode is also an encoding, i.e. an assignment of meaning to numbers. On the first 128 numbers it is the same as ASCII, but it assigns meanings to (theoretically) 2^21 numbers. Because there are lots of meanings which aren't strictly "characters" in the sense of writing (such as zero-width joiners or diacritic modifiers), the term "codepoint" is preferred over "character". Nonetheless, any integral data type that is at least 21 bits wide can represent one codepoint. Typically one picks a 32-bit type, and this encoding, in which every element stands for one codepoint, is called UTF-32 or UCS-4.
Now we have a second meaning of "encoding": I can take a string of Unicode codepoints and transform it into a string of 8-bit or 16-bit values, thus further "encoding" the information. In this new, transformed form (called "unicode transformation format", or "UTF"), we now have strings of 8-bit or 16-bit values (called "code units"), but each individual value does not in general correspond to anything meaningful -- it first has to be decoded into a sequence of Unicode codepoints.
Thus, from a programming perspective, if you want to modify text (not bytes), then you should store your text as a sequence of Unicode codepoints. Practically that means that you need a 32-bit data type. The char data type in C and C++ is usually 8 bits wide (though that's only a minimum), while on C# and Java it is always 16 bits wide. An 8-bit char could conceivably be used to store a transformed UTF-8 string, and a 16-bit char could store a transformed UTF-16 string, but in order to get at the raw, meaningful Unicode codepoints (and in particular at the length of the string in codepoints) you will always have to perform decoding.
Typically your text processing libraries will be able to do the decoding and encoding for you, so they will happily accept UTF8 and UTF16 strings (but at a price), but if you want to spare yourself this extra indirection, store your strings as raw Unicode codepoints in a sufficiently wide type.

In Java:
The char data type is a single 16-bit Unicode character.
Taken from http://download.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
In C#:
A single Unicode character
Taken from http://msdn.microsoft.com/en-us/library/ms228360(v=vs.80).aspx

Understanding Text Encoding (In .Net)

I have done very little with encoding of Text. Truthfully, I don't really even know what it means exactly.
For example, if I have something like:
Dim myStr as String = "Hello"
Is that 'encoded' in memory in a particular format? Does that format depend on what language I'm using?
If I were in another country, like China, for example, and I had a string of Chinese (mandarin? My apologies if I'm using the wrong words here) would the following code (that I've used fine on English strings) still work the same?
System.Text.UTF8Encoding encoding=new System.Text.UTF8Encoding();
return encoding.GetBytes(str);
Or would it lose all meaning when you convert that .Net string to a UTF8Encoding when that conversion isn't valid?
Finally, I've worked with .Net for a few years now and I've never seen, heard, or had to do anything with Encoding. Am I the exception, or is it not a common thing to do?

The .NET string class is encoding strings using UTF16 - that means 2 bytes per character (although it allows for special combinations of two characters to form a single 4 byte character, so called "surrogate pairs") .
UTF8 on the other hand will use a variable number of bytes necessary to represent a particular Unicode character, i.e. only one byte for regular ASCII characters, but maybe 3 bytes for a Chinese character. Both encodings allow representing all Unicode characters, so there is always a mapping between them - both are different binary represenations (i.e for storing in memory or on disk) of the same (unicode) character set.
Since not all Unicode characters were able to fit into the original 2 bytes reserved by UTF-16, the format also allows to denote a combination of two UTF-16 characters to form 4 byte characters - the so formed character is called a "surrogate" or surrogate pair and is a pair of 16-bit Unicode encoding values that, together, represent a single character.
UTF-8 does not have this problem, since the number of bytes per Unicode character is not fixed. A good general overview over UTF-8, UTF-16 and BOMs can be gathered here.
An excellent overview / introduction to Unicode character encoding is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

First and foremeost: do not despair, you are not alone. Awareness of the treatment of character encoding and text representation in general is an unfortunately uncommon thing, but there is no better time to start learning than right now!
In modern systems, including .NET, text strings are represented in memory by some encoding of Unicode code points. These are just numbers. The code point for the character A is 65. The code point for the copyright (c) is 169. The code point for the Thai digit six is 3670.
The term "encoding" refers to how these numbers are represented in memory. There are a number of standard encodings that are used so that textual representation can remain consistent as data is transmitted from one system to another.
A simple encoding standard is UCS-2, whereby the code point is stored in the raw as a 16-bit word. This is limited due to the fact that it can only represent code points 0000-FFFF and such a range does not cover the full breadth of Unicode code points.
UTF-16 is the encoding used internally by the .NET String class. Most characters fit into a single 16-bit word here, but values larger than FFFF are encoded using surrogate pairs (see the Wiki). Because of this encoding scheme, code points D800-DFFF cannot be enocded by UTF-16.
UTF-8 is perhaps the most popular encoding used today, for a number of reasons which are outlined in the Wiki article.

UTF is a specific type of encoding with multiple different sizes. Each encoding type is how much memory and what representation in that memory the characters will take.
Generally we work with Unicode and Ascii.
Unicode is 2 Bytes per character.
Ascii is 1 Byte per character.
Ascii can be represented in unicode. however Unicode cannot be represented in ascii without being encoded.
UTF encoding uses a special character '%' to tell you that the following is the hex value of an encoded character.
%20 for instance is the character 32, which is actually a space.
http://www.google.com?q=space%20character
placing that url in a browser would UTF-8 decode that string and q= would actually be interpreted as "space character" notice the %20 is now a space.
UTF-16 uses 2 Bytes and is represented as such.
http://www.google.com?q=space%0020character
this example would actually fail as the URI is actually supposed to use UTF-8, But this example demonstrates the point.
The Unicode character would be 0020 or two bytes with values 0 and 32 respectively.
Mandarin would be some type of unicode characters, and UTF-16 would encode the Unicode so it would be representable in Ascii.
Here is a wiki article explaining a little more in depth
http://en.wikipedia.org/wiki/UTF-8

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.