Select Unicode character subset by culture - c#

I'd like to iterate over a character list. The character list should be a subset of a Unicode character set. Unicode have a lot of codes, and AFAIK Unicode includes all characters for all cultures.
What I want is to select a certain subset of Unicode characters depending on a specific culture, since a specific culture doesn't use all Unicode characters.
Is this possible?
I'm trying to draw a set of characters for OpenGL texture generation. In this way I can render font glyphs with OpenGL using a texture (very simple, enought fast). I'm already supporting ASCII character set, since there are less than 256 displayable characters, but with Unicode I need to select a subset of characters, otherwise the resulting OpenGL texture will be unmanageable.
What I'm trying is to select a subset of Unicode characters depending on the requested culture. I cannot think about another top level filter except the culture.

I think it would be fair to say, a specific culture does not use most Unicode Characters.
Check out the current standard. I don't think there is a direct correlation between Cultures and Scripts, this previous question touches on the problem.

Related

Force .NET UTF-8 encoder to output 3-byte encoded characters

I am working on a foreign file format that was apparently developed in Japan. Most of their strings are stored with UTF-8 encoding in the 3-byte format (i.e. the capital A is represented as 0xEF,0xBC,0xA1). While it is no problem to decode such strings in .NET, I could not find a way to force the framework to output in the same format, as it will default to the abbreviated form (makes sense, but really I do need the 3-byte form).
Is there any standard functionality that will take care of this? Me being lazy I do not want to implement it myself :)
That's not the letter 'A'. It's a different rune, the FULLWIDTH LATIN CAPITAL LETTER A. Notice the extra spacing in 'A'.
This isn't a different UTF8 format, it's a different character. Whoever produced this kind of file either made a mistake, or intentionally used those glyphs for layout purposes.
If you want to produce a similar text, you'll have to find how those characters are used in the first place, eg. for some words, every word, specific sections? Then you'll have to modify your own text to match this, eg by replacing normal letters with the full-width equivalents.
You can convert such strings with String.Normalize, using the KC or KD normalisation forms. For example, the following expression :
"'AA'".Normalize(System.Text.NormalizationForm.FormKC)
Returns:
'AA'

Suggestions needed to Apply SuperScript to C# string For Xsl transformation

I Want to apply SuperScript to String for display
It works fine with numbers in superscript, doesn't work for String characters.
Suggestions needed.
Works fine for :
var o2 = "O₂"; // or "O\x2082"
var unit2 = "unit²"; // or "unit\xB2"
Does not work for :
var xyz = "ABC365\xBTM"
Can not get TM superscripted over string ABC365.
Suggestions appreciated.
You seem to have completely misunderstood what is going on here, so I'll try a very basic explanation.
Unicode defines a large number of characters (some 1,114,111 if I remember right). These came from a large number of historic sources, and there's no great rhyme or reason about which characters made it in and which didn't. The available characters include some subscript and superscript digits, for example x2082 is subscript 2, and x00B2 is superscript 2. It also includes some special symbols such as the trademark symbol x2122 which are traditionally rendered with a superscript appearance.
But there's no general mechanism in Unicode to render any character in superscript or subscript rendition. If you want to write Xn, Unicode won't help you: to achieve that I had to resort to mechanisms outside Unicode, specifically HTML tagging. HTML allows you to render anything as subscript or superscript; Unicode only handles a few select cases.
C# recognizes the escape sequences \xHH and \xHHHH (depending on context), where H is any hex digit, to represent special characters by their Unicode code point value. So if there's a codepoint x2082 meaning subscript 2, you can write it as \x2082 in a Unicode string literal. But there's no codepoint for subscript-lowercase-italic N, so there's no way of representing that.
Now when you write \xBTM it should be clear that's nonsense. \x must be followed by 2 or 4 hex digits (depending on context). If you want the trademark symbol, you can use \x2122. If you want the two characters "T" and "M" in superscript rendition, you're out of luck; if you need to pass that sort of thing around in your application, you will need to pass strings containing HTML markup, rather than just plain Unicode.
You indicate that you're trying to create strings that will be used as input to an XSLT transformation. My suggestion would to pass XML documents rather than plain strings: but I would need to understand the requirement in better detail before saying that's definitively the right solution.

French/Portuguese extended ASCII symbols in regex

I need to write an edit control mask that should accept [a-zA-Z] letters as well as extended French and Portuguese symbols like [ùàçéèçǵ]. The mask should accept both uppercase and lowercase symbols.
If found two suggestions:
[\p{L}]
and
[a-zA-Z0-9\u0080-\u009F]
What is the correct way to write such a regular expression?
Update:
My question is about forming a regexp that should match (not filter) French and Portuguese characters in order to display it in the edit control. Case insensitive solution won't help me.
[\p{L}] seems to be a Unicode character class, I need an ASCII regexp.
Digits are allowed, but special characters such as !##$%^&*)_+}{|"?>< are disallowed (should be filtered).
I found the most working variant is [a-zA-Z0-9\u00B5-\u00FF]
https://regex101.com/r/EPF1rg/2
The question is why the range for [ùàçéèçǵ] is \u00B5-\u00FF and not \u0080-\u009F ?
As I see from CP860 (Portuguese code page) and from CP863 (French code page) it should be in range \u0080-\u009F.
https://www.ascii-codes.com/cp860.html
Can anyone explain it?
The characters [µùàçéèçÇ] are in range \u00B5-\u00FF, because the Unicode standard says so. The "old" range (\u0080-\u009F as in the 860 portugese code page) was just one of many possible mappings of the available 128 extended characters in ANSI, where you would sometimes find the same character at different codepoints depending on codepage).
C# strings are unicode, and so are its regex features:
https://stackoverflow.com/a/20641460/1132334
If you really must specify a fixed range of characters, in C# you can just as well include them literally:
[a-zA-Z0-9µùàçéèçÇ]
Or, as others have suggested already, use the "letter" matching. So it won't be up to you to define what a letter is in each alphabet, and you don't need to keep up with future changes of that definition yourself:
\p{L}
A third valid option could be to invert the specification and name only the punctuation characters and control characters that you would not allow.

How to display non-printable Ascii characters?

I came across this simple code to output ascii to the console:
Console.Write((char)1); //Output ☺
The thing is, it only works when I change the fonts of the console to RasterFonts, and it's ugly. I mean, look at those old text-based games, how did they draw some ascii art like this?
The Amazing Adventures of ANSI Dude, Snipes
How can I draw nice Ascii on that console?
Unless for some reason you are restricted to use ASCII characters you should use proper Unicode characters. It will avoid potential conflicts with mapping control characters (0-31) to printable characters and let you use lines and borders directly with .Net String type without going through encodings (since line and borders are part of "extended ASCII" and not mapped directly to Unicode characters unlike regular 7 bit ASCII codes 1-127).
Unicode "\u263a" would produce face you are looking for. For the borders and lines drawing use characters from Unicode box drawing range, for more characters see overall table http://unicode.org/charts/.

Understanding Text Encoding (In .Net)

I have done very little with encoding of Text. Truthfully, I don't really even know what it means exactly.
For example, if I have something like:
Dim myStr as String = "Hello"
Is that 'encoded' in memory in a particular format? Does that format depend on what language I'm using?
If I were in another country, like China, for example, and I had a string of Chinese (mandarin? My apologies if I'm using the wrong words here) would the following code (that I've used fine on English strings) still work the same?
System.Text.UTF8Encoding encoding=new System.Text.UTF8Encoding();
return encoding.GetBytes(str);
Or would it lose all meaning when you convert that .Net string to a UTF8Encoding when that conversion isn't valid?
Finally, I've worked with .Net for a few years now and I've never seen, heard, or had to do anything with Encoding. Am I the exception, or is it not a common thing to do?
The .NET string class is encoding strings using UTF16 - that means 2 bytes per character (although it allows for special combinations of two characters to form a single 4 byte character, so called "surrogate pairs") .
UTF8 on the other hand will use a variable number of bytes necessary to represent a particular Unicode character, i.e. only one byte for regular ASCII characters, but maybe 3 bytes for a Chinese character. Both encodings allow representing all Unicode characters, so there is always a mapping between them - both are different binary represenations (i.e for storing in memory or on disk) of the same (unicode) character set.
Since not all Unicode characters were able to fit into the original 2 bytes reserved by UTF-16, the format also allows to denote a combination of two UTF-16 characters to form 4 byte characters - the so formed character is called a "surrogate" or surrogate pair and is a pair of 16-bit Unicode encoding values that, together, represent a single character.
UTF-8 does not have this problem, since the number of bytes per Unicode character is not fixed. A good general overview over UTF-8, UTF-16 and BOMs can be gathered here.
An excellent overview / introduction to Unicode character encoding is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
First and foremeost: do not despair, you are not alone. Awareness of the treatment of character encoding and text representation in general is an unfortunately uncommon thing, but there is no better time to start learning than right now!
In modern systems, including .NET, text strings are represented in memory by some encoding of Unicode code points. These are just numbers. The code point for the character A is 65. The code point for the copyright (c) is 169. The code point for the Thai digit six is 3670.
The term "encoding" refers to how these numbers are represented in memory. There are a number of standard encodings that are used so that textual representation can remain consistent as data is transmitted from one system to another.
A simple encoding standard is UCS-2, whereby the code point is stored in the raw as a 16-bit word. This is limited due to the fact that it can only represent code points 0000-FFFF and such a range does not cover the full breadth of Unicode code points.
UTF-16 is the encoding used internally by the .NET String class. Most characters fit into a single 16-bit word here, but values larger than FFFF are encoded using surrogate pairs (see the Wiki). Because of this encoding scheme, code points D800-DFFF cannot be enocded by UTF-16.
UTF-8 is perhaps the most popular encoding used today, for a number of reasons which are outlined in the Wiki article.
UTF is a specific type of encoding with multiple different sizes. Each encoding type is how much memory and what representation in that memory the characters will take.
Generally we work with Unicode and Ascii.
Unicode is 2 Bytes per character.
Ascii is 1 Byte per character.
Ascii can be represented in unicode. however Unicode cannot be represented in ascii without being encoded.
UTF encoding uses a special character '%' to tell you that the following is the hex value of an encoded character.
%20 for instance is the character 32, which is actually a space.
http://www.google.com?q=space%20character
placing that url in a browser would UTF-8 decode that string and q= would actually be interpreted as "space character" notice the %20 is now a space.
UTF-16 uses 2 Bytes and is represented as such.
http://www.google.com?q=space%0020character
this example would actually fail as the URI is actually supposed to use UTF-8, But this example demonstrates the point.
The Unicode character would be 0020 or two bytes with values 0 and 32 respectively.
Mandarin would be some type of unicode characters, and UTF-16 would encode the Unicode so it would be representable in Ascii.
Here is a wiki article explaining a little more in depth
http://en.wikipedia.org/wiki/UTF-8

Categories

Resources