Is C#/.NET actually limited to Unicode 6.0?

Is C#/.NET actually limited to Unicode 6.0? - c#

So, I saw this question here on Stack Overflow (the question), and it says;
Update 2: This note on Roslyn confirms that the underlying platform defines the Unicode support for the compiler, and in the link to the code it explains that C# 6.0 supports Unicode 6.0 and up (with a breaking change for C# identifiers as a result).
So I am now wondering if I can, for example, read a file that contains unicode 13.0 characters, or am I missing something?

There are three things at play here:
The compiler, which is only relevant for source file handling. If you try to compile code that includes characters the compiler is unaware of, I would expect the compiler to treat those characters as "unknown" in terms of their Unicode category. (So you wouldn't be able to use them in identifiers, they wouldn't count as whitespace etc.)
The framework, which is relevant when you use methods that operate on strings, or things like char.GetUnicodeCategory() - but which will let you load data from files even if it doesn't "understand" some characters.
Whatever applications do with the data - often data is just propagated from system to system in an opaque way, but often there are also other operations and checks performed on it.
If you need to store some text in a database, and then display it on a user's screen, it's entirely possible for that text to go through various systems that don't understand some characters. That can be a problem, in terms of areas such as:
Equality and ordering: if two strings should be equal in a case-insensitive comparison, but the system doesn't know about some of the characters within those strings, it might get the wrong answer
Validation: if a string is only meant to contain characters within certain Unicode categories, but the system doesn't know what category a character is in, it logically doesn't know for sure whether the string is valid.
Combining and normalization: again in terms of validation, if your system is meant to validate that a string is only (say) 5 characters long, but that's in a particular normalization form, then you need to be able to perform that normalization in order to get the right answer.
(There are no doubt lots of similar other areas.)
The compiler is basically the least important part of this - it does matter what level of support the framework has, but whether it's actually a problem to be a bit out of date or not will depend on what's happening with the data.

Related

Convert language characters to Latin alphabet

I try to program an application to learn foreign characters. If you for example want to learn Japanese, then you'd have to memorize all the Hiragana, Katakana and Kanji letters. (e.g あ、い、か... = Hiragana カ、サ、ケ... = Katakana 本、学... = Kanji).
Example:
Some user is trying to learn Japanese. So he has to learn:
か = ka
本 = hon, meaning: basis/ book/ this
And he also has to learn the pronunciation.
My first question is if there is any library or something to easily do it in .net? I also looked at Microsoft IME, but I couldn't really find out how I could use it in my project.
I also looked at the Unicode database and it's basically possible to it. I also managed to write a Project to convert か to ka. The only Thing that is missing are the meanings (for example 本=basis/ book/ this), which are also provided by the Unicode database. I unfortunately couldn't find them in my .XML file from which I get the UCD data.
It works, when I enter the word on the UCD-Website.
Another approach was to use the CLDR-Library which also seems to be related to UCD. I unfortunately couldn't figure out which of the two (UCD or CLDR) I should use.
CLDR: http://cldr.unicode.org/
My question is if UCD is the best way to do it and if I could also use CLDR.
I don't really want to work with normal lists where I just type in all the characters myself. It would take too much time, especially for all the Kanji letters (more than 10,000).
Thanks
EDIT: I solved it, I extract the information from the Unicode Character Database (UCD). You can download the whole database in a .XML file. I just needed to learn how to handle it and find the correct attributes.

Both Google and Microsoft offer APIs you can call to translate text.
eg http://www.microsoft.com/en-us/translator/translatorapi.aspx
Depending the type of service you choose a small fee might be required.
They also offer sounds for the translation.
No need to re-invent this wheel. :-)
If this was a CodePage type question.
this blog is an amusing place to start
http://www.joelonsoftware.com/articles/Unicode.html
EDIT: in response to comment about options. Google can supply several possible translations
eg for 本

What you are looking for is a Transliteration API or library.
Well, actually, what you want is a Romanization library, which is not quite the same, but you better forget i said that, you'll find out soon enough, and I don't want to shatter your daydreams.
You might want to look at this https://bitbucket.org/Dimps/unidecodesharpfork
or this http://unidecode.codeplex.com/
or this http://transliterator.codeplex.com/
I used unidecodesharpfork to transliterate Russian, and it's somewhat unsatisfactory, as it only transliterates each character, it doesn't properly romanize according to ISO standard.
Unfortunately, "transliteration" (what you actually need is romanization, so by transliteration i/you mean romanization) isn't quite as simple as having a list of characters in one alphabet, and then substitute each character with the corresponding character in another alphabet, which seems to be the basic belief of the unidecodesharpfork author.
There are rules, because sometimes transliteration depends on the preceding or following character, and there is also an ISO Standard on Romanization, e.g. for Russian (see http://en.wikipedia.org/wiki/Romanization_of_Russian).
Also, transliteration isn't culture-independant. For example, if you are a German-Speaker, you transliterate Russian differently than an English-Speaker does.
Therefore, for serious usage, I would use the Google transliterate API (provides English-Speaker standpoint only), but i just see it has been deprecated.
https://developers.google.com/transliterate/
Maybe high time to read out the transliteration for those 10'000 characters :)

Regex for Words or Names For a Specific Language

I'm interested in language-specific validators via regex. I know that I can validate a person's name, in any language, with a pattern like this:
“[\p{L}\p{M}]”
But what if I want validation to be for a specific language? It would be nice if my thread's CurrentUICulture or CurrentCulture setting would simply convert the meaning of "[\w]" to something appropriate for German, Spanish, English, and especially Chinese. Does it work that way? If yes, then this is likely my answer.
If not, then my next interest would be to use a regex script annotation. However, I notice that:
The list given in that link does not include simplified "Chinese", which I am particularly interested in.
I don't think .NET regex capabilities support script-based matching. Yes? No?
So my final option, if I can't get the prior two options to work, is to turn to named blocks. At least the list of .net supported named blocks includes several entries for CJK. I suppose I can simply combine the several CJK blocks, and call that (simplified) "Chinese."
Thoughts?

I have concluded that, in a .net setting, there is no such thing as a regex that is sensitive to the CurrentUICulture. I have also concluded that the most permissive reasonable scenario is to perform a validation - applicable to all languages simultaneously - that simply rejects all forms of non-printable characters, "dingbats", angle-brackets (to prevent markup injection), and math symbols:
#”^[^\p{C}<>\p{Sm}\p{So}]*$”
The mid-permissive approach is to use a string that expressly captures both Western and Eastern character sets (including diacritics and "combining characters"):
#"^[\p{L}\p{M}\p{Pd}\p{Pi}\p{Pf}\s]*$"
The least-permissive approach, if I want only Western characters, is this:
#"^[\p{IsBasicLatin}\p{IsLatin-1Supplement}\p{Pd}\p{Pi}\p{Pf}-[\p{N}]]*$"
The above still allows all forms of quote marks, which usually apply for names like O'Toole.

internal encoding for my application

My desktop c# application gets various documents from users, possibly in different encodings.
I need to show users existing documents, allow to manipulate them in my UI, and store them for future use.
Adding the notion of "encoding" to each of these steps seems complex to me. I was thinking to internally always convert the user input documents to UTF-8, and so my UI and data store do not need to worry about it. Then when the user wants the document back as a file I ask the user which encoding to use.
Does this make sense? Are encodings interoperable? What if I only support unicode?

In your application you should use native Unicode support (what the platform uses for storing Unicode). On Windows and OS X this is a sort of UTF-16, but on Linux it is UTF-8.
When it comes to saving/loading files or communicating with external systems, go for UTF-8.
Also, do not confuse code-pages with encodings.
Regarding code-pages, today I think it is not so important to support them anymore. At least it should not be a priority for you. Because for ANSI encodings you do not have BOMs, it will be really hard guess the encoding of files (in fact it is impossible to do it perfectly).

Encodings are not interoperable, since some have characters that others don't have.
Unicode internal representation is a good idea since it has the wider charset, but I'd advice to save back the document in the original encoding if the added characters are still in the said encoding. If not, prompt the user that you'll save in Unicode in order to encode correctly these characters.

Just decode all the documents to String. Strings in .Net are always Unicode (utf-16). Only use encodings when you are reading or writing a file.

When you get ANSI files you should know the codepage before converting to unicode e. g. create a utf-16 string, otherwise the bytes from 128 to 255 could result into the wrong unicode codepoints. You might get into trouble when you want to store unicode string to a ANSI file, because codepoints up to 0x10ffff cannot fit into a single byte.

There are only two reasons to ever use UTF-16 in an interchange format (that is, one that gets sent from A to B):
You didn't design the document type, and have to interoperate with something that already uses it.
Your content is of such that with some languages UTF-16 is shorter. This is relatively rare as even with those languages, there is often a high number of characters from the BMP in the mix, so UTF-8 ends up being more concise.
Barring that case, there are only two reasons to ever use anything other than UTF-8 in an interchange format:
You didn't design the document type, and have to interoperate with something that already uses legacy character sets.
You hate people.
Number 2 is particularly pressing if you particularly hate foreigners and people who don't use your own language, but if you just hate people generally, you'll cause enough headaches to enough people that you should find the exercise satisfying.
Now, extending from that, if a given document format designed by someone else allows UTF-8, and you can expect all modern software dealing with it to be able to handle UTF-8, then there are two reasons to not do this:
There is some sort of security checks done on the data to make sure it hasn't been changed (note, if you in any way edit or alter the document, this inherently doesn't apply).
You hate people. Again with a bonus for xenophobes.
For your internal storing, it's just a matter of whatever is most useful to you. As a rule, .NET tends to default to UTF-16 when in memory (char and string work with that) and UTF-8 when writing to and reading from strings. If your backing store is a SQL Server, then UTF-16 is your friend (the 'nchar', 'nvarchar', 'ntext' variants of 'char', 'varchar', 'text' to avoid issues if the character set was set to anything other than UTF-8), and other databases either have their own way of dealing with modern characters, or can use UTF-8.
In general though, use UTF-8 unless someone forces you to do otherwise (because either they were forced to deal with code from the 1990s or earlier, or because they hate people).

C# compile error: "No logical space left to create more user strings"

I am trying to compile a one-off "script", an autogenerated C# program. This program contains 120,000 different string literals. The C# compiler can't build this, saying:
Unexpected error writing metadata to file '<removed>' -- 'No logical space left to create more user strings.'
Is there a hard limit in .NET on the number of string literals one can have in a module? What is this limit? Is there any way around it?

There is a limit on the number of strings in an assembly, just as there are limits on the number of classes, fields, etc. Each of these is identified with a 32-bit metadata token, where the upper-most byte is a meta-type-code, and the lower bits are the individual data record. For strings, they actually identify an offset into the string heap, so you can have at most 2**24 bytes for strings, i.e. 16MiB. Not sure whether strings are stored in UTF-8 or UTF-16.

I have no idea of what the limits are in .NET but, if it is a resource limitation, I'd solve it the same way we solved running out of space in the bad old 64K-segment days.
Externalize the strings - put them in an external file and simply store the offsets (and lengths if they're not null terminated). When you need a string, load it from the file and use it.

I've never encountered this before myself, but there is some information and suggested solutions on MSDN
However, unless you have a good reason for using hard coded string literals, you should probably consider using resource files rather than literals for most strings.

Should you use international identifiers in Java/C#?

C# and Java allow almost any character in class names, method names, local variables, etc.. Is it bad practice to use non-ASCII characters, testing the boundaries of poor editors and analysis tools and making it difficult for some people to read, or is American arrogance the only argument against?

I would stick to english, simply because you usually never know who is working on that code, and because some third-party tools used in the build/testing/bugtracking progress may have problems. Typing äöüß on a Non-German Keyboard is simply a PITA, and I simply believe that anyone involved in software development should speak english, but maybe that's just my arrogance as a non-native-english speaker.
What you call "American arrogance" is not whether or not your program uses international variable names, it's when your program thinks "Währung" and "Wahrung" are the same words.

I'd say it entirely depends on who's working on the codebase.
If you have a small group of developers who all share a common language and you don't ever plan needing anyone who doesn't speak the language to work on the code then go ahead and use whatever characters you want.
If you need to have people of varying cultures and languages working on the code then it's probably best to stick with English since it's the common denominator for just about everyone in the world.

If your business are non-English speakers, and you think Domain Driven Design has something to it, then there is another aspect: How do we, as developers, use the same domain language as our business without any translation overhead?
That does not only mean translations between languages, say English and Norwegian, but also between different words. We should use the exact same words as our business for our entity classes and services.
I have found it easier to just give in and use my native language. Now that my code use the same words, it's easier to have a conversation with my domain experts. And after a while you get used to it, just like how you got used to code without Hungarian notation.

I used to work in a development team that happily wiped their asses with any naming (and for that matter any other coding) conventions. Believe it or not, having to cope with ä's and ö's in the code was a contributing factor of me resigning. Though I'm Finnish, I prefer writing code with US keyboard settings because curly and square brackets are a pain to write in a Finnish keyboard (try right alt and 7 and 0 for curlies).
So I say stick with the ascii characters.

Here's an example of where I've used non-ASCII identifiers, because I found it more readable than replacing the greek letters with their English names. Even though I don't have θ or φ on my keyboard (I relied on copy-and-paste.)
However these are all local variables. I would keep non-ASCII identifiers out of public interfaces.

It depends:
Does your team conform to any existing standards that require your using ASCII?
Is your code ever going to be feasibly reused or read by someone who doesn't speak your native language?
Do you envision a scenario where you'll need to ask for help online and will therefore not be able to copy-paste your code sample in as-is?
Are you certain your entire suite of tools support code encoding?
If you answered 'yes' to any of the above, stay ASCII only. If not, go forward at your own risk.

Part of the problem is that the Java/C# language and its libraries are based on English words like if and toString(). I personally would not like to switch between non-English language and English while reading code.
However, if your database, UI, business logics (including metaphors) are already in some non-English language, there's no need to translate every method names and variables into English.

IF you get past the other prerequisites you then have one extra (IMHO more important) one - How difficult is the symbol to type.
On my regular en-us keyboard, the only way I know of to type the letter ç is to hold alt, and hit 0227 on the numeric keypad, or copy and paste.
This would be a HUGE big roadblock in the way of typing quickly. You don't want to slow your coding down with trivial stuff like this if you aren't forced to. International keyboards may alleviate this, but then what happens if you have to code on your laptop which doesn't have an international keyboard, etc?

I would stick to ASCII characters because if anyone in your development team is using an SDK that only supports ASCII or you wanted to make your code open source, alot of problems could arise. Personally, I would not do it even if you are not planning on bringing anyone who doesn't speak the language in on the project, because you are running a business and it seems to me that one running a business would want his business to expand, which in this day and age means transcending national borders. My opinion is that English is the language of the realm, and even if you name your variables in a different language, there is little to no point to use any non-ASCII characters in your programming. Leave it up to the language to deal with it if you are handling data that is UTF8: my iPhone program (which involves tons of user data going in between the phone and server) has full UTF8 support, but has no UTF8 in the source code. It just seems to open such a large can of worms for almost no benefit.

There is another hazzard to using non-ASCII characters, though it will probably only bite in obscure cases. The allowed characters are defined in terms of the methods Character.isJavaIdentifierStart(int) and Character.isJavaIdentifierPart(int), which are defined in terms of Unicode. However, the exact version of Unicode used depends on the version Java platform, as specified in the documentation for java.lang.Character.
Since character properties change slightly from one Unicode version to the next, it's possible (but probably very unlikely) you could have identifiers that are valid in one version of Java, but not in the next.

As already pointed out, unless method names mostly match the language, it is a bit weird to constantly switch languages while reading.
For the Scandinavian languages & German, which I can speak and thus speak for, I would at least recommend using standard substitutions, ie.
ä/æ -> ae, ö/ø -> oe, å -> aa, ü -> ue
etc. just in case as others may find it difficult to type the original letters without keyboard/keymap changes. Think if you suddenly had to work with a codebase where the developers used a third language (for instance including the French ç) and didn't do this.. Switching between more than 2 keymaps to type efficiently would be painful in my experience.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.