Make my application multilingual - c#

What is the best way to make my C#/WPF application support different languages?
I want to be able to give my users the choice to choose a language.
Thanks

There is a lot of information to digest, but the .Net framework has built in support for Internationalization
I wish I could give you an easy example, but it is not a "drag and drop" solution. You will need to put a lot of thought into how you design your application for this.

Pick up the following excellent book: .NET Internationalization by Smith-Ferrier. One of my next projects will be to internationalize our applications. This is going to be my guide.

How about tokenising the strings you are going to use in your application and then have separate language files which are then loaded in at runtime?
E.g. on my form I want to show 'First Name' in different languages, so it's stubbed with the token "firstnnamestring" and when the form is being loaded I'll replace the text with whatever's configured in the languages file depending on the current language.
The language files can be simple XML with key value pairs of 'token' and 'display text'.

How much text are we talking, here? You could always just use case statements that checks a language variable and places the appropriate language then, but this will turn into a mess if you've got a TON to replace.
In response to your comment:
Realize that this may not be the best solution, as I'm not familiar with the built in support with .NET.
You can simply keep a variable that contains the language, for example, strLang, and when you're placing text just have your program run a case statement to output the proper language.
Switch (strLang)
{
case "EN":
//OUTPUT ENGLISH TEXT HERE
break;
case "SP":
//OUTPUT SPANISH TEXT HERE
break;
}
As you can see, it can really clutter your code depending on the number of languages and the amount of text, so you may want to check out the book Randy suggested.

Related

Determining if text is english

I currently have a lot comments and text in my database that is mainly in English. However if it isn't in English I want to translate it to English.
I know I can call a translation api to determine the language but I don't want to make millions of translation API calls for text that most likely won't need translating.
I am looking for a way to determine if the text is English or not. I don't need to know what language it is, just that it isn't English, then if it isn't English I will send it to a translation service API.
The Chromium project (including its most popular implementation, Google Chrome) solves this problem with https://github.com/google/cld3.
If your only need is to detect whether or not something is English, then in theory you can use something even more compact.
Most good language detectors use trigram frequency (a gram being a single character) or trigram frequency overlaid with word frequency. For your application it seems that you could use a hybrid approach where the first pass is local, but of low accuracy and tuned to be a bit aggressive to not miss any potential English, and the second pass that does hit an API like Google Translate.
The popularity of English and amount of English data is usually helpful for applying NLP solutions to it, but in this case unfortunately you will often find false positives for English, because sources of data that are listed as English contain other languages or un-language like garbage characters or URLs.
Note also that for many queries there is no single correct answer. Good systems will return a weighted list of possibilities, but for a query like [dan], [a], [example#example.com] or [hi! como estas? i'm in class ahorita] the most correct answer will depend on your application and may not exist.
You can use NTextCat to determine input language.
Research (by a certian Zipf) determined that for the most part, there are some words which are used very frequently, and a lot of words which are rarely used.
If I was given this problem, I'd probably put down a list of the top X used words. Then for each comment I would see if there's a match.
It's not perfect (and if the text is very particular, or mispelt, you've got an issue) - but I think it's an acceptable heuristic.
See this post
More specifically, take a look on Trigrams

Detect language of text [duplicate]

This question already has answers here:
How to detect the language of a string?
(9 answers)
Closed 8 years ago.
Is there any C# library which can detect the language of a particular piece of text? i.e. for an input text "This is a sentence", it should detect the language as "English". Or for "Esto es una sentencia" it should detect the language as "Spanish".
I understand that language detection from text is not a deterministic problem. But both Google Translate and Bing Translator have an "Auto detect" option, which best-guesses the input language. Is there something similar available publicly, preferably in C#?
Yes indeed, TextCat is very good for language identification. And it has a lot of implementations in different languages.
There were no ports in .Net. So I have written one: NTextCat (NuGet, Online Demo).
It is pure .NET Standard 2.0 DLL + command line interface to it. By default, it uses a profile of 14 languages.
Any feedback is very appreciated! New ideas and feature requests are welcomed too :)
Language detection is a pretty hard thing to do.
Some languages are much easier to detect than others simply due to the diacritics and digraphs/trigraphs used. For example, double-acute accents are used almost exclusively in Hungarian. The dotless i ‘ı’, is used exclusively [I think] in Turkish, t-comma (not t-cedilla) is used only in Romanian, and the eszett ‘ß’ occurs only in German.
Some digraphs, trigraphs and tetragraphs are also a good give-away. For example, you'll most likely find ‘eeuw’ and ‘ieuw’ primarily in Dutch, and ‘tsch’ and ‘dsch’ primarily in German etc.
More giveaways would include common words or common prefixes/suffixes used in a particular language. Sometimes even the punctuation that is used can help determine a language (quote-style and use, etc).
If such a library exists I would like to know about it, since I'm working on one myself.
Please find a C# implementation based on of 3grams analysis here:
http://idsyst.hu/development/language_detector.html
Here you have a simple detector based on bigram statistics (basically means learning from a big set which bigrams occur more frequently on each language and then count those in a piece of text, comparing to your previously detected values):
http://allantech.blogspot.com/2007/07/automatic-language-detection.html
This is probably good enough for many (most?) applications and doesn't require Internet access.
Of course it will perform worse than Google's or Bing's algorithm (which themselves aren't great). If you need excellent detection performance you would have to do both a lot of hard work and over huge amounts of data.
The other option would be to leverage Google's or Bing APIs if your app has Internet access.
You'll want a machine learning algorithm based on hidden markov chains, process a bunch of texts in different languages.
Then when it gets to the unidentified text, the language that has the closer 'score' is the winner.
There is a simple tool to identify text language:
http://www.detectlanguage.com/
I've found that "textcat" is very useful for this. I've used a PHP implementation, PHP Text Cat, based on this this original implementation, and found it reliable. If you have a look at the sources, you'll find it's not a terrifyingly difficult thing to implement in the language of your choice. The hard work -- the letter combinations that are relevant to a particular language -- is all in there as data.

best way to store/use multiple languages

If I would want to create a c# application that support multiple languages, how should I store them?
I'd probably use constants in the application as value holders.
Such as:
Console.Write(FILE_NOT_FOUND);
When compiled, it would change into the string determined by the language.
I'll probably stick to 3 languages (Danish, English, Deutsch), not that I think it matters though.
It seems to be a waste to have a class file for each language, which all is processed when the application is compiled. It would also mean that you'd have to re-compile and re-distribute the whole program every time you want to change a string.
As far as I know, hardcoded strings is a bad thing.
Maybe a text file?
English.txt
Line1: FILE_NOT_FOUND=File Not Found. Try Again
Line2
Line3
etc.
Danish.txt
Line1: FILE_NOT_FOUND=Filen blev ikke fundet. Prøv igen
Line2
Line3
etc.
and so on.
If the user selects English, it reads the text file and set the different constant values.
The last one I can think of is placing it in a SQL database.
Could you give me some input? :)
Also, I tried writing FILE _ NOT _ FOUND (without spaces, but the text editor wouldn't let me
Use a resource file. That's the standard way to handle localization.
For details, see this tutorial.
--- EDIT ---
An alternative tutorial is available here. This one uses much better naming, so it may be more clear how it works.
I think your best option is to use the built in localization resources of the .NET Framework. You can read more about the mechanics of that here.
As for using a database to store your localised elements (text, images and the like) this is certainly a common option, but I think it's mostely because developers understand getting data from a database, more than working with satellite assemblies and the like. There a number of problems with using a database, so I'll name only a few: 1) added complexity of deployment of the application 2) addtional load on the database server, 3) where do you store the localized messages to say that the database is down :)
Using some sort of text file (likely XML) also carries with it some deployment issues, but more importanly the percieved flexibility of making text changes 'on the fly' is somewhat over rated. Apart of spelling mistakes and awkward wording you'll almost always be shipping a new build as the text of your app changes.
Check out the Localization/Internationalization samples here:
http://msdn.microsoft.com/en-us/goglobal/bb688096.aspx
I've also heard good things about this book.
This topic is far too large for a reply.
The process of making a program ready for new languages is normally called "internationalization" or "i18n", and the process of taking that and actually making it run is "localization" or "l10n".
Briefly, you want to have hardcoded strings replaced by string resources, as you say, and then typically create different resource files for different languages. Assuming you're working in .NET (a fairly good assumption for C#), there's a lot of stuff Microsoft does to make it easier.
Remember that there are other localization issues than language. For example, the Danish currency symbol is probably not '$', but rather the Euro symbol, dates are almost certainly abbreviated differently, and many places use ',' for the decimal point and '.' for the thousands separator, opposite from the English practice.

How do I best localize an entire app to many different languages?

I'm using Visual Studio (2005 and up). I am looking into trying out making an application where the user can change language for all menues, input formats and such. How would I go on doing this, as I suppose that there is some complete feature within .Net that can help me with this?
I need to take the following into account (and fill me in if I miss some obvious stuff)
Strings (menues, texts)
Input data (parsing floats, dates, etc..)
Should be easy to add support for another language
I'm not an expert with .NET by any means but Localization is never just as simple as "swapping out String values" or "changing date formats". There is much more to be taken into consideration such as layout, proper text placement.
Take Chinese for example. The way you read is top to bottom not left to right. If properly localized the app should take that into account.
http://msdn.microsoft.com/en-us/library/y99d1cd3(VS.80).aspx seems to be a good start though if you're dealing with Windows Forms.
The classic recipe is: design the app with no native language but a localization facility, and develop an initialization into one language (e.g., English). So you build the app and localize it into English every night; without the localization step it would not be usable. Do that well, and the resources for the initial sample localization can be replaced with those for any other language. Take into account non-roman scripts from the beginning. It's much cleaner to have a no-language app that always requires localization rather than a language-specific app that needs to have its native language subtracted and a replacement added.
For strings you should just separate your strings from your code (having an XML/DLL that will transform string IDs to real strings is one way to go). However you do need to make sure that you are supporting double byte characters for some languages (this is relevant if you use C/C++).
For input data what you want is to have different locale's. In Java this is relatively easy, and if you use C# it probably is quite easy also. In C/C++ I don't really know. The basic idea is that the input parsers should be different based on the locale selected at that time. So each field (textfield, textbox, etc.) must have an abstract parser that is then implemented by a different class depending on the locale (right to left, double byte, etc.).
Check the Java implementation for details on how they did it. It is quite functional.
You definitely need to be using the .NET ResourceManager and the resx file xml format, however there are a number of approaches to using this.
It really depends on what you are wanting to achieve. For me I wanted a single xml resource file (for each supported language) that could be modified by anyone. I created a helper class that loaded the global resource file into ResourceManager (once only) and I had a helper function that gives me the required resource for a given name. The only disadvantage in this approach was that I could not leverage dynamic binding of resources to properties.
I found this better and easier to manage than multiple or embedded resource files for every form. Additionally exactly the same approach can used in an ASP.NET application. I also found this approach means that outsourcing translation of resources and shipping language packs to customers much more manageable.
Microsoft's recommended approach is to use satellite assemblies, as described in Packaging and Deploying Resources. If you're using a ResourceManager to load resources, .NET will load the correct resources for the CurrentUICulture. This defaults to the user's current UI language setting in Windows.
It is possible to localize Windows Forms either through Visual Studio or an external tool, WinRes.exe. This article describes WinRes and how to use Visual Studio to localize the form.

Should you use international identifiers in Java/C#?

C# and Java allow almost any character in class names, method names, local variables, etc.. Is it bad practice to use non-ASCII characters, testing the boundaries of poor editors and analysis tools and making it difficult for some people to read, or is American arrogance the only argument against?
I would stick to english, simply because you usually never know who is working on that code, and because some third-party tools used in the build/testing/bugtracking progress may have problems. Typing äöüß on a Non-German Keyboard is simply a PITA, and I simply believe that anyone involved in software development should speak english, but maybe that's just my arrogance as a non-native-english speaker.
What you call "American arrogance" is not whether or not your program uses international variable names, it's when your program thinks "Währung" and "Wahrung" are the same words.
I'd say it entirely depends on who's working on the codebase.
If you have a small group of developers who all share a common language and you don't ever plan needing anyone who doesn't speak the language to work on the code then go ahead and use whatever characters you want.
If you need to have people of varying cultures and languages working on the code then it's probably best to stick with English since it's the common denominator for just about everyone in the world.
If your business are non-English speakers, and you think Domain Driven Design has something to it, then there is another aspect: How do we, as developers, use the same domain language as our business without any translation overhead?
That does not only mean translations between languages, say English and Norwegian, but also between different words. We should use the exact same words as our business for our entity classes and services.
I have found it easier to just give in and use my native language. Now that my code use the same words, it's easier to have a conversation with my domain experts. And after a while you get used to it, just like how you got used to code without Hungarian notation.
I used to work in a development team that happily wiped their asses with any naming (and for that matter any other coding) conventions. Believe it or not, having to cope with ä's and ö's in the code was a contributing factor of me resigning. Though I'm Finnish, I prefer writing code with US keyboard settings because curly and square brackets are a pain to write in a Finnish keyboard (try right alt and 7 and 0 for curlies).
So I say stick with the ascii characters.
Here's an example of where I've used non-ASCII identifiers, because I found it more readable than replacing the greek letters with their English names. Even though I don't have θ or φ on my keyboard (I relied on copy-and-paste.)
However these are all local variables. I would keep non-ASCII identifiers out of public interfaces.
It depends:
Does your team conform to any existing standards that require your using ASCII?
Is your code ever going to be feasibly reused or read by someone who doesn't speak your native language?
Do you envision a scenario where you'll need to ask for help online and will therefore not be able to copy-paste your code sample in as-is?
Are you certain your entire suite of tools support code encoding?
If you answered 'yes' to any of the above, stay ASCII only. If not, go forward at your own risk.
Part of the problem is that the Java/C# language and its libraries are based on English words like if and toString(). I personally would not like to switch between non-English language and English while reading code.
However, if your database, UI, business logics (including metaphors) are already in some non-English language, there's no need to translate every method names and variables into English.
IF you get past the other prerequisites you then have one extra (IMHO more important) one - How difficult is the symbol to type.
On my regular en-us keyboard, the only way I know of to type the letter ç is to hold alt, and hit 0227 on the numeric keypad, or copy and paste.
This would be a HUGE big roadblock in the way of typing quickly. You don't want to slow your coding down with trivial stuff like this if you aren't forced to. International keyboards may alleviate this, but then what happens if you have to code on your laptop which doesn't have an international keyboard, etc?
I would stick to ASCII characters because if anyone in your development team is using an SDK that only supports ASCII or you wanted to make your code open source, alot of problems could arise. Personally, I would not do it even if you are not planning on bringing anyone who doesn't speak the language in on the project, because you are running a business and it seems to me that one running a business would want his business to expand, which in this day and age means transcending national borders. My opinion is that English is the language of the realm, and even if you name your variables in a different language, there is little to no point to use any non-ASCII characters in your programming. Leave it up to the language to deal with it if you are handling data that is UTF8: my iPhone program (which involves tons of user data going in between the phone and server) has full UTF8 support, but has no UTF8 in the source code. It just seems to open such a large can of worms for almost no benefit.
There is another hazzard to using non-ASCII characters, though it will probably only bite in obscure cases. The allowed characters are defined in terms of the methods Character.isJavaIdentifierStart(int) and Character.isJavaIdentifierPart(int), which are defined in terms of Unicode. However, the exact version of Unicode used depends on the version Java platform, as specified in the documentation for java.lang.Character.
Since character properties change slightly from one Unicode version to the next, it's possible (but probably very unlikely) you could have identifiers that are valid in one version of Java, but not in the next.
As already pointed out, unless method names mostly match the language, it is a bit weird to constantly switch languages while reading.
For the Scandinavian languages & German, which I can speak and thus speak for, I would at least recommend using standard substitutions, ie.
ä/æ -> ae, ö/ø -> oe, å -> aa, ü -> ue
etc. just in case as others may find it difficult to type the original letters without keyboard/keymap changes. Think if you suddenly had to work with a codebase where the developers used a third language (for instance including the French ç) and didn't do this.. Switching between more than 2 keymaps to type efficiently would be painful in my experience.

Categories

Resources