How to manage localized strings that look alike in C#/WPF?

How to manage localized strings that look alike in C#/WPF? - c#

In my application (and probably in lots of other applications) I want to use localization.
However, I don't know what is the best way to manage almost equal string.
Some questions:
When a button 'Edit' is used as menu name, menu item name should I make two resource strings (with equal value "Edit") or only 1?
When there is an edit button on the screen and I want to use the Alt key the strings needs to be "_Edit", should I make for this a separate resource string ?
And what if there is also a button that shows a new page, and thus (for my convention), the value should be "_Edit..."?

It's not a quick answer topic (localization) but I'll try to answer at least to your questions:
1) Yes, even if not mandatory it's better to keep each text separated. If you'll change one of them you won't risk to change both (and maybe in different languages they must be translated with something different because of their context).
2) Absolutely yes. Imagine you have these strings: _Edit and _Mark as read. In Italian, for example, they'll be translated to _Modifica and Segna come già letto. For both strings you can't add use "_" for the same letter, this is an issue that translators should take care about.
3) I suggest no. As for 1) try to keep different strings separated (for the same reason I said before). If you want to save money of translation (who doesn't want?) you may write a program to pre-parse strings to produce a "normalized" output to give to translators (it'll remove duplicates and merge similar strings, when possible). But you should keep your program unaware of details.
To summarize: no, don't try to merge (similar) strings inside your program. If you really need it then it'll be done with an external program (it can even take into account strings from different modules so it'll do a better job) and only when applicable (for case 3 it's possible, for case 1 and 2 absolutely not).

Yes you should make separate resource strings. You are using the word edit to describe different actions in your application in english. However, in another language these actions might be described by different words (one might be "edit" while the other one might be "change").
So you'll need the ability to assign different strings for your menus and buttons.

1) You should favor abstraction and decoupling : you have two concepts : a category "edition" and an action "edit", so it makes sense to have two separate strings, because in other languages than English "edit/edit" could not be as acceptable.
2) Here you have only one concept with two different formats, so only one string and a transformation rule ("_" + myString) make sense.
3) It depends : is the action triggered by this button the exact same concept as the one above : if yes use the same string, if no use two.
All this are general considerations and you should of course adapt it to your needs :
if you really have to support n languages upfront then follow these guidelines, and they should have appeared naturally during the localization process
if you are anticipating that somewhere in the future you may have to support another language than English and that you have a lot of identical strings now then you might not spend too much time trying to do things perfectly as there is a lot of chances it will be for nothing and time could be spent better enhancing or adding core features.

Related

Determining if text is english

I currently have a lot comments and text in my database that is mainly in English. However if it isn't in English I want to translate it to English.
I know I can call a translation api to determine the language but I don't want to make millions of translation API calls for text that most likely won't need translating.
I am looking for a way to determine if the text is English or not. I don't need to know what language it is, just that it isn't English, then if it isn't English I will send it to a translation service API.

The Chromium project (including its most popular implementation, Google Chrome) solves this problem with https://github.com/google/cld3.
If your only need is to detect whether or not something is English, then in theory you can use something even more compact.
Most good language detectors use trigram frequency (a gram being a single character) or trigram frequency overlaid with word frequency. For your application it seems that you could use a hybrid approach where the first pass is local, but of low accuracy and tuned to be a bit aggressive to not miss any potential English, and the second pass that does hit an API like Google Translate.
The popularity of English and amount of English data is usually helpful for applying NLP solutions to it, but in this case unfortunately you will often find false positives for English, because sources of data that are listed as English contain other languages or un-language like garbage characters or URLs.
Note also that for many queries there is no single correct answer. Good systems will return a weighted list of possibilities, but for a query like [dan], [a], [example#example.com] or [hi! como estas? i'm in class ahorita] the most correct answer will depend on your application and may not exist.

You can use NTextCat to determine input language.

Research (by a certian Zipf) determined that for the most part, there are some words which are used very frequently, and a lot of words which are rarely used.
If I was given this problem, I'd probably put down a list of the top X used words. Then for each comment I would see if there's a match.
It's not perfect (and if the text is very particular, or mispelt, you've got an issue) - but I think it's an acceptable heuristic.

See this post
More specifically, take a look on Trigrams

Reductions in programming

Sometimes to make a variable/method/class name descriptive I need to make it longer. But I don't want to, I'd like to have short names that are easy to read. So I thought of a special addin to IDE like Visual Studio to be able to write short names for class, method, field but be able to attach long names. If you need to - you can make it all long or you can make single name long. If you want to reduce it - use reduction, like two views of the same code. I`d like to know what others thinking about it? Do you think it is usefull? Would anybody use the kind of addin?

Why not just use the standard XML commenting system built into Visual Studio.
If you type /// above the Class/Method/variable etc, it creates the comment stub.
These comments popup through Intelisense/Code Completion with extra info.
This way you keep your naming conventions short and descriptive whilst commenting your code.
You can run a process to then create documentation for your code using these comments.
See: http://msdn.microsoft.com/en-us/magazine/cc302121.aspx

A variable name should be as long as required to make it identifiable, does it matter if it's a bit longer than you would prefer? As long as the code is readable and understandable, surely this makes no difference?
Use comments for names that would be far too long to use as a variable/class name. This would be a lot more appropriate.
If a method name is too long, then it shouldn't be a single method...
I wouldn't use an addin like that.

I never worry about long names. If a method name becomes too long, it may also indicate that the method does too much (unless it happens to include a really long word). On the other hand, I also try to avoid repeating myself. I would not have Account.AccountId for instance, but rather Account.Id. I also lean back on the namespace; if the namespace is clear about what domain I am in, I usually try to not repeat that in class- or member names.
Bottom line; I can't see myself using such an addin.

Other programmers without this addin would find themselves in trouble because if you give too short names they will not fully understand the code, if you give long names they will loose time reading and eventually get angry because long names are difficult to remember :P
One has to find the best name for everything one writes, imho there is no need for a switch to turn on and off verbosity for identifiers.
I would not use that addin.

Nor I. The fact is you are talking about VisualStudio. It takes the heavy-load of remembering most variables names (long and short) with IntelliSense. As Power said, as long as the code is readable and understandable, that's all that matters.

With ReSharper 4 and above, you can get automatic expansion of type and variable names that are camel or Pascal cased:
(source: jetbrains.com)
So you could call your variable myExtremelyLongAndDescriptiveVariableName but then just type mELADVN to use it.

I don't think I'd want it.
The overhead of switching between different views would be as much work as hitting F12 and reading the comment for the function, which will always be more descriptive than the long name.

I wont.
Long function names could be handy in somecases. If you have a special case or something.
Some examples:
what would you favor for multiplication, mul or multiply ? multiply is my choice
Choosing functionnames is a matter of making your code clear for using, if you have too small names and you have to read comment to know what the function does, then youre doing it wrong

IDEs, text editors and compilers support limited (if at all limited) form of described functionality - that is source code comments. I think comments do very well and don't see any necessity of described addin. If comments are too long they can be folded. If you need source code with no comments you can easily strip them off with regex of similar stuff.

Id like to have short names that are
easy to read.
That is often a contradiction in terms.
Take for example a name like oScBf, if you don't already know what it's for it's practically unreadable. Is it outputScreenBuffer, onlineSourceBitflag, openScannerBrowsefile, outdoorSpecialBikinifavorites...?
Longer identifier names are usually preferrable. Eventhough it's more to read, it's still easier to understand.
Reading code is in some ways similar to reading text. You expect it to follow a certain pattern to be easy to read, if you start to add a lot of abbrev. and non-std words in da text u hav 2 stop n think what it means, and u lose da flow. :)

It's a bad idea. Variable names don't usually need to be long to be adequately descriptive, you'll waste a lot of time writing two versions of every name, and many programmers will probably find it rather confusing to have multiple names for the same thing.
With XMLDoc and intellisense help, you can add any extra detail required to fully describe a code element - the name doesn't need to describe the minutiae, only give a clear and distinctve idea of what the code element's purpose is.
With name auto-completion readily available, there is no longer any reason to complain of long names requiring lots of typing.
Also, good coding style is all about making code easy to read, understand and maintain, not about packing more code into a smaller space.
OO design should help to break functionality down hierarchically into namespaces and classes, reducing the need for such long names at the class/method level)
Lastly, if you really must shorten names, most languages most languages provide easy ways to strip off namespaces and/or add competely new aliases for names (e.g. 'typedef' and 'using' in C++, 'using' in C#), so in a localised region you can easily refer to a long name via a shortened variant or alias if you wish.

I like the idea. It's really good and I congradulate you and hope you're successful in developing it. Although I would never use such an Add-On.

best way to store/use multiple languages

If I would want to create a c# application that support multiple languages, how should I store them?
I'd probably use constants in the application as value holders.
Such as:
Console.Write(FILE_NOT_FOUND);
When compiled, it would change into the string determined by the language.
I'll probably stick to 3 languages (Danish, English, Deutsch), not that I think it matters though.
It seems to be a waste to have a class file for each language, which all is processed when the application is compiled. It would also mean that you'd have to re-compile and re-distribute the whole program every time you want to change a string.
As far as I know, hardcoded strings is a bad thing.
Maybe a text file?
English.txt
Line1: FILE_NOT_FOUND=File Not Found. Try Again
Line2
Line3
etc.
Danish.txt
Line1: FILE_NOT_FOUND=Filen blev ikke fundet. Prøv igen
Line2
Line3
etc.
and so on.
If the user selects English, it reads the text file and set the different constant values.
The last one I can think of is placing it in a SQL database.
Could you give me some input? :)
Also, I tried writing FILE _ NOT _ FOUND (without spaces, but the text editor wouldn't let me

Use a resource file. That's the standard way to handle localization.
For details, see this tutorial.
--- EDIT ---
An alternative tutorial is available here. This one uses much better naming, so it may be more clear how it works.

I think your best option is to use the built in localization resources of the .NET Framework. You can read more about the mechanics of that here.
As for using a database to store your localised elements (text, images and the like) this is certainly a common option, but I think it's mostely because developers understand getting data from a database, more than working with satellite assemblies and the like. There a number of problems with using a database, so I'll name only a few: 1) added complexity of deployment of the application 2) addtional load on the database server, 3) where do you store the localized messages to say that the database is down :)
Using some sort of text file (likely XML) also carries with it some deployment issues, but more importanly the percieved flexibility of making text changes 'on the fly' is somewhat over rated. Apart of spelling mistakes and awkward wording you'll almost always be shipping a new build as the text of your app changes.

Check out the Localization/Internationalization samples here:
http://msdn.microsoft.com/en-us/goglobal/bb688096.aspx
I've also heard good things about this book.

This topic is far too large for a reply.
The process of making a program ready for new languages is normally called "internationalization" or "i18n", and the process of taking that and actually making it run is "localization" or "l10n".
Briefly, you want to have hardcoded strings replaced by string resources, as you say, and then typically create different resource files for different languages. Assuming you're working in .NET (a fairly good assumption for C#), there's a lot of stuff Microsoft does to make it easier.
Remember that there are other localization issues than language. For example, the Danish currency symbol is probably not '$', but rather the Euro symbol, dates are almost certainly abbreviated differently, and many places use ',' for the decimal point and '.' for the thousands separator, opposite from the English practice.

Shorter naming convention for types

I am developing a framework, and some of the objects have reaaally long names. I don't really like this, but I don't like acronyms either. I am trying to come up with a shorter name for "EventModelSocket", basically a wrapper around the .Net socket class that implements various events, and methods to send files, objects, etc. Some of the objects have really long names due to this, such as "EventModelSocketObjectReceivedEventArgs" for example.
I've tried everything from a thesaurus, to a dictionary to sitting here for hours thinking.
When you come upon situations like this, what is the best way to name something?

Push some of it into the namespace.
For example:
EventModelSocketObjectReceivedEventArgs
becomes
EventModel.Sockets.ReceivedEventArgs

Well, are the long names hurting something?
(edit) two other thoughts:
use var in C# 3.0 - that'll save half the width
if you are using the type multiple times in a file, consider a type alias if it is annoying you:
using Fred = Namespace.VeryLongNameThatIsBeingAnnoying;

I would just suggest using the most concise naming that describes the object.
If EventModelSocketObjectReceivedEventArgs does that, move on.
My 2 cents.

Years ago when I was in a programming class, the prof quoted the statistic that a piece of code is typically read 600 times for each single time it got modified. Nowadays, I would assume that this is no longer true, particulary in TDD environments where there's lots of refactoring going on. Nevertheless, I think a given piece of code is still read many more times than it gets written. Therefore, I think the maxim that we should write for readability is still valid. The full form of a word in a name is more readable, since the brain doesn't have to do the conversion. Comprehension is faster and more accurate.
The tools we have today make this so easy with autocompletion and the like. Because of this, I use full words in variable names now, and I think it's a good way to go.

If you need to go through that much effort to find an alternative name, you already have the correct name. Object/method/property names should be self documenting. If they do not describe their exact purpose they are misnamed. There is nothing wrong with long names if they give the most clear understanding of the purpose of that object.
In this age of intellisense and large monitors there really is no excuse to not be as descriptive as possible in naming.

Don't remove the vowels or something crazy like that.
I'm with the "stick with the long name" people.
One thought is that if the names are that awkward, maybe some deeper rethinking of the system is needed.

I for one use the long name. With intellisense typing out the name isn't that important, unless you are using a 15 inch monitor.
If I had to reduce the name I might go with EvtMdlSck just make the work shorter but still understood. Even though that is not my preference.

Some criticisms on your naming...
Why DOES your component have the word "model" in its name - isnt that a bit redundant.
Since your component seems to be a messaging hub of some sort why not include
Message in its name. What about MessageSender.
To solve your problem I would create an interface and given it a generic name like
MessageSender and an implementation which is where you include the technology within the name like RandomFailingSocketMessageSender.
If one wishes to get a good example of this take a look at the Java or .Net libraries..
from Java.
interface - class/implementations...
Map - HashMap, LinkedHashMap.
List - LinkedList
Details regarding the technology or framework used eg words like "Socket" or perhaps to use a contrived example "MQSeries" shouldnt be part of the interface name at all.
MessageSender seems to IMHO sum up the purpose of your component. It seems strange that your thing which sends "files" and "events" doesnt include the those two descriptive words. The stuff your using in your naming is superfluous and IMHO doesnt match your description of the component.

In general I believe in classnames that accurately describe their function, and that's it's OK to have long names. If you think the names are really getting long, what I would suggest is finding a concept that is well-known to your programming team and abbreviating that. So if "Event Model Sockets" are a concept that everybody knows about, then abbreviate them to EMS. If you've got a package that is entirely about Event Model Sockets then abbreviate them to EMS in all the classes internal to that package. They key here is to make sure the name is in full for anyone who might not be familiar with the concept and abbreviated for anyone who is.

Should you use international identifiers in Java/C#?

C# and Java allow almost any character in class names, method names, local variables, etc.. Is it bad practice to use non-ASCII characters, testing the boundaries of poor editors and analysis tools and making it difficult for some people to read, or is American arrogance the only argument against?

I would stick to english, simply because you usually never know who is working on that code, and because some third-party tools used in the build/testing/bugtracking progress may have problems. Typing äöüß on a Non-German Keyboard is simply a PITA, and I simply believe that anyone involved in software development should speak english, but maybe that's just my arrogance as a non-native-english speaker.
What you call "American arrogance" is not whether or not your program uses international variable names, it's when your program thinks "Währung" and "Wahrung" are the same words.

I'd say it entirely depends on who's working on the codebase.
If you have a small group of developers who all share a common language and you don't ever plan needing anyone who doesn't speak the language to work on the code then go ahead and use whatever characters you want.
If you need to have people of varying cultures and languages working on the code then it's probably best to stick with English since it's the common denominator for just about everyone in the world.

If your business are non-English speakers, and you think Domain Driven Design has something to it, then there is another aspect: How do we, as developers, use the same domain language as our business without any translation overhead?
That does not only mean translations between languages, say English and Norwegian, but also between different words. We should use the exact same words as our business for our entity classes and services.
I have found it easier to just give in and use my native language. Now that my code use the same words, it's easier to have a conversation with my domain experts. And after a while you get used to it, just like how you got used to code without Hungarian notation.

I used to work in a development team that happily wiped their asses with any naming (and for that matter any other coding) conventions. Believe it or not, having to cope with ä's and ö's in the code was a contributing factor of me resigning. Though I'm Finnish, I prefer writing code with US keyboard settings because curly and square brackets are a pain to write in a Finnish keyboard (try right alt and 7 and 0 for curlies).
So I say stick with the ascii characters.

Here's an example of where I've used non-ASCII identifiers, because I found it more readable than replacing the greek letters with their English names. Even though I don't have θ or φ on my keyboard (I relied on copy-and-paste.)
However these are all local variables. I would keep non-ASCII identifiers out of public interfaces.

It depends:
Does your team conform to any existing standards that require your using ASCII?
Is your code ever going to be feasibly reused or read by someone who doesn't speak your native language?
Do you envision a scenario where you'll need to ask for help online and will therefore not be able to copy-paste your code sample in as-is?
Are you certain your entire suite of tools support code encoding?
If you answered 'yes' to any of the above, stay ASCII only. If not, go forward at your own risk.

Part of the problem is that the Java/C# language and its libraries are based on English words like if and toString(). I personally would not like to switch between non-English language and English while reading code.
However, if your database, UI, business logics (including metaphors) are already in some non-English language, there's no need to translate every method names and variables into English.

IF you get past the other prerequisites you then have one extra (IMHO more important) one - How difficult is the symbol to type.
On my regular en-us keyboard, the only way I know of to type the letter ç is to hold alt, and hit 0227 on the numeric keypad, or copy and paste.
This would be a HUGE big roadblock in the way of typing quickly. You don't want to slow your coding down with trivial stuff like this if you aren't forced to. International keyboards may alleviate this, but then what happens if you have to code on your laptop which doesn't have an international keyboard, etc?

I would stick to ASCII characters because if anyone in your development team is using an SDK that only supports ASCII or you wanted to make your code open source, alot of problems could arise. Personally, I would not do it even if you are not planning on bringing anyone who doesn't speak the language in on the project, because you are running a business and it seems to me that one running a business would want his business to expand, which in this day and age means transcending national borders. My opinion is that English is the language of the realm, and even if you name your variables in a different language, there is little to no point to use any non-ASCII characters in your programming. Leave it up to the language to deal with it if you are handling data that is UTF8: my iPhone program (which involves tons of user data going in between the phone and server) has full UTF8 support, but has no UTF8 in the source code. It just seems to open such a large can of worms for almost no benefit.

There is another hazzard to using non-ASCII characters, though it will probably only bite in obscure cases. The allowed characters are defined in terms of the methods Character.isJavaIdentifierStart(int) and Character.isJavaIdentifierPart(int), which are defined in terms of Unicode. However, the exact version of Unicode used depends on the version Java platform, as specified in the documentation for java.lang.Character.
Since character properties change slightly from one Unicode version to the next, it's possible (but probably very unlikely) you could have identifiers that are valid in one version of Java, but not in the next.

As already pointed out, unless method names mostly match the language, it is a bit weird to constantly switch languages while reading.
For the Scandinavian languages & German, which I can speak and thus speak for, I would at least recommend using standard substitutions, ie.
ä/æ -> ae, ö/ø -> oe, å -> aa, ü -> ue
etc. just in case as others may find it difficult to type the original letters without keyboard/keymap changes. Think if you suddenly had to work with a codebase where the developers used a third language (for instance including the French ç) and didn't do this.. Switching between more than 2 keymaps to type efficiently would be painful in my experience.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to manage localized strings that look alike in C#/WPF? - c#

Related

Determining if text is english

Reductions in programming

best way to store/use multiple languages

Shorter naming convention for types

Should you use international identifiers in Java/C#?

Categories

Resources