In my career I have seen literally dozens of ways that people choose to handle implementing state and country data (as in NY or USA). I've seen enumerations, multi-dimensional arrays, data driven classes from XML docs or databases, and good old-fashioned strings.
So, I'm wondering, what do people consider the "best way" to handle this very common implementation? I know that most of the answers will be based primarily on opinion and preference; but, I'm curious to hear arguments as a recent discussion for an app I'm working on has resulted in a small debate.
If you consider the number of states in the united states to be unchanging then what you have is a finite, small, fixed set of values which can easily be associated with a number. For that type of data structure an enumeration works great. My choice then would be an enumeration as they have good compile time validation and typing something like the following reads nicely.
var state = GetStateInfo(States.Nebraska);
However if you consider the number of states to be a changing value then an enumeration may not be the best choice. The framework design guidelines (link) reccomend you do not use an enumeration for a set of values which is considered to be open (changing). The reason why is that if the value does change you risk breaking already shipped code. Instead I would go with a static class and string constants for the name.
I would pick a standard for enumerating the geogrpahical areas and then associate keys in a table. In the past I have used FIPS 5-2 place codes. Its a federal standard and the numbers and abbreviations are reasonablly well known. for international country codes we use FIPS 10-4 for encoding places. There are some iso standards such as ISO 3166 that may also be appealing, but this is more preference and taste.
I think enums or XML is fine if all you want is a simple list of states. If you have other dependent data like say state tax rates or postcode data then I think I'd store the states in the database.
Same with countries. You may want to control languages, currencies etc. based on country. I think that's just easier to deal with if it's in a database.
Related
This question may be a bit too generic and abstract because I do not know what I'm looking for yet. I do not have too much experience with patterns. I need to know what pattern/ technique I can use to categorize patients in a medical app.
Let's say the hospital has a documentation app with 10 data fields. Dates, numbers, selects, multi-selects.
Every patient that visits the hospital will have it's own specific information.
After input and analysis, each patient must be placed in a category.
Each category is determined by a set of rules. Those rules are created based on some or all fields defined above and their individual value.
In reality I'm speaking about hundreds of patients and hundreds of input fields. So I'm trying to find out whether there is some traditional way of doing this (something more generic) or if I'm stuck with writing tens of "IF and Switch" statements.
PS: this is not a machine learning task
This sounds like a task for a type of algorithm called a rules engine. At their simplest the coding style of rules engines does appear to be a collection of IF ... THEN ... (ELSE...) but rules engines also usually have features such as the elimination of redundant branches, cycle and contradiction detection and so on.
Examples of software packages that provide this are Drools and the BizTalk Business Rules Engine
In addition to Tom W's answer, on a lower level you may find the Specification pattern useful.
Is there any C# algorithm by which personal and place names can be extracted from text?
e.g., given the following text:
St. Mark died at Alexandria, in Egypt. He was martyred, I think.
However, that has nothing to do with my legend. About the founding of
the city of Venice--
(taken from "The Innocents Abroad" by Mark Twain)
...is there any way to extract:
St. Mark
Alexandria (or better yet, "Alexandria, Egypt")
Venice
?
I realize that there is no way to get 100% accuracy (where all place names and personal names are captured, and no "false positives" are added), but 80% accuracy could be very valuable.
I understand that each word could be compared with an encyclopedia or some such, but there must be a better way. Also, how could the algorithm know to combine "St." and "Mark" and to see "Alexandria, in Egypt" as "Alexandria, Egypt"?
I noticed that the links provided here are a bit dated. One project that is still active (and free [correction: GPL, so free for non-commercial]) is the Stanford Natural Language Processing (NLP) libraries (https://nlp.stanford.edu/software/). You can demo their Named Entity Recognition (NER) here. It even has a .NET wrapper (http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordNER.html).
Microsoft also offers many similar algorithms through Azure Cognitive Services. You would be most interested in Entity Linking (https://azure.microsoft.com/en-us/services/cognitive-services/entity-linking-intelligence-service/)
I hope helps future viewers.
You are best off using some kind of API that will be able to perform this kind of entity matching, as what you are asking is potentially very complex and requires some degree of semantic textual analysis backed up by a large database. I'd recommend at looking at APIs such as:
OpenCalais - English Semantic Metadata: Entity/Fact/Event Definitions and Descriptions web-service
Calais supports a rich set of semantic metadata, including entities, events and facts.
Alchemy API - Entity Extraction API
AlchemyAPI is capable of identifying people, companies, organizations, cities, geographic features, and other typed entities within your HTML, text, or web-based content. We employ sophisticated statistical algorithms and natural language processing technology to analyze your information, extracting the semantic richness embedded within.
I currently have a lot comments and text in my database that is mainly in English. However if it isn't in English I want to translate it to English.
I know I can call a translation api to determine the language but I don't want to make millions of translation API calls for text that most likely won't need translating.
I am looking for a way to determine if the text is English or not. I don't need to know what language it is, just that it isn't English, then if it isn't English I will send it to a translation service API.
The Chromium project (including its most popular implementation, Google Chrome) solves this problem with https://github.com/google/cld3.
If your only need is to detect whether or not something is English, then in theory you can use something even more compact.
Most good language detectors use trigram frequency (a gram being a single character) or trigram frequency overlaid with word frequency. For your application it seems that you could use a hybrid approach where the first pass is local, but of low accuracy and tuned to be a bit aggressive to not miss any potential English, and the second pass that does hit an API like Google Translate.
The popularity of English and amount of English data is usually helpful for applying NLP solutions to it, but in this case unfortunately you will often find false positives for English, because sources of data that are listed as English contain other languages or un-language like garbage characters or URLs.
Note also that for many queries there is no single correct answer. Good systems will return a weighted list of possibilities, but for a query like [dan], [a], [example#example.com] or [hi! como estas? i'm in class ahorita] the most correct answer will depend on your application and may not exist.
You can use NTextCat to determine input language.
Research (by a certian Zipf) determined that for the most part, there are some words which are used very frequently, and a lot of words which are rarely used.
If I was given this problem, I'd probably put down a list of the top X used words. Then for each comment I would see if there's a match.
It's not perfect (and if the text is very particular, or mispelt, you've got an issue) - but I think it's an acceptable heuristic.
See this post
More specifically, take a look on Trigrams
In my application (and probably in lots of other applications) I want to use localization.
However, I don't know what is the best way to manage almost equal string.
Some questions:
When a button 'Edit' is used as menu name, menu item name should I make two resource strings (with equal value "Edit") or only 1?
When there is an edit button on the screen and I want to use the Alt key the strings needs to be "_Edit", should I make for this a separate resource string ?
And what if there is also a button that shows a new page, and thus (for my convention), the value should be "_Edit..."?
It's not a quick answer topic (localization) but I'll try to answer at least to your questions:
1) Yes, even if not mandatory it's better to keep each text separated. If you'll change one of them you won't risk to change both (and maybe in different languages they must be translated with something different because of their context).
2) Absolutely yes. Imagine you have these strings: _Edit and _Mark as read. In Italian, for example, they'll be translated to _Modifica and Segna come giĆ letto. For both strings you can't add use "_" for the same letter, this is an issue that translators should take care about.
3) I suggest no. As for 1) try to keep different strings separated (for the same reason I said before). If you want to save money of translation (who doesn't want?) you may write a program to pre-parse strings to produce a "normalized" output to give to translators (it'll remove duplicates and merge similar strings, when possible). But you should keep your program unaware of details.
To summarize: no, don't try to merge (similar) strings inside your program. If you really need it then it'll be done with an external program (it can even take into account strings from different modules so it'll do a better job) and only when applicable (for case 3 it's possible, for case 1 and 2 absolutely not).
Yes you should make separate resource strings. You are using the word edit to describe different actions in your application in english. However, in another language these actions might be described by different words (one might be "edit" while the other one might be "change").
So you'll need the ability to assign different strings for your menus and buttons.
1) You should favor abstraction and decoupling : you have two concepts : a category "edition" and an action "edit", so it makes sense to have two separate strings, because in other languages than English "edit/edit" could not be as acceptable.
2) Here you have only one concept with two different formats, so only one string and a transformation rule ("_" + myString) make sense.
3) It depends : is the action triggered by this button the exact same concept as the one above : if yes use the same string, if no use two.
All this are general considerations and you should of course adapt it to your needs :
if you really have to support n languages upfront then follow these guidelines, and they should have appeared naturally during the localization process
if you are anticipating that somewhere in the future you may have to support another language than English and that you have a lot of identical strings now then you might not spend too much time trying to do things perfectly as there is a lot of chances it will be for nothing and time could be spent better enhancing or adding core features.
What libraries are there to write C# internationalized applications?
Typical functionalities that should be contained in the library:
Validation of country specific data (e.g. VAT numbers, phone numbers, addresses,...)
Validation of bank and financial coordinates (e.g. Credit Card numbers, IBAN,...)
Language-specific functionalities (e.g. numbers to words to numbers, summarize,...)
Language specific content filtering (e.g. swearword filtering...)
An example of such libraries in Perl would be the Internationalization/Locale section of CPAN.
What C# solutions are available?
Note: I am not looking for an introduction to the System.Globalization namespace :)
Note 2: Should I desume that there are no options available? Is someone interested in joining forces and create one?
Note 3: Edit to make the question appear on front page in hope of more answers. This isn't such a hard question, how is it possible that Stackers don't ever do i18n?
One project that is working towards a database of globalization, internationalization and localization knowledge is the Unicode Common Locale Data Repository, based on the old ICU project at IBM.
As it is a database of XML data it doesn't contain any .NET-specific code, but as a body of knowledge it is very good.
Only a smallish subset is in the .NET framework. Microsoft hasn't gone near any of the supplemental stuff, like postcode formats, number spelling (for check/cheque amounts), etc. Standard time zone names (from the Olson/tz distribution), etc. are also included, with mappings to the Windows-specific names. Some of the hierarchical locale-specific behaviours also have better support.
I wouldn't say that no one does i18n, but I don't know of any generic tools that can be used for every project. Maintaining a database with all of the information you are looking for would be an epic project. It sounds like what you're looking for isn't a specific C# library, but more a collection of information online that you can draw from. If you were able to find a repository of swear words in various languages (for example), it would be trivial for you to use this in C#. I think that finding a solution that wraps up all of your requirements into an easy-to-use assembly is going to be impossible to find.
Have a look at
http://www.microsoft.com/globaldev/getwr/dotneti18n.mspx
and
http://www.dotneti18n.com/
String to number and vice versa can be dones as following:
culture = new CultureInfo(locale);
int number = Convert.ToInt32(myString, culture.NumberFormat);
string str= Convert.ToString(myNumber, culture.NumberFormat);
As to checking VATS and adresses, I'm interested in that too, haven't found anything useful so far.
Not exactly a "library", per se, but I've actually ran into a great service (for pay), by a company called E4X (former client of mine).
What they provide is complete localization of your ecommerce site, including language translations, currency exchanges, local billing and handling of financial transactions including region-specific taxes etc, and more. They even deal with logisitics of physical shipping...
Worth looking into, for an ecommerce business. Let 'em know I sent you... ;-)
That's a huge endeavor. Let's start with one simple problem: phone numbers. Libphonenumber Google library at http://code.google.com/p/libphonenumber/ has a C# port at https://bitbucket.org/pmezard/libphonenumber-csharp with notes at http://blog.thekieners.com/2011/06/06/using-googles-libphonenumber-in-microsoft-net-with-c/. Appears to be a good library for handling both US and int'l numbers.