Detect language of text [duplicate] - c#

This question already has answers here:
How to detect the language of a string?
(9 answers)
Closed 8 years ago.
Is there any C# library which can detect the language of a particular piece of text? i.e. for an input text "This is a sentence", it should detect the language as "English". Or for "Esto es una sentencia" it should detect the language as "Spanish".
I understand that language detection from text is not a deterministic problem. But both Google Translate and Bing Translator have an "Auto detect" option, which best-guesses the input language. Is there something similar available publicly, preferably in C#?

Yes indeed, TextCat is very good for language identification. And it has a lot of implementations in different languages.
There were no ports in .Net. So I have written one: NTextCat (NuGet, Online Demo).
It is pure .NET Standard 2.0 DLL + command line interface to it. By default, it uses a profile of 14 languages.
Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

Language detection is a pretty hard thing to do.
Some languages are much easier to detect than others simply due to the diacritics and digraphs/trigraphs used. For example, double-acute accents are used almost exclusively in Hungarian. The dotless i ‘ı’, is used exclusively [I think] in Turkish, t-comma (not t-cedilla) is used only in Romanian, and the eszett ‘ß’ occurs only in German.
Some digraphs, trigraphs and tetragraphs are also a good give-away. For example, you'll most likely find ‘eeuw’ and ‘ieuw’ primarily in Dutch, and ‘tsch’ and ‘dsch’ primarily in German etc.
More giveaways would include common words or common prefixes/suffixes used in a particular language. Sometimes even the punctuation that is used can help determine a language (quote-style and use, etc).
If such a library exists I would like to know about it, since I'm working on one myself.

Please find a C# implementation based on of 3grams analysis here:
http://idsyst.hu/development/language_detector.html

Here you have a simple detector based on bigram statistics (basically means learning from a big set which bigrams occur more frequently on each language and then count those in a piece of text, comparing to your previously detected values):
http://allantech.blogspot.com/2007/07/automatic-language-detection.html
This is probably good enough for many (most?) applications and doesn't require Internet access.
Of course it will perform worse than Google's or Bing's algorithm (which themselves aren't great). If you need excellent detection performance you would have to do both a lot of hard work and over huge amounts of data.
The other option would be to leverage Google's or Bing APIs if your app has Internet access.

You'll want a machine learning algorithm based on hidden markov chains, process a bunch of texts in different languages.
Then when it gets to the unidentified text, the language that has the closer 'score' is the winner.

There is a simple tool to identify text language:
http://www.detectlanguage.com/

I've found that "textcat" is very useful for this. I've used a PHP implementation, PHP Text Cat, based on this this original implementation, and found it reliable. If you have a look at the sources, you'll find it's not a terrifyingly difficult thing to implement in the language of your choice. The hard work -- the letter combinations that are relevant to a particular language -- is all in there as data.

Related

Determining if text is english

I currently have a lot comments and text in my database that is mainly in English. However if it isn't in English I want to translate it to English.
I know I can call a translation api to determine the language but I don't want to make millions of translation API calls for text that most likely won't need translating.
I am looking for a way to determine if the text is English or not. I don't need to know what language it is, just that it isn't English, then if it isn't English I will send it to a translation service API.
The Chromium project (including its most popular implementation, Google Chrome) solves this problem with https://github.com/google/cld3.
If your only need is to detect whether or not something is English, then in theory you can use something even more compact.
Most good language detectors use trigram frequency (a gram being a single character) or trigram frequency overlaid with word frequency. For your application it seems that you could use a hybrid approach where the first pass is local, but of low accuracy and tuned to be a bit aggressive to not miss any potential English, and the second pass that does hit an API like Google Translate.
The popularity of English and amount of English data is usually helpful for applying NLP solutions to it, but in this case unfortunately you will often find false positives for English, because sources of data that are listed as English contain other languages or un-language like garbage characters or URLs.
Note also that for many queries there is no single correct answer. Good systems will return a weighted list of possibilities, but for a query like [dan], [a], [example#example.com] or [hi! como estas? i'm in class ahorita] the most correct answer will depend on your application and may not exist.
You can use NTextCat to determine input language.
Research (by a certian Zipf) determined that for the most part, there are some words which are used very frequently, and a lot of words which are rarely used.
If I was given this problem, I'd probably put down a list of the top X used words. Then for each comment I would see if there's a match.
It's not perfect (and if the text is very particular, or mispelt, you've got an issue) - but I think it's an acceptable heuristic.
See this post
More specifically, take a look on Trigrams

Why doesn't a phone number datatype/library exist

Just wondering if anyone is aware of any low level phone number storage constructs for C#. I have been surprised to find that all of my searches have proved fruitless for such a library.
Essentially I am hoping for something that can take a string phone number input (in all its varied goodness) and both validate and segment the given string into its various sections (IE: country code, area code, number) along with providing a common format to store this data.
Does any such library exist? If not, any idea why something like this hasn't been attempted? (Is it really that hard a problem?)
I think that the main reason is that between people themselves there seems to be no real standard way of writing the phone number. For instance, people living on small islands tend to not have regional codes and there is no need for the country code when calling residents of the same island.
This changes when you move to larger places. Also, I have seen certain numbers written as (XXXX)XXXXX or XXXXX-XXXXX or XXXXXXXXXX.
The standard way of dealing with this seems to be with regular expressions. The developer usually takes a few possible input formats and uses regular expressions to validate and transform the format of the number.
So it turns out there is actually a pretty awesome library available after all.
Original Java version: http://code.google.com/p/libphonenumber/
C# version: https://bitbucket.org/pmezard/libphonenumber-csharp/wiki/Home
Me discussing the library on my blog.

Is there any DECAPTCHA library in .NET?

I'm looking for some sample projects to read CAPTCHA images. Is there any in C# or VB ?
pseudo code:
String captchaText = CaptchaDecoder(Image captchaImage);
Take a look to:
Text-based CAPTCHA Strengths and Weaknesses. ACM Computer and Communication security 2011 (CSS’2011). link
The authors present a CAPTCHA breaker and explain a generic algorithm to crack standard CAPTCHAs
In this section we present our captcha breaker, Decaptcha, which is
able to break many popular captchas including eBay, Wikipedia and Digg
[...] Decaptcha implements a refined version of the three stage
approach in 15,000 lines of code in C# [...]
This is easier said than done.
This involves either brute-forcing the captcha or running OCR algorithms on it to try and detect what is written in the captcha.
You might want to check this related question: Has reCaptcha been cracked / hacked / OCR'd / defeated / broken?
It also depends on what techniques were used to produce the CAPTCHA. Some distort the text and some squeeze the text.
Your question is a little vague.
Additional reading here: http://en.wikipedia.org/wiki/CAPTCHA
Christian
There are so many types of Captchas out there that you won't find a single library to read them all. If you are only interested in one type though, you might have more luck. Even then, there are lots of variations on Captchas, and the engines frequently produce (whether on purpose or incidentally) tricky ones which even humans can't figure out. Humans can click the little icon to get a new one; your program might not be able to.

Recommended data format for describing the rules of chess

I'm going to be writing a chess server and one or more clients for chess and I want to describe the rules of chess (e.g. allowable moves based on game state, rules for when a game is complete) in a programming language independant way. This is a bit tricky since some of the chess rules (e.g. King Castling, en passent, draws based on 3 or more repeated moves) are based not only on the board layout but also on the history of moves.
I would prefer the format to be:
textual
human readable
based on a standard (e.g. YAML, XML)
easily parsable in a variety of languages
But I am willing to sacrifice any of these for a suitable solution.
My main question is: How can I build algorithms of such a complexity that operate on such complex state from a data format?
A followup queston is: Can you provide an example of a similar problem solved in a similar manner that can act as a starting point?
Edit: In response to a request for clarity -- consider that I will have a server written in Python, one client written in C# and another client written in Java. I would like to avoid specifying the rules (e.g. for allowable piece movement, circumstances for check, etc.) in each place. I would prefer to specify these rules once in a language independant manner.
Let's think. We're describing objects (locations and pieces) with states and behaviors. We need to note a current state and an ever-changing set of allowed state changes from a current state.
This is programming. You don't want some "meta-language" that you can then parse in a regular programming language. Just use a programming language.
Start with ordinary class definitions in an ordinary language. Get it all to work. Then, those class definitions are the definition of chess.
With only miniscule exceptions, all programming languages are
Textual
Human readable
Reasonably standardized
Easily parsed by their respective compilers or interpreters.
Just pick a language, and you're done. Since it will take a while to work out the nuances, you'll probably be happier with a dynamic language like Python or Ruby than with a static language like Java or C#.
If you want portability. Pick a portable language. If you want the language embedded in a "larger" application, then, pick the language for your "larger" application.
Since the original requirements were incomplete, a secondary minor issue is how to have code that runs in conjunction with multiple clients.
Don't have clients in multiple languages. Pick one. Java, for example, and stick with it.
If you must have clients in multiple languages, then you need a language you can embed in all three language run-time environments. You have two choices.
Embed an interpreter. For example Python, Tcl and JavaScript are lightweight interpreters that you can call from C or C# programs. This approach works for browsers, it can work for you. Java, via JNI can make use of this, also. There are BPEL rules engines that you can try this with.
Spawn an interpreter as a separate subprocess. Open a named pipe or socket or something between your app and your spawned interpreter. Your Java and C# clients can talk with a Python subprocess. Your Python server can simply use this code.
There's already a widely used format specific to chess called Portable Game Notation. There's also Smart Game Format, which is adaptable to many different games.
I would suggest Prolog for describing the rules.
This is answering the followup question :-)
I can point out that one of the most popular chess servers around documents its protocol here (Warning, FTP link, and does not support passive FTP), but only to write interfaces to it, not for any other purpose. You could start writing a client for this server as a learning experience.
One thing that's relevant is that good chess servers offer many more features than just a move relay.
That said, there is a more basic protocol used to interface to chess engines, documented here.
Oh, and by the way: Board Representation at Wikipedia
Anything beyond board representation belongs to the program itself, as many have already pointed out.
Edit: Overly wordy answer deleted.
The short answer is, write the rules in Python. Use Iron Python to interface that to the C# client, and Jython for the Java client.
What I've gathered from the responses so far:
For chess board data representations:
See the Wikipedia article on [chess board representations](http://en.wikipedia.org/wiki/Board_representation_(chess)).
For chess move data representations:
See the Wikipedia articles on Portable Game Notation and Algebraic Chess Notation
For chess rules representations:
This must be done using a programming language. If one wants to reduce the amount of code written in the case where the rules will be implemented in more than one language then there are a few options
Use a language where an embedable interpreter exists for the target languages (e.g. Lua, Python).
Use a Virtual Machine that the common languages can compile to (e.g. IronPython for C#, JPython for Java).
Use a background daemon or sub-process for the rules with which the target languages can communicate.
Reimplement the rules algorithms in each target language.
Although I would have liked a declarative syntax that could have been interpreted by mutliple languages to enforce the rules of chess my research has lead me to no likely candidate. I have a suspicion that Constraint Based Programming may be a possible route given that solvers exist for many languages but I am not sure they would truly fulfill this requirement. Thanks for all the attention and perhaps in the future an answer will appear.
Drools has a modern human readable rules implementation -- https://www.jboss.org/drools/.
They have a way users can enter their rules in Excel. A lot more users can understand what is in Excel than in other tools.
To represent the current state of a board (including castling possibilities etc) you can use
Forsyth-Edwards Notation, which will give you a short ascii representation. e.g.:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Would be the opening board position.
Then to represent a particular move from a position you could use numeric move notation (as used in correspondence chess), which give you a short (4-5 digits) representation of a move on the board.
As to represent the rules - I'd love to know myself. Currently the rules for my chess engine are just written in Python and probably aren't as declarative as I'd like.
I would agree with the comment left by ΤΖΩΤΖΙΟΥ, viz. just let the server do the validation and let the clients submit a potential move. If that's not the way you want to take the design, then just write the rules in Python as suggested by S. Lott and others.
It really shouldn't be that hard. You can break the rules down into three major categories:
- Rules that rely on the state of the board (castling, en passant, draws, check, checkmate, passing through check, is it even this player's turn, etc.)
- Rules that apply to all pieces (can't occupy the same square as another piece of your own colour, moving to a square w/ opponent's piece == capture, can't move off the board)
- Rules that apply to each individual piece. (pawns can't move backwards, castles can't move diagonally, etc)
Each rule can be implemented as a function, and then for each half-move, validity is determined by seeing if it passes all of the validations.
For each potential move submitted, you would just need to check the rules in the following order:
is the proposed move potentially valid? (the right "shape" for the piece)
does it fit the restraints of the board? (is the piece blocked, would it move off the edge)
does the move violate state requirements? (am I in check after this move? do I move through check? is this en passant capture legal?)
If all of those are ok, then the server should accept the move as legal…

Should you use international identifiers in Java/C#?

C# and Java allow almost any character in class names, method names, local variables, etc.. Is it bad practice to use non-ASCII characters, testing the boundaries of poor editors and analysis tools and making it difficult for some people to read, or is American arrogance the only argument against?
I would stick to english, simply because you usually never know who is working on that code, and because some third-party tools used in the build/testing/bugtracking progress may have problems. Typing äöüß on a Non-German Keyboard is simply a PITA, and I simply believe that anyone involved in software development should speak english, but maybe that's just my arrogance as a non-native-english speaker.
What you call "American arrogance" is not whether or not your program uses international variable names, it's when your program thinks "Währung" and "Wahrung" are the same words.
I'd say it entirely depends on who's working on the codebase.
If you have a small group of developers who all share a common language and you don't ever plan needing anyone who doesn't speak the language to work on the code then go ahead and use whatever characters you want.
If you need to have people of varying cultures and languages working on the code then it's probably best to stick with English since it's the common denominator for just about everyone in the world.
If your business are non-English speakers, and you think Domain Driven Design has something to it, then there is another aspect: How do we, as developers, use the same domain language as our business without any translation overhead?
That does not only mean translations between languages, say English and Norwegian, but also between different words. We should use the exact same words as our business for our entity classes and services.
I have found it easier to just give in and use my native language. Now that my code use the same words, it's easier to have a conversation with my domain experts. And after a while you get used to it, just like how you got used to code without Hungarian notation.
I used to work in a development team that happily wiped their asses with any naming (and for that matter any other coding) conventions. Believe it or not, having to cope with ä's and ö's in the code was a contributing factor of me resigning. Though I'm Finnish, I prefer writing code with US keyboard settings because curly and square brackets are a pain to write in a Finnish keyboard (try right alt and 7 and 0 for curlies).
So I say stick with the ascii characters.
Here's an example of where I've used non-ASCII identifiers, because I found it more readable than replacing the greek letters with their English names. Even though I don't have θ or φ on my keyboard (I relied on copy-and-paste.)
However these are all local variables. I would keep non-ASCII identifiers out of public interfaces.
It depends:
Does your team conform to any existing standards that require your using ASCII?
Is your code ever going to be feasibly reused or read by someone who doesn't speak your native language?
Do you envision a scenario where you'll need to ask for help online and will therefore not be able to copy-paste your code sample in as-is?
Are you certain your entire suite of tools support code encoding?
If you answered 'yes' to any of the above, stay ASCII only. If not, go forward at your own risk.
Part of the problem is that the Java/C# language and its libraries are based on English words like if and toString(). I personally would not like to switch between non-English language and English while reading code.
However, if your database, UI, business logics (including metaphors) are already in some non-English language, there's no need to translate every method names and variables into English.
IF you get past the other prerequisites you then have one extra (IMHO more important) one - How difficult is the symbol to type.
On my regular en-us keyboard, the only way I know of to type the letter ç is to hold alt, and hit 0227 on the numeric keypad, or copy and paste.
This would be a HUGE big roadblock in the way of typing quickly. You don't want to slow your coding down with trivial stuff like this if you aren't forced to. International keyboards may alleviate this, but then what happens if you have to code on your laptop which doesn't have an international keyboard, etc?
I would stick to ASCII characters because if anyone in your development team is using an SDK that only supports ASCII or you wanted to make your code open source, alot of problems could arise. Personally, I would not do it even if you are not planning on bringing anyone who doesn't speak the language in on the project, because you are running a business and it seems to me that one running a business would want his business to expand, which in this day and age means transcending national borders. My opinion is that English is the language of the realm, and even if you name your variables in a different language, there is little to no point to use any non-ASCII characters in your programming. Leave it up to the language to deal with it if you are handling data that is UTF8: my iPhone program (which involves tons of user data going in between the phone and server) has full UTF8 support, but has no UTF8 in the source code. It just seems to open such a large can of worms for almost no benefit.
There is another hazzard to using non-ASCII characters, though it will probably only bite in obscure cases. The allowed characters are defined in terms of the methods Character.isJavaIdentifierStart(int) and Character.isJavaIdentifierPart(int), which are defined in terms of Unicode. However, the exact version of Unicode used depends on the version Java platform, as specified in the documentation for java.lang.Character.
Since character properties change slightly from one Unicode version to the next, it's possible (but probably very unlikely) you could have identifiers that are valid in one version of Java, but not in the next.
As already pointed out, unless method names mostly match the language, it is a bit weird to constantly switch languages while reading.
For the Scandinavian languages & German, which I can speak and thus speak for, I would at least recommend using standard substitutions, ie.
ä/æ -> ae, ö/ø -> oe, å -> aa, ü -> ue
etc. just in case as others may find it difficult to type the original letters without keyboard/keymap changes. Think if you suddenly had to work with a codebase where the developers used a third language (for instance including the French ç) and didn't do this.. Switching between more than 2 keymaps to type efficiently would be painful in my experience.

Categories

Resources