How can I add a spellchecker to a richtextbox in my application?
You can purchase a spell checker control, integrate the Microsoft Office Spell Checker, write your own (which isn't too hard, actually, once you get the Soundex function figured out), or get a good free one. Here's a (relatively) good free one.
http://www.codeproject.com/KB/recipes/spellchecker_mg.aspx
For commercial products, I'd say to Google "Spell check WinForms"
If you're interested in rolling your own, I wrote one for Asp.Net back when I was in my beginner phase, and even then it only took me about a week to research and then a day or so to code. It's a fun pet project. I'd start by looking at soundex functions and comparing soundex values for similarity.
You start by comparing all of the words in the TextBox to a known dictionary, and using the soundex functions to come up with similar words.
From there, you can go on to creating a table for "popular replacements. (for example, you can track that the word "teh" was replaced by "the" n times and move the more popular replacements to the top of the list.
I found a better solution. Hunspell is the spellcheck library Mozilla and OpenOffice.org uses. Its been ported to .NET as NHunspell, and is real easy to implement and has samples for you to use.
Related
Working in C# I have an array of strings. Some of these strings are real words, others are complete nonsense. My goal is to come up with a way of deciding which of these words are real and which are false.
I had planned to find some kind of word list online that I could bring into my project, turn into a list, and compare against, but of course typing in "C# dictionary" comes up with an unrelated topic! I don't need a 100% accuracy rate.
To formalize the question:
In C#, what is the recommended way to establish whether or not a string is a real word?
Advice and guidance is very much appreciated!
Solution
Thanks for the great answers, they were all very useful. As it happens the thing to do was ask the same question in a different wording. Searching for C# spellcheck brought up some great links and I ended up using Nhunspell which you can get through NuGet, and is very easy to use.
The problem is that "Dictionary" is a type within the framework. So, searching with that word will end up with all sorts of results. What you are basically wanting to do is Spell Check. This will determine if a word is valid or not.
Searching for C# spell check yielded some promising results. Searching for open source spell check also has some.
I have previously implemented one of the open source ones within a VB6 project. I think it was ASpell. I haven't had to use spell check library within C#, but I'm sure there is one, or at least one with a .NET wrapper to make implementation easier.
If you have special case words that do not exist in the dictionary/word file for a spell check solution, you can add them.
To do this I would use a freely available dictionary for linux (googling "linux dictionaries" should get you on the right track), read and parse the file, and store it in a C# System.Collections.Generic.HashSet collection. I would probably store everything as .ToUpper() or as .ToLower() but this depends on your requirements.
You can then check if any arbitrary string is in the HashSet efficiently.
I don't know of any word list file included by default on Windows, but most Unix-like operating systems include a words file for this purpose. Someone has also posted a words file on github suggested for use in Windows projects. These files are simple lists of words, one per line.
I have a text file with around 300,000 words. Each word is 5 letters.
I'd like to be able to determine how unique each word is on the internet.
An idea I had was to Google the word and see how many results it yielded. Unfortunately, this is against their TOS.
I was trying to think of any other way but it would have to involve querying some website a lot and I doubt they would appreciate that much.
Anyone have any other ideas? Programming language doesn't matter that much but I would prefer C#.
To look up the frequency 'in books' you could use the Google Ngram dataset, but that's not 'for the internet'. If this is for academic purposes the Bing alternative might work also and it is based on internet-frequencies.
If your words do not contain slang, I would recommend looking at public domain books. The issue here is that most of these books will be older, so you really will be getting a snapshot in time of how popular a word is (or I guess was). The plus side is that these books are freely available in text file format allowing you to easily mine them for data.
One thing to note, if you're in the US and plan on using Project Gutenberg to get the books, they have a rule that the website is intended only for human users. There is a page that tells you how to get the same data via mirror.
I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification?
No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification.
What to suggest to you depends from your requirements. So, maybe more description needed.
But, generally, easiest way try to use external services. All external services have REST API, and it's very easy to interact with it using C#.
From external services:
Open Calais
uClassify
Google Prediction API
Text Classify
Alchemy API
Also there good Java SDK like Mahout. As I remember interactions with Mahout could be also done like with service, so integration with it is not a problem at all.
I had similar "auto tagging" task using c#, and I've used for that Open Calais. It's free to make 50,000 transactions per day. It was enough for me. Also uClassify has good pricing, as example "Indie" license 99$ per year.
But maybe external services and Mahout is not your way. Than take a look at DBpedia project and RDF.
And the last, you can use some implementations of Naive Bayes algorithm, at least. It's easy, and all will be under your control.
This is a very hard problem but if you don't want to spend time on it you can take all words which have between 5% and 10% frequency in the whole document. Or, you simply take the most common 5 words.
Doing tag extraction well is very very hard. It is so hard that whole companies live from webservices exposing such an API.
You can also do stopword removal (using a fixed stopword list obtained from the internet).
And you can find common N-grams (for example pairs) which you can use to find multi-word tags.
The Problem:
I need a good free library or algorithm to determine whether a text is related to a search pattern or not. The search pattern can be an ordered or unordered list of words.
For some searches the order is relevant, for some it is not. Additionally I need the ability to define aliases for searched words (e.g. "(C#|C sharp) code").
I doubt that there is a free cheap c# library meeting all my requests.
Which libraries/algorithms would you use to implement that functionality?
I´m grateful for any tip.
EDIT:
I need this to filter search results from multiple specialized search services. The resulting program must be VERY strict, so false negatives are no problem.False positives should be avoided(as far as possible).
For free, start here with the builtin Regex namespace/class:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
More sophisticated search is unlikely to come for free (cf. Google Search Appliance or similar).
I found this very cool C++ sample , literally the "Hello World!" of genetic algorithms.
I so decided to re-code the whole thing in C# and this is the result.
Now I am asking myself: is there any practical application along the lines of generating a target string starting from a population of random strings?
EDIT: my buddy on twitter just tweeted that "is useful for transcription type things such as translation. Does not have to be Monkey's". I wish I had a clue.
Is there any practical application along the lines of generating a target string starting from a population of random strings?
Sure. Imagine any scenario in which you know how to evaluate the fitness of a particular string, and in which the choices are discrete and constrained in some way:
Picking pronounceable names ("Xhjkxc" has low fitness; "Artekzo" has high fitness)
Trying out a series of chess moves
Guessing the combination to a safe, assuming you can tell how close you are to unlocking each tumbler
Picking phone numbers that evaluate to words (e.g. "843-2378" has high fitness because it spells "THE-BEST")
No. Each time you run the GA, you are giving it the eventual answer. This is great for showing how a GA works and to show how powerful it can be, but it does not have any purpose beyond that.
You could write an EA that writes code in a dynamic language like IronPython with the goal of creating code that a) executes without crashing and b) analyzes the stock market and intelligently buys and sells stock.
That's a very simplistic take on what would be necessary, but it's possible. You would need a host that provides a lot of methods for the IronPython code (technical indicators, etc) and a database of ticks.
It would also be smart to not just generate any old random code, lest you format your own hard-drive. You need a sandbox, and you need to limit the namespaces that are accessable, and you would need to provide a time limit to avoid infinite loops. You could also provide symantic guidelines that allow it to choose appropriate approved keywords instead of just stringing random letters together -- this would greatly speed up evolution.
So, I was involved with a project that did everything but the EA. We had a satellite dish that got real-time stock ticks from the NASDAQ, a service for trading that had an API, and a primitive decision making "brain" that made decisions as the ticks came in.
Sadly, one of the partners flipped out, quit his job, forked the project (got his own dish, etc), and started trading with logic that wasn't ready. He lost a bunch of money. It turns out that for some people this type of project is only a step away from common gambling. But anyway, the project kind of fizzled out after that. Evolving the logic part is the missing link though. And I know there are people out there doing this type of thing.
I have used GA in 2 real life research problems.
One was a power optimization problem (maximize number of appliances turned on, meeting the available power constraint and service guarantee for each appliance)
Another was for radio network optimization, maximizing the coverage area given a fixed equipment budget
GA has one main disadvantage, it usually works with genetic speed so using it in some serious time-dependant projects is quite risky.