The Problem:
I need a good free library or algorithm to determine whether a text is related to a search pattern or not. The search pattern can be an ordered or unordered list of words.
For some searches the order is relevant, for some it is not. Additionally I need the ability to define aliases for searched words (e.g. "(C#|C sharp) code").
I doubt that there is a free cheap c# library meeting all my requests.
Which libraries/algorithms would you use to implement that functionality?
I´m grateful for any tip.
EDIT:
I need this to filter search results from multiple specialized search services. The resulting program must be VERY strict, so false negatives are no problem.False positives should be avoided(as far as possible).
For free, start here with the builtin Regex namespace/class:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
More sophisticated search is unlikely to come for free (cf. Google Search Appliance or similar).
Related
Working in C# I have an array of strings. Some of these strings are real words, others are complete nonsense. My goal is to come up with a way of deciding which of these words are real and which are false.
I had planned to find some kind of word list online that I could bring into my project, turn into a list, and compare against, but of course typing in "C# dictionary" comes up with an unrelated topic! I don't need a 100% accuracy rate.
To formalize the question:
In C#, what is the recommended way to establish whether or not a string is a real word?
Advice and guidance is very much appreciated!
Solution
Thanks for the great answers, they were all very useful. As it happens the thing to do was ask the same question in a different wording. Searching for C# spellcheck brought up some great links and I ended up using Nhunspell which you can get through NuGet, and is very easy to use.
The problem is that "Dictionary" is a type within the framework. So, searching with that word will end up with all sorts of results. What you are basically wanting to do is Spell Check. This will determine if a word is valid or not.
Searching for C# spell check yielded some promising results. Searching for open source spell check also has some.
I have previously implemented one of the open source ones within a VB6 project. I think it was ASpell. I haven't had to use spell check library within C#, but I'm sure there is one, or at least one with a .NET wrapper to make implementation easier.
If you have special case words that do not exist in the dictionary/word file for a spell check solution, you can add them.
To do this I would use a freely available dictionary for linux (googling "linux dictionaries" should get you on the right track), read and parse the file, and store it in a C# System.Collections.Generic.HashSet collection. I would probably store everything as .ToUpper() or as .ToLower() but this depends on your requirements.
You can then check if any arbitrary string is in the HashSet efficiently.
I don't know of any word list file included by default on Windows, but most Unix-like operating systems include a words file for this purpose. Someone has also posted a words file on github suggested for use in Windows projects. These files are simple lists of words, one per line.
I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.
Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.
Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.
This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.
You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.
I have a text file with around 300,000 words. Each word is 5 letters.
I'd like to be able to determine how unique each word is on the internet.
An idea I had was to Google the word and see how many results it yielded. Unfortunately, this is against their TOS.
I was trying to think of any other way but it would have to involve querying some website a lot and I doubt they would appreciate that much.
Anyone have any other ideas? Programming language doesn't matter that much but I would prefer C#.
To look up the frequency 'in books' you could use the Google Ngram dataset, but that's not 'for the internet'. If this is for academic purposes the Bing alternative might work also and it is based on internet-frequencies.
If your words do not contain slang, I would recommend looking at public domain books. The issue here is that most of these books will be older, so you really will be getting a snapshot in time of how popular a word is (or I guess was). The plus side is that these books are freely available in text file format allowing you to easily mine them for data.
One thing to note, if you're in the US and plan on using Project Gutenberg to get the books, they have a rule that the website is intended only for human users. There is a page that tells you how to get the same data via mirror.
I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification?
No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification.
What to suggest to you depends from your requirements. So, maybe more description needed.
But, generally, easiest way try to use external services. All external services have REST API, and it's very easy to interact with it using C#.
From external services:
Open Calais
uClassify
Google Prediction API
Text Classify
Alchemy API
Also there good Java SDK like Mahout. As I remember interactions with Mahout could be also done like with service, so integration with it is not a problem at all.
I had similar "auto tagging" task using c#, and I've used for that Open Calais. It's free to make 50,000 transactions per day. It was enough for me. Also uClassify has good pricing, as example "Indie" license 99$ per year.
But maybe external services and Mahout is not your way. Than take a look at DBpedia project and RDF.
And the last, you can use some implementations of Naive Bayes algorithm, at least. It's easy, and all will be under your control.
This is a very hard problem but if you don't want to spend time on it you can take all words which have between 5% and 10% frequency in the whole document. Or, you simply take the most common 5 words.
Doing tag extraction well is very very hard. It is so hard that whole companies live from webservices exposing such an API.
You can also do stopword removal (using a fixed stopword list obtained from the internet).
And you can find common N-grams (for example pairs) which you can use to find multi-word tags.
How can I add a spellchecker to a richtextbox in my application?
You can purchase a spell checker control, integrate the Microsoft Office Spell Checker, write your own (which isn't too hard, actually, once you get the Soundex function figured out), or get a good free one. Here's a (relatively) good free one.
http://www.codeproject.com/KB/recipes/spellchecker_mg.aspx
For commercial products, I'd say to Google "Spell check WinForms"
If you're interested in rolling your own, I wrote one for Asp.Net back when I was in my beginner phase, and even then it only took me about a week to research and then a day or so to code. It's a fun pet project. I'd start by looking at soundex functions and comparing soundex values for similarity.
You start by comparing all of the words in the TextBox to a known dictionary, and using the soundex functions to come up with similar words.
From there, you can go on to creating a table for "popular replacements. (for example, you can track that the word "teh" was replaced by "the" n times and move the more popular replacements to the top of the list.
I found a better solution. Hunspell is the spellcheck library Mozilla and OpenOffice.org uses. Its been ported to .NET as NHunspell, and is real easy to implement and has samples for you to use.