Consider a program that askes you questions, like "what is the last site you visited?" and the answer would be "stackoverflow". The user is asked this question and gives the answer "stakovervlow" or "overflowstack". I still need the program to count it as a correct answer.
To compare normal strings I would use StringCompare class, but this wouldn't work in this case. I've searched the internet and found some articles about SOUNDEX and some algorithms to compare every char in the string and calculate the similarity percentage (like the damerau levenshtein distance), but i don't really know what is best.
Anyone knows if there is a class in .net to accomplish this or what the best way is to compare the user answer with the correct answer?
From the docs there is the SpellCheck class. You can add customized dictionaries as well for words like "StackOverflow", that are not in the dictionary.
What you are trying to do is quite difficult. The easy but tedious way is to create a dictionary or a table in your database that lists common misspellings.
The difficult way is to try to write some code to do natural language processing. The 2 most successful endeavors into this are the semantic search by Google and IBM's Watson supercomputer. I gather you won't be duplicating their methodology anytime soon.
Related
Working in C# I have an array of strings. Some of these strings are real words, others are complete nonsense. My goal is to come up with a way of deciding which of these words are real and which are false.
I had planned to find some kind of word list online that I could bring into my project, turn into a list, and compare against, but of course typing in "C# dictionary" comes up with an unrelated topic! I don't need a 100% accuracy rate.
To formalize the question:
In C#, what is the recommended way to establish whether or not a string is a real word?
Advice and guidance is very much appreciated!
Solution
Thanks for the great answers, they were all very useful. As it happens the thing to do was ask the same question in a different wording. Searching for C# spellcheck brought up some great links and I ended up using Nhunspell which you can get through NuGet, and is very easy to use.
The problem is that "Dictionary" is a type within the framework. So, searching with that word will end up with all sorts of results. What you are basically wanting to do is Spell Check. This will determine if a word is valid or not.
Searching for C# spell check yielded some promising results. Searching for open source spell check also has some.
I have previously implemented one of the open source ones within a VB6 project. I think it was ASpell. I haven't had to use spell check library within C#, but I'm sure there is one, or at least one with a .NET wrapper to make implementation easier.
If you have special case words that do not exist in the dictionary/word file for a spell check solution, you can add them.
To do this I would use a freely available dictionary for linux (googling "linux dictionaries" should get you on the right track), read and parse the file, and store it in a C# System.Collections.Generic.HashSet collection. I would probably store everything as .ToUpper() or as .ToLower() but this depends on your requirements.
You can then check if any arbitrary string is in the HashSet efficiently.
I don't know of any word list file included by default on Windows, but most Unix-like operating systems include a words file for this purpose. Someone has also posted a words file on github suggested for use in Windows projects. These files are simple lists of words, one per line.
I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.
Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.
Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.
This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.
You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.
I want to restart with data structure ( and Ai + I want to clear all my misconceptions too. ;P )
For now I want to know how would I put given pictorial info into algorithm using C# structure. Image processing is not required here. Just need to feed the data in.
Here I need this question to be modified too if not clear. :|
Say Arad is a city in Romania from where I have to go to another city Bucharest.
This map also has info of how far all connecting cities are from any city.
How would I use these info in program to start with any searching or sorting algo?
Any pointer will be helpful. Say if this can be done using anything else than struct. Something like node or something. I don't know.
Please consider I want to learn things. So using C# for ease in use not to use its inbuilt searching and sorting functions. Later to confirm I might use.
The way you typically solve this problem is to create a node class and an edge class. Each node has a set of edges that have "lengths", and each edge connects two nodes. You then write a shortest-path algorithm that determines the least-total-length set of edges that connects two nodes.
For a brief tutorial on how to do that, see my series of articles on the A* algorithm:
http://blogs.msdn.com/b/ericlippert/archive/tags/astar/
Although it's not exactly what you're looking for, Eric Lippert's series on graph colouring is an excellent step-by-step example of designing data structures and implementing algorithms (efficiently!) in C#. It has helped me a lot; I highly recommend reading it. Once you've worked your way through that, you will know much more about C# and you will understand some of the specific design tradeoffs that you may encounter based on your specific problem, including what data structure to use for a particular problem.
If you just want to look at raw algorithms, the shortest path problem has many algorithms defined for it over the years. I would recommend implementing the common Dijkstra's algorithm first. The Wikipedia article has pseudocode; with what you get out of Eric Lippert's series, you should be in good shape to develop an implementation of this. If you still want more step-by-step guidance, try a search for "Dijkstra's algorithm in C#".
Hope that helps!
I'm writing a bot that will analyse posts and reply with a vaguely related strings from a database. I'm not aiming for coherence, just for vague similarity that could pass as someone ignorant to the topic (but knowledgeable enough to try to reply). What are some methods that would help me to choose the right reply?
One thing I've come up with is to create a vocabulary list, check which elements of the list are in the post, and get a reply from the database based on these results. This crude method has been successful about 10% of the time (based on 100 replies to random posts). I might expand the list by more words, but this method has its limit. Any better ones?
(P. S. The database is sizeable -- about 500 000 replies)
First of all, I think the best you can hope for will be about a 50% answer rate, unless you're prepared to write a lot of code.
If you're willing to get your hands dirty with some statistics, check out term frequency–inverse document frequency. Basically, you will use the frequency of uncommon words to determine what keywords are critical to the document, and use this as the input into the tf-idf algorithm to pull out other replies with those same keywords.
You can then combine this further with whitelisting and blacklisting techniques to ignore common words and prioritize certain keywords. You can then keep tuning those lists to enhance the algorithm as you see it work.
There are also simpler string metrics you can use to test basic similarity. Take a look at this list of string metrics.
You might want to look into vector-space mapping and resemblance. The "vaguely related" problem could be handled by resemblance statistical analysis most likely.
Check out this novel use of resemblance:
http://www.cromwell-intl.com/security/attack-study/
There is a PHP function called "similar_text()", (e.g.:
$percent_similar = similar_text($str1, $str2);) This works fairly well but I didn't come up with anything similar in C#. If you could get hold of the source for the PHP function you might try to translate it. I think there may be a Java version also.
I'm looking for a way to determine the differences between two strings, and highlight them in both strings.
I would suspect that most 'diff' libraries won't work since they show differences in lines (I believe).
Either an algorithm or library will work.
Thanks,
Mark
DiffPlex can handle many different kinds of "intra-line" diffs, including character and word diffs. I think it should be able to do everything you're asking for here.
From your question, you seem to have rejected using an existing program and decided to write your own because you believe existing programs cannot show differences within lines.
However WinMerge can show intra-line diffs.
Does that meet your needs? Or do you need this to be a .NET component for some reason?
You'll probably want to look into using the Levenshtein distance, or some similar algorithm. For a C# implementation of the Levenshtein algorithm, see here (if you're really keen on writing this yourself).
This question asks something similar, with the accepted answer pointing to a bunch of diff related projects. There's a lot of good code that's been written that's definitely worth taking a look into.
String.Compare would work. If you want to compare words then just split the initial string into an array of strings and loop through it.