I'm creating a program that reads a scanned hand written document and coverts it to text. The recognized words must come from a dictionary of about 300 words that I create. As an example, if the hand written word is recognized as "heilo", but my dictionary only contains "hello" and "world", it should convert it to "hello". However, if it recognized it as "planet", it shouldn't match it to anything. I think a possible approach would be to create a score of how closely the recognized word matches each word in the dictionary. If it doesn't get a minimum score, then no match is found.
I'm writing the application in C#. Are there any libraries/examples available that be do something like this, or would I have to code everything from scratch?
Thanks
There is nothing in the standard libraries to compute the distance between words, but there are plenty of examples you can find on the internet: look up "edit distance" or "Levenshtein distance". The idea is to measure the similarity in terms of the number of changes to the first string in order to make it a second string. The distance between "heil" and "hello" is 2, because you need to replace "i" with "l" (first edit), and then append an "o" (the second edit).
When looking for an implementation or implementing your own, avoid the trivial implementation with a 2D array, because it's not memory-efficient. Use the modification with O(min(m,n)) memory requirements instead of the "naive" O(m*n).
I have no lib at hand to do what you need but searching the web knowing that you want to calculate the Levenshtein Distance might help you in your search.
Perhaps you should start with a spell checker - there are a number of libraries available that do this.
There are a few c# snippets online that will get the ball rolling:
Levenshtein:
http://www.dotnetperls.com/levenshtein
Boyer-Moore:
http://www-igm.univ-mlv.fr/~lecroq/string/node15.html#SECTION00150
Based on those, you can easily implement your own Word Matcher module.
Related
First of all I'd like to mention that I'm new to programming and this sight so I'm still an infant in this world, however, I have a problem.
I have to make code that can compare two strings but the second string (from a file) will have unique identifiers within it. For example:
first string:
I have 10 cats and their fur is #000000
Second string from a file:
I have <d> cats and their fur is <h>
Although I probably don't need to explain, 'd' is for numbers or decimal and 'h' for hex. There are also 's' and 'a' associated to ASCII.
What's supposed to happen is that the first string can have any different number which can be of different length and/or Hex when the data comes in but the rest of the message stays the same, E.G.
I have 1500 cats and their fur is #000000
the code will still match the two strings as True matches as it'll effectively ignore anything that is an int and hex. (this identifiers are User defined so they can be anywhere in any string).
The end game is that if it finds a relative match the code will change the colour of the text in the app among other things. it's basically to highlight errors in a log file.
I've searched High an low on Stackflow and looked into Regex and string comparisons. I'm currently going to make a start on the code, however, would like some input/help.
Obviously I'm not asking for something to be written for me, just to be pointed in the right direction so I can learn.
Many thanks in advance! And apologies if there is a similar post out there, but alas I couldn't find it if there is.
If I understand it correctly I think I would solve this by replacing the <d> etc. by a RegEx expression. Then use that RegEx to replace the values by an empty string. That way you can compare them without the values.
Hope that makes sense. I didn't include any code because you asked for just some directions.
I'm going to write a program that takes a URL and counts the occurrences of EVERY single 1-word, 2-word, and 3-word phrases in the webpage (and possibly x-word phrases).
Here's the best algorithm I could come up with:
1). strip html tags
2) make everything lowercase
3) split the text on space and put them all into an array
4) iterate over each word, and for each word you must: put word[i], word[i+1], word[i+2] into a hashtable.
Every time u have a collision you increase the word count for that word or 2-3 letter word phrase.
My questions are:
1) Can anyone provide any more efficient solutions in terms of space and runtime?
2) Are there any easy ways to do #1 in C#?
I can probably use a dom parser and parse out all the inner text maybe.
Depending on your case, You might be oversimplifying the problem and/or You may end up putting a lot of effort implementing functionalites that already exist in some libraries. So this will not be an direct answer but suggestion on what path to take in tackling this problem.
Process You want to implement is called information retrieval. It is very broad and complex but luckily there is a lot of research in this area. Part of it is extracting word ngrams (ngram is set of consecutive letters or words in sequence).
Let me show you some additional problems you should think of ahead:
is the capitalization of letters in word important?
is dot the only sign that You want to use to mark the end of sentence?
do You want to exclude stop words? Stop words are words You don't want to include in phrase like 'a', 'the', 'I', 'my' and so on.
do you want to stem words? Convert words from their original form to their root form, like plural to singular form: basketballs -> basketball
And for extracting pure text from HTML:
extract only text shown on page?
extract hints also? (like those shown when hovering mouse over picture)
Any other non-visible text (meta tag and so on)
There are libraries that perform searching and extracting information from raw material. "Raw material" means that You have to process document (html, doc, pdf, image, ...) and turn it into text in order for search engine to index it (extract phrases, for instance). Once document is indexed it can be searched. One such library for .NET is Lucene.NET. It supports different stemmers, analyzers, filters.
I am not sure but i believe there are libraries for extracting text from html also.
Basically, your approach may work in some simpler scenarios where not so small error-level is acceptable. I recently gain interest in information retrieval and found it really complex and interesting. You may get benefits researching this topic depending on your goals. There is a lot of info here on stackoverflow as well as the rest of Internet.
And if You decide to go this way, there is much more info on Lucene (orioginal Lucene JAVA version, Lucene.NET is port to .NET) than on Lucene.NET. So if You don't find answer for Lucene.NET immediately do a search on Lucene discussions.
To answer your question #2.
HtmlDocument doc = WebBrowser1.Document;
string text = doc.GetInnerText();
If you want to make it more efficient - use a suffix trie (you may have to write your own)
http://en.wikipedia.org/wiki/Suffix_trie
A suffix trie basically makes searching through strings depend on the length of the string instead of the length of the array. Its the sort of thing they use in search engines.
I need to introduce some text macros, for example:
"Some text here, some text here #from_file[a.txt,2,N] and here and here"
The #from_file[a.txt,2,N] macro should get 2 random lines from a.txt and join them with new line character another #from_file[a.txt,5,S] - take 5 random lines and join with space
I of course need some another macros: #random[0-9] - random number, #random[A-B,5] - random string with 5 characters
Macros can be in another format etc: {from_file:a.txt,2,N}
My first idea was to use regular expressions - but maybe exist another solution for my problem?
It sounds like you want to create some sort of "general purpose" text-macro system, and while I'm sure this can be done with regexps, what you want basically boil down to what you want to be capable of, and how extensive & flexible it needs to be.
You basically need to define your grammar and constraints. Can the file-name contain the macro-block terminator-character '}' ? If so, does it need to be escaped? Should escaping be supported? Are spaces within a macro-block allowed?
Basically find out how you want things to work, preferably as constrained as possible, as this means you can implement a simpler solution, and there might not be any need for a full blown parser and similar ilk.
Maybe a regex-based solution will be sufficient (although most certainly not very good). But before you can tell that, you need to spec better ;)
I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?
What about typos?
For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?
And what if the text is written as follows:
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR
"F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, without whitespace)
How to solve this problem?
Preferably fast algorithm, because text are processed in real time.
And maybe what some tips to improve performance (how to store, etc)?
there're two possible solutions as far as I know algorithms.
You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):
http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
http://www.ics.uci.edu/~eppstein/161/960229.html
Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).
http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm
So now you could compare benefits of these two and yours suggestion.
You could also use RegEx Matches to check for words.
http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx
I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.
Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.
Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.
This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.
You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.