Analyze text (lemmatization, edit distance) - c#

I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?
What about typos?
For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?
And what if the text is written as follows:
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR
"F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, without whitespace)
How to solve this problem?
Preferably fast algorithm, because text are processed in real time.
And maybe what some tips to improve performance (how to store, etc)?

there're two possible solutions as far as I know algorithms.
You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):
http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
http://www.ics.uci.edu/~eppstein/161/960229.html
Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).
http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm
So now you could compare benefits of these two and yours suggestion.

You could also use RegEx Matches to check for words.
http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx

Related

Find best matched word in dictionary

I'm creating a program that reads a scanned hand written document and coverts it to text. The recognized words must come from a dictionary of about 300 words that I create. As an example, if the hand written word is recognized as "heilo", but my dictionary only contains "hello" and "world", it should convert it to "hello". However, if it recognized it as "planet", it shouldn't match it to anything. I think a possible approach would be to create a score of how closely the recognized word matches each word in the dictionary. If it doesn't get a minimum score, then no match is found.
I'm writing the application in C#. Are there any libraries/examples available that be do something like this, or would I have to code everything from scratch?
Thanks
There is nothing in the standard libraries to compute the distance between words, but there are plenty of examples you can find on the internet: look up "edit distance" or "Levenshtein distance". The idea is to measure the similarity in terms of the number of changes to the first string in order to make it a second string. The distance between "heil" and "hello" is 2, because you need to replace "i" with "l" (first edit), and then append an "o" (the second edit).
When looking for an implementation or implementing your own, avoid the trivial implementation with a 2D array, because it's not memory-efficient. Use the modification with O(min(m,n)) memory requirements instead of the "naive" O(m*n).
I have no lib at hand to do what you need but searching the web knowing that you want to calculate the Levenshtein Distance might help you in your search.
Perhaps you should start with a spell checker - there are a number of libraries available that do this.
There are a few c# snippets online that will get the ball rolling:
Levenshtein:
http://www.dotnetperls.com/levenshtein
Boyer-Moore:
http://www-igm.univ-mlv.fr/~lecroq/string/node15.html#SECTION00150
Based on those, you can easily implement your own Word Matcher module.

counting 1 word, 2 word and 3 word phrases occurrences in a webpage in C#/.NET

I'm going to write a program that takes a URL and counts the occurrences of EVERY single 1-word, 2-word, and 3-word phrases in the webpage (and possibly x-word phrases).
Here's the best algorithm I could come up with:
1). strip html tags
2) make everything lowercase
3) split the text on space and put them all into an array
4) iterate over each word, and for each word you must: put word[i], word[i+1], word[i+2] into a hashtable.
Every time u have a collision you increase the word count for that word or 2-3 letter word phrase.
My questions are:
1) Can anyone provide any more efficient solutions in terms of space and runtime?
2) Are there any easy ways to do #1 in C#?
I can probably use a dom parser and parse out all the inner text maybe.
Depending on your case, You might be oversimplifying the problem and/or You may end up putting a lot of effort implementing functionalites that already exist in some libraries. So this will not be an direct answer but suggestion on what path to take in tackling this problem.
Process You want to implement is called information retrieval. It is very broad and complex but luckily there is a lot of research in this area. Part of it is extracting word ngrams (ngram is set of consecutive letters or words in sequence).
Let me show you some additional problems you should think of ahead:
is the capitalization of letters in word important?
is dot the only sign that You want to use to mark the end of sentence?
do You want to exclude stop words? Stop words are words You don't want to include in phrase like 'a', 'the', 'I', 'my' and so on.
do you want to stem words? Convert words from their original form to their root form, like plural to singular form: basketballs -> basketball
And for extracting pure text from HTML:
extract only text shown on page?
extract hints also? (like those shown when hovering mouse over picture)
Any other non-visible text (meta tag and so on)
There are libraries that perform searching and extracting information from raw material. "Raw material" means that You have to process document (html, doc, pdf, image, ...) and turn it into text in order for search engine to index it (extract phrases, for instance). Once document is indexed it can be searched. One such library for .NET is Lucene.NET. It supports different stemmers, analyzers, filters.
I am not sure but i believe there are libraries for extracting text from html also.
Basically, your approach may work in some simpler scenarios where not so small error-level is acceptable. I recently gain interest in information retrieval and found it really complex and interesting. You may get benefits researching this topic depending on your goals. There is a lot of info here on stackoverflow as well as the rest of Internet.
And if You decide to go this way, there is much more info on Lucene (orioginal Lucene JAVA version, Lucene.NET is port to .NET) than on Lucene.NET. So if You don't find answer for Lucene.NET immediately do a search on Lucene discussions.
To answer your question #2.
HtmlDocument doc = WebBrowser1.Document;
string text = doc.GetInnerText();
If you want to make it more efficient - use a suffix trie (you may have to write your own)
http://en.wikipedia.org/wiki/Suffix_trie
A suffix trie basically makes searching through strings depend on the length of the string instead of the length of the array. Its the sort of thing they use in search engines.

Anomaly in text

Let me explain with an example.
We have the following text:
"Comme Il Faut was founded in 1927. The tobacco company is most well known for its reputation of producing customized private label brands for its partners worldwide".
This is normal text. But the following text:
"CommeIlFautwasfounded in 1927. The tobacco companyi most wellknown foritsreputation of producing customizedprivatelabelbrands foritspartners worldwide"
This is text anomaly: typos, words without a space, maybe something else.
How to search for such anomalies?
What algorithms are there for this (statistical)?
It is desirable that the result was a percentage: for example, 80% of the anomalies.
Thanks.
Construct a Trie tree with all the known words in the dictionary.
Take each word that apears in your text and try to find it in the Trie tree. If you don't find it then try to match prefix of length-k. If you find a match then you apply the same procedure to the rest k characters. It's recursive and it could catch more than two concatenated words
Another simple method is to use the edit distance algorithm. This algorithm calculates the minimum number of edit operations (insert, delete or replace) that have to be performed to transform the string into the other string. With some additional logic you can easily get this algorithm to output the operations as well.
This however assumes you have both the correct and the broken string. If you only have the broken string this get's a lot harder. In that case I would suggest you either try the trie approach mentioned before, or you use some external library like ispell to have it handle this logic. You could have a look at the code for ispell or it's variants to see how complicated such a task might get.
A couple of links that could be helpful:
http://www.codeproject.com/KB/cs/spellcheckdemo.aspx
http://www.codeproject.com/KB/recipes/spellcheckparser.aspx

How to cut specified words from string

There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.
Trivial example:
foreach(string word in wordsList)
{
foreach(string mail in mailList)
{
mail.Replace(word,String.Empty);
}
}
How I can improve this algorithm?
Thanks for advices. I voted few answers up but I didn't mark any as answer since it was more like discussion than solution. Some people missed banned words with bad words. In my case I don't have to bother about recognize 'sh1t' or something like that.
Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.
What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?
See How do you implement a good profanity filter? for extensive discussion.
You could use RegEx to make things a little cleaner:
var bannedWords = #"\b(this|is|the|list|of|banned|words)\b";
foreach(mail in mailList)
var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);
Even that, though, is far from perfect since people will always figure out a way around any type of filter.
You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.
You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.
Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.
In that way you only need to walk each message a single time.
Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).
In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.
A general algorithm would be to:
Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
Compare each token against a list of banned words
Replace matched tokens
A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.
HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
"bad",
};
string Input = "this is some bad text.";
string Output = Regex.Replace(Input, #"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);
Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.
Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).
In some circumstance is possible to improve it:
Just for fun:
u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:
first calculate ur running time algorithm:
Words: n item. (each item has an O(1) length).
mailing list: K item.
each item in mailing list average length of Z.
each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.
ur algorithm takes O(n*K*Z). // the best way with knut algorithm
1.now if u sort the words list in O(n log n).
2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z).
2.2- sort the items in mailing list: O(m * log m)
total sorting takes O(K * Z) in worth case with respect to (m logm << Z).
3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)
total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).
So if performance is very very very important, u can do this.
You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:
"\bBADWORD\b"
Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.
Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.
Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.
That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word
/censorship is bullshit rant
I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).
I had great results using this algorithm on codeproject.com better than brute force text replacments.

C# Code/Algorithm to Search Text for Terms

We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.
What's the most efficient way to do this in .NET (ideally C#)?
Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.
The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).
We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.
A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.
Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.
I'd suggest your problem lies elsewhere...Or you just need some decent hardware.
Why reinvent the wheel? Why not just leverage something like Lucene.NET?
have you considered the following:
do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?
Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)
Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example
"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time
"written C# today" <- removed noise, now there are lots less to look throught.
A great article on NLP can be found here. Sentance comparing
HTH
Bones
A modified Suffix tree would be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).
Here's another idea: Make a class something like this:
class Word
{
string Word;
List<int> Positions;
}
For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.
Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>.
Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.
Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.
However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.
The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.
It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.
There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf
Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into one pattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.
An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementation of the algorithm.
Ok, current rework shows this as fastest (psuedocode):
foreach (var term in allTerms)
{
string pattern = term.ToWord(); // Use /b word boundary regex
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(bigTextToSearchForTerms))
{
result.Add(term);
}
}
What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count
How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));
This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:
static bool Match(string checkItem)
{
return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}
static void Main(string[] args)
{
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(Match);
}
Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));
I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.

Categories

Resources