Hello I am trying to create a very fast algorithm to detect keywords or list of keywords in a collection.
Before anything, I have read a lot of stackoverflow (and other) posts without being able to improve the performances to the level I expect.
My current solution is able to analyze an input of 200 chars and a collection of 400 list in 0.1825ms (5 inputs analyzed in 1 ms) but this is way too long and I am hoping to improve this performances by at least 5 times (which is the requirement I have).
Solution tested :
Manual research
Highly complex Regex (groups, backreferences...)
Simple regex called multiple times (to match each of the keyword)
Simple regex to match input keywords followed by an intersect with the tracked keywords (current solution)
Multi-threading (huge impact on performances (*100) so I am not sure that would be the best solution for this problem)
Current solution :
input (string) : string to parse and analyze to verify the keywords list contained in it.
Example : "hello world! How are you Mr #piloupe?".
tracks (string[]) : array of string that we want to match (space means AND). Example : "hello world" matches a string that contains both 'hello' and 'world' whatever their location
keywordList (string[][]) : list of string to match from the input.
Example : { { "hello" }, { "#piloupe" }, { "hello", "world" } }
uniqueKeywords (string[]) : array of string representing all the unique keywords of the keywordList. With the previous keywordList that would be : { "hello", "#piloupe", "world" }
All these previous information does not require any performances improvement as they are constructed only once for any input.
Find the tracks algorithm:
// Store in the class performing the queries
readonly Regex _regexToGetAllInputWords = new Regex(#"\#\w+|\w+", RegexOptions.Compiled);
List<string> GetInputMatches(input)
{
// Extract all the words from the input
var inputWordsMatchCollection = _regexToGetAllInputWords.Matches(input.ToLower()).OfType<Match>().Select(x => x.Value).ToArray();
// Get all the words from the input matching the tracked keywords
var matchingKeywords = uniqueKeywords.Intersect(inputWordsMatchCollection).ToArray();
List<string> result = new List<string>();
// For all the tracks check whether they match
for (int i = 0; i < tracksKeywords.Length; ++i)
{
bool trackIsMatching = true;
// For all the keywords of the track check whether they exist
for (int j = 0; j < tracksKeywords[i].Length && trackIsMatching; ++j)
{
trackIsMatching = matchingKeywords.Contains(tracksKeywords[i][j]);
}
if (trackIsMatching)
{
string keyword = tracks[i];
result.Add(keyword);
}
}
return result;
}
Any help will be greatly appreciated.
The short answer is to parse every word, and store it into a binary tree-like collection. SortedList or SortedDictionary would be your friend here.
With very little code, you can add your words to a SortedList and then do a .BinarySearch() on that SortedList. This is a O(log n) implementation and you should be able to search through thousands or millions of words in a few iterations. When using SortedList, the performance issue will be on the inserts to SortedList (since it will sort while inserting). But this is necessary to do a binary search.
I wouldn't bother with threading since you need results in less than 1ms.
The long answer is to look at something like Lucene, which can be especially helpful if you're doing an autocomplete-style search. RavenDB uses Lucene under the covers and can do background indexing for you, it will search through millions of records in a few milliseconds.
I would like to suggest using hash table.
with hashing you can convert string text to integer representing the index of this string in hash table.
It's much more faster than sequential search.
The ultimate solution is the Elastic binary tree data structure. It is used in HAProxy to match rules against URLs in the proxied HTTP requests (and for many other purposes as well).
ebtree is a data structure built from your 'keyword' patterns, which allows faster matching than either SortedList or hashing. To be faster than hashing is possible because hashing reads the input string once (or at least several characters of it) to generate hash code, then again to evaluate .Equals(). Thus hashing reads all characters of the input 1+ times. ebtree reads all characters at most once and finds the match, or if there's no match, tells it after O(log(n)) characters where n is the number of patterns.
I'm not aware of existing C# implementation of ebtree, but surely many would be pleased if someone would take it.
Related
I have a list of strings. I want to find all of the strings that start or end with another string. At its simplest an example is;
List<string> allWords = new List<string>();
for(int index = 0; index < 1000000; index++)
allWords.Add(index.ToString());
List<string> result = allWords.FindAll(x => x.StartsWith("10") || x.EndsWith("10"));
This algorithm scans the list from beginning to end. I need to perform this operation very quickly and O(n) is too slow.
What data structures (if any) are available to me to solve this algorithm faster that O(n)?
If you have an unsorted List<string>, there is no way to do it in less than O(n). However, you could use a different data structure. A trie (also called prefix tree) is particularly well suited for your need, as it has a O(m) search complexity (where m is the length of the searched prefix)
I have a C# implementation here : Trie.cs (actually, it's a trie-based dictionary, which associates a value with each key, but for your use-case you can just ignore the value; or if you prefer you can adapt the implementation to your needs).
To find strings starting with a given substring, sort the list, do a binary search to find the closest match, then scan adjacent strings to find others that also match the beginning. That's log(n)
To find strings ending with a given substring, create a list of reversed strings, and sort that list. Then to find a string that ends in a given pattern, reverse the pattern and look for reversed strings that start with the reversed pattern, as in step 1.
I need to implement a process, wherein a text file of roughly 50/150kb is uploaded, and matched against a large number of phrases (~10k).
I need to know which phrases match specifically.
A phrase could be "blah blah blah" or just "blah" - meaning I need to take word-boundaries into account, as I don't wish to include infix matches.
My first attempt was to just create a large pre-compiled list of regular expressions that look like #"\b{0}\b" (as 10k the phrases are constant - I can cache & re-use this same list against multiple documents);
On my brand-new & very fast PC - this matching is taking 10 seconds+, which I would like to be able to reduce a great deal.
Any advice on how I may be able to achieve this would be greatly appreciated!
Cheers,
Dave
You could Lucene.NET and the Shingle Filter as long as you don't mind having a cap on the number of possible words as phrase can have.
public class MyAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
return new ShingleFilter(new LowerCaseFilter(new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader)), 6);
}
}
You can run the analyzer using this utility method.
public static IEnumerable<string> GetTerms(Analyzer analyzer, string keywords)
{
var tokenStream = analyzer.TokenStream("content", new StringReader(keywords));
var termAttribute = tokenStream.AddAttribute<ITermAttribute>();
var terms = new HashSet<string>();
while (tokenStream.IncrementToken())
{
var term = termAttribute.Term;
if (!terms.Contains(term))
{
terms.Add(term);
}
}
return terms;
}
Once you've retrieved all the terms do an intersect with you words list.
var matchingShingles = GetTerms(new MyAnalyzer(), "Here's my stuff I want to match");
var matchingPhrases = phrasesToMatch.Intersect(matchingShingles, StringComparer.OrdinalIgnoreCase);
I think you will find this method is much faster than Regex matching and respects word boundries.
You can use Lucene.Net
This will create a inedx of your text, so that you can make really quick queries against it. This is a "full text index".
This article explains what it's all about:
Lucene.net
This library is originally written in java, (Lucene) but there is a port to .NET (lucene.net).
You must take special care while choosing the stemmer. An stemmer takes the "root" of a word, so that several similar words can match (i.e. book and books will match). If you need exact matches, then you should take (or implement) an stemmer which returns the original words without change.
The same stemmer must be used for creating the index and for searching the results.
You must also have a look at the syntax, because it's too powerful and allows for partial matches, exact matches, and so on.
You can also have a look at this blog.
string[] words = System.IO.File.ReadAllLines("word.txt");
var query = from word in words
where word.Length > "abe".Length && word.StartsWith("abe")
select word;
foreach (var w in query.AsParallel())
{
Console.WriteLine(w);
}
Basically the word.txt contains 170000 English words. Is there a collection class in C# that is faster than array of string for the above query? There will be no insert or delete, just search if a string starts with "abe" or "abdi".
Each word in the file is unique.
EDIT 1 This search will be performed potentially millions of times in my application. Also I want to stick with LINQ for collection query because I might need to use aggregate function.
EDIT 2 The words from the file are sorted already, the file will not change
myself I'd create a Dictionary<char, List<string>>, where I'd group words by their first letter. This will reduce substantially the lookup of needed word.
If you need to do search once there is nothing better than linear search - array is perfectly fine for it.
If you need to perform repeated searches you can consider soring the array (n Log n) and search by any prefix will be fast (long n). Depending on type of search using dictionary of string lists indexed by prefix may be another good option.
If you search much often than you change a file with words. You can sort words in file every time you change list. After this you can use bisectional search. So you will have to make up to 20 comparisons to find any word witch match with your key and some additional comparisons of neighborhood.
There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.
Trivial example:
foreach(string word in wordsList)
{
foreach(string mail in mailList)
{
mail.Replace(word,String.Empty);
}
}
How I can improve this algorithm?
Thanks for advices. I voted few answers up but I didn't mark any as answer since it was more like discussion than solution. Some people missed banned words with bad words. In my case I don't have to bother about recognize 'sh1t' or something like that.
Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.
What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?
See How do you implement a good profanity filter? for extensive discussion.
You could use RegEx to make things a little cleaner:
var bannedWords = #"\b(this|is|the|list|of|banned|words)\b";
foreach(mail in mailList)
var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);
Even that, though, is far from perfect since people will always figure out a way around any type of filter.
You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.
You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.
Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.
In that way you only need to walk each message a single time.
Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).
In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.
A general algorithm would be to:
Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
Compare each token against a list of banned words
Replace matched tokens
A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.
HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
"bad",
};
string Input = "this is some bad text.";
string Output = Regex.Replace(Input, #"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);
Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.
Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).
In some circumstance is possible to improve it:
Just for fun:
u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:
first calculate ur running time algorithm:
Words: n item. (each item has an O(1) length).
mailing list: K item.
each item in mailing list average length of Z.
each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.
ur algorithm takes O(n*K*Z). // the best way with knut algorithm
1.now if u sort the words list in O(n log n).
2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z).
2.2- sort the items in mailing list: O(m * log m)
total sorting takes O(K * Z) in worth case with respect to (m logm << Z).
3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)
total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).
So if performance is very very very important, u can do this.
You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:
"\bBADWORD\b"
Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.
Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.
Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.
That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word
/censorship is bullshit rant
I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).
I had great results using this algorithm on codeproject.com better than brute force text replacments.
We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.
What's the most efficient way to do this in .NET (ideally C#)?
Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.
The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).
We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.
A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.
Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.
I'd suggest your problem lies elsewhere...Or you just need some decent hardware.
Why reinvent the wheel? Why not just leverage something like Lucene.NET?
have you considered the following:
do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?
Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)
Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example
"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time
"written C# today" <- removed noise, now there are lots less to look throught.
A great article on NLP can be found here. Sentance comparing
HTH
Bones
A modified Suffix tree would be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).
Here's another idea: Make a class something like this:
class Word
{
string Word;
List<int> Positions;
}
For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.
Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>.
Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.
Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.
However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.
The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.
It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.
There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf
Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into one pattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.
An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementation of the algorithm.
Ok, current rework shows this as fastest (psuedocode):
foreach (var term in allTerms)
{
string pattern = term.ToWord(); // Use /b word boundary regex
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(bigTextToSearchForTerms))
{
result.Add(term);
}
}
What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count
How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));
This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:
static bool Match(string checkItem)
{
return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}
static void Main(string[] args)
{
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(Match);
}
Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));
I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.