Matching a large number of strings/phrases

Matching a large number of strings/phrases - c#

I need to implement a process, wherein a text file of roughly 50/150kb is uploaded, and matched against a large number of phrases (~10k).
I need to know which phrases match specifically.
A phrase could be "blah blah blah" or just "blah" - meaning I need to take word-boundaries into account, as I don't wish to include infix matches.
My first attempt was to just create a large pre-compiled list of regular expressions that look like #"\b{0}\b" (as 10k the phrases are constant - I can cache & re-use this same list against multiple documents);
On my brand-new & very fast PC - this matching is taking 10 seconds+, which I would like to be able to reduce a great deal.
Any advice on how I may be able to achieve this would be greatly appreciated!
Cheers,
Dave

You could Lucene.NET and the Shingle Filter as long as you don't mind having a cap on the number of possible words as phrase can have.
public class MyAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
return new ShingleFilter(new LowerCaseFilter(new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader)), 6);
}
}
You can run the analyzer using this utility method.
public static IEnumerable<string> GetTerms(Analyzer analyzer, string keywords)
{
var tokenStream = analyzer.TokenStream("content", new StringReader(keywords));
var termAttribute = tokenStream.AddAttribute<ITermAttribute>();
var terms = new HashSet<string>();
while (tokenStream.IncrementToken())
{
var term = termAttribute.Term;
if (!terms.Contains(term))
{
terms.Add(term);
}
}
return terms;
}
Once you've retrieved all the terms do an intersect with you words list.
var matchingShingles = GetTerms(new MyAnalyzer(), "Here's my stuff I want to match");
var matchingPhrases = phrasesToMatch.Intersect(matchingShingles, StringComparer.OrdinalIgnoreCase);
I think you will find this method is much faster than Regex matching and respects word boundries.

You can use Lucene.Net
This will create a inedx of your text, so that you can make really quick queries against it. This is a "full text index".
This article explains what it's all about:
Lucene.net
This library is originally written in java, (Lucene) but there is a port to .NET (lucene.net).
You must take special care while choosing the stemmer. An stemmer takes the "root" of a word, so that several similar words can match (i.e. book and books will match). If you need exact matches, then you should take (or implement) an stemmer which returns the original words without change.
The same stemmer must be used for creating the index and for searching the results.
You must also have a look at the syntax, because it's too powerful and allows for partial matches, exact matches, and so on.
You can also have a look at this blog.

Related

Find keyword fastest algorithm in C#

Hello I am trying to create a very fast algorithm to detect keywords or list of keywords in a collection.
Before anything, I have read a lot of stackoverflow (and other) posts without being able to improve the performances to the level I expect.
My current solution is able to analyze an input of 200 chars and a collection of 400 list in 0.1825ms (5 inputs analyzed in 1 ms) but this is way too long and I am hoping to improve this performances by at least 5 times (which is the requirement I have).
Solution tested :
Manual research
Highly complex Regex (groups, backreferences...)
Simple regex called multiple times (to match each of the keyword)
Simple regex to match input keywords followed by an intersect with the tracked keywords (current solution)
Multi-threading (huge impact on performances (*100) so I am not sure that would be the best solution for this problem)
Current solution :
input (string) : string to parse and analyze to verify the keywords list contained in it.
Example : "hello world! How are you Mr #piloupe?".
tracks (string[]) : array of string that we want to match (space means AND). Example : "hello world" matches a string that contains both 'hello' and 'world' whatever their location
keywordList (string[][]) : list of string to match from the input.
Example : { { "hello" }, { "#piloupe" }, { "hello", "world" } }
uniqueKeywords (string[]) : array of string representing all the unique keywords of the keywordList. With the previous keywordList that would be : { "hello", "#piloupe", "world" }
All these previous information does not require any performances improvement as they are constructed only once for any input.
Find the tracks algorithm:
// Store in the class performing the queries
readonly Regex _regexToGetAllInputWords = new Regex(#"\#\w+|\w+", RegexOptions.Compiled);
List<string> GetInputMatches(input)
{
// Extract all the words from the input
var inputWordsMatchCollection = _regexToGetAllInputWords.Matches(input.ToLower()).OfType<Match>().Select(x => x.Value).ToArray();
// Get all the words from the input matching the tracked keywords
var matchingKeywords = uniqueKeywords.Intersect(inputWordsMatchCollection).ToArray();
List<string> result = new List<string>();
// For all the tracks check whether they match
for (int i = 0; i < tracksKeywords.Length; ++i)
{
bool trackIsMatching = true;
// For all the keywords of the track check whether they exist
for (int j = 0; j < tracksKeywords[i].Length && trackIsMatching; ++j)
{
trackIsMatching = matchingKeywords.Contains(tracksKeywords[i][j]);
}
if (trackIsMatching)
{
string keyword = tracks[i];
result.Add(keyword);
}
}
return result;
}
Any help will be greatly appreciated.

The short answer is to parse every word, and store it into a binary tree-like collection. SortedList or SortedDictionary would be your friend here.
With very little code, you can add your words to a SortedList and then do a .BinarySearch() on that SortedList. This is a O(log n) implementation and you should be able to search through thousands or millions of words in a few iterations. When using SortedList, the performance issue will be on the inserts to SortedList (since it will sort while inserting). But this is necessary to do a binary search.
I wouldn't bother with threading since you need results in less than 1ms.
The long answer is to look at something like Lucene, which can be especially helpful if you're doing an autocomplete-style search. RavenDB uses Lucene under the covers and can do background indexing for you, it will search through millions of records in a few milliseconds.

I would like to suggest using hash table.
with hashing you can convert string text to integer representing the index of this string in hash table.
It's much more faster than sequential search.

The ultimate solution is the Elastic binary tree data structure. It is used in HAProxy to match rules against URLs in the proxied HTTP requests (and for many other purposes as well).
ebtree is a data structure built from your 'keyword' patterns, which allows faster matching than either SortedList or hashing. To be faster than hashing is possible because hashing reads the input string once (or at least several characters of it) to generate hash code, then again to evaluate .Equals(). Thus hashing reads all characters of the input 1+ times. ebtree reads all characters at most once and finds the match, or if there's no match, tells it after O(log(n)) characters where n is the number of patterns.
I'm not aware of existing C# implementation of ebtree, but surely many would be pleased if someone would take it.

Need help for code analysis Lucene.Net search results asp.net

I was looking for good code for searching index using lucene.net. i got one look promising but i got some confusion. if possible anyone who is familiar with lucene.net then please have look at the code and tell me why the person construct that code in that way.
from where i got the code...url as follows
http://www.codeproject.com/Articles/320219/Lucene-Net-ultra-fast-search-for-MVC-or-WebForms
here is code
private static IEnumerable<SampleData> _search
(string searchQuery, string searchField = "") {
// validation
if (string.IsNullOrEmpty(searchQuery.Replace("*", "").Replace("?", ""))) return new List<SampleData>();
// set up lucene searcher
using (var searcher = new IndexSearcher(_directory, false)) {
var hits_limit = 1000;
var analyzer = new StandardAnalyzer(Version.LUCENE_29);
// search by single field
if (!string.IsNullOrEmpty(searchField)) {
var parser = new QueryParser(Version.LUCENE_29, searchField, analyzer);
var query = parseQuery(searchQuery, parser);
var hits = searcher.Search(query, hits_limit).ScoreDocs;
var results = _mapLuceneSearchResultsToDataList(hits, searcher);
analyzer.Close();
searcher.Close();
searcher.Dispose();
return results;
}
// search by multiple fields (ordered by RELEVANCE)
else {
var parser = new MultiFieldQueryParser
(Version.LUCENE_29, new[] { "Id", "Name", "Description" }, analyzer);
var query = parseQuery(searchQuery, parser);
var hits = searcher.Search
(query, null, hits_limit, Sort.RELEVANCE).ScoreDocs;
var results = _mapLuceneSearchResultsToDataList(hits, searcher);
analyzer.Close();
searcher.Close();
searcher.Dispose();
return results;
}
}
}
i have couple of question here for the above routine
1) why the developer of this code replace all * & ? to empty string in search term
2) why search once with QueryParser and again by MultiFieldQueryParser
3) how developer detect that search term has one word or many words separated by space.
4) how wild card search can be done using this code....where to change in code for handling wild card.
5) how to handle search for similar word like if anyone search with helo then hello related result should come.
var hits = searcher.Search(query, 1000).ScoreDocs;
6) when my search result will return 5000 record and then if i limit like 1000 then how could i show next 4000 in pagination fashion.what is the object for giving the limit...i think for fastness but if i specify limit the how can i show other results....what would be the logic
i will be glad if someone discuss about all my points. thanks

1) why the developer of this code replace all * & ? to empty string in
search term
Because those are special characters for wildcard search. What the author does - he checks if a search query has something else along with wildcards. You don't usually want to search for "*", for example.
2) why search once with QueryParser and again by
MultiFieldQueryParser
He doesn't search with QueryParsers per se, but he's parsing a search query (string) and making a bunch of objects out of it. Those objects are then consumed by a Searcher object, which performs actual search.
3) how developer detect that search term has one
word or many words separated by space.
That's something a Parser object should care about, not the developer.
4) how wild card search can be
done using this code....where to change in code for handling wild
card.
The wildcards are specified in a searchQuery parameter. Specifying "test*" will count as a wildcard, for example. Details are here.
5) how to handle search for similar word like if anyone search with
helo then hello related result should come.
I think you want a fuzzy search.
6) when my search result will return 5000 record and then if i limit
like 1000 then how could i show next 4000 in pagination
fashion.what is the object for giving the limit...i think for
fastness but if i specify limit the how can i show other
results....what would be the logic
Here's an article about that.
UPD: About multiple fields. Logic is following:
If searchField is specified, than use simple parser, that will produce query like searchField: value1 seachField: value2... etc.
If, however that parameter isn't there, then it assumes, that passed searchQuery will specify fields and values like "field1: value1 field2: value2". Example is on the same syntax page, as I previously mentioned.
UPD2: Don't hesitate to look for Java documentation and examples for Lucene, as this is initially a Java project (hence, there's a lot of Java examples and tutorials). Lucene.NET is a ported project and both projects share a lot of functionality and classes.
UPD3: About fuzzy search, you might also want to implement your own analyzer for synonyms search (we used that technique in one of commercial projects, which I worked on, to handle common typos along with synonyms).

Trouble searching for acronyms in Lucene.NET

I'm currently working on a Lucene.NET full-text search implementation. For the most part it's going quite well but I'm having a few issues revolving around acronyms in the data...
As an example of what's going on if I had "N.A.S.A." in the field I indexed I'm able to match it with n.a.s.a. or nasa, but n.a.s.a doesn't match it, not even if I put a fuzzy-search (n.a.s.a~).
The first thought that comes to mind for me is to rip out all the .'s before indexing/searching, but it seems a bit more like a workaround than a solution and I was hoping to get a cleaner solution.
Can anyone suggest any changes or a different analyzer (using StandardAnalyzer currently) that may be more suited to matching this kind of data?

The StandardAnalyzer uses the StandardTokenizer which tokenizes 'N.A.S.A.' as 'nasa', but won't do this to 'N.A.S.A'. That's why your original query matches both the input 'N.A.S.A' which are processed into 'nasa', and the input 'nasa' which matches the already tokenized value. This also explains why 'N.A.S.A' wont match anything since the index only contains the token 'nasa'.
This can be seen when outputting the value from the token stream directly.
public static void Main(string[] args) {
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
var stream = analyzer.TokenStream("f", new StringReader("N.A.S.A. N.A.S.A"));
var termAttr = stream.GetAttribute<ITermAttribute>();
while (stream.IncrementToken()) {
Console.WriteLine(termAttr.Term);
}
Console.ReadLine();
}
Outputs:
nasa
n.a.s.a
You would probably need to write a custom analyzer to handle this scenario. One solution would be to keep the original token so n.a* would work, but you would also need to build a better detection of acronyms.

C# Code/Algorithm to Search Text for Terms

We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.
What's the most efficient way to do this in .NET (ideally C#)?
Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.
The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).
We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.

A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.
Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.
I'd suggest your problem lies elsewhere...Or you just need some decent hardware.

Why reinvent the wheel? Why not just leverage something like Lucene.NET?

have you considered the following:
do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?
Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)
Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example
"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time
"written C# today" <- removed noise, now there are lots less to look throught.
A great article on NLP can be found here. Sentance comparing
HTH
Bones

A modified Suffix tree would be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).

Here's another idea: Make a class something like this:
class Word
{
string Word;
List<int> Positions;
}
For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.
Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>.
Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.

Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.
However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.
The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.
It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.
There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf

Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into one pattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.
An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementation of the algorithm.

Ok, current rework shows this as fastest (psuedocode):
foreach (var term in allTerms)
{
string pattern = term.ToWord(); // Use /b word boundary regex
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(bigTextToSearchForTerms))
{
result.Add(term);
}
}
What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count

How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));
This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:
static bool Match(string checkItem)
{
return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}
static void Main(string[] args)
{
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(Match);
}
Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));
I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.

Best way to replace tokens in a large text template

I have a large text template which needs tokenized sections replaced by other text. The tokens look something like this: ##USERNAME##. My first instinct is just to use String.Replace(), but is there a better, more efficient way or is Replace() already optimized for this?

System.Text.RegularExpressions.Regex.Replace() is what you seek - IF your tokens are odd enough that you need a regex to find them.
Some kind soul did some performance testing, and between Regex.Replace(), String.Replace(), and StringBuilder.Replace(), String.Replace() actually came out on top.

The only situation in which I've had to do this is sending a templated e-mail. In .NET this is provided out of the box by the MailDefinition class. So this is how you create a templated message:
MailDefinition md = new MailDefinition();
md.BodyFileName = pathToTemplate;
md.From = "test#somedomain.com";
ListDictionary replacements = new ListDictionary();
replacements.Add("<%To%>", someValue);
// continue adding replacements
MailMessage msg = md.CreateMailMessage("test#someotherdomain.com", replacements, this);
After this, msg.Body would be created by substituting the values in the template. I guess you can take a look at MailDefinition.CreateMailMessage() with Reflector :). Sorry for being a little off-topic, but if this is your scenario I think it's the easiest way.

Well, depending on how many variables you have in your template, how many templates you have, etc. this might be a work for a full template processor. The only one I've ever used for .NET is NVelocity, but I'm sure there must be scores of others out there, most of them linked to some web framework or another.

string.Replace is fine. I'd prefer using a Regex, but I'm *** for regular expressions.
The thing to keep in mind is how big these templates are. If its real big, and memory is an issue, you might want to create a custom tokenizer that acts on a stream. That way you only hold a small part of the file in memory while you manipulate it.
But, for the naiive implementation, string.Replace should be fine.

If you are doing multiple replaces on large strings then it might be better to use StringBuilder.Replace(), as the usual performance issues with strings will appear.

Regular expressions would be the quickest solution to code up but if you have many different tokens then it will get slower. If performance is not an issue then use this option.
A better approach would be to define token, like your "##" that you can scan for in the text. Then select what to replace from a hash table with the text that follows the token as the key.
If this is part of a build script then nAnt has a great feature for doing this called Filter Chains. The code for that is open source so you could look at how its done for a fast implementation.

Had to do something similar recently. What I did was:
make a method that takes a dictionary (key = token name, value = the text you need to insert)
Get all matches to your token format (##.+?## in your case I guess, not that good at regular expressions :P) using Regex.Matches(input, regular expression)
foreach over the results, using the dictionary to find the insert value for your token.
return result.
Done ;-)
If you want to test your regexes I can suggest the regulator.

FastReplacer implements token replacement in O(n*log(n) + m) time and uses 3x the memory of the original string.
FastReplacer is good for executing many Replace operations on a large string when performance is important.
The main idea is to avoid modifying existing text or allocating new memory every time a string is replaced.
We have designed FastReplacer to help us on a project where we had to generate a large text with a large number of append and replace operations. The first version of the application took 20 seconds to generate the text using StringBuilder. The second improved version that used the String class took 10 seconds. Then we implemented FastReplacer and the duration dropped to 0.1 seconds.

If your template is large and you have lots of tokens, you probably don't want walk it and replace the token in the template one by one as that would result in an O(N * M) operation where N is the size of the template and M is the number of tokens to replace.
The following method accepts a template and a dictionary of the keys value pairs you wish to replace. By initializing the StringBuilder to slightly larger than the size of the template, it should result in an O(N) operation (i.e. it shouldn't have to grow itself log N times).
Finally, you can move the building of the tokens into a Singleton as it only needs to be generated once.
static string SimpleTemplate(string template, Dictionary<string, string> replacements)
{
// parse the message into an array of tokens
Regex regex = new Regex("(##[^#]+##)");
string[] tokens = regex.Split(template);
// the new message from the tokens
var sb = new StringBuilder((int)((double)template.Length * 1.1));
foreach (string token in tokens)
sb.Append(replacements.ContainsKey(token) ? replacements[token] : token);
return sb.ToString();
}

This is an ideal use of Regular Expressions. Check out this helpful website, the .Net Regular Expressions class, and this very helpful book Mastering Regular Expressions.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.