I'm using Contains() to check if a string has a word and the method is returning true but there isn't the word I'm comparing.
Text:
“Under the guise of Medicare for All and a Green New Deal, Democrats are embracing the same tired economic theories that have impoverished nations and stifled the liberties of millions over the past century,” Pence said to applause. “That system is socialism.
“And the only thing green about the so-called Green New Deal is how much green it’s going to cost taxpayers if we do it: $90 million,” he said. Democrats have said the price tag would be lower than the figure Pence quoted.
His comments to the Conservative Political Action Conference outside Washington continued a White House and Republican National Committee push to paint the opposition party as hellbent on making America’s economy one that is centrally planned from Washington and intent on taking money out of Americans’ pockets to finance a myriad social programs."
Searching word: "nation"
Do you know another way to do that search?
Your search is returning true because the text contains "nations", which includes the string "nation".
If you want to search for the word "nation" and not include similar words like "nations", the easiest way is probably using regex and the \b metacharacter, which matches word boundaries.
bool found = Regex.IsMatch(text, #"\bnation\b");
If you want to generalize this, you can write:
string search = "nation";
bool found = Regex.IsMatch(text, $#"\b{Regex.Escape(search)}\b");
As #Flydog57 helpfully pointed out in the comments, you can also do a case-insensitive search if that's what you're after:
string search = "nation";
bool found = Regex.IsMatch(text, $#"\b{Regex.Escape(search)}\b", RegexOptions.IgnoreCase);
Regex has its problems, as you need to understand very deeply how its mechanics are working and the potential for accidents or performance nightmares are big.
What I usually do is I destroy the text into little chunks and work with them.
Feel free to add stuff to the Split() method! Enjoy:
static bool findWord()
{
var text = #"“Under the guise of Medicare for All and a Green New Deal, Democrats are embracing the same tired economic theories that have impoverished nations and stifled the liberties of millions over the past century,” Pence said to applause. “That system is socialism.
“And the only thing green about the so-called Green New Deal is how much green it’s going to cost taxpayers if we do it: $90 million,” he said. Democrats have said the price tag would be lower than the figure Pence quoted.
His comments to the Conservative Political Action Conference outside Washington continued a White House and Republican National Committee push to paint the opposition party as hellbent on making America’s economy one that is centrally planned from Washington and intent on taking money out of Americans’ pockets to finance a myriad social programs.";
var stringList = text.Split(' ', ',', ':', '.', '?', '“', '-'); // split the text into pieces and make a list
foreach (var word in stringList) // go through all items of that list
{
if (word == "nation") return true;
}
return false;
}
Related
I want to get a regex that will split text into sentences, leaving in the punctuation (breaking on the space after the punctuation but not breaking on titles.
I'm almost there.
#"(?<=[\.!;\?])\s+"
splits on the space, but also splits on the title.
#"(?<!Mr|Mrs|Dr|Ms|St|a|p|m|K)\.|;"
won't split on titles but wipes out the punctuation.
Any suggestions on combining the two expressions so that the regex will split on space after the punctuation but not split on titles?
Example Text:
Shirking and sharking in all their many varieties have been sown broadcast by the
ill-fated cause; and even those who have contemplated its history
from the outermost circle of such evil have been insensibly tempted
into a loose way of letting bad things alone to take their own bad
course, and a loose belief that if the world go wrong it was in some
off-hand manner never meant to go right.
Thus, in the midst of the mud and at the heart of the fog, sits the
Lord High Chancellor in his High Court of Chancery.
"Mr. Tangle," says the Lord High Chancellor, latterly something
restless under the eloquence of that learned gentleman.
"Mlud," says Mr. Tangle. Mr. Tangle knows more of Jarndyce and
Jarndyce than anybody. He is famous for it--supposed never to have
read anything else since he left school.
"Have you nearly concluded your argument?"
"Mlud, no--variety of points--feel it my duty tsubmit--ludship," is
the reply that slides out of Mr. Tangle.
"Several members of the bar are still to be heard, I believe?" says
the Chancellor with a slight smile.
This effectively combines what you're looking for:
#"(?<!(?:Mr|Mr.|Dr|Ms|St|a|p|m|K)\.)(?<=[.!;\?])\s+"
However, i don't think it's reliable. What if a sentence finished with something like "abaracadabra."?
Okay, this works:
(?<=[\.!;\?])(?<!Mr\.|Mrs\.|Dr\.|Ms\.|St\.)\s+
There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.
Trivial example:
foreach(string word in wordsList)
{
foreach(string mail in mailList)
{
mail.Replace(word,String.Empty);
}
}
How I can improve this algorithm?
Thanks for advices. I voted few answers up but I didn't mark any as answer since it was more like discussion than solution. Some people missed banned words with bad words. In my case I don't have to bother about recognize 'sh1t' or something like that.
Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.
What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?
See How do you implement a good profanity filter? for extensive discussion.
You could use RegEx to make things a little cleaner:
var bannedWords = #"\b(this|is|the|list|of|banned|words)\b";
foreach(mail in mailList)
var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);
Even that, though, is far from perfect since people will always figure out a way around any type of filter.
You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.
You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.
Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.
In that way you only need to walk each message a single time.
Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).
In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.
A general algorithm would be to:
Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
Compare each token against a list of banned words
Replace matched tokens
A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.
HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
"bad",
};
string Input = "this is some bad text.";
string Output = Regex.Replace(Input, #"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);
Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.
Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).
In some circumstance is possible to improve it:
Just for fun:
u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:
first calculate ur running time algorithm:
Words: n item. (each item has an O(1) length).
mailing list: K item.
each item in mailing list average length of Z.
each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.
ur algorithm takes O(n*K*Z). // the best way with knut algorithm
1.now if u sort the words list in O(n log n).
2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z).
2.2- sort the items in mailing list: O(m * log m)
total sorting takes O(K * Z) in worth case with respect to (m logm << Z).
3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)
total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).
So if performance is very very very important, u can do this.
You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:
"\bBADWORD\b"
Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.
Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.
Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.
That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word
/censorship is bullshit rant
I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).
I had great results using this algorithm on codeproject.com better than brute force text replacments.
I looked through the related questions, there were quite a few but I don't think any answered this question. I am very new to Regex but I'm trying to get better so bear with me please. I am trying to match several groups in a string, but in any order. Is this something I should be using Regex for? If so, how? If it matters, I plan to use these in IronPython.
EDIT: Someone asked me to be more specific, so here:
I want to use re.match with a regex like:
\[image\s*(?(#alt:(?<alt>.*?);).*(#title:(?<title>.*?);))*.*\](?<arg>.*?)\[\/image\]
But it will only match the named groups when they are in the right order, and separated with a space. I would like to be able to match the named groups in any order, as long as they appear where they do now in the regex.
A typical string that will be applied to this might look like:
[image #alt:alien; #title:reddit alien;]http://www.reddit.com/alien.png[/image]
But I should have no problem matching:
[image #title:reddit alien; #alt:alien;]http://www.reddit.com/alien.png[/image]
So the 'attributes' (things that come between '#' and ';' in the first 'tag') should be matched in any order, as long as they both appear.
The answer to the question in your title is "no" -- to match N groups "in any order", the regex should have an "or" (the | feature in the regex pattern) among the N! (N factorial) possible permutations of the groups, the product of all integers from 1 to N. That's a number which grows extremely fast -- for N just equal 6, it's already 720, for 7, it's almost 5000, and so on at a dizzying pace -- so this approach is totally impractical for any N which isn't really tiny.
The solutions may be many, depending on what you want the groups to be separated with. Let's say, for example, that you don't care (if you DO care, edit your question with better specs).
In this case, if overlapping matches are impossible or are OK with you, make N separate regular expressions, one per group -- say these N compiled RE objects are in a list named grps, then
mos = [g.search(thestring) for g in grps]
is the list of match objects for the groups (None for a group which doesn't match). With the mos list you can do all sorts of checks and/or further manipulations, for example all(mos) is True if and only if all the groups matched, in which case [m.group() for m in mos] is the list of substrings that have been matched, and so on, and so forth.
If you need non-overlapping matches, it's a bit more complicated -- you may extract the boundaries of all possible matches for each group, then seeing if there's a way to extract from these N lists a set of N intervals, one per lists, so that no two of them are pairwise intersecting. This is a somewhat subtle algorithm (if you want reasonable speed for a large N, of course), so I think it's worth a separate question, and in any case it's not worth discussing right here when the very issue of whether it's needed or not depends on so incredibly many factors that you have not specified.
So, please edit your question with more precise specifications, first, and then things can perhaps be clarified to provide you with the code and/or algorithms you need.
Edit: I see the OP has now clarified the issue at least of the extent of providing an example -- although, confusingly, he offers a RE pattern example and a string example that should not match, regardless of ordering (the RE specifies the presence of a substring #title which the example string does not have -- puzzling!).
Anyway, if the number of groups in the example (two which appear to be interchangeable, one which appears to have to occur in a specific spot) is representative of the OP's actual problems, then the total number of permutations of interest is just two, so joining the "just two" permutations with a vertical bar | would of course be quite feasible. Is that the case in the OP's real problems, though...?
Edit: if the number of permutations of interest is tiny, here's an example of one way to avoid the problem of repeated group names in the pattern (syntax requires Python 2.7 or better, but that's just for the final "dict comprehension" -- the same functionality is available in many previous version of Python, just with the less elegant dict(('a', ... syntax;-)...:
>>> r = re.compile(r'(?P<a1>a.*?a).*?(?P<b1>b.*?b)|(?P<b2>b.*?b).*?(?P<a2>a.*?a)')
>>> m = r.search('zzzakkkavvvbxxxbnnn')
>>> g = m.groupdict()
>>> d = {'a':(g.get('a1') or g.get('a2')), 'b':(g.get('b1') or g.get('b2'))}
>>> d
{'a': 'akkka', 'b': 'bxxxb'}
This is very similar to one of the key problems with using regular expressions to parse HTML - there is no requirement that attributes always be specified in the same order, and many tags have surprising attributes (like <br clear="all">. So it seems you are working with a very similar markup syntax.
Pyparsing addresses this problem in an indirect way - instead of trying to parse all different permutations, parse the general "#attrname:attribute value;" syntax, and keep track of the attributes keys and values in an attribute mapping data structure. The mapping makes it easy to get the "title" attribute, regardless of whether it came first or last in the image tag. This behavior is built into the pyparsing API methods, makeHTMLTags and makeXMLTags.
Of course, this markup is not XML, but a similar approach gives some pretty easy to work with results:
text = """[image #alt:alien; #title:reddit alien;]http://www.reddit.com/alien1.png[/image]
But I should have no problem matching:
[image #title:reddit alien; #alt:alien;]http://www.reddit.com/alien2.png[/image]
"""
from pyparsing import Suppress, Group, Word, alphas, SkipTo, Dict, ZeroOrMore
LBRACK,RBRACK,COLON,SEMI,AT = map(Suppress,"[]:;#")
tagAttribute = Group(AT + Word(alphas) + COLON + SkipTo(SEMI) + SEMI)
imageTag = LBRACK + "image" + Dict(ZeroOrMore(tagAttribute)) + RBRACK
imageLink = imageTag + SkipTo("[/image]")("text")
for taginfo in imageLink.searchString(text):
print taginfo.alt
print taginfo.title
print taginfo.text
print
Prints:
alien
reddit alien
http://www.reddit.com/alien1.png
alien
reddit alien
http://www.reddit.com/alien2.png
I have an address class that uses a regular expression to parse the house number, street name, and street type from the first line of an address. This code is generally working well, but I'm posting here to share with the community and to see if anyone has suggestions for improvement.
Note: The STREETTYPES and QUADRANT constants contain all of the relevant street types and quadrants respectively.
I've included a subset here:
private const string STREETTYPES = #"ALLEY|ALY|ANNEX|AX|ARCADE|ARC|AVENUE|AV|AVE|BAYOU|BYU|BEACH|...";
private const string QUADRANTS = "N|NORTH|S|SOUTH|E|EAST|W|WEST|NE|NORTHEAST|NW|NORTHWEST|SE|SOUTHEAST|SW|SOUTHWEST";
HouseNumber, Quadrant, StreetName, and StreetType are all properties on the class.
private void Parse(string line1)
{
HouseNumber = string.Empty;
Quadrant = string.Empty;
StreetName = string.Empty;
StreetType = string.Empty;
if (!String.IsNullOrEmpty(line1))
{
string noPeriodsLine1 = String.Copy(line1);
noPeriodsLine1 = noPeriodsLine1.Replace(".", "");
string addressParseRegEx =
#"(?ix)
^
\s*
(?:
(?<housenumber>\d+)
(?:(?:\s+|-)(?<quadrant>" +
QUADRANTS +
#"))?
(?:(?:\s+|-)(?<streetname>\S+(?:\s+\S+)*?))??
(?:(?:\s+|-)(?<quadrant>" +
QUADRANTS + #"))?
(?:(?:\s+|-)(?<streettype>" + STREETTYPES +
#"))?
(?:(?:\s+|-)(?<streettypequalifier>(?!(?:" +
QUADRANTS +
#"))(?:\d+|\S+)))?
(?:(?:\s+|-)(?<streettypequadrant>(" +
QUADRANTS + #")))??
(?:(?:\s+|-)(?<suffix>(?:ste|suite|po\sbox|apt)\s*\S*))?
|
(?:(?:po|postoffice|post\s+office)\s+box\s+(?<postofficebox>\S+))
)
\s*
$
";
Match match = Regex.Match(noPeriodsLine1, addressParseRegEx);
if (match.Success)
{
HouseNumber = match.Groups["housenumber"].Value;
Quadrant = (string.IsNullOrEmpty(match.Groups["quadrant"].Value)) ? match.Groups["streettypequadrant"].Value : match.Groups["quadrant"].Value;
if (match.Groups["streetname"].Captures.Count > 1)
{
foreach (Capture capture in match.Groups["streetname"].Captures)
{
StreetName += capture.Value + " ";
}
StreetName = StreetName.Trim();
}
else
{
StreetName = (string.IsNullOrEmpty(match.Groups["streetname"].Value)) ? match.Groups["streettypequalifier"].Value : match.Groups["streetname"].Value;
}
StreetType = match.Groups["streettype"].Value;
//if the matched street type is found
//use the abbreviated version...especially for credit bureau calls
string streetTypeAbbreviation;
if (StreetTypes.TryGetValue(StreetType.ToUpper(), out streetTypeAbbreviation))
{
StreetType = streetTypeAbbreviation;
}
}
}
}
Have fun with addresses and regexs, you're in for a long, horrible ride.
You're trying to lay order upon chaos.
For every "123 Simple Way", there's a "14 1/2 South".
Then, for extra laughs, there's Salt Lake City: "855 South 1300 East".
Have fun with that.
There are more exceptions than rules when it comes to street adresses.
I don't know what country you're in, but if you're in the USA and want to spend some money on address validation, you can buy related USPS products here. And here is a good place to find free word lists from the USPS for expected words and abbreviations. I'm sure similar pages are available for other countries.
I think you should clarify your usage scenario.
Unless you're in a very, very limited scenario where you know that the addresses were entered following a strict schema, parsing addresses for content is an extremely hard problem to solve and, usually, quite futile (unless it's the raison d'être of your application).
If you're limited to a particular country that has very specific conventions for writing addresses, then using these regex might get you 90% of the way.
However, as soon as you have to start accepting foreign addresses, you're screwed.
Even if you're a US-centric site, there is a good chance that you may have to be able to accept addresses from US citizen living abroad for instance.
Again, it may be OK in a very narrow field, but it's almost always a bad idea to validate or split addresses that were not strictly validated and constrained at the time the user entered them.
When you do enforce some strict rules for users to enter their addresses, these end-up being inadequate in a small portion of cases, even in the best address validation components out there.
Just a few things that mess up address parsing:
postal codes (Zip codes) are sometimes placed before, after, or may even not exist at all.
postal codes follow strict rules: a 10-digit Zip code is probably easy to spot as invalid, but what about a non-existent one? What about more codes such as those used in the UK for instance?
What about a place like Hong Kong where you could write the address in either English, Traditional Chinese or Mandarin?
What if it's perfectly fine to split your address and write it out of sequence?
even if you're just parsing US addresses, there are at least a handfull of ways to describe a PO box: you can also use poste restante, general delivery and then need to add a 4-digit code to the Zip code, which would normally probably not be present at all...
Bottom line is
If getting addresses in a parseable format is really important, be 100% sure that you can get all possible combinations right or you're going to have a percentage of failures that will mean frustrated users and loss sales.
If you don't have 100% case coverage then don't enforce strict rules on the user.
I can't count the number of websites I gave up purchasing from because they would require a Zip/Postal Code when the place I live in has none.
Sorry for the rant, but I think it's important that people wanting to do address validation and parsing think hard about what they're getting themselves in.
This actually works pretty well except that it doesn't pull apartment numbers. We're working on that. It also coughed a little when we had an address of 769 Branch Ave. Of course "branch" is one of the street types that its looking for. It all goes back that making order out of chaos thing. We know that its going to break here and there.
If someone runs into this problem in 2013/2014 :)
You can use google geocode API. it provides more functionality than just regex - you can even get lat/long for address. And its free
For an address example-
http://maps.googleapis.com/maps/api/geocode/xml?address=2520%20Cohasset%20Rd%20-%20Chico%2C%20CA%2095973-1307%20530-893-1300%20%20&sensor=false
I tried to get this to work, but it seems as though you have a static member of a StreetTypes class that is not included. It seems to work except for that, but I can not do much testing without it.
I'll agree that your strictness is going to be a problem. I'm writing an address parser designed to strip addresses from classified ads where the format could be just about anything. For instance, for your quadrant matches, you're ignoring punctuation altogether. I have to search data that could represent NE in all these different ways:
"NE", "N.E", "N E", "N.E.", "N. E", "North East", "Northeast"
so I am using the following pattern match which should catch all direction qualifiers no matter how they are expressed:
\b(?:(?:[nesw]\.? ?){0,2}|(?:north|no\.|east|south|so\.|west){0,2})\b
Of course, context is also important since "no" is going to be matched by this. But "NE" for Nebraska would be matched by either, so you really have to be careful about what's to the left and right in your larger expression. I'm having to compile lists of words that commonly appear interspersed in address texts which are not address components, such as "near, x-street, in, across", etc.
It is a very tough problem, and I agree Salt Lake City is a bitch. In addition to having the double direction/coordinate format, they also compound it by referring to stuff like "3700 North 5300 East Arborville Way" where the streets can be referenced by name, number, or both.
We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.
What's the most efficient way to do this in .NET (ideally C#)?
Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.
The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).
We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.
A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.
Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.
I'd suggest your problem lies elsewhere...Or you just need some decent hardware.
Why reinvent the wheel? Why not just leverage something like Lucene.NET?
have you considered the following:
do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?
Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)
Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example
"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time
"written C# today" <- removed noise, now there are lots less to look throught.
A great article on NLP can be found here. Sentance comparing
HTH
Bones
A modified Suffix tree would be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).
Here's another idea: Make a class something like this:
class Word
{
string Word;
List<int> Positions;
}
For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.
Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>.
Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.
Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.
However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.
The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.
It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.
There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf
Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into one pattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.
An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementation of the algorithm.
Ok, current rework shows this as fastest (psuedocode):
foreach (var term in allTerms)
{
string pattern = term.ToWord(); // Use /b word boundary regex
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(bigTextToSearchForTerms))
{
result.Add(term);
}
}
What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count
How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));
This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:
static bool Match(string checkItem)
{
return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}
static void Main(string[] args)
{
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(Match);
}
Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));
I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.