Recursive string replacement in C# - c#

Senario
I have an object Keyword -> It has a name and a Value (string)
A keyword cannot contain itsSelf... but can contain other keywords example
K1 = "%K2% %K3% %K4%"
where K2,K3,K4 are keywords.....
Just Relpacing them with their values works but here a catch that i am facing
Examlple :
K3= "%K2%"
K2= "%K4%"
K4="%K2%"
Now if i start replacing there will a infinite loop as K2 gives K4 and Vice Versa...
I wish to avoid such problem
But it is required that i allow the uesr to nest other keyword... how could i check "While adding if the DeadLock Occur" i will display invalid... Should i use a HashTable or what... Some Code Direction would be nice...

From your comment:
I want to be able to take a Context Free Grammar and run an analyzer on it that determines if there is any "infinite cycle".
That is easily done. First off, let's clearly define a "context free grammar". A CFG is a substitution system which has "terminal" and "non-terminal" symbols. Terminals are things which are "done"; non-terminals are replaced with a sequence of terminal and non-terminal symbols.
In my examples, non-terminals will be in UPPERCASE and terminals will be in lowercase. Substitution rules will be written as "NONTERMINAL : substituted symbols". So an example of a CFG is:
SENTENCE : SUBJECT VERB OBJECT
SUBJECT : ARTICLE NOUN
ARTICLE : a
ARTICLE : the
NOUN : can
NOUN : man
VERB : saw
VERB : kicked
OBJECT : ARTICLE NOUN
So if we start with SENTENCE then we can make substitutions:
SENTENCE
SUBJECT VERB OBJECT
ARTICLE NOUN VERB OBJECT
the NOUN VERB OBJECT
the man VERB OBJECT
the man kicked OBJECT
the man kicked ARTICLE NOUN
the man kicked the NOUN
the man kicked the can
and we have no more non-terminals, so we're done.
CFGs can have cycles:
EQUATION : TERM = TERM
TERM : 1
TERM : ADDITION
ADDITION : TERM + TERM
And now we do productions:
EQUATION
TERM = TERM
1 = TERM
1 = ADDITION
1 = TERM + TERM
1 = 1 + TERM
1 = 1 + 1
This one can eventually stop but it can go forever also. Of course you can define CFGs that must go forever; if there had been no production "TERM : 1" then this one would go forever without finding a valid sequence of only terminals.
So how do you determine if there are any productions that can run forever?
What you do is make a directed graph data structure. Make all the non-terminals into nodes in the graph. Then add an edge for every production rule that has a non-terminal on the right-hand side. So for our first example, we'd have the graph:
SENTENCE -----> SUBJECT
| | | |
| | | |
v | | |
VERB | | |
v v |
OBJECT--->ARTICLE |
\ v
---------->NOUN
In the second example we'd have the graph:
EQUATION --> TERM ---> ADDITION
<-----/
If the graph contains a cycle that is reachable from the start symbol then the grammar contains productions which can be expanded forever. If it does not, then it cannot.
So now all you have to do is build a cycle detector, and that's an easy problem in graph analysis. If there are no cycles, or if the only cycles are not reachable from the start symbol, then the grammar is good.

For a start, I wouldn't actually do this recursively. There's a perfectly good iterative solution that won't blow out your stack:
def morph (string):
while string has a substitution pattern in it:
find first keyword in string
replace it with value
endwhile
return string
That's the basic construct. No chance of blowing out stack space with infinite looping substitutions. However, it still has the same problem as the recursive solution in that it will loop infinitely where:
kw1="%kw2%"
kw2="%kw1%"
or even the simpler:
kw1="%kw1%"
The best way to stop that is to simply provide an arbitrary limit on the number of substitutions allowed (and preferably make it configurable in case there's a real need for a large number).
You can make this a arbitrarily large number since there's no risk of stack blowout, and the changes needed to the code above are as simple as:
def morph (string):
sublimit = getConfig ("subLimit")
while string has a substitution pattern in it:
sublimit = sublimit - 1
if sublimit < 0:
return some sort of error
find first keyword in string
replace it with value
endwhile
return string

Each time you're going to replace a keyword, add it to a collection. If, at some point, that collection contains the keyword more than one time, it is recursive. Then you can throw an exception or just skip it.

Enforce unique keywords. They could be nested but not equal.
Since it would be hard to have deeply nested strings, you could enforce a limit on recursion levels.

The idea is to implement something like "state" machine:
Look for first occurrence of any key string
Make replacement there
Look for first occurence of any key string, starting from previous occurence (from step 2)
Make replacement there
etc.
using System;
class Program
{
static void Main(string[] args)
{
var s = "%K2% %K3% %K4%";
var replaces = new[]
{
new[] {"%K3%", "%K2%"},
new[] {"%K2%", "%K4%"},
new[] {"%K4%", "%K2%"},
};
bool wasReplaces;
var curPos = 0;
do
{
wasReplaces = false;
string[] curReplacement = null;
var minIndex = int.MaxValue;
foreach (var replacement in replaces)
{
var index = s.IndexOf(replacement[0], curPos);
if ((index < minIndex) && (index != -1))
{
minIndex = index;
curReplacement = replacement;
}
}
if (curReplacement != null)
{
s = s.Substring(0, minIndex) + curReplacement[1] + s.Substring(minIndex + curReplacement[0].Length);
curPos = minIndex + curReplacement[0].Length + 1;
wasReplaces = true;
}
} while (wasReplaces && (curPos < s.Length));
// Should be "%K4% %K2% %K2%
Console.WriteLine(s);
}
}

Related

How to Interpret pseudocode in c#?

I have a data interpretation algorithm & actual data. Using this algorithm, I have to interpret the actual data and display it as a report.
For this, Firstly I need to create a form which will accept some variable values from user. The variables are defined in pseudocode as below. (one example given)
AGEYEARS {
Description: Age in Years
Type: Range;
MinVal: 0;
MaxVal: 124;
Default: 0;
ErrorAction: ERT1:=04 GRT4:=960Z;
}
I have several variables like this in my Variables.txt file. I don't wish to use StreamReader, read it line by ine & interpret the variables.
Instead, I am looking for some logic, which can read XXXX { } as one object and Type:Range as Attribute:Value. This way, I can skip one step of reading the file and converting it to a understandable code.
Like this, I also have other files which has conditions to check. For ex,
IF SEX = '9' THEN
SEX:=U
ENDIF
Is there any way to interpret them easily and faster? Can someone help me with it?
I am using C# as my programming language.
So you need a parser for a DSL.
I can advise you ANTLR, which will let you build a grammar easily.
Here's a totally untested simple grammar for it:
grammar ConfigFile;
file: object+;
object: ID '{' property+ '}';
property: ID ':' value ';';
value: (ID|CHAR)+;
ID: [a-zA-Z][a-zA-Z0-9_]*;
WS: [ \t\r\n]+ -> channel(HIDDEN);
CHAR: .;
Alternate solution: You also could use regex:
(?<id>\w+)\s*\{\s*(?:(?<prop>\w+)\s*:\s*(?<value>.+?)\s*;\s*)*\}
Then extract the captured information. For each match, you'll have a group id with the name of the object. The groups prop and value will have multiple captures, each pair defining a property.
In C#:
var text = #"
AGEYEARS {
Description: Age in Years;
Type: Range;
MinVal: 0;
MaxVal: 124;
Default: 0;
ErrorAction: ERT1:=04 GRT4:=960Z;
}
OTHER {
Foo: Bar;
Bar: Baz;
}";
var re = new Regex(#"(?<id>\w+)\s*\{\s*(?:(?<prop>\w+)\s*:\s*(?<value>.+?)\s*;\s*)*\}");
foreach (Match match in re.Matches(text))
{
Console.WriteLine("Object {0}:", match.Groups["id"].Value);
var properties = match.Groups["prop"].Captures.Cast<Capture>();
var values = match.Groups["value"].Captures.Cast<Capture>();
foreach (var property in properties.Zip(values, (prop, value) => new {name = prop.Value, value = value.Value}))
{
Console.WriteLine(" {0} = {1}", property.name, property.value);
}
Console.WriteLine();
}
This solution is not as "pretty" as the parser one, but works without any external lib.
I advice you against using regular expressions. Maybe it will work at start, but if your task will become a bit more complex it might be the case regex won't solve your problem, because it technically cannot do this.
The better choice (for the price of adding library) is using some parser. For C# there might not be as many as for other languages, but there are enough -- just take your pick :-). You have Irony, Coco/R, GOLD, ANTLR, LLLPG, Sprache, or my NLT.
If you sense that you will have mathematical precedence issues (i.e. you will have to work with evaluating of expressions like "5+5*2" which should give 15, not 20) than compare top-down parsers -- ANLTR is one of them -- syntax first against bottom-up parsers -- NLT for example. Usually in the first ones you have to write rules in quirky order (you have to embed the rules) while in the latter ones you have just to set the order of them (stating * goes before +). In other words, rules are separated from precedence.

Is there a way to switch on partial strings in c#?

I have a list of different types of images I need to store in a database, they all have a type description such as Indoor or GardenSummer and things like that, but there are a lot of the descriptions that contain repeated words, like GardenSummer and AreaSummer1KM for example both contain "Summer", so is there a way for me to do something like this in c#:
open System
let strs = ["Kitchen"; "GardenSummer"; "GardenWinter"; "AreaSummer1KM"; "PoolIndoors"; "LivingRoom"; "BathRoom"]
let switch (x: string) = match x with
| a when a.Contains "Summer" -> Some "Summer" // here
| b when b.Contains "Winter" -> Some "Winter" // here
| "Exterior" | "ParkFacilities" -> Some "Outdoors"
| "Kitchen" | "Landing" -> Some "Indoors"
| c when c.Contains "Room" -> Some "Indoors" // and here
| _ -> None
let sorted = List.map switch strs
// part from here and down was just added to print the contents, and isn't a part of the issue
let printOption = function
| Some v -> v.ToString () |> Console.WriteLine
| None -> "No Match" |> Console.WriteLine
List.iter printOption sorted
is there a way for me to switch on str.Contains(str2) without making a bunch of else ifs?
The short answer is no.
The slightly longer answer is "no, for a good reason". The switch statement is actually quite a smart statement, that performs better than a chain of if-else if statements in many cases (a good example being the typical switch (MessageType) ...). To do this, however, it requires certain contracts to be held. In the end, it doesn't evaluate every possibility. It performs something similar to a binary search on the possible options.
In the end, your F# code probably does the equivalent of if-else if statements, rather than the equivalent of switch in C#.
Of course, nothing prevents you from creating your own method that would be syntactically similar to F#'s match. Anonymous delegates, generic functions, all those make it rather easy to write such syntax shorteners :)
And of course, there's other options too, like using regular expressions or such. Calling Contains 10 times in a row is going to mean a significant performance penalty if the searched string is long.
Some sample regexes for your data and switches. The common code is as follows:
void Main()
{
var data =
new []
{
"Kitchen", "GardenSummer", "GardenWinter", "AreaSummer1KM",
"PoolIndoors", "LivingRoom", "BathRoom", "Exterior", "ParkFacilities"
};
foreach (var str in data)
{
Matcher(str).Dump();
}
}
Now the thing we're going to change is the Matcher method implementation.
First, to just simplify the whole thing and avoid multiple string matching (comparing strings isn't exactly free):
Regex matcherRegex = new Regex("(Summer)|(Winter)|(^Exterior|ParkFacilities$)",
RegexOptions.Compiled);
string Matcher(string input)
{
var m = matcherRegex.Match(input);
if (m.Groups.Count == 4)
{
if (m.Groups[0].Success) return "Summer";
else if (m.Groups[1].Success) return "Winter";
else if (m.Groups[2].Success) return "Outdoors";
}
return null;
}
So, we still have a if-else chain, but we no longer traverse the strings multiple times. It also allows you to easily specify the conditions you want.
One way to improve this to be more "switchy" is by using LINQ. This is definitely not something you want to do for performance reasons, it's only about aesthetics:
var groupIndex = m.Groups.OfType<Group>()
.Skip(1)
.Select((i, idx) => new { Item = i, Index = idx + 1 })
.Where(i => i.Item.Success)
.Select(i => i.Index)
.FirstOrDefault();
switch (groupIndex)
{
case 0: return null;
case 1: return "Summer";
case 2: return "Winter";
case 3: return "Outdoors";
}
Basically, I get the index of the matched group, and use a switch on that. As I said before, this is probably going to be slower than the first variant, at least due to the LINQ overhead.
You can also use named captures to get the matched groups by name, rather than by index, which is a bit more maintainable. Also, for simple cases, you could use the named group name to avoid the switch altogether:
Regex matcherRegex =
new Regex("(?<Summer>Summer)"
+ "|(?<Winter>Winter)"
+ "|(?<Outdoors>(^Exterior|ParkFacilities$))",
RegexOptions.Compiled | RegexOptions.ExplicitCapture);
string Matcher(string input)
{
return matcherRegex.Match(input)
.Groups
.OfType<Group>()
.Select((i, idx) => new { Item = i, Index = idx })
.Skip(1)
.Where(i => i.Item.Success)
.Select(i => matcherRegex.GroupNameFromNumber(i.Index))
.FirstOrDefault();
}
All of those are just samples, you may want to change those for better edge case or exception handling, and performance, but it shows the ideas.
The last version in particular is handy in that there's nothing preventing you from using this as a common method that handles all the string "switches" that you can explain in regular expressions. Sadly, group names allow a lot of unicode characters, but not whitespaces; it's nothing you couldn't work around, though.
You could even build the pattern matcher automatically, for example by passing Expression<Func<...>> to a helper method, but that's going into complicated territory :)

What do a & b stand for in .Aggregate( (a,b) => statement w/ a&b)

I am trying to think of the best way to word this to get to my exact question without having someone have to explain what Aggregate does because I know that's been covered in depth here and elsewhere on the internet. When calling Aggregate() and using a linq statement like
(a,b) => a+b
What is a and what is b? I know a is the current element, but what is b? I've seen examples where it seemed like b was simply one element ahead of a and other examples where it seemed like a was the result of the previous function and other examples where it seemed like b was the result of the previous function.
I've looked through the examples on the actual C# documentation pages here
http://msdn.microsoft.com/en-us/library/bb548744.aspx
and here
http://www.dotnetperls.com/aggregate
But I just need some clarification of the difference between the two arguments in the linq expression. If I'm missing some fundamental Linq knowledge that answers this, feel free to put me in my place.
Take a look at the example at http://msdn.microsoft.com/en-us/library/bb548651.aspx
string sentence = "the quick brown fox jumps over the lazy dog";
// Split the string into individual words.
string[] words = sentence.Split(' ');
// Prepend each word to the beginning of the
// new sentence to reverse the word order.
string reversed = words.Aggregate((workingSentence, next) =>
next + " " + workingSentence);
Console.WriteLine(reversed);
// This code produces the following output:
//
// dog lazy the over jumps fox brown quick the
In this example, the anonymous function passed to Aggregate is (workingSentence, next) => next + " " + workingSentence. a would be workingSentence which contains the result of the aggregation up to the current element, and b would be the current element being added to the aggregation. In the first call to the anonymous function, workingSentence = "" and next = "the". In the next call, workingSentence = "the" and next = "quick".
If you're calling the overload that takes a Func matching that description, you're most likely using this version:
Enumerable.Aggregate
That means that a would be your accumulator and b would be the next element to work with.
someEnumerable.Aggregate((a,b) => a & b);
If you were to expand that out to a regular loop, it might look something like:
Sometype a = null;
foreach(var b in someEnumerable)
{
if(a == null)
{
a = b;
}
else
{
a = a & b;
}
}
Would perform a bitwise-and and store the result back into the accumulator.
a isn't the current element - b is. The first time that the lambda expression is invoked, a will be equal to the seed argument you gave to Aggregate. Each subsequent time it will be equal to the result of the previous invocation of the lambda expression.

Piglatin using Arrays

Last night I was messing around with Piglatin using Arrays and found out I could not reverse the process. How would I shift the phrase and take out the Char's "a" and "y" at the end of the word and return the original word in the phrase.
For instance if I entered "piggy" it would come out as "iggypay" shifting the word piggy so "p" is at the end of the word and "ay" is appended.
Here is the example code so you can try it as well.
public string ay;
public string PigLatin(string phrase)
{
string[] pLatin;
ArrayList pLatinPhrase = new ArrayList();
int wordLength;
pLatin = phrase.Split();
foreach (string pl in pLatin)
{
wordLength = pl.Length;
pLatinPhrase.Add(pl.Substring(1, wordLength - 1) + pl.Substring(0, 1) + "ay");
}
foreach (string p in pLatinPhrase)
{
ay += p;
}
return ay;
}
You will notice that is example is not programmed to find vowels and append them to the end along with "ay". Just simply a basic way of doing it.
If you where wondering how to reverse the above try this example of uPiglatinify
public string way;
public string uPigLatinify(string word)
{
string[] latin;
int wordLength;
// Using arrraylist to store split words.
ArrayList Phrase = new ArrayList();
// Split string phrase into words.
latin = word.Split(' ');
foreach (string i in latin)
{
wordLength = i.Length;
if (wordLength > 0)
{
// Grab 3rd letter from the end of word and append to front
// of word chopping off "ay" as it was not included in the indexing.
Phrase.Add(i.Substring(wordLength - 3, 1) + i.Substring(0, wordLength - 3) + " ");
}
}
foreach (string _word in Phrase)
{
// Add words to string and return.
way += _word;
}
return way;
}
Please don’t take this the wrong way, but although you can probably get people here to give you the C# code to implement the algorithm you want, I suspect this is not enough if you want to learn how it works. To learn the basics of programming, there are some good tutorials to delve into (whether websites or books). In particular, if you aspire to be a programmer, you will need to learn not just how to write code. In your example:
You should first write a specification of what your PigLatin function is supposed to do. Think about all the corner-cases: What if the first letter is a vowel? What if there are several consonants at the beginning? What if there are only consonants? What if the input starts with a number, a parenthesis, or a space? What if the input string is empty? Write down exactly what should happen in all of these cases — even if it’s “throw an exception”.
Only then can you implement the algorithm according to the specification (i.e. write the actual C# code). While doing this, you may find that the specification is incomplete, in which case you need to go back and correct it.
Once your code is finished, you need to test it. Run it on several testcases, especially the corner-cases you came up with above: For example, try PigLatin("air"), PigLatin("x"), PigLatin("1"), PigLatin(""), etc. In each case, make yourself aware first what behaviour you expect, and then see if the behaviour matches your expectation. If it doesn’t, you need to go back and fix the code.
Once you have implemented the forward PigLatin algorithm and it works (read: passes all your testcases), then you will already have the skills needed to write the reverse function youself. I guarantee you that you will feel achieved and excited then! Whereas, if you just copy the code from this website, you are setting yourself up for feeling dumb because you will think other people can do it and you can’t.
Of course, we are nonetheless happy to help you with specific technical questions, for example “What is the difference between ArrayList and List<string>?” or “What does the scope of a local variable mean?” (but search first — these may have already been asked before) — but you probably shouldn’t ask to have the code fully written and finished for you.
The work to split the phrase into words and recombine the words after transforming them is the same as in the original case. The difficulty is in un-pig-latin-ifying an individual word. With some error checking, I imagine you could do this:
string UnPigLatinify(string word)
{
if ((word == null) || !Regex.IsMatch(word, #"^\w+ay$", RegexOptions.IgnoreCase))
return word;
return word[word.Length - 3] + word.Substring(0, word.Length - 3);
}
The regular expression just checks to make sure the word is at least 3 letters long, composed of characters, and ends with "ay".
The actual transform takes the third to last letter (the original first letter) and appends the rest of the word minus the "ay" and the original letter.
Is this what you meant?

C# Efficient Substring with many inputs

Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.
Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.
string[] FindKeyWordOccurence( string[] stringsToSearch,
string[] positiveKeywords,
string[] negativeKeywords )
{
Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
foreach( string searchIn in stringsToSearch )
{
// tokenize and sort the input to make searches faster
string[] tokenizedList = searchIn.Split( ' ' );
Array.Sort( tokenizedList );
// if any negative keywords exist, skip to the next search string...
foreach( string negKeyword in negativeKeywords )
if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
continue; // skip to next search string...
// for each positive keyword, add to dictionary to keep track of it
// we could have also used a SortedList, but the dictionary is easier
foreach( string posKeyword in positiveKeyWords )
if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
foundKeywords[posKeyword] = 1;
}
// convert the Keys in the dictionary (our found keywords) to an array...
string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
return foundKeywordsArray;
}
Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
IEnumerable<string> positiveKeywords,
IEnumerable<string> negativeKeywords )
{
var foundKeywordsDict = new Dictionary<string, int>();
foreach( var searchIn in searchStrings )
{
// tokenize the search string...
var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
// skip if any negative keywords exist...
if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
continue;
// merge found positive keywords into dictionary...
// an example of where Enumerable.ForEach() would be nice...
var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
foreach (var keyword in found)
foundKeywordsDict[keyword] = 1;
}
return foundKeywordsDict.Keys;
}
If you add this extension method:
public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
foreach (var keyword in keywords)
{
if (testString.Contains(keyword))
return true;
}
return false;
}
Then this becomes a one line statement:
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.
If you're truly just looking for space-delimited words, this code would be a very simple implementation:
static void Main(string[] args)
{
string sIn = "This is a string that isn't nearly as long as it should be " +
"but should still serve to prove an algorithm";
string[] sFor = { "string", "as", "not" };
Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
}
private static string[] FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Intersect(hsFor).ToArray();
}
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
private static bool FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Overlaps(hsFor);
}
Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.
First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.
Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
foreach (Bucket b in buckets) {
if (i + b.Length >= inputString.Length)
continue;
string candidate = inputString.Substring(i, b.Length);
int hash = ComputeHash(candidate);
foreach (MatchString match in b.MatchStrings) {
if (hash != match.Hash)
continue;
if (candidate == match.String) {
if (match.IsPositive) {
// positive case
}
else {
// negative case
}
}
}
}
}
To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.

Categories

Resources