String to dictionary using regex (want to optimize) - c#

I have string on the format "$0Option one$1$Option two$2$Option three" (etc) that I want to convert into a dictionary where each number corresponds to an option. I currently have a working solution for this problem, but since this method is called for every entry I'm importing (few thousand) I want it to be as optimized as possible.
public Dictionary<string, int> GetSelValsDictBySelValsString(string selectableValuesString)
{
// Get all numbers in the string.
var correspondingNumbersArray = Regex.Split(selectableValuesString, #"[^\d]+").Where(x => (!String.IsNullOrWhiteSpace(x))).ToArray();
List<int> correspondingNumbers = new List<int>();
int number;
foreach (string s in correspondingNumbersArray)
{
Int32.TryParse(s, out number);
correspondingNumbers.Add(number);
}
selectableValuesString = selectableValuesString.Replace("$", "");
var selectableStringValuesArray = Regex.Split(selectableValuesString, #"[\d]+").Where(x => (!String.IsNullOrWhiteSpace(x))).ToArray();
var selectableValues = new Dictionary<string, int>();
for (int i = 0; i < selectableStringValuesArray.Count(); i++)
{
selectableValues.Add(selectableStringValuesArray.ElementAt(i), correspondingNumbers.ElementAt(i));
}
return selectableValues;
}

The first thing that caught my attention in your code is that it processes the input string three times: twice with Split() and once with Replace(). The Matches() method is a much better tool than Split() for this job. With it, you can extract everything you need in a single pass. It makes the code a lot easier to read, too.
The second thing I noticed was all those loops and intermediate objects. You're using LINQ already; really use it, and you can eliminate all of that clutter and improve performance. Check it out:
public static Dictionary<int, string> GetSelectValuesDictionary(string inputString)
{
return Regex.Matches(inputString, #"(?<key>[0-9]+)\$*(?<value>[^$]+)")
.Cast<Match>()
.ToDictionary(
m => int.Parse(m.Groups["key"].Value),
m => m.Groups["value"].Value);
}
notes:
Cast<Match>() is necessary because MatchCollection only advertises itself as an IEnumerable, and we need it to be an IEnumerable<Match>.
I used [0-9] instead of \d on the off chance that your values might contain digits from non-Latin writing systems; in .NET, \d matches them all.
Static Regex methods like Matches() automatically cache the Regex objects, but if this method is going to be called a lot (especially if you're using a lot of other regexes, too), you might want to create a static Regex object anyway. If performance is really critical, you can specify the Compiled option while you're at it.
My code, like yours, makes no attempt to deal with malformed input. In particular, mine will throw an exception if the number turns out to be too large, while yours just converts it to zero. This probably isn't relevant to your real code, but I felt compelled to express my unease at seeing you call TryParse() without checking the return value. :/
You also don't make sure your keys are unique. Like #Gabe, I flipped it around used the numeric values as the keys, because they happened to be unique and the string values weren't. I trust that, too, is not a problem with your real data. ;)

Your selectableStringValuesArray is not actually an array! This means that every time you index into it (with ElementAt or count it with Count) it has to rerun the regex and walk through the list of results looking for non-whitespace. You need something like this instead:
var selectableStringValuesArray = Regex.Split(selectableValuesString, #"[\d]+").Where(x => (!String.IsNullOrWhiteSpace(x))).ToArray();
You should also fix your correspondingNumbersString because it has the same problem.
I see you're using C# 4, though, so you can use Zip to combine the lists and then you wouldn't have to create an array or use any loops. You could create your dictionary like this:
return correspondingNumbersString.Zip(selectableStringValuesArray,
(number, str) => new KeyValuePair<int, string>(int.Parse(number), str))
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

Related

Trim HashSet<string>

values and _values are both HashSet<string>
I need to trim all the values.
What I need is this
values = _values.Trim();
But that is not valid syntax
Edit
This is what I am doing now but it seems messy
values.Clear();
foreach (string s in _values)
{
values.Add(s.Trim());
}
"Better" is arguable, but you could use the built-in method UnionWith and some LINQ for a nice one liner to replace your foreach. Think of it as the closest equivalent to an AddRange, except with the restrictions of a set of course.
values.Clear();
values.UnionWith(_values.Select(x => x.Trim()));

When to write code to anticipate looping and when to specifically do something 'X' times

If i know that a method must perform an action a certain amount of times, such as retrieve data, should I write code to specifically do this the number of times that is required, or should my code be able to anticipate later changes?
For instance, say I was told to write a method that retrieves 2 values from a dictionary (Ill call it Settings here) and return them using known keys that are provided
public Dictionary<string, string> GetSettings()
{
const string keyA = "address"; //I understand 'magic strings' are bad, bear with me
const string keyB = "time"
Dictionary<string, string> retrievedSettings = new Dictionary<string,string>();
//should I add the keys to a list and then iterate through the list?
List<string> listOfKeys = new List<string>(){keyA, keyB};
foreach( string key in listOfKeys)
{
if(Settings.ContainsKey(key)
{
string value = Setting[key];
retrieveSettings.Add(key, value);
}
}
//or should I just get the two values directly from the dictionary like so
if(Settings.ContainsKey(keyA)
{
retrievedSettings.Add(keyA , Setting[keyA]);
}
if(Settings.Contains(keyB)
{
retrievedSettings.Add(keyB , Setting[keyB]);
}
return retrievedSettings
}
The reason why I ask is that code repetition is always a bad thing ie DRY, but at the same time, more experienced programmers have told me that there is no need write the logic to anticipate larger looping if it the action only needs to be performed a known number of times
I would extract a method that takes the keys as parameter:
private Dictionary<string, string> GetSettings(params string[] keys)
{
var retrievedSettings = new Dictionary<string, string>();
foreach(string key in keys)
{
if(Settings.ContainsKey(key)
retrieveSettings.Add(key, Setting[key]);
}
return retrievedSettings;
}
You can now use this method like this:
public Dictionary<string, string> GetSettings()
{
return GetSettings(keyA, keyB);
}
I would choose this approach because it makes your main method trivial to understand: "Aha, it gets the settings for keyA and for keyB".
I would use this approach even when I am sure that I will never need to get more than these two keys. In other words, this approach has been chosen, not because it anticipates later changes but because it better communicates intent.
However, with LINQ, you wouldn't really need that extracted method. You could simply use this:
public Dictionary<string, string> GetSettings()
{
return new [] { keyA, keyB }.Where(x => Settings.ContainsKey(x))
.ToDictionary(x => x, Settings[x]);
}
The DRY principle does not necessarily mean that every line of code in your program should be unique. It simply means that you should not have large regions of code spread out throughout your program that do the same thing.
Option number 1 works well when you have a large number of items to search for, but has the downside of making the code slightly less trivial to read.
Option number 2 works well when you have a small number options. It is more straightforward and is actually more efficient.
Since you only have two settings, I would definitely go with option number 2. Making decisions such as these in expectation of future changes is a waste of effort. I have found this article to be quite helpful in illustrating the perils of being too concerned with non-existent requirements.

Calling a list of methods in a random sequence?

I have a list of 10 methods. Now I want to call this methods in a random sequence. The sequence should be generated at runtime. Whats the best way to do this?
It is always astonishing to me the number of incorrect and inefficient answers one sees whenever anyone asks how to shuffle a list of things on StackOverflow. Here we have several examples of code which is brittle (because it assumes that key collisions are impossible when in fact they are merely rare) or slow for large lists. (In this case the problem is stated to be only ten elements, but when possible surely it is better to give a solution that scales to thousands of elements if doing so is not difficult.)
This is not a hard problem to solve correctly. The correct, fast way to do this is to create an array of actions, and then shuffle that array in-place using a Fisher-Yates Shuffle.
http://en.wikipedia.org/wiki/Fisher-Yates_shuffle
Some things not to do:
Do not implement Fischer-Yates shuffle incorrectly. One sees more incorrect than correct implementations of this trivial algorithm. In particular, make sure you are choosing the random number from the correct range. Choosing it from the wrong range produces a biased shuffle.
If the shuffle algorithm must actually be unpredictable then use a source of randomness other than Random, which is only pseudo-random. Remember, Random only has 232 possible seeds, and therefore there are fewer than that many possible shuffles.
If you are going to be producing many shuffles in a short amount of time, do not create a new instance of Random every time. Save and re-use the old one, or use a different source of randomness entirely. Random chooses its seed based on the time; many Randoms created in close succession will produce the same sequence of "random" numbers.
Do not sort on a "random" GUID as your key. GUIDs are guaranteed to be unique. They are not guaranteed to be randomly ordered. It is perfectly legal for an implementation to spit out consecutive GUIDs.
Do not use a random function as a comparator and feed that to a sorting algorithm. Sort algorithms are permitted to do anything they please if the comparator is bad, including crashing, and including producing non-random results. As Microsoft recently found out, it is extremely embarrassing to get a simple algorithm like this wrong.
Do not use the input to random as the key to a dictionary, and then sort the dictionary. There is nothing stopping the randomness source from choosing the same key twice, and therefore either crashing your application with a duplicate key exception, or silently losing one of your methods.
Do not use the algorithm "Create two lists. Add the elements to the first list. Repeatedly move a random element from the first list to the second list, removing the element from the first list". If the list is O(n) to remove an item then this is an O(n2) algorithm.
Do not use the algorithm "Create two lists. Add the elements to the first list. Repeatedly move a random non-null element from the first list to the second list, setting the element in the first list to null." Also do not do this crazy equivalent of that algorithm.If there are lots of items in the list then this gets slower and slower as you start hitting more and more nulls.
New, short answer
Starting from where Ilya Kogan left off, totally correct after we had Eric Lippert find the bug:
var methods = new Action[10];
var rng = new Random();
var shuffled = methods.Select(m => Tuple.Create(rng.Next(), m))
.OrderBy(t => t.Item1).Select(t => t.Item2);
foreach (var action in shuffled) {
action();
}
Of course this is doing a lot behind the scenes. The method below should be much faster. But if LINQ is fast enough...
Old answer (much longer)
After stealing this code from here:
public static T[] RandomPermutation<T>(T[] array)
{
T[] retArray = new T[array.Length];
array.CopyTo(retArray, 0);
Random random = new Random();
for (int i = 0; i < array.Length; i += 1)
{
int swapIndex = random.Next(i, array.Length);
if (swapIndex != i)
{
T temp = retArray[i];
retArray[i] = retArray[swapIndex];
retArray[swapIndex] = temp;
}
}
return retArray;
}
the rest is easy:
var methods = new Action[10];
var perm = RandomPermutation(methods);
foreach (var method in perm)
{
// call the method
}
Have an array of delegates. Suppose you have this:
class YourClass {
public int YourFunction1(int x) { }
public int YourFunction2(int x) { }
public int YourFunction3(int x) { }
}
Now declare a delegate:
public delegate int MyDelegate(int x);
Now create an array of delegates:
MyDelegate delegates[] = new MyDelegate[10];
delegates[0] = new MyDelegate(YourClass.YourFunction1);
delegates[1] = new MyDelegate(YourClass.YourFunction2);
delegates[2] = new MyDelegate(YourClass.YourFunction3);
and now call it like this:
int result = delegates[randomIndex] (48);
You can create a shuffled collection of delegates, and then call all methods in the collection.
Here is an easy way of doing so using a dictionary. The keys of the dictionary are random numbers, and the values are delegates to your methods. When you iterate through the dictionary, it has the effect of shuffling.
var shuffledActions = actions.ToDictionary(
action => random.Next(),
action => action);
foreach (var pair in shuffledActions.OrderBy(item => item.Key))
{
pair.Value();
}
actions is an enumerable of your methods.
random is a of type Random.
Think that this is a list of objects and you want it to extract the objects randomly. You can get a random index using the Random.Next Method (always use current List.Count as parameter) and after that remove object from the list so it will not be drawn again.
When processing a list in a random order, the natural inclination is to shuffle a list.
Another approach is to just keep the list order, but randomly select and remove each item.
var actionList = new[]
{
new Action( () => CallMethodOne() ),
new Action( () => CallMethodTwo() ),
new Action( () => CallMethodThree() )
}.ToList();
var r = new Random();
while(actionList.Count() > 0) {
var index = r.Next(actionList.Count());
var action = actionList[index];
actionList.RemoveAt(index);
action();
}
I think:
Via reflection get Method Objects;
create an array of created Method Object;
generate random index (normalize range);
invoke method;
You can remove method from array to execute method one times.
Bye

C#-fu - finding top-used words in a functional style

This little program finds the top ten most used words in a file. How would you, or could you, optimize this to process the file via line-by-line streaming, but keep it in the functional style it is now?
static void Main(string[] args)
{
string path = #"C:\tools\copying.txt";
File.ReadAllText(path)
.Split(' ')
.Where(s => !string.IsNullOrEmpty(s))
.GroupBy(s => s)
.OrderByDescending(g => g.Count())
.Take(10)
.ToList()
.ForEach(g => Console.WriteLine("{0}\t{1}", g.Key, g.Count()));
Console.ReadLine();
}
Here is the line reader I'd like use:
static IEnumerable<string> ReadLinesFromFile(this string filename)
{
using (StreamReader reader = new StreamReader(filename))
{
while (true)
{
string s = reader.ReadLine();
if (s == null)
break;
yield return s;
}
}
}
Edit:
I realize that the implementation of top-words doesn't take into account punctuation and all the other little nuances, and I'm not too worried about that.
Clarification:
I'm interested in solution that doesn't load the entire file into memory at once. I suppose you'd need a data structure that could take a stream of words and "group" on the fly -- like a trie. And then somehow get it done in a lazy way so the line reader can go about it's business line-by-line. I'm now realizing that this is a lot to ask for and is a lot more complex than the simple example I gave above. Maybe I'll give it a shot and see if I can get the code as clear as above (with a bunch of new lib support).
So what you're saying is you want to go from:
full text -> sequence of words -> rest of query
to
sequence of lines -> sequence of words -> rest of query
yes?
that seems straightforward.
var words = from line in GetLines()
from word in line.Split(' ')
select word;
and then
words.Where( ... blah blah blah
Or, if you prefer using the "fluent" style throughout, the SelectMany() method is the one you want.
I personally would not do this all in one go. I'd make the query, and then write a foreach loop. That way, the query is built free of side effects, and the side effects are in the loop, where they belong. But some people seem to prefer putting their side effects into a ForEach method instead.
UPDATE: There's a question as to how "lazy" this query is.
You are correct in that what you end up with is an in-memory representation of every word in the file; however, with my minor reorganization of it, you at least do not have to create one big string that contains the entire text to begin with; you can do it line by line.
There are ways to cut down on how much duplication there is here, which we'll come to in a minute. However, I want to keep talking for a bit about how to reason about laziness.
A great way to think about these things is due to Jon Skeet, which I shall shamelessly steal from him.
Imagine a stage upon which there is a line of people. They are wearing shirts that say GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach.
ToList pokes Take. Take does something and then hands toList a card with a list of words on it. ToList keeps on poking Take until Take says "I'm done". At that point, ToList makes a list out of all the cards it has been handed, and then hands the first one off to ForEach. The next time it is poked, it hands out the next card.
What does Take do? Every time it is poked it asks OrderByDescending for another card, and immediately hands that card to ToList. After handing out ten cards, it tells ToList "I'm done".
What does OrderByDescending do? When it is poked for the first time, it pokes GroupBy. GroupBy hands it a card. It keeps on poking GroupBy until GroupBy says "I'm done". Then OrderByDescending sorts the cards, and hands the first one to Take. Every subsequent time it is poked, it hands a new card to Take, until Take stops asking.
GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach
And so on. You see how this goes. The query operators GetLines, Split, Where, GroupBy, OrderByDescending, Take are lazy, in that they do not act until poked. Some of them (OrderByDescending, ToList, GroupBy), need to poke their card provider many, many times before they can respond to the guy poking them. Some of them (GetLines, Split, Where, Take) only poke their provider once when they are themselves poked.
Once ToList is done, ForEach pokes ToList. ToList hands ForEach a card off its list. Foreach counts the words, and then writes a word and a count on the whiteboard. ForEach keeps on poking ToList until ToList says "no more".
(Notice that the ToList is completely unnecessary in your query; all it does is accumulate the results of the top ten into a list. ForEach could be talking directly to Take.)
Now, as for your question of whether you can reduce the memory footprint further: yes, you can. Suppose the file is "foo bar foo blah". Your code builds up the set of groups:
{
{ key: foo, contents: { foo, foo } },
{ key: bar, contents: { bar } },
{ key: blah, contents: { blah } }
}
and then orders those by the length of the contents list, and then takes the top ten. You don't have to store nearly that much in the contents list in order to compute the answer you want. What you really want to be storing is:
{
{ key: foo, value: 2 },
{ key: bar, value: 1 },
{ key: blah, value: 1 }
}
and then sort that by value.
Or, alternately, you could build up the backwards mapping
{
{ key: 2, value: { foo } },
{ key: 1, value: { bar, blah }}
}
sort that by key, and then do a select-many on the lists until you have extracted the top ten words.
The concept you want to look at in order to do either of these is the "accumulator". An accumulator is an object which efficiently "accumulates" information about a data structure while the data structure is being iterated over. "Sum" is an accumulator of a sequence of numbers. "StringBuilder" is often used as an accumulator on a sequence of strings. You could write an accumulator which accumulates counts of words as the list of words is walked over.
The function you want to study in order to understand how to do this is Aggregate:
http://msdn.microsoft.com/en-us/library/system.linq.enumerable.aggregate.aspx
Good luck!
First, let's abstract away our file into an IEnumerable<string> where the lines are yielded one at a time:
class LineReader : IEnumerable<string> {
Func<TextReader> _source;
public LineReader(Func<Stream> streamSource) {
_source = () => new StreamReader(streamSource());
}
public IEnumerator<string> GetEnumerator() {
using (var reader = _source()) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
IEnumerator IEnumerable.GetEnumerator() {
return GetEnumerator();
}
}
Next, let's make an extension method on IEnumerable<string> that will yield the words in each line:
static class IEnumerableStringExtensions {
public static IEnumerable<string> GetWords(this IEnumerable<string> lines) {
foreach (string line in lines) {
foreach (string word in line.Split(' ')) {
yield return word;
}
}
}
}
Then:
var lr = new LineReader(() => new FileStream("C:/test.txt", FileMode.Open));
var dict = lr.GetWords()
.GroupBy(w => w)
.ToDictionary(w => w.Key, w => w.Count());
foreach (var pair in dict.OrderByDescending(kvp => kvp.Value).Take(10)) {
Console.WriteLine("{0}: {1}", pair.Key, pair.Value);
}

C# Efficient Substring with many inputs

Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.
Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.
string[] FindKeyWordOccurence( string[] stringsToSearch,
string[] positiveKeywords,
string[] negativeKeywords )
{
Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
foreach( string searchIn in stringsToSearch )
{
// tokenize and sort the input to make searches faster
string[] tokenizedList = searchIn.Split( ' ' );
Array.Sort( tokenizedList );
// if any negative keywords exist, skip to the next search string...
foreach( string negKeyword in negativeKeywords )
if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
continue; // skip to next search string...
// for each positive keyword, add to dictionary to keep track of it
// we could have also used a SortedList, but the dictionary is easier
foreach( string posKeyword in positiveKeyWords )
if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
foundKeywords[posKeyword] = 1;
}
// convert the Keys in the dictionary (our found keywords) to an array...
string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
return foundKeywordsArray;
}
Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
IEnumerable<string> positiveKeywords,
IEnumerable<string> negativeKeywords )
{
var foundKeywordsDict = new Dictionary<string, int>();
foreach( var searchIn in searchStrings )
{
// tokenize the search string...
var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
// skip if any negative keywords exist...
if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
continue;
// merge found positive keywords into dictionary...
// an example of where Enumerable.ForEach() would be nice...
var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
foreach (var keyword in found)
foundKeywordsDict[keyword] = 1;
}
return foundKeywordsDict.Keys;
}
If you add this extension method:
public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
foreach (var keyword in keywords)
{
if (testString.Contains(keyword))
return true;
}
return false;
}
Then this becomes a one line statement:
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.
If you're truly just looking for space-delimited words, this code would be a very simple implementation:
static void Main(string[] args)
{
string sIn = "This is a string that isn't nearly as long as it should be " +
"but should still serve to prove an algorithm";
string[] sFor = { "string", "as", "not" };
Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
}
private static string[] FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Intersect(hsFor).ToArray();
}
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
private static bool FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Overlaps(hsFor);
}
Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.
First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.
Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
foreach (Bucket b in buckets) {
if (i + b.Length >= inputString.Length)
continue;
string candidate = inputString.Substring(i, b.Length);
int hash = ComputeHash(candidate);
foreach (MatchString match in b.MatchStrings) {
if (hash != match.Hash)
continue;
if (candidate == match.String) {
if (match.IsPositive) {
// positive case
}
else {
// negative case
}
}
}
}
}
To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.

Categories

Resources