C#-fu - finding top-used words in a functional style

C#-fu - finding top-used words in a functional style - c#

This little program finds the top ten most used words in a file. How would you, or could you, optimize this to process the file via line-by-line streaming, but keep it in the functional style it is now?
static void Main(string[] args)
{
string path = #"C:\tools\copying.txt";
File.ReadAllText(path)
.Split(' ')
.Where(s => !string.IsNullOrEmpty(s))
.GroupBy(s => s)
.OrderByDescending(g => g.Count())
.Take(10)
.ToList()
.ForEach(g => Console.WriteLine("{0}\t{1}", g.Key, g.Count()));
Console.ReadLine();
}
Here is the line reader I'd like use:
static IEnumerable<string> ReadLinesFromFile(this string filename)
{
using (StreamReader reader = new StreamReader(filename))
{
while (true)
{
string s = reader.ReadLine();
if (s == null)
break;
yield return s;
}
}
}
Edit:
I realize that the implementation of top-words doesn't take into account punctuation and all the other little nuances, and I'm not too worried about that.
Clarification:
I'm interested in solution that doesn't load the entire file into memory at once. I suppose you'd need a data structure that could take a stream of words and "group" on the fly -- like a trie. And then somehow get it done in a lazy way so the line reader can go about it's business line-by-line. I'm now realizing that this is a lot to ask for and is a lot more complex than the simple example I gave above. Maybe I'll give it a shot and see if I can get the code as clear as above (with a bunch of new lib support).

So what you're saying is you want to go from:
full text -> sequence of words -> rest of query
to
sequence of lines -> sequence of words -> rest of query
yes?
that seems straightforward.
var words = from line in GetLines()
from word in line.Split(' ')
select word;
and then
words.Where( ... blah blah blah
Or, if you prefer using the "fluent" style throughout, the SelectMany() method is the one you want.
I personally would not do this all in one go. I'd make the query, and then write a foreach loop. That way, the query is built free of side effects, and the side effects are in the loop, where they belong. But some people seem to prefer putting their side effects into a ForEach method instead.
UPDATE: There's a question as to how "lazy" this query is.
You are correct in that what you end up with is an in-memory representation of every word in the file; however, with my minor reorganization of it, you at least do not have to create one big string that contains the entire text to begin with; you can do it line by line.
There are ways to cut down on how much duplication there is here, which we'll come to in a minute. However, I want to keep talking for a bit about how to reason about laziness.
A great way to think about these things is due to Jon Skeet, which I shall shamelessly steal from him.
Imagine a stage upon which there is a line of people. They are wearing shirts that say GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach.
ToList pokes Take. Take does something and then hands toList a card with a list of words on it. ToList keeps on poking Take until Take says "I'm done". At that point, ToList makes a list out of all the cards it has been handed, and then hands the first one off to ForEach. The next time it is poked, it hands out the next card.
What does Take do? Every time it is poked it asks OrderByDescending for another card, and immediately hands that card to ToList. After handing out ten cards, it tells ToList "I'm done".
What does OrderByDescending do? When it is poked for the first time, it pokes GroupBy. GroupBy hands it a card. It keeps on poking GroupBy until GroupBy says "I'm done". Then OrderByDescending sorts the cards, and hands the first one to Take. Every subsequent time it is poked, it hands a new card to Take, until Take stops asking.
GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach
And so on. You see how this goes. The query operators GetLines, Split, Where, GroupBy, OrderByDescending, Take are lazy, in that they do not act until poked. Some of them (OrderByDescending, ToList, GroupBy), need to poke their card provider many, many times before they can respond to the guy poking them. Some of them (GetLines, Split, Where, Take) only poke their provider once when they are themselves poked.
Once ToList is done, ForEach pokes ToList. ToList hands ForEach a card off its list. Foreach counts the words, and then writes a word and a count on the whiteboard. ForEach keeps on poking ToList until ToList says "no more".
(Notice that the ToList is completely unnecessary in your query; all it does is accumulate the results of the top ten into a list. ForEach could be talking directly to Take.)
Now, as for your question of whether you can reduce the memory footprint further: yes, you can. Suppose the file is "foo bar foo blah". Your code builds up the set of groups:
{
{ key: foo, contents: { foo, foo } },
{ key: bar, contents: { bar } },
{ key: blah, contents: { blah } }
}
and then orders those by the length of the contents list, and then takes the top ten. You don't have to store nearly that much in the contents list in order to compute the answer you want. What you really want to be storing is:
{
{ key: foo, value: 2 },
{ key: bar, value: 1 },
{ key: blah, value: 1 }
}
and then sort that by value.
Or, alternately, you could build up the backwards mapping
{
{ key: 2, value: { foo } },
{ key: 1, value: { bar, blah }}
}
sort that by key, and then do a select-many on the lists until you have extracted the top ten words.
The concept you want to look at in order to do either of these is the "accumulator". An accumulator is an object which efficiently "accumulates" information about a data structure while the data structure is being iterated over. "Sum" is an accumulator of a sequence of numbers. "StringBuilder" is often used as an accumulator on a sequence of strings. You could write an accumulator which accumulates counts of words as the list of words is walked over.
The function you want to study in order to understand how to do this is Aggregate:
http://msdn.microsoft.com/en-us/library/system.linq.enumerable.aggregate.aspx
Good luck!

First, let's abstract away our file into an IEnumerable<string> where the lines are yielded one at a time:
class LineReader : IEnumerable<string> {
Func<TextReader> _source;
public LineReader(Func<Stream> streamSource) {
_source = () => new StreamReader(streamSource());
}
public IEnumerator<string> GetEnumerator() {
using (var reader = _source()) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
IEnumerator IEnumerable.GetEnumerator() {
return GetEnumerator();
}
}
Next, let's make an extension method on IEnumerable<string> that will yield the words in each line:
static class IEnumerableStringExtensions {
public static IEnumerable<string> GetWords(this IEnumerable<string> lines) {
foreach (string line in lines) {
foreach (string word in line.Split(' ')) {
yield return word;
}
}
}
}
Then:
var lr = new LineReader(() => new FileStream("C:/test.txt", FileMode.Open));
var dict = lr.GetWords()
.GroupBy(w => w)
.ToDictionary(w => w.Key, w => w.Count());
foreach (var pair in dict.OrderByDescending(kvp => kvp.Value).Take(10)) {
Console.WriteLine("{0}: {1}", pair.Key, pair.Value);
}

Related

"Unzip" IEnumerable dynamically in C# or best alternative

Lets assume you have a function that returns a lazily-enumerated object:
struct AnimalCount
{
int Chickens;
int Goats;
}
IEnumerable<AnimalCount> FarmsInEachPen()
{
....
yield new AnimalCount(x, y);
....
}
You also have two functions that consume two separate IEnumerables, for example:
ConsumeChicken(IEnumerable<int>);
ConsumeGoat(IEnumerable<int>);
How can you call ConsumeChicken and ConsumeGoat without a) converting FarmsInEachPen() ToList() beforehand because it might have two zillion records, b) no multi-threading.
Basically:
ConsumeChicken(FarmsInEachPen().Select(x => x.Chickens));
ConsumeGoats(FarmsInEachPen().Select(x => x.Goats));
But without forcing the double enumeration.
I can solve it with multithread, but it gets unnecessarily complicated with a buffer queue for each list.
So I'm looking for a way to split the AnimalCount enumerator into two int enumerators without fully evaluating AnimalCount. There is no problem running ConsumeGoat and ConsumeChicken together in lock-step.
I can feel the solution just out of my grasp but I'm not quite there. I'm thinking along the lines of a helper function that returns an IEnumerable being fed into ConsumeChicken and each time the iterator is used, it internally calls ConsumeGoat, thus executing the two functions in lock-step. Except, of course, I don't want to call ConsumeGoat more than once..

I don't think there is a way to do what you want, since ConsumeChickens(IEnumerable<int>) and ConsumeGoats(IEnumerable<int>) are being called sequentially, each of them enumerating a list separately - how do you expect that to work without two separate enumerations of the list?
Depending on the situation, a better solution is to have ConsumeChicken(int) and ConsumeGoat(int) methods (which each consume a single item), and call them in alternation. Like this:
foreach(var animal in animals)
{
ConsomeChicken(animal.Chickens);
ConsomeGoat(animal.Goats);
}
This will enumerate the animals collection only once.
Also, a note: depending on your LINQ-provider and what exactly it is you're trying to do, there may be better options. For example, if you're trying to get the total sum of both chickens and goats from a database using linq-to-sql or linq-to-entities, the following query..
from a in animals
group a by 0 into g
select new
{
TotalChickens = g.Sum(x => x.Chickens),
TotalGoats = g.Sum(x => x.Goats)
}
will result in a single query, and do the summation on the database-end, which is greatly preferable to pulling the entire table over and doing the summation on the client end.

The way you have posed your problem, there is no way to do this. IEnumerable<T> is a pull enumerable - that is, you can GetEnumerator to the front of the sequence and then repeatedly ask "Give me the next item" (MoveNext/Current). You can't, on one thread, have two different things pulling from the animals.Select(a => a.Chickens) and animals.Select(a => a.Goats) at the same time. You would have to do one then the other (which would require materializing the second).
The suggestion BlueRaja made is one way to change the problem slightly. I would suggest going that route.
The other alternative is to utilize IObservable<T> from Microsoft's reactive extensions (Rx), a push enumerable. I won't go into the details of how you would do that, but it's something you could look into.
Edit:
The above is assuming that ConsumeChickens and ConsumeGoats are both returning void or are at least not returning IEnumerable<T> themselves - which seems like an obvious assumption. I'd appreciate it if the lame downvoter would actually comment.

Actually simples way to achieve what you what is convert FarmsInEachPen return value to push collection or IObservable and use ReactiveExtensions for working with it
var observable = new Subject<Animals>()
observable.Do(x=> DoSomethingWithChicken(x. Chickens))
observable.Do(x=> DoSomethingWithGoat(x.Goats))
foreach(var item in FarmsInEachPen())
{
observable.OnNext(item)
}

I figured it out, thanks in large part due to the path that #Lee put me on.
You need to share a single enumerator between the two zips, and use an adapter function to project the correct element into the sequence.
private static IEnumerable<object> ConsumeChickens(IEnumerable<int> xList)
{
foreach (var x in xList)
{
Console.WriteLine("X: " + x);
yield return null;
}
}
private static IEnumerable<object> ConsumeGoats(IEnumerable<int> yList)
{
foreach (var y in yList)
{
Console.WriteLine("Y: " + y);
yield return null;
}
}
private static IEnumerable<int> SelectHelper(IEnumerator<AnimalCount> enumerator, int i)
{
bool c = i != 0 || enumerator.MoveNext();
while (c)
{
if (i == 0)
{
yield return enumerator.Current.Chickens;
c = enumerator.MoveNext();
}
else
{
yield return enumerator.Current.Goats;
}
}
}
private static void Main(string[] args)
{
var enumerator = GetAnimals().GetEnumerator();
var chickensList = ConsumeChickens(SelectHelper(enumerator, 0));
var goatsList = ConsumeGoats(SelectHelper(enumerator, 1));
var temp = chickensList.Zip(goatsList, (i, i1) => (object) null);
temp.ToList();
Console.WriteLine("Total iterations: " + iterations);
}

Filter a IEnumerable<string> for unwanted strings

Edit: i have received a few very good suggestions i will try to work through them and accept an answer at some point
I have a large list of strings (800k) that i would like to filter in the quickest time possible for a list of unwanted words (ultimately profanity but could be anything).
the result i would ultimately like to see would be a list such as
Hello,World,My,Name,Is,Yakyb,Shell
would become
World,My,Name,Is,Yakyb
after being checked against
Hell,Heaven.
my code so far is
var words = items
.Distinct()
.AsParallel()
.Where(x => !WordContains(x, WordsUnwanted));
public static bool WordContains(string word, List<string> words)
{
for (int i = 0; i < words.Count(); i++)
{
if (word.Contains(words[i]))
{
return true;
}
}
return false;
}
this is currently taking about 2.3 seconds (9.5 w/o parallel) to process 800k words which as a one off is no big deal. however as a learning process is there a quicker way of processing?
the unwanted words list is 100 words long
none of the words contain punctuation or spaces
step taken to remove duplicates in all lists
step to see if working with array is quicker (it isn't) interestingly changing the parameter words to a string[] makes it 25% slower
Step adding AsParallel() has reduced time to ~2.3 seconds

Try the method called Except.
http://msdn.microsoft.com/en-AU/library/system.linq.enumerable.except.aspx
var words = new List<string>() {"Hello","Hey","Cat"};
var filter = new List<string>() {"Cat"};
var filtered = words.Except(filter);
Also how about:
var words = new List<string>() {"Hello","Hey","cat"};
var filter = new List<string>() {"Cat"};
// Perhaps a Except() here to match exact strings without substrings first?
var filtered = words.Where(i=> !ContainsAny(i,filter)).AsParallel();
// You could experiment with AsParallel() and see
// if running the query parallel yields faster results on larger string[]
// AsParallel probably not worth the cost unless list is large
public bool ContainsAny(string str, IEnumerable<string> values)
{
if (!string.IsNullOrEmpty(str) || values.Any())
{
foreach (string value in values)
{
// Ignore case comparison from #TimSchmelter
if (str.IndexOf(value, StringComparison.OrdinalIgnoreCase) != -1) return true;
//if(str.ToLowerInvariant().Contains(value.ToLowerInvariant()))
// return true;
}
}
return false;
}

couple of things
Alteration 1 (nice and simple):
I was able to speed the run (fractionally) by using HashSet over the Distinct method.
var words = new HashSet<string>(items) //this uses HashCodes
.AsParallel()...
Alteration 2 (Bear with me ;) ) :
regarding #Tim's comment, the contains may not provide you with enough for search for black listed words. For example Takeshita is a street name.
you have already identified that you would like the finite state (aka Stemmed) of the word. for example for Apples we would treat it as Apple. To do this we can use stemming algorithms such as the Porter Stemmer.
If we are to Stem a word then we may not need to do Contains(x), we can use the equals(x) or even better compare the HashCodes (the fastest way).
var filter = new HashSet<string>(
new[] {"hello", "of", "this", "and", "for", "is",
"bye", "the", "see", "in", "an",
"top", "v", "t", "e", "a" });
var list = new HashSet<string> (items)
.AsParallel()
.Where(x => !filter.Contains(new PorterStemmer().Stem(x)))
.ToList();
this will compare the words on their hash codes, int == int.
The use of the stemmer did not slowdown the speed as we complemented it with the HashSet (for the filtered list, bigO of 1). And this returned a larger list of results.
I am using the Porter Stemmer located in the Lucene.Net code, this is not threadsafe thus we new one up each time
Issue with Alteration 2, Alteration 2a: as with most Natural language processing, its not simple. What happens when
the word is a combination of banned words "GrrArgh" (where Grr and Argh are banned)
the word is spelt intentionally wrong "Frack", but still has the same meaning as a banned word (sorry to the forum ppl)
the word is spelt with spaces "G r r".
you the band word is not a word but a phrase, poor example: "son of a Barrel"
With forums, they use humans to fulfil these gaps.
Or the introduction of a white list is introduced (given that you have mention the bigO we can say this will have a performance hit of 2n^2, as we are doing 2 lists for every item, do not forget to remove the leading constaints and if i remember correctly you are left with n^2, but im a little rusty on my bigO)

Change your WordContains method to use a single Aho-Corasick search instead of ~100 Contains calls (and of course initialize the Aho-Corasick search tree just once).
You can find a open-sourced implementation here http://www.codeproject.com/script/Articles/ViewDownloads.aspx?aid=12383.
After initilization of the StringSearch class you will call the method public bool ContainsAny(string text) for each of your 800k strings.
A single call will take O(length of the string) time no matter how long your list of unwanted words is.

I was interested to see if I could come up with a faster way of doing this - but I only managed one little optimization. That was to check the index of a string occuring within another because it firstly seems to be slightly faster than 'contains' and secondly lets you specify case insensitivity (if that is useful to you).
Included below is a test class I wrote - I have used >1 million words and am searching using a case sensitive test in all cases. Its tests your method, and also a regular expression I am trying to build up on the fly. You can try it for yourself and see the timings; the regular expression doesn't work as fast as the method you provided, but then I could be building it incorrectly. I use (?i) before (word1|word2...) to specify case insensitivity in a regular expression (I would love to find out how that could be optimised - it's probably suffering from the classic backtracking problem!).
The searching methods (be it regular expressions or the original method provided) seem to get progressivly slow as more 'unwanted' words are added.
Anyway - hope this simple test helps you out a bit:
class Program
{
static void Main(string[] args)
{
//Load your string here - I got war and peace from project guttenburg (http://www.gutenberg.org/ebooks/2600.txt.utf-8) and loaded twice to give 1.2 Million words
List<string> loaded = File.ReadAllText(#"D:\Temp\2600.txt").Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList();
List<string> items = new List<string>();
items.AddRange(loaded);
items.AddRange(loaded);
Console.WriteLine("Loaded {0} words", items.Count);
Stopwatch sw = new Stopwatch();
List<string> WordsUnwanted = new List<string> { "Hell", "Heaven", "and", "or", "big", "the", "when", "ur", "cat" };
StringBuilder regexBuilder = new StringBuilder("(?i)(");
foreach (string s in WordsUnwanted)
{
regexBuilder.Append(s);
regexBuilder.Append("|");
}
regexBuilder.Replace("|", ")", regexBuilder.Length - 1, 1);
string regularExpression = regexBuilder.ToString();
Console.WriteLine(regularExpression);
List<string> words = null;
bool loop = true;
while (loop)
{
Console.WriteLine("Enter test type - 1, 2, 3, 4 or Q to quit");
ConsoleKeyInfo testType = Console.ReadKey();
switch (testType.Key)
{
case ConsoleKey.D1:
sw.Reset();
sw.Start();
words = items
.Distinct()
.AsParallel()
.Where(x => !WordContains(x, WordsUnwanted)).ToList();
sw.Stop();
Console.WriteLine("Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
words = null;
break;
case ConsoleKey.D2:
sw.Reset();
sw.Start();
words = items
.Distinct()
.Where(x => !WordContains(x, WordsUnwanted)).ToList();
sw.Stop();
Console.WriteLine("Non-Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
words = null;
break;
case ConsoleKey.D3:
sw.Reset();
sw.Start();
words = items
.Distinct()
.AsParallel()
.Where(x => !Regex.IsMatch(x, regularExpression)).ToList();
sw.Stop();
Console.WriteLine("Non-Compiled regex (parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
words = null;
break;
case ConsoleKey.D4:
sw.Reset();
sw.Start();
words = items
.Distinct()
.Where(x => !Regex.IsMatch(x, regularExpression)).ToList();
sw.Stop();
Console.WriteLine("Non-Compiled regex (non-parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
words = null;
break;
case ConsoleKey.Q:
loop = false;
break;
default:
continue;
}
}
}
public static bool WordContains(string word, List<string> words)
{
for (int i = 0; i < words.Count(); i++)
{
//Found that this was a bit fater and also lets you check the casing...!
//if (word.Contains(words[i]))
if (word.IndexOf(words[i], StringComparison.InvariantCultureIgnoreCase) >= 0)
return true;
}
return false;
}
}

Ah, filtering words based on matches from a "bad" list. This is a clbuttic problem that has tested the consbreastution of many programmers. My mate from Scunthorpe wrote a dissertation on it.
What you really want to avoid is a solution that tests a word in O(lm), where l is the length of the word to test and m is the number of bad words. In order to do this, you need a solution other than looping through the bad words. I had thought that a regular expression would solve this, but I forgot that typical implementations have an internal data structure that is increased at every alternation. As one of the other solutions says, Aho-Corasick is the algorithm that does this. The standard implementation finds all matches, yours would be more efficient since you could bail out at the first match. I think this provides a theoretically optimal solution.

Which is faster for removing items from a List<object> - RemoveAll or a foreach loop?

For removing an object where a property equals a value which is faster?
foreach(object o in objects)
{
if(o.name == "John Smith")
{
objects.Remove(o);
break;
}
}
or
objects.RemoveAll(o => o.Name == "John Smith");
Thanks!
EDIT:
I should have mentioned this is removing one object from the collection, then breaking out of the loop which prevents any errors you have described, although using a for loop with the count is the better option!

If you really want to know if one thing is faster than another, benchmark it. In other words, measure, don't guess! This is probably my favorite mantra.
As well as the fact that you're breaking the rules in the first one (modifying the list during the processing of it, leading me to invoke my second mantra: You can't get any more unoptimised than "wrong"), the second is more readable and that's usually what I aim for first.
And, just to complete my unholy trinity of mantras: Optimise for readability first, then optimise for speed only where necessary :-)

From a List<string> of 10,000 items, the speeds are:
for loop: 110,000 ticks
lambda: 1,000 ticks
From this information, we can conclude that the lambda expression is faster.
The source code I used can be found here.
Note that I substituted your foreach with a for loop, since we aren't able to modify values within a foreach loop.

Assuming you meant something like
for(int i = 0; i < objects.Count; i++)
{
if(objects[i].name == "John Smith")
{
objects.Remove(objects[i--]);
}
}
RemoveAll would be faster in this case. As with Remove you are iterating over the list again(IndexOf) when you already have the position.
Here is List.Remove
public bool Remove(T item)
{
int index = this.IndexOf(item);
if (index >= 0x0)
{
this.RemoveAt(index);
return true;
}
return false;
}

C# Efficient Substring with many inputs

Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.

Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.
string[] FindKeyWordOccurence( string[] stringsToSearch,
string[] positiveKeywords,
string[] negativeKeywords )
{
Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
foreach( string searchIn in stringsToSearch )
{
// tokenize and sort the input to make searches faster
string[] tokenizedList = searchIn.Split( ' ' );
Array.Sort( tokenizedList );
// if any negative keywords exist, skip to the next search string...
foreach( string negKeyword in negativeKeywords )
if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
continue; // skip to next search string...
// for each positive keyword, add to dictionary to keep track of it
// we could have also used a SortedList, but the dictionary is easier
foreach( string posKeyword in positiveKeyWords )
if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
foundKeywords[posKeyword] = 1;
}
// convert the Keys in the dictionary (our found keywords) to an array...
string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
return foundKeywordsArray;
}
Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
IEnumerable<string> positiveKeywords,
IEnumerable<string> negativeKeywords )
{
var foundKeywordsDict = new Dictionary<string, int>();
foreach( var searchIn in searchStrings )
{
// tokenize the search string...
var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
// skip if any negative keywords exist...
if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
continue;
// merge found positive keywords into dictionary...
// an example of where Enumerable.ForEach() would be nice...
var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
foreach (var keyword in found)
foundKeywordsDict[keyword] = 1;
}
return foundKeywordsDict.Keys;
}

If you add this extension method:
public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
foreach (var keyword in keywords)
{
if (testString.Contains(keyword))
return true;
}
return false;
}
Then this becomes a one line statement:
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.

If you're truly just looking for space-delimited words, this code would be a very simple implementation:
static void Main(string[] args)
{
string sIn = "This is a string that isn't nearly as long as it should be " +
"but should still serve to prove an algorithm";
string[] sFor = { "string", "as", "not" };
Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
}
private static string[] FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Intersect(hsFor).ToArray();
}
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
private static bool FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Overlaps(hsFor);
}

Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.

First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.

Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
foreach (Bucket b in buckets) {
if (i + b.Length >= inputString.Length)
continue;
string candidate = inputString.Substring(i, b.Length);
int hash = ComputeHash(candidate);
foreach (MatchString match in b.MatchStrings) {
if (hash != match.Hash)
continue;
if (candidate == match.String) {
if (match.IsPositive) {
// positive case
}
else {
// negative case
}
}
}
}
}

To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.

Design pattern for aggregating lazy lists

I'm writing a program as follows:
Find all files with the correct extension in a given directory
Foreach, find all occurrences of a given string in those files
Print each line
I'd like to write this in a functional way, as a series of generator functions (things that call yield return and only return one item at a time lazily-loaded), so my code would read like this:
IEnumerable<string> allFiles = GetAllFiles();
IEnumerable<string> matchingFiles = GetMatches( "*.txt", allFiles );
IEnumerable<string> contents = GetFileContents( matchingFiles );
IEnumerable<string> matchingLines = GetMatchingLines( contents );
foreach( var lineText in matchingLines )
Console.WriteLine( "Found: " + lineText );
This is all fine, but what I'd also like to do is print some statistics at the end. Something like this:
Found 233 matches in 150 matching files. Scanned 3,297 total files in 5.72s
The problem is, writing the code in a 'pure functional' style like above, each item is lazily loaded.
You only know how many files match in total until the final foreach loop completes, and because only one item is ever yielded at a time, the code doesn't have any place to keep track of how many things it's found previously. If you invoke LINQ's matchingLines.Count() method, it will re-enumerate the collection!
I can think of many ways to solve this problem, but all of them seem to be somewhat ugly. It strikes me as something that people are bound to have done before, and I'm sure there'll be a nice design pattern which shows a best practice way of doing this.
Any ideas? Cheers

In a similar vein to other answers, but taking a slightly more generic approach ...
... why not create a Decorator class that can wrap an existing IEnumerable implementation and calculate the statistic as it passes other items through.
Here's a Counter class I just threw together - but you could create variations for other kinds of aggregation too.
public class Counter<T> : IEnumerable<T>
{
public int Count { get; private set; }
public Counter(IEnumerable<T> source)
{
mSource = source;
Count = 0;
}
public IEnumerator<T> GetEnumerator()
{
foreach (var T in mSource)
{
Count++;
yield return T;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
foreach (var T in mSource)
{
Count++;
yield return T;
}
}
private IEnumerable<T> mSource;
}
You could create three instances of Counter:
One to wrap GetAllFiles() counting the total number of files;
One to wrap GetMatches() counting the number of matching files; and
One to wrap GetMatchingLines() counting the number of matching lines.
The key with this approach is that you're not layering multiple responsibilities onto your existing classes/methods - the GetMatchingLines() method only handles the matching, you're not asking it to track stats as well.
Clarification in response to a comment by Mitcham:
The final code would look something like this:
var files = new Counter<string>( GetAllFiles());
var matchingFiles = new Counter<string>(GetMatches( "*.txt", files ));
var contents = GetFileContents( matchingFiles );
var linesFound = new Counter<string>(GetMatchingLines( contents ));
foreach( var lineText in linesFound )
Console.WriteLine( "Found: " + lineText );
string message
= String.Format(
"Found {0} matches in {1} matching files. Scanned {2} files",
linesFound.Count,
matchingFiles.Count,
files.Count);
Console.WriteLine(message);
Note that this is still a functional approach - the variables used are immutable (more like bindings than variables), and the overall function has no side-effects.

I would say that you need to encapsulate the process into a 'Matcher' class in which your methods capture statistics as they progress.
public class Matcher
{
private int totalFileCount;
private int matchedCount;
private DateTime start;
private int lineCount;
private DateTime stop;
public IEnumerable<string> Match()
{
return GetMatchedFiles();
System.Console.WriteLine(string.Format(
"Found {0} matches in {1} matching files." +
" {2} total files scanned in {3}.",
lineCount, matchedCount,
totalFileCount, (stop-start).ToString());
}
private IEnumerable<File> GetMatchedFiles(string pattern)
{
foreach(File file in SomeFileRetrievalMethod())
{
totalFileCount++;
if (MatchPattern(pattern,file.FileName))
{
matchedCount++;
yield return file;
}
}
}
}
I'll stop there since I'm supposed to be coding work stuff, but the general idea is there. The entire point of 'pure' functional program is to not have side effects, and this type of statics calculation is a side effect.

I can think of two ideas
Pass in a context object and return (string + context) from your enumerators - the purely functional solution
use thread local storage for you statistics (CallContext), you can be fancy and support a stack of contexts. so you would have code like this.
using (var stats = DirStats.Create())
{
IEnumerable<string> allFiles = GetAllFiles();
IEnumerable<string> matchingFiles = GetMatches( "*.txt", allFiles );
IEnumerable<string> contents = GetFileContents( matchingFiles );
stats.Print()
IEnumerable<string> matchingLines = GetMatchingLines( contents );
stats.Print();
}

If you're happy to turn your code upside down, you might be interested in Push LINQ. The basic idea is to reverse the "pull" model of IEnumerable<T> and turn it into a "push" model with observers - each part of the pipeline effectively pushes its data past any number of observers (using event handlers) which typically form new parts of the pipeline. This gives a really easy way to hook up multiple aggregates to the same data.
See this blog entry for some more details. I gave a talk on it in London a while ago - my page of talks has a few links for sample code, the slide deck, video etc.
It's a fun little project, but it does take a bit of getting your head around.

I took Bevan's code and refactored it around until I was content. Fun stuff.
public class Counter
{
public int Count { get; set; }
}
public static class CounterExtensions
{
public static IEnumerable<T> ObserveCount<T>
(this IEnumerable<T> source, Counter count)
{
foreach (T t in source)
{
count.Count++;
yield return t;
}
}
public static IEnumerable<T> ObserveCount<T>
(this IEnumerable<T> source, IList<Counter> counters)
{
Counter c = new Counter();
counters.Add(c);
return source.ObserveCount(c);
}
}
public static class CounterTest
{
public static void Test1()
{
IList<Counter> counters = new List<Counter>();
//
IEnumerable<int> step1 =
Enumerable.Range(0, 100).ObserveCount(counters);
//
IEnumerable<int> step2 =
step1.Where(i => i % 10 == 0).ObserveCount(counters);
//
IEnumerable<int> step3 =
step2.Take(3).ObserveCount(counters);
//
step3.ToList();
foreach (Counter c in counters)
{
Console.WriteLine(c.Count);
}
}
}
Output as expected: 21, 3, 3

Assuming those functions are your own, the only thing I can think of is the Visitor pattern, passing in an abstract visitor function that calls you back when each thing happens. For example: pass an ILineVisitor into GetFileContents (which I'm assuming breaks up the file into lines). ILineVisitor would have a method like OnVisitLine(String line), you could then implement the ILineVisitor and make it keep the appropriate stats. Rinse and repeat with a ILineMatchVisitor, IFileVisitor etc. Or you could use a single IVisitor with an OnVisit() method which has a different semantic in each case.
Your functions would each need to take a Visitor, and call it's OnVisit() at the appropriate time, which may seem annoying, but at least the visitor could be used to do lots of interesting things, other than just what you're doing here. In fact you could actually avoid writing GetMatchingLines by passing a visitor that checks for the match in OnVisitLine(String line) into GetFileContents.
Is this one of the ugly things you'd already considered?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.