Fast way to use String.Contains with huge list C#

Fast way to use String.Contains with huge list C# - c#

I have somethings like this:
List<string> listUser;
listUser.Add("user1");
listUser.Add("user2");
listUser.Add("userhacker");
listUser.Add("user1other");
List<string> key_blacklist;
key_blacklist.Add("hacker");
key_blacklist.Add("other");
foreach (string user in listUser)
{
foreach (string key in key_blacklist)
{
if (user.Contains(key))
{
// remove it in listUser
}
}
}
The result of listUser is: user1, user2.
The problem is if i have a huge listUser (more than 10 million) and huge key_blacklist (100.000). That code is very very slow.
Is have anyway to get that faster?
UPDATE: I find new solution in there.
http://cc.davelozinski.com/c-sharp/fastest-way-to-check-if-a-string-occurs-within-a-string
Hope that will help someone when he got in there! :)

If you don't have much control over how the list of users is constructed, you can at least test each item in the list in parallel, which on modern machines with multiple cores will speed up the checking a fair bit.
listuser.AsParallel().Where(
s =>
{
foreach (var key in key_blacklist)
{
if (s.Contains(key))
{
return false; //Not to be included
}
}
return true; //To be included, as no match with the blacklist
});
Also - do you have to use .Contains? .Equals is going to be much much quicker, because in almost all cases a non-match will be determined when the HashCodes differ, which can be found only by an integer comparison. Super quick.
If you do need .Contains, you may want to think about restructuring the app. What do these strings in the list really represent? Separate sub-groups of users? Can I test each string, at the time it's added, for whether it represents a user on the blacklist?
UPDATE: In response to #Rawling's comment below - If you know that there is a finite set of usernames which have, say, "hacker" as a substring, that set would have to be pretty large before running a .Equals test of each username against a candidate would be slower than running .Contains on the candidate. This is because HashCode is really quick.

If you are using entity framework or linq to sql then using linq and sending the query to a server can improve the performance.
Then instead of removing the items you are actually querying for the items that fulfil the requirements, i.e. user where the name doesn't contain the banned expression:
listUser.Where(u => !key_blacklist.Any(u.Contains)).Select(u => u).ToList();

A possible solution is to use a tree-like data structure.
The basic idea is to have the blacklisted words organised like this:
+ h
| + ha
| + hac
| - hacker
| - [other words beginning with hac]
|
+ f
| + fu
| + fuk
| - fukoff
| - [other words beginning with fuk]
Then, when you check for blacklisted words, you avoid searching the whole list of words beginning with "hac" if you find out that your user string does not even contain "h".
In the example I provided, with your sample data, this does not of course make any difference, but with the real data sets this should reduce significantly the number of Contains, since you don't check against the full list of blacklisted words every time.
Here is a code example (please note that the code is pretty bad, this is just to illustrate my idea)
using System;
using System.Collections.Generic;
using System.Linq;
class Program {
class Blacklist {
public string Start;
public int Level;
const int MaxLevel = 3;
public Dictionary<string, Blacklist> SubBlacklists = new Dictionary<string, Blacklist>();
public List<string> BlacklistedWords = new List<string>();
public Blacklist() {
Start = string.Empty;
Level = 0;
}
Blacklist(string start, int level) {
Start = start;
Level = level;
}
public void AddBlacklistedWord(string word) {
if (word.Length > Level && Level < MaxLevel) {
string index = word.Substring(0, Level + 1);
Blacklist sublist = null;
if (!SubBlacklists.TryGetValue(index, out sublist)) {
sublist = new Blacklist(index, Level + 1);
SubBlacklists[index] = sublist;
}
sublist.AddBlacklistedWord(word);
} else {
BlacklistedWords.Add(word);
}
}
public bool ContainsBlacklistedWord(string wordToCheck) {
if (wordToCheck.Length > Level && Level < MaxLevel) {
foreach (var sublist in SubBlacklists.Values) {
if (wordToCheck.Contains(sublist.Start)) {
return sublist.ContainsBlacklistedWord(wordToCheck);
}
}
}
return BlacklistedWords.Any(x => wordToCheck.Contains(x));
}
}
static void Main(string[] args) {
List<string> listUser = new List<string>();
listUser.Add("user1");
listUser.Add("user2");
listUser.Add("userhacker");
listUser.Add("userfukoff1");
Blacklist blacklist = new Blacklist();
blacklist.AddBlacklistedWord("hacker");
blacklist.AddBlacklistedWord("fukoff");
foreach (string user in listUser) {
if (blacklist.ContainsBlacklistedWord(user)) {
Console.WriteLine("Contains blacklisted word: {0}", user);
}
}
}
}

You are using the wrong thing. If you have a lot of data, you should be using either HashSet<T> or SortedSet<T>. If you don't need the data sorted, go with HashSet<T>. Here is a program I wrote to demonstrate the time differences:
class Program
{
private static readonly Random random = new Random((int)DateTime.Now.Ticks);
static void Main(string[] args)
{
Console.WriteLine("Creating Lists...");
var stringList = new List<string>();
var hashList = new HashSet<string>();
var sortedList = new SortedSet<string>();
var searchWords1 = new string[3];
int ndx = 0;
for (int x = 0; x < 1000000; x++)
{
string str = RandomString(10);
if (x == 5 || x == 500000 || x == 999999)
{
str = "Z" + str;
searchWords1[ndx] = str;
ndx++;
}
stringList.Add(str);
hashList.Add(str);
sortedList.Add(str);
}
Console.WriteLine("Lists created!");
var sw = new Stopwatch();
sw.Start();
bool search1 = stringList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("List<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
search1 = hashList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("HashSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
search1 = sortedList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("SortedSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
}
private static string RandomString(int size)
{
var builder = new StringBuilder();
char ch;
for (int i = 0; i < size; i++)
{
ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
builder.Append(ch);
}
return builder.ToString();
}
}
On my machine, I got the following results:
Creating Lists...
Lists created!
List<T> True ==> 15ms
HashSet<T> True ==> 0ms
SortedSet<T> True ==> 0ms
As you can see, List<T> was extremely slow comparted to HashSet<T> and SortedSet<T>. Those were almost instantaneous.

Related

How to split a string into an array of two letter substrings with C#

Problem
Given a sample string abcdef, i am trying to split that into an array of two character string elements that should results in ['ab','cd','ef'];
What i tried
I tried to iterate through the string while storing the substring in the current index in an array i declared inside the method, but am getting this output
['ab','bc','cd','de','ef']
Code I used
static string[] mymethod(string str)
{
string[] r= new string[str.Length];
for(int i=0; i<str.Length-1; i++)
{
r[i]=str.Substring(i,2);
}
return r;
}
Any solution to correct that with the code to return the correct output is really welcome, Thanks

your problem was that you incremented your index by 1 instead of 2 every time
var res = new List<string>();
for (int i = 0; i < x.Length - 1; i += 2)
{
res.Add(x.Substring(i, 2));
}
should work
EDIT:
because you ask for a default _ suffix in case of odd characters amount,
this should be the change:
var testString = "odd";
string workOn = testString.Length % 2 != 0
? testString + "_"
: testString;
var res = new List<string>();
for (int i = 0; i < workOn.Length - 1; i += 2)
{
res.Add(workOn.Substring(i, 2));
}
two notes to notice:
in .NET 6 Chunk() is available so you can use this as suggested in other answers
this solution might not be the best in case of a very long input
so it really depends on what are your inputs and expectations

.net 6 has an IEnumerable.Chunk() method that you can use to do this, as follows:
public static void Main()
{
string[] result =
"abcdef"
.Chunk(2)
.Select(chunk => new string(chunk)).ToArray();
Console.WriteLine(string.Join(", ", result)); // Prints "ab, cd, ef"
}
Before .net 6, you can use MoreLinq.Batch() to do the same thing.
[EDIT] In response the the request below:
MoreLinq is a set of Linq utilities originally written by Jon Skeet. You can find an implementation by going to Project | Manage NuGet Packages and then browsing for MoreLinq and installing it.
After installing it, add using MoreLinq.Extensions; and then you'll be able to use the MoreLinq.Batch extension like so:
public static void Main()
{
string[] result = "abcdef"
.Batch(2)
.Select(chunk => new string(chunk.ToArray())).ToArray();
Console.WriteLine(string.Join(", ", result)); // Prints "ab, cd, ef"
}
Note that there is no string constructor that accepts an IEnumerable<char>, hence the need for the chunk.ToArray() above.
I would say, though, that including the whole of MoreLinq just for one extension method is perhaps overkill. You could just write your own extension method for Enumerable.Chunk():
public static class MyBatch
{
public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> self, int size)
{
T[] bucket = null;
int count = 0;
foreach (var item in self)
{
if (bucket == null)
bucket = new T[size];
bucket[count++] = item;
if (count != size)
continue;
yield return bucket;
bucket = null;
count = 0;
}
if (bucket != null && count > 0)
yield return bucket.Take(count).ToArray();
}
}

If you are using latest .NET version i.e (.NET 6.0 RC 1), then you can try Chunk() method,
var strChunks = "abcdef".Chunk(2); //[['a', 'b'], ['c', 'd'], ['e', 'f']]
var result = strChunks.Select(x => string.Join('', x)).ToArray(); //["ab", "cd", "ef"]
Note: I am unable to test this on fiddle or my local machine due to latest version of .NET

With linq you can achieve it with the following way:
char[] word = "abcdefg".ToCharArray();
var evenCharacters = word.Where((_, idx) => idx % 2 == 0);
var oddCharacters = word.Where((_, idx) => idx % 2 == 1);
var twoCharacterLongSplits = evenCharacters
.Zip(oddCharacters)
.Select((pair) => new char[] { pair.First, pair.Second });
The trick is the following, we create two collections:
one where we have only those characters where the original index was even (% 2 == 0)
one where we have only those characters where the original index was odd (% 2 == 1)
Then we zip them. So, we create a tuple by taking one item from the even and one item from the odd collection. Then we create a new tuple by taking one item from the even and ...
And last we convert the tuples to arrays to have the desired output format.

You are on the right track but you need to increment by 2 not by one. You also need to check if the array has not ended before taking the second character else you risk running into an index out of bounds exception. Try this code I've written below. I've tried it and it works. Best!
public static List<string> splitstring(string str)
{
List<string> result = new List<string>();
int strlen = str.Length;
for(int i = 0; i<strlen; i+=2)
{
string currentstr = str[i].ToString();
if (i + 1 <= strlen-1)
{ currentstr += str[i + 1].ToString(); }
result.Add(currentstr);
}
return result;
}

For vs. Linq - Performance vs. Future

Very brief question. I have a randomly sorted large string array (100K+ entries) where I want to find the first occurance of a desired string. I have two solutions.
From having read what I can my guess is that the 'for loop' is going to currently give slightly better performance (but this margin could always change), but I also find the linq version much more readable. On balance which method is generally considered current best coding practice and why?
string matchString = "dsf897sdf78";
int matchIndex = -1;
for(int i=0; i<array.length; i++)
{
if(array[i]==matchString)
{
matchIndex = i;
break;
}
}
or
int matchIndex = array.Select((r, i) => new { value = r, index = i })
.Where(t => t.value == matchString)
.Select(s => s.index).First();

The best practice depends on what you need:
Development speed and maintainability: LINQ
Performance (according to profiling tools): manual code
LINQ really does slow things down with all the indirection. Don't worry about it as 99% of your code does not impact end user performance.
I started with C++ and really learnt how to optimize a piece of code. LINQ is not suited to get the most out of your CPU. So if you measure a LINQ query to be a problem just ditch it. But only then.
For your code sample I'd estimate a 3x slowdown. The allocations (and subsequent GC!) and indirections through the lambdas really hurt.

Slightly better performance? A loop will give SIGNIFICANTLY better performance!
Consider the code below. On my system for a RELEASE (not debug) build, it gives:
Found via loop at index 999999 in 00:00:00.2782047
Found via linq at index 999999 in 00:00:02.5864703
Loop was 9.29700432810805 times faster than linq.
The code is deliberately set up so that the item to be found is right at the end. If it was right at the start, things would be quite different.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace Demo
{
public static class Program
{
private static void Main(string[] args)
{
string[] a = new string[1000000];
for (int i = 0; i < a.Length; ++i)
{
a[i] = "Won't be found";
}
string matchString = "Will be found";
a[a.Length - 1] = "Will be found";
const int COUNT = 100;
var sw = Stopwatch.StartNew();
int matchIndex = -1;
for (int outer = 0; outer < COUNT; ++outer)
{
for (int i = 0; i < a.Length; i++)
{
if (a[i] == matchString)
{
matchIndex = i;
break;
}
}
}
sw.Stop();
Console.WriteLine("Found via loop at index " + matchIndex + " in " + sw.Elapsed);
double loopTime = sw.Elapsed.TotalSeconds;
sw.Restart();
for (int outer = 0; outer < COUNT; ++outer)
{
matchIndex = a.Select((r, i) => new { value = r, index = i })
.Where(t => t.value == matchString)
.Select(s => s.index).First();
}
sw.Stop();
Console.WriteLine("Found via linq at index " + matchIndex + " in " + sw.Elapsed);
double linqTime = sw.Elapsed.TotalSeconds;
Console.WriteLine("Loop was {0} times faster than linq.", linqTime/loopTime);
}
}
}

LINQ, according to declarative paradigm, expresses the logic of a computation without describing its control flow. The query is goal oriented, selfdescribing and thus easy to analyse and understand. Is also concise. Moreover, using LINQ, one depends highly upon abstraction of data structure. That involves high rate of maintanability and reusability.
Iteration aproach addresses imperative paradigm. It gives fine-grained control, thus ease obtain higher performance. The code is also simpler to debug. Sometimes well contructed iteration is more readable than query.

There is always dilemma between performance and maintainability. And usually (if there is no specific requirements about performance) maintainability should win. Only if you have performance problems, then you should profile application, find problem source, and improve its performance (by reducing maintainability at same time, yes that's the world we live in).
About your sample. Linq is not very good solution here, because it do not add match maintainability into your code. Actually for me projecting, filtering, and projecting again looks even worse, than simple loop. What you need here is simple Array.IndexOf, which is more maintainable, than loop, and have almost same performance:
Array.IndexOf(array, matchString)

Well, you gave the answer to your question yourself.
Go with a For loop if you want the best performance, or go with Linq if you want readability.
Also perhaps keep in mind the possibility of using Parallel.Foreach() which would benefit from in-line lambda expressions (so, more closer to Linq), and that is much more readable then doing paralelization "manually".

I don't think either is considered best practice some people prefer looking at LINQ and some don't.
If performance is a issue the I would profile both bits of code for your scenario and if the difference is negligible then go with the one you feel more conformable with, after all it will most likely be you who maintains the code.
Also have you thought about using PLINQ or making the loop run in parallel?

Just an interesting observation. LINQ Lambda queries for sure add a penalty over LINQ Where queries or a For Loop. In the following code, it fills a list with 1000001 multi-parameter objects and then searches for a specific item that in this test will always be the last one, using a LINQ Lamba, a LINQ Where Query and a For Loop. Each test iterates 100 times and then averages the times to get the results.
LINQ Lambda Query Average Time: 0.3382 seconds
LINQ Where Query Average Time: 0.238 seconds
For Loop Average Time: 0.2266 seconds
I've run this test over and over, and even increase the iteration and the spread is pretty much identical statistically speaking. Sure we are talking 1/10 of a second for essentially that a million item search. So in the real world, unless something is that intensive, not sure you would even notice. But if you do the LINQ Lambda vs LINQ Where query does have a difference in performance. The LINQ Where is near the same as the For Loop.
private void RunTest()
{
try
{
List<TestObject> mylist = new List<TestObject>();
for (int i = 0; i <= 1000000; i++)
{
TestObject testO = new TestObject(string.Format("Item{0}", i), 1, Guid.NewGuid().ToString());
mylist.Add(testO);
}
mylist.Add(new TestObject("test", "29863", Guid.NewGuid().ToString()));
string searchtext = "test";
int iterations = 100;
// Linq Lambda Test
List<int> list1 = new List<int>();
for (int i = 1; i <= iterations; i++)
{
DateTime starttime = DateTime.Now;
TestObject t = mylist.FirstOrDefault(q => q.Name == searchtext);
int diff = (DateTime.Now - starttime).Milliseconds;
list1.Add(diff);
}
// Linq Where Test
List<int> list2 = new List<int>();
for (int i = 1; i <= iterations; i++)
{
DateTime starttime = DateTime.Now;
TestObject t = (from testO in mylist
where testO.Name == searchtext
select testO).FirstOrDefault();
int diff = (DateTime.Now - starttime).Milliseconds;
list2.Add(diff);
}
// For Loop Test
List<int> list3 = new List<int>();
for (int i = 1; i <= iterations; i++)
{
DateTime starttime = DateTime.Now;
foreach (TestObject testO in mylist)
{
if (testO.Name == searchtext)
{
TestObject t = testO;
break;
}
}
int diff = (DateTime.Now - starttime).Milliseconds;
list3.Add(diff);
}
float diff1 = list1.Average();
Debug.WriteLine(string.Format("LINQ Lambda Query Average Time: {0} seconds", diff1 / (double)100));
float diff2 = list2.Average();
Debug.WriteLine(string.Format("LINQ Where Query Average Time: {0} seconds", diff2 / (double)100));
float diff3 = list3.Average();
Debug.WriteLine(string.Format("For Loop Average Time: {0} seconds", diff3 / (double)100));
}
catch (Exception ex)
{
Debug.WriteLine(ex.ToString());
}
}
private class TestObject
{
public TestObject(string _name, string _value, string _guid)
{
Name = _name;
Value = _value;
GUID = _guid;
}
public string Name;
public string Value;
public string GUID;
}

The Best Option Is To Use IndexOf method of Array Class. Since it is specialized for arrays it will b significantly faster than both Linq and For Loop.
Improving on Matt Watsons Answer.
using System;
using System.Diagnostics;
using System.Linq;
namespace PerformanceConsoleApp
{
public class LinqVsFor
{
private static void Main(string[] args)
{
string[] a = new string[1000000];
for (int i = 0; i < a.Length; ++i)
{
a[i] = "Won't be found";
}
string matchString = "Will be found";
a[a.Length - 1] = "Will be found";
const int COUNT = 100;
var sw = Stopwatch.StartNew();
Loop(a, matchString, COUNT, sw);
First(a, matchString, COUNT, sw);
Where(a, matchString, COUNT, sw);
IndexOf(a, sw, matchString, COUNT);
Console.ReadLine();
}
private static void Loop(string[] a, string matchString, int COUNT, Stopwatch sw)
{
int matchIndex = -1;
for (int outer = 0; outer < COUNT; ++outer)
{
for (int i = 0; i < a.Length; i++)
{
if (a[i] == matchString)
{
matchIndex = i;
break;
}
}
}
sw.Stop();
Console.WriteLine("Found via loop at index " + matchIndex + " in " + sw.Elapsed);
}
private static void IndexOf(string[] a, Stopwatch sw, string matchString, int COUNT)
{
int matchIndex = -1;
sw.Restart();
for (int outer = 0; outer < COUNT; ++outer)
{
matchIndex = Array.IndexOf(a, matchString);
}
sw.Stop();
Console.WriteLine("Found via IndexOf at index " + matchIndex + " in " + sw.Elapsed);
}
private static void First(string[] a, string matchString, int COUNT, Stopwatch sw)
{
sw.Restart();
string str = "";
for (int outer = 0; outer < COUNT; ++outer)
{
str = a.First(t => t == matchString);
}
sw.Stop();
Console.WriteLine("Found via linq First at index " + Array.IndexOf(a, str) + " in " + sw.Elapsed);
}
private static void Where(string[] a, string matchString, int COUNT, Stopwatch sw)
{
sw.Restart();
string str = "";
for (int outer = 0; outer < COUNT; ++outer)
{
str = a.Where(t => t == matchString).First();
}
sw.Stop();
Console.WriteLine("Found via linq Where at index " + Array.IndexOf(a, str) + " in " + sw.Elapsed);
}
}
}
Output:
Found via loop at index 999999 in 00:00:01.1528531
Found via linq First at index 999999 in 00:00:02.0876573
Found via linq Where at index 999999 in 00:00:01.3313111
Found via IndexOf at index 999999 in 00:00:00.7244812

A bit of a non-answer, and really just an extension to https://stackoverflow.com/a/14894589, but I have, on and off, been working on an API-compatible replacement for Linq-to-Objects for a while now. It still doesn't provide the performance of a hand-coded loop, but it is faster for many (most?) linq scenarios. It does create more garbage, and has some slightly heavier up front costs.
The code is available https://github.com/manofstick/Cistern.Linq
A nuget package is available https://www.nuget.org/packages/Cistern.Linq/ (I can't claim this to be battle hardened, use at your own risk)
Taking the code from Matthew Watson's answer (https://stackoverflow.com/a/14894589) with two slight tweaks, and we get the time down to "only" ~3.5 time worse than the hand-coded loop. On my machine it take about 1/3 of the time of original System.Linq version.
The two changes to replace:
using System.Linq;
...
matchIndex = a.Select((r, i) => new { value = r, index = i })
.Where(t => t.value == matchString)
.Select(s => s.index).First();
With the following:
// a complete replacement for System.Linq
using Cistern.Linq;
...
// use a value tuple rather than anonymous type
matchIndex = a.Select((r, i) => (value: r, index: i))
.Where(t => t.value == matchString)
.Select(s => s.index).First();
So the library itself is a work in progress. It fails a couple of edge cases from the corefx's System.Linq test suite. It also still needs a few functions to be converted over (they currently have the corefx System.Linq implementation, which is compatible from an API perspective, if not a performance perspective). But anymore who wants to help, comment, etc would be appreciated....

Quickest way to filter a list of words

I have a list with several words. I want to filter out some of them, which don't match a specific pattern. Is it quicker to add all the matches to a temporary list and copy this list to the main list afterwards? Or is it quicker to remove all the mismatches from the main list?
I have to filter 10000 words as quickly as possible, so I'm looking forward to every little speed increasement.
Edit:
string characters = "aAbBcC";
// currentMatches contains all the words from the beginning
List<string> currentMatches = new List<string>();
List<string> newMatches = new List<string>();
foreach (string word in currentMatches)
{
if (characters.IndexOf(word[0]) > -1)
// word match
{
newMatches.Add(word);
}
}
currentMatches = newMatches;
The foreach loop should check whether word begins with one of the characters of characters. Here I copy every match to newMatches before I copy all the new matches to currentMatches.

Assuming a List<T> then you'll have to take in consideration the following:
If Count is less than Capacity, the Add method is an O(1) operation. If the capacity needs to be increased to accommodate the new element, this method becomes an O(n) operation, where n is Count;
The RemoveAt method is an O(n) operation, where n is (Count - index).
If you create the list to hold the matches with an initial capacity set to the total word count then Add will always be O(1) and faster. However you need to take in consideration the overhead of creating this new list with a capacity set to the total word count.
Bottom line, you need to test it and see what works better for your specific scenario.

Here is an example I threw together on how to time methods. There are many ways to do this, and I think you're going to have to try out a few. You can use information like in João Angelo's post to help direct you towards good approaches, but here are a few. Also, if you're willing to spend the time, you could put this all in a loop that would create a new list, run all of the tests, put the TimeSpan results into a collection instead of Console.WriteLine'ing them, and then give you an average of however many iterations of the test. That would help give you an average.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Test
{
public class Program
{
public static void Main(string[] args)
{
List<string> testList = CreateTestList();
const string filter = "abc";
TimeNewListMethod(FilterIntoNewListWithLinq, testList, filter);
TimeInPlaceMethod(FilterInPlaceWithLinq, testList, filter);
TimeNewListMethod(FilterIntoNewListWithForEach, testList, filter);
TimeInPlaceMethod(FilterInPlaceWithRemoveAll, testList, filter);
Console.Read();
}
public static void TimeInPlaceMethod(Action<List<string>, string> testMethod, List<string> toFilter, string filter)
{
List<string> toFilterCopy = new List<string>(toFilter);
DateTime time = DateTime.Now;
testMethod(toFilterCopy, filter);
Console.WriteLine(DateTime.Now - time);
}
public static void TimeNewListMethod(Func<List<string>, string, List<string>> testMethod, List<string> toFilter, string filter)
{
List<string> toFilterCopy = new List<string>(toFilter);
List<string> resultList;
DateTime time = DateTime.Now;
resultList = testMethod(toFilterCopy, filter);
Console.WriteLine(DateTime.Now - time);
}
public static List<string> FilterIntoNewListWithLinq(List<string> toFilter, string filter)
{
return toFilter.Where(element => element.IndexOf(filter) > -1).ToList();
}
public static void FilterInPlaceWithLinq(List<string> toFilter, string filter)
{
toFilter = toFilter.Where(element => element.IndexOf(filter) > -1).ToList();
}
public static List<string> FilterIntoNewListWithForEach(List<string> toFilter, string filter)
{
List<string> returnList = new List<string>(toFilter.Count);
foreach (string word in toFilter)
{
if (word.IndexOf(word[0]) > -1)
{
returnList.Add(word);
}
}
return returnList;
}
public static void FilterInPlaceWithRemoveAll(List<string> toFilter, string filter)
{
toFilter.RemoveAll(element => element.IndexOf(filter) == -1);
}
public static List<string> CreateTestList(int elements = 10000, int wordLength = 6)
{
List<string> returnList = new List<string>();
StringBuilder nextWord = new StringBuilder();
for (int i = 0; i < elements; i++)
{
for (int j = 0; j < wordLength; j++)
{
nextWord.Append(RandomCharacter());
}
returnList.Add(nextWord.ToString());
nextWord.Clear();
}
return returnList;
}
public static char RandomCharacter()
{
return (char)('a' + rand.Next(0, 25));
}
public static Random rand = new Random();
}
}

The whole
characters.IndexOf(word[0]) > -1
was a little unfamiliar to me and so I would go for something more readable and maintainable for future programmers. It took me a minute to figure out you are checking the first char in each string in the list looking for a match in the range of { a A, B, C, a, b, c }. It works, but to me, it was a little cryptic. I'm starting having taken the time to read through it, but I would do it like this:
foreach (string word in currentMatches)
{
if (Regex.IsMatch(word, "^([A-Ca-c])"))
{
newMatches.Add(word);
}
}
I would not worry about copying the temp list back to the initial list. You're already defined it filled it, go ahead and use it.

What's wrong with my implementation of the KMP algorithm?

static void Main(string[] args)
{
string str = "ABC ABCDAB ABCDABCDABDE";//We should add some text here for
//the performance tests.
string pattern = "ABCDABD";
List<int> shifts = new List<int>();
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
NaiveStringMatcher(shifts, str, pattern);
stopWatch.Stop();
Trace.WriteLine(String.Format("Naive string matcher {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
shifts.Clear();
stopWatch.Restart();
int[] pi = new int[pattern.Length];
Knuth_Morris_Pratt(shifts, str, pattern, pi);
stopWatch.Stop();
Trace.WriteLine(String.Format("Knuth_Morris_Pratt {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
Console.ReadKey();
}
static IList<int> NaiveStringMatcher(List<int> shifts, string text, string pattern)
{
int lengthText = text.Length;
int lengthPattern = pattern.Length;
for (int s = 0; s < lengthText - lengthPattern + 1; s++ )
{
if (text[s] == pattern[0])
{
int i = 0;
while (i < lengthPattern)
{
if (text[s + i] == pattern[i])
i++;
else break;
}
if (i == lengthPattern)
{
shifts.Add(s);
}
}
}
return shifts;
}
static IList<int> Knuth_Morris_Pratt(List<int> shifts, string text, string pattern, int[] pi)
{
int patternLength = pattern.Length;
int textLength = text.Length;
//ComputePrefixFunction(pattern, pi);
int j;
for (int i = 1; i < pi.Length; i++)
{
j = 0;
while ((i < pi.Length) && (pattern[i] == pattern[j]))
{
j++;
pi[i++] = j;
}
}
int matchedSymNum = 0;
for (int i = 0; i < textLength; i++)
{
while (matchedSymNum > 0 && pattern[matchedSymNum] != text[i])
matchedSymNum = pi[matchedSymNum - 1];
if (pattern[matchedSymNum] == text[i])
matchedSymNum++;
if (matchedSymNum == patternLength)
{
shifts.Add(i - patternLength + 1);
matchedSymNum = pi[matchedSymNum - 1];
}
}
return shifts;
}
Why does my implemention of the KMP algorithm work slower than the Naive String Matching algorithm?

The KMP algorithm has two phases: first it builds a table, and then it does a search, directed by the contents of the table.
The naive algorithm has one phase: it does a search. It does that search much less efficiently in the worst case than the KMP search phase.
If the KMP is slower than the naive algorithm then that is probably because building the table is taking you longer than it takes to simply search the string naively in the first place. Naive string matching is usually very fast on short strings. There is a reason why we don't use fancy-pants algorithms like KMP inside the BCL implementations of string searching. By the time you set up the table, you could have done half a dozen searches of short strings with the naive algorithm.
KMP is only a win if you have enormous strings and you are doing lots of searches that allow you to re-use an already-built table. You need to amortize away the huge cost of building the table by doing lots of searches using that table.
And also, the naive algorithm only has bad performance in bizarre and unlikely scenarios. Most people are searching for words like "London" in strings like "Buckingham Palace, London, England", and not searching for strings like "BANANANANANANA" in strings like "BANAN BANBAN BANBANANA BANAN BANAN BANANAN BANANANANANANANANAN...". The naive search algorithm is optimal for the first problem and highly sub-optimal for the latter problem; but it makes sense to optimize for the former, not the latter.
Another way to put it: if the searched-for string is of length w and the searched-in string is of length n, then KMP is O(n) + O(w). The Naive algorithm is worst case O(nw), best case O(n + w). But that says nothing about the "constant factor"! The constant factor of the KMP algorithm is much larger than the constant factor of the naive algorithm. The value of n has to be awfully big, and the number of sub-optimal partial matches has to be awfully large, for the KMP algorithm to win over the blazingly fast naive algorithm.
That deals with the algorithmic complexity issues. Your methodology is also not very good, and that might explain your results. Remember, the first time you run code, the jitter has to jit the IL into assembly code. That can take longer than running the method in some cases. You really should be running the code a few hundred thousand times in a loop, discarding the first result, and taking an average of the timings of the rest.
If you really want to know what is going on you should be using a profiler to determine what the hot spot is. Again, make sure you are measuring the post-jit run, not the run where the code is jitted, if you want to have results that are not skewed by the jit time.

Your example is too small and it does not have enough repetitions of the pattern where KMP avoids backtracking.
KMP can be slower than the normal search in some cases.

A Simple KMPSubstringSearch Implementation.
https://github.com/bharathkumarms/AlgorithmsMadeEasy/blob/master/AlgorithmsMadeEasy/KMPSubstringSearch.cs
using System;
using System.Collections.Generic;
using System.Linq;
namespace AlgorithmsMadeEasy
{
class KMPSubstringSearch
{
public void KMPSubstringSearchMethod()
{
string text = System.Console.ReadLine();
char[] sText = text.ToCharArray();
string pattern = System.Console.ReadLine();
char[] sPattern = pattern.ToCharArray();
int forwardPointer = 1;
int backwardPointer = 0;
int[] tempStorage = new int[sPattern.Length];
tempStorage[0] = 0;
while (forwardPointer < sPattern.Length)
{
if (sPattern[forwardPointer].Equals(sPattern[backwardPointer]))
{
tempStorage[forwardPointer] = backwardPointer + 1;
forwardPointer++;
backwardPointer++;
}
else
{
if (backwardPointer == 0)
{
tempStorage[forwardPointer] = 0;
forwardPointer++;
}
else
{
int temp = tempStorage[backwardPointer];
backwardPointer = temp;
}
}
}
int pointer = 0;
int successPoints = sPattern.Length;
bool success = false;
for (int i = 0; i < sText.Length; i++)
{
if (sText[i].Equals(sPattern[pointer]))
{
pointer++;
}
else
{
if (pointer != 0)
{
int tempPointer = pointer - 1;
pointer = tempStorage[tempPointer];
i--;
}
}
if (successPoints == pointer)
{
success = true;
}
}
if (success)
{
System.Console.WriteLine("TRUE");
}
else
{
System.Console.WriteLine("FALSE");
}
System.Console.Read();
}
}
}
/*
* Sample Input
abxabcabcaby
abcaby
*/

Performance ideas (in-memory C# hashset and contains too slow)

I have the following code
private void LoadIntoMemory()
{
//Init large HashSet
HashSet<document> hsAllDocuments = new HashSet<document>();
//Get first rows from database
List<document> docsList = document.GetAllAboveDocID(0, 500000);
//Load objects into dictionary
foreach (document d in docsList)
{
hsAllDocuments.Add(d);
}
Application["dicAllDocuments"] = hsAllDocuments;
}
private HashSet<document> documentHits(HashSet<document> hsRawHit, HashSet<document> hsAllDocuments, string query, string[] queryArray)
{
int counter = 0;
const int maxCount = 1000;
foreach (document d in hsAllDocuments)
{
//Headline
if (d.Headline.Contains(query))
{
if (counter >= maxCount)
break;
hsRawHit.Add(d);
counter++;
}
//Description
if (d.Description.Contains(query))
{
if (counter >= maxCount)
break;
hsRawHit.Add(d);
counter++;
}
//splitted query word by word
//string[] queryArray = query.Split(' ');
if (queryArray.Count() > 1)
{
foreach (string q in queryArray)
{
if (d.Headline.Contains(q))
{
if (counter >= maxCount)
break;
hsRawHit.Add(d);
counter++;
}
//Description
if (d.Description.Contains(q))
{
if (counter >= maxCount)
break;
hsRawHit.Add(d);
counter++;
}
}
}
}
return hsRawHit;
}
First I load all the data into a hashset (via Application for later use) - runs fine - totally OK to be slow for what I'm doing.
Will be running 4.0 framework in C# (can't update to the new upgrade for 4.0 with the async stuff).
The documentHits method runs fairly slow on my current setup - considering that it's all in memory. What can I do to speed up this method?
Examples would be awesome - thanks.

I see that you are using a HashSet, but you are not using any of it's advantages, so you should just use a List instead.
What's taking time is looping through all the documents and looking for strings in other strings, so you should try to elliminate as much as possible of that.
One possibility is to set up indexes of which documents contains which character pairs. If the string query contains Hello, you would be looking in the documents that contains He, el, ll and lo.
You could set up a Dictionary<string, List<int>> where the dictionary key is the character combinations and the list contains indexes to the documents in your document list. Setting up the dictionary will take some time, of course, but you can focus on the character combinations that are less common. If a character combination exists in 80% of the documents, it's pretty useless for elliminating documents, but if a character combination exists in only 2% of the documents it has elliminated 98% of your work.
If you loop through the documents in the list and add occurances to the lists in the dictionary, the lists of indexes will be sorted, so it will be very easy to join the lists later on. When you add indexes to the list, you can throw away lists when they get too large and you see that they would not be useful for elliminating documents. That way you will only be keeping the shorter lists and they will not consume so much memory.
Edit:
It put together a small example:
public class IndexElliminator<T> {
private List<T> _items;
private Dictionary<string, List<int>> _index;
private Func<T, string> _getContent;
private static HashSet<string> GetPairs(string value) {
HashSet<string> pairs = new HashSet<string>();
for (int i = 1; i < value.Length; i++) {
pairs.Add(value.Substring(i - 1, 2));
}
return pairs;
}
public IndexElliminator(List<T> items, Func<T, string> getContent, int maxIndexSize) {
_items = items;
_getContent = getContent;
_index = new Dictionary<string, List<int>>();
for (int index = 0;index<_items.Count;index++) {
T item = _items[index];
foreach (string pair in GetPairs(_getContent(item))) {
List<int> list;
if (_index.TryGetValue(pair, out list)) {
if (list != null) {
if (list.Count == maxIndexSize) {
_index[pair] = null;
} else {
list.Add(index);
}
}
} else {
list = new List<int>();
list.Add(index);
_index.Add(pair, list);
}
}
}
}
private static List<int> JoinLists(List<int> list1, List<int> list2) {
List<int> result = new List<int>();
int i1 = 0, i2 = 0;
while (i1 < list1.Count && i2 < list2.Count) {
switch (Math.Sign(list1[i1].CompareTo(list2[i2]))) {
case 0: result.Add(list1[i1]); i1++; i2++; break;
case -1: i1++; break;
case 1: i2++; break;
}
}
return result;
}
public List<T> Find(string query) {
HashSet<string> pairs = GetPairs(query);
List<List<int>> indexes = new List<List<int>>();
bool found = false;
foreach (string pair in pairs) {
List<int> list;
if (_index.TryGetValue(pair, out list)) {
found = true;
if (list != null) {
indexes.Add(list);
}
}
}
List<T> result = new List<T>();
if (found && indexes.Count == 0) {
indexes.Add(Enumerable.Range(0, _items.Count).ToList());
}
if (indexes.Count > 0) {
while (indexes.Count > 1) {
indexes[indexes.Count - 2] = JoinLists(indexes[indexes.Count - 2], indexes[indexes.Count - 1]);
indexes.RemoveAt(indexes.Count - 1);
}
foreach (int index in indexes[0]) {
if (_getContent(_items[index]).Contains(query)) {
result.Add(_items[index]);
}
}
}
return result;
}
}
Test:
List<string> items = new List<string> {
"Hello world",
"How are you",
"What is this",
"Can this be true",
"Some phrases",
"Words upon words",
"What to do",
"Where to go",
"When is this",
"How can this be",
"Well above margin",
"Close to the center"
};
IndexElliminator<string> index = new IndexElliminator<string>(items, s => s, items.Count / 2);
List<string> found = index.Find("this");
foreach (string s in found) Console.WriteLine(s);
Output:
What is this
Can this be true
When is this
How can this be

You are running linearly through all documents to find matches - this is O(n), you could do better if you solved the inverse problem, similar to how a fulltext index works: start with the query terms and preprocess the set of documents that match each query term - since this might get complicated I would suggest just using a DB with fulltext capability, this will be much faster than your approach.
Also you are abusing HashSet - instead just use a List, and don't put in duplicates - all the cases in documentHits() that produce a match should be exclusive.

If you have a whole lot of time at the start to create the database, you can look into using a Trie.
A Trie will make the string search much faster.
There's a little explanation and an implementation in the end here.
Another implementation: Trie class

You should not test each document for all test steps!
Instead it you should go to the next document after first successeful test result.
hsRawHit.Add(d);
counter++;
you should add continue; after counter++;
hsRawHit.Add(d);
counter++;
continue;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.