Very brief question. I have a randomly sorted large string array (100K+ entries) where I want to find the first occurance of a desired string. I have two solutions.
From having read what I can my guess is that the 'for loop' is going to currently give slightly better performance (but this margin could always change), but I also find the linq version much more readable. On balance which method is generally considered current best coding practice and why?
string matchString = "dsf897sdf78";
int matchIndex = -1;
for(int i=0; i<array.length; i++)
{
if(array[i]==matchString)
{
matchIndex = i;
break;
}
}
or
int matchIndex = array.Select((r, i) => new { value = r, index = i })
.Where(t => t.value == matchString)
.Select(s => s.index).First();
The best practice depends on what you need:
Development speed and maintainability: LINQ
Performance (according to profiling tools): manual code
LINQ really does slow things down with all the indirection. Don't worry about it as 99% of your code does not impact end user performance.
I started with C++ and really learnt how to optimize a piece of code. LINQ is not suited to get the most out of your CPU. So if you measure a LINQ query to be a problem just ditch it. But only then.
For your code sample I'd estimate a 3x slowdown. The allocations (and subsequent GC!) and indirections through the lambdas really hurt.
Slightly better performance? A loop will give SIGNIFICANTLY better performance!
Consider the code below. On my system for a RELEASE (not debug) build, it gives:
Found via loop at index 999999 in 00:00:00.2782047
Found via linq at index 999999 in 00:00:02.5864703
Loop was 9.29700432810805 times faster than linq.
The code is deliberately set up so that the item to be found is right at the end. If it was right at the start, things would be quite different.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace Demo
{
public static class Program
{
private static void Main(string[] args)
{
string[] a = new string[1000000];
for (int i = 0; i < a.Length; ++i)
{
a[i] = "Won't be found";
}
string matchString = "Will be found";
a[a.Length - 1] = "Will be found";
const int COUNT = 100;
var sw = Stopwatch.StartNew();
int matchIndex = -1;
for (int outer = 0; outer < COUNT; ++outer)
{
for (int i = 0; i < a.Length; i++)
{
if (a[i] == matchString)
{
matchIndex = i;
break;
}
}
}
sw.Stop();
Console.WriteLine("Found via loop at index " + matchIndex + " in " + sw.Elapsed);
double loopTime = sw.Elapsed.TotalSeconds;
sw.Restart();
for (int outer = 0; outer < COUNT; ++outer)
{
matchIndex = a.Select((r, i) => new { value = r, index = i })
.Where(t => t.value == matchString)
.Select(s => s.index).First();
}
sw.Stop();
Console.WriteLine("Found via linq at index " + matchIndex + " in " + sw.Elapsed);
double linqTime = sw.Elapsed.TotalSeconds;
Console.WriteLine("Loop was {0} times faster than linq.", linqTime/loopTime);
}
}
}
LINQ, according to declarative paradigm, expresses the logic of a computation without describing its control flow. The query is goal oriented, selfdescribing and thus easy to analyse and understand. Is also concise. Moreover, using LINQ, one depends highly upon abstraction of data structure. That involves high rate of maintanability and reusability.
Iteration aproach addresses imperative paradigm. It gives fine-grained control, thus ease obtain higher performance. The code is also simpler to debug. Sometimes well contructed iteration is more readable than query.
There is always dilemma between performance and maintainability. And usually (if there is no specific requirements about performance) maintainability should win. Only if you have performance problems, then you should profile application, find problem source, and improve its performance (by reducing maintainability at same time, yes that's the world we live in).
About your sample. Linq is not very good solution here, because it do not add match maintainability into your code. Actually for me projecting, filtering, and projecting again looks even worse, than simple loop. What you need here is simple Array.IndexOf, which is more maintainable, than loop, and have almost same performance:
Array.IndexOf(array, matchString)
Well, you gave the answer to your question yourself.
Go with a For loop if you want the best performance, or go with Linq if you want readability.
Also perhaps keep in mind the possibility of using Parallel.Foreach() which would benefit from in-line lambda expressions (so, more closer to Linq), and that is much more readable then doing paralelization "manually".
I don't think either is considered best practice some people prefer looking at LINQ and some don't.
If performance is a issue the I would profile both bits of code for your scenario and if the difference is negligible then go with the one you feel more conformable with, after all it will most likely be you who maintains the code.
Also have you thought about using PLINQ or making the loop run in parallel?
Just an interesting observation. LINQ Lambda queries for sure add a penalty over LINQ Where queries or a For Loop. In the following code, it fills a list with 1000001 multi-parameter objects and then searches for a specific item that in this test will always be the last one, using a LINQ Lamba, a LINQ Where Query and a For Loop. Each test iterates 100 times and then averages the times to get the results.
LINQ Lambda Query Average Time: 0.3382 seconds
LINQ Where Query Average Time: 0.238 seconds
For Loop Average Time: 0.2266 seconds
I've run this test over and over, and even increase the iteration and the spread is pretty much identical statistically speaking. Sure we are talking 1/10 of a second for essentially that a million item search. So in the real world, unless something is that intensive, not sure you would even notice. But if you do the LINQ Lambda vs LINQ Where query does have a difference in performance. The LINQ Where is near the same as the For Loop.
private void RunTest()
{
try
{
List<TestObject> mylist = new List<TestObject>();
for (int i = 0; i <= 1000000; i++)
{
TestObject testO = new TestObject(string.Format("Item{0}", i), 1, Guid.NewGuid().ToString());
mylist.Add(testO);
}
mylist.Add(new TestObject("test", "29863", Guid.NewGuid().ToString()));
string searchtext = "test";
int iterations = 100;
// Linq Lambda Test
List<int> list1 = new List<int>();
for (int i = 1; i <= iterations; i++)
{
DateTime starttime = DateTime.Now;
TestObject t = mylist.FirstOrDefault(q => q.Name == searchtext);
int diff = (DateTime.Now - starttime).Milliseconds;
list1.Add(diff);
}
// Linq Where Test
List<int> list2 = new List<int>();
for (int i = 1; i <= iterations; i++)
{
DateTime starttime = DateTime.Now;
TestObject t = (from testO in mylist
where testO.Name == searchtext
select testO).FirstOrDefault();
int diff = (DateTime.Now - starttime).Milliseconds;
list2.Add(diff);
}
// For Loop Test
List<int> list3 = new List<int>();
for (int i = 1; i <= iterations; i++)
{
DateTime starttime = DateTime.Now;
foreach (TestObject testO in mylist)
{
if (testO.Name == searchtext)
{
TestObject t = testO;
break;
}
}
int diff = (DateTime.Now - starttime).Milliseconds;
list3.Add(diff);
}
float diff1 = list1.Average();
Debug.WriteLine(string.Format("LINQ Lambda Query Average Time: {0} seconds", diff1 / (double)100));
float diff2 = list2.Average();
Debug.WriteLine(string.Format("LINQ Where Query Average Time: {0} seconds", diff2 / (double)100));
float diff3 = list3.Average();
Debug.WriteLine(string.Format("For Loop Average Time: {0} seconds", diff3 / (double)100));
}
catch (Exception ex)
{
Debug.WriteLine(ex.ToString());
}
}
private class TestObject
{
public TestObject(string _name, string _value, string _guid)
{
Name = _name;
Value = _value;
GUID = _guid;
}
public string Name;
public string Value;
public string GUID;
}
The Best Option Is To Use IndexOf method of Array Class. Since it is specialized for arrays it will b significantly faster than both Linq and For Loop.
Improving on Matt Watsons Answer.
using System;
using System.Diagnostics;
using System.Linq;
namespace PerformanceConsoleApp
{
public class LinqVsFor
{
private static void Main(string[] args)
{
string[] a = new string[1000000];
for (int i = 0; i < a.Length; ++i)
{
a[i] = "Won't be found";
}
string matchString = "Will be found";
a[a.Length - 1] = "Will be found";
const int COUNT = 100;
var sw = Stopwatch.StartNew();
Loop(a, matchString, COUNT, sw);
First(a, matchString, COUNT, sw);
Where(a, matchString, COUNT, sw);
IndexOf(a, sw, matchString, COUNT);
Console.ReadLine();
}
private static void Loop(string[] a, string matchString, int COUNT, Stopwatch sw)
{
int matchIndex = -1;
for (int outer = 0; outer < COUNT; ++outer)
{
for (int i = 0; i < a.Length; i++)
{
if (a[i] == matchString)
{
matchIndex = i;
break;
}
}
}
sw.Stop();
Console.WriteLine("Found via loop at index " + matchIndex + " in " + sw.Elapsed);
}
private static void IndexOf(string[] a, Stopwatch sw, string matchString, int COUNT)
{
int matchIndex = -1;
sw.Restart();
for (int outer = 0; outer < COUNT; ++outer)
{
matchIndex = Array.IndexOf(a, matchString);
}
sw.Stop();
Console.WriteLine("Found via IndexOf at index " + matchIndex + " in " + sw.Elapsed);
}
private static void First(string[] a, string matchString, int COUNT, Stopwatch sw)
{
sw.Restart();
string str = "";
for (int outer = 0; outer < COUNT; ++outer)
{
str = a.First(t => t == matchString);
}
sw.Stop();
Console.WriteLine("Found via linq First at index " + Array.IndexOf(a, str) + " in " + sw.Elapsed);
}
private static void Where(string[] a, string matchString, int COUNT, Stopwatch sw)
{
sw.Restart();
string str = "";
for (int outer = 0; outer < COUNT; ++outer)
{
str = a.Where(t => t == matchString).First();
}
sw.Stop();
Console.WriteLine("Found via linq Where at index " + Array.IndexOf(a, str) + " in " + sw.Elapsed);
}
}
}
Output:
Found via loop at index 999999 in 00:00:01.1528531
Found via linq First at index 999999 in 00:00:02.0876573
Found via linq Where at index 999999 in 00:00:01.3313111
Found via IndexOf at index 999999 in 00:00:00.7244812
A bit of a non-answer, and really just an extension to https://stackoverflow.com/a/14894589, but I have, on and off, been working on an API-compatible replacement for Linq-to-Objects for a while now. It still doesn't provide the performance of a hand-coded loop, but it is faster for many (most?) linq scenarios. It does create more garbage, and has some slightly heavier up front costs.
The code is available https://github.com/manofstick/Cistern.Linq
A nuget package is available https://www.nuget.org/packages/Cistern.Linq/ (I can't claim this to be battle hardened, use at your own risk)
Taking the code from Matthew Watson's answer (https://stackoverflow.com/a/14894589) with two slight tweaks, and we get the time down to "only" ~3.5 time worse than the hand-coded loop. On my machine it take about 1/3 of the time of original System.Linq version.
The two changes to replace:
using System.Linq;
...
matchIndex = a.Select((r, i) => new { value = r, index = i })
.Where(t => t.value == matchString)
.Select(s => s.index).First();
With the following:
// a complete replacement for System.Linq
using Cistern.Linq;
...
// use a value tuple rather than anonymous type
matchIndex = a.Select((r, i) => (value: r, index: i))
.Where(t => t.value == matchString)
.Select(s => s.index).First();
So the library itself is a work in progress. It fails a couple of edge cases from the corefx's System.Linq test suite. It also still needs a few functions to be converted over (they currently have the corefx System.Linq implementation, which is compatible from an API perspective, if not a performance perspective). But anymore who wants to help, comment, etc would be appreciated....
Related
I have a list of strings where I need to count the number of list entries that have an occurances of a specific string inside of them (and the whole thing only for a subset of the list not the whole list).
The code below works quite well BUT its performance is.....sadly not in an acceptable niveau as I need to parse through 500k to 900k list entries.For these entries I need to run the code below about 10k times (as I have 10k parts of the list I need to analyse). For that it takes 177 seconds and even more. So my question is how can I do this...fast?
private int ExtraktNumbers(List<string> myList, int start, int end)
{
return myList.Where((x, index) => index >= start && index <= end
&& x.Contains("MYNUMBER:")).Count();
}
Well now we know you are calling the method 10,00 times here is my suggestion. I assume as you have hardcoded "Number:" that it means you are doing different ranges with each call? So if that's the case...
First, run an 'indexing' method and create a list of which indices are a match. Then you can easily count up the matches for the ranges you need.
NOTE: This is something quick, and you may even be able to further optimize this too:
List<int> matchIndex = new List<int>();
void RunIndex(List<string> myList)
{
for(int i = 0; i < myList.Count; i++)
{
if(myList[i].Contains("MYNUMBER:"))
{
matchIndex.Add(i);
}
}
}
int CountForRange(int start, int end)
{
return matchIndex.Count(x => x >= start && x <= end);
}
Then you can use like this, for example:
RunIndex(myList);
// I don't know what code you have here, this is just basic example.
for(int i = 0; i <= 10,000; i++)
{
int count = CountForRange(startOfRange, endOfRange);
// Do something with count.
}
In addition, if you have a lot of duplication in the ranges you check then you could consider caching range counts in a dictionary, but at this stage it's hard to tell if that will be worth doing anyway.
I am pretty sure a simple iterative solution will perform better:
private int ExtractNumbers(List<string> myList, int start, int end)
{
int count = 0;
for (int i = start; i <= end; i++)
{
if (myList[i].Contains("MYNUMBER:"))
{
count++;
}
}
return count;
}
Well for my test stand for 10 millions (10 times more than you have) lines
var data = Enumerable
.Range(1, 10000000)
.Select(item => "123456789 bla-bla-bla " + "MYNUMBER:" + item.ToString())
.ToList();
Stopwatch sw = new Stopwatch();
sw.Start();
int result = ExtraktNumbers(data, 0, 10000000);
sw.Stop();
I've got these results:
2.78 seconds - your initial implementtation
Naive loop (2.60 seconds):
private int ExtraktNumbers(List<string> myList, int start, int end) {
int result = 0;
for (int i = start; i < end; ++i)
if (myList[i].Contains("MYNUMBER:"))
result += 1;
return result;
}
PLinq (1.72 seconds):
private int ExtraktNumbers(List<string> myList, int start, int end) {
return myList
.AsParallel() // <- Do it in parallel
.Skip(start - 1)
.Take(end - start)
.Where(x => x.Contains("MYNUMBER:"))
.Count();
}
Explicit parallel implementation (1.66 seconds):
private int ExtraktNumbers(List<string> myList, int start, int end) {
long result = 0;
Parallel.For(start, end, (i) => {
if (myList[i].Contains("MYNUMBER:"))
Interlocked.Increment(ref result);
});
return (int) result;
}
I just cannot reproduce your 177 seconds
If you know from the beginning the intervals you want to consider, it's probably a good idea to loop the list once, as Dmytro and musefan proposed above, so I won't repeat the same idea again.
However I have a different suggestion for performance improvement. How do you create your list? Do you know the number of items in advance? Because for such a big list, you may gest a significant performance boost by using the List<T> constructor that takes the initial capacity.
So I am looking at this question and the general consensus is that uint cast version is more efficient than range check with 0. Since the code is also in MS's implementation of List I assume it is a real optimization. However I have failed to produce a code sample that results in better performance for the uint version. I have tried different tests and there is something missing or some other part of my code is dwarfing the time for the checks. My last attempt looks like this:
class TestType
{
public TestType(int size)
{
MaxSize = size;
Random rand = new Random(100);
for (int i = 0; i < MaxIterations; i++)
{
indexes[i] = rand.Next(0, MaxSize);
}
}
public const int MaxIterations = 10000000;
private int MaxSize;
private int[] indexes = new int[MaxIterations];
public void Test()
{
var timer = new Stopwatch();
int inRange = 0;
int outOfRange = 0;
timer.Start();
for (int i = 0; i < MaxIterations; i++)
{
int x = indexes[i];
if (x < 0 || x > MaxSize)
{
throw new Exception();
}
inRange += indexes[x];
}
timer.Stop();
Console.WriteLine("Comparision 1: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms");
inRange = 0;
outOfRange = 0;
timer.Reset();
timer.Start();
for (int i = 0; i < MaxIterations; i++)
{
int x = indexes[i];
if ((uint)x > (uint)MaxSize)
{
throw new Exception();
}
inRange += indexes[x];
}
timer.Stop();
Console.WriteLine("Comparision 2: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms");
}
}
class Program
{
static void Main()
{
TestType t = new TestType(TestType.MaxIterations);
t.Test();
TestType t2 = new TestType(TestType.MaxIterations);
t2.Test();
TestType t3 = new TestType(TestType.MaxIterations);
t3.Test();
}
}
The code is a bit of a mess because I tried many things to make uint check perform faster like moving the compared variable into a field of a class, generating random index access and so on but in every case the result seems to be the same for both versions. So is this change applicable on modern x86 processors and can someone demonstrate it somehow?
Note that I am not asking for someone to fix my sample or explain what is wrong with it. I just want to see the case where the optimization does work.
if (x < 0 || x > MaxSize)
The comparison is performed by the CMP processor instruction (Compare). You'll want to take a look at Agner Fog's instruction tables document (PDF), it list the cost of instructions. Find your processor back in the list, then locate the CMP instruction.
For mine, Haswell, CMP takes 1 cycle of latency and 0.25 cycles of throughput.
A fractional cost like that could use an explanation, Haswell has 4 integer execution units that can execute instructions at the same time. When a program contains enough integer operations, like CMP, without an interdependency then they can all execute at the same time. In effect making the program 4 times faster. You don't always manage to keep all 4 of them busy at the same time with your code, it is actually pretty rare. But you do keep 2 of them busy in this case. Or in other words, two comparisons take just as long as single one, 1 cycle.
There are other factors at play that make the execution time identical. One thing helps is that the processor can predict the branch very well, it can speculatively execute x > MaxSize in spite of the short-circuit evaluation. And it will in fact end up using the result since the branch is never taken.
And the true bottleneck in this code is the array indexing, accessing memory is one of the slowest thing the processor can do. So the "fast" version of the code isn't faster even though it provides more opportunity to allow the processor to concurrently execute instructions. It isn't much of an opportunity today anyway, a processor has too many execution units to keep busy. Otherwise the feature that makes HyperThreading work. In both cases the processor bogs down at the same rate.
On my machine, I have to write code that occupies more than 4 engines to make it slower. Silly code like this:
if (x < 0 || x > MaxSize || x > 10000000 || x > 20000000 || x > 3000000) {
outOfRange++;
}
else {
inRange++;
}
Using 5 compares, now I can a difference, 61 vs 47 msec. Or in other words, this is a way to count the number of integer engines in the processor. Hehe :)
So this is a micro-optimization that probably used to pay off a decade ago. It doesn't anymore. Scratch it off your list of things to worry about :)
I would suggest attempting code which does not throw an exception when the index is out of range. Exceptions are incredibly expensive and can completely throw off your bench results.
The code below does a timed-average bench for 1,000 iterations of 1,000,000 results.
using System;
using System.Diagnostics;
namespace BenchTest
{
class Program
{
const int LoopCount = 1000000;
const int AverageCount = 1000;
static void Main(string[] args)
{
Console.WriteLine("Starting Benchmark");
RunTest();
Console.WriteLine("Finished Benchmark");
Console.Write("Press any key to exit...");
Console.ReadKey();
}
static void RunTest()
{
int cursorRow = Console.CursorTop; int cursorCol = Console.CursorLeft;
long totalTime1 = 0; long totalTime2 = 0;
long invalidOperationCount1 = 0; long invalidOperationCount2 = 0;
for (int i = 0; i < AverageCount; i++)
{
Console.SetCursorPosition(cursorCol, cursorRow);
Console.WriteLine("Running iteration: {0}/{1}", i + 1, AverageCount);
int[] indexArgs = RandomFill(LoopCount, int.MinValue, int.MaxValue);
int[] sizeArgs = RandomFill(LoopCount, 0, int.MaxValue);
totalTime1 += RunLoop(TestMethod1, indexArgs, sizeArgs, ref invalidOperationCount1);
totalTime2 += RunLoop(TestMethod2, indexArgs, sizeArgs, ref invalidOperationCount2);
}
PrintResult("Test 1", TimeSpan.FromTicks(totalTime1 / AverageCount), invalidOperationCount1);
PrintResult("Test 2", TimeSpan.FromTicks(totalTime2 / AverageCount), invalidOperationCount2);
}
static void PrintResult(string testName, TimeSpan averageTime, long invalidOperationCount)
{
Console.WriteLine(testName);
Console.WriteLine(" Average Time: {0}", averageTime);
Console.WriteLine(" Invalid Operations: {0} ({1})", invalidOperationCount, (invalidOperationCount / (double)(AverageCount * LoopCount)).ToString("P3"));
}
static long RunLoop(Func<int, int, int> testMethod, int[] indexArgs, int[] sizeArgs, ref long invalidOperationCount)
{
Stopwatch sw = new Stopwatch();
Console.Write("Running {0} sub-iterations", LoopCount);
sw.Start();
long startTickCount = sw.ElapsedTicks;
for (int i = 0; i < LoopCount; i++)
{
invalidOperationCount += testMethod(indexArgs[i], sizeArgs[i]);
}
sw.Stop();
long stopTickCount = sw.ElapsedTicks;
long elapsedTickCount = stopTickCount - startTickCount;
Console.WriteLine(" - Time Taken: {0}", new TimeSpan(elapsedTickCount));
return elapsedTickCount;
}
static int[] RandomFill(int size, int minValue, int maxValue)
{
int[] randomArray = new int[size];
Random rng = new Random();
for (int i = 0; i < size; i++)
{
randomArray[i] = rng.Next(minValue, maxValue);
}
return randomArray;
}
static int TestMethod1(int index, int size)
{
return (index < 0 || index >= size) ? 1 : 0;
}
static int TestMethod2(int index, int size)
{
return ((uint)(index) >= (uint)(size)) ? 1 : 0;
}
}
}
You aren't comparing like with like.
The code you were talking about not only saved one branch by using the optimisation, but also 4 bytes of CIL in a small method.
In a small method 4 bytes can be the difference in being inlined and not being inlined.
And if the method calling that method is also written to be small, then that can mean two (or more) method calls are jitted as one piece of inline code.
And maybe some of it is then, because it is inline and available for analysis by the jitter, optimised further again.
The real difference is not between index < 0 || index >= _size and (uint)index >= (uint)_size, but between code that has repeated efforts to minimise the method body size and code that does not. Look for example at how another method is used to throw the exception if necessary, further shaving off a couple of bytes of CIL.
(And no, that's not to say that I think all methods should be written like that, but there certainly can be performance differences when one does).
I have somethings like this:
List<string> listUser;
listUser.Add("user1");
listUser.Add("user2");
listUser.Add("userhacker");
listUser.Add("user1other");
List<string> key_blacklist;
key_blacklist.Add("hacker");
key_blacklist.Add("other");
foreach (string user in listUser)
{
foreach (string key in key_blacklist)
{
if (user.Contains(key))
{
// remove it in listUser
}
}
}
The result of listUser is: user1, user2.
The problem is if i have a huge listUser (more than 10 million) and huge key_blacklist (100.000). That code is very very slow.
Is have anyway to get that faster?
UPDATE: I find new solution in there.
http://cc.davelozinski.com/c-sharp/fastest-way-to-check-if-a-string-occurs-within-a-string
Hope that will help someone when he got in there! :)
If you don't have much control over how the list of users is constructed, you can at least test each item in the list in parallel, which on modern machines with multiple cores will speed up the checking a fair bit.
listuser.AsParallel().Where(
s =>
{
foreach (var key in key_blacklist)
{
if (s.Contains(key))
{
return false; //Not to be included
}
}
return true; //To be included, as no match with the blacklist
});
Also - do you have to use .Contains? .Equals is going to be much much quicker, because in almost all cases a non-match will be determined when the HashCodes differ, which can be found only by an integer comparison. Super quick.
If you do need .Contains, you may want to think about restructuring the app. What do these strings in the list really represent? Separate sub-groups of users? Can I test each string, at the time it's added, for whether it represents a user on the blacklist?
UPDATE: In response to #Rawling's comment below - If you know that there is a finite set of usernames which have, say, "hacker" as a substring, that set would have to be pretty large before running a .Equals test of each username against a candidate would be slower than running .Contains on the candidate. This is because HashCode is really quick.
If you are using entity framework or linq to sql then using linq and sending the query to a server can improve the performance.
Then instead of removing the items you are actually querying for the items that fulfil the requirements, i.e. user where the name doesn't contain the banned expression:
listUser.Where(u => !key_blacklist.Any(u.Contains)).Select(u => u).ToList();
A possible solution is to use a tree-like data structure.
The basic idea is to have the blacklisted words organised like this:
+ h
| + ha
| + hac
| - hacker
| - [other words beginning with hac]
|
+ f
| + fu
| + fuk
| - fukoff
| - [other words beginning with fuk]
Then, when you check for blacklisted words, you avoid searching the whole list of words beginning with "hac" if you find out that your user string does not even contain "h".
In the example I provided, with your sample data, this does not of course make any difference, but with the real data sets this should reduce significantly the number of Contains, since you don't check against the full list of blacklisted words every time.
Here is a code example (please note that the code is pretty bad, this is just to illustrate my idea)
using System;
using System.Collections.Generic;
using System.Linq;
class Program {
class Blacklist {
public string Start;
public int Level;
const int MaxLevel = 3;
public Dictionary<string, Blacklist> SubBlacklists = new Dictionary<string, Blacklist>();
public List<string> BlacklistedWords = new List<string>();
public Blacklist() {
Start = string.Empty;
Level = 0;
}
Blacklist(string start, int level) {
Start = start;
Level = level;
}
public void AddBlacklistedWord(string word) {
if (word.Length > Level && Level < MaxLevel) {
string index = word.Substring(0, Level + 1);
Blacklist sublist = null;
if (!SubBlacklists.TryGetValue(index, out sublist)) {
sublist = new Blacklist(index, Level + 1);
SubBlacklists[index] = sublist;
}
sublist.AddBlacklistedWord(word);
} else {
BlacklistedWords.Add(word);
}
}
public bool ContainsBlacklistedWord(string wordToCheck) {
if (wordToCheck.Length > Level && Level < MaxLevel) {
foreach (var sublist in SubBlacklists.Values) {
if (wordToCheck.Contains(sublist.Start)) {
return sublist.ContainsBlacklistedWord(wordToCheck);
}
}
}
return BlacklistedWords.Any(x => wordToCheck.Contains(x));
}
}
static void Main(string[] args) {
List<string> listUser = new List<string>();
listUser.Add("user1");
listUser.Add("user2");
listUser.Add("userhacker");
listUser.Add("userfukoff1");
Blacklist blacklist = new Blacklist();
blacklist.AddBlacklistedWord("hacker");
blacklist.AddBlacklistedWord("fukoff");
foreach (string user in listUser) {
if (blacklist.ContainsBlacklistedWord(user)) {
Console.WriteLine("Contains blacklisted word: {0}", user);
}
}
}
}
You are using the wrong thing. If you have a lot of data, you should be using either HashSet<T> or SortedSet<T>. If you don't need the data sorted, go with HashSet<T>. Here is a program I wrote to demonstrate the time differences:
class Program
{
private static readonly Random random = new Random((int)DateTime.Now.Ticks);
static void Main(string[] args)
{
Console.WriteLine("Creating Lists...");
var stringList = new List<string>();
var hashList = new HashSet<string>();
var sortedList = new SortedSet<string>();
var searchWords1 = new string[3];
int ndx = 0;
for (int x = 0; x < 1000000; x++)
{
string str = RandomString(10);
if (x == 5 || x == 500000 || x == 999999)
{
str = "Z" + str;
searchWords1[ndx] = str;
ndx++;
}
stringList.Add(str);
hashList.Add(str);
sortedList.Add(str);
}
Console.WriteLine("Lists created!");
var sw = new Stopwatch();
sw.Start();
bool search1 = stringList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("List<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
search1 = hashList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("HashSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
search1 = sortedList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("SortedSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
}
private static string RandomString(int size)
{
var builder = new StringBuilder();
char ch;
for (int i = 0; i < size; i++)
{
ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
builder.Append(ch);
}
return builder.ToString();
}
}
On my machine, I got the following results:
Creating Lists...
Lists created!
List<T> True ==> 15ms
HashSet<T> True ==> 0ms
SortedSet<T> True ==> 0ms
As you can see, List<T> was extremely slow comparted to HashSet<T> and SortedSet<T>. Those were almost instantaneous.
static void Main(string[] args)
{
string str = "ABC ABCDAB ABCDABCDABDE";//We should add some text here for
//the performance tests.
string pattern = "ABCDABD";
List<int> shifts = new List<int>();
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
NaiveStringMatcher(shifts, str, pattern);
stopWatch.Stop();
Trace.WriteLine(String.Format("Naive string matcher {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
shifts.Clear();
stopWatch.Restart();
int[] pi = new int[pattern.Length];
Knuth_Morris_Pratt(shifts, str, pattern, pi);
stopWatch.Stop();
Trace.WriteLine(String.Format("Knuth_Morris_Pratt {0}", stopWatch.Elapsed));
foreach (int s in shifts)
{
Trace.WriteLine(s);
}
Console.ReadKey();
}
static IList<int> NaiveStringMatcher(List<int> shifts, string text, string pattern)
{
int lengthText = text.Length;
int lengthPattern = pattern.Length;
for (int s = 0; s < lengthText - lengthPattern + 1; s++ )
{
if (text[s] == pattern[0])
{
int i = 0;
while (i < lengthPattern)
{
if (text[s + i] == pattern[i])
i++;
else break;
}
if (i == lengthPattern)
{
shifts.Add(s);
}
}
}
return shifts;
}
static IList<int> Knuth_Morris_Pratt(List<int> shifts, string text, string pattern, int[] pi)
{
int patternLength = pattern.Length;
int textLength = text.Length;
//ComputePrefixFunction(pattern, pi);
int j;
for (int i = 1; i < pi.Length; i++)
{
j = 0;
while ((i < pi.Length) && (pattern[i] == pattern[j]))
{
j++;
pi[i++] = j;
}
}
int matchedSymNum = 0;
for (int i = 0; i < textLength; i++)
{
while (matchedSymNum > 0 && pattern[matchedSymNum] != text[i])
matchedSymNum = pi[matchedSymNum - 1];
if (pattern[matchedSymNum] == text[i])
matchedSymNum++;
if (matchedSymNum == patternLength)
{
shifts.Add(i - patternLength + 1);
matchedSymNum = pi[matchedSymNum - 1];
}
}
return shifts;
}
Why does my implemention of the KMP algorithm work slower than the Naive String Matching algorithm?
The KMP algorithm has two phases: first it builds a table, and then it does a search, directed by the contents of the table.
The naive algorithm has one phase: it does a search. It does that search much less efficiently in the worst case than the KMP search phase.
If the KMP is slower than the naive algorithm then that is probably because building the table is taking you longer than it takes to simply search the string naively in the first place. Naive string matching is usually very fast on short strings. There is a reason why we don't use fancy-pants algorithms like KMP inside the BCL implementations of string searching. By the time you set up the table, you could have done half a dozen searches of short strings with the naive algorithm.
KMP is only a win if you have enormous strings and you are doing lots of searches that allow you to re-use an already-built table. You need to amortize away the huge cost of building the table by doing lots of searches using that table.
And also, the naive algorithm only has bad performance in bizarre and unlikely scenarios. Most people are searching for words like "London" in strings like "Buckingham Palace, London, England", and not searching for strings like "BANANANANANANA" in strings like "BANAN BANBAN BANBANANA BANAN BANAN BANANAN BANANANANANANANANAN...". The naive search algorithm is optimal for the first problem and highly sub-optimal for the latter problem; but it makes sense to optimize for the former, not the latter.
Another way to put it: if the searched-for string is of length w and the searched-in string is of length n, then KMP is O(n) + O(w). The Naive algorithm is worst case O(nw), best case O(n + w). But that says nothing about the "constant factor"! The constant factor of the KMP algorithm is much larger than the constant factor of the naive algorithm. The value of n has to be awfully big, and the number of sub-optimal partial matches has to be awfully large, for the KMP algorithm to win over the blazingly fast naive algorithm.
That deals with the algorithmic complexity issues. Your methodology is also not very good, and that might explain your results. Remember, the first time you run code, the jitter has to jit the IL into assembly code. That can take longer than running the method in some cases. You really should be running the code a few hundred thousand times in a loop, discarding the first result, and taking an average of the timings of the rest.
If you really want to know what is going on you should be using a profiler to determine what the hot spot is. Again, make sure you are measuring the post-jit run, not the run where the code is jitted, if you want to have results that are not skewed by the jit time.
Your example is too small and it does not have enough repetitions of the pattern where KMP avoids backtracking.
KMP can be slower than the normal search in some cases.
A Simple KMPSubstringSearch Implementation.
https://github.com/bharathkumarms/AlgorithmsMadeEasy/blob/master/AlgorithmsMadeEasy/KMPSubstringSearch.cs
using System;
using System.Collections.Generic;
using System.Linq;
namespace AlgorithmsMadeEasy
{
class KMPSubstringSearch
{
public void KMPSubstringSearchMethod()
{
string text = System.Console.ReadLine();
char[] sText = text.ToCharArray();
string pattern = System.Console.ReadLine();
char[] sPattern = pattern.ToCharArray();
int forwardPointer = 1;
int backwardPointer = 0;
int[] tempStorage = new int[sPattern.Length];
tempStorage[0] = 0;
while (forwardPointer < sPattern.Length)
{
if (sPattern[forwardPointer].Equals(sPattern[backwardPointer]))
{
tempStorage[forwardPointer] = backwardPointer + 1;
forwardPointer++;
backwardPointer++;
}
else
{
if (backwardPointer == 0)
{
tempStorage[forwardPointer] = 0;
forwardPointer++;
}
else
{
int temp = tempStorage[backwardPointer];
backwardPointer = temp;
}
}
}
int pointer = 0;
int successPoints = sPattern.Length;
bool success = false;
for (int i = 0; i < sText.Length; i++)
{
if (sText[i].Equals(sPattern[pointer]))
{
pointer++;
}
else
{
if (pointer != 0)
{
int tempPointer = pointer - 1;
pointer = tempStorage[tempPointer];
i--;
}
}
if (successPoints == pointer)
{
success = true;
}
}
if (success)
{
System.Console.WriteLine("TRUE");
}
else
{
System.Console.WriteLine("FALSE");
}
System.Console.Read();
}
}
}
/*
* Sample Input
abxabcabcaby
abcaby
*/
I can't figure out a discrepancy between the time it takes for the Contains method to find an element in an ArrayList and the time it takes for a small function that I wrote to do the same thing. The documentation states that Contains performs a linear search, so it's supposed to be in O(n) and not any other faster method. However, while the exact values may not be relevant, the Contains method returns in 00:00:00.1087087 seconds while my function takes 00:00:00.1876165. It might not be much, but this difference becomes more evident when dealing with even larger arrays. What am I missing and how should I write my function to match Contains's performances?
I'm using C# on .NET 3.5.
public partial class Window1 : Window
{
public bool DoesContain(ArrayList list, object element)
{
for (int i = 0; i < list.Count; i++)
if (list[i].Equals(element)) return true;
return false;
}
public Window1()
{
InitializeComponent();
ArrayList list = new ArrayList();
for (int i = 0; i < 10000000; i++) list.Add("zzz " + i);
Stopwatch sw = new Stopwatch();
sw.Start();
//Console.Out.WriteLine(list.Contains("zzz 9000000") + " " + sw.Elapsed);
Console.Out.WriteLine(DoesContain(list, "zzz 9000000") + " " + sw.Elapsed);
}
}
EDIT:
Okay, now, lads, look:
public partial class Window1 : Window
{
public bool DoesContain(ArrayList list, object element)
{
int count = list.Count;
for (int i = count - 1; i >= 0; i--)
if (element.Equals(list[i])) return true;
return false;
}
public bool DoesContain1(ArrayList list, object element)
{
int count = list.Count;
for (int i = 0; i < count; i++)
if (element.Equals(list[i])) return true;
return false;
}
public Window1()
{
InitializeComponent();
ArrayList list = new ArrayList();
for (int i = 0; i < 10000000; i++) list.Add("zzz " + i);
Stopwatch sw = new Stopwatch();
long total = 0;
int nr = 100;
for (int i = 0; i < nr; i++)
{
sw.Reset();
sw.Start();
DoesContain(list,"zzz");
total += sw.ElapsedMilliseconds;
}
Console.Out.WriteLine(total / nr);
total = 0;
for (int i = 0; i < nr; i++)
{
sw.Reset();
sw.Start();
DoesContain1(list, "zzz");
total += sw.ElapsedMilliseconds;
}
Console.Out.WriteLine(total / nr);
total = 0;
for (int i = 0; i < nr; i++)
{
sw.Reset();
sw.Start();
list.Contains("zzz");
total += sw.ElapsedMilliseconds;
}
Console.Out.WriteLine(total / nr);
}
}
I made an average of 100 running times for two versions of my function(forward and backward loop) and for the default Contains function. The times I've got are 136 and
133 milliseconds for my functions and a distant winner of 87 for the Contains version. Well now, if before you could argue that the data was scarce and I based my conclusions on a first, isolated run, what do you say about this test? Not does only on average Contains perform better, but it achieves consistently better results in each run. So, is there some kind of disadvantage in here for 3rd party functions, or what?
First, you're not running it many times and comparing averages.
Second, your method isn't being jitted until it actually runs. So the just in time compile time is added into its execution time.
A true test would run each multiple times and average the results (any number of things could cause one or the other to be slower for run X out of a total of Y), and your assemblies should be pre-jitted using ngen.exe.
As you're using .NET 3.5, why are you using ArrayList to start with, rather than List<string>?
A few things to try:
You could see whether using foreach instead of a for loop helps
You could cache the count:
public bool DoesContain(ArrayList list, object element)
{
int count = list.Count;
for (int i = 0; i < count; i++)
{
if (list[i].Equals(element))
{
return true;
}
return false;
}
}
You could reverse the comparison:
if (element.Equals(list[i]))
While I don't expect any of these to make a significant (positive) difference, they're the next things I'd try.
Do you need to do this containment test more than once? If so, you might want to build a HashSet<T> and use that repeatedly.
I'm not sure if you're allowed to post Reflector code, but if you open the method using Reflector, you can see that's it's essentially the same (there are some optimizations for null values, but your test harness doesn't include nulls).
The only difference that I can see is that calling list[i] does bounds checking on i whereas the Contains method does not.
Using the code below I was able to get the following timings relatively consitently (within a few ms):
1: 190ms DoesContainRev
2: 198ms DoesContainRev1
3: 188ms DoesContainFwd
4: 203ms DoesContainFwd1
5: 199ms Contains
Several things to notice here.
This is run with release compiled code from the commandline. Many people make the mistake of benchmarking code inside the Visual Studio debugging environment, not to say anyone here did but something to be careful of.
The list[i].Equals(element) appears to be just a bit slower than element.Equals(list[i]).
using System;
using System.Diagnostics;
using System.Collections;
namespace ArrayListBenchmark
{
class Program
{
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
const int arrayCount = 10000000;
ArrayList list = new ArrayList(arrayCount);
for (int i = 0; i < arrayCount; i++) list.Add("zzz " + i);
sw.Start();
DoesContainRev(list, "zzz");
sw.Stop();
Console.WriteLine(String.Format("1: {0}", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
DoesContainRev1(list, "zzz");
sw.Stop();
Console.WriteLine(String.Format("2: {0}", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
DoesContainFwd(list, "zzz");
sw.Stop();
Console.WriteLine(String.Format("3: {0}", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
DoesContainFwd1(list, "zzz");
sw.Stop();
Console.WriteLine(String.Format("4: {0}", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
list.Contains("zzz");
sw.Stop();
Console.WriteLine(String.Format("5: {0}", sw.ElapsedMilliseconds));
sw.Reset();
Console.ReadKey();
}
public static bool DoesContainRev(ArrayList list, object element)
{
int count = list.Count;
for (int i = count - 1; i >= 0; i--)
if (element.Equals(list[i])) return true;
return false;
}
public static bool DoesContainFwd(ArrayList list, object element)
{
int count = list.Count;
for (int i = 0; i < count; i++)
if (element.Equals(list[i])) return true;
return false;
}
public static bool DoesContainRev1(ArrayList list, object element)
{
int count = list.Count;
for (int i = count - 1; i >= 0; i--)
if (list[i].Equals(element)) return true;
return false;
}
public static bool DoesContainFwd1(ArrayList list, object element)
{
int count = list.Count;
for (int i = 0; i < count; i++)
if (list[i].Equals(element)) return true;
return false;
}
}
}
With a really good optimizer there should not be difference at all, because the semantics seems to be the same. However the existing optimizer can optimize your function not so good as the hardcoded Contains is optimized. Some of the points for optimization:
comparing to a property each time can be slower than counting downwards and comparing against 0
function call itself has its performance penalty
using iterators instead of explicit indexing can be faster (foreach loop instead of plain for)
First, if you are using types you know ahead of time, I'd suggest using generics. So List instead of ArrayList. Underneath the hood, ArrayList.Contains actually does a bit more than what you are doing. The following is from reflector:
public virtual bool Contains(object item)
{
if (item == null)
{
for (int j = 0; j < this._size; j++)
{
if (this._items[j] == null)
{
return true;
}
}
return false;
}
for (int i = 0; i < this._size; i++)
{
if ((this._items[i] != null) && this._items[i].Equals(item))
{
return true;
}
}
return false;
}
Notice that it forks itself on being passed a null value for item. However, since all the values in your example are not null, the additional check on null at the beginning and in the second loop should in theory take longer.
Are you positive you are dealing with fully compiled code? I.e., when your code runs the first time it gets JIT compiled where as the framework is obviously already compiled.
After your Edit, I copied the code and made a few improvements to it.
The difference was not reproducable, it turns out to be a measuring/rounding issue.
To see that, change your runs to this form:
sw.Reset();
sw.Start();
for (int i = 0; i < nr; i++)
{
DoesContain(list,"zzz");
}
total += sw.ElapsedMilliseconds;
Console.WriteLine(total / nr);
I just moved some lines. The JIT issue was insignificant with this numbr of repetitions.
My guess would be that ArrayList is written in C++ and could be taking advantage of some micro-optimizations (note: this is a guess).
For instance, in C++ you can use pointer arithmetic (specifically incrementing a pointer to iterate an array) to be faster than using an index.
using an array structure, you can't search faster than O(n) whithout any additional information.
if you know that the array is sorted, then you can use binary search algorithm and spent only o(log(n))
otherwise you should use a set.
Revised after reading comments:
It does not use some Hash-alogorithm to enable fast lookup.
Use SortedList<TKey,TValue>, Dictionary<TKey, TValue> or System.Collections.ObjectModel.KeyedCollection<TKey, TValue> for fast access based on a key.
var list = new List<myObject>(); // Search is sequential
var dictionary = new Dictionary<myObject, myObject>(); // key based lookup, but no sequential lookup, Contains fast
var sortedList = new SortedList<myObject, myObject>(); // key based and sequential lookup, Contains fast
KeyedCollection<TKey, TValue> is also fast and allows indexed lookup, however, it needs to be inherited as it is abstract. Therefore, you need a specific collection. However, with the following you can create a generic KeyedCollection.
public class GenericKeyedCollection<TKey, TValue> : KeyedCollection<TKey, TValue> {
public GenericKeyedCollection(Func<TValue, TKey> keyExtractor) {
this.keyExtractor = keyExtractor;
}
private Func<TValue, TKey> keyExtractor;
protected override TKey GetKeyForItem(TValue value) {
return this.keyExtractor(value);
}
}
The advantage of using the KeyedCollection is that the Add method does not require that a key is specified.