EDIT: so it looks like this is normal behavior, so can anyone just recommend a faster way to do these numerous intersections?
so my problem is this. I have 8000 lists (strings in each list). For each list (ranging from size 50 to 400), I'm comparing it to every other list and performing a calculation based on the intersection number. So I'll do
list1(intersect)list1= number
list1(intersect)list2= number
list1(intersect)list888= number
And I do this for every list. Previously, I had HashList and my code was essentially this: (well, I was actually searching through properties of an object, so I
had to modify the code a bit, but it's basically this:
I have my two versions below, but if anyone knows anything faster, please let me know!
Loop through AllLists, getting each list, starting with list1, and then do this:
foreach (List list in AllLists)
{
if (list1_length < list_length) //just a check to so I'm looping through the
//smaller list
{
foreach (string word in list1)
{
if (block.generator_list.Contains(word))
{
//simple integer count
}
}
}
// a little more code, but the same, but looping through the other list if it's smaller/bigger
Then I make the lists into regular lists, and applied Sort(), which changed my code to
foreach (List list in AllLists)
{
if (list1_length < list_length) //just a check to so I'm looping through the
//smaller list
{
for (int i = 0; i < list1_length; i++)
{
var test = list.BinarySearch(list1[i]);
if (test > -1)
{
//simple integer count
}
}
}
The first version takes about 6 seconds, the other one takes more than 20 (I just stop there cuz otherwise it would take more than a minute!!!) (and this is for a smallish subset of the data)
I'm sure there's a drastic mistake somewhere, but I can't find it.
Well I have tried three distinct methods for achieving this (assuming I understood the problem correctly). Please note I have used HashSet<int> in order to more easily generate random input.
setting up:
List<HashSet<int>> allSets = new List<HashSet<int>>();
Random rand = new Random();
for(int i = 0; i < 8000; ++i) {
HashSet<int> ints = new HashSet<int>();
for(int j = 0; j < rand.Next(50, 400); ++j) {
ints.Add(rand.Next(0, 1000));
}
allSets.Add(ints);
}
the three methods I checked (code is what runs in the inner loop):
the loop:
note that you are getting duplicated results in your code (intersecting set A with set B and later intersecting set B with set A).
It won't affect your performance thanks to the list length check you are doing. But iterating this way is clearer.
for(int i = 0; i < allSets.Count; ++i) {
for(int j = i + 1; j < allSets.Count; ++j) {
}
}
first method:
used IEnumerable.Intersect() to get the intersection with the other list and checked IEnumerable.Count() to get the size of the intersection.
var intersect = allSets[i].Intersect(allSets[j]);
count = intersect.Count();
this was the slowest one averaging 177s
second method:
cloned the smaller set of the two sets I was intersecting, then used ISet.IntersectWith() and checked the resulting sets Count.
HashSet<int> intersect;
HashSet<int> intersectWith;
if(allSets[i].Count < allSets[j].Count) {
intersect = new HashSet<int>(allSets[i]);
intersectWith = allSets[j];
} else {
intersect = new HashSet<int>(allSets[j]);
intersectWith = allSets[i];
}
intersect.IntersectWith(intersectWith);
count = intersect.Count;
}
}
this one was slightly faster, averaging 154s
third method:
did something very similar to what you did iterated over the shorter set and checked ISet.Contains on the longer set.
for(int i = 0; i < allSets.Count; ++i) {
for(int j = i + 1; j < allSets.Count; ++j) {
count = 0;
if(allSets[i].Count < allSets[j].Count) {
loopingSet = allSets[i];
containsSet = allSets[j];
} else {
loopingSet = allSets[j];
containsSet = allSets[i];
}
foreach(int k in loopingSet) {
if(containsSet.Contains(k)) {
++count;
}
}
}
}
this method was by far the fastest (as expected), averaging 66s
conclusion
the method you're using is the fastest of these three. I certainly can't think of a faster single threaded way to do this. Perhaps there is a better concurrent solution.
I've found that one of the most important considerations in iterating/searching any kind of collection is to choose the collection type very carefully. To iterate through a normal collection for your purposes will not be the most optimal. Try using something like:
System.Collections.Generic.HashSet<T>
Using the Contains() method while iterating over the shorter list of two (as you mentioned you're already doing) should give close to O(1) performance, the same as key lookups in the generic Dictionary type.
Related
In C#, I create a list of lists and try to access/modify the elements. There is a problem in that it seems that an operation (e.g. adding a constant) seems to apply not only to the wanted index but to all elements.
Here is the piece of code:
List<List<double>> D = new List<List<double>>();
List<double> ttmp = new List<double>(new double[256]);
for (int i = 0; i < 100; i++)
{
D.Add(ttmp);
}
for (int i = 0; i < 100; i++)
{
D[i][0] = D[i][0] + 1;
}
D is a list of 100 lists, each of size 256. It initially contains only zeroes. In the second loop, I ask that the first element of every of the 100 lists be incremented by one.
As a results, the entire "matrix" is filled with ones, i.e. not only D[0][0], D[1][0] ... D[99][0], but also D[0][1], D[0][2] , etc.
Why is that?
NB: the C++ equivalent with vector<vector<double>> works perfectly fine...
When code is executed, result is not that
D[0][0], D[1][0] ... D[99][0], but also D[0][1], D[0][2] are modified.
Result is that inner lists have [0] element equal to 100.
Why so? Because you created one list, and added it 100 times into D. But that is the same list - when you modify it, you modify all instances (because this is reference data type).
Change it instead to:
List<List<double>> D = new List<List<double>>();
for (int i = 0; i < 100; i++)
{
D.Add(new List<double>(new double[256]));
}
foreach (var innerList in D)
{
innerList[0]++;
}
I have a huge list of strings _wordList List<string> containing about 100,000 values. The problem I'm having is that I also require multiple nested loops within this. The nested loop is also a list but with a structure containing only variables, containing about 0-100 values depending on what happens
for (int y = 0; y < _wordList.Count; y++)
{
string word = _wordList[y];
for(int x = 0; x < _secondWordList.Count; x++)
{
if (!word.Contains(_secondWordList[x].Word) || word == _secondWordList[x].Word)
continue;
// do other stuff
}
}
Here is part of the code, I won't post all of it since most of it will be irrelevant but within the second loop I have about 2 other short loops, the whole function completes in 350-600ms. What would the best way to optimize the loops? The word.Contains also have an impact of about 100-150ms on performance.
It seems that you're looking for a text search, and in this case, you can benefit from projects like LuceneNet.
If the call to string.Contains() didn't exist and you were looking for exact matches I was to suggest that you swap the list for hashset as it'll give you a great performance boost in your case. Like below.
static void Main(string[] args)
{
var wordList = new HashSet<string>(); //Assuming Initialized
var secondWordList = new List<X>(); //Assuming Initialized
for (var c = 0; c < secondWordList.Count; c++)
{
if(wordList.Contains(secondWordList[c].Word))
continue;
// do other stuff
}
}
With this, you're going to iterate on the smaller list and look for the value in the HashSet which has a complexity of O(1), which Means that it'll be extremely fast.
I have a loop that is too slow in C#. I want to know if there is a faster way to process through these arrays. I'm currently working in .NET 2.0. i'm not opposed to upgrading this project. This is part of a theoretical image processing concept involving gray levels.
Pixel count (PixCnt = 21144402)
g_len = 4625
list1d - 1Dimensional array of an image with upper bound of the above pixel count.
pg - gray level intensity holder.
This function creates an index of those values. hence pgidx.
int[] pgidx = new int[PixCnt];
sw = new Stopwatch();
sw.Start();
for (i = 0; i < PixCnt; i++)
{
j = 0;
pgidx[i] = 0;
while (list_1d[i] != pg[j] && j < g_len) j++;
if (list_id[i] == pg[j])
pgidx[i] = j
}
sw.stop();
Debug.WriteLine("PixCnt Loop took" + sw.ElapsedMilliseconds + " ms");
I think using a dictionary to store what's in the pg array will speed it up. g_len is 4625 elements, so you will likely average around 2312 iterations of the inner while loop. Replacing that with a single hashed look up in a dictionary should be faster. Since the outer loop executes 21 million times, speeding up the body of that loop should reap big rewards. I'm guessing the code below will speed up your time by 100 to 1000 time faster.
var pgDict = new Dictionary<int,int>(g_len);
for (int i = 0; i < g_len; i++) pgDict.Add(pg[i], i);
int[] pgidx = new int[PixCnt];
int value = 0;
for (int i = 0; i < PixCnt; i++) {
if (pgDict.TryGetValue(list_id[i], out value)) pgidx[i] = value;
}
Note that setting pgidx[i] to zero when a match isn't found is not necessary, because all elements of the array are already initialized to zero when the array is created.
If there is the possibility for a value in pg to appear more than once, you would want to check first to see if that key has already been added, and skip adding it to the dictionary if it has. That would mimic your current behavior of finding the first match. To do that replace the line where the dictionary is built with this:
for (int i = 0; i < g_len; i++) if (!pgDict.ContainsKey(pg[i])) pgDict.Add(pg[i], i);
If the range of the pixel values in pq allows it (say 16 bpp = 65536 entries), you can create an auxiliary array that maps all possible gray levels to the index value in pg. Filling this array is done with a single pass over pg (after initializing to all zeroes).
Then convert list_1d to pgidx with straight table lookups.
If the table is too big (bigger than the image), then do as #hatchet answered.
I need to create an array of boolean values, which could be on the scale of 100,000s or even millions of entries. It also needs to be super-fast, so every millisecond per iteration counts.
At the time of beginning the loop, I will already know how many entries there are going to be in the array. The question is, will it be faster to create a bool array up front and fill in the values by index (which is random access - could be slow?), or should I create a List<bool>, keep adding entries to the list, and at the end return .ToArray()?
In other words:
Option 1
var array = new bool[size];
for (var n=0; n<size; n++)
array[n] = GetValue(n);
return array;
Option 2
var list = new List<bool>();
for (var n=0; n<size; n++)
list.Add(GetValue(n));
return list.ToArray();
Or maybe there's a 3rd way that's even faster?
Use a System.Collections.BitArray and don't worry about speed.
What you are suggesting above will only waste your memory. This is optimizes both for speed and size, and will pack your bool values nicely (8 per byte, as the gods intended :).
Reply to below comments: If you use a BitArray, everything will be zero at first. Set only those bits for which you have GetValue == true.
The following code seems to show (at least to me) that of the methods discussed on this page, the simple allocation to a bool[] using a loop is quickest.
The code also seems to show me that unless GetValue(n) is computationally trivial, the overhead of allocating the bytes is not the part of the process I would be hoping to optimise.
Hope this helps in some way.
edit: added the results from the run (on my machine)
-- 187ms BitArray
-- 171ms List<bool>().ToArray
-- 168ms bool[] set only if true
-- 130ms bool[] always set
--11460ms bool[] always set with 'complex' GetValue()
class Program
{
static void Main(string[] args)
{
BitArray bitArray = new BitArray(10000000);
bool[] boolArray = new bool[10000000];
Stopwatch sw1 = new Stopwatch();
sw1.Start();
for (int i = 0; i < 10000000; i++)
{
bitArray[i] = GetMod2(i);
}
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
var list = new List<bool>();
for (int i = 0; i < 10000000; i++)
list.Add(GetMod2(i));
var boolArray2 = list.ToArray();
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
for (int i = 0; i < 10000000; i++)
{
bool nextVal = GetMod2(i);
if (nextVal)
bitArray[i] = true;
}
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
for (int i = 0; i < 10000000; i++)
{
boolArray[i] = GetMod2(i);
}
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
for (int i = 0; i < 10000000; i++)
{
boolArray[i] = GetRand(i);
}
Console.WriteLine(sw1.ElapsedMilliseconds);
Console.ReadLine();
}
static bool GetMod2(int i)
{
return (i % 2) == 1;
}
static bool GetRand(int i)
{
return new Random().Next(2) == 1;
}
}
Go with the first. The only reason it might be. "slow" is if it keeps paging data from outside the processor cache.
The list will have exactly the same problem, except it will also need to perform several memory allocations and copies.
Now here's a funny old thing. Inspired by #paul, I ran these benchmark tests myself, on 10,000,000 booleans. The results (in milliseconds) are very surprising, given the discussion in the comments to this question:
BitArray: 517
BitArray + CopyTo(array): 536
List + ToArray(): 455
bool array: 483
And what a turnout for the books! Despite the fact that the List<Bool> is inserting a new record every time, while the bool[] and BitArray are initialized to false on every record, and I only updated them where the value should be true, the List<bool> comes out tops, consistently, even including the .ToArray() call.
Yet another case where practical application is better than textbook knowledge, it seems... :)
I have a scenario where I have a list of classes, and I want to mix up the order. For example:
private List<Question> myQuestions = new List<Question>();
So, given that this is now populated with a set of data, I want to mix up the order. My first thought was to create a collection of integers numbered from 1 to myQuestions.Count, assign one at random to each question and then loop through them in order; however, I can’t seem to find a suitable collection type to use for this. An example of what I mean would be something like this:
for (int i = 0; i <= myQuestions.Count -1; i++)
tempCollection[i] = myQuestions[rnd.Next(myQuestions.Count-1)];
But I’m not sure what tempCollection should be – it just needs to be a single value that I can remove as I use it. Does anyone have any suggestions as to which type to use, or of a better way to do this?
I suggest you copy the results into a new List<Question> and then shuffle that list.
However, I would use a Fisher-Yates shuffle rather than the one you've given here. There are plenty of C# examples of that on this site.
For example, you might do:
// Don't create a new instance of Random each time. That's a detail
// for another question though.
Random rng = GetAppropriateRandomInstanceForThread();
List<Question> shuffled = new List<Question>(myQuestions);
for (int i = shuffled.Count - 1; i > 0; i--)
{
// Swap element "i" with a random earlier element it (or itself)
int swapIndex = rng.Next(i + 1);
Question tmp = shuffled[i];
shuffled[i] = shuffled[swapIndex];
shuffled[swapIndex] = tmp;
}
You could use Linq and order by a random value:
List<string> items = new List<string>();
items.Add("Foo");
items.Add("Bar");
items.Add("Baz");
foreach (string item in items.OrderBy(c => Guid.NewGuid()))
{
Console.WriteLine(item);
}
temp Collection should be same type as myQuestions.
I would also suggest a change in your code:
for (int i = 0; i <= myQuestions.Count -1; i++)
to
for (int i = 0; i < myQuestions.Count; i++)
Does the same thing, but this is how most programers do it so it will make your code simpler to read.