Looping through huge c# list with inner loops

Looping through huge c# list with inner loops - c#

I have a huge list of strings _wordList List<string> containing about 100,000 values. The problem I'm having is that I also require multiple nested loops within this. The nested loop is also a list but with a structure containing only variables, containing about 0-100 values depending on what happens
for (int y = 0; y < _wordList.Count; y++)
{
string word = _wordList[y];
for(int x = 0; x < _secondWordList.Count; x++)
{
if (!word.Contains(_secondWordList[x].Word) || word == _secondWordList[x].Word)
continue;
// do other stuff
}
}
Here is part of the code, I won't post all of it since most of it will be irrelevant but within the second loop I have about 2 other short loops, the whole function completes in 350-600ms. What would the best way to optimize the loops? The word.Contains also have an impact of about 100-150ms on performance.

It seems that you're looking for a text search, and in this case, you can benefit from projects like LuceneNet.
If the call to string.Contains() didn't exist and you were looking for exact matches I was to suggest that you swap the list for hashset as it'll give you a great performance boost in your case. Like below.
static void Main(string[] args)
{
var wordList = new HashSet<string>(); //Assuming Initialized
var secondWordList = new List<X>(); //Assuming Initialized
for (var c = 0; c < secondWordList.Count; c++)
{
if(wordList.Contains(secondWordList[c].Word))
continue;
// do other stuff
}
}
With this, you're going to iterate on the smaller list and look for the value in the HashSet which has a complexity of O(1), which Means that it'll be extremely fast.

Related

How to replace values smaller than zero with zero in a collection using LINQ

I have a list of objects. This object has a field called val. This value shouldn't be smaller than zero but there are such objects in this list. I want to replace those less-than-zero values with zero. The easiest solution is
foreach(Obj item in list)
{
if (item.val < 0)
{
item.val = 0;
}
}
But I want to do this using LINQ. The important thing is I do not want a list of updated elements. I want the same list just with the necessary values replaced. Thanks in advance.

As I read the comments I realized what I wanted to do is less efficient and pointless. LINQ is for querying and creating new collections rather than updating collections. A possible solution I came across was this
list.Select(c => { if (c.val < 0 ) c.val= 0; return c;}).ToList();
But my initial foreach solution is more efficient than this. So dont make the same mistake I do and complicate things.

you can try this one, which is faster because of parallelism
Parallel.ForEach(list, item =>
{
item.val = item.val < 0 ? 0 : item.val;
});
The Parallel ForEach in C# provides a parallel version of the standard, sequential Foreach loop. In standard Foreach loop, each iteration processes a single item from the collection and will process all the items one by one only. However, the Parallel Foreach method executes multiple iterations at the same time on different processors or processor cores. This may open the possibility of synchronization problems. So, the loop is ideally suited to processes where each iteration is independent of the others
More Details - LINK

loop 'for' is faster than 'foreach' so you can use this one
for (int i = 0; i < list.Count; i++)
{
if(list[i].val <= 0)
{
list[i].val = 0;
}
}

C# Finding an element in a List

Let's say I have the following C# code
var my_list = new List<string>();
// Filling the list with tons of sentences.
string sentence = Console.ReadLine();
Is there any difference between doing either of the following ?
bool c1 = my_list.Contains(sentence);
bool c2 = my_list.Any(s => s == sentence);
I imagine the pure algorithmic behind isn't exactly the same. But what are the actual differences on my side? Is one way faster or more efficient than the other? Will one method sometime return true and the other false? What should I consider to pick one method or the other? Or is it purely up to me and both work in any situation?

The most upvoted answer isn't completely correct (and it's a reason big O doesn't always work). Any will be slower than Contains in this scenario (by about double).
Any will have an extra call every iteration, the delegate you specified on every item in your list, something contain does not have to do. An extra call will slow it down substantially.
The results will be the same, but the speed will be very different.
Example benchmark:
Stopwatch watch = new Stopwatch();
List<string> stringList = new List<string>();
for (int i = 0; i < 10000000; i++)
{
stringList.Add(i.ToString());
}
int t = 0;
watch.Start();
for (int i = 0; i < 1000000; i++)
if (stringList.Any(x => x == "29"))
t = i;
watch.Stop();
("Any takes: " + watch.ElapsedMilliseconds).Dump();
GC.Collect();
watch.Restart();
for (int i = 0; i < 1000000; i++)
if (stringList.Contains("29"))
t = i;
watch.Stop();
("Contains takes: " + watch.ElapsedMilliseconds).Dump();
Results:
Any takes: 481
Contains takes: 235
Size and amount of iterations will not effect the % difference, Any will always be slower.

Realistically, the two will operate in almost the same fashion: iterate the list's items and check to see if sentence matches any list elements, giving a complexity of about O(n). I would argue List.Contains since that is a little easier and more natural, but it's entirely preferential!
Now, if you're looking for something faster in terms of lookup complexity and speed, I'd suggest a HashSet<T>. HashSets have, generally speaking, a lookup of about O(1) since the hashing function, theoretically, should be a constant time operation. Again, just a suggestion :)

For string objects, there's no difference, since the == operator simply calls String.Equals.
However, for other objects, there could be differences between == and .Equals - looking at the implementation of .Contains, it will use the EqualityComparer<T>.Default, which hooks into Equals(T) as long as you class implements IEquatable<T> (where T is itself). Without overloading ==, most classes instead use referential comparison for == since that's what they inherit from Object.

Binary search slower, what am I doing wrong?

EDIT: so it looks like this is normal behavior, so can anyone just recommend a faster way to do these numerous intersections?
so my problem is this. I have 8000 lists (strings in each list). For each list (ranging from size 50 to 400), I'm comparing it to every other list and performing a calculation based on the intersection number. So I'll do
list1(intersect)list1= number
list1(intersect)list2= number
list1(intersect)list888= number
And I do this for every list. Previously, I had HashList and my code was essentially this: (well, I was actually searching through properties of an object, so I
had to modify the code a bit, but it's basically this:
I have my two versions below, but if anyone knows anything faster, please let me know!
Loop through AllLists, getting each list, starting with list1, and then do this:
foreach (List list in AllLists)
{
if (list1_length < list_length) //just a check to so I'm looping through the
//smaller list
{
foreach (string word in list1)
{
if (block.generator_list.Contains(word))
{
//simple integer count
}
}
}
// a little more code, but the same, but looping through the other list if it's smaller/bigger
Then I make the lists into regular lists, and applied Sort(), which changed my code to
foreach (List list in AllLists)
{
if (list1_length < list_length) //just a check to so I'm looping through the
//smaller list
{
for (int i = 0; i < list1_length; i++)
{
var test = list.BinarySearch(list1[i]);
if (test > -1)
{
//simple integer count
}
}
}
The first version takes about 6 seconds, the other one takes more than 20 (I just stop there cuz otherwise it would take more than a minute!!!) (and this is for a smallish subset of the data)
I'm sure there's a drastic mistake somewhere, but I can't find it.

Well I have tried three distinct methods for achieving this (assuming I understood the problem correctly). Please note I have used HashSet<int> in order to more easily generate random input.
setting up:
List<HashSet<int>> allSets = new List<HashSet<int>>();
Random rand = new Random();
for(int i = 0; i < 8000; ++i) {
HashSet<int> ints = new HashSet<int>();
for(int j = 0; j < rand.Next(50, 400); ++j) {
ints.Add(rand.Next(0, 1000));
}
allSets.Add(ints);
}
the three methods I checked (code is what runs in the inner loop):
the loop:
note that you are getting duplicated results in your code (intersecting set A with set B and later intersecting set B with set A).
It won't affect your performance thanks to the list length check you are doing. But iterating this way is clearer.
for(int i = 0; i < allSets.Count; ++i) {
for(int j = i + 1; j < allSets.Count; ++j) {
}
}
first method:
used IEnumerable.Intersect() to get the intersection with the other list and checked IEnumerable.Count() to get the size of the intersection.
var intersect = allSets[i].Intersect(allSets[j]);
count = intersect.Count();
this was the slowest one averaging 177s
second method:
cloned the smaller set of the two sets I was intersecting, then used ISet.IntersectWith() and checked the resulting sets Count.
HashSet<int> intersect;
HashSet<int> intersectWith;
if(allSets[i].Count < allSets[j].Count) {
intersect = new HashSet<int>(allSets[i]);
intersectWith = allSets[j];
} else {
intersect = new HashSet<int>(allSets[j]);
intersectWith = allSets[i];
}
intersect.IntersectWith(intersectWith);
count = intersect.Count;
}
}
this one was slightly faster, averaging 154s
third method:
did something very similar to what you did iterated over the shorter set and checked ISet.Contains on the longer set.
for(int i = 0; i < allSets.Count; ++i) {
for(int j = i + 1; j < allSets.Count; ++j) {
count = 0;
if(allSets[i].Count < allSets[j].Count) {
loopingSet = allSets[i];
containsSet = allSets[j];
} else {
loopingSet = allSets[j];
containsSet = allSets[i];
}
foreach(int k in loopingSet) {
if(containsSet.Contains(k)) {
++count;
}
}
}
}
this method was by far the fastest (as expected), averaging 66s
conclusion
the method you're using is the fastest of these three. I certainly can't think of a faster single threaded way to do this. Perhaps there is a better concurrent solution.

I've found that one of the most important considerations in iterating/searching any kind of collection is to choose the collection type very carefully. To iterate through a normal collection for your purposes will not be the most optimal. Try using something like:
System.Collections.Generic.HashSet<T>
Using the Contains() method while iterating over the shorter list of two (as you mentioned you're already doing) should give close to O(1) performance, the same as key lookups in the generic Dictionary type.

What is the difference between for and foreach?

What is the major difference between for and foreach loops?
In which scenarios can we use for and not foreach and vice versa.
Would it be possible to show with a simple program?
Both seem the same to me. I can't differentiate them.

a for loop is a construct that says "perform this operation n. times".
a foreach loop is a construct that says "perform this operation against each value/object in this IEnumerable"

You can use foreach if the object you want to iterate over implements the IEnumerable interface. You need to use for if you can access the object only by index.

I'll tryto answer this in a more general approach:
foreach is used to iterate over each element of a given set or list (anything implementing IEnumerable) in a predefined manner. You can't influence the exact order (other than skipping entries or canceling the whole loop), as that's determined by the container.
foreach (String line in document) { // iterate through all elements of "document" as String objects
Console.Write(line); // print the line
}
for is just another way to write a loop that has code executed before entering the loop and once after every iteration. It's usually used to loop through code a given number of times. Contrary to foreach here you're able to influence the current position.
for (int i = 0, j = 0; i < 100 && j < 10; ++i) { // set i and j to 0, then loop as long as i is less than 100 or j is less than 10 and increase i after each iteration
if (i % 8 == 0) { // skip all numbers that can be divided by 8 and count them in j
++j
continue;
}
Console.Write(i);
}
Console.Write(j);
If possible and applicable, always use foreach rather than for (assuming there's some array index). Depending on internal data organisation, foreach can be a lot faster than using for with an index (esp. when using linked lists).

Everybody gave you the right answer with regard to foreach, i.e. it's a way to loop through the elements of something implementing IEnumerable.
On the other side, for is much more flexible than what is shown in the other answers. In fact, for is used to executes a block of statements for as long as a specified condition is true.
From Microsoft documentation:
for (initialization; test; increment)
statement
initialization
Required. An expression. This expression is executed only once, before the loop is executed.
test
Required. A Boolean expression. If test is true, statement is executed. If test if false, the loop is terminated.
increment
Required. An expression. The increment expression is executed at the end of every pass through the loop.
statement
Optional. Statement to be executed if test is true. Can be a compound statement.
This means that you can use it in many different ways. Classic school examples are the sum of the numbers from 1 to 10:
int sum = 0;
for (int i = 0; i <= 10; i++)
sum = sum + i;
But you can use it to sum the numbers in an Array, too:
int[] anArr = new int[] { 1, 1, 2, 3, 5, 8, 13, 21 };
int sum = 0;
for (int i = 0; i < anArr.Length; i++)
sum = sum + anArr[i];
(this could have been done with a foreach, too):
int[] anArr = new int[] { 1, 1, 2, 3, 5, 8, 13, 21 };
int sum = 0;
foreach (int anInt in anArr)
sum = sum + anInt;
But you can use it for the sum of the even numbers from 1 to 10:
int sum = 0;
for (int i = 0; i <= 10; i = i + 2)
sum = sum + i;
And you can even invent some crazy thing like this one:
int i = 65;
for (string s = string.Empty; s != "ABC"; s = s + Convert.ToChar(i++).ToString()) ;
Console.WriteLine(s);

for loop:
1) need to specify the loop bounds( minimum or maximum).
2) executes a statement or a block of statements repeatedly
until a specified expression evaluates to false.
Ex1:-
int K = 0;
for (int x = 1; x <= 9; x++){
k = k + x ;
}
foreach statement:
1)do not need to specify the loop bounds minimum or maximum.
2)repeats a group of embedded statements for
a)each element in an array
or b) an object collection.
Ex2:-
int k = 0;
int[] tempArr = new int[] { 0, 2, 3, 8, 17 };
foreach (int i in tempArr){
k = k + i ;
}

foreach is almost equivalent to :
var enumerator = list.GetEnumerator();
var element;
while(enumerator.MoveNext()){
element = enumerator.Current;
}
and in order to implemetn a "foreach" compliant pattern, this need to provide a class that have a method GetEnumerator() which returns an object that have a MoveNext() method, a Reset() method and a Current property.
Indeed, you do not need to implement neither IEnumerable nor IEnumerator.
Some derived points:
foreach does not need to know the collection length so allows to iterate through a "stream" or a kind of "elements producer".
foreach calls virtual methods on the iterator (the most of the time) so can perform less well than for.

It depends on what you are doing, and what you need.
If you are iterating through a collection of items, and do not care about the index values then foreach is more convenient, easier to write and safer: you can't get the number of items wrong.
If you need to process every second item in a collection for example, or process them ion the reverse order, then a for loop is the only practical way.
The biggest differences are that a foreach loop processes an instance of each element in a collection in turn, while a for loop can work with any data and is not restricted to collection elements alone. This means that a for loop can modify a collection - which is illegal and will cause an error in a foreach loop.
For more detail, see MSDN : foreach and for

Difference Between For and For Each Loop in C#
For Loops executes a block of code until an expression returns false while ForEach loop executed a block of code through the items in object collections.
For loop can execute with object collections or without any object collections while ForEach loop can execute with object collections only.
The for loop is a normal loop construct which can be used for multiple purposes where as foreach is designed to work only on Collections or IEnumerables object.

foreach is useful if you have a array or other IEnumerable Collection of data. but for can be used for access elements of an array that can be accessed by their index.

A for loop is useful when you have an indication or determination, in advance, of how many times you want a loop to run. As an example, if you need to perform a process for each day of the week, you know you want 7 loops.
A foreach loop is when you want to repeat a process for all pieces of a collection or array, but it is not important specifically how many times the loop runs. As an example, you are formatting a list of favorite books for users. Every user may have a different number of books, or none, and we don't really care how many it is, we just want the loop to act on all of them.

The for loop executes a statement or a block of statements repeatedly until a specified expression evaluates to false.
There is a need to specify the loop bounds (minimum or maximum). Following is a code example of a simple for loop that starts 0 till <= 5.
we look at foreach in detail. What looks like a simple loop on the outside is actually a complex data structure called an enumerator:
An enumerator is a data structure with a Current property, a MoveNext method, and a Reset method. The Current property holds the value of the current element, and every call to MoveNext advances the enumerator to the next item in the sequence.
Enumerators are great because they can handle any iterative data structure. In fact, they are so powerful that all of LINQ is built on top of enumerators.
But the disadvantage of enumerators is that they require calls to Current and MoveNext for every element in the sequence. All those method calls add up, especially in mission-critical code.
Conversely, the for-loop only has to call get_Item for every element in the list. That’s one method call less than the foreach-loop, and the difference really shows.
So when should you use a foreach-loop, and when should you use a for-loop?
Here’s what you need to do:
When you’re using LINQ, use foreach
When you’re working with very large computed sequences of values, use foreach
When performance isn’t an issue, use foreach
But if you want top performance, use a for-loop instead

The major difference between the for and foreach loop in c# we understand by its working:
The for loop:
The for loop's variable always be integer only.
The For Loop executes the statement or block of statements repeatedly until specified expression evaluates to false.
In for loop we have to specify the loop's boundary ( maximum or minimum).-------->We can say this is the limitation of the for loop.
The foreach loop:
In the case of the foreach loop the variable of the loop while be same as the type of values under the array.
The Foreach statement repeats a group of embedded statements for each element in an array or an object collection.
In foreach loop, You do not need to specify the loop bounds minimum or maximum.--->
here we can say that this is the advantage of the for each loop.

I prefer the FOR loop in terms of performance. FOREACH is a little slow when you go with more number of items.
If you perform more business logic with the instance then FOREACH performs faster.
Demonstration:
I created a list of 10000000 instances and looping with FOR and FOREACH.
Time took to loop:
FOREACH -> 53.852ms
FOR -> 28.9232ms
Below is the sample code.
class Program
{
static void Main(string[] args)
{
List<TestClass> lst = new List<TestClass>();
for (int i = 1; i <= 10000000; i++)
{
TestClass obj = new TestClass() {
ID = i,
Name = "Name" + i.ToString()
};
lst.Add(obj);
}
DateTime start = DateTime.Now;
foreach (var obj in lst)
{
//obj.ID = obj.ID + 1;
//obj.Name = obj.Name + "1";
}
DateTime end = DateTime.Now;
var first = end.Subtract(start).TotalMilliseconds;
start = DateTime.Now;
for (int j = 0; j<lst.Count;j++)
{
//lst[j].ID = lst[j].ID + 1;
//lst[j].Name = lst[j].Name + "1";
}
end = DateTime.Now;
var second = end.Subtract(start).TotalMilliseconds;
}
}
public class TestClass
{
public long ID { get; set; }
public string Name { get; set; }
}
If I uncomment the code inside the loop:
Then, time took to loop:
FOREACH -> 2564.1405ms
FOR -> 2753.0017ms
Conclusion
If you do more business logic with the instance, then FOREACH is recommended.
If you are not doing much logic with the instance, then FOR is recommended.

Many answers are already there, I just need to identify one difference which is not there.
for loop is fail-safe while foreach loop is fail-fast.
Fail-fast iteration throws ConcurrentModificationException if iteration and modification are done at the same time in object.
However, fail-safe iteration keeps the operation safe from failing even if the iteration goes in infinite loop.
public class ConcurrentModification {
public static void main(String[] args) {
List<String> str = new ArrayList<>();
for(int i=0; i<1000; i++){
str.add(String.valueOf(i));
}
/**
* this for loop is fail-safe. It goes into infinite loop but does not fail.
*/
for(int i=0; i<str.size(); i++){
System.out.println(str.get(i));
str.add(i+ " " + "10");
}
/**
* throws ConcurrentModificationexception
for(String st: str){
System.out.println(st);
str.add("10");
}
*/
/* throws ConcurrentModificationException
Iterator<String> itr = str.iterator();
while(itr.hasNext()) {
System.out.println(itr.next());
str.add("10");
}*/
}
}
Hope this helps to understand the difference between for and foreach loop through different angle.
I found a good blog to go through the differences between fail-safe and fail-fast, if anyone interested:

You can use the foreach for an simple array like
int[] test = { 0, 1, 2, 3, ...};
And you can use the for when you have a 2D array
int[][] test = {{1,2,3,4},
{5,2,6,5,8}};

foreach syntax is quick and easy. for syntax is a little more complex, but is also more flexible.
foreach is useful when iterating all of the items in a collection. for is useful when iterating overall or a subset of items.
The foreach iteration variable which provides each collection item, is READ-ONLY, so we can't modify the items as they are iterated. Using the for syntax, we can modify the items as needed.
Bottom line- use foreach to quickly iterate all of the items in a collection. Use for to iterate a subset of the items of the collection or to modify the items as they are iterated.

simple difference between for and foreach
for loop is working with values. it must have condition then increment and intialization also. you have to knowledge about 'how many times loop repeated'.
foreach is working with objects and enumaretors. no need to knowledge how many times loop repeated.

The foreach statement repeats a group of embedded statements for each element in an array or an object collection that implements the System.Collections.IEnumerable or System.Collections.Generic.IEnumerable interface. The foreach statement is used to iterate through the collection to get the information that you want, but can not be used to add or remove items from the source collection to avoid unpredictable side effects. If you need to add or remove items from the source collection, use a for loop.

One important thing related with foreach is that , foreach iteration variable cannot be updated(or assign new value) in loop body.
for example :
List<string> myStrlist = new List<string>() { "Sachin", "Ganguly", "Dravid" };
foreach(string item in myStrlist)
{
item += " cricket"; // ***Not Possible***
}

Simple Dictionary Lookup is Slow in .Net Compared to Flat Array

I found that dictionary lookup could be very slow if compared to flat array access. Any idea why? I'm using Ants Profiler for performance testing. Here's a sample function that reproduces the problem:
private static void NodeDisplace()
{
var nodeDisplacement = new Dictionary<double, double[]>();
var times = new List<double>();
for (int i = 0; i < 6000; i++)
{
times.Add(i * 0.02);
}
foreach (var time in times)
{
nodeDisplacement.Add(time, new double[6]);
}
var five = 5;
var six = 6;
int modes = 10;
var arrayList = new double[times.Count*6];
for (int i = 0; i < modes; i++)
{
int k=0;
foreach (var time in times)
{
for (int j = 0; j < 6; j++)
{
var simpelCompute = five * six; // 0.027 sec
nodeDisplacement[time][j] = simpelCompute; //0.403 sec
arrayList[6*k+j] = simpelCompute; //0.0278 sec
}
k++;
}
}
}
Notice the relative magnitude between flat array access and dictionary access? Flat array is about 20 times faster than dictionary access ( 0.403/0.0278), after taking into account of the array index manipulation ( 6*k+j).
As weird as it sounds, but dictionary lookup is taking a major portion of my time, and I have to optimize it.

Yes, I'm not surprised. The point of dictionaries is that they're used to look up arbitrary keys. Consider what has to happen for a single array dereference:
Check bounds
Multiply index by element size
Add index to pointer
Very, very fast. Now for a dictionary lookup (very rough; depends on implementation):
Potentially check key for nullity
Take hash code of key
Find the right slot for that hash code (probably a "mod prime" operation)
Probably dereference an array element to find the information for that slot
Compare hash codes
If the hash codes match, compare for equality (and potentially go on to the next hash code match)
If you've got "keys" which can very easily be used as array indexes instead (e.g. contiguous integers, or something which can easily be mapped to contiguous integers) then that will be very, very fast. That's not the primary use case for hash tables. They're good for situations which can't easily be mapped that way - for example looking up by string, or by arbitrary double value (rather than doubles which are evenly spaced, and can thus be mapped to integers easily).
I would say that your title is misleading - it's not that dictionary lookup is slow, it's that when arrays are a more suitable approach, they're ludicrously fast.

In addition the Jon's answer I would like to add that your inner loop does not do very much, normally you do a least some more work in the inner loop and then the relative performance loss of the dictionary is somewhat lower.
If you look at the code for Double.GetHashCode() in Reflector you'll find that it is executing 4 lines of code (assuming your double is not 0), just that is more than the body of your inner loop. Dictionary<TKey, TValue>.Insert() (called by the set indexer) is even more code, almost a screen full.
The thing with Dictionary compared to a flat array is that you don't waste to much memory when your keys are not dense (as they are in your case) and that read and write are ~O(1) like arrays (but with a higher constant).
As a side note you can use a multi dimensional array instead of the 6*k+j trick.
Declare it this way
var arrayList = new double[times.Count, 6];
and use it this way
arrayList[k ,j] = simpelCompute;
It won't be faster, but it is easier to read.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Looping through huge c# list with inner loops - c#

Related

How to replace values smaller than zero with zero in a collection using LINQ

C# Finding an element in a List

Binary search slower, what am I doing wrong?

What is the difference between for and foreach?

Simple Dictionary Lookup is Slow in .Net Compared to Flat Array

Categories

Resources