When immutable collections are preferable than concurrent - c#

Recently read about immutable collections.
They are recommended to be used as a thread safe for reading, when the read operations are performed more often than write.
Then I want to test read performance ImmutableDictionary vs ConcurrentDictionary. Here is this very simple test (in .NET Core 2.1):
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Collections.Immutable;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;
namespace ImmutableSpeedTests
{
class Program
{
public class ConcurrentVsImmutable
{
public int ValuesCount;
public int ThreadsCount;
private ImmutableDictionary<int, int> immutable = ImmutableDictionary<int, int>.Empty;
private ConcurrentDictionary<int, int> concurrent = new ConcurrentDictionary<int, int>();
public ConcurrentVsImmutable(int valuesCount, int threadsCount)
{
ValuesCount = valuesCount;
ThreadsCount = threadsCount;
}
public void Setup()
{
// fill both collections. I don't measure time cause immutable is filling much slower obviously.
for (var i = 0; i < ValuesCount; i++)
{
concurrent[i] = i;
immutable = immutable.Add(i, i);
}
}
public async Task<long> ImmutableSum() => await Sum(immutable);
public async Task<long> ConcurrentSum() => await Sum(concurrent);
private async Task<long> Sum(IReadOnlyDictionary<int, int> dic)
{
var tasks = new List<Task<long>>();
// main job. Run multiple tasks to sum all values.
for (var i = 0; i < ThreadsCount; i++)
tasks.Add(Task.Run(() =>
{
long x = 0;
foreach (var key in dic.Keys)
{
x += dic[key];
}
return x;
}));
var result = await Task.WhenAll(tasks.ToArray());
return result.Sum();
}
}
static void Main(string[] args)
{
var test = new ConcurrentVsImmutable(1000000, 4);
test.Setup();
var sw = new Stopwatch();
sw.Start();
var result = test.ConcurrentSum().Result;
sw.Stop();
// Convince that the result of the work is the same
Console.WriteLine($"Concurrent. Result: {result}. Elapsed: {sw.ElapsedTicks}.");
sw.Reset();
sw.Start();
result = test.ImmutableSum().Result;
sw.Stop();
Console.WriteLine($" Immutable. Result: {result}. Elapsed: {sw.ElapsedTicks}.");
Console.ReadLine();
}
}
}
You can run this code. Elapsed time in ticks will differ from time to time but the time spent by ConcurrentDictionary is several times less than by ImmutableDictionary.
This experiment makes me embarrassed. Did I do it wrong? What the reason to use immutable collections if we have concurrent? When they are preferable?

Immutable collections are not alternative to concurrent collections. And the way they are designed to reduce memory consumption, they are bound to be slower, the trade off here is to use less memory and thus by using less n operations to do anything.
We usually copy collections to other collections to achieve immutability to persist state. Lets see what it means,
var s1 = ImmutableStack<int>.Empty;
var s2 = s1.Push(1);
// s2 = [1]
var s3 = s2.Push(2);
// s2 = [1]
// s3 = [1,2]
// notice that s2 has only one item, it is not modified..
var s4 = s3.Pop(ref var i);
// s2 = [1];
// still s2 has one item...
Notice that, s2 always has only one item. Even if all items are removed.
The way all data is stored internally is a huge tree and your collection is pointing to a branch which has descendants that represents initial state of the tree.
I don't think the performance can be matched with concurrent collection where goals are totally different.
In concurrent collection, you have a single copy of collection accessed by all threads.
In immutable collection you have virtually isolated copy of a tree, navigating that tree is always costly.
It is useful in transactional system, where if a transaction has to be rolled back, state of collection can be retained in commit points.

This is a criticism that's been made before.
As Akash already said, ImmutableDictionary works with an internal tree, instead of a hashset.
One aspect of this is that you can improve the performance slightly if you build the dictionary in one step instead of iteratively adding all the keys:
immutable = concurrent.ToImmutableDictionary();
Enumerating a hashset and a balanced tree are both O(n) operations. I took the average of a few runs on a single thread for varying container size and get results consistent with that:
I don't know why the immutable slope is 6x steeper. For now I'll just assume its doing tricky nonblocking tree stuff. I assume this class would be optimized for random stores and reads rather than enumeration.
To identify what exact scenarios ImmutableDictionary wins at, we'd need to wrap a concurrent dictionary to provide some level of immutability, and test both classes in the face of levels of read/write contention.
Not a serious suggestion, but a counterpoint to your test is to use immutability to "cheat" over multiple iterations by comparing:
private ConcurrentDictionary<object, long> cache = new ConcurrentDictionary<object, long>();
public long ImmutableSum()
{
return cache.GetOrAdd(immutable, (obj) => (obj as ImmutableDictionary<int, int>).Sum(kvp => (long)kvp.Value));
}
public long ConcurrentSum() => concurrent.Sum(kvp => (long)kvp.Value);
This makes a quite a difference on subsequent calls to sum an unchanged collection!

The two are not mutually exclusive. I use both.
If your dictionary is small the read performance of ImmutableDictionary will be superior to ConcurrentDictionary as K1*Log(N) < K2 where Log(N) < K2/K1 (when the hash table overhead is worse than tree traversal).
I personally find the write semantics of the Immutable collections easier to understand than those of the concurrent collections as they tend to be more consistent, especially when dealing with AddOrUpdate() and GetOrAdd().
In practice, I find that there have many cases in which I have a good number of small (or empty) dictionaries that are more appropriate as ImmutableDictionary and some larger ones that warrant the use of ConcurrentDictionary.
Having said that, if they are small then it doesn't make much of a difference what you use.
Regarding the answer of Peter Wishart, the enumeration performance of ImmutableDictionary is higher than ConcurrentDictionary (for reasonable N) because tree traversal is brutal in terms of memory latency on modern cache architectures.

Related

Iterate though sequence and then call Count() or create a List by the start and then call Count

The language I use is c#.
Let that we want to iterate through the elements of a sequence called customers, which is a sequence of objects of a fictional type called Customer. In terms of code, let that we have the following:
IEnumerable<Customer> customers = module.GetCustomers();
where module is an service layer's class through one of it's methods, we can retrieve all the customers. That being sain the iteration through the elements of customers would be:
foreach(var customer in customers)
{
}
Let now that we want after having iterated through the elements of customers to get the number of customers. That could be done like below:
int numberOfCustomers = customers.Count();
My concern/question now is the following:
Using the Count() method we iterate again through the elements of customers. However, if we had created an in memory collection of this objects, calling for instance the method ToList():
List<Customer> customers = module.GetCustomers()
.ToList();
we would have the number of customers in O(1),using the Count property of the list customers.
In order to find out between these two options, which is the best one, I wrote a simple console app and I used the StopWatch class to profile them. However, I didn't come in a clear result.
Which of these two options is the best one?
UPDATE
I ran the following console app:
class Program
{
static void Main(string[] args)
{
IEnumerable<int> numbers = Enumerable.Range(0, 1000);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
foreach (var number in numbers)
Console.WriteLine(number);
Console.WriteLine(numbers.Count());
stopwatch.Stop();
// I got 175ms
Console.WriteLine(stopwatch.ElapsedMilliseconds);
Console.ReadKey();
stopwatch.Restart();
List<int> numbers2 = numbers.ToList();
foreach (var number in numbers2)
Console.WriteLine(number);
Console.WriteLine(numbers2.Count);
stopwatch.Stop();
// I got 86ms
Console.WriteLine(stopwatch.ElapsedMilliseconds);
Console.ReadKey();
}
}
Then I ran this:
class Program
{
static void Main(string[] args)
{
IEnumerable<int> numbers = Enumerable.Range(0, 1000);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
List<int> numbers2 = numbers.ToList();
foreach (var number in numbers2)
Console.WriteLine(number);
Console.WriteLine(numbers2.Count);
stopwatch.Stop();
// I got 167ms
Console.WriteLine(stopwatch.ElapsedMilliseconds);
Console.ReadKey();
stopwatch.Restart();
foreach (var number in numbers)
Console.WriteLine(number);
Console.WriteLine(numbers.Count());
stopwatch.Stop();
// I got 104ms
Console.WriteLine(stopwatch.ElapsedMilliseconds);
Console.ReadKey();
}
}
I usually prefer to make my repository methods return an IReadOnlyCollection<>, which helps callers to know that they can safely iterate it multiple times:
IReadOnlyCollection<Customer> customers = module.GetCustomers();
If I can't do that, and I know that I'm going to iterate over what I'm given multiple times, I'll typically use .ToList() to make sure I'm dealing with an in-memory collection:
var customers = module.GetCustomers().ToList();
In cases where customers was already an in-memory collection, this adds a little overhead by creating a list, but it helps to avoid the risk of creating an enormous amount of overhead by doing something like retrieving the data from the database multiple times.
Your benchmark is flawed for a few reasons, but one of the biggest reasons is that it's using Console.WriteLine(), which performs an I/O operation. That operation will take far, far longer than iterating the collections and counting the results, combined. In fact, the variance in the amount of time spent in Console.WriteLine() will outweigh the differences in the code you're testing.
But this actually illustrates my point well--I/O operations take vastly longer than CPU and memory operations, so it's often worthwhile to add .ToList(), which will probably add microseconds to the run time, in order to avoid the slightest possibility of adding I/O operations, which can add milliseconds.

intersect and any or contains and any. Which is more efficient to find at least one common element?

If I have two list and I want to know if there are at least one common element, I have this two options:
lst1.Intersect(lst2).Any();
Lst1.Any(x => lst2.Contains(x));
The two options give me the result that I expect, however I don't know what is the best option. Which is more efficient? And why?
Thanks.
EDIT: when I created this post, apart of the solution, I was looking the reason. I know that I can run tests, but I wouldn't know the reason of the result. One is faster than the other? Is always one solution best than the other?
So for this reason, I hace accepted the answer of Matthew, not only for the test code, but also he explain when one is better than other and why. I appreciate a lot the contributions of Nicholas and Oren too.
Thanks.
Oren's answer has an error in the way the stopwatch is being used. It isn't being reset at the end of the loop after the time taken by Any() has been measured.
Note how it goes back to the start of the loop with the stopwatch never being Reset() so that the time that is added to intersect includes the time taken by Any().
Following is a corrected version.
A release build run outside any debugger gives this result on my PC:
Intersect: 1ms
Any: 6743ms
Note how I'm making two non-overlapping string lists for this test. Also note that this is a worst-case test.
Where there are many intersections (or intersections that happen to occur near the start of the data) then Oren is quite correct to say that Any() should be faster.
If the real data usually contains intersections then it's likely that it is better to use Any(). Otherwise, use Intersect(). It's very data dependent.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace Demo
{
class Program
{
void run()
{
double intersect = 0;
double any = 0;
Stopwatch stopWatch = new Stopwatch();
List<string> L1 = Enumerable.Range(0, 10000).Select(x => x.ToString()).ToList();
List<string> L2 = Enumerable.Range(10000, 10000).Select(x => x.ToString()).ToList();
for (int i = 0; i < 10; i++)
{
stopWatch.Restart();
Intersect(L1, L2);
stopWatch.Stop();
intersect += stopWatch.ElapsedMilliseconds;
stopWatch.Restart();
Any(L1, L2);
stopWatch.Stop();
any += stopWatch.ElapsedMilliseconds;
}
Console.WriteLine("Intersect: " + intersect + "ms");
Console.WriteLine("Any: " + any + "ms");
}
private static bool Any(List<string> lst1, List<string> lst2)
{
return lst1.Any(lst2.Contains);
}
private static bool Intersect(List<string> lst1, List<string> lst2)
{
return lst1.Intersect(lst2).Any();
}
static void Main()
{
new Program().run();
}
}
}
For comparative purposes, I wrote my own test comparing int sequences:
intersect took 00:00:00.0065928
any took 00:00:08.6706195
The code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace Demo
{
class Program
{
void run()
{
var lst1 = Enumerable.Range(0, 10000);
var lst2 = Enumerable.Range(10000, 10000);
int count = 10;
DemoUtil.Time(() => lst1.Intersect(lst2).Any(), "intersect", count);
DemoUtil.Time(() => lst1.Any(lst2.Contains), "any", count);
}
static void Main()
{
new Program().run();
}
}
static class DemoUtil
{
public static void Print(this object self)
{
Console.WriteLine(self);
}
public static void Print(this string self)
{
Console.WriteLine(self);
}
public static void Print<T>(this IEnumerable<T> self)
{
foreach (var item in self)
Console.WriteLine(item);
}
public static void Time(Action action, string title, int count)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < count; ++i)
action();
(title + " took " + sw.Elapsed).Print();
}
}
}
If I also time this for overlapping ranges by changing the lists to this and upping the count to 10000:
var lst1 = Enumerable.Range(10000, 10000);
var lst2 = Enumerable.Range(10000, 10000);
I get these results:
intersect took 00:00:03.2607476
any took 00:00:00.0019170
In this case Any() is clearly much faster.
Conclusion
The worst-case performance is very bad for Any() but acceptible for Intersect().
The best-case performance is extremely good for Any() and bad for Intersect().
(and best-case for Any() is probably worst-case for Intersect()!)
The Any() approach is O(N^2) in the worst case and O(1) in the best case.
The Intersect() approach is always O(N) (since it uses hashing, not sorting, otherwise it would be O(N(Log(N))).
You must also consider the memory usage: the Intersect() method needs to take a copy of one of the inputs, whereas Any() doesn't.
Therefore to make the best decision you really need to know the characteristics of the real data, and actually perform tests.
If you really don't want the Any() to turn into an O(N^2) in the worst case, then you should use Intersect(). However, the chances are that you will be best off using Any().
And of course, most of the time none of this matters!
Unless you've discovered this part of the code to be a bottleneck, this is of merely academic interest. You shouldn't waste your time with this kind of analysis if there's no problem. :)
It depends on the implementation of your IEnumerables.
Your first try (Intersect/Any), finds all the matches and then determines if the set is empty or not. From the documentation, this looks to be something like O(n) operation:
When the object returned by this method is enumerated, Intersect enumerates first,
collecting all distinct elements of that sequence. It then enumerates [the]
second, marking those elements that occur in both sequences. Finally, the marked
elements are yielded in the order in which they were collected.
Your second try ( Any/Contains ) enumerates over the first collection, an O(n) operation, and for each item in the first collection, enumerates over the second, another O(n) operation, to see if a matching element is found. This makes it something like an O(n2) operation, does it not? Which do you think might be faster?
One thing to consider, though, is that the Contains() lookup for certain collection or set types (e.g., dictionaries, binary trees or ordered collections that allow a binary search or hashtable lookup) might be a cheap operation if the Contains() implementation is smart enough to take advantage of the semantics of the collection upon which it is operating.
But you'll need to experiment with your collection types to find out which works better.
See Matthew's answer for a complete and accurate breakdown.
Relatively easy to mock up and try yourself:
bool found;
double intersect = 0;
double any = 0;
for (int i = 0; i < 100; i++)
{
List<string> L1 = GenerateNumberStrings(200000);
List<string> L2 = GenerateNumberStrings(60000);
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
found = Intersect(L1, L2);
stopWatch.Stop();
intersect += stopWatch.ElapsedMilliseconds;
stopWatch.Reset();
stopWatch.Start();
found = Any(L1, L2);
stopWatch.Stop();
any += stopWatch.ElapsedMilliseconds;
}
Console.WriteLine("Intersect: " + intersect + "ms");
Console.WriteLine("Any: " + any + "ms");
}
private static bool Any(List<string> lst1, List<string> lst2)
{
return lst1.Any(x => lst2.Contains(x));
}
private static bool Intersect(List<string> lst1, List<string> lst2)
{
return lst1.Intersect(lst2).Any();
}
You'll find that the Any method is significantly faster in the long run, likely because it does not require the memory allocations and setup that intersect requires (Any stops and returns true as soon as it finds a match whereas Intersect actually needs to store the matches in a new List<T>).

Thread safe Increment in C#

I am trying to Increment an element in a list in C#, but I need it to be thread safe, so the count does not get affected.
I know you can do this for integers:
Interlocked.Increment(ref sdmpobjectlist1Count);
but this does not work on a list I have the following so far:
lock (padlock)
{
DifferenceList[diff[d].PropertyName] = DifferenceList[diff[d].PropertyName] + 1;
}
I know this works, but I'm not sure if there is another way to do this?
As David Heffernan said, ConcurrentDictionary should provider better performance. But, the performance gain might be negligible depending upon how frequently multiple threads try to access the cache.
using System;
using System.Collections.Concurrent;
using System.Threading;
namespace ConcurrentCollections
{
class Program
{
static void Main()
{
var cache = new ConcurrentDictionary<string, int>();
for (int threadId = 0; threadId < 2; threadId++)
{
new Thread(
() =>
{
while (true)
{
var newValue = cache.AddOrUpdate("key", 0, (key, value) => value + 1);
Console.WriteLine("Thread {0} incremented value to {1}",
Thread.CurrentThread.ManagedThreadId, newValue);
}
}).Start();
}
Thread.Sleep(TimeSpan.FromMinutes(2));
}
}
}
If you use a List<int[]> rather than a List<int>, and have each element in the list be a single-item array, you will be able to do Increment(ref List[whatever][0]) and have it be atomic. One could improve storage efficiency slightly if one defined
class ExposedFieldHolder<T> {public T Value;}
and then used a List<ExposedFieldHolder<int>> and used the statement Increment(ref List[whatever].Value) to perform the increment. Things could be more efficient yet if the built-in types provided a means of exposing an item as a ref or allowed derived classes sufficient access to their internals to provide such ability themselves. They don't, however, so one must either define one's own collection types from scratch or encapsulate each item in its own class object [using an array or a wrapper class].
check the variable you locked on "padLock", normally, you can define it as private static Object padLock = new Object(). if you do not define it as static, each object has its own copy, thus it will not work.

What is the most efficient loop in c#

There are a number of different way to accomplish the same simple loop though the items of an object in c#.
This has made me wonder if there is any reason be it performance or ease of use, as to use on over the other. Or is it just down to personal preference.
Take a simple object
var myList = List<MyObject>;
Lets assume the object is filled and we want to iterate over the items.
Method 1.
foreach(var item in myList)
{
//Do stuff
}
Method 2
myList.Foreach(ml =>
{
//Do stuff
});
Method 3
while (myList.MoveNext())
{
//Do stuff
}
Method 4
for (int i = 0; i < myList.Count; i++)
{
//Do stuff
}
What I was wondering is do each of these compiled down to the same thing? is there a clear performance advantage for using one over the others?
or is this just down to personal preference when coding?
Have I missed any?
The answer the majority of the time is it does not matter. The number of items in the loop (even what one might consider a "large" number of items, say in the thousands) isn't going to have an impact on the code.
Of course, if you identify this as a bottleneck in your situation, by all means, address it, but you have to identify the bottleneck first.
That said, there are a number of things to take into consideration with each approach, which I'll outline here.
Let's define a few things first:
All of the tests were run on .NET 4.0 on a 32-bit processor.
TimeSpan.TicksPerSecond on my machine = 10,000,000
All tests were performed in separate unit test sessions, not in the same one (so as not to possibly interfere with garbage collections, etc.)
Here's some helpers that are needed for each test:
The MyObject class:
public class MyObject
{
public int IntValue { get; set; }
public double DoubleValue { get; set; }
}
A method to create a List<T> of any length of MyClass instances:
public static List<MyObject> CreateList(int items)
{
// Validate parmaeters.
if (items < 0)
throw new ArgumentOutOfRangeException("items", items,
"The items parameter must be a non-negative value.");
// Return the items in a list.
return Enumerable.Range(0, items).
Select(i => new MyObject { IntValue = i, DoubleValue = i }).
ToList();
}
An action to perform for each item in the list (needed because Method 2 uses a delegate, and a call needs to be made to something to measure impact):
public static void MyObjectAction(MyObject obj, TextWriter writer)
{
// Validate parameters.
Debug.Assert(obj != null);
Debug.Assert(writer != null);
// Write.
writer.WriteLine("MyObject.IntValue: {0}, MyObject.DoubleValue: {1}",
obj.IntValue, obj.DoubleValue);
}
A method to create a TextWriter which writes to a null Stream (basically a data sink):
public static TextWriter CreateNullTextWriter()
{
// Create a stream writer off a null stream.
return new StreamWriter(Stream.Null);
}
And let's fix the number of items at one million (1,000,000, which should be sufficiently high to enforce that generally, these all have about the same performance impact):
// The number of items to test.
public const int ItemsToTest = 1000000;
Let's get into the methods:
Method 1: foreach
The following code:
foreach(var item in myList)
{
//Do stuff
}
Compiles down into the following:
using (var enumerable = myList.GetEnumerable())
while (enumerable.MoveNext())
{
var item = enumerable.Current;
// Do stuff.
}
There's quite a bit going on there. You have the method calls (and it may or may not be against the IEnumerator<T> or IEnumerator interfaces, as the compiler respects duck-typing in this case) and your // Do stuff is hoisted into that while structure.
Here's the test to measure the performance:
[TestMethod]
public void TestForEachKeyword()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle through the items.
foreach (var item in list)
{
// Write the values.
MyObjectAction(item, writer);
}
// Write out the number of ticks.
Debug.WriteLine("Foreach loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
Foreach loop ticks: 3210872841
Method 2: .ForEach method on List<T>
The code for the .ForEach method on List<T> looks something like this:
public void ForEach(Action<T> action)
{
// Error handling omitted
// Cycle through the items, perform action.
for (int index = 0; index < Count; ++index)
{
// Perform action.
action(this[index]);
}
}
Note that this is functionally equivalent to Method 4, with one exception, the code that is hoisted into the for loop is passed as a delegate. This requires a dereference to get to the code that needs to be executed. While the performance of delegates has improved from .NET 3.0 on, that overhead is there.
However, it's negligible. The test to measure the performance:
[TestMethod]
public void TestForEachMethod()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle through the items.
list.ForEach(i => MyObjectAction(i, writer));
// Write out the number of ticks.
Debug.WriteLine("ForEach method ticks: {0}", s.ElapsedTicks);
}
}
The output:
ForEach method ticks: 3135132204
That's actually ~7.5 seconds faster than using the foreach loop. Not completely surprising, given that it uses direct array access instead of using IEnumerable<T>.
Remember though, this translates to 0.0000075740637 seconds per item being saved. That's not worth it for small lists of items.
Method 3: while (myList.MoveNext())
As shown in Method 1, this is exactly what the compiler does (with the addition of the using statement, which is good practice). You're not gaining anything here by unwinding the code yourself that the compiler would otherwise generate.
For kicks, let's do it anyways:
[TestMethod]
public void TestEnumerator()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
// Get the enumerator.
using (IEnumerator<MyObject> enumerator = list.GetEnumerator())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle through the items.
while (enumerator.MoveNext())
{
// Write.
MyObjectAction(enumerator.Current, writer);
}
// Write out the number of ticks.
Debug.WriteLine("Enumerator loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
Enumerator loop ticks: 3241289895
Method 4: for
In this particular case, you're going to gain some speed, as the list indexer is going directly to the underlying array to perform the lookup (that's an implementation detail, BTW, there's nothing to say that it can't be a tree structure backing the List<T> up).
[TestMethod]
public void TestListIndexer()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle by index.
for (int i = 0; i < list.Count; ++i)
{
// Get the item.
MyObject item = list[i];
// Perform the action.
MyObjectAction(item, writer);
}
// Write out the number of ticks.
Debug.WriteLine("List indexer loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
List indexer loop ticks: 3039649305
However the place where this can make a difference is arrays. Arrays can be unwound by the compiler to process multiple items at a time.
Instead of doing ten iterations of one item in a ten item loop, the compiler can unwind this into five iterations of two items in a ten item loop.
However, I'm not positive here that this is actually happening (I have to look at the IL and the output of the compiled IL).
Here's the test:
[TestMethod]
public void TestArray()
{
// Create the list.
MyObject[] array = CreateList(ItemsToTest).ToArray();
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle by index.
for (int i = 0; i < array.Length; ++i)
{
// Get the item.
MyObject item = array[i];
// Perform the action.
MyObjectAction(item, writer);
}
// Write out the number of ticks.
Debug.WriteLine("Enumerator loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
Array loop ticks: 3102911316
It should be noted that out-of-the box, Resharper offers a suggestion with a refactoring to change the above for statements to foreach statements. That's not to say this is right, but the basis is to reduce the amount of technical debt in code.
TL;DR
You really shouldn't be concerned with the performance of these things, unless testing in your situation shows that you have a real bottleneck (and you'll have to have massive numbers of items to have an impact).
Generally, you should go for what's most maintainable, in which case, Method 1 (foreach) is the way to go.
In regards to the final bit of the question, "Did I miss any?" Yes, and I feel I would be remiss to not mention this even though the question is quite old. While those four ways of doing it will execute in relatively the same amount of time, there is a way not shown above that runs faster than all of them. Quite significantly, in fact, as the number of items in the iterated list increases. It would be the exact same way as the last method but instead of getting .Count in the condition check of the loop, you assign this value to a variable before setting up the loop and use that instead. Which leaves you with something like this:
var countVar = list.Count;
for(int i = 0; i < countVar; i++)
{
//loop logic
}
By doing it this way, you're only looking up a variable value at each iteration, rather than resolving the Count or Length properties, which is considerably less efficient.
I would suggest an even better and not well-known approach for faster loop iteration over a list. I would recommend you to first read about Span<T>. Note that you can use it if you are using .NET Core.
List<MyObject> list = new();
foreach (MyObject item in CollectionsMarshal.AsSpan(list))
{
// Do something
}
Be aware of the caveats:
The CollectionsMarshal.AsSpan method is unsafe and should be used only if you know what you're doing. CollectionsMarshal.AsSpan returns a Span<T> on the private array of List<T>. Iterating over a Span<T> is fast as the JIT uses the same tricks as for optimizing arrays. Using this method, it won't check the list is not modified during the enumeration.
This is a more detailed explanation of what it does behind the scenes and more, super interesting!

LINQ Why is "Enumerable = Enumerable.Skip(N)" slow?

I am having an issue with the performance of a LINQ query and so I created a small simplified example to demonstrate the issue below. The code takes a random list of small integers and returns the list partitioned into several smaller lists each which totals 10 or less.
The problem is that (as I've written this) the code takes exponentially longer with N. This is only an O(N) problem. With N=2500, the code takes over 10 seconds to run on my pc.
I would appriciate greatly if someone could explain what is going on. Thanks, Mark.
int N = 250;
Random r = new Random();
var work = Enumerable.Range(1,N).Select(x => r.Next(0, 6)).ToList();
var chunks = new List<List<int>>();
// work.Dump("All the work."); // LINQPad Print
var workEnumerable = work.AsEnumerable();
Stopwatch sw = Stopwatch.StartNew();
while(workEnumerable.Any()) // or .FirstorDefault() != null
{
int soFar = 0;
var chunk = workEnumerable.TakeWhile( x =>
{
soFar += x;
return (soFar <= 10);
}).ToList();
chunks.Add(chunk); // Commented out makes no difference.
workEnumerable = workEnumerable.Skip(chunk.Count); // <== SUSPECT
}
sw.Stop();
// chunks.Dump("Work Chunks."); // LINQPad Print
sw.Elapsed.Dump("Time elapsed.");
What .Skip() does is create a new IEnumerable that loops over the source, and only begins yielding results after the first N elements. You chain who knows how many of these after each other. Everytime you call .Any(), you need to loop over all the previously skipped elements again.
Generally speaking, it's a bad idea to set up very complicated operator chains in LINQ and enumerating them repeatedly. Also, since LINQ is a querying API, methods like Skip() are a bad choice when what you're trying to achieve amounts to modifying a data structure.
You effectively keep chaining Skip() onto the same enumerable. In a list of 250, the last chunk will be created from a lazy enumerable with ~25 'Skip' enumerator classes on the front.
You would find things become a lot faster, already if you did
workEnumerable = workEnumerable.Skip(chunk.Count).ToList();
However, I think the whole approach could be altered.
How about using standard LINQ to achieve the same:
See it live on http://ideone.com/JIzpml
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
private readonly static Random r = new Random();
public static void Main(string[] args)
{
int N = 250;
var work = Enumerable.Range(1,N).Select(x => r.Next(0, 6)).ToList();
var chunks = work.Select((o,i) => new { Index=i, Obj=o })
.GroupBy(e => e.Index / 10)
.Select(group => group.Select(e => e.Obj).ToList())
.ToList();
foreach(var chunk in chunks)
Console.WriteLine("Chunk: {0}", string.Join(", ", chunk.Select(i => i.ToString()).ToArray()));
}
}
The Skip() method and others like it basically create a placeholder object, implementing IEnumerable, that references its parent enumerable and contains the logic to perform the skipping. Skips in loops, therefore, are non-performant, because instead of throwing away elements of the enumerable, like you think they are, they add a new layer of logic that's lazily executed when you actually need the first element after all the ones you've skipped.
You can get around this by calling ToList() or ToArray(). This forces "eager" evaluation of the Skip() method, and really does get rid of the elements you're skipping from the new collection you will be enumerating. That comes at an increased memory cost, and requires all of the elements to be known (so if you're running this on an IEnumerable that represents an infinite series, good luck).
The second option is to not use Linq, and instead use the IEnumerable implementation itself, to get and control an IEnumerator. Then instead of Skip(), simply call MoveNext() the necessary number of times.

Categories

Resources