I am trying to run the following program from the book.
The author claims that the resultant output
" should be "
1000
2000
....
10000
if you run the program on normal processor but on multiprocessor computer it could be
999
1998
...
9998
when using normal increment method (number+=1) but using the intelocked increment as shown in the program solves the problem(i.e. you get first output)
Now I have got 3 questions.
First why cant i use normal increment in the inner loop [i++ instead of Interlocked.Increment(ref i)]. Why has author choosed the other method?
Secondly what purpose does Thread.Sleep(1000) has in the context. When I comment out this line, I get second output even if I am using Interlocked method to increment number.
Thirdly I get correct output even by using normal increment method [number += 1] if I dont comment the Thread.Sleep(1000) line and second output if I do so.
Now I am running the program on Intel(R) Core(TM) i7 Q820 cpu if it makes any difference
static void Main(string[] args)
{
MyNum n = new MyNum();
for (int a = 0; a < 10; a++)
{
for (int i = 1; i <= 1000; Interlocked.Increment(ref i))
{
Thread t = new Thread(new ThreadStart(n.AddOne));
t.Start();
}
Thread.Sleep(1000);
Console.WriteLine(n.number);
}
}
class MyNum
{
public int number = 0;
public void AddOne()
{
Interlocked.Increment(ref number);
}
}
The sleep is easy--let the threads finish before you look at the result. It's not really a good answer, though--while they should finish in a second there is no guarantee they actually do.
The need for the interlocked increment in the MyNum class is clear--there are 1000 threads trying for the number, without protection it would be quite possible for one to read the number, then a second read it, then the first one put it back and then the second put it back, wiping out the change the first one made. Note that such errors are FAR more likely when there are multiple cores, otherwise it can only happen if a thread switch hits at the wrong time.
I can't see why i needs to be protected, though.
Edit: You are getting about the same result because the code executes too fast. The thread runs faster than it's created so they aren't running all at once.
Try:
public void AddOne()
{
int x = number + fibnocci(20) + 1 - fibnocci(20);
}
private int fibnocci(int n)
{
if (n < 3) return 1 else return fibnocci(n - 1) + fibnocci(n - 2);
}
(I hope the optimizer isn't good enough to kill this extra code)
The code is actually pretty strange. Since Thread t is declared locally on each iteration, it can possibly be garbage collected by .NET because no reference exists to the thread. Anyway...
To answer the first question, I don't see a need for Interlocked.Increment(ref i) to take place. The main thread is the only thread that will touch i. Using i++ is not a problem here.
For the second question, Thread.Sleep(1000) exists to give the program enough time to complete all the threads. Your i7 (quad core with hyper-threading) is probably finishing each item pretty fast.
For the third question, having the same result is not really a guaranteed with number += 1. The two cores might read the same numeral and increment the numerals to the same value (i.e., 1001, 1001).
Lastly, I'm not sure whether or not you are running the program in debug mode. Building the program in release mode may give you different behaviors and side effects that a multi-threaded program should do.
if you comment out the thread.sleep line, there is a good chance that the threads will not finish prior to the print line... in this case you will see a number smaller than the "correct" output, but not because the incrementer wasn't atomic.
On a true multicore system, it is possible for the non-atomic actions to collide. Perhaps you are doing too few iterations to see the collision.
Related
I have problem with my library for neural networks. It uses multithreading to fasten computations. But after about 30-60 sec of runtime my program does not utilize 100% of my i7 3610QM 4cores 8threads anymore.
Basically my processing looks like (c# with pseudocode)
for each training example t in training set
for each layer l in neural network
Parallel.For(0, N, (int i)=>{l.processForward(l.regions[i])})
for each layer l in neural network (but with reversed order)
Parallel.For(0, N, (int i)=>{l.backPropageteError(l.regions[i])})
Where regions is layer's list of precalculated regions of neuron to process. Every region is the same size of 1/N of current layer so Tasks are same size to minimize chance that other threads need to wait for longest task to finish.
Like i said, this processing scheme is consuming 100% of my processor only for a short time and then drops to about 80-85%. In my case i set N to Environment.ProcessorsCount (= 8);
I can share whole code/repository if anyone is willing to help.
I tried to investigate and I created new console project and put there almost Hello World of Parallel.For() and i simply can't tell what is going on. This might be other issue of Parallel.For() but i also want you to address this problem. Here is the code:
class Program
{
static void Main(string[] args)
{
const int n = 1;
while (true)
{
//int counter = 0; for (int ii = 0; ii < 1000; ++ii) counter++;
Parallel.For(0, n, (int i) => { int counter = 0; for (int ii = 0; ii < 1000; ++ii) counter++; });
}
}
}
In this code, I constantly (while loop) create one task (n=1) that has some work to do (increase counter one thousand times). As i know, Parallel.For blocks execution / waits for all parallel calls to finish. If that is true it should be doing the same work as commented section (provided n=1). But on my computer, this program uses 100% of CPU, like there is work for more than one thread! How is that possible? When i switch to commented version, program uses less than 20% of CPU and this is what I expected. Please help me understand this behaviour.
As #TaW said, their is a cost of going parallel. That's why f() and Parallel.For(0, n, _ => f()) are not equivalent. Parallel version incurs thread scheduling and context switching. In your case the execution time of f() is comparable to thread scheduling overhead. That why you do get performance degrade with parallel version. Parallel.For do wait until operation completes, but is completes so fast that several threads run on the CPU in a very short period of time (remember that each time you invoke Parallel.For it may choose different thread to run f() on it) on different CPU cores.
As for the first part of the question, i guess the problem lies in the index range passed to Parallel.For. Instead of [0, number of CPU cores), it should be equal to the index range of data.
One day I was trying to get a better understanding of threading concepts, so I wrote a couple of test programs. One of them was:
using System;
using System.Threading.Tasks;
class Program
{
static volatile int a = 0;
static void Main(string[] args)
{
Task[] tasks = new Task[4];
for (int h = 0; h < 20; h++)
{
a = 0;
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = new Task(() => DoStuff());
tasks[i].Start();
}
Task.WaitAll(tasks);
Console.WriteLine(a);
}
Console.ReadKey();
}
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
a++;
}
}
}
I hoped I will be able to see outputs less than 2000000. The model in my imagination was the following: more threads read variable a at the same time, all local copies of a will be the same, the threads increment it and the writes happen and one or more increments are "lost" this way.
Although the output is against this reasoning. One sample output (from a corei5 machine):
2000000
1497903
1026329
2000000
1281604
1395634
1417712
1397300
1396031
1285850
1092027
1068205
1091915
1300493
1357077
1133384
1485279
1290272
1048169
704754
If my reasoning were true I would see 2000000 occasionally and sometimes numbers a bit less. But what I see is 2000000 occasionally and numbers way less than 2000000. This indicates that what happens behind the scenes is not just a couple of "increment losses" but something more is going on. Could somebody explain me the situation?
Edit:
When I was writing this test program I was fully aware how I could make this thrad safe and I was expecting to see numbers less than 2000000. Let me explain why I was surprised by the output: First lets assume that the reasoning above is correct. Second assumption (this wery well can be the source of my confusion): if the conflicts happen (and they do) than these conflicts are random and I expect a somewhat normal distribution for these random event occurences. In this case the first line of the output says: from 500000 experiments the random event never occured. The second line says: the random event occured at least 167365 times. The difference between 0 and 167365 is just to big (almost impossible with a normal distribution). So the case boils down to the following:
One of the two assumptions (the "increment loss" model or the "somewhat normally distributed paralell conflicts" model) are incorrect. Which one is and why?
The behavior stems from the fact that you are using both the volatile keyword as well as not locking access to the variable a when using the increment operator (++) (although you still get a random distribution when not using volatile, using volatile does change the nature of the distribution, which is explored below).
When using the increment operator, it's the equivalent of:
a = a + 1;
In this case, you're actually doing three operations, not one:
Read the value of a
Add 1 to the value of a
Assign the result of 2 back to a
While the volatile keyword serializes access, in the above case, it's serializing access to three separate operations, not serializing access to them collectively, as an atomic unit of work.
Because you're performing three operations when incrementing instead of one, you have additions that are being dropped.
Consider this:
Time Thread 1 Thread 2
---- -------- --------
0 read a (1) read a (1)
1 evaluate a + 1 (2) evaluate a + 1 (2)
2 write result to a (3) write result to a (3)
Or even this:
Time a Thread 1 Thread 2 Thread 3
---- - -------- -------- --------
0 1 read a read a
1 1 evaluate a + 1 (2)
2 2 write back to a
3 2 read a
4 2 evaluate a + 1 (3)
5 3 write back to a
6 3 evaluate a + 1 (2)
7 2 write back to a
Note in particular steps 5-7, thread 2 has written a value back to a, but because thread 3 has an old, stale value, it actually overwrites the results that previous threads have written, essentially wiping out any trace of those increments.
As you can see, as you add more threads, you have a greater potential to mix up the order in which the operations are being performed.
volatile will prevent you from corrupting the value of a due to two writes happening at the same time, or a corrupt read of a due to a write happening during a read, but it doesn't do anything to handle making the operations atomic in this case (since you're performing three operations).
In this case, volatile ensures that the distribution of the value of a is between 0 and 2,000,000 (four threads * 500,000 iterations per thread) because of this serialization of access to a. Without volatile, you run the risk of a being anything as you can run into corruption of the value a when reads and/or writes happen at the same time.
Because you haven't synchronized access to a for the entire increment operation, the results are unpredictable, as you have writes that are being overwritten (as seen in the previous example).
What's going on in your case?
For your specific case you have many writes that are being overwritten, not just a few; since you have four threads each writing a loop two million times, theoretically all the writes could be overwritten (expand the second example to four threads and then just add a few million rows to increment the loops).
While it's not really probable, there shouldn't be an expectation that you wouldn't drop a tremendous amount of writes.
Additionally, Task is an abstraction. In reality (assuming you are using the default scheduler), it uses the ThreadPool class to get threads to process you requests. The ThreadPool is ultimately shared with other operations (some internal to the CLR, even in this case) and even then, it does things like work-stealing, using the current thread for operations and ultimately at some point drops down to the operating system at some level to get a thread to perform work on.
Because of this, you can't assume that there's a random distribution of overwrites that will be skipped, as there's always going to be a lot more going on that will throw whatever order you expect out the window; the order of processing is undefined, the allocation of work will never be evenly distributed.
If you want to ensure that additions won't be overwritten, then you should use the Interlocked.Increment method in the DoStuff method, like so:
for (int i = 0; i < 500000; i++)
{
Interlocked.Increment(ref a);
}
This will ensure that all writes will take place, and your output will be 2000000 twenty times (as per your loop).
It also invalidates the need for the volatile keyword, as you're making the operations you need atomic.
The volatile keyword is good when the operation that you need to make atomic is limited to a single read or write.
If you have to do anything more than a read or a write, then the volatile keyword is too granular, you need a more coarse locking mechanism.
In this case, it's Interlocked.Increment, but if you have more that you have to do, then the lock statement will more than likely be what you rely on.
I don't think it's anything else happening - it's just happening a lot. If you add 'locking' or some other synch technique (Best thread-safe way to increment an integer up to 65535) you'll reliably get the full 2,000,000 increments.
Each task is calling DoStuff() as you'd expect.
private static object locker = new object();
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
lock (locker)
{
a++;
}
}
}
Try increasing the the amounts, the timespan is simply to short to draw any conclusions on. Remember that normal IO is in the range of milliseconds and just one blocking IO-op in this case would render the results useless.
Something along the lines of this is better: (or why not intmax?)
static void DoStuff()
{
for (int i = 0; i < 50000000; i++) // 50 000 000
a++;
}
My results ("correct" being 400 000 000):
63838940
60811151
70716761
62101690
61798372
64849158
68786233
67849788
69044365
68621685
86184950
77382352
74374061
58356697
70683366
71841576
62955710
70824563
63564392
71135381
Not really a normal distribution but we are getting there. Bear in mind that this is roughly 35% of the correct amount.
I can explain my results as I am running on 2 physical cores, although viewed as 4 due to hyperthreading, which means that if it is optimal to do a "ht-switch" during the actual addition atleast 50% of the additions will be "removed" (if I remember the implementation of ht correctly it would be (ie modifying some threads data in ALU while loading/saving other threads data)). And the remaining 15% due to the program actually running on 2 cores in parallell.
My recommendations
post your hardware
increase the loop count
vary the TaskCount
hardware matters!
Ok, So, I just started screwing around with threading, now it's taking a bit of time to wrap my head around the concepts so i wrote a pretty simple test to see how much faster if faster at all printing out 20000 lines would be (and i figured it would be faster since i have a quad core processor?)
so first i wrote this, (this is how i would normally do the following):
System.DateTime startdate = DateTime.Now;
for (int i = 0; i < 10000; ++i)
{
Console.WriteLine("Producing " + i);
Console.WriteLine("\t\t\t\tConsuming " + i);
}
System.DateTime endtime = DateTime.Now;
Console.WriteLine(a.startdate.Second + ":" + a.startdate.Millisecond + " to " + endtime.Second + ":" + endtime.Millisecond);
And then with threading:
public class Test
{
static ProducerConsumer queue;
public System.DateTime startdate = DateTime.Now;
static void Main()
{
queue = new ProducerConsumer();
new Thread(new ThreadStart(ConsumerJob)).Start();
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Producing {0}", i);
queue.Produce(i);
}
Test a = new Test();
}
static void ConsumerJob()
{
Test a = new Test();
for (int i = 0; i < 10000; i++)
{
object o = queue.Consume();
Console.WriteLine("\t\t\t\tConsuming {0}", o);
}
System.DateTime endtime = DateTime.Now;
Console.WriteLine(a.startdate.Second + ":" + a.startdate.Millisecond + " to " + endtime.Second + ":" + endtime.Millisecond);
}
}
public class ProducerConsumer
{
readonly object listLock = new object();
Queue queue = new Queue();
public void Produce(object o)
{
lock (listLock)
{
queue.Enqueue(o);
Monitor.Pulse(listLock);
}
}
public object Consume()
{
lock (listLock)
{
while (queue.Count == 0)
{
Monitor.Wait(listLock);
}
return queue.Dequeue();
}
}
}
Now, For some reason i assumed this would be faster, but after testing it 15 times, the median of the results is ... a few milliseconds different in favor of non threading
Then i figured hey ... maybe i should try it on a million Console.WriteLine's, but the results were similar
am i doing something wrong ?
Writing to the console is internally synchronized. It is not parallel. It also causes cross-process communication.
In short: It is the worst possible benchmark I can think of ;-)
Try benchmarking something real, something that you actually would want to speed up. It needs to be CPU bound and not internally synchronized.
As far as I can see you have only got one thread servicing the queue, so why would this be any quicker?
I have an example for why your expectation of a big speedup through multi-threading is wrong:
Assume you want to upload 100 pictures. The single threaded variant loads the first, uploads it, loads the second, uploads it, etc.
The limiting part here is the bandwidth of your internet connection (assuming that every upload uses up all the upload bandwidth you have).
What happens if you create 100 threads to upload 1 picture only? Well, each thread reads its picture (this is the part that speeds things up a little, because reading the pictures is done in parallel instead of one after the other).
As the currently active thread uses 100% of the internet upload bandwidth to upload its picture, no other thread can upload a single byte when it is not active. As the amount of bytes that needs to be transmitted, the time that 100 threads need to upload one picture each is the same time that one thread needs to upload 100 pictures one after the other.
You only get a speedup if uploading pictures was limited to lets say 50% of the available bandwidth. Then, 100 threads would be done in 50% of the time it would take one thread to upload 100 pictures.
"For some reason i assumed this would be faster"
If you don't know why you assumed it would be faster, why are you surprised that it's not? Simply starting up new threads is never guaranteed to make any operation run faster. There has to be some inefficiency in the original algorithm that a new thread can reduce (and that is sufficient to overcome the extra overhead of creating the thread).
All the advice given by others is good advice, especially the mention of the fact that the console is serialized, as well as the fact that adding threads does not guarantee speedup.
What I want to point out and what it seems the others missed is that in your original scenario you are printing everything in the main thread, while in the second scenario you are merely delegating the entire printing task to the secondary worker. This cannot be any faster than your original scenario because you simply traded one worker for another.
A scenario where you might see speedup is this one:
for(int i = 0; i < largeNumber; i++)
{
// embarrassingly parallel task that takes some time to process
}
and then replacing that with:
int i = 0;
Parallel.For(i, largeNumber,
o =>
{
// embarrassingly parallel task that takes some time to process
});
This will split the loop among the workers such that each worker processes a smaller chunk of the original data. If the task does not need synchronization you should see the expected speedup.
Cool test.
One thing to have in mind when dealing with threads is bottlenecks. Consider this:
You have a Restaurant. Your kitchen can make a new order every 10
minutes (your chef has a bladder problem so he's always in the
bathroom, but is your girlfriend's cousin), so he produces 6 orders an
hour.
You currently employ only one waiter, which can attend tables
immediately (he's probably on E, but you don't care as long as the
service is good).
During the first week of business everything is fine: you get
customers every ten minutes. Customers still wait for exactly ten
minutes for their meal, but that's fine.
However, after that week, you are getting as much as 2 costumers every
ten minutes, and they have to wait as much as 20 minutes to get their
meal. They start complaining and making noises. And god, you have
noise. So what do you do?
Waiters are cheap, so you hire two more. Will the wait time change?
Not at all... waiters will get the order faster, sure (attend two
customers in parallel), but still some customers wait 20 minutes for
the chef to complete their orders.You need another chef, but as you
search, you discover they are lacking! Every one of them is on TV
doing some crazy reality show (except for your girlfriend's cousin who
actually, you discover, is a former drug dealer).
In your case, waiters are the threads making calls to Console.WriteLine; But your chef is the Console itself. It can only service so much calls a second. Adding some threads might make things a bit faster, but the gains should be minimal.
You have multiple sources, but only 1 output. It that case multi-threading will not speed it up. It's like having a road where 4 lanes that merge into 1 lane. Having 4 lanes will move traffic faster, but at the end it will slow back down when it merges into 1 lane.
Hi
I am reading a “threading in C#” tutorial. One of the things it mentions is:
“The CLR assigns each thread its own memory stack so that local variables are kept separate”
And there’s this example:
namespace ConsoleApplication1 {
class Program {
static void Main(string[] args) {
for (int i = 0; i < 20; i++) {
Thread t = new Thread(() => {
Console.WriteLine(i);
});
t.Start();
}
Console.ReadLine();
}
}
}
Output:
1
2
2
4
6
8
10
10
10
10
12
12
14
15
17
18
18
20
20
So the way I understand what’s happening here is:
The main thread starts executing the for loop.
A new thread is instantiated and defined such that it will receive the value of “i”
and print it to the console.
The thread instance is started meanwhile the main thread continues working.
Being “i” an integer my guess is that the new thread will have its own copy in its memory stack. And then print the value to the console. But as the results show, it’s skipping values jumping from 10 to 12 or 12 to 14.
So is the new thread is receiving a reference to i? But if “i” is an integer shouldn’t the new thread store a new value in its memory stack instead of what seems a reference to i.
Also why are there duplicate values? It’s printing several times 2,10, 12, 18, 20.
Thanks.
That sample is fatally flawed... because every thread is actually sharing the one i variable. It's being captured by the lambda expression.
This is a very common problem, but it's a real shame to see it in a threading tutorial. (I hope it's not one of my articles! Please tell us where you're reading this.) Eric Lippert has written about it very carefully in his blog posts, "closing over the loop variable considered harmful" - part 1; part 2.
It's worth distinguishing between threading behaviour and that of lambda expressions. Threads really do have their own stacks and their own local variables - but here, i is shared between all threads due to the lambda expression. It's not a local variable in the "normal" sense.
Here's an example which shows each thread having its own local variables:
using System;
using System.Threading;
public class Test
{
static void Main()
{
for (int x = 0; x < 10; x++)
{
new Thread(Count).Start();
}
}
static void Count()
{
int threadId = Thread.CurrentThread.ManagedThreadId;
Console.WriteLine("Thread {0} starting", threadId);
for (int i = 0; i < 5; i++)
{
Console.WriteLine("{0}: {1}", threadId, i);
}
Console.WriteLine("Thread {0} ending", threadId);
}
}
Each thread will definitely print 0..4 along with its own thread ID. The i variable is genuinely local to each thread this time - there's no sharing.
When a variable is used in a lambda expression, like your variable i, it gets hoisted into the a closure (in your case that of the Main method) (first hit on google for "closure c#" happens to be Jon Skeet's article on the subject). And because of this, it is not a local variable and does not live on the thread's stack.
The problem is simple as the tread take time to initialize and run, the i's value will have changed in the mean time. And there is also a possibility of more than one loop would have completed by the time the other threads get a processor cycle to process. Thus a single number gets printed multiple times.
While testing application performance, I came across some pretty strange GC behavior. In short, the GC runs even on an empty program without runtime allocations!
The following application demonstrates the issue:
using System;
using System.Collections.Generic;
public class Program
{
// Preallocate strings to avoid runtime allocations.
static readonly List<string> Integers = new List<string>();
static int StartingCollections0, StartingCollections1, StartingCollections2;
static Program()
{
for (int i = 0; i < 1000000; i++)
Integers.Add(i.ToString());
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
static void Main(string[] args)
{
DateTime start = DateTime.Now;
int i = 0;
Console.WriteLine("Test 1");
StartingCollections0 = GC.CollectionCount(0);
StartingCollections1 = GC.CollectionCount(1);
StartingCollections2 = GC.CollectionCount(2);
while (true)
{
if (++i >= Integers.Count)
{
Console.WriteLine();
break;
}
// 1st test - no collections!
{
if (i % 50000 == 0)
{
PrintCollections();
Console.Write(" - ");
Console.WriteLine(Integers[i]);
//System.Threading.Thread.Sleep(100);
// or a busy wait (run in debug mode)
for (int j = 0; j < 50000000; j++)
{ }
}
}
}
i = 0;
Console.WriteLine("Test 2");
StartingCollections0 = GC.CollectionCount(0);
StartingCollections1 = GC.CollectionCount(1);
StartingCollections2 = GC.CollectionCount(2);
while (true)
{
if (++i >= Integers.Count)
{
Console.WriteLine("Press any key to continue...");
Console.ReadKey(true);
return;
}
DateTime now = DateTime.Now;
TimeSpan span = now.Subtract(start);
double seconds = span.TotalSeconds;
// 2nd test - several collections
if (seconds >= 0.1)
{
PrintCollections();
Console.Write(" - ");
Console.WriteLine(Integers[i]);
start = now;
}
}
}
static void PrintCollections()
{
Console.Write(Integers[GC.CollectionCount(0) - StartingCollections0]);
Console.Write("|");
Console.Write(Integers[GC.CollectionCount(1) - StartingCollections1]);
Console.Write("|");
Console.Write(Integers[GC.CollectionCount(2) - StartingCollections2]);
}
}
Can someone explain what is going on here? I was under the impression that the GC won't run unless memory pressure hits specific limits. However, it seems to run (and collect) all the time - is this normal?
Edit: I have modified the program to avoid all runtime allocations.
Edit 2: Ok, new iteration and it seems that DateTime is the culprit. One of the DateTime methods allocates memory (probably Subtract), which causes the GC to run. The first test now causes absolutely no collections - as expected - while the second causes several.
In short, the GC only runs when it needs to run - I was just generating memory pressure unwittingly (DateTime is a struct and I thought it wouldn't generate garbage).
GC.CollectionCount(0) returns the following:
The number of times garbage collection has occurred for the specified generation since the process was started.
Therefore you should see an increase in the numbers and that increase doesn't mean that memory is leaking but that the GC has run.
Also in the first case you can see this increase. It simply will happen much slower because the very slow Console.WriteLine method is called much more often, slowing things down a lot.
Another thing that should be noted here is that GC.Collect() is not a synchronous function call. It triggers a garbage collection, but that garbage collection occurs on a background thread, and theoretically may not have finished running by the time you get around to checking your GC statistics.
There is a GC.WaitForPendingFinalizers call which you can make after GC.Collect to block until the garbage collection occurs.
If you really want to attempt to accurately track GC statistics in different situations, I would instead utilize the Windows Performance Monitor on your process, where you can create monitors on all sorts of things, including .NET Heap statistics.
If you just wait a few seconds, you see that the collection count also increases in the first test, but not as fast.
The differences between the codes is that the first test writes out the collection count all the time, as fast as it can, while the second test loops without writing anything out until the time limit is reached.
The first test spends most of the time waiting for text being written to the console, while the second test spends most of the time looping, waiting for the time limit. The second test will do a lot more iterations during the same time.
I counted the iterations, and printed out the number of iterations per garbage collection. On my computer the first test stabilises around 45000 iterations per GC, while the second test stabilises around 130000 iterations per GC.
So, the first test actually does more garbage collections than the second test, about three times as many.
Thanks everyone! Your suggestions helped reveal the culprit: DateTime is allocating heap memory.
The GC does not run all the time but only when memory is allocated. If memory usage is flat, the GC will never run and GC.CollectionCount(0) will always return 0, as expected.
The latest iteration of the test showcases this behavior. The first test run does not allocate any heap memory (GC.CollectionCount(0) remains 0), while the second allocates memory in a non-obvious fashion: through DateTime.Subtract() -> Timespan.
Now, both DateTime and Timespan are value types, which is why I found this behavior surprising. Still, there you have it: there was a memory leak after all.