Why there is a Thread.Sleep(1) in .NET internal Hashtable? - c#

Recently I was reading implementation of .NET Hashtable and encountered piece of code that I don't understand. Part of the code is:
int num3 = 0;
int num4;
do
{
num4 = this.version;
bucket = bucketArray[index];
if (++num3 % 8 == 0)
Thread.Sleep(1);
}
while (this.isWriterInProgress || num4 != this.version);
The whole code is within public virtual object this[object key] of System.Collections.Hashtable (mscorlib Version=4.0.0.0).
The question is:
What is the reason of having Thread.Sleep(1) there?

Sleep(1) is a documented way in Windows to yield the processor and allow other threads to run. You can find this code in the Reference Source with comments:
// Our memory model guarantee if we pick up the change in bucket from another processor,
// we will see the 'isWriterProgress' flag to be true or 'version' is changed in the reader.
//
int spinCount = 0;
do {
// this is violate read, following memory accesses can not be moved ahead of it.
currentversion = version;
b = lbuckets[bucketNumber];
// The contention between reader and writer shouldn't happen frequently.
// But just in case this will burn CPU, yield the control of CPU if we spinned a few times.
// 8 is just a random number I pick.
if( (++spinCount) % 8 == 0 ) {
Thread.Sleep(1); // 1 means we are yeilding control to all threads, including low-priority ones.
}
} while ( isWriterInProgress || (currentversion != version) );
The isWriterInProgress variable is a volatile bool. The author had some trouble with English "violate read" is "volatile read". The basic idea is try to avoid yielding, thread context switches are very expensive, with some hope that the writer gets done quickly. If that doesn't pan out then explicitly yield to avoid burning cpu. This would probably have been written with Spinlock today but Hashtable is very old. As are the assumptions about the memory model.

Without having access to the rest of the implementation code, I can only make an educated guess based on what you've posted.
That said, it looks like it's trying to update something in the Hashtable, either in memory or on disk, and doing an infinite loop while waiting for it to finish (as seen by checking the isWriterInProgress).
If it's a single-core processor, it can only run the one thread at a time. Going in a continuous loop like this could easily mean the other thread doesn't have a chance to run, but the Thread.Sleep(1) gives the processor a chance to give time to the writer. Without the wait, the writer thread may never get a chance to run, and never complete.

I haven't read the source, but it looks like a lockless concurrency thing. You're trying to read from the hashtable, but someone else might be writing to it, so you wait until isWriterInProgress is unset and the version that you've read hasn't changed.
This does not explain why e.g. we always wait at least once. EDIT: that's because we don't, thanks #Maciej for pointing that out. When there's no contention we proceed immediately. I don't know why 8 is the magic number instead of e.g. 4 or 16, though.

Related

C# shared variable access from different threads

I am using a static variables to get access between threads, but is taking so long to get their values.
Context: I have a static class Results.cs, where I store the result variables of two running Process.cs instances.
public static int ResultsStation0 { get; set; }
public static int ResultsStation1 { get; set; }
Then, a function of the two process instances is called at the same time, with initial value of ResultsStation0/1 = -1.
Because the result will be provided not at the same time, the function is checking that both results are available. The fast instance will set the result and await for the result of the slower instance.
void StationResult(){
Stopwatch sw = new Stopwatch();
sw.Restart();
switch (stationIndex) //Set the result of the station thread
{
case 0: Results.ResultsStation0 = 1; break;
case 1: Results.ResultsStation1 = 1; break;
}
//Waits to get the results of both threads
while (true)
{
if (Results.ResultsStation0 != -1 && Results.ResultsStation1 != -1)
{
break;
}
}
Trace_Info("GOT RESULTS " + stationIndex + "Time: " + sw.ElapsedMilliseconds.ToString() + "ms");
if (Results.ResultsStation0 == 1 && Results.ResultsStation1 == 1)
{
//set OK if both results are OK
Device.profinet.WritePorts(new Enum[] { NOK, OK },
new int[] { 0, 1 });
}
}
It works, but the problem is that the value of sw of the thread that awaits, should be 1ms more or less. I am getting 1ms sometimes, but most of the times I have values up to 80ms.
My question is: why it takes that much if they are sharing the same memory (I guess)?
Is this the right way to access to a variable between threads?
Don't use this method. Global mutable state is bad enough. Mixing in multiple threads sounds like a recipe for unmaintainable code. Since there is no synchronization at all in sight there is no real guarantee that your program may ever finish. On a single CPU system your loop will prevent any real work from actually being done until the scheduler picks one of the worker threads to run, an even on multi core system you will waste a ton of CPU cycles.
If you really want global variables, these should be something that can signal the completion of the operation, i.e. a Task, or ManualResetEvent. That way you can get rid of your horrible spin-wait, and actually wait for each task to complete.
But I would highly recommend to get rid of the global variables and just use standard task based programming:
var result1 = Task.Run(MyMethod1);
var result2 = Task.Run(MyMethod2);
await Task.WhenAll(new []{result1, result2});
Such code is much easier to reason about and understand.
Multi threaded programming is difficult. There are a bunch of new ways your program can break, and the compiler will not help you. You are lucky if you even get an exception, in many cases you will just get an incorrect result. If you are unlucky you will only get incorrect results in production, not in development or testing. So you should read a fair amount about the topic so that you are at least familiar with the common dangers and the ways to mitigate them.
You are using flags as signaling for this you have a class called AutoResetEvent.
There's a difference between safe access and synchronization.
For safe access (atomic) purpose you can use the class Interlocked
For synchronization you use mutex based solutions - either spinlocks, barriers, etc...
What it looks like is you need a synchronization mechanism because you relay on an atomic behavior to signal a process that it is done.
Further more,
For C# there's the async way to do things and that is to call await.
It is Task based so in case you can redesign your flow to use Tasks instead of Threads it will suit you more.
Just to be clear - atomicity means you perform the call in one go.
So for example this is not atomic
int a = 0;
int b = a; //not atomic - read 'a' and then assign to 'b'.
I won't teach you everything to know about threading in C# in one post answer - so my advice is to read the MSDN articles about threading and tasks.

use of forcefully calling Garbage collection method [duplicate]

The general advice is that you should not call GC.Collect from your code, but what are the exceptions to this rule?
I can only think of a few very specific cases where it may make sense to force a garbage collection.
One example that springs to mind is a service, that wakes up at intervals, performs some task, and then sleeps for a long time. In this case, it may be a good idea to force a collect to prevent the soon-to-be-idle process from holding on to more memory than needed.
Are there any other cases where it is acceptable to call GC.Collect?
If you have good reason to believe that a significant set of objects - particularly those you suspect to be in generations 1 and 2 - are now eligible for garbage collection, and that now would be an appropriate time to collect in terms of the small performance hit.
A good example of this is if you've just closed a large form. You know that all the UI controls can now be garbage collected, and a very short pause as the form is closed probably won't be noticeable to the user.
UPDATE 2.7.2018
As of .NET 4.5 - there is GCLatencyMode.LowLatency and GCLatencyMode.SustainedLowLatency. When entering and leaving either of these modes, it is recommended that you force a full GC with GC.Collect(2, GCCollectionMode.Forced).
As of .NET 4.6 - there is the GC.TryStartNoGCRegion method (used to set the read-only value GCLatencyMode.NoGCRegion). This can itself, perform a full blocking garbage collection in an attempt to free enough memory, but given we are disallowing GC for a period, I would argue it is also a good idea to perform full GC before and after.
Source: Microsoft engineer Ben Watson's: Writing High-Performance .NET Code, 2nd Ed. 2018.
See:
https://msdn.microsoft.com/en-us/library/system.runtime.gclatencymode(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/dn906204(v=vs.110).aspx
I use GC.Collect only when writing crude performance/profiler test rigs; i.e. I have two (or more) blocks of code to test - something like:
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
TestA(); // may allocate lots of transient objects
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
TestB(); // may allocate lots of transient objects
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
...
So that TestA() and TestB() run with as similar state as possible - i.e. TestB() doesn't get hammered just because TestA left it very close to the tipping point.
A classic example would be a simple console exe (a Main method sort-enough to be posted here for example), that shows the difference between looped string concatenation and StringBuilder.
If I need something precise, then this would be two completely independent tests - but often this is enough if we just want to minimize (or normalize) the GC during the tests to get a rough feel for the behaviour.
During production code? I have yet to use it ;-p
The best practise is to not force a garbage collection in most cases. (Every system I have worked on that had forced garbage collections, had underlining problems that if solved would have removed the need to forced the garbage collection, and speeded the system up greatly.)
There are a few cases when you know more about memory usage then the garbage collector does. This is unlikely to be true in a multi user application, or a service that is responding to more then one request at a time.
However in some batch type processing you do know more then the GC. E.g. consider an application that.
Is given a list of file names on the command line
Processes a single file then write the result out to a results file.
While processing the file, creates a lot of interlinked objects that can not be collected until the processing of the file have complete (e.g. a parse tree)
Does not keep match state between the files it has processed.
You may be able to make a case (after careful) testing that you should force a full garbage collection after you have process each file.
Another cases is a service that wakes up every few minutes to process some items, and does not keep any state while it’s asleep. Then forcing a full collection just before going to sleep may be worthwhile.
The only time I would consider forcing
a collection is when I know that a lot
of object had been created recently
and very few objects are currently
referenced.
I would rather have a garbage collection API when I could give it hints about this type of thing without having to force a GC my self.
See also "Rico Mariani's Performance Tidbits"
These days I consider same of the above cases would be better to use a short lived worker process to do each batch of work and let the OS do the resource recovery.
One case is when you are trying to unit test code that uses WeakReference.
In large 24/7 or 24/6 systems -- systems that react to messages, RPC requests or that poll a database or process continuously -- it is useful to have a way to identify memory leaks. For this, I tend to add a mechanism to the application to temporarily suspend any processing and then perform full garbage collection. This puts the system into a quiescent state where the memory remaining is either legitimately long lived memory (caches, configuration, &c.) or else is 'leaked' (objects that are not expected or desired to be rooted but actually are).
Having this mechanism makes it a lot easier to profile memory usage as the reports will not be clouded with noise from active processing.
To be sure you get all of the garbage, you need to perform two collections:
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
As the first collection will cause any objects with finalizers to be finalized (but not actually garbage collect these objects). The second GC will garbage collect these finalized objects.
You can call GC.Collect() when you know something about the nature of the app the garbage collector doesn't.
As the author, it's often tempting to think this is likely or normal. However, the truth is the GC amounts to a pretty well-written and tested expert system, and it's rare you'll know something about the low level code paths it doesn't.
The best example I can think of where you might have some extra information is an app that cycles between idle periods and very busy periods. You want the best performance possible for the busy periods and therefore want to use the idle time to do some clean up.
However, most of the time the GC is smart enough to do this anyway.
One instance where it is almost necessary to call GC.Collect() is when automating Microsoft Office through Interop. COM objects for Office don't like to automatically release and can result in the instances of the Office product taking up very large amounts of memory. I'm not sure if this is an issue or by design. There's lots of posts about this topic around the internet so I won't go into too much detail.
When programming using Interop, every single COM object should be manually released, usually though the use of Marshal.ReleseComObject(). In addition, calling Garbage Collection manually can help "clean up" a bit. Calling the following code when you're done with Interop objects seems to help quite a bit:
GC.Collect()
GC.WaitForPendingFinalizers()
GC.Collect()
In my personal experience, using a combination of ReleaseComObject and manually calling garbage collection greatly reduces the memory usage of Office products, specifically Excel.
As a memory fragmentation solution.
I was getting out of memory exceptions while writing a lot of data into a memory stream (reading from a network stream). The data was written in 8K chunks. After reaching 128M there was exception even though there was a lot of memory available (but it was fragmented). Calling GC.Collect() solved the issue. I was able to handle over 1G after the fix.
Have a look at this article by Rico Mariani. He gives two rules when to call GC.Collect (rule 1 is: "Don't"):
When to call GC.Collect()
I was doing some performance testing on array and list:
private static int count = 100000000;
private static List<int> GetSomeNumbers_List_int()
{
var lstNumbers = new List<int>();
for(var i = 1; i <= count; i++)
{
lstNumbers.Add(i);
}
return lstNumbers;
}
private static int[] GetSomeNumbers_Array()
{
var lstNumbers = new int[count];
for (var i = 1; i <= count; i++)
{
lstNumbers[i-1] = i + 1;
}
return lstNumbers;
}
private static int[] GetSomeNumbers_Enumerable_Range()
{
return Enumerable.Range(1, count).ToArray();
}
static void performance_100_Million()
{
var sw = new Stopwatch();
sw.Start();
var numbers1 = GetSomeNumbers_List_int();
sw.Stop();
//numbers1 = null;
//GC.Collect();
Console.WriteLine(String.Format("\"List<int>\" took {0} milliseconds", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
var numbers2 = GetSomeNumbers_Array();
sw.Stop();
//numbers2 = null;
//GC.Collect();
Console.WriteLine(String.Format("\"int[]\" took {0} milliseconds", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
//getting System.OutOfMemoryException in GetSomeNumbers_Enumerable_Range method
var numbers3 = GetSomeNumbers_Enumerable_Range();
sw.Stop();
//numbers3 = null;
//GC.Collect();
Console.WriteLine(String.Format("\"int[]\" Enumerable.Range took {0} milliseconds", sw.ElapsedMilliseconds));
}
and I got OutOfMemoryException in GetSomeNumbers_Enumerable_Range method the only workaround is to deallocate the memory by:
numbers = null;
GC.Collect();
You should try to avoid using GC.Collect() since its very expensive. Here is an example:
public void ClearFrame(ulong timeStamp)
{
if (RecordSet.Count <= 0) return;
if (Limit == false)
{
var seconds = (timeStamp - RecordSet[0].TimeStamp)/1000;
if (seconds <= _preFramesTime) return;
Limit = true;
do
{
RecordSet.Remove(RecordSet[0]);
} while (((timeStamp - RecordSet[0].TimeStamp) / 1000) > _preFramesTime);
}
else
{
RecordSet.Remove(RecordSet[0]);
}
GC.Collect(); // AVOID
}
TEST RESULT: CPU USAGE 12%
When you change to this:
public void ClearFrame(ulong timeStamp)
{
if (RecordSet.Count <= 0) return;
if (Limit == false)
{
var seconds = (timeStamp - RecordSet[0].TimeStamp)/1000;
if (seconds <= _preFramesTime) return;
Limit = true;
do
{
RecordSet[0].Dispose(); // Bitmap destroyed!
RecordSet.Remove(RecordSet[0]);
} while (((timeStamp - RecordSet[0].TimeStamp) / 1000) > _preFramesTime);
}
else
{
RecordSet[0].Dispose(); // Bitmap destroyed!
RecordSet.Remove(RecordSet[0]);
}
//GC.Collect();
}
TEST RESULT: CPU USAGE 2-3%
In your example, I think that calling GC.Collect isn't the issue, but rather there is a design issue.
If you are going to wake up at intervals, (set times) then your program should be crafted for a single execution (perform the task once) and then terminate. Then, you set the program up as a scheduled task to run at the scheduled intervals.
This way, you don't have to concern yourself with calling GC.Collect, (which you should rarely if ever, have to do).
That being said, Rico Mariani has a great blog post on this subject, which can be found here:
http://blogs.msdn.com/ricom/archive/2004/11/29/271829.aspx
One useful place to call GC.Collect() is in a unit test when you want to verify that you are not creating a memory leak (e. g. if you are doing something with WeakReferences or ConditionalWeakTable, dynamically generated code, etc).
For example, I have a few tests like:
WeakReference w = CodeThatShouldNotMemoryLeak();
Assert.IsTrue(w.IsAlive);
GC.Collect();
GC.WaitForPendingFinalizers();
Assert.IsFalse(w.IsAlive);
It could be argued that using WeakReferences is a problem in and of itself, but it seems that if you are creating a system that relies on such behavior then calling GC.Collect() is a good way to verify such code.
There are some situations where it is better safe than sorry.
Here is one situation.
It is possible to author an unmanaged DLL in C# using IL rewrites (because there are situations where this is necessary).
Now suppose, for example, the DLL creates an array of bytes at the class level - because many of the exported functions need access to such. What happens when the DLL is unloaded? Is the garbage collector automatically called at that point? I don't know, but being an unmanaged DLL it is entirely possible the GC isn't called. And it would be a big problem if it wasn't called. When the DLL is unloaded so too would be the garbage collector - so who is going to be responsible for collecting any possible garbage and how would they do it? Better to employ C#'s garbage collector. Have a cleanup function (available to the DLL client) where the class level variables are set to null and the garbage collector called.
Better safe than sorry.
The short answer is: never!
using(var stream = new MemoryStream())
{
bitmap.Save(stream, ImageFormat.Png);
techObject.Last().Image = Image.FromStream(stream);
bitmap.Dispose();
// Without this code, I had an OutOfMemory exception.
GC.Collect();
GC.WaitForPendingFinalizers();
//
}
Another reason is when you have a SerialPort opened on a USB COM port, and then the USB device is unplugged. Because the SerialPort was opened, the resource holds a reference to the previously connected port in the system's registry. The system's registry will then contain stale data, so the list of available ports will be wrong. Therefore the port must be closed.
Calling SerialPort.Close() on the port calls Dispose() on the object, but it remains in memory until garbage collection actually runs, causing the registry to remain stale until the garbage collector decides to release the resource.
From https://stackoverflow.com/a/58810699/8685342:
try
{
if (port != null)
port.Close(); //this will throw an exception if the port was unplugged
}
catch (Exception ex) //of type 'System.IO.IOException'
{
System.GC.Collect();
System.GC.WaitForPendingFinalizers();
}
port = null;
If you are creating a lot of new System.Drawing.Bitmap objects, the Garbage Collector doesn't clear them. Eventually GDI+ will think you are running out of memory and will throw a "The parameter is not valid" exception. Calling GC.Collect() every so often (not too often!) seems to resolve this issue.
i am still pretty unsure about this.
I am working since 7 years on an Application Server. Our bigger installations take use of 24 GB Ram. Its hightly Multithreaded, and ALL calls for GC.Collect() ran into really terrible performance issues.
Many third party Components used GC.Collect() when they thought it was clever to do this right now.
So a simple bunch of Excel-Reports blocked the App Server for all threads several times a minute.
We had to refactor all the 3rd Party Components in order to remove the GC.Collect() calls, and all worked fine after doing this.
But i am running Servers on Win32 as well, and here i started to take heavy use of GC.Collect() after getting a OutOfMemoryException.
But i am also pretty unsure about this, because i often noticed, when i get a OOM on 32 Bit, and i retry to run the same Operation again, without calling GC.Collect(), it just worked fine.
One thing i wonder is the OOM Exception itself...
If i would have written the .Net Framework, and i can't alloc a memory block, i would use GC.Collect(), defrag memory (??), try again, and if i still cant find a free memory block, then i would throw the OOM-Exception.
Or at least make this behavior as configurable option, due the drawbacks of the performance issue with GC.Collect.
Now i have lots of code like this in my app to "solve" the problem:
public static TResult ExecuteOOMAware<T1, T2, TResult>(Func<T1,T2 ,TResult> func, T1 a1, T2 a2)
{
int oomCounter = 0;
int maxOOMRetries = 10;
do
{
try
{
return func(a1, a2);
}
catch (OutOfMemoryException)
{
oomCounter++;
if (maxOOMRetries > 10)
{
throw;
}
else
{
Log.Info("OutOfMemory-Exception caught, Trying to fix. Counter: " + oomCounter.ToString());
System.Threading.Thread.Sleep(TimeSpan.FromSeconds(oomCounter * 10));
GC.Collect();
}
}
} while (oomCounter < maxOOMRetries);
// never gets hitted.
return default(TResult);
}
(Note that the Thread.Sleep() behavior is a really App apecific behavior, because we are running a ORM Caching Service, and the service takes some time to release all the cached objects, if RAM exceeds some predefined values. so it waits a few seconds the first time, and has increased waiting time each occurence of OOM.)
one good reason for calling GC is on small ARM computers with little memory, like the Raspberry PI (running with mono).
If unallocated memory fragments use too much of the system RAM, then the Linux OS can get unstable.
I have an application where I have to call GC every second (!) to get rid of memory overflow problems.
Another good solution is to dispose objects when they are no longer needed. Unfortunately this is not so easy in many cases.
This isn't that relevant to the question, but for XSLT transforms in .NET (XSLCompiledTranform) then you might have no choice. Another candidate is the MSHTML control.
If you are using a version of .net less than 4.5, manual collection may be inevitable (especially if you are dealing with many 'large objects').
this link describes why:
https://blogs.msdn.microsoft.com/dotnet/2011/10/03/large-object-heap-improvements-in-net-4-5/
Since there are Small object heap(SOH) and Large object heap(LOH)
We can call GC.Collect() to clear de-reference object in SOP, and move lived object to next generation.
In .net4.5, we can also compact LOH by using largeobjectheapcompactionmode

Is it possible for a Dictionary in .Net to cause dead lock when reading and writing to it in parallel?

I was playing with TPL, and trying to find out how big a mess I could make by reading and writing to the same Dictionary in parallel.
So I had this code:
private static void HowCouldARegularDicionaryDeadLock()
{
for (var i = 0; i < 20000; i++)
{
TryToReproduceProblem();
}
}
private static void TryToReproduceProblem()
{
try
{
var dictionary = new Dictionary<int, int>();
Enumerable.Range(0, 1000000)
.ToList()
.AsParallel()
.ForAll(n =>
{
if (!dictionary.ContainsKey(n))
{
dictionary[n] = n; //write
}
var readValue = dictionary[n]; //read
});
}
catch (AggregateException e)
{
e.Flatten()
.InnerExceptions.ToList()
.ForEach(i => Console.WriteLine(i.Message));
}
}
It was pretty messed up indeed, there were a lot of exceptions thrown, mostly about key does not exist, a few about index out of bound of array.
But after running the app for a while, it hangs, and the cpu percentage stays at 25%, the machine has 8 cores.
So I assume that's 2 threads running at full capacity.
Then I ran dottrace on it, and got this:
It matches my guess, two threads running at 100%.
Both running the FindEntry method of Dictionary.
Then I ran the app again, with dottrace, this time the result is slightly different:
This time, one thread is running FindEntry, the other Insert.
My first intuition was that it's dead locked, but then I thought it could not be, there is only one shared resource, and it's not locked.
So how should this be explained?
ps: I am not looking to solve the problem, it could be fixed by using a ConcurrentDictionary, or by doing parallel aggregation. I am just looking for a reasonable explanation for this.
So your code is executing Dictionary.FindEntry. It's not a deadlock - a deadlock happens when two threads block in a way which makes them wait for one another to release a resource, but in your case you're getting two seemingly infinite loops. The threads aren't locked.
Let's take a look at this method in the reference source:
private int FindEntry(TKey key) {
if( key == null) {
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
}
if (buckets != null) {
int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = entries[i].next) {
if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) return i;
}
}
return -1;
}
Take a look at the for loop. The increment part is i = entries[i].next, and guess what: entries is a field which is updated in the Resize method. next is a field of the inner Entry struct:
public int next; // Index of next entry, -1 if last
If your code can't exit the FindEntry method, the most probable cause would be you've managed to mess the entries in such a way that they produce an infinite sequence when you're following the indexes pointed by the next field.
As for the Insert method, it has a very similar for loop:
for (int i = buckets[targetBucket]; i >= 0; i = entries[i].next)
As the Dictionary class is documented to be non-threadsafe, you're in the realm of undefined behavior anyway.
Using a ConcurrentDictionary or a locking pattern such as a ReaderWriterLockSlim (Dictionary is thread-safe for concurrent reads only) or a plain old lock nicely solves the problem.
Looks like a race condition (not a deadlock) - which, as you comment, causes the messed up internal state.
The dictionary is not thread safe so concurrent reads and writes to the same container from seperate threads (even if there are as few as one of each) is not safe.
Once the race condition is hit, it becomes undefined what will happen; in this case what appears to be an infinite loop of some sort.
In general, once write access is required, some form of synchronization is required.
Just to complement (and correlate) the previous 2 great answers:
Dictionary<T> is a HashMap implementation, and like most HashMap implementations it internally uses LinkedLists (to store multiple elements in case different keys result into the same bucket position after being hashed and after taking the hash modulo) and uses an internal array which can grow as the number of elements in the dictionary grow.
Since the code starts with an empty dictionary and adds a lot of elements, the dictionary starts with a small internal array (probably size=3) and increases it frequently. Since there are multiple threads trying to add elements to the dictionary, there's a great chance of different threads trying to Resize() the dictionary at the same time. Since Dictionary is not thread-safe class, if two threads try to modify the same LinkedList at the same time, this could leave the LinkedList in an inconsistent state (that's a race condition - two threads modifying the same data, causing unpredictable results).
As explained in other answer, a race condition while modifying the LinkedList could put the LinkedList into an invalid state, which would explain the infinite loop occurring on the methods which iterate through the LinkedList (both FindEntry and Insert). The high cpu (each thread using 100% cpu) is explained by this infinite loop - if it was a deadlock the thread would be in a low-cpu state waiting for some locked resource.
Since we know the number of elements we could pre-initialize the dictionary with a larger size (like 1000000), to reduce the chances of a race condition. But that doesn't solve the issue - the best solution is still to use a thread-safe class (ConcurrentDictionary<T>).

Variable freshness guarantee in .NET (volatile vs. volatile read)

I have read many contradicting information (msdn, SO etc.) about volatile and VoletileRead (ReadAcquireFence).
I understand the memory access reordering restriction implication of those - what I'm still completely confused about is the freshness guarantee - which is very important for me.
msdn doc for volatile mentions:
(...) This ensures that the most up-to-date value is present in the field at all times.
msdn doc for volatile fields mentions:
A read of a volatile field is called a volatile read. A volatile read has "acquire semantics"; that is, it is guaranteed to occur prior to any references to memory that occur after it in the instruction sequence.
.NET code for VolatileRead is:
public static int VolatileRead(ref int address)
{
int ret = address;
MemoryBarrier(); // Call MemoryBarrier to ensure the proper semantic in a portable way.
return ret;
}
According to msdn MemoryBarrier doc Memory barrier prevents reordering. However this doesn't seem to have any implications on freshness - correct?
How then one can get freshness guarantee?
And is there difference between marking field volatile and accessing it with VolatileRead and VolatileWrite semantic? I'm currently doing the later in my performance critical code that needs to guarantee freshness, however readers sometimes get stale value. I'm wondering if marking the state volatile will make situation different.
EDIT1:
What I'm trying to achieve - get the guarantee that reader threads will get as recent value of shared variable (written by multiple writers) as possible - ideally no older than what is the cost of context switch or other operations that may postpone the immediate write of state.
If volatile or higher level construct (e.g. lock) have this guarantee (do they?) than how do they achieve this?
EDIT2:
The very condensed question should have been - how do I get guarantee of as fresh value during reads as possible? Ideally without locking (as exclusive access is not needed and there is potential for high contention).
From what I learned here I'm wondering if this might be the solution (solving(?) line is marked with comment):
private SharedState _sharedState;
private SpinLock _spinLock = new SpinLock(false);
public void Update(SharedState newValue)
{
bool lockTaken = false;
_spinLock.Enter(ref lockTaken);
_sharedState = newValue;
if (lockTaken)
{
_spinLock.Exit();
}
}
public SharedState GetFreshSharedState
{
get
{
Thread.MemoryBarrier(); // <---- This is added to give readers freshness guarantee
var value = _sharedState;
Thread.MemoryBarrier();
return value;
}
}
The MemoryBarrier call was added to make sure both - reads and writes - are wrapped by full fences (same as lock code - as indicated here http://www.albahari.com/threading/part4.aspx#_The_volatile_keyword 'Memory barriers and locking' section)
Does this look correct or is it flawed?
EDIT3:
Thanks to very interesting discussions here I learned quite a few things and I actually was able to distill to the simplified unambiguous question that I have about this topic. It's quite different from the original one so I rather posted a new one here: Memory barrier vs Interlocked impact on memory caches coherency timing
I think this is a good question. But, it is also difficult to answer. I am not sure I can give you a definitive answer to your questions. It is not your fault really. It is just that the subject matter is complex and really requires knowing details that might not be feasible to enumerate. Honestly, it really seems like you have educated yourself on the subject quite well already. I have spent a lot of time studying the subject myself and I still do not fully understand everything. Nevertheless, I will still attempt an some semblance of an answer here anyway.
So what does it mean for a thread to read a fresh value anyway? Does it mean the value returned by the read is guaranteed to be no older than 100ms, 50ms, or 1ms? Or does it mean the value is the absolute latest? Or does it mean that if two reads occur back-to-back then the second is guaranteed to get a newer value assuming the memory address changed after the first read? Or does it mean something else altogether?
I think you are going to have a hard time getting your readers to work correctly if you are thinking about things in terms of time intervals. Instead think of things in terms of what happens when you chain reads together. To illustrate my point consider how you would implement an interlocked-like operation using arbitrarily complex logic.
public static T InterlockedOperation<T>(ref T location, T operand)
{
T initial, computed;
do
{
initial = location;
computed = op(initial, operand); // where op is replaced with a specific implementation
}
while (Interlocked.CompareExchange(ref location, computed, initial) != initial);
return computed;
}
In the code above we can create any interlocked-like operation if we exploit the fact that the second read of location via Interlocked.CompareExchange will be guaranteed to return a newer value if the memory address received a write after the first read. This is because the Interlocked.CompareExchange method generates a memory barrier. If the value has changed between reads then the code spins around the loop repeatedly until location stops changing. This pattern does not require that the code use the latest or freshest value; just a newer value. The distinction is crucial.1
A lot of lock free code I have seen works on this principal. That is the operations are usually wrapped into loops such that the operation is continually retried until it succeeds. It does not assume that the first attempt is using the latest value. Nor does it assume every use of the value be the latest. It only assumes that the value is newer after each read.
Try to rethink how your readers should behave. Try to make them more agnostic about the age of the value. If that is simply not possible and all writes must be captured and processed then you may be forced into a more deterministic approach like placing all writes into a queue and having the readers dequeue them one-by-one. I am sure the ConcurrentQueue class would help in that situation.
If you can reduce the meaning of "fresh" to only "newer" then placing a call to Thread.MemoryBarrier after each read, using Volatile.Read, using the volatile keyword, etc. will absolutely guarantee that one read in a sequence will return a newer value than a previous read.
1The ABA problem opens up a new can of worms.
A memory barrier does provide this guarantee. We can derive the "freshness" property that you are looking for from the reording properties that a barrier guarantees.
By freshness you probably mean that a read returns the value of the most recent write.
Let's say we have these operations, each on a different thread:
x = 1
x = 2
print(x)
How could we possibly print a value other than 2? Without volatile the read can move one slot upwards and return 1. Volatile prevents reorderings, though. The write cannot move backwards in time.
In short, volatile guarantees you to see the most recent value.
Strictly speaking I'd need to differentiate between volatile and a memory barrier here. The latter one is a stronger guarantee. I have simplified this discussion because volatile is implemented using memory barriers, at least on x86/x64.

Locks vs Compare-and-swap

I've been reading about lock-free techniques, like Compare-and-swap and leveraging the Interlocked and SpinWait classes to achieve thread synchronization without locking.
I've ran a few tests of my own, where I simply have many threads trying to append a character to a string. I tried using regular locks and compare-and-swap. Surprisingly (at least to me), locks showed much better results than using CAS.
Here's the CAS version of my code (based on this). It follows a copy->modify->swap pattern:
private string _str = "";
public void Append(char value)
{
var spin = new SpinWait();
while (true)
{
var original = Interlocked.CompareExchange(ref _str, null, null);
var newString = original + value;
if (Interlocked.CompareExchange(ref _str, newString, original) == original)
break;
spin.SpinOnce();
}
}
And the simpler (and more efficient) lock version:
private object lk = new object();
public void AppendLock(char value)
{
lock (lk)
{
_str += value;
}
}
If i try adding 50.000 characters, the CAS version takes 1.2 seconds and the lock version 700ms (average). For 100k characters, they take 7 seconds and 3.8 seconds, respectively.
This was run on a quad-core (i5 2500k).
I suspected the reason why CAS was displaying these results was because it was failing the last "swap" step a lot. I was right. When I try adding 50k chars (50k successful swaps), i was able to count between 70k (best case scenario) and almost 200k (worst case scenario) failed attempts. Worst case scenario, 4 out of every 5 attempts failed.
So my questions are:
What am I missing? Shouldn't CAS give better results? Where's the benefit?
Why exactly and when is CAS a better option? (I know this has been asked, but I can't find any satisfying answer that also explains my specific scenario).
It is my understanding that solutions employing CAS, although hard to code, scale much better and perform better than locks as contention increases. In my example, the operations are very small and frequent, which means high contention and high frequency. So why do my tests show otherwise?
I assume that longer operations would make the case even worse -> the "swap" failing rate would increase even more.
PS: this is the code I used to run the tests:
Stopwatch watch = Stopwatch.StartNew();
var cl = new Class1();
Parallel.For(0, 50000, i => cl.Append('a'));
var time = watch.Elapsed;
Debug.WriteLine(time.TotalMilliseconds);
The problem is a combination of the failure rate on the loop and the fact that strings are immutable. I did a couple of tests on my own using the following parameters.
Ran 8 different threads (I have an 8 core machine).
Each thread called Append 10,000 times.
What I observed was that the final length of the string was 80,000 (8 x 10,000) so that was perfect. The number of append attempts averaged ~300,000 for me. So that is a failure rate of ~73%. Only 27% of the CPU time resulted in useful work. Now because strings are immutable that means a new instance of the string is created on the heap and the original contents plus the one extra character is copied into it. By the way, this copy operation is O(n) so it gets longer and longer as the string's length increases. Because of the copy operation my hypothesis was that the failure rate would increase as the length of the string increases. The reason being that as the copy operation takes more and more time there is a higher probability of a collision as the threads spend more time competing to finalize the ICX. My tests confirmed this. You should try this same test yourself.
The biggest issue here is that sequential string concatenations do not lend themselves to parallelism very well. Since the results of the operation Xn depend on Xn-1 it is going to be faster to take the full lock especially if it means you avoid all of the failures and retries. A pessimistic strategy wins the battle against an optimistic one in this case. The low techniques work better when you can partition the problem into independent chucks that really can run unimpeded in parallel.
As a side note the use of Interlocked.CompareExchange to do the initial read of _str is unnecessary. The reason is that a memory barrier is not required for the read in this case. This is because the Interlocked.CompareExchange call that actually performs work (the second one in your code) will create a full barrier. So the worst case scenario is that the first read is "stale", the ICX operation fails the test, and the loop spins back around to try again. This time, however, the previous ICX forced a "fresh" read.1
The following code is how I generalize a complex operation using low lock mechanisms. In fact, the code presented below allows you to pass a delegate representing the operation so it is very generalized. Would you want to use it in production? Probably not because invoking the delegate is slow, but you at least you get the idea. You could always hard code the operation.
public static class InterlockedEx
{
public static T Change<T>(ref T destination, Func<T, T> operation) where T : class
{
T original, value;
do
{
original = destination;
value = operation(original);
}
while (Interlocked.CompareExchange(ref destination, value, original) != original);
return original;
}
}
1I actually dislike the terms "stale" and "fresh" when discussing memory barriers because that is not what they are about really. It is more of a side effect as opposed to actual guarantee. But, in this case it illustrates my point better.

Categories

Resources