I am exploring async/await and have found a curious scenario that I need guidance on resolving.
For reference, the code seen in this question can be found here:
https://github.com/Mike-EEE/Stash/tree/master/AwaitPerformance
I have provided two simple ways of awaiting a set of tasks. The first is simply creating a List<Task>, adding tasks to this list, and awaiting the entire result at once with a call to Task.WhenAll:
public async Task<uint> AwaitList()
{
var list = new List<Task>();
for (var i = 0u; i < 10; i++)
{
list.Add(Task.Delay(1));
}
await Task.WhenAll(list).ConfigureAwait(false);
return 123;
}
The second is by awaiting each task as they occur in the for loop:
public async Task<uint> AwaitEach()
{
for (var i = 0u; i < 10; i++)
{
await Task.Delay(1).ConfigureAwait(false);
}
return 123;
}
When running these two methods with Benchmark.NET, however, I get surprisingly conflicting results:
// * Summary *
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
AMD Ryzen 7 2700X, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview5-011568
[Host] : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT
DefaultJob : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------- |----------:|----------:|----------:|------:|------:|------:|----------:|
| AwaitList | 15.60 ms | 0.0274 ms | 0.0243 ms | - | - | - | 2416 B |
| AwaitEach | 155.62 ms | 0.9113 ms | 0.8524 ms | - | - | - | 352 B |
As you can see, awaiting the list of tasks is much faster, but generates a ton of allocations. Awaiting each item, however, is the inverse: it is slower but generates way less garbage.
Is there an obvious, ideal way that I am overlooking to get the both of best worlds here? That is, is there a way to await a set of Task elements that is both fast and results in a low amount of allocations?
Thank you in advance for any assistance.
You are not comparing apples to apples here.
In your example:
AwaitList creates a list of Tasks and then runs them all parallel (async).
AwaitEach runs each Task one after another hence making the async keyword useless.
If, however, you make your list of Tasks, so that each task can start and then compare WhenAll vs a loop, your comparison would look like this:
public async Task<uint> AwaitList()
{
var list = new List<Task>();
for (var i = 0u; i < 10; i++)
{
list.Add(Task.Delay(1));
}
await Task.WhenAll(list).ConfigureAwait(false);
return 123;
}
verses
public async Task<uint> AwaitEach()
{
var list = new List<Task>();
for (var i = 0; i < 10; i++)
{
list.Add(Task.Delay(1));
}
for (var i = 0; i < 10; i++)
{
await list[i].ConfigureAwait(false);
}
return 123;
}
Now compare the stats on these two functions and you will find they are ballpark of one another.
Related
we have a piece of code that basically reads a template from a file, and then replaces a bunch of placeholders with real data. The template is something outside of the developer's control and we have noticed that sometimes (for large template files), it can get quite CPU-intensive to perform the replaces.
At least, we believe that it is the string replaces that are intensive. Debugging locally shows that it only takes milliseconds to perform each replace, but every template is different and some templates can contain hundreds of these tags that need replacing.
I'll show a little bit of code first, before I continue.
This is a huge simplification of the real code. There could be hundreds of replacements happening on our real code.
string template = File.ReadAllText("path\to\file");
if (!string.IsNullOrEmpty(template))
{
if (template.Contains("[NUMBER]"))
template = template.Replace("[NUMBER]", myobject.Number);
if (template.Contains("[VAT_TABLE]"))
template = template.Replace("[VAT_TABLE]", ConstructVatTable(myObject));
// etc ... :
}
private string ConstructVatTable(Invoice myObject)
{
string vatTemplate = "this is a template with tags of its own";
StringBuilder builder = new StringBuilder();
foreach (var item in myObject.VatItems)
{
builder.Append(vatTemplate.Replace("[TAG1]", item.Tag1).Replace("[TAG2]", item.Tag2);
}
return builder.ToString();
}
Is this the most optimal way of replacing parts of a large string, or are there better ways? Are there ways that we could profile what we are doing in more detail to show us where the CPU intensive parts may lie? Any help or advice would be greatly appreciated.
You perhaps need to come up with some alternative strategies for your replacements and race your horses.
I did 4 here and benched them:
[MemoryDiagnoser]
[SimpleJob(RuntimeMoniker.Net50)]
public class Streplace
{
private string _doc;
private Dictionary<string,string> _tags;
[GlobalSetup]
public void GenerateDocAndTags()
{
var sb = new StringBuilder(50000);
_tags = new Dictionary<string, string>();
for (int i = 0; i < 20; i++)
{
_tags["TAG" + i] = new string((char)('a' + i), 1000);
}
for (int i = 0; i < 1000; i++)
{
sb.Append(Guid.NewGuid().ToString().ToUpper());
if (i % 50 == 0)
{
sb.Append("[TAG" + i / 50 + "]");
}
}
_doc = sb.ToString();
}
[Benchmark]
public void NaiveString()
{
var str = _doc;
foreach (var tag in _tags)
{
str = str.Replace("[" + tag.Key + "]", tag.Value);
}
}
[Benchmark]
public void NaiveStringBuilder()
{
var strB = new StringBuilder(_doc, _doc.Length * 2);
foreach (var tag in _tags)
{
strB.Replace("[" + tag.Key + "]", tag.Value);
}
var s = strB.ToString();
}
[Benchmark]
public void StringSplit()
{
var strs = _doc.Split('[',']');
for (int i = 1; i < strs.Length; i+= 2)
{
strs[i] = _tags[strs[i]];
}
var s = string.Concat(strs);
}
[Benchmark]
public void StringCrawl()
{
var strB = new StringBuilder(_doc.Length * 2);
var str = _doc;
var lastI = 0;
for (int i = str.IndexOf('['); i > -1; i = str.IndexOf('[', i))
{
strB.Append(str, lastI, i - lastI); //up to the [
i++;
var j = str.IndexOf(']', i);
var tag = str[i..j];
strB.Append(_tags[tag]);
lastI = j + 1;
}
strB.Append(str, lastI, str.Length - lastI);
var s = strB.ToString();
}
}
NaiveString - your replace replace replace approach
NaiveStringBuilder - Rafal's approach - I was quite surprised how badly this performed, but I haven't looked into why. If anyone notices a glaring error in my code, let me know
StringSplit - the approach I commented - split the string and then the odd indexes are what needs swapping out, then join the string again
StringCrawl - travel the string looking for [ and ] putting either the doc content, or a tag content into the StringBuilder, depending on whether we're inside or outside the brackets
The test document was generated; thousands of GUIDs with [TAGx] (x was 0 to 19) inserted at regular intervals. The [TAGx] were each replaced with a string considerably longer that was a bunch of repeated chars. The resulting document was 56000 chars and looked like:
The results were:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1526 (21H2)
Intel Core i7-7820HQ CPU 2.90GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.200
[Host] : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT [AttachedDebugger]
.NET 5.0 : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT
Job=.NET 5.0 Runtime=.NET 5.0
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------- |-----------:|---------:|---------:|---------:|---------:|---------:|----------:|
| NaiveString | 1,266.9 us | 24.73 us | 39.93 us | 433.5938 | 433.5938 | 433.5938 | 1,820 KB |
| NaiveStringBuilder | 3,908.6 us | 73.79 us | 93.33 us | 78.1250 | 78.1250 | 78.1250 | 293 KB |
| StringSplit | 110.8 us | 2.15 us | 2.01 us | 34.4238 | 34.4238 | 34.4238 | 181 KB |
| StringCrawl | 101.5 us | 1.96 us | 2.40 us | 79.9561 | 79.9561 | 79.9561 | 251 KB |
NaiveString's memory usage is huge, NaiveStringBuilder's memory is better but is 3x slower. Crawl and Split are pretty good - about 12x faster than NaiveString, 40x faster than NaiveStringBuilder (and a bit less memory too, bonus).
I thought the Crawl would be better than the Split, but I'm not sure the 10% speedup is worth the extra memory/collections - that would be your call. Take the code, maybe add some more approaches, and race them. Nuget install BenchmarkDotNet, put a static void main of this:
public static async Task Main()
{
#if DEBUG
var sr = new Streplace();
sr.GenerateDocAndTags();
sr.NaiveString();
sr.NaiveStringBuilder();
sr.StringSplit();
sr.StringCrawl();
#else
var summary = BenchmarkRunner.Run<Streplace>();
#endif
}
I was testing a .NET application on a RaspberryPi and whereas each iteration of that program took 500 milliseconds on a Windows laptop, the same took 5 seconds on a RaspberryPi. After some debugging, I found that majority of that time was being spent on a foreach loop concatenating strings.
Edit 1: To clarify, that 500 ms and 5 s time I mentioned was the time of the entire loop. I placed a timer before the loop, and stopped the timer after the loop had finished. And, the number of iterations are the same in both, 1000.
Edit 2: To time the loop, I used the answer mentioned here.
private static string ComposeRegs(List<list_of_bytes> registers)
{
string ret = string.Empty;
foreach (list_of_bytes register in registers)
{
ret += Convert.ToString(register.RegisterValue) + ",";
}
return ret;
}
Out of the blue I replaced the foreach with a for loop, and suddenly it starts taking almost the same time as it did on that laptop. 500 to 600 milliseconds.
private static string ComposeRegs(List<list_of_bytes> registers)
{
string ret = string.Empty;
for (UInt16 i = 0; i < 1000; i++)
{
ret += Convert.ToString(registers[i].RegisterValue) + ",";
}
return ret;
}
Should I always use for loops instead of foreach? Or was this just a scenario in which a for loop is way faster than a foreach loop?
The actual problem is concatenating strings not a difference between for vs foreach. The reported timings are excruciatingly slow even on a Raspberry Pi. 1000 items is so little data it can fit in either machine's CPU cache. An RPi has a 1+ GHZ CPU which means each concatenation takes at leas 1000 cycles.
The problem is the concatenation. Strings are immutable. Modifying or concatenating strings creates a new string. Your loops created 2000 temporary objects that need to be garbage collected. That process is expensive. Use a StringBuilder instead, preferably with a capacity roughly equal to the size of the expected string.
[Benchmark]
public string StringBuilder()
{
var sb = new StringBuilder(registers.Count * 3);
foreach (list_of_bytes register in registers)
{
sb.AppendFormat("{0}",register.RegisterValue);
}
return sb.ToString();
}
Simply measuring a single execution, or even averaging 10 executions, won't produce valid numbers. It's quite possible the GC run to collect those 2000 objects during one of the tests. It's also quite possible that one of the tests was delayed by JIT compilation or any other number of reasons. A test should run long enough to produce stable numbers.
The defacto standard for .NET benchmarking is BenchmarkDotNet. That library will run each benchmark long enough to eliminate startup and cooldown effect and account for memory allocations and GC collections. You'll see not only how much each test takes but how much RAM is used and how many GCs are caused
To actually measure your code try using this benchmark using BenchmarkDotNet :
[MemoryDiagnoser]
[MarkdownExporterAttribute.StackOverflow]
public class ConcatTest
{
private readonly List<list_of_bytes> registers;
public ConcatTest()
{
registers = Enumerable.Range(0,1000).Select(i=>new list_of_bytes(i)).ToList();
}
[Benchmark]
public string StringBuilder()
{
var sb = new StringBuilder(registers.Count*3);
foreach (var register in registers)
{
sb.AppendFormat("{0}",register.RegisterValue);
}
return sb.ToString();
}
[Benchmark]
public string ForEach()
{
string ret = string.Empty;
foreach (list_of_bytes register in registers)
{
ret += Convert.ToString(register.RegisterValue) + ",";
}
return ret;
}
[Benchmark]
public string For()
{
string ret = string.Empty;
for (UInt16 i = 0; i < registers.Count; i++)
{
ret += Convert.ToString(registers[i].RegisterValue) + ",";
}
return ret;
}
}
The tests are run by calling BenchmarkRunner.Run<ConcatTest>()
using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Linq;
public class Program
{
public static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<ConcatTest>();
Console.WriteLine(summary);
}
}
Results
Running this on a Macbook produced the following results. Note that BenchmarkDotNet produced results ready to use in StackOverflow, and the runtime information is included in the results :
BenchmarkDotNet=v0.13.1, OS=macOS Big Sur 11.5.2 (20G95) [Darwin 20.6.0]
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.100
[Host] : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
DefaultJob : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Allocated |
-------------- |----------:|---------:|---------:|---------:|--------:|----------:|
StringBuilder | 34.56 μs | 0.682 μs | 0.729 μs | 7.5684 | 0.3052 | 35 KB |
ForEach | 278.36 μs | 5.509 μs | 5.894 μs | 818.8477 | 24.4141 | 3,763 KB |
For | 268.72 μs | 3.611 μs | 3.015 μs | 818.8477 | 24.4141 | 3,763 KB |
Both For and ForEach took almost 10 times more than StringBuilder and used 100 times as much RAM
If a string can change like in your example then using a StringBuilder is a better option and could help the issue your dealing with.
Modify any string object will result into the creation of a new string object. This makes the use of string costly. So when the user needs the repetitive operations on the string then the need of StringBuilder come into existence. It provides the optimized way to deal with the repetitive and multiple string manipulation operations. It represents a mutable string of characters. Mutable means the string which can be changed. So String objects are immutable but StringBuilder is the mutable string type. It will not create a new modified instance of the current string object but do the modifications in the existing string object.
So intead of creating many temporary objects that will need to be garbage collected and mean while are taking a lot of memory, just use StringBuilder.
More about StringBuilder - https://learn.microsoft.com/en-us/dotnet/api/system.text.stringbuilder?view=net-6.0
I have many tasks, each task defined by the day that I can start working on and the last day that task is still valid to do, each task done withing one day, not more, I can do one task per day.
The tasks with the deadlines as described in the below table.
| task | valid from | valid until |
|------|------------|-------------|
| t01 | 1 | 3 |
| t02 | 2 | 2 |
| t03 | 1 | 1 |
| t04 | 2 | 3 |
| t05 | 2 | 3 |
the number of tasks may be a huge number.
I want to know which algorithm I can use to solve this problem to maximize the number of tasks that I can do.
Update
base on the comments I wrote this code it is working but still hasn't good performance with a huge number of tasks.
public static int countTodoTasks(int[] validFrom, int[] validUnitil)
{
var tasks = new List<TaskTodo>();
for (int i = 0; i < validFrom.Length; i++)
{
tasks.Add(new TaskTodo { ValidFrom = validFrom[i], ValidUntil = validUnitil[i] });
}
tasks = tasks.OrderBy(x => x.ValidUntil).ToList();
var lDay = 0;
var schedule = new Dictionary<int, TaskTodo>();
while (tasks.Count > 0)
{
lDay = findBigestMinimumOf(lDay, tasks[0].ValidFrom, tasks[0].ValidUntil);
if (lDay != -1)
{
schedule[lDay] = tasks[0];
}
tasks.RemoveAt(0);
tasks.RemoveAll(x => lDay >= x.ValidUntil);
}
return schedule.Count;
}
static int findBigestMinimumOf(int x, int start, int end)
{
if (start > x)
{
return start;
}
if ((x == start && start == end) || x == end || x > end)
{
return -1;
}
return x + 1;
}
If the tasks have the same duration, then use a greedy algorithm as described above.
If it's too slow, use indexes (= hashing) and incremental calculation to speed it up if you need to scale out.
Indexing: during setup, iterate through all tasks to create map (=dictionary?) that maps each due date to a list of tasks. Better yet, use a NavigableMap (TreeMap), so you can ask for tail iterator (all tasks starting from a specific due date, in order). The greedy algorithm can then use that to scale better (think a better bigO notation).
Incremental calculation: only calculate the delta's for each task you're considering.
If the tasks have different duration, a greedy algorithm (aka construction heuristic) won't give you the optimal solution. Then it's NP-hard. After the Construction Heuristic (= greedy algorithm), run a Local Search (such as Tabu Search). Libraries such as OptaPlanner (Java, not C# unfortunately - look for alternatives there) can do both for you.
Also note there are multiple greedy algo's (First Fit, Fit Fit Decreasing, ...)
I suppose you can apply greedy algorithm for you purpose in this way.
Select minimal "valid from", minday.
Add to Xcandidates, all candidates with "valid from" = minday.
If no Xcandidates go to 1.
Select the interval, x, from Xcandidates, with earliest "valid until".
Remove x, inserting it in your schedule.
Remove all Xcandidates with "valid until" = minday.
Increment minday and go to 2.
Tracking down a performance problem (micro I know) I end with this test program. Compiled with the framework 4.5 and Release mode it tooks on my machine around 10ms.
What bothers me if that if I remove this line
public int[] value1 = new int[80];
times get closer to 2 ms. It seems that there is some memory fragmentation problem but I failed to explain the why. I have tested the program with Net Core 2.0 with same results. Can anyone explain this behaviour?
using System;
using System.Collections.Generic;
using System.Diagnostics;
namespace ConsoleApp4
{
public class MyObject
{
public int value = 1;
public int[] value1 = new int[80];
}
class Program
{
static void Main(string[] args)
{
var list = new List<MyObject>();
for (int i = 0; i < 500000; i++)
{
list.Add(new MyObject());
}
long total = 0;
for (int i = 0; i < 200; i++)
{
int counter = 0;
Stopwatch timer = Stopwatch.StartNew();
foreach (var obj in list)
{
if (obj.value == 1)
counter++;
}
timer.Stop();
total += timer.ElapsedMilliseconds;
}
Console.WriteLine(total / 200);
Console.ReadKey();
}
}
}
UPDATE:
After some research I came to the conclusion that it's just the processor cache access time. Using the VS profiler, the cache misses seem to be a lot higher
Without array
With array
There are several implications involved.
When you have your line public int[] value1 = new int[80];, you have one extra allocation of memory: a new array is created on a heap which will accommodate 80 integers (320 bytes) + overhead of the class. You do 500 000 of these allocations.
These allocations total up for more than 160 MBs of RAM, which may cause the GC to kick in and see if there is memory to be released.
Further, when you allocate so much memory, it is likely that some of the objects from the list are not retained in the CPU cache. When you later enumerate your collection, the CPU may need to read the data from RAM, not from cache, which will induce a serious performance penalty.
I'm not able to reproduce a big difference between the two and I wouldn't expect it either. Below are the results I get on .NET Core 2.2.
Instances of MyObject will be allocated on the heap. In one case, you have an int and a reference to the int array. In the other you have just the int. In both cases, you need to do the additional work of following the reference from the list. That is the same in both cases and the compiled code shows this.
Branch prediction will affect how fast this runs, but since you're branching on the same condition every time I wouldn't expect this to change from run to run (unless you change the data).
BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.376 (1803/April2018Update/Redstone4)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.2.200-preview-009648
[Host] : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
DefaultJob : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
Method | size | Mean | Error | StdDev | Ratio |
------------- |------- |---------:|----------:|----------:|------:|
WithArray | 500000 | 8.167 ms | 0.0495 ms | 0.0463 ms | 1.00 |
WithoutArray | 500000 | 8.167 ms | 0.0454 ms | 0.0424 ms | 1.00 |
For reference:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Collections.Generic;
namespace CoreSandbox
{
[DisassemblyDiagnoser(printAsm: true, printSource: false, printPrologAndEpilog: true, printIL: false, recursiveDepth: 1)]
//[MemoryDiagnoser]
public class Test
{
private List<MyObject> dataWithArray;
private List<MyObjectLight> dataWithoutArray;
[Params(500_000)]
public int size;
public class MyObject
{
public int value = 1;
public int[] value1 = new int[80];
}
public class MyObjectLight
{
public int value = 1;
}
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<Test>();
}
[GlobalSetup]
public void Setup()
{
dataWithArray = new List<MyObject>(size);
dataWithoutArray = new List<MyObjectLight>(size);
for (var i = 0; i < size; i++)
{
dataWithArray.Add(new MyObject());
dataWithoutArray.Add(new MyObjectLight());
}
}
[Benchmark(Baseline = true)]
public int WithArray()
{
var counter = 0;
foreach(var obj in dataWithArray)
{
if (obj.value == 1)
counter++;
}
return counter;
}
[Benchmark]
public int WithoutArray()
{
var counter = 0;
foreach (var obj in dataWithoutArray)
{
if (obj.value == 1)
counter++;
}
return counter;
}
}
}
Related brief info:
AFAIK , The concurrent stack, queue, and bag classes are implemented internally with linked lists.
And I know that there is much less contention because each thread is responsible for its own linked list.
Any way , my question is about the ConcurrentDictionary<,>
But I was testing this code :(single thread)
Stopwatch sw = new Stopwatch();
sw.Start();
var d = new ConcurrentDictionary < int, int > ();
for(int i = 0; i < 1000000; i++) d[i] = 123;
for(int i = 1000000; i < 2000000; i++) d[i] = 123;
for(int i = 2000000; i < 3000000; i++) d[i] = 123;
Console.WriteLine("baseline = " + sw.Elapsed);
sw.Restart();
var d2 = new Dictionary < int, int > ();
for(int i = 0; i < 1000000; i++) lock (d2) d2[i] = 123;
for(int i = 1000000; i < 2000000; i++) lock (d2) d2[i] = 123;
for(int i = 2000000; i < 3000000; i++) lock (d2) d2[i] = 123;
Console.WriteLine("baseline = " + sw.Elapsed);
sw.Stop();
Result : (tested many times, same values (+/-)).
baseline = 00:00:01.2604656
baseline = 00:00:00.3229741
Question :
What makes ConcurrentDictionary<,> much slower in a single threaded environment ?
My first instinct is that lock(){} will be always slower. but apparently it is not.
Well, ConcurrentDictionary is allowing for the possibility that it can be used by multiple threads. It seems entirely reasonable to me that that requires more internal housekeeping than something which assumes it can get away without worrying about access from multiple threads. I'd have been very surprised if it had worked out the other way round - if the safer version were always faster too, why would you ever use the less safe version?
The most likely reason that ConcurrentDictionary simply has more overhead than Dictionary for the same operation. This is demonstrably true if you dig into the sources
It uses a lock for the indexer
It uses volatile writes
It has to do atomic writes of values which are not guaranteed to be atomic in .Net
It has extra branches in the core add routine (whether to take a lock, do atomic write)
All of these costs are incurred irrespective of the number of threads that it's being used on. These costs may be individually small but aren't free and do add up over time
Update for .NET 5: I'll leave the previous answer up as it is still relevant for older runtimes but .NET 5 appears to have further improved ConcurrentDictionary to the point where reads via TryGetValue() are actually faster than even the normal Dictionary, as seen in the results below (COW is my CopyOnWriteDictionary, detailed below). Make what you will of this :)
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------- |------------:|----------:|----------:|---------:|---------:|---------:|----------:|
| ConcurrentWrite | 1,372.32 us | 12.752 us | 11.304 us | 226.5625 | 89.8438 | 44.9219 | 1398736 B |
| COWWrite | 1,077.39 us | 21.435 us | 31.419 us | 56.6406 | 19.5313 | 11.7188 | 868629 B |
| DictWrite | 347.19 us | 5.875 us | 5.208 us | 124.5117 | 124.5117 | 124.5117 | 673064 B |
| ConcurrentRead | 63.53 us | 0.486 us | 0.431 us | - | - | - | - |
| COWRead | 81.55 us | 0.908 us | 0.805 us | - | - | - | - |
| DictRead | 70.71 us | 0.471 us | 0.393 us | - | - | - | - |
Previous answer, still relevant for < .NET 5:
The latest versions of ConcurrentDictionary have improved significantly since I originally posted this answer. It no longer locks on read and thus offers almost the same performance profile as my CopyOnWriteDictionary implementation with more features so I recommend you use that instead in most cases. ConcurrentDictionary still has 20 - 30% more overhead than Dictionary or CopyOnWriteDictionary, so performance-sensitive applications may still benefit from its use.
You can read about my lock-free thread-safe copy-on-write dictionary implementation here:
http://www.singulink.com/CodeIndex/post/fastest-thread-safe-lock-free-dictionary
It's currently append-only (with the ability to replace values) as it is intended for use as a permanent cache. If you need removal then I suggest using ConcurrentDictionary since adding that into CopyOnWriteDictionary would eliminate all performance gains due to the added locking.
CopyOnWriteDictionary is very fast for quick bursts of writes and lookups usually run at almost standard Dictionary speed without locking. If you write occasionally and read often, this is the fastest option available.
My implementation provides maximum read performance by removing the need for any read locks under normal circumstances while updates aren't being made to the dictionary. The trade-off is that the dictionary needs to be copied and swapped after updates are applied (which is done on a background thread) but if you don't write often or you only write once during initialization then the trade-off is definitely worth it.
ConcurrentDictionary vs. Dictionary
In general, use a
System.Collections.Concurrent.ConcurrentDictionary in
any scenario where you are adding and updating keys or values
concurrently from multiple threads. In scenarios that involve frequent
updates and relatively few reads, the ConcurrentDictionary generally offers modest benefits. In scenarios that involve
many reads and many updates, the ConcurrentDictionary
generally is significantly faster on computers that have any number of
cores.
In scenarios that involve frequent updates, you can increase the
degree of concurrency in the ConcurrentDictionary and
then measure to see whether performance increases on computers that
have more cores. If you change the concurrency level, avoid global
operations as much as possible.
If you are only reading key or values, the Dictionary is
faster because no synchronization is required if the dictionary is not
being modified by any threads.
Link: https://msdn.microsoft.com/en-us/library/dd997373%28v=vs.110%29.aspx
The ConcurrentDictionary<> creates an internal set of locking objects at creation (this is determined by the concurrencyLevel, amongst other factors) - this set of locking objects is used to control access to the internal bucket structures in a series of fine-grained locks.
In a single threaded scenario, there would be no need for the locks, so the extra overhead of acquiring and releasing these locks is probably the source of the difference you're seeing.
There is no point in using ConcurrentDictionary in one thread or synchronizing access if all is done in a single thread. Of course dictionary will beat ConcrurrentDictionary.
Much depends on the usage pattern and number of threads. Here is a test, that shows that ConcurrentDictionary outperforms dictionary and lock with thread number increase.
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
namespace ConsoleApp
{
class Program
{
static void Main(string[] args)
{
Run(1, 100000, 10);
Run(10, 100000, 10);
Run(100, 100000, 10);
Run(1000, 100000, 10);
Console.ReadKey();
}
static void Run(int threads, int count, int cycles)
{
Console.WriteLine("");
Console.WriteLine($"Threads: {threads}, items: {count}, cycles:{cycles}");
var semaphore = new SemaphoreSlim(0, threads);
var concurrentDictionary = new ConcurrentDictionary<int, string>();
for (int i = 0; i < threads; i++)
{
Thread t = new Thread(() => Run(concurrentDictionary, count, cycles, semaphore));
t.Start();
}
Thread.Sleep(1000);
var w = Stopwatch.StartNew();
semaphore.Release(threads);
for (int i = 0; i < threads; i++)
semaphore.Wait();
Console.WriteLine($"ConcurrentDictionary: {w.Elapsed}");
var dictionary = new Dictionary<int, string>();
for (int i = 0; i < threads; i++)
{
Thread t = new Thread(() => Run(dictionary, count, cycles, semaphore));
t.Start();
}
Thread.Sleep(1000);
w.Restart();
semaphore.Release(threads);
for (int i = 0; i < threads; i++)
semaphore.Wait();
Console.WriteLine($"Dictionary: {w.Elapsed}");
}
static void Run(ConcurrentDictionary<int, string> dic, int elements, int cycles, SemaphoreSlim semaphore)
{
semaphore.Wait();
try
{
for (int i = 0; i < cycles; i++)
for (int j = 0; j < elements; j++)
{
var x = dic.GetOrAdd(i, x => x.ToString());
}
}
finally
{
semaphore.Release();
}
}
static void Run(Dictionary<int, string> dic, int elements, int cycles, SemaphoreSlim semaphore)
{
semaphore.Wait();
try
{
for (int i = 0; i < cycles; i++)
for (int j = 0; j < elements; j++)
lock (dic)
{
if (!dic.TryGetValue(i, out string value))
dic[i] = i.ToString();
}
}
finally
{
semaphore.Release();
}
}
}
}
Threads: 1, items: 100000, cycles:10
ConcurrentDictionary: 00:00:00.0000499
Dictionary: 00:00:00.0000137
Threads: 10, items: 100000, cycles:10
ConcurrentDictionary: 00:00:00.0497413
Dictionary: 00:00:00.2638265
Threads: 100, items: 100000, cycles:10
ConcurrentDictionary: 00:00:00.2408781
Dictionary: 00:00:02.2257736
Threads: 1000, items: 100000, cycles:10
ConcurrentDictionary: 00:00:01.8196668
Dictionary: 00:00:25.5717232
What makes ConcurrentDictionary<,> much slower in a single threaded environment?
The overhead of the machinery required to make it much faster in multi-threaded environments.
My first instinct is that lock(){} will be always slower. but apparently it is not.
A lock is very cheap when uncontested. You can lock a million times per second and your CPU won't even notice, provided that you are doing it from a single thread. What kills performance in multi-threaded programs is contention for locks. When multiple threads are competing fiercely for the same lock, almost all of them have to wait for the lucky one that holds the lock to release it. This is where the ConcurrentDictionary, with its granular locking implementation, shines. And the more concurrency you have (the more processors/cores), the more it shines.
In .Net 4, ConcurrentDictionary utilized very poor locking management and contention resolution that made it extremely slow. Dictionary with custom locking and/or even TestAndSet usage to COW the whole dictionary was faster.
Your test is wrong : you must stop the Stopwatch before !
Stopwatch sw = new Stopwatch();
sw.Start();
var d = new ConcurrentDictionary<int, int>();
for (int i = 0; i < 1000000; i++) d[i] = 123;
for (int i = 1000000; i < 2000000; i++) d[i] = 123;
for (int i = 2000000; i < 3000000; i++) d[i] = 123;
sw.Stop();
Console.WriteLine("baseline = " + sw.Elapsed);
sw.Start();
var d2 = new Dictionary<int, int>();
for (int i = 0; i < 1000000; i++) lock (d2) d2[i] = 123;
for (int i = 1000000; i < 2000000; i++) lock (d2) d2[i] = 123;
for (int i = 2000000; i < 3000000; i++) lock (d2) d2[i] = 123;
sw.Stop();
Console.WriteLine("baseline = " + sw.Elapsed);
sw.Stop();
--Output :