I'm running a simulation for certain number of iterations. After each iteration results are sent do the server and some time later server sends update data that is used in next iteration.
Code is (very roughly) looking like this:
for(var iteration=0;iteration<100;iteration++)
{
Parallel.For(0,50,i=>DoExpensiveWork(i));
SendResultsToServer(); //worker sends iteration results to server
//server gathers results from all workers,
//calculates update, then sends it back to workers
WaitForDataForNextIteration();
}
DoExpensiveWork is using ThreadLocal caches (mostly just big arrays that are reused every iteration to avoid GC work).
ThreadLocal<MyCache> _cache;
private void DoExpensiveWork(int i)
{
var cache = _cache.Value;
DoSomeWorkUsingCache(cache);
}
Here are problems I've encountered:
Over time _cache grows slowly (that is _cache.Values.Count) increases to value much higher than number of cores on my machine (I guess this means old threads are sometimes destroyed by ThreadPool and next iterations uses newly created threads leading to creation of additional entries in _cache.Values).
This leads to a 'memory leak' as old caches are still kept in memory, while being inaccessible.
When server takes lot of time to send update then _cache grows on every iteration (on 32 core machine (2x16) _cache.Values grows like 32,64,96 etc - adding 32 new entries every iteration). This only happens when server takes ~20 seconds to send update (if it takes 15 seconds or less, this does not happen).
Does this mean ThreadPool is tearing down threads when worker is stuck waiting for update?
How can I prevent this?
Setting min/max threads to 32 does not help
ThreadPool.SetMinThreads(32,1000);
ThreadPool.SetMaxThreads(32,1000);
What I want basically is to stop ThreadLocal cache creating new entries, so it does not grow above 32 in my case.
Edit: from MSDN: When demand is low, the actual number of thread pool threads can fall below the minimum values. This might explain why more and more ThreadLocal copies are created. Any workaround for this?
Related
I'm currently dealing with a queue that has a couple thousand entries in it. To save on RAM usage I'm currently using the TrimExcess() method built in to the queue datatype by Microsoft. As mentioned in the documentation
, this method is inefficient when it comes to large lists and results in a significant time loss whenever it is called.
Is there a more efficient way to remove items from a queue that actually deletes them in RAM aswell?
Edit: to clarify there are still elements in the queue, I just want to remove the elements that have been dequeued from RAM
The answer to your question is "Don't worry about that, and it's very likely that you should not do that." This answer is an elaboration of the comments from #madreflection and myself.
The Queue<T> class, like other collection classes uses an array to hold the items it collects. Like other collection classes, if the array runs out of slots, the Queue class creates a new, larger array and copies the data from old array.
I haven't looked at the Queue<T> source, but using the debugger, I can see that this array is the _array member of the class. It starts with an array of 0 length. When you enqueue one item, it gets replaced by an array of length 4. After that, the array doubles in size whenever it needs more space.
You say your queue "has a couple thousand entries in it". I'm going to use 2000 in this analysis as a rough guess.
As you enqueue more and more net entries into the queue, that array doubling will happen several times:
At First
After 5
After 9
After 17
After 33
4 Entries
Double to 8
Double to 16
Double to 32
Double to 64
It will keep doing this until it's doubled (and copied the array contents) 10 times - bringing it to 2048. At that point, you will have allocated 10 arrays, nine of which are garbage, and done about 3000 queued element copies.
Now think about it. I'm guessing you are enqueue reference type objects. A reference type object is represented by an object reference (in effect a pointer). If you have 2000 instances in a queue that will represent 8kb on a 32-bit machine (plus some one-time overhead for the members of the queue class). On a 64-bit machine, it's 16kb. That's nothing for a modern computer. The .NET garbage collector has two strategies for managing memory, a normal one and one for large objects. The boundary is 85kb; your queue will never be a large object
If you are enqueuing large value types, then more memory is needed (since the value type objects will be copied into the array elements that make up the queue entries). You'd need to be using very large value type objects before your queue becomes a large object.
The other thing that will happen is that as your queue grows in size, it will settle into the Garbage Collector's Gen2 memory area. Gen2 collections are expensive, but once an object becomes stable in Gen2, it doesn't bother the garbage collector at all.
But, think about what happens if you reduce your queue size way down to, say 100 entries and call TrimExcess. At that point, yet another new array will be created (this time, much smaller) and the entries in your queue will be copied to that new queue (that's what the notes in the Remarks section of the TrimExcess documentation is referring to when it talks about The cost of reallocating and copying a large Queue<T>). If your queue starts growing again, you will start doubling/copying that array over and over again - spinning off more garbage and spinning your wheels doing the copying.
A better strategy is to look at your estimated queue size, inflate it a bit, and pre-allocate the space for all of those entries at construction time. If you expect to have 2000 entries, allocate space for 2500 in the constructor:
var myQueue = new Queue<SomeType>(2500);
Now, you do one allocation, there should be no reallocation or array copying, and your memory will quickly migrate to Gen2, but will never be touched by the GC.
I need an iteration of a parallel for loop to use 7 cores(or stay away from 1 core) but another iteration to use 8(all) cores and tried below code:
Parallel.For(0,2,i=>{
if(i=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(i==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
this outputs 255 for both iterations. So either parallel.for loop is using single thread for them or one setting sets other iterations affinity too. Another problem is, this is from a latency sensitive application and all this affinity settings add 1 to 15 milliseconds latency.
Do I have to use threads explicitly and should I set affinities only once?
Edit: I tried threaded version, same thing happens. Even with explicit two threads, both writing 255 to console. Now it seems this command is for a process not a thread.
OpenCL context is using max cores for kernel execution on cpu in one iteration. Other iterations using 1-2 cores to copy buffers and send command to devices. When cpu is used by opencl, it uses all cores and devices cannot get enough time to copy buffers. Device fission seems to be harder than solving this issue I hink.
Different thread affinities in Parallel.For Iterations
Question is misleading, as it is based on assumption that Parallel API means multiple threads. Parallel API does refer to Data Parallel processing, but doesn't provide any guarantee for invoking multiple threads, especially for the code provided above, where there's hardly any work per thread.
For the Parallel API, you can set the Max degree of Parallelism, as follows:
ParallelOptions parallelOption = new ParallelOptions();
parallelOption.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.For(0, 20, parallelOption, i =>
But that never guarantee the number of threads that would be invoked to parallel processing, since Threads are used from the ThreadPool and CLR decides at run-time, based on amount of work to be processed, whether more than one thread is required for the processing.
In the same Parallel loop can you try the following, print Thread.Current.ManageThreadId, this would provide a clear idea, regarding the number of threads being invoked in the Parallel loop.
Do I have to use threads explicitly and should I set affinities only once?
Edit: I tried threaded version, same thing happens. Even with explicit two threads, both writing 255 to console. Now it seems this command is for a process not a thread.
Can you post the code, for multiple threads, can you try something like this.
Thread[] threadArray = new Thread[2];
threadArray[0] = new Thread(<ThreadDelegate>);
threadArray[1] = new Thread(<ThreadDelegate>);
threadArray[0]. ProcessorAffinity = <Set Processor Affinity>
threadArray[1]. ProcessorAffinity = <Set Processor Affinity>
Assuming you assign the affinity correctly, you can print them and find different values, check the following ProcessThread.ProcessorAffinity.
On another note as you could see in the link above, you can set the value in hexadecimal based on processor affinity, not sure what does values 254, 255 denotes , do you really have server with that many processors.
EDIT:
Try the following edit to your program, (based on the fact that two Thread ids are getting printed), now by the time both threads some in the picture, they both get same value of variable i, they need a local variable to avoid closure issue
Parallel.For(0,2,i=>{
int local = i;
if(local=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(local==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
EDIT 2: (Would mostly not work as both threads might increment, before actual logic execution)
int local = -1;
Parallel.For(0,2,i=>{
Interlocked.Increment(ref local);
if(local=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(local==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
I have the following issue :
I am using a parallel.foreach iteration for a pretty CPU intensive workload (applying a method on a number of items) & it works fine for about the first 80% of the items - using all cpu cores very nice.
As the iteration seems to come near to the end (around 80% i would say) i see that the number of threads begins to go down core by core, & at the end the last around 5% of the items are proceesed only by two cores. So insted to use all cores untill the end, it slows down pretty hard toward the end of the iteration.
Please note the the workload can be per item very different. One can last 1-2 seconds, the other item can take 2-3 minutes to finish.
Any ideea, suggestion is very welcome.
Code used:
var source = myList.ToArray();
var rangePartitioner = Partitioner.Create(0, source.Lenght);
using (SqlConnection connection =new SqlConnection(cnStr))
{
connection.Open();
try
(
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
for(int i = range.Item1; i<range.Item2; i++)
{
CPUIntensiveMethod(source[i]);
}
});
}
catch(AggretateException ae)
{ //Exception cachting}
}
This is an unavoidable consequence of the fact the parallelism is per computation. It is clear that the whole parallel batch cannot run any quicker than the time taken by the slowest single item in the work-set.
Imagine a batch of 100 items, 8 of which are slow (say 1000s to run) and the rest are quick (say 1s to run). You kick them off in a random order across 8 threads. Its clear that eventually each thread will be calculating one of your long running items, at this point you are seeing full utilisation. Eventually the one(s) that hit their long-op(s) first will finish up their long op(s) and quickly finish up any remaining short ops. At that time you ONLY have some of the long ops waiting to finish, so you will see the active utilisation drop off.. i.e. at some point there are only 3 ops left to finish, so only 3 cores are in use.
Mitigation Tactics
Your long running items might be amenable to 'internal parallelism' allowing them to have a faster minimum limit runtime.
Your long running items may be able to be identified and prioritised to start first (which will ensure you get full CPU utilisation for a long as possible)
(see update below) DONT use partitioning in cases where the body can be long running as this simply increases the 'hit' of this effect. (ie get rid of your rangePartitioner entirely). This will massively reduce the impact of this effect to your particular loop
either way your batch run-time is bound by the run-time of the slowest item in the batch.
Update I have also noticed you are using partitioning on your loop, which massively increases the scope of this effect, i.e. you are saying 'break this work-set down into N work-sets' and then parallelize the running of those N work-sets. In the example above this could mean that you get (say) 3 of the long ops into the same work-set and so those are going to process on that same thread. As such you should NOT be using partitioning if the inner body can be long running. For example the docs on partitioning here https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx are saying this is aimed at short bodies
If you have multiple threads that process the same number of items each and each item takes varying amount of time, then of course you will have some threads that finish earlier.
If you use collection whose size is not known, then the items will be taken one by one:
var source = myList.AsEnumerable();
Another approach can be a Producer-Consumer pattern
https://msdn.microsoft.com/en-us/library/dd997371
I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:
class Alaki
{
private static List<string> input = new List<string>();
private static void exec(int threadcount)
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threadcount;
Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) =>
{
var dic = new Dictionary<string, List<int>>();
for (int i = range.Item1; i < range.Item2; i++)
{
//make some delay!
//for (int x = 0; x < 400000; x++) ;
var tokens = input[i].Split();
foreach (var token in tokens)
{
if (!dic.ContainsKey(token))
dic[token] = new List<int>();
dic[token].Add(1);
}
}
});
}
public static void Main(String[] args)
{
StreamReader reader=new StreamReader((#"c:\txt-set\agg.txt"));
while(true)
{
var line=reader.ReadLine();
if(line==null)
break;
input.Add(line);
}
DateTime t0 = DateTime.Now;
exec(Environment.ProcessorCount);
Console.WriteLine("Parallel: " + (DateTime.Now - t0));
t0 = DateTime.Now;
exec(1);
Console.WriteLine("Serial: " + (DateTime.Now - t0));
}
}
It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduce programming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:
Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171
The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.
In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.
Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!
This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.
What is your idea?
Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions.
Is there a flaw in .NET's run-time?
Thanks in advance.
Parallel.For doesn't divide the input into n pieces (where n is the MaxDegreeOfParallelism); instead it creates many small batches and makes sure that at most n are being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.For can still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Work for more details.)
Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.
Running PerfMonitor on my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:
Over 10% of the total CPU time was spent in the garbage collector.
Most well tuned applications are in the 0-10% range. This is typically
caused by an allocation pattern that allows objects to live just long
enough to require an expensive Gen 2 collection.
This program had a peak GC heap allocation rate of over 10 MB/sec.
This is quite high. It is not uncommon that this is simply a
performance bug.
Edit: As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.
However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).
If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).
When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.
An attempt to explain the results:
a quick run in the VS profiler shows it's barely reaching 40% CPU utilization.
String.Split is the main hotspot.
so a shared something must be blocking the the CPU.
that something is most likely memory allocation. Your bottlenecks are
var dic = new Dictionary<string, List<int>>();
...
dic[token].Add(1);
I replaced this with
var dic = new Dictionary<string, int>();
...
... else dic[token] += 1;
and the result is closer to a 2x speedup.
But my counter question would be: does it matter? Your code is very artificial and incomplete. The parallel version ends up creating multiple dictionaries without merging them. This is not even close to a real situation. And as you can see, little details do matter.
Your sample code is to complex to make broad statements about Parallel.ForEach().
It is too simple to solve/analyze a real problem.
Just for fun, here is a shorter PLINQ version:
File.ReadAllText("big.txt").Split().AsParallel().GroupBy(t => t)
.ToDictionary(g => g.Key, g => g.Count());
In my application, I have used the number of System.Threading.Timer and set this timer to fire every 1 second. My application execute the thread at every 1 second but it execution of the millisecond is different.
In my application i have used the OPC server & OPC group .one thread reading the data from the OPC server (like one variable changing it's value & i want to log this moment of the changes values into my application every 1 s)
then another thread to read this data read this data from the first thread every 1s & second thread used for store data into the MYSQL database .
in this process when i will read the data from the first thread then i will get the old data values like , read the data at 10:28:01.530 this second then i will get the information of 10:28:00.260 this second.so i want to mange these threads the first thread worked at 000 millisecond & second thread worked at 500 millisecond. using this first thread update the data at 000 second & second thread read the data at 500 millisecond.
My output is given below:
10:28:32.875
10:28:33.390
10:28:34.875
....
10:28:39.530
10:28:40.875
However, I want following results:
10:28:32.000
10:28:33.000
10:28:34.000
....
10:28:39.000
10:28:40.000
How can the timer be set so the callback is executed at "000 milliseconds"?
First of all, it's impossible. Even if you are to schedule your 'events' for a time that they are fired few milliseconds ahead of schedule, then compare millisecond component of the current time with zero in a loop, the flow control for your code could be taken away at the any given moment.
You will have to rethink your design a little, and not depend on when the event would fire, but think of the algorithm that will compensate for the milliseconds delayed.
Also, you won't have much help with the Threading.Timer, you would have better chance if you have used your own thread, periodically:
check for the current time, see what is the time until next full second
Sleep() for that amount minus the 'spice' factor
do the work you have to do.
You'll calculate your 'spice' factor depending on the results you are getting - does the sleep finishes ahead or behind the schedule.
If you are to give more information about your apparent need for having event at exactly zero ms, I could help you get rid of that requirement.
HTH
I would say that its impossible. You have to understand that switching context for cpu takes time (if other process is running you have to wait - cpu shelduler is working). Each CPU tick takes some time so synchronization to 0 milliseconds is impossible. Maybe with setting high priority of your process you can get closer to 0 but you won't achive it ever.
IMHO it will be impossible to really get a timer to fire exactly every 1sec (on the milisecond) - even in hardcore assembler this would be a very hard task on your normal windows-machine.
I think first what you need to do: is to set right dueTime for a timer. I do it so:
dueTime = 1000 - DateTime.Now.Milliseconds + X; where X - is serving for accuracy and you need select It by testing. Then Threading.Timer each time It ticks running on thread from CLR thread pool and, how tests show - this thread is different each time. Creating threads slows timer, because of this you can use WaitableTimer, which always will be running at the same thread. Instead of WaitableTimer you can using Thread.Sleep method in such way:
Thread.CurrentThread.Priority = Priority.High; //If time is really critical
Thread.Sleep (1000 - DateTime.Now + 50); //Make bound = 1s
while (SomeBoolCondition)
{
Thread.Sleep (980); //1000 ms = 1 second, but something ms will be spent on exit from Sleep
//Here you write your code
}
It will be work faster then a timer.