Why is C# Threading not woking on multiple cores?

Why is C# Threading not woking on multiple cores? - c#

my code is a bit complex, but the core is starting threads like this:
Thread task = new Thread(new ParameterizedThreadStart(x => { ThreadReturn = BuildChildNodes(x); }));
task.Start((NodeParameters)tasks[0]);
it should work. but when i check my CPU usage i get barely 10%. so i do assume it's just using one core. barely.
ThreadReturn btw is a value i use a setter on, to have some kind of event, when the thread is ready:
public object ThreadReturn
{
set
{
lock (thisLock)
{
NodeReturn result = (NodeReturn)value;
if (result.states.Count == 0) return;
Content[result.level + 1].AddRange(result.states);
if (result.level + 1 >= MaxDepth) return;
for (int i = 0; i < result.states.Count; i++)
{
Thread newTask = new Thread(new ParameterizedThreadStart(x => ThreadReturn = BuildChildNodes(x)));
NodeParameters param = new NodeParameters()
{
level = result.level + 1,
node = Content[result.level + 1].Count - (i + 1),
turn = SkipOpponent ? StartTurn : !result.turn
};
if (tasks.Count > 100)
unstarted.Add(param);
else
{
newTask.Start(param);
tasks.Add(newTask);
}
}
}
}
}
i got some crazy error about mark stack overflow so i limited the maximum number of parallel tasks with putting them into a second list...
i'm not firm in multithreading so this code is a little bit messy... maybe you can show me a better way which actually uses my cores.
btw: it's not the locks fault. i tried without before. -> same result
Edit: this is my code before i went to the Threading class. i find it more suitable:
Content.Clear();
Content.Add(new List<T> { Root });
for (var i = 0; i < maxDepth; i++)
Content.Add(new List<T>());
Task<object> firstTask = new Task<object>(x => BuildChildNodes(x), (new NodeParameters() { level = 0, node = 0, turn = Turn }));
firstTask.Start();
tasks.Add(firstTask);
while (tasks.Count > 0 && Content.Last().Count == 0)
{
Task.WaitAny(tasks.ToArray());
for (int task = tasks.Count - 1; task >= 0; task--)
{
if (tasks[task].IsCompleted)
{
NodeReturn result = (NodeReturn)tasks[task].Result;
tasks.RemoveAt(task);
Content[result.level + 1].AddRange(result.states);
if (result.level + 1 >= maxDepth) continue;
for (int i = 0; i < result.states.Count; i++)
{
Task<object> newTask = new Task<object>(x => BuildChildNodes(x), (object)new NodeParameters() { level = result.level + 1, node = Content[result.level + 1].Count - (i + 1), turn = SkipOpponent ? Turn : !result.turn });
newTask.Start();
}
}
}
}
In every state i'm calculating children and in my main thread i put them into my state tree while waiting for the tasks to finish. please assume i'd actually use the return value of waitany, i did a git reset and now... welll... it's gone^^
Edit:
Okay i don't know what exactly i did wrong but... in general everything was a total mess. i now implemented the deep construction method and maybe because there's much less... "traffic" now my whole code runs in 200ms. so... thanks for this!
i don't know if i should delete this question hence stupidity or if you guys want to post answers so i can rate them postive, you really helped me a lot :)

Disregarding all the other issues you have here, essentially your lock ruins the show.
What you are saying is, hey random person go and do some stuff! just make sure you don't do it at the same time as anyone else (lock), you could have 1000 threads, but only one thread is going to be active at one time on one core, hence your results.
Here are some other thoughts.
Get the gunk out of the setter, this would fail any sane code review.
Use Tasks instead of Thread.
Thinking about what needs thread safety, and elegantly lock only what needs it, Take a look at the Interlocked for dealing with numeric atomic manipulation
Take a look at the concurrent collections you may get more mileage out of this
Simplify your code.
I can't give any more advice as it's just about impossible to know what you are trying to do.

Related

Parallel BFS rush hour solver

So, for uni we have to do this assignment where we have to make a serial implementation of a rush hour solver parallel. The solver uses a bfs implementation.
Here is a part of the default bfs implementation:
// Initialize empty queue
Queue<Tuple<byte[], Solution>> q = new Queue<Tuple<byte[], Solution>>();
// By default, the solution is "no solution"
foundSolution = new NoSolution();
// Place the starting position in the queue
q.Enqueue(Tuple.Create(vehicleStartPos, (Solution)new EmptySolution()));
AddNode(vehicleStartPos);
// Do BFS
while (q.Count > 0)
{
Tuple<byte[], Solution> currentState = q.Dequeue();
// Generate sucessors, and push them on to the queue if they haven't been seen before
foreach (Tuple<byte[], Solution> next in Sucessors(currentState))
{
// Did we reach the goal?
if (next.Item1[targetVehicle] == goal)
{
q.Clear();
foundSolution = next.Item2;
break;
}
// If we haven't seen this node before, add it to the Trie and Queue to be expanded
if(!AddNode(next.Item1))
q.Enqueue(next);
}
}
Console.WriteLine(foundSolution);
Console.ReadLine();
I managed to turn this into parallel like this:
ConcurrentQueue<Tuple<byte[], Solution>> q = new ConcurrentQueue<Tuple<byte[], Solution>>();
foundSolution = new NoSolution();
q.Enqueue(Tuple.Create(vehicleStartPos, (Solution)new EmptySolution()));
AddNode(vehicleStartPos);
while (q.Count > 0 && !solutionFound)
{
Tuple<byte[], Solution> currentState;
q.TryDequeue(out currentState);
Parallel.ForEach(Sucessors(currentState), (next) =>
{
// Did we reach the goal?
if (next.Item1[targetVehicle] == goal)
{
solutionFound = true;
foundSolution = next.Item2;
return;
}
// If we haven't seen this node before, add it to the Trie and Queue to be expanded
if (!AddNode(next.Item1))
q.Enqueue(next);
});
}
as you can see, I tried to implement a parallel foreach loop with a concurrentQueue. I get the feeling like the concurrentQueue works well, but it locks automatically and thus costs too much time, making this parallel implementation way slower than the serial one.
I was thinking about having a wait-free or at least lock-free queue, so I can save that bit of time, but I am not sure how to implement such thing. Could you guys give some insight into whether this would be feasable and whether it would be faster than using a regular Queue ? Or maybe use a different concurrent data structure to better suit the situation. Not sure how well a ConcurrentBag and the like would fit in. Could you shed some light on this ?
Also, after having searched for parallel bfs implementations, I couldn't find any. What are some general tips and hints for people like me wanting to implement bfs in parallel ? What are some good alternatives for the queue, to make it thread-safe ?
EDIT1:
I managed to implement tasks like this:
int taskNumbers = Environment.ProcessorCount;
Task[] tasks = new Task[taskNumbers];
// Set up the cancellation token
ctSource = new CancellationTokenSource();
for (int i = 0; i < taskNumbers; i++)
tasks[i] = new Task(() =>
{
try{ Traverse(); }
catch{ }
},
ctSource.Token);
for (int i = 0; i < taskNumbers; i++)
tasks[i].Start();
Task.WaitAll(tasks);
ctSource.Dispose();
They call a traverse method, which looks like this:
private static void Traverse()
{
ctSource.Token.ThrowIfCancellationRequested();
while (q.Count > 0)
{
Tuple<byte[], Solution> currentState;
if (q.TryDequeue(out currentState))
{
foreach (Tuple<byte[], Solution> next in Sucessors(currentState))
{
// Did we reach the goal?
if (next.Item1[targetVehicle] == goal)
{
ctSource.Cancel();
foundSolution = next.Item2;
return;
}
// If we haven't seen this node before, add it to the Trie and Queue to be expanded
if (!AddNode(next.Item1))
q.Enqueue(next);
}
}
if (ctSource.IsCancellationRequested)
ctSource.Token.ThrowIfCancellationRequested();
}
}
yet, I am having trouble figuring out the condition for the while loop in the traverse method. The current condition allows for tasks to exit the loop too early. As far as I know, I dont have a complete list of all nodes available, so I cant compare the visited tree to the list of all nodes. Besides that, I don't have any other ideas of how I can keep tasks looping through the while loop until I have found an answer or until there are no more new nodes. Could you guys help me out ?
Thnx #Brian Malehorn for your help so far, I managed to get the performance of the parallel bfs version up to almost equal the performance of the serial version. All I need now is to make tasks stay in the while loop I think.

The problem isn't your queue, the problem is you're parallelizing the wrong thing. You're parallelizing adding the successors to the queue, when you should be parallelizing the Sucessors() call.
That is, Sucessors() should only be called from a worker thread, never in the "main" thread.
For example, suppose Sucessors() takes 1 second to run, and you're searching this tree:
o
/ \
/ \
o o
/ \ / \
o o o o
The fastest you can search this tree is 3 seconds. How long will your code take?

This program takes a long time to terminate. How can I optimise it?

I've been writing a program to perform a kind of pattern matching in XML and text files. When my program reaches this section of the code the CPU usage goes very high and the performance slows down to a point where the program appears to be frozen, but actually it is not. Depending on the input (number of text files and their content) it may take up to several hours to complete the task. I'm looking for a more efficient way to rewrite this section of the code :
List<string> CandidatesRet = new List<string>();
for (int indexCi = 0; indexCi < Ci.Count - 1; indexCi++)
{
// generate all sub itemset with length-1
string[] allItems = Ci[indexCi].Split(new char[] { ' ' });
for (int i = 0; i < allItems.Length; i++)
{
string tempStr = "";
for (int j = 0; j < allItems.Length; j++)
if (i != j)
tempStr += allItems[j] + " ";
tempStr = tempStr.Trim();
subItemset.Add(tempStr);
}
// THE PROBLEM BEGINS HERE
foreach (string subitem in subItemset)
{
int iFirtS;
for (int indexCommon = indexCi + 1; indexCommon < Ci.Count; indexCommon++)
if ((iFirtS = Ci[indexCommon].IndexOf(subitem)) >= 0)
{
string[] listTempCi = Ci[indexCommon].Split(new char[] { ' ' });
foreach (string itemCi in listTempCi)
if (!subitem.Contains(itemCi))
commonItem.Add(itemCi);
}
allCommonItems.Add(commonItem);
}
// generate condidate from common item
foreach (string item in oldItemsetCi)
{
bool flagCi = true;
foreach (List<string> listCommItem in allCommonItems)
if (!listCommItem.Contains(item))
{
flagCi = false;
break;
}
if (flagCi)
CandidatesRet.Add((Ci[indexCi] + " " + item).Trim());
}
There are many nested loops, and I know this is the problem. What do you recommend to improve it?

Assuming you were to re-write your code to be more performant there's still a chance the work you are doing is CPU bound and therefore if it doesn't yield enough so that the main thread can handle it's UI related event processing, you will always experience a so called freeze in your application.
There are several techniques to counter this:
Use a BackgroundWorker to get the job done
Offload to a separate dedicated thread
Utilize the Task library
Utilize the Thread Pool directly
Use Application.DoEvents Better yet, DON'T EVER!.
(Most of these techniques are beyond the scope of this answer.) See this article on implementing this technique.
The core idea is that if you have CPU or IO bound work and your UI main thread doesn't have enough time to do its event processing this will be a problem that can't be avoided.

Java is scaling much worse than C# over many cores?

I am testing spawning off many threads running the same function on a 32 core server for Java and C#. I run the application with 1000 iterations of the function, which is batched across either 1,2,4,8, 16 or 32 threads using a threadpool.
At 1, 2, 4, 8 and 16 concurrent threads Java is at least twice as fast as C#. However, as the number of threads increases, the gap closes and by 32 threads C# has nearly the same average run-time, but Java occasionally takes 2000ms (whereas both languages are usually running about 400ms). Java is starting to get worse with massive spikes in the time taken per thread iteration.
EDIT This is Windows Server 2008
EDIT2 I have changed the code below to show using the Executor Service threadpool. I have also installed Java 7.
I have set the following optimisations in the hotspot VM:
-XX:+UseConcMarkSweepGC -Xmx 6000
but it still hasnt made things any better. The only difference between the code is that im using the below threadpool and for the C# version we use:
http://www.codeproject.com/Articles/7933/Smart-Thread-Pool
Is there a way to make the Java more optimised? Perhaos you could explain why I am seeing this massive degradation in performance?
Is there a more efficient Java threadpool?
(Please note, I do not mean by changing the test function)
import java.io.DataOutputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;
public class PoolDemo {
static long FastestMemory = 2000000;
static long SlowestMemory = 0;
static long TotalTime;
static int[] FileArray;
static DataOutputStream outs;
static FileOutputStream fout;
static Byte myByte = 0;
public static void main(String[] args) throws InterruptedException, FileNotFoundException {
int Iterations = Integer.parseInt(args[0]);
int ThreadSize = Integer.parseInt(args[1]);
FileArray = new int[Iterations];
fout = new FileOutputStream("server_testing.csv");
// fixed pool, unlimited queue
ExecutorService service = Executors.newFixedThreadPool(ThreadSize);
ThreadPoolExecutor executor = (ThreadPoolExecutor) service;
for(int i = 0; i<Iterations; i++) {
Task t = new Task(i);
executor.execute(t);
}
for(int j=0; j<FileArray.length; j++){
new PrintStream(fout).println(FileArray[j] + ",");
}
}
private static class Task implements Runnable {
private int ID;
public Task(int index) {
this.ID = index;
}
public void run() {
long Start = System.currentTimeMillis();
int Size1 = 100000;
int Size2 = 2 * Size1;
int Size3 = Size1;
byte[] list1 = new byte[Size1];
byte[] list2 = new byte[Size2];
byte[] list3 = new byte[Size3];
for(int i=0; i<Size1; i++){
list1[i] = myByte;
}
for (int i = 0; i < Size2; i=i+2)
{
list2[i] = myByte;
}
for (int i = 0; i < Size3; i++)
{
byte temp = list1[i];
byte temp2 = list2[i];
list3[i] = temp;
list2[i] = temp;
list1[i] = temp2;
}
long Finish = System.currentTimeMillis();
long Duration = Finish - Start;
TotalTime += Duration;
FileArray[this.ID] = (int)Duration;
System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");
if(Duration < FastestMemory){
FastestMemory = Duration;
}
if (Duration > SlowestMemory)
{
SlowestMemory = Duration;
}
}
}
}

Summary
Below are the original response, update 1, and update 2. Update 1 talks about dealing with the race conditions around the test statistic variables by using concurrency structures. Update 2 is a much simpler way of dealing with the race condition issue. Hopefully no more updates from me - sorry for the length of the response but multithreaded programming is complicated!
Original Response
The only difference between the code is that im using the below
threadpool
I would say that is an absolutely huge difference. It's difficult to compare the performance of the two languages when their thread pool implementations are completely different blocks of code, written in user space. The thread pool implementation could have enormous impact on performance.
You should consider using Java's own built-in thread pools. See ThreadPoolExecutor and the entire java.util.concurrent package of which it is part. The Executors class has convenient static factory methods for pools and is a good higher level interface. All you need is JDK 1.5+, though the newer, the better. The fork/join solutions mentioned by other posters are also part of this package - as mentioned, they require 1.7+.
Update 1 - Addressing race conditions by using concurrency structures
You have race conditions around the setting of FastestMemory, SlowestMemory, and TotalTime. For the first two, you are doing the < and > testing and then the setting in more than one step. This is not atomic; there is certainly the chance that another thread will update these values in between the testing and the setting. The += setting of TotalTime is also non-atomic: a test and set in disguise.
Here are some suggested fixes.
TotalTime
The goal here is a threadsafe, atomic += of TotalTime.
// At the top of everything
import java.util.concurrent.atomic.AtomicLong;
...
// In PoolDemo
static AtomicLong TotalTime = new AtomicLong();
...
// In Task, where you currently do the TotalTime += piece
TotalTime.addAndGet (Duration);
FastestMemory / SlowestMemory
The goal here is testing and updating FastestMemory and SlowestMemory each in an atomic step, so no thread can slip in between the test and update steps to cause a race condition.
Simplest approach:
Protect the testing and setting of the variables using the class itself as a monitor. We need a monitor that contains the variables in order to guarantee synchronized visibility (thanks #A.H. for catching this.) We have to use the class itself because everything is static.
// In Task
synchronized (PoolDemo.class) {
if (Duration < FastestMemory) {
FastestMemory = Duration;
}
if (Duration > SlowestMemory) {
SlowestMemory = Duration;
}
}
Intermediate approach:
You may not like taking the whole class for the monitor, or exposing the monitor by using the class, etc. You could do a separate monitor that does not itself contain FastestMemory and SlowestMemory, but you will then run into synchronization visibility issues. You get around this by using the volatile keyword.
// In PoolDemo
static Integer _monitor = new Integer(1);
static volatile long FastestMemory = 2000000;
static volatile long SlowestMemory = 0;
...
// In Task
synchronized (PoolDemo._monitor) {
if (Duration < FastestMemory) {
FastestMemory = Duration;
}
if (Duration > SlowestMemory) {
SlowestMemory = Duration;
}
}
Advanced approach:
Here we use the java.util.concurrent.atomic classes instead of monitors. Under heavy contention, this should perform better than the synchronized approach. Try it and see.
// At the top of everything
import java.util.concurrent.atomic.AtomicLong;
. . . .
// In PoolDemo
static AtomicLong FastestMemory = new AtomicLong(2000000);
static AtomicLong SlowestMemory = new AtomicLong(0);
. . . . .
// In Task
long temp = FastestMemory.get();
while (Duration < temp) {
if (!FastestMemory.compareAndSet (temp, Duration)) {
temp = FastestMemory.get();
}
}
temp = SlowestMemory.get();
while (Duration > temp) {
if (!SlowestMemory.compareAndSet (temp, Duration)) {
temp = SlowestMemory.get();
}
}
Let me know what happens after this. It may not fix your problem, but the race condition around the very variables that track your performance is too dangerous to ignore.
I originally posted this update as a comment but moved it here so that I would have room to show code. This update has been through a few iterations - thanks to A.H. for catching a bug I had in an earlier version. Anything in this update supersedes anything in the comment.
Last but not least, an excellent source covering all this material is Java Concurrency in Practice, the best book on Java concurrency, and one of the best Java books overall.
Update 2 - Addressing race conditions in a much simpler way
I recently noticed that your current code will never terminate unless you add executorService.shutdown(). That is, the non-daemon threads living in that pool must be terminated or else the main thread will never exit. This got me to thinking that since we have to wait for all threads to exit, why not compare their durations after they finished, and thus bypass the concurrent updating of FastestMemory, etc. altogether? This is simpler and could be faster; there's no more locking or CAS overhead, and you are already doing an iteration of FileArray at the end of things anyway.
The other thing we can take advantage of is that your concurrent updating of FileArray is perfectly safe, since each thread is writing to a separate cell, and since there is no reading of FileArray during the writing of it.
With that, you make the following changes:
// In PoolDemo
// This part is the same, just so you know where we are
for(int i = 0; i<Iterations; i++) {
Task t = new Task(i);
executor.execute(t);
}
// CHANGES BEGIN HERE
// Will block till all tasks finish. Required regardless.
executor.shutdown();
executor.awaitTermination(10, TimeUnit.SECONDS);
for(int j=0; j<FileArray.length; j++){
long duration = FileArray[j];
TotalTime += duration;
if (duration < FastestMemory) {
FastestMemory = duration;
}
if (duration > SlowestMemory) {
SlowestMemory = duration;
}
new PrintStream(fout).println(FileArray[j] + ",");
}
. . .
// In Task
// Ending of Task.run() now looks like this
long Finish = System.currentTimeMillis();
long Duration = Finish - Start;
FileArray[this.ID] = (int)Duration;
System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");
Give this approach a shot as well.
You should definitely be checking your C# code for similar race conditions.

...but Java occasionally takes 2000ms...
And
byte[] list1 = new byte[Size1];
byte[] list2 = new byte[Size2];
byte[] list3 = new byte[Size3];
The hickups will be the garbage collector cleaning up your arrays. If you really want to tune that I suggest you use some kind of cache for the arrays.
Edit
This one
System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");
does one or more synchronized internally. So your highly "concurrent" code will be serialized quite good at this point. Just remove it and retest.

While #sparc_spread's answer is great, another thing I've noticed is this:
I run the application with 1000 iterations of the function
Notice that the HotSpot JVM is working on interpreted mode for the first 1.5k iterations of any function on client mode, and for 10k iterations on server mode. Computers with that many cores are automatically considered "servers" by the HotSpot JVM.
That would mean that C# would do JIT (and run in machine code) before Java does, and has a chance for better performance at the function runtime. Try increasing the iterations to 20,000 and start counting from 10k iteration.
The rationale here is that the JVM collects statistical data for how to do JIT best. It trusts that your function is going to be run a lot through time, so it takes a "slow bootstrapping" mechanism for a faster runtime overall. Or in their words "20% of the functions run 80% of the time", so why JIT them all?

Are you using java6? Java 7 comes with features to improve performance in parallel programing:
http://www.oracle.com/technetwork/articles/java/fork-join-422606.html

Threadpool in C# too slow, is there a way to speed up it? Thread.Sleep(0) and QueueUserWorkItem issues

I am using Threadpool in a C# application that need to do some CPU-intensive work. By the way it seems too slow (EDIT: it prints out debug string "Calculating on "
+ lSubArea.X + ":" + lSubArea.Y + " "
+ lSubArea.Width + ":" + lSubArea.Height only few times every 10 seconds, while I'm expecting to see that at least NUM_ROWS_GRID^2 = 16 times every few seconds), also changing MinThreads via SetMinThreads method. I don't know if switch to custom threads or if there's a way to speed up it. Searching on Google returns me some result but nothing works; same situation with MSDN.
Old Code follows:
private void StreamerRoutine()
{
if (this._state.Area.Width == 0 && this._state.Area.Height == 0)
this._state.Area = new Rectangle(0, 0, Screen.PrimaryScreen.Bounds.Width, Screen.PrimaryScreen.Bounds.Height);
while (this._state.WorkEnd == false)
{
// Ends time slice if video is off
if (this._state.VideoOn == false)
Thread.Sleep(0);
else
{
lock(this._state.AreaSync)
{
Int32 lWidth = this._state.Area.Width / Constants.NUM_ROWS_GRID;
Int32 lHeight = this._state.Area.Height / Constants.NUM_ROWS_GRID;
for (Int32 lX = 0; lX + lWidth <= this._state.Area.Width; lX += lWidth)
for (Int32 lY = 0; lY + lHeight <= this._state.Area.Height; lY += lHeight)
ThreadPool.QueueUserWorkItem(CreateDiffFrame, (Object)new Rectangle(lX, lY, lWidth, lHeight));
}
}
}
}
private void CreateDiffFrame(Object pState)
{
Rectangle lSubArea = (Rectangle)pState;
SmartDebug.DWL("Calculating on "
+ lSubArea.X + ":" + lSubArea.Y + " "
+ lSubArea.Width + ":" + lSubArea.Height);
// TODO : calculate frame
Thread.Sleep(0);
}
EDIT: CreateDiffFrame function is only a stub I used to know how many times it is called per second. It will be replaced with CPU intensive work as I define the best way to use thread in this case.
EDIT: I removed all Thread.Sleep(0); I thought it could be a way to speed up routine but it seems it could be a bottleneck.. new code follows:
EDIT: I made WorkEnd and VideoOn volatile in order to avoid cached values and so endless loop; I added also a semaphore to make every bunch of work items start after previous bunch is done.. now it is working quite well
private void StreamerRoutine()
{
if (this._state.Area.Width == 0 && this._state.Area.Height == 0)
this._state.Area = new Rectangle(0, 0, Screen.PrimaryScreen.Bounds.Width, Screen.PrimaryScreen.Bounds.Height);
this._state.StreamingSem = new Semaphore(Constants.NUM_ROWS_GRID * Constants.NUM_ROWS_GRID, Constants.NUM_ROWS_GRID * Constants.NUM_ROWS_GRID);
while (this._state.WorkEnd == false)
{
if (this._state.VideoOn == true)
{
for (int i = 0; i < Constants.NUM_ROWS_GRID * Constants.NUM_ROWS_GRID; i++)
this._state.StreamingSem.WaitOne();
lock(this._state.AreaSync)
{
Int32 lWidth = this._state.Area.Width / Constants.NUM_ROWS_GRID;
Int32 lHeight = this._state.Area.Height / Constants.NUM_ROWS_GRID;
for (Int32 lX = 0; lX + lWidth <= this._state.Area.Width; lX += lWidth)
for (Int32 lY = 0; lY + lHeight <= this._state.Area.Height; lY += lHeight)
ThreadPool.QueueUserWorkItem(CreateDiffFrame, (Object)new Rectangle(lX, lY, lWidth, lHeight));
}
}
}
}
private void CreateDiffFrame(Object pState)
{
Rectangle lSubArea = (Rectangle)pState;
SmartDebug.DWL("Calculating on " + lSubArea.X + ":" + lSubArea.Y + " " + lSubArea.Width + ":" + lSubArea.Height);
// TODO : calculate frame
this._state.StreamingSem.Release(1);
}

There really isn't a good way to tell you exactly what's making your code slow from what I see, but there are a couple of things that stand out:
Thread.Sleep(0). When you do this, you give up the rest of your timeslice from the OS, and slow down everything, because CreateDiffFrame() can't actually return until the OS scheduler comes back to it.
The object cast of Rectangle, which is a struct. You incur the overhead of boxing when this happens, which isn't going to be something you'll want for truly compute-intensive operations.
Your calls to lock(this._state.AreaSync). It could be that AreaSync is being locked somewhere else, too, and that could be slowing things down.
You may be queueing items too granularly -- if you queue very small items of work, it's likely that the overhead of putting these items in the queue one at a time could be significant as compared to the actual amount of work done. You could also perhaps consider putting the contents of the inner loop inside the queued work item to cut down this overhead.
If this is something you're trying to do for parallel computation, have you investigated using PLINQ or another such framework?

My guess would be that it's the Sleep at the end of CreateDiffFrame. It means each thread stays alive for at least another 10 ms, if I remember correctly. You can probably do the actual work in less than 10 ms. ThreadPool tries to optimize the usage of threads, but I think it has an upper limit to the total number of outstanding threads. So if you want to actually mimic your workload, make a tight loop that waits until the expected number of milliseconds have passed instead of a Sleep.
Anyway, I don't think using ThreadPool is the actual bottleneck, using an other threading mechanism will not speed up your code.

There is known bug with the ThreadPool.SetMinThreads method described in KB976898:
After you use the ThreadPool.SetMinThreads method in the Microsoft .NET Framework 3.5, threads maintained by the thread pool do not work as expected
You can download a fix to this behavior from here.

Why does parallel this code work sometimes?

I wanted to parallelize a piece of code, but the code actually got slower probably because of overhead of Barrier and BlockCollection. There would be 2 threads, where the first would find pieces of work wich the second one would operate on. Both operations are not much work so the overhead of switching safely would quickly outweigh the two threads.
So I thought I would try to write some code myself to be as lean as possible, without using Barrier etc. It does not behave consistent however. Sometimes it works, sometimes it does not and I can't figure out why.
This code is just the mechanism I use to try to synchronize the two threads. It doesn't do anything useful, just the minimum amount of code you need to reproduce the bug.
So here's the code:
// node in linkedlist of work elements
class WorkItem {
public int Value;
public WorkItem Next;
}
static void Test() {
WorkItem fst = null; // first element
Action create = () => {
WorkItem cur=null;
for (int i = 0; i < 1000; i++) {
WorkItem tmp = new WorkItem { Value = i }; // create new comm class
if (fst == null) fst = tmp; // if it's the first add it there
else cur.Next = tmp; // else add to back of list
cur = tmp; // this is the current one
}
cur.Next = new WorkItem { Value = -1 }; // -1 means stop element
#if VERBOSE
Console.WriteLine("Create is done");
#endif
};
Action consume = () => {
//Thread.Sleep(1); // this also seems to cure it
#if VERBOSE
Console.WriteLine("Consume starts"); // especially this one seems to matter
#endif
WorkItem cur = null;
int tot = 0;
while (fst == null) { } // busy wait for first one
cur = fst;
#if VERBOSE
Console.WriteLine("Consume found first");
#endif
while (true) {
if (cur.Value == -1) break; // if stop element break;
tot += cur.Value;
while (cur.Next == null) { } // busy wait for next to be set
cur = cur.Next; // move to next
}
Console.WriteLine(tot);
};
try { Parallel.Invoke(create, consume); }
catch (AggregateException e) {
Console.WriteLine(e.Message);
foreach (var ie in e.InnerExceptions) Console.WriteLine(ie.Message);
}
Console.WriteLine("Consume done..");
Console.ReadKey();
}
The idea is to have a Linkedlist of workitems. One thread adds items to the back of that list, and another thread reads them, does something, and polls the Next field to see if it is set. As soon as it is set it will move to the new one and process it. It polls the Next field in a tight busy loop because it should be set very quickly. Going to sleep, context switching etc would kill the benefit of parallizing the code.
The time it takes to create a workitem would be quite comparable to executing it, so the cycles wasted should be quite small.
When I run the code in release mode, sometimes it works, sometimes it does nothing. The problem seems to be in the 'Consumer' thread, the 'Create' thread always seems to finish. (You can check by fiddling with the Console.WriteLines).
It has always worked in debug mode. In release it about 50% hit and miss. Adding a few Console.Writelines helps the succes ratio, but even then it's not 100%. (the #define VERBOSE stuff).
When I add the Thread.Sleep(1) in the 'Consumer' thread it also seems to fix it. But not being able to reproduce a bug is not the same thing as knowing for sure it's fixed.
Does anyone here have a clue as to what goes wrong here? Is it some optimization that creates a local copy or something that does not get updated? Something like that?
There's no such thing as a partial update right? like a datarace, but then that one thread is half doen writing and the other thread reads the partially written memory? Just checking..
Looking at it I think it should just work.. I guess once every few times the threads arrive in different order and that makes it fail, but I don't get how. And how I could fix this without adding slowing it down?
Thanks in advance for any tips,
Gert-Jan

I do my damn best to avoid the utter minefield of closure/stack interaction at all costs.
This is PROBABLY a (language-level) race condition, but without reflecting Parallel.Invoke i can't be sure. Basically, sometimes fst is being changed by create() and sometimes not. Ideally, it should NEVER be changed (if c# had good closure behaviour). It could be due to which thread Parallel.Invoke chooses to run create() and consume() on. If create() runs on the main thread, it might change fst before consume() takes a copy of it. Or create() might be running on a separate thread and taking a copy of fst. Basically, as much as i love c#, it is an utter pain in this regard, so just work around it and treat all variables involved in a closure as immutable.
To get it working:
//Replace
WorkItem fst = null
//with
WorkItem fst = WorkItem.GetSpecialBlankFirstItem();
//And
if (fst == null) fst = tmp;
//with
if (fst.Next == null) fst.Next = tmp;

A thread is allowed by the spec to cache a value indefinitely.
see Can a C# thread really cache a value and ignore changes to that value on other threads? and also http://www.yoda.arachsys.com/csharp/threads/volatility.shtml

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Why is C# Threading not woking on multiple cores? - c#

Related

Parallel BFS rush hour solver

This program takes a long time to terminate. How can I optimise it?

Java is scaling much worse than C# over many cores?

Threadpool in C# too slow, is there a way to speed up it? Thread.Sleep(0) and QueueUserWorkItem issues

Why does parallel this code work sometimes?

Categories

Resources