Introduction to threading - Process xml file

Introduction to threading - Process xml file - c#

I have never written multi-threaded code before (barring a few basic backgroundworker tricks) and am hoping for some guidance about how I would approach my problem.
I have an XML file which is a serialized List<Stock>. For each one of these stock items I need to perform a webservice call called UpdatePrice().
What I want to do is take each one of these items, create a threadpool (who's size depends on the amount of rows I will need to process) and begin making webservice calls.
I am not asking for a complete solution (obviously) but would really appreciate some guidance about how one would typically solve this problem.
The biggest issue that I see arising is how I would designate which threads would work on which objects. Do I simply take the list divide it by the number of threads I make and split the work? Or am I better off allowing each thread to arbitrarily pick an item from the list to process? (Then I have locking issues but as a plus can ensure no thread is idle)
As I said before I am not looking for a complete solution but just some basic guidance on where to start because honestly I am lost on this one and haven't written a single line of code.
PS: Also are autogenerated webservice proxies in .NET threadsafe?

I would suggest looking into TPL and PLINQ for a solution. A simple example solution using Parallel.ForEach() could look like this (parallel calls limited to 5 in the example).
List<Stock> stocks;
Parallel.ForEach(stocks,
new ParallelOptions() { MaxDegreeOfParallelism = 5 },
(stock) =>
{
float newPrice = UpdatePrice(stock.TickerSymbol); //web service call
stock.Price = newPrice;
});

i would:
First read the whole XML data synchronously.
Then, i would put each element to be processed in a single queue.
Then, you can spawn N processing threads, in which at the beginning of each one, it would "pop" an element of your queue, wrapping this specific piece of code in a mutex / semaphore (Google C# mutex, or concurrent access, or anything related). This is easily done in C# with the "lock" keyword on an arbitrary object.
Hope this helps.
Pierre.

There's no point in using threads here. A thread can only give you one resource: more cpu cycles, provided that you have a CPU with multiple cores. That is not the resource that you need to speed up your program. You need a faster Internet connection.
If you have an UI you don't want frozen then the BackgroundWorker tricks will work just fine.

Related

Design thoughts required about concurrent processing

I have a series of calculations that need to be processed - the calculations and the order they run are all defined by the user on the UI.
If they just ran one after each other, it wouldn't be too hard. However, some of the calculations need to be processed concurrently and all calculations must have the ability to be individually paused at any time. I also need to be able to re-arrange orders or add new calculations to be processed at any time. So whatever I do must be flexible enough to handle this.
On the UI, imagine a listbox (a queue, if you like) of usercontrols - with each usercontrol displaying the name of the calculation and a pause button. And I can add calculations to this list at any time during processing.
What is the best way to do this?
Should I be running each calculation in its own thread? If so, how should I store the list of running processes? How will I pass the queue to the calculation processor? How will I be able to ensure that every time the queue changes (new ordering or new calculation) the calculation processor will be made aware of this?
My initial thoughts were to have:
CalcProcessor class
CalcCalculation class
In CalcProcessor have 2 Lists of CalcCalculations. One being the "queue" as shown on the UI (perhaps a pointer to it? Or some other way to ensure it updates live), and the other being the list of currently running calculations.
Somehow I need to get the CalcCalculation to be running in its own thread to process the calculation, and be able to handle any pause events. So I need some way to transmit the info of the Pause button being pressed in the UI to the CalcProcessor object, and then into the correct CalcCalculation.
Edit in response to David Hope:
Thanks for your reply.
Yes, there are n calculations but this could change at any time due to being able to add more calculations to process on the UI.
They do not need to share data in anyway. There will be a setting in the application to specify how many should run concurrently (ie. 10 at any given time, the first 10 in the queue for example - and when 1 finishes the next calculation in the queue will start processing).
The calculation will involve taking data from some data source - it could be a database or a file, and analysing it and performing some calculations on that data. When I say the calculation needs to be paused, I don't mean pausing the thread... I just mean (for example, as I haven't written this part of the application yet) if it is reading row by row from a database and doing some live calculations pausing at the completion of processing the current row... and continuing on when the pause button is unclicked on the UI - which could be done with something as primitive as a while(notPaused) loop providing I can get the Pause information from the UI into the thread.

There are several questions here:
How to synchronize the UI and the model?
I think you got this one backwards. Your model shouldn't have a “pointer” to the queue you're showing in the UI. Instead, the queue should be in your model and you should use databinding together with INotifyPropertyChange and ObservableCollection to show the queue on the UI. (At least that's how it's done in WPF.)
This way, you can manipulate your queue directly from your model, and it will automatically show on the UI.
How to start and monitor calculations?
I think Tasks are ideal for this. You can start a Task using Task.Factory.StartNew(). Since it seems your Tasks will take long to execute, you might consider using TaskCreationOptions.LongRunning. You can also use the Task to find out when is the calculation complete (or if it failed with an exception).
How to pause running calculations?
You can use ManualReserEventSlim for that. Normally, it would be set, but if you wanted to pause a running Task, you would Reset() it. The calculation will need to periodically call Wait() on that event. It's not possible to reasonably pause a running thread without cooperation from the calculation on that thread.
If you were using C# 5.0, a better approach would be to use something like PauseToken.

In Framework 4.5, the answer here is the Async API, which removes the need to manage threads. For details, look at the async/await keywords.
From a broader perspective, a "CalcProcessor" class is a good idea, but I think the Task object will suffice to replace your "CalcCalculation" class. The Processor can simply have an Enumerable of Tasks. The Processor can expose methods for managing the queue, if needed, as well as returning information about its status. When your application finally reaches a state where it must have the results, you can use the AwaitAll method to block the CalcProcessor's thread until all of the tasks complete.
Without more information about the actual goal here, it's hard to give better advice.

You can use Observer Pattern to display results on UI and order changes back in to Processor. State and Command patterns will help you to start, pause, cancel the calculations. These patterns have great answers to your questions in design way. Concurrency is still a problem, they do not answer multi-threading problems but they open an easier road to manage threads.

I suggest that you haven't broken the problem down far enough, which is the reason you are frustrated.
You need to start small and build up from there. You mention, but don't define your actual requirements, but they seem to be...
Need to be able to run ?N? calculations
Some need to be run concurrently (does this imply that they share data, if so how are you going to share the data)
Must be able to pause the calculation (don't use Thread.Suspend, as it potentially leaves a thread in an unstable state, particularly bad if you are sharing data), so you will need to build pause points into each calculation. Also need to consider how you are going to communicate the pause/unpause to the calculation
As far as methods, there are several to consider...
Threads are an obvious choice, but require careful tending too (starting, pausing, stopping, etc...)
You could also use BackGroundWorker or possibly Parallel.ForEach
BackGroundWorker contains the framework for cancelling the worker and providing progress (which can be useful).
My recommendation to start would be to go with BackGroundWorker, potentially subclass it to add the Pause/Resume functionality you need. Determine how you are going to manage data sharing (at least use lock to protect against simultaneous access).
You may find BackGroundWorker too restrictive and need to go with Threads, but I'm usually able to avoid it.
If you post more clear requirements, or samples of what you've tried and didn't work, I'll be happy to help more.

For queue you can use heap data structure (priority queue). This will help prioritize yours tasks. Also you should use Thread Pool for effectively calculations. And try to split you tasks to little parts.

Parallel programming in games

in my game I have entities (about 2000-5000) that do very performance heavy calculations every frame (or every 2nd or 3rd).
I know that parallel programming will help in this scenario.
The question is how do I use multiple threads / cpu-cores in the best way possible.
I tried the static Parallel class in .NET, but the startup cost for every Parallel.For / Parallel.ForEach is simply to big.
Does it start a number of threads each time I call a function of the Parallel class ??
The game's target fps is 60 fps, so whatevery approach I use, it should never involve starting multiple threads every game-frame!
So my question is:
What other alternatives are there to parallel-process multiple entities?
Should I create the threads myself, and if so, what pitfalls are there?
Is the threadpool a good alternative?

By default the Parallel.For/Parallel.Foreach uses ThreadPool thread. Infact TPL api like Parallel class, Task instance etc uses ThreadPool thread.
The number of threads started by TPL will depend upon lots of factors like number of available ideal thread in threadpool etc. So assuming TPL will create lots of threads as soon as you will submit request will not be entirely correct.
Important point is that you can control, the number of threads used the TPL for executing the tasks via MaxDegreeOfParallelism flag. By setting this value to say number "X", you limit the number of concurrent items being processed to "X". Personally, I feel this is a very powerful feature of TPL library.
Also, if you do not want your task to block other requests in queue, then you can set the TaskCreationOptions to be of type "LongRunning", in that case, TPL will run the task on a dedicated thread.
In your case, you can try using Parallel.For/ForEach and set the MaxDegreeOfParalleleism value to some optimal value and see how that behaves.

You need some computers or servers to test your parallel programmed game. Other wise on a single computer, since it will work on thread, it will cost more.
Additionally, since the game is calculated and drawn 60 times in a second, working on threads is not logical, your game is already running on threads.
Maybe you have to change or optimize your algorithm. Do not let everything is done by CPU, use GPU as much as possible.

There isn't a simple answer.
When every entity of your game can be calculated without knowing any other entity then you can simply move that calcultion to a given number of threads. A threadpool would be a good start.
You could set this up when your programm starts.
But even when you do so, you have to check if switching threads performs better than having one single thread. Context switch is here the keyword you can search for.
But what happens when entity A on thread A does a calculation with affects entity B on thread B and forces entity B to be destroyed which forces entity C on thread C to be destroyed also?
And now to make it harder :P entity C was calculated before entity B, before entity A...
What should happen? Is it ok that those entities will be cleaned up in 2 or 3 frames in the future?
Does it affect or break your game logic?
If so then things are getting complicated.

C# multi threading query

I am trying to write a program in C# that will connect to around 400 computers and retrieve some information, lets say it retrieves the list of web services running on each computer.
I am assuming I need a well threaded application to be able to retrieve info from such a huge number of servers really quick. I am pretty blank on how to start working on this, can you guys give me a head start as to how to begin!
Thanks!

I see no reason why you should use threading in your main logic. Use asynchronous APIs and schedule their callback to the main thread. That way you get the benefits of asynchrony, but without most of the difficulty related to threading.
You'll only need multithreading in your logic code if the work you need to do on the data is that expensive. And even then you usually can get aways with parallelizing using side effect free functions.

Take a look at the Task Parallel Library.
Speficically Data Parallelism.
You could also use PLINQ if you wanted.

You should also execute the threads parallely on a multi-core CPU to enhance performance.
My favourite references on the topic are given below -
http://www.albahari.com/threading/
http://www.codeproject.com/KB/Parallel_Programming/NET4ParallelIntro.aspx

Where and how do you get the list of those 400 servers to query?
how often do you need to do this?
you could use a windows service or schedule a task which invoke your software and in it you could do a foreach element in the server list and start a call to such server in a different thread using thread queue/pool, but there is a maximum so you won't start 400 threads all together anyway.
describe a bit better your solution and we see what you can do :)

Take a look at this library: Task Parallel Library. You can make efficient use of your system resources and manage your work easier than managing your threads directly.

There might be considerable impact on the server side when you start query all 400 computers. But you can take a look at Parallel LINQ (PLINQ), where you can limit the degree of parallelism.
You can also use thread pooling for this matter, e.g. a Task class.
Createing manual threads may not be a good idea, as they are not highly reusable and take quite a lot of memory/CPU to be created

Using multithreading for loop

I'm new to threading and want to do something similar to this question:
Speed up loop using multithreading in C# (Question)
However, I'm not sure if that solution is the best one for me as I want them to keep running and never finish. (I'm also using .net 3.5 rather than 2.0 as for that question.)
I want to do something like this:
foreach (Agent agent in AgentList)
{
// I want to start a new thread for each of these
agent.DoProcessLoop();
}
---
public void DoProcessLoop()
{
while (true)
{
// do the processing
// this is things like check folder for new files, update database
// if new files found
}
}
Would a ThreadPool be the best solution or is there something that suits this better?
Update: Thanks for all the great answers! I thought I'd explain the use case in more detail. A number of agents can upload files to a folder. Each agent has their own folder which they can upload assets to (csv files, images, pdfs). Our service (it's meant to be a windows service running on the server they upload their assets to, rest assured I'll be coming back with questions about windows services sometime soon :)) will keep checking every agent's folder if any new assets are there, and if there are, the database will be updated and for some of them static html pages created. As it could take a while for them to upload everything and we want them to be able to see their uploaded changes pretty much straight away, we thought a thread per agent would be a good idea as no agent then needs to wait for someone else to finish (and we have multiple processors so wanted to use their full capacity). Hope this explains it!
Thanks,
Annelie

Given the specific usage your describe (watching for files), I'd suggest you use a FileSystemWatcher to determine when there are new files and then fire off a thread with the threadpool to process the files until there are no more to process -- at which point the thread exits.
This should reduce i/o (since you're not constantly polling the disk), reduce CPU usage (since the constant looping of multiple threads polling the disk would use cycles), and reduce the number of threads you have running at any one time (assuming there aren't constant modifications being made to the file system).
You might want to open and read the files only on the main thread and pass the data to the worker threads (if possible), to limit i/o to a single thread.

I believe that the Parallels Extensions make this possible:
Parallel.Foreach
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx
http://blogs.msdn.com/pfxteam/

One issue with ThreadPool would be that if the pool happens to be smaller than the number of Agents you would like to have, the ones you try to start later may never execute. Some tasks may never begin to execute, and you could starve everything else in your app domain that uses the thread pool as well. You're probably better off not going down that route.

You definitely don't want to use the ThreadPool for this purpose. ThreadPool threads are not meant to be used for long-running tasks ("infinite" counts as long-running), since that would obviously tie up resources meant to be shared.
For your application, it would probably be better to create one thread (not from the ThreadPool) and in that thread execute your while loop, inside of which you iterate through your Agents collection and perform the processing for each one. In the while loop you should also use a Thread.Sleep call so you don't max out the processor (there are better ways of executing code periodically, but Thread.Sleep will work for your purposes).
Finally, you need to include some way for the while loop to exit when your program terminates.
Update: Finally finally, multi-threading does not automatically speed up slow-running code. Nine women can't make a baby in one month.

A thread pool is useful when you expect threads to be coming into and out of existence fairly regularly, not for a predefined set number of threads.

Hmm.. as Ragoczy points out, its better to use FileSystemWatcher to monitor the files. However, since you have additional operations, you may think in terms of multithreading.
But beware, no matter how many processers you have, there is a limit to it's capacity. You may not want to create as many threads as the number of concurrent users, for the simple reason that your number of agents can increase.

Until you upgrade to .NET 4, the ThreadPool might be your best option. You may also want to use a Semaphore and a AutoResetEvent to control the number of concurrent threads. If you're talking about long-running work then the overhead of starting up and managing your own threads is low and the solution is more elegant. That will allow you to use a WorkerThread.Join() so you can make sure all worker threads are complete before you resume execution.

C# Multithreading File IO (Reading)

We have a situation where our application needs to process a series of files and rather than perform this function synchronously, we would like to employ multi-threading to have the workload split amongst different threads.
Each item of work is:
1. Open a file for read only
2. Process the data in the file
3. Write the processed data to a Dictionary
We would like to perform each file's work on a new thread?
Is this possible and should be we better to use the ThreadPool or spawn new threads keeping in mind that each item of "work" only takes 30ms however its possible that hundreds of files will need to be processed.
Any ideas to make this more efficient is appreciated.
EDIT: At the moment we are making use of the ThreadPool to handle this. If we have 500 files to process we cycle through the files and allocate each "unit of processing work" to the threadpool using QueueUserWorkItem.
Is it suitable to make use of the threadpool for this?

I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net .
It's very easy to use,
ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod);
YourMethod(object o){
Your Code here...
}
For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx
Hope, this helps

I suggest you have a finite number of threads (say 4) and then have 4 pools of work. I.e. If you have 400 files to process have 100 files per thread split evenly. You then spawn the threads, and pass to each their work and let them run until they have finished their specific work.
You only have a certain amount of I/O bandwidth so having too many threads will not provide any benefits, also remember that creating a thread also takes a small amount of time.

Instead of having to deal with threads or manage thread pools directly I would suggest using a higher-level library like Parallel Extensions (PEX):
var filesContent = from file in enumerableOfFilesToProcess
select new
{
File=file,
Content=File.ReadAllText(file)
};
var processedContent = from content in filesContent
select new
{
content.File,
ProcessedContent = ProcessContent(content.Content)
};
var dictionary = processedContent
.AsParallel()
.ToDictionary(c => c.File);
PEX will handle thread management according to available cores and load while you get to concentrate about the business logic at hand (wow, that sounded like a commercial!)
PEX is part of the .Net Framework 4.0 but a back-port to 3.5 is also available as part of the Reactive Framework.

I suggest using the CCR (Concurrency and Coordination Runtime) it will handle the low-level threading details for you. As for your strategy, one thread per work item may not be the best approach depending on how you attempt to write to the dictionary, because you may create heavy contention since dictionaries aren't thread safe.
Here's some sample code using the CCR, an Interleave would work nicely here:
Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
new TeardownReceiverGroup(Arbiter.Receive<bool>(
false, mainPort, new Handler<bool>(Teardown))),
new ExclusiveReceiverGroup(Arbiter.Receive<object>(
true, mainPort, new Handler<object>(WriteData))),
new ConcurrentReceiverGroup(Arbiter.Receive<string>(
true, mainPort, new Handler<string>(ReadAndProcessData)))));
public void WriteData(object data)
{
// write data to the dictionary
// this code is never executed in parallel so no synchronization code needed
}
public void ReadAndProcessData(string s)
{
// this code gets scheduled to be executed in parallel
// CCR take care of the task scheduling for you
}
public void Teardown(bool b)
{
// clean up when all tasks are done
}

In the long run, I think you'll be happier if you manage your own threads. This will let you control how many are running and make it easy to report status.
Build a worker class that does the processing and give it a callback routine to return results and status.
For each file, create a worker instance and a thread to run it. Put the thread in a Queue.
Peel threads off of the queue up to the maximum you want to run simultaneously. As each thread completes go get another one. Adjust the maximum and measure throughput. I prefer to use a Dictionary to hold running threads, keyed by their ManagedThreadId.
To stop early, just clear the queue.
Use locking around your thread collections to preserve your sanity.

Use ThreadPool.QueueUserWorkItem to execute each independent task. Definitely don't create hundreds of threads. That is likely to cause major headaches.

The general rule for using the ThreadPool is if you don't want to worry about when the threads finish (or use Mutexes to track them), or worry about stopping the threads.
So do you need to worry about when the work is done? If not, the ThreadPool is the best option. If you want to track the overall progress, stop threads then your own collection of threads is best.
ThreadPool is generally more efficient if you are re-using threads. This question will give you a more detailed discussion.
Hth

Using the ThreadPool for each individual task is definitely a bad idea. From my experience this tends to hurt performance more than helping it. The first reason is that a considerable amount of overhead is required just to allocate a task for the ThreadPool to execute. By default, each application is assigned it's own ThreadPool that is initialized with ~100 thread capacity. When you are executing 400 operations in a parallel, it does not take long to fill the queue with requests and now you have ~100 threads all competing for CPU cycles. Yes the .NET framework does a great job with throttling and prioritizing the queue, however, I have found that the ThreadPool is best left for long-running operations that probably won't occur very often (loading a configuration file, or random web requests). Using the ThreadPool to fire off a few operations at random is much more efficient than using it to execute hundreds of requests at once. Given the current information, the best course of action would be something similar to this:
Create a System.Threading.Thread (or use a SINGLE ThreadPool thread) with a queue that the application can post requests to
Use the FileStream's BeginRead and BeginWrite methods to perform the IO operations. This will cause the .NET framework to use native API's to thread and execute the IO (IOCP).
This will give you 2 leverages, one is that your requests will still get processed in parallel while allowing the operating system to manage file system access and threading. The second is that because the bottleneck of the vast majority of systems will be the HDD, you can implement a custom priority sort and throttling to your request thread to give greater control over resource usage.
Currently I have been writing a similar application and using this method is both efficient and fast... Without any threading or throttling my application was only using 10-15% CPU, which can be acceptable for some operations depending on the processing involved, however, it made my PC as slow as if an application was using 80%+ of the CPU. This was the file system access. The ThreadPool and IOCP functions do not care if they are bogging the PC down, so don't get confused, they are optimized for performance, even if that performance means your HDD is squeeling like a pig.
The only problem I have had is memory usage ran a little high (50+ mb) during the testing phaze with approximately 35 streams open at once. I am currently working on a solution similar to the MSDN recommendation for SocketAsyncEventArgs, using a pool to allow x number of requests to be operating simultaneously, which ultimately led me to this forum post.
Hope this helps somebody with their decision making in the future :)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.