IO Bound Operation and Task.Run()

IO Bound Operation and Task.Run() - c#

I am quite new to concurrency (and C#, actually). I have a bunch of csv files in two separate directory to be read, and then I want to do some processing after I read a file. The processing is independent of other data read and process operations. After all the processing are done, I want to update the UI. The UI needs to be responsive at the mean time too because I will need to display a progress bar. Currently I have something like this:
private string _directoryA;
private string _directoryB;
// The user clicks the button
private void ButtonPressed()
{
Task.Run(() => DoJob());
}
private void DoJob()
{
var tasks = new List<Task>();
var watch = Stopwatch.StartNew();
tasks.Add(Task.Run(() => DoJobForDirectory(_directoryA).ContinueWith(t => Console.WriteLine("First Half");
tasks.Add(Task.Run(() => DoJobForDirectory(_directoryB).ContinueWith(t => Console.WriteLine("Second Half");
Task.WaitAll(tasks.ToArray());
watch.Stop();
Console.WriteLine($"Time Taken : {watch.ElapsedMilliseconds} ms.");
UpdateUI();
}
private void DoJobForDirectory(string directory)
{
var files = Directory.EnumerateFiles(directory, "*.csv");
var tasks = new List<Task>();
foreach (var file in files)
{
// Update the progress bar in the UI when a file has finished processing
tasks.Add(Task.Run(() => DoJobForFile(file)).ContinueWith(t => UpdateCounter++));
}
Task.WaitAll(tasks.ToArray());
}
private void DoJobForFile(string filePath)
{
ReadCSV();
ProcessData();
...
}
I feel like I am missing something here. From my reading this operation should be I/O bound, as the processing afterwards is pretty lightweight (some for loops and assignments). So I really should be using just async await, but not Task.Run()...? However I couldn't think of a better way to do this. The ReadCSV() is from some library that does not have the async version. Using Parallel.ForEach does not boost the performance too. Is there a better way to do this (to be efficient on resources and also achieve better performance)?
Also, when I tried to only run on one directory, the elapsed time would be nearly half of the time required for both directories. Since the operations are all independent, I want to run them all in parallel, so processing both directories should take roughly the same (or only slightly more) time as processing just single directory, but not two times slower. It seems like no matter how many Task.Run() I do, I will have a limited number of threads running at the same time (some bottleneck). I tried changing all the Task.Run() to be new Thread(), and observed much more threads were active at the same time, but in the end resulted worse performance. Why is that?

The Task.Run schedules work on the ThreadPool, which is a conservative mechanism regarding how many threads it creates immediately on demand (it creates as many as the available cores of the machine), and on how frequently it creates new threads when the demand for work is high (one new thread every second). You could try experimenting with the ThreadPool.SetMinThreads method that affects the behavior of the ThreadPool. For example:
ThreadPool.SetMinThreads(100, 100);
This way the ThreadPool will create 100 threads immediately on demand, before switching to the conservative algorithm.
Chances are that you'll see no improvement on the performance of your directory-processing application. That's because your I/O bound workload is throttled by the capabilities of your storage device. No matter what you do with code, the hardware has a limit on how many data can store or retrieve per time-unit. When you reach this limit, the only way to boost the performance is to upgrade your hardware.
Regarding the suitability of using Task.Run and synchronous APIs for doing I/O bound work, surprisingly in many cases it's the most performant way of getting the job done. The synchronous file-system APIs in particular are significantly faster than their asynchronous counterparts. What you lose with the synchronous APIs is memory-efficiency. Each thread requires at least 1 MB of memory for its stack, so if you start 1,000 threads at once you'll deprive your system from 1 GB of memory or more, which can affect negatively the performance of your application indirectly.
Starting manually tasks with Task.Run for the purpose of parallelization, is a low lever approach at parallelizing your work. The TPL offers higher level Task-based tools, like the Parallel class, the PLINQ library (.AsParallel) and the TPL Dataflow library.
For updating the UI with progress information during a background work, the modern approach is the IProgress<T> interface and the Progress<T> class. You can find an example here, as part of a comparison between the Task.Run and the BackgroundWorker class.

The Task.Run(() => DoJob()); and using Task.WaitAll() is wasting a thread.
I would change it to this:
private string _directoryA;
private string _directoryB;
// The user clicks the button
private async void ButtonPressed()
{
// disable UI controls
try
{
await DoJob();
}
finally
{
// enable UI controls.
}
}
private async Task DoJob()
{
var tasks = new List<Task>();
var watch = Stopwatch.StartNew();
tasks.Add(Task.Run(async () => DoJobForDirectory(_directoryA).ContinueWith(t => Console.WriteLine("First Half");
tasks.Add(Task.Run(async () => DoJobForDirectory(_directoryB).ContinueWith(t => Console.WriteLine("Second Half");
await Task.WhenAll(tasks.ToArray());
watch.Stop();
Console.WriteLine($"Time Taken : {watch.ElapsedMilliseconds} ms.");
UpdateUI();
}
private async Task DoJobForDirectory(string directory)
{
var files = Directory.EnumerateFiles(directory, "*.csv");
var tasks = new List<Task>();
foreach (var file in files)
{
// Update the progress bar in the UI when a file has finished processing
tasks.Add(Task.Run(() => DoJobForFile(file)).ContinueWith(t => UpdateCounter++));
}
await Task.WhenAll(tasks.ToArray());
}
private void DoJobForFile(string filePath)
{
ReadCSV();
ProcessData();
...
}
If you want to limit the threads, you could use a SemaphoreSlim. Here is a good example as accepted answer:
How to limit the Maximum number of parallel tasks in c#

Related

Multithreading in C# increasing CPU usage

I am using multiple thread for invoking a function as below.
At table level there is queue number. According to the queue
Number the MAX_Download_Thread will be set. So that much thread
will be created and work it continuously. When I put
MAX_Download_Thread as 4 it consuming 30% of CPU out of 8 processors.
When make it as 10 it almost consuming 70%. Just want to know whether
any better method to reduce this or this is normal.
protected override void OnStart(string[] args)
{
for (int i = 1; i <= MAX_Download_THREAD; i++)
{
int j = i;
var fileDownloadTread = new Thread(() =>
{
new Repo.FILE_DOWNLOAD().PIC_FILE_DOWNLOAD(j);
});
fileDownloadTread.Start();
}
}

The point of using multiple Threads is to achieve a higher utilization of multicore CPUs to complete a Task in a shorter amount of time. The higher usage is to be expected, nay desired.
If the work your threads are doing is I/O-Bound (e.g. writing large amounts of Data to a Disk) it might end up taking longer than with fewer threads as data can only be written to a disk in series.

Using multiple threads does use more CPU. In this case though, that's simply wasted. IO operations like file operations and downloads are carried out by the OS itself, not application code. The application threads just wait for the OS to retrieve data over the network.
To execute multiple downloads concurrently you need asynchronous operations. One way to do this is to use Parallel.ForeachAsync. The following code will download files in parallel from a list and save them locally:
HttpClient _client=new HttpClient();
async Task DownloadUrl(Uri url,DirectoryInfo folder)
{
var fileName=url.Segments.Last();
var newPath=Path.Combine(folder.FullName,fileName);
var stream=await _client.GetStreamAsync(url);
using var localStream=File.Create(newPath);
await stream.CopyToAsync(localStream);
}
async Task DonwloadConcurrently(IEnumerable<Uri> urls,DirectoryInfo folder)
{
await Parallel.ForEachAsync(urls,async (url,_)=>{
await DownloadUrl(url,folder);
});
}
This will executeDownloadUrl multiple times in parallel. The default number of concurrent tasks is equal to Environment.ProcessorCount but can be adjusted
ParallelOptions parallelOptions = new()
{
MaxDegreeOfParallelism = 3
};
await Parallel.ForEachAsync(urls, parallelOptions, async (url,_) =>{
await DownloadUrl(url,folder);
});

How to scale an application with 50000 Simultaneous Tasks

I am working on a project which needs to be able to run (for example) 50,000 tasks simultaneously. Each task will run at some frequency (say 5 minutes) and will be either a url ping or an HTTP GET request. My initial plan was to create thread for each task. I ran a basic test to see if this was possible given available system resources. I ran the following code as a console app:
public class Program
{
public static void Test1()
{
Thread.Sleep(1000000);
}
public static void Main(string[] args)
{
for(int i = 0; i < 50000; i++)
{
Thread t = new Thread(new ThreadStart(Test1));
t.Start();
Console.WriteLine(i);
}
}
}
Unfortunately, though it started very fast, at the 2000 thread mark, the performance was greatly decreased. By 5000, I could count faster than the program could create threads. This makes getting to 50000 seem like it wouldn't be exactly possible. Am I on the right track or should I try something else? Thanks

Many people have the idea that you need to spawn n threads if you want to handle n tasks in parallel. Most of the time a computer is waiting, it is waiting on I/O such as network traffic, disk access, memory transfer for GPU compute, hardware device to complete an operation, etc.
Given this insight, we can see that a viable solution to handling as many tasks in parallel as possible for a given hardware platform is to pipeline work: place work in a queue and process it using as many threads as possible. Usually, this means 1-2 threads per virtual processor.
In C# we can accomplish this with the Task Parallel Library (TPL):
class Program
{
static Task RunAsync(int x)
{
return Task.Delay(10000);
}
static async Task Main(string[] args)
{
var tasks = Enumerable.Range(0, 50000).Select(x => RunAsync());
Console.WriteLine("Waiting for tasks to complete...");
await Task.WhenAll(tasks);
Console.WriteLine("Done");
}
}
This queues 50000 work items, and waits until all 50000 tasks are complete. These tasks only execute on as many threads that are needed. Behind the scenes, a task scheduler examines the pool of work and has threads steal work from the queue when they need a task to execute.
Additional Considerations
With a large upper bound (n=50000) you should be cognizant of memory pressure, garbage collector activity, and other task-related overhead. You should consider the following:
Consider using ValueTask<T> to minimize allocations, especially for synchronous operations
Use ConfigureAwait(false) where possible to reduce context switching
Use CancellationTokenSource and CancellationToken to cancel requests early (e.g. timeout)
Follow best practices
Avoid awaiting inside of a loop where possible
Avoid querying tasks too frequently for completion
Avoid accessing Task<T>.Result before a task is complete to prevent blocking
Avoid deadlocks by using synchronization primitives (mutex, semaphore, condition signal, synclock, etc) as appropriate
Avoid frequent use of Task.Run to create tasks to avoid exhausting the thread pool available to the default task scheduler (this method is usually reserved for compute-bound tasks)

Multithreading/Concurrent strategy for a network based task

I am not pro in utilizing resources to the best hence am seeking the best way for a task that needs to be done in parallel and efficiently.
We have a scenario wherein we have to ping millions of system and receive a response. The response itself takes no time in computation but the task is network based.
My current implementation looks like this -
Parallel.ForEach(list, ip =>
{
try
{
// var record = client.QueryAsync(ip);
var record = client.Query(ip);
results.Add(record);
}
catch (Exception)
{
failed.Add(ip);
}
});
I tested this code for
100 items it takes about 4 secs
1k items it takes about 10 secs
10k items it takes about 80 secs
100k items it takes about 710 secs
I need to process close to 20M queries, what strategy should i use in order to speed this up further

Here is the problem
Parallel.ForEach uses the thread pool. Moreover, IO bound operations will block those threads waiting for a device to respond and tie up resources.
If you have CPU bound code, Parallelism is appropriate;
Though if you have IO bound code, Asynchrony is appropriate.
In this case, client.Query is clearly I/O, so the ideal consuming code would be asynchronous.
Since you said there was an async verison, you are best to use async/await pattern and/or some type of limit on concurrent tasks, another neat solution is to use ActionBlock Class in the TPL dataflow library.
Dataflow example
public static async Task DoWorkLoads(List<IPAddress> addresses)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 50
};
var block = new ActionBlock<IPAddress>(MyMethodAsync, options);
foreach (var ip in addresses)
block.Post(ip);
block.Complete();
await block.Completion;
}
...
public async Task MyMethodAsync(IpAddress ip)
{
try
{
var record = await client.Query(ip);
// note this is not thread safe best to lock it
results.Add(record);
}
catch (Exception)
{
// note this is not thread safe best to lock it
failed.Add(ip);
}
}
This approach gives you Asynchrony, it also gives you MaxDegreeOfParallelism, it doesn't waste resources, and lets IO be IO without chewing up unnecessary resources
*Disclaimer, DataFlow may not be where you want to be, however i just thought id give you some more information
Demo here
update
I just did some bench-marking with Parallel.Foreaceh and DataFlow
Run multiple times 10000 pings
Parallel.Foreach = 30 seconds
DataFlow = 10 seconds

multithreading in winforms application

I’m writing a win forms that uses the report viewer for the creation of multiple PDF files. These PDF files are divided in 4 main parts, each part is responsible for the creation of a specific report. These processes are creating a minimum of 1 file up to the number of users (currently 50).
The program already exists using there 4 methods sequentially. For extra performance where the number of users is growing, I want to separate these methods from the mail process in 4 separate threads.
While I'm new to multithreading using C# I read a number of articles how to achieve this. The only thing I'm not sure of is which way I should start. As I read multiple blog posts I'm not sure if to use 4 separate threads, a thread pool or multiple background workers. (or should parallel programming be the best way?). Blog posts tell me if more than 3 threads use a thread pool, but on the other hand the tell me if using winforms, use the backgroundworker. Which option is best (and why)?
At the end my main thread has to wait for all processes to end before continuing.
Can someone tell me what's the best solution to my problem.
* Extra information after edit *
Which i forgot to tell (after i read al your comments and possible solutions). The methods share one "IEnumerable" only for reading. After firing the methods (that don't have to run sequentially), the methods trigger events for for sending status updates to the UI. I think triggering events is difficult if not impossible using separate threads so there should be some kind of callback function to report status updates while running.
some example in psuedo code.
main()
{
private List<customclass> lcc = importCustomClass()
export.CreatePDFKind1.create(lcc.First(), exportfolderpath, arg1)
export.CreatePDFKind2.create(lcc, exportfolderpath)
export.CreatePDFKind3.create(lcc.First(), exportfolderpath)
export.CreatePDFKind4.create(customclass2, exportfolderpath)
}
namespace export
{
class CreatePDFKind1
{
create(customclass cc, string folderpath)
{
do something;
reportstatus(listviewItem, status, message)
}
}
class CreatePDFKind2
{
create(IEnumerable<customclass> lcc, string folderpath)
{
foreach (var x in lcc)
{
do something;
reportstatus(listviewItem, status, message)
}
}
}
etc.......
}

From the very basic picture you have described, I would use the Task Paralell Library (TPL). Shipped with .NET Framework 4.0+.
You talk about the 'best' option of using thread pools when spawning a large-to-medium number of threads. Dispite this being correct [the most efficent way of mangaing the resources], the TPL does all of this for you - without you having to worry about a thing. The TPL also makes the use of multiple threads and waiting on their completion a doddle too...
To do what you require I would use the TPL and Continuations. A continuation not only allows you to create a flow of tasks but also handles your exceptions. This is a great introduction to the TPL. But to give you some idea...
You can start a TPL task using
Task task = Task.Factory.StartNew(() =>
{
// Do some work here...
});
Now to start a second task when an antecedent task finishes (in error or successfully) you can use the ContinueWith method
Task task1 = Task.Factory.StartNew(() => Console.WriteLine("Antecedant Task"));
Task task2 = task1.ContinueWith(antTask => Console.WriteLine("Continuation..."));
So as soon as task1 completes, fails or is cancelled task2 'fires-up' and starts running. Note that if task1 had completed before reaching the second line of code task2 would be scheduled to execute immediately. The antTask argument passed to the second lambda is a reference to the antecedent task. See this link for more detailed examples...
You can also pass continuations results from the antecedent task
Task.Factory.StartNew<int>(() => 1)
.ContinueWith(antTask => antTask.Result * 4)
.ContinueWith(antTask => antTask.Result * 4)
.ContinueWith(antTask =>Console.WriteLine(antTask.Result * 4)); // Prints 64.
Note. Be sure to read up on exception handling in the first link provided as this can lead a newcomer to TPL astray.
One last thing to look at in particular for what you want is child tasks. Child tasks are those which are created as AttachedToParent. In this case the continuation will not run until all child tasks have completed
TaskCreationOptions atp = TaskCreationOptions.AttachedToParent;
Task.Factory.StartNew(() =>
{
Task.Factory.StartNew(() => { SomeMethod() }, atp);
Task.Factory.StartNew(() => { SomeOtherMethod() }, atp);
}).ContinueWith( cont => { Console.WriteLine("Finished!") });
So in your case you would start your four tasks, then wait on their completion on the main thread.
I hope this helps.

Using a BackgroundWorker is helpful if you need to interact with the UI with respect to your background process. If you don't, then I wouldn't bother with it. You can just start 4 Task objects directly:
tasks.Add(Task.Factory.StartNew(()=>DoStuff()));
tasks.Add(Task.Factory.StartNew(()=>DoStuff2()));
tasks.Add(Task.Factory.StartNew(()=>DoStuff3()));
If you do need to interact with the UI; possibly by updating it to reflect when the tasks are finished, then I would suggest staring one BackgroundWorker and then using tasks again to process each individual unit of work. Since there is some additional overhead in using a BackgroundWorker I would avoid starting lots of them if you can avoid it.
BackgroundWorker bgw = new BackgroundWorker();
bgw.DoWork += (_, args) =>
{
List<Task> tasks = new List<Task>();
tasks.Add(Task.Factory.StartNew(() => DoStuff()));
tasks.Add(Task.Factory.StartNew(() => DoStuff2()));
tasks.Add(Task.Factory.StartNew(() => DoStuff3()));
Task.WaitAll(tasks.ToArray());
};
bgw.RunWorkerCompleted += (_, args) => updateUI();
bgw.RunWorkerAsync();
You could of course use just Task methods to do all of this, but I still find BackgroundWorkers a bit simpler to work with for the simpler cases. Using .NEt 4.5 you could use Task.WhenAll to run a continuation in the UI thread when all 4 tasks finished, but doing that in 4.0 wouldn't be quite as simple.

Without further information it's impossible to tell. The fact that they're in four separate methods doesn't make much of a difference if they're accessing the same resources. The PDF file for example. If you're having trouble understanding what I mean you should post some of the code for each method and I'll go into a little more detail.
Since the number of "parts" you have is fixed it won't make a big difference whether you use separate threads, background workers or use a thread pool. I'm not sure why people are recommending background workers. Most likely because it's a simpler approach to multithreading and more difficult to screw up.

C# queueing dependant tasks to be processed by a thread pool

I want to queue dependant tasks across several flows that need to be processed in order (in each flow). The flows can be processed in parallel.
To be specific, let's say I need two queues and I want the tasks in each queue to be processed in order. Here is sample pseudocode to illustrate the desired behavior:
Queue1_WorkItem wi1a=...;
enqueue wi1a;
... time passes ...
Queue1_WorkItem wi1b=...;
enqueue wi1b; // This must be processed after processing of item wi1a is complete
... time passes ...
Queue2_WorkItem wi2a=...;
enqueue wi2a; // This can be processed concurrently with the wi1a/wi1b
... time passes ...
Queue1_WorkItem wi1c=...;
enqueue wi1c; // This must be processed after processing of item wi1b is complete
Here is a diagram with arrows illustrating dependencies between work items:
The question is how do I do this using C# 4.0/.NET 4.0? Right now I have two worker threads, one per queue and I use a BlockingCollection<> for each queue. I would like to instead leverage the .NET thread pool and have worker threads process items concurrently (across flows), but serially within a flow. In other words I would like to be able to indicate that for example wi1b depends on completion of wi1a, without having to track completion and remember wi1a, when wi1b arrives. In other words, I just want to say, "I want to submit a work item for queue1, which is to be processed serially with other items I have already submitted for queue1, but possibly in parallel with work items submitted to other queues".
I hope this description made sense. If not please feel free to ask questions in the comments and I will update this question accordingly.
Thanks for reading.
Update:
To summarize "flawed" solutions so far, here are the solutions from the answers section that I cannot use and the reason(s) why I cannot use them:
TPL tasks require specifying the antecedent task for a ContinueWith(). I do not want to maintain knowledge of each queue's antecedent task when submitting a new task.
TDF ActionBlocks looked promising, but it would appear that items posted to an ActionBlock are processed in parallel. I need for the items for a particular queue to be processed serially.
Update 2:
RE: ActionBlocks
It would appear that setting the MaxDegreeOfParallelism option to one prevents parallel processing of work items submitted to a single ActionBlock. Therefore it seems that having an ActionBlock per queue solves my problem with the only disadvantage being that this requires the installation and deployment of the TDF library from Microsoft and I was hoping for a pure .NET 4.0 solution. So far, this is the candidate accepted answer, unless someone can figure out a way to do this with a pure .NET 4.0 solution that doesn't degenerate to a worker thread per queue (which I am already using).

I understand you have many queues and don't want to tie up threads. You could have an ActionBlock per queue. The ActionBlock automates most of what you need: It processes work items serially, and only starts a Task when work is pending. When no work is pending, no Task/Thread is blocked.

The best way is to use the Task Parallel Library (TPL) and Continuations. A continuation not only allows you to create a flow of tasks but also handles your exceptions. This is a great introduction to the TPL. But to give you some idea...
You can start a TPL task using
Task task = Task.Factory.StartNew(() =>
{
// Do some work here...
});
Now to start a second task when an antecedent task finishes (in error or successfully) you can use the ContinueWith method
Task task1 = Task.Factory.StartNew(() => Console.WriteLine("Antecedant Task"));
Task task2 = task1.ContinueWith(antTask => Console.WriteLine("Continuation..."));
So as soon as task1 completes, fails or is cancelled task2 'fires-up' and starts running. Note that if task1 had completed before reaching the second line of code task2 would be scheduled to execute immediately. The antTask argument passed to the second lambda is a reference to the antecedent task. See this link for more detailed examples...
You can also pass continuations results from the antecedent task
Task.Factory.StartNew<int>(() => 1)
.ContinueWith(antTask => antTask.Result * 4)
.ContinueWith(antTask => antTask.Result * 4)
.ContinueWith(antTask =>Console.WriteLine(antTask.Result * 4)); // Prints 64.
Note. Be sure to read up on exception handling in the first link provided as this can lead a newcomer to TPL astray.
One last thing to look at in particular for what you want is child tasks. Child tasks are those which are created as AttachedToParent. In this case the continuation will not run until all child tasks have completed
TaskCreationOptions atp = TaskCreationOptions.AttachedToParent;
Task.Factory.StartNew(() =>
{
Task.Factory.StartNew(() => { SomeMethod() }, atp);
Task.Factory.StartNew(() => { SomeOtherMethod() }, atp);
}).ContinueWith( cont => { Console.WriteLine("Finished!") });
I hope this helps.
Edit: Have you had a look at ConcurrentCollections in particular the BlockngCollection<T>. So in your case you might use something like
public class TaskQueue : IDisposable
{
BlockingCollection<Action> taskX = new BlockingCollection<Action>();
public TaskQueue(int taskCount)
{
// Create and start new Task for each consumer.
for (int i = 0; i < taskCount; i++)
Task.Factory.StartNew(Consumer);
}
public void Dispose() { taskX.CompleteAdding(); }
public void EnqueueTask (Action action) { taskX.Add(Action); }
void Consumer()
{
// This seq. that we are enumerating will BLOCK when no elements
// are avalible and will end when CompleteAdding is called.
foreach (Action action in taskX.GetConsumingEnumerable())
action(); // Perform your task.
}
}

A .NET 4.0 solution based on TPL is possible, while hiding away the fact that it needs to store the parent task somewhere. For example:
class QueuePool
{
private readonly Task[] _queues;
public QueuePool(int queueCount)
{ _queues = new Task[queueCount]; }
public void Enqueue(int queueIndex, Action action)
{
lock (_queues)
{
var parent = _queue[queueIndex];
if (parent == null)
_queues[queueIndex] = Task.Factory.StartNew(action);
else
_queues[queueIndex] = parent.ContinueWith(_ => action());
}
}
}
This is using a single lock for all queues, to illustrate the idea. In production code, however, I would use a lock per queue to reduce contention.

It looks like the design you already have is good and working. Your worker threads (one per queue) are long-running so if you want to use Task's instead, specify TaskCreationOptions.LongRunning so you get a dedicated worker thread.
But there isn't really a need to use the ThreadPool here. It doesn't offer many benefits for long-running work.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.