Async file I/O overhead in C# - c#

I've got a problem where I have to process a large batch of large jsonl files (read, deserialize, do some transforms db lookups etc, then write the transformed results in a .net core console app.
I've gotten better throughput by putting the output in batches on a separate thread and was trying to improve the processing side by adding some parallelism but the overhead ended up being self defeating.
I had been doing:
using (var stream = new FileStream(_filePath, FileMode.Open))
using (var reader = new StreamReader(stream)
{
for (;;)
{
var l = reader.ReadLine();
if (l == null)
break;
// Deserialize
// Do some database lookups
// Do some transforms
// Pass result to output thread
}
}
And some diagnostic timings showed me that the ReadLine() call was taking more than the deserialization, etc. To put some numbers on that, a large file would have about:
11 seconds spent on ReadLine
7.8 seconds spend on serialization
10 seconds spent on db lookups
I wanted to overlap that 11 seconds of file i/o with the other work so I tried
using (var stream = new FileStream(_filePath, FileMode.Open))
using (var reader = new StreamReader(stream)
{
var nextLine = reader.ReadLineAsync();
for (;;)
{
var l = nextLine.Result;
if (l == null)
break;
nextLine = reader.ReadLineAsync();
// Deserialize
// Do some database lookups
// Do some transforms
// Pass result to output thread
}
}
To get the next I/O going while I did the transform stuff. Only that ended up taking a lot longer than the regular sync stuff (like twice as long).
I've got requirements that they want predictability on the overall result (i.e. the same set of files have to be processed in name order and the output rows have to be predictably in the same order) so I can't just throw a file per thread and let them fight it out.
I was just trying to introduce enough parallelism to smooth the throughput over a large set of inputs, and I was surprised how counterproductive the above turned out to be.
Am I missing something here?

The built-in asynchronous filesystem APIs are currently broken, and you are advised to avoid them. Not only they are much slower than their synchronous counterparts, but they are not even truly asynchronous. The .NET 6 will come with an improved FileStream implementation, so in a few months this may no longer be an issue.
What you are trying to achieve is called task-parallelism, where two or more heterogeneous operations are running concurrently and independently from each other. It's an advanced technique and it requires specialized tools. The most common type of parallelism is the so called data-parallelism, where the same type of operation is running in parallel on a list of homogeneous data, and it's commonly implemented using the Parallel class or the PLINQ library.
To achieve task-parallelism the most readily available tool is the TPL Dataflow library, which is built-in the .NET Core / .NET 5 platforms, and you only need to install a package if you are targeting the .NET Framework. This library allows you to create a pipeline consisting of linked components that are called "blocks" (TransformBlock, ActionBlock, BatchBlock etc), where each block acts as an independent processor with its own input and output queues. You feed the pipeline with data, and the data flows from block to block through the pipeline, while being processed along the way. You Complete the first block in the pipeline to signal that no more input data will ever be available, and then await the Completion of the last block to make your code wait until all the work has been done. Here is an example:
private async void Button1_Click(object sender, EventArgs e)
{
Button1.Enabled = false;
var fileBlock = new TransformManyBlock<string, IList<string>>(filePath =>
{
return File.ReadLines(filePath).Buffer(10);
});
var deserializeBlock = new TransformBlock<IList<string>, MyObject[]>(lines =>
{
return lines.Select(line => Deserialize(line)).ToArray();
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 2 // Let's assume that Deserialize is parallelizable
});
var persistBlock = new TransformBlock<MyObject[], MyObject[]>(async objects =>
{
foreach (MyObject obj in objects) await PersistToDbAsync(obj);
return objects;
});
var displayBlock = new ActionBlock<MyObject[]>(objects =>
{
foreach (MyObject obj in objects) TextBox1.AppendText($"{obj}\r\n");
}, new ExecutionDataflowBlockOptions()
{
TaskScheduler = TaskScheduler.FromCurrentSynchronizationContext()
// Make sure that the delegate will be invoked on the UI thread
});
fileBlock.LinkTo(deserializeBlock,
new DataflowLinkOptions { PropagateCompletion = true });
deserializeBlock.LinkTo(persistBlock,
new DataflowLinkOptions { PropagateCompletion = true });
persistBlock.LinkTo(displayBlock,
new DataflowLinkOptions { PropagateCompletion = true });
foreach (var filePath in Directory.GetFiles(#"C:\Data"))
await fileBlock.SendAsync(filePath);
fileBlock.Complete();
await displayBlock.Completion;
MessageBox.Show("Done");
Button1.Enabled = true;
}
The data passed through the pipeline should be chunky. If each unit of work is too lightweight, you should batch them in arrays or lists, otherwise the overhead of moving lots of tiny data around is going to outweigh the benefits of parallelism. That's the reason for using the Buffer LINQ operator (from the System.Interactive package) in the above example. The .NET 6 will come with a new Chunk LINQ operator, offering the same functionality.

Theodor's suggestion looks like a really powerful and useful library that's worth checking out, but if you're looking for a smaller DIY solution this is how I would approach it:
using System;
using System.IO;
using System.Threading.Tasks;
using System.Collections.Generic;
namespace Parallelism
{
class Program
{
private static Queue<string> _queue = new Queue<string>();
private static Task _lastProcessTask;
static async Task Main(string[] args)
{
string path = "???";
await ReadAndProcessAsync(path);
}
private static async Task ReadAndProcessAsync(string path)
{
using (var str = File.OpenRead(path))
using (var sr = new StreamReader(str))
{
string line = null;
while (true)
{
line = await sr.ReadLineAsync();
if (line == null)
break;
lock (_queue)
{
_queue.Enqueue(line);
if (_queue.Count == 1)
// There was nothing in the queue before
// so initiate a new processing loop. Save
// but DON'T await the Task yet.
_lastProcessTask = ProcessQueueAsync();
}
}
}
// Now that file reading is completed, await
// _lastProcessTask to ensure we don't return
// before it's finished.
await _lastProcessTask;
}
// This will continue processing as long as lines are in the queue,
// including new lines entering the queue while processing earlier ones.
private static Task ProcessQueueAsync()
{
return Task.Run(async () =>
{
while (true)
{
string line;
lock (_queue)
{
// Only peak at first so the read loop doesn't think
// the queue is empty and initiate a second processing
// loop while we're processing this line.
if (!_queue.TryPeek(out line))
return;
}
await ProcessLineAsync(line);
lock (_queue)
{
// Dequeues the item we just processed. If it's the last
// one, this loop is done.
_queue.Dequeue();
if (_queue.Count == 0)
return;
}
}
});
}
private static async Task ProcessLineAsync(string line)
{
// do something
}
}
}
Note this approach has a processing loop that terminates when nothing is left in the queue, and is re-initiated if needed when new items are ready. Another approach would be to have a continuous processing loop that repeatedly re-checks and does a Task.Delay() for a small amount of time while the queue is empty. I like my approach better because it doesn't bog down the worker thread with periodic and unnecessary checks but performance would likely be unnoticeably different.
Also just to comment on Blindy's answer, I have to disagree with discouraging the use of parallelism here. First off, most CPUs these days are multi-core, so smart use of the .NET threadpool will in fact maximize your application's efficiency when run on multi-core CPUs and have pretty minimal downside in single-core scenarios.
More importantly, though, async does not equal multithreading. Asynchronous programming existed long before multithreading, I/O being the most notable example. I/O operations are in large part handled by hardware other than the CPU - the NIC, SATA controllers, etc. They use an ancient concept called the Hardware Interrupt that most coders today have probably never heard of and predates multithreading by decades. It's basically just a way to give the CPU a callback to execute when an off-CPU operation is finished. So when you use a well-behaved asynchronous API (notwithstanding that .NET FileStream has issues as Theodore mentioned), your CPU really shouldn't be doing that much work at all. And when you await such an API, the CPU is basically sitting idle until the other hardware in the machine has written the requested data to RAM.
I agree with Blindy that it would be better if computer science programs did a better job of teaching people how computer hardware actually works. Looking to take advantage of the fact that the CPU can be doing other things while waiting for data to be read off the disk, off a network, etc., is, in the words of Captain Kirk, "officer thinking".

11 seconds spent on ReadLine
More like, specifically, 11 seconds spent on file I/O, but you didn't measure that.
Replace your stream creation with this instead:
using var reader = new StreamReader(_filePath, Encoding.UTF8, false, 50 * 1024 * 1024);
That will cause it to read it to a buffer of 50MB (play with the size as needed) to avoid repeated I/O on what seems like an ancient hard drive.
I was just trying to introduce enough parallelism to smooth the throughput
Not only did you not introduce any parallelism at all, but you used ReadLineAsync wrong -- it returns a Task<string>, not a string.
It's completely overkill, the buffer size increase will most likely fix your issue, but if you want to actually do this you need two threads that communicate over a shared data structure, as Peter said.
Only that ended up taking a lot longer than the regular sync stuff
It baffles me that people think multi-threaded code should take less processing power than single-threaded code. There has to be some really basic understanding missing from present day education to lead to this. Multi-threading includes multiple extra context switches, mutex contention, your OS scheduler kicking in to replace one of your threads (leading to starvation or oversaturation), gathering, serializing and aggregating results after work is done etc. None of that is free or easy to implement.

Related

IO Bound Operation and Task.Run()

I am quite new to concurrency (and C#, actually). I have a bunch of csv files in two separate directory to be read, and then I want to do some processing after I read a file. The processing is independent of other data read and process operations. After all the processing are done, I want to update the UI. The UI needs to be responsive at the mean time too because I will need to display a progress bar. Currently I have something like this:
private string _directoryA;
private string _directoryB;
// The user clicks the button
private void ButtonPressed()
{
Task.Run(() => DoJob());
}
private void DoJob()
{
var tasks = new List<Task>();
var watch = Stopwatch.StartNew();
tasks.Add(Task.Run(() => DoJobForDirectory(_directoryA).ContinueWith(t => Console.WriteLine("First Half");
tasks.Add(Task.Run(() => DoJobForDirectory(_directoryB).ContinueWith(t => Console.WriteLine("Second Half");
Task.WaitAll(tasks.ToArray());
watch.Stop();
Console.WriteLine($"Time Taken : {watch.ElapsedMilliseconds} ms.");
UpdateUI();
}
private void DoJobForDirectory(string directory)
{
var files = Directory.EnumerateFiles(directory, "*.csv");
var tasks = new List<Task>();
foreach (var file in files)
{
// Update the progress bar in the UI when a file has finished processing
tasks.Add(Task.Run(() => DoJobForFile(file)).ContinueWith(t => UpdateCounter++));
}
Task.WaitAll(tasks.ToArray());
}
private void DoJobForFile(string filePath)
{
ReadCSV();
ProcessData();
...
}
I feel like I am missing something here. From my reading this operation should be I/O bound, as the processing afterwards is pretty lightweight (some for loops and assignments). So I really should be using just async await, but not Task.Run()...? However I couldn't think of a better way to do this. The ReadCSV() is from some library that does not have the async version. Using Parallel.ForEach does not boost the performance too. Is there a better way to do this (to be efficient on resources and also achieve better performance)?
Also, when I tried to only run on one directory, the elapsed time would be nearly half of the time required for both directories. Since the operations are all independent, I want to run them all in parallel, so processing both directories should take roughly the same (or only slightly more) time as processing just single directory, but not two times slower. It seems like no matter how many Task.Run() I do, I will have a limited number of threads running at the same time (some bottleneck). I tried changing all the Task.Run() to be new Thread(), and observed much more threads were active at the same time, but in the end resulted worse performance. Why is that?
The Task.Run schedules work on the ThreadPool, which is a conservative mechanism regarding how many threads it creates immediately on demand (it creates as many as the available cores of the machine), and on how frequently it creates new threads when the demand for work is high (one new thread every second). You could try experimenting with the ThreadPool.SetMinThreads method that affects the behavior of the ThreadPool. For example:
ThreadPool.SetMinThreads(100, 100);
This way the ThreadPool will create 100 threads immediately on demand, before switching to the conservative algorithm.
Chances are that you'll see no improvement on the performance of your directory-processing application. That's because your I/O bound workload is throttled by the capabilities of your storage device. No matter what you do with code, the hardware has a limit on how many data can store or retrieve per time-unit. When you reach this limit, the only way to boost the performance is to upgrade your hardware.
Regarding the suitability of using Task.Run and synchronous APIs for doing I/O bound work, surprisingly in many cases it's the most performant way of getting the job done. The synchronous file-system APIs in particular are significantly faster than their asynchronous counterparts. What you lose with the synchronous APIs is memory-efficiency. Each thread requires at least 1 MB of memory for its stack, so if you start 1,000 threads at once you'll deprive your system from 1 GB of memory or more, which can affect negatively the performance of your application indirectly.
Starting manually tasks with Task.Run for the purpose of parallelization, is a low lever approach at parallelizing your work. The TPL offers higher level Task-based tools, like the Parallel class, the PLINQ library (.AsParallel) and the TPL Dataflow library.
For updating the UI with progress information during a background work, the modern approach is the IProgress<T> interface and the Progress<T> class. You can find an example here, as part of a comparison between the Task.Run and the BackgroundWorker class.
The Task.Run(() => DoJob()); and using Task.WaitAll() is wasting a thread.
I would change it to this:
private string _directoryA;
private string _directoryB;
// The user clicks the button
private async void ButtonPressed()
{
// disable UI controls
try
{
await DoJob();
}
finally
{
// enable UI controls.
}
}
private async Task DoJob()
{
var tasks = new List<Task>();
var watch = Stopwatch.StartNew();
tasks.Add(Task.Run(async () => DoJobForDirectory(_directoryA).ContinueWith(t => Console.WriteLine("First Half");
tasks.Add(Task.Run(async () => DoJobForDirectory(_directoryB).ContinueWith(t => Console.WriteLine("Second Half");
await Task.WhenAll(tasks.ToArray());
watch.Stop();
Console.WriteLine($"Time Taken : {watch.ElapsedMilliseconds} ms.");
UpdateUI();
}
private async Task DoJobForDirectory(string directory)
{
var files = Directory.EnumerateFiles(directory, "*.csv");
var tasks = new List<Task>();
foreach (var file in files)
{
// Update the progress bar in the UI when a file has finished processing
tasks.Add(Task.Run(() => DoJobForFile(file)).ContinueWith(t => UpdateCounter++));
}
await Task.WhenAll(tasks.ToArray());
}
private void DoJobForFile(string filePath)
{
ReadCSV();
ProcessData();
...
}
If you want to limit the threads, you could use a SemaphoreSlim. Here is a good example as accepted answer:
How to limit the Maximum number of parallel tasks in c#

Is there a way to limit the number of parallel Tasks globally in an ASP.NET Web API application?

I have an ASP.NET 5 Web API application which contains a method that takes objects from a List<T> and makes HTTP requests to a server, 5 at a time, until all requests have completed. This is accomplished using a SemaphoreSlim, a List<Task>(), and awaiting on Task.WhenAll(), similar to the example snippet below:
public async Task<ResponseObj[]> DoStuff(List<Input> inputData)
{
const int maxDegreeOfParallelism = 5;
var tasks = new List<Task<ResponseObj>>();
using var throttler = new SemaphoreSlim(maxDegreeOfParallelism);
foreach (var input in inputData)
{
tasks.Add(ExecHttpRequestAsync(input, throttler));
}
List<ResponseObj> resposnes = await Task.WhenAll(tasks).ConfigureAwait(false);
return responses;
}
private async Task<ResponseObj> ExecHttpRequestAsync(Input input, SemaphoreSlim throttler)
{
await throttler.WaitAsync().ConfigureAwait(false);
try
{
using var request = new HttpRequestMessage(HttpMethod.Post, "https://foo.bar/api");
request.Content = new StringContent(JsonConvert.SerializeObject(input, Encoding.UTF8, "application/json");
var response = await HttpClientWrapper.SendAsync(request).ConfigureAwait(false);
var responseBody = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
var responseObject = JsonConvert.DeserializeObject<ResponseObj>(responseBody);
return responseObject;
}
finally
{
throttler.Release();
}
}
This works well, however I am looking to limit the total number of Tasks that are being executed in parallel globally throughout the application, so as to allow scaling up of this application. For example, if 50 requests to my API came in at the same time, this would start at most 250 tasks running parallel. If I wanted to limit the total number of Tasks that are being executed at any given time to say 100, is it possible to accomplish this? Perhaps via a Queue<T>? Would the framework automatically prevent too many tasks from being executed? Or am I approaching this problem in the wrong way, and would I instead need to Queue the incoming requests to my application?
I'm going to assume the code is fixed, i.e., Task.Run is removed and the WaitAsync / Release are adjusted to throttle the HTTP calls instead of List<T>.Add.
I am looking to limit the total number of Tasks that are being executed in parallel globally throughout the application, so as to allow scaling up of this application.
This does not make sense to me. Limiting your tasks limits your scaling up.
For example, if 50 requests to my API came in at the same time, this would start at most 250 tasks running parallel.
Concurrently, sure, but not in parallel. It's important to note that these aren't 250 threads, and that they're not 250 CPU-bound operations waiting for free thread pool threads to run on, either. These are Promise Tasks, not Delegate Tasks, so they don't "run" on a thread at all. It's just 250 objects in memory.
If I wanted to limit the total number of Tasks that are being executed at any given time to say 100, is it possible to accomplish this?
Since (these kinds of) tasks are just in-memory objects, there should be no need to limit them, any more than you would need to limit the number of strings or List<T>s. Apply throttling where you do need it; e.g., number of HTTP calls done simultaneously per request. Or per host.
Would the framework automatically prevent too many tasks from being executed?
The framework has nothing like this built-in.
Perhaps via a Queue? Or am I approaching this problem in the wrong way, and would I instead need to Queue the incoming requests to my application?
There's already a queue of requests. It's handled by IIS (or whatever your host is). If your server gets too busy (or gets busy very suddenly), the requests will queue up without you having to do anything.
If I wanted to limit the total number of Tasks that are being executed at any given time to say 100, is it possible to accomplish this?
What you are looking for is to limit the MaximumConcurrencyLevel of what's called the Task Scheduler. You can create your own task scheduler that regulates the MaximumCongruencyLevel of the tasks it manages. I would recommend implementing a queue-like object that tracks incoming requests and currently working requests and waits for the current requests to finish before consuming more. The below information may still be relevant.
The task scheduler is in charge of how Tasks are prioritized, and in charge of tracking the tasks and ensuring that their work is completed, at least eventually.
The way it does this is actually very similar to what you mentioned, in general the way the Task Scheduler handles tasks is in a FIFO (First in first out) model very similar to how a ConcurrentQueue<T> works (at least starting in .NET 4).
Would the framework automatically prevent too many tasks from being executed?
By default the TaskScheduler that is created with most applications appears to default to a MaximumConcurrencyLevel of int.MaxValue. So theoretically yes.
The fact that there practically is no limit to the amount of tasks(at least with the default TaskScheduler) might not be that big of a deal for your case scenario.
Tasks are separated into two types, at least when it comes to how they are assigned to the available thread pools. They're separated into Local and Global queues.
Without going too far into detail, the way it works is if a task creates other tasks, those new tasks are part of the parent tasks queue (a local queue). Tasks spawned by a parent task are limited to the parent's thread pool.(Unless the task scheduler takes it upon itself to move queues around)
If a task isn't created by another task, it's a top-level task and is placed into the Global Queue. These would normally be assigned their own thread(if available) and if one isn't available it's treated in a FIFO model, as mentioned above, until it's work can be completed.
This is important because although you can limit the amount of concurrency that happens with the TaskScheduler, it may not necessarily be important - if for say you have a top-level task that's marked as long running and is in-charge of processing your incoming requests. This would be helpful since all the tasks spawned by this top-level task will be part of that task's local queue and therefor won't spam all your available threads in your thread pool.
When you have a bunch of items and you want to process them asynchronously and with limited concurrency, the SemaphoreSlim is a great tool for this job. There are two ways that it can be used. One way is to create all the tasks immediately and have each task acquire the semaphore before doing it's main work, and the other is to throttle the creation of the tasks while the source is enumerated. The first technique is eager, and so it consumes more RAM, but it's more maintainable because it is easier to understand and implement. The second technique is lazy, and it's more efficient if you have millions of items to process.
The technique that you have used in your sample code is the second (lazy) one.
Here is an example of using two SemaphoreSlims in order to impose two maximum concurrency policies, one per request and one globally. First the eager approach:
private const int maxConcurrencyGlobal = 100;
private static SemaphoreSlim globalThrottler
= new SemaphoreSlim(maxConcurrencyGlobal, maxConcurrencyGlobal);
public async Task<ResponseObj[]> DoStuffAsync(IEnumerable<Input> inputData)
{
const int maxConcurrencyPerRequest = 5;
var perRequestThrottler
= new SemaphoreSlim(maxConcurrencyPerRequest, maxConcurrencyPerRequest);
Task<ResponseObj>[] tasks = inputData.Select(async input =>
{
await perRequestThrottler.WaitAsync();
try
{
await globalThrottler.WaitAsync();
try
{
return await ExecHttpRequestAsync(input);
}
finally { globalThrottler.Release(); }
}
finally { perRequestThrottler.Release(); }
}).ToArray();
return await Task.WhenAll(tasks);
}
The Select LINQ operator provides an easy and intuitive way to project items to tasks.
And here is the lazy approach for doing exactly the same thing:
private const int maxConcurrencyGlobal = 100;
private static SemaphoreSlim globalThrottler
= new SemaphoreSlim(maxConcurrencyGlobal, maxConcurrencyGlobal);
public async Task<ResponseObj[]> DoStuffAsync(IEnumerable<Input> inputData)
{
const int maxConcurrencyPerRequest = 5;
var perRequestThrottler
= new SemaphoreSlim(maxConcurrencyPerRequest, maxConcurrencyPerRequest);
var tasks = new List<Task<ResponseObj>>();
foreach (var input in inputData)
{
await perRequestThrottler.WaitAsync();
await globalThrottler.WaitAsync();
Task<ResponseObj> task = Run(async () =>
{
try
{
return await ExecHttpRequestAsync(input);
}
finally
{
try { globalThrottler.Release(); }
finally { perRequestThrottler.Release(); }
}
});
tasks.Add(task);
}
return await Task.WhenAll(tasks);
static async Task<T> Run<T>(Func<Task<T>> action) => await action();
}
This implementation assumes that the await globalThrottler.WaitAsync() will never throw, which is a given according to the documentation. This will no longer be the case if you decide later to add support for cancellation, and you pass a CancellationToken to the method. In that case you would need one more try/finally wrapper around the task-creation logic. The first (eager) approach could be enhanced with cancellation support without such considerations. Its existing try/finally infrastructure is
already sufficient.
It is also important that the internal helper Run method is implemented with async/await. Eliding the async/await would be an easy mistake to make, because in that case any exception thrown synchronously by the ExecHttpRequestAsync method would be rethrown immediately, and it would not be encapsulated in a Task<ResponseObj>. Then the task returned by the DoStuffAsync method would fail without releasing the acquired semaphores, and also without awaiting the completion of the already started operations. That's another argument for preferring the eager approach. The lazy approach has too many gotchas to watch for.

Multithreading/Concurrent strategy for a network based task

I am not pro in utilizing resources to the best hence am seeking the best way for a task that needs to be done in parallel and efficiently.
We have a scenario wherein we have to ping millions of system and receive a response. The response itself takes no time in computation but the task is network based.
My current implementation looks like this -
Parallel.ForEach(list, ip =>
{
try
{
// var record = client.QueryAsync(ip);
var record = client.Query(ip);
results.Add(record);
}
catch (Exception)
{
failed.Add(ip);
}
});
I tested this code for
100 items it takes about 4 secs
1k items it takes about 10 secs
10k items it takes about 80 secs
100k items it takes about 710 secs
I need to process close to 20M queries, what strategy should i use in order to speed this up further
Here is the problem
Parallel.ForEach uses the thread pool. Moreover, IO bound operations will block those threads waiting for a device to respond and tie up resources.
If you have CPU bound code, Parallelism is appropriate;
Though if you have IO bound code, Asynchrony is appropriate.
In this case, client.Query is clearly I/O, so the ideal consuming code would be asynchronous.
Since you said there was an async verison, you are best to use async/await pattern and/or some type of limit on concurrent tasks, another neat solution is to use ActionBlock Class in the TPL dataflow library.
Dataflow example
public static async Task DoWorkLoads(List<IPAddress> addresses)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 50
};
var block = new ActionBlock<IPAddress>(MyMethodAsync, options);
foreach (var ip in addresses)
block.Post(ip);
block.Complete();
await block.Completion;
}
...
public async Task MyMethodAsync(IpAddress ip)
{
try
{
var record = await client.Query(ip);
// note this is not thread safe best to lock it
results.Add(record);
}
catch (Exception)
{
// note this is not thread safe best to lock it
failed.Add(ip);
}
}
This approach gives you Asynchrony, it also gives you MaxDegreeOfParallelism, it doesn't waste resources, and lets IO be IO without chewing up unnecessary resources
*Disclaimer, DataFlow may not be where you want to be, however i just thought id give you some more information
Demo here
update
I just did some bench-marking with Parallel.Foreaceh and DataFlow
Run multiple times 10000 pings
Parallel.Foreach = 30 seconds
DataFlow = 10 seconds

C# parallel threading but control by each other

I want to write a program which will have 2 thread. One will download another will parse the downloaded file. The tricky part is I can not have 2 parsing thread at the same time as it is using a library technique to parse the file. Please help with a suggestion. Thank you.
Foreach(string filename in filenames)
{
//start downloading thread here;
readytoparse.Add(filename);
}
Foreach(string filename in readytoparse)
{
//start parsing here
}
I ended up with the following logic
bool parserrunning = false;
List<string> readytoparse = new List<string>();
List<string> filenames= new List<string>();
//downloading method
Foreach(string filename in filenames)
{
//start downloading thread here;
readytoparse.Add(filename);
if(parserrunning == false;
{
// start parser method
}
}
//parsing method
parserrunning = true;
list<string> _readytoparse = new List<string>(readytoparse);
Foreach(string filename in _readytoparse)
{
//start parsing here
}
parserrunning = false;
Yousuf, your "question" is pretty vague. You could take an approach where your main thread downloads the files, then each time a file finishes downloading, spawns a worker thread to parse that file. There is the Task API or QueueUserWorkItem for this sort of thing. I suppose it's possible that you could end up with an awful lot of worker threads running concurrently this way, which isn't necessarily the key to getting the work done faster and could negatively impact other concurrent work on the computer.
If you want to limit this to two threads, you might consider having the download thread write the file name into a queue each time a download finishes. Then your parser thread monitors that queue (wake up every x seconds, check the queue to see if there's anything to do, do the work, check the queue again, if there's nothing to do, go back to sleep for x seconds, repeat).
If you want the parser to be resilient, make that queue persistent (a database, MSMQ, a running text file on disk--something persistent). That way, if there is an interruption (computer crashes, program crashes, power loss), the parser can start right back up where it left off.
Code synchronization comes into play in the sense that you obviously cannot have the parser trying to parse a file that the downloader is still downloading, and if you have two threads using a queue, then you obviously have to protect that queue from concurrent access.
Whether you use Monitors or Mutexes, or QueueUserWorkItem or the Task API is sort of academic. There is plenty of support in the .NET framework for synchronizing and parallelizing units of work.
I suggest avoiding all of the heart-ache in doing this yourself with any primatives and use a library designed for this kind of thing.
I recommend Microsoft's Reactive Framework (Rx).
Here's the code:
var query =
from filename in filenames.ToObservable(Scheduler.Default)
from file in Observable.Start(() => /* read file */, Scheduler.Default)
from parsed in Observable.Start(() => /* parse file */, Scheduler.Default)
select new
{
filename,
parsed,
};
query.Subscribe(fp =>
{
/* Do something with finished file */
});
Very simple.
If your parsing library is single threaded only, then add this line:
var els = new EventLoopScheduler();
And then replace Scheduler.Default with els on the parsing line.

Optimal Implementation and Usage of Task Parallel Library?

I have a WCF Service that is responseible for taking in an offer and 'reaching' out and dynamically provide this offer to X amount of potential buyers (typically 15-20) which are essentially external APIs.
Each of the buyers currently has 35 seconds to return a response, or they lose the ability to buy the offer,
In order to accomplish this, I have the following code which has been in production for 8 months and has worked and scaled rather well.
As we have been spending a lot of time on improving recently so that we can scale further, I have been interested in whether I have a better option for how I accomplishing this task. I am hesitant in making changes because it is workign well right now,however I may be able to squeeze additional performance out of it right now while I am able to focus on it.
The following code is responsible for creating the tasks which make the outbound requests to the buyers.
IBuyer[] buyer = BuyerService.GetBuyers(); /*Obtain potential buyers for the offer*/
var tokenSource = new CancellationTokenSource();
var token = tokenSource.Token;
Tasks = new Task<IResponse>[Buyers.Count];
for(int i = 0; i < Buyers.Count;i++)
{
IBuyer buyer = Buyers[i];
Func<IResponse> makeOffer = () => buyer.MakeOffer()
Tasks[i] = Task.Factory.StartNew<IResponse>((o) =>
{
try
{
var result = MakeOffer();
if (!token.IsCancellationRequested)
{
return result;
}
}
catch (Exception exception
{
/*Do Work For Handling Exception In Here*/
}
return null;
}, token,TaskCreationOptions.LongRunning);
};
Task.WaitAll(Tasks, timeout, token); /*Give buyers fair amount of time to respond to offer*/
tokenSource.Cancel();
List<IResponse> results = new List<IResponse>(); /*List of Responses From Buyers*/
for (int i = 0; i < Tasks.Length; i++)
{
if (Tasks[i].IsCompleted) /*Needed so it doesnt block on Result*/
{
if (Tasks[i].Result != null)
{
results.Add(Tasks[i].Result);
}
Tasks[i].Dispose();
}
}
/*Continue Processing Buyers That Responded*/
On average, this service is called anywhere from 400K -900K per day, and sometimes up to 30-40 times per second.
We have made a lot of optimizations in an attempt to tune performance, but I want to make sure that this piece of code does not have any immediate glaring issues.
I read alot about the power of TaskScheduler and messing with the SynchronizationContext and working async, and I am not sure how I can make that fit and if it is worth an improvement or not.
Right now, you're using thread pool threads (each Task.Factory.StartNew call uses a TP thread or a full .NET thread, as in your case, due to the LongRunning hint) for work that is effectively IO bound. If you hadn't specified TaskCreationOptions.LongRunning, you'd have seen a problem very early on, and you'd be experiencing thread pool starvation. As is, you're likely using a very large number of threads, and creating and destroying them very quickly, which is a waste of resources.
If you were to make this fully asynchronous, and use the new async/await support, you could perform the same "work" asynchronously, without using threads. This would scale significantly better, as the amount of threads used for a given number of requests would be significantly reduced.
As a general rule of thumb, Task.Factory.StartNew (or Task.Run in .NET 4.5, as well as the Parallel class) should only be used for CPU bound work, and async/await should be used for IO bound work, especially for server side operations.

Categories

Resources