C# parallel threading but control by each other - c#

I want to write a program which will have 2 thread. One will download another will parse the downloaded file. The tricky part is I can not have 2 parsing thread at the same time as it is using a library technique to parse the file. Please help with a suggestion. Thank you.
Foreach(string filename in filenames)
{
//start downloading thread here;
readytoparse.Add(filename);
}
Foreach(string filename in readytoparse)
{
//start parsing here
}
I ended up with the following logic
bool parserrunning = false;
List<string> readytoparse = new List<string>();
List<string> filenames= new List<string>();
//downloading method
Foreach(string filename in filenames)
{
//start downloading thread here;
readytoparse.Add(filename);
if(parserrunning == false;
{
// start parser method
}
}
//parsing method
parserrunning = true;
list<string> _readytoparse = new List<string>(readytoparse);
Foreach(string filename in _readytoparse)
{
//start parsing here
}
parserrunning = false;

Yousuf, your "question" is pretty vague. You could take an approach where your main thread downloads the files, then each time a file finishes downloading, spawns a worker thread to parse that file. There is the Task API or QueueUserWorkItem for this sort of thing. I suppose it's possible that you could end up with an awful lot of worker threads running concurrently this way, which isn't necessarily the key to getting the work done faster and could negatively impact other concurrent work on the computer.
If you want to limit this to two threads, you might consider having the download thread write the file name into a queue each time a download finishes. Then your parser thread monitors that queue (wake up every x seconds, check the queue to see if there's anything to do, do the work, check the queue again, if there's nothing to do, go back to sleep for x seconds, repeat).
If you want the parser to be resilient, make that queue persistent (a database, MSMQ, a running text file on disk--something persistent). That way, if there is an interruption (computer crashes, program crashes, power loss), the parser can start right back up where it left off.
Code synchronization comes into play in the sense that you obviously cannot have the parser trying to parse a file that the downloader is still downloading, and if you have two threads using a queue, then you obviously have to protect that queue from concurrent access.
Whether you use Monitors or Mutexes, or QueueUserWorkItem or the Task API is sort of academic. There is plenty of support in the .NET framework for synchronizing and parallelizing units of work.

I suggest avoiding all of the heart-ache in doing this yourself with any primatives and use a library designed for this kind of thing.
I recommend Microsoft's Reactive Framework (Rx).
Here's the code:
var query =
from filename in filenames.ToObservable(Scheduler.Default)
from file in Observable.Start(() => /* read file */, Scheduler.Default)
from parsed in Observable.Start(() => /* parse file */, Scheduler.Default)
select new
{
filename,
parsed,
};
query.Subscribe(fp =>
{
/* Do something with finished file */
});
Very simple.
If your parsing library is single threaded only, then add this line:
var els = new EventLoopScheduler();
And then replace Scheduler.Default with els on the parsing line.

Related

Async file I/O overhead in C#

I've got a problem where I have to process a large batch of large jsonl files (read, deserialize, do some transforms db lookups etc, then write the transformed results in a .net core console app.
I've gotten better throughput by putting the output in batches on a separate thread and was trying to improve the processing side by adding some parallelism but the overhead ended up being self defeating.
I had been doing:
using (var stream = new FileStream(_filePath, FileMode.Open))
using (var reader = new StreamReader(stream)
{
for (;;)
{
var l = reader.ReadLine();
if (l == null)
break;
// Deserialize
// Do some database lookups
// Do some transforms
// Pass result to output thread
}
}
And some diagnostic timings showed me that the ReadLine() call was taking more than the deserialization, etc. To put some numbers on that, a large file would have about:
11 seconds spent on ReadLine
7.8 seconds spend on serialization
10 seconds spent on db lookups
I wanted to overlap that 11 seconds of file i/o with the other work so I tried
using (var stream = new FileStream(_filePath, FileMode.Open))
using (var reader = new StreamReader(stream)
{
var nextLine = reader.ReadLineAsync();
for (;;)
{
var l = nextLine.Result;
if (l == null)
break;
nextLine = reader.ReadLineAsync();
// Deserialize
// Do some database lookups
// Do some transforms
// Pass result to output thread
}
}
To get the next I/O going while I did the transform stuff. Only that ended up taking a lot longer than the regular sync stuff (like twice as long).
I've got requirements that they want predictability on the overall result (i.e. the same set of files have to be processed in name order and the output rows have to be predictably in the same order) so I can't just throw a file per thread and let them fight it out.
I was just trying to introduce enough parallelism to smooth the throughput over a large set of inputs, and I was surprised how counterproductive the above turned out to be.
Am I missing something here?
The built-in asynchronous filesystem APIs are currently broken, and you are advised to avoid them. Not only they are much slower than their synchronous counterparts, but they are not even truly asynchronous. The .NET 6 will come with an improved FileStream implementation, so in a few months this may no longer be an issue.
What you are trying to achieve is called task-parallelism, where two or more heterogeneous operations are running concurrently and independently from each other. It's an advanced technique and it requires specialized tools. The most common type of parallelism is the so called data-parallelism, where the same type of operation is running in parallel on a list of homogeneous data, and it's commonly implemented using the Parallel class or the PLINQ library.
To achieve task-parallelism the most readily available tool is the TPL Dataflow library, which is built-in the .NET Core / .NET 5 platforms, and you only need to install a package if you are targeting the .NET Framework. This library allows you to create a pipeline consisting of linked components that are called "blocks" (TransformBlock, ActionBlock, BatchBlock etc), where each block acts as an independent processor with its own input and output queues. You feed the pipeline with data, and the data flows from block to block through the pipeline, while being processed along the way. You Complete the first block in the pipeline to signal that no more input data will ever be available, and then await the Completion of the last block to make your code wait until all the work has been done. Here is an example:
private async void Button1_Click(object sender, EventArgs e)
{
Button1.Enabled = false;
var fileBlock = new TransformManyBlock<string, IList<string>>(filePath =>
{
return File.ReadLines(filePath).Buffer(10);
});
var deserializeBlock = new TransformBlock<IList<string>, MyObject[]>(lines =>
{
return lines.Select(line => Deserialize(line)).ToArray();
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 2 // Let's assume that Deserialize is parallelizable
});
var persistBlock = new TransformBlock<MyObject[], MyObject[]>(async objects =>
{
foreach (MyObject obj in objects) await PersistToDbAsync(obj);
return objects;
});
var displayBlock = new ActionBlock<MyObject[]>(objects =>
{
foreach (MyObject obj in objects) TextBox1.AppendText($"{obj}\r\n");
}, new ExecutionDataflowBlockOptions()
{
TaskScheduler = TaskScheduler.FromCurrentSynchronizationContext()
// Make sure that the delegate will be invoked on the UI thread
});
fileBlock.LinkTo(deserializeBlock,
new DataflowLinkOptions { PropagateCompletion = true });
deserializeBlock.LinkTo(persistBlock,
new DataflowLinkOptions { PropagateCompletion = true });
persistBlock.LinkTo(displayBlock,
new DataflowLinkOptions { PropagateCompletion = true });
foreach (var filePath in Directory.GetFiles(#"C:\Data"))
await fileBlock.SendAsync(filePath);
fileBlock.Complete();
await displayBlock.Completion;
MessageBox.Show("Done");
Button1.Enabled = true;
}
The data passed through the pipeline should be chunky. If each unit of work is too lightweight, you should batch them in arrays or lists, otherwise the overhead of moving lots of tiny data around is going to outweigh the benefits of parallelism. That's the reason for using the Buffer LINQ operator (from the System.Interactive package) in the above example. The .NET 6 will come with a new Chunk LINQ operator, offering the same functionality.
Theodor's suggestion looks like a really powerful and useful library that's worth checking out, but if you're looking for a smaller DIY solution this is how I would approach it:
using System;
using System.IO;
using System.Threading.Tasks;
using System.Collections.Generic;
namespace Parallelism
{
class Program
{
private static Queue<string> _queue = new Queue<string>();
private static Task _lastProcessTask;
static async Task Main(string[] args)
{
string path = "???";
await ReadAndProcessAsync(path);
}
private static async Task ReadAndProcessAsync(string path)
{
using (var str = File.OpenRead(path))
using (var sr = new StreamReader(str))
{
string line = null;
while (true)
{
line = await sr.ReadLineAsync();
if (line == null)
break;
lock (_queue)
{
_queue.Enqueue(line);
if (_queue.Count == 1)
// There was nothing in the queue before
// so initiate a new processing loop. Save
// but DON'T await the Task yet.
_lastProcessTask = ProcessQueueAsync();
}
}
}
// Now that file reading is completed, await
// _lastProcessTask to ensure we don't return
// before it's finished.
await _lastProcessTask;
}
// This will continue processing as long as lines are in the queue,
// including new lines entering the queue while processing earlier ones.
private static Task ProcessQueueAsync()
{
return Task.Run(async () =>
{
while (true)
{
string line;
lock (_queue)
{
// Only peak at first so the read loop doesn't think
// the queue is empty and initiate a second processing
// loop while we're processing this line.
if (!_queue.TryPeek(out line))
return;
}
await ProcessLineAsync(line);
lock (_queue)
{
// Dequeues the item we just processed. If it's the last
// one, this loop is done.
_queue.Dequeue();
if (_queue.Count == 0)
return;
}
}
});
}
private static async Task ProcessLineAsync(string line)
{
// do something
}
}
}
Note this approach has a processing loop that terminates when nothing is left in the queue, and is re-initiated if needed when new items are ready. Another approach would be to have a continuous processing loop that repeatedly re-checks and does a Task.Delay() for a small amount of time while the queue is empty. I like my approach better because it doesn't bog down the worker thread with periodic and unnecessary checks but performance would likely be unnoticeably different.
Also just to comment on Blindy's answer, I have to disagree with discouraging the use of parallelism here. First off, most CPUs these days are multi-core, so smart use of the .NET threadpool will in fact maximize your application's efficiency when run on multi-core CPUs and have pretty minimal downside in single-core scenarios.
More importantly, though, async does not equal multithreading. Asynchronous programming existed long before multithreading, I/O being the most notable example. I/O operations are in large part handled by hardware other than the CPU - the NIC, SATA controllers, etc. They use an ancient concept called the Hardware Interrupt that most coders today have probably never heard of and predates multithreading by decades. It's basically just a way to give the CPU a callback to execute when an off-CPU operation is finished. So when you use a well-behaved asynchronous API (notwithstanding that .NET FileStream has issues as Theodore mentioned), your CPU really shouldn't be doing that much work at all. And when you await such an API, the CPU is basically sitting idle until the other hardware in the machine has written the requested data to RAM.
I agree with Blindy that it would be better if computer science programs did a better job of teaching people how computer hardware actually works. Looking to take advantage of the fact that the CPU can be doing other things while waiting for data to be read off the disk, off a network, etc., is, in the words of Captain Kirk, "officer thinking".
11 seconds spent on ReadLine
More like, specifically, 11 seconds spent on file I/O, but you didn't measure that.
Replace your stream creation with this instead:
using var reader = new StreamReader(_filePath, Encoding.UTF8, false, 50 * 1024 * 1024);
That will cause it to read it to a buffer of 50MB (play with the size as needed) to avoid repeated I/O on what seems like an ancient hard drive.
I was just trying to introduce enough parallelism to smooth the throughput
Not only did you not introduce any parallelism at all, but you used ReadLineAsync wrong -- it returns a Task<string>, not a string.
It's completely overkill, the buffer size increase will most likely fix your issue, but if you want to actually do this you need two threads that communicate over a shared data structure, as Peter said.
Only that ended up taking a lot longer than the regular sync stuff
It baffles me that people think multi-threaded code should take less processing power than single-threaded code. There has to be some really basic understanding missing from present day education to lead to this. Multi-threading includes multiple extra context switches, mutex contention, your OS scheduler kicking in to replace one of your threads (leading to starvation or oversaturation), gathering, serializing and aggregating results after work is done etc. None of that is free or easy to implement.

Downloading while processing

I have the following scenario, a timer every x minutes:
download an item to work from a rest service (made in php)
run a process batch to elaborate item
Now the application is fully functional, but I want to speedup the entire process downloading another item (if present in the rest service) while the application is processing one.
I think that I need a buffer/queue to accomplish this, like BlockingCollection, but I've no idea how to use it.
What's the right way to accomplish what I'm trying to do?
Thank you in advance!
What you can do is create a function which checks for new files to download. Have this function start as its own background thread that runs in an infinite loop, checking for new downloads in each iteration. If it finds any files that need downloading, call a separate function to download the file as a new thread. This new download function can then call the processing function as yet another thread once the file finishes downloading. With this approach you will be running all tasks in parallel for multiple files if needed.
Functions can be started as new threads by doing this
Thread thread = new Thread(FunctionName);
thread.start();
Use Microsoft's Reactive Framework (NuGet "System.Reactive"). Then you can do this:
var x_minutes = 5;
var query =
from t in Observable.Interval(TimeSpan.FromMinutes(x_minutes))
from i in Observable.Start(() => DownloadAnItem())
from e in Observable.Start(() => ElaborateItem(i))
select new { i, e };
var subscription =
query.Subscribe(x =>
{
// Do something with each `x.i` & `x.e`
});
Multi-threaded and simple.
If you want to stop processing then just call subscription.Dispose().

Thread Queue Process

I'm building this program in visual studio 2010 using C# .Net4.0
The goal is to use thread and queue to improve performance.
I have a list of urls I need to process.
string[] urls = { url1, url2, url3, etc.} //up to 50 urls
I have a function that will take in each url and process them.
public void processUrl(string url) {
//some operation
}
Originally, I created a for-loop to go through each urls.
for (i = 0; i < urls.length; i++)
processUrl(urls[i]);
The method works, but the program is slow as it was going through urls one after another.
So the idea is to use threading to reduce the time, but I'm not too sure how to approach that.
Say I want to create 5 threads to process at the same time.
When I start the program, it will start processing the first 5 urls. When one is done, the program start process the 6th url; when another one is done, the program starts processing the 7th url, and so on.
The problem is, I don't know how to actually create a 'queue' of urls and be able to go through the queue and process.
Can anyone help me with this?
-- EDIT at 1:42PM --
I ran into another issue when I was running 5 process at the same time.
The processUrl function involve writing to log file. And if multiple processes timeout at the same time, they are writing to the same log file at the same time and I think that's throwing an error.
I'm assuming that's the issue because the error message I got was "The process cannot access the file 'data.log' because it is being used by another process."
The simplest option would be to just use Parallel.ForEach. Provided processUrl is thread safe, you could write:
Parallel.ForEach(urls, processUrl);
I wouldn't suggest restricting to 5 threads (the scheduler will automatically scale normally), but this can be done via:
Parallel.ForEach(urls, new ParallelOptions { MaxDegreeOfParallelism = 5}, processUrl);
That being said, URL processing is, by its nature, typically IO bound, and not CPU bound. If you could use Visual Studio 2012, a better option would be to rework this to use the new async support in the language. This would require changing your method to something more like:
public async Task ProcessUrlAsync(string url)
{
// Use await with async methods in the implementation...
You could then use the new async support in the loop:
// Create an enumerable to Tasks - this will start all async operations..
var tasks = urls.Select(url => ProcessUrlAsync(url));
await Task.WhenAll(tasks); // "Await" until they all complete
Use a Parallel Foreach with the Max Degree of Parallelism set to the number of threads you want (or leave it empty and let .NET do the work for you)
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = 5;
Parallel.ForEach(urls, parallelOptions, url =>
{
processUrl(url);
});
If you really want to create threads to accomplish you task in place of using parallel execution:
Suppose that I want one thread for each URL:
string[] urls = {"url1", "url2", "url3"};
I just start a new Thread instance for each URL (or each 5 url's):
foreach (var thread in urls.Select(url => new Thread(() => DownloadUrl(url))))
thread.Start();
And the method to download your URL:
private static void DownloadUrl(string url)
{
Console.WriteLine(url);
}

UI not responds during copying bulk of files

I need you to debug my idea about this project.
I've written a backup manager project which I give a folder and it copies every file and folder of it to another location and so on.
It works (does the copy job well) but during copying which takes about 1 minute the application UI does not respond. I've heard about threads and I've seen the word parallel programming (just the word and no more), now I want some explanation, comparison and examples to become able to switch my code.
I have done very simple actions with threads before but it was a long time ago and I am not confident enough on threading. Here is my code :
private void CopyFiles(string path, string dest)
{
System.IO.Directory.CreateDirectory(dest + "\\" + path.Split('\\')[path.Split('\\').Count()-1]);
dest = dest + "\\" + path.Split('\\')[path.Split('\\').Count() - 1];
foreach (string file in System.IO.Directory.GetFiles(path))
{
System.IO.File.Copy(file, dest + "\\" + file.Split('\\')[file.Split('\\').Count() - 1]);
}
foreach (string folder in System.IO.Directory.GetDirectories(path))
{
CopyFiles(folder, dest);
}
}
I run this in a timer based on a special interval, if I come up using threading, should I omit timer? Lead me, I'm confused.
Since you are not confident with threading enough, I highly recommend you read Joe Albahari's Threading in C# Tutorial. Parallel programming is when you do multiple operations in 'parallel' or at the same time (mostly for spreading large amounts of calculations over several CPU or GPU cores). In this case you want threading to make your UI responsive while copying all the files. Essentially, you would have something set out like this: (After you read the threading in C# tutorial)
Thread copyFilesThread = new Thread(() =>
{
CopyFiles(path, dest);
});
copyFilesThread.Start();
The UI runs on its own thread. All of the code that is put into your application will run on the UI thread (unless you are explicitly using threading). Since your CopyFiles method takes a long time, it will stop the UI until the copying job is completed. Using threading will run the CopyFiles on a separate thread to the UI thread, therefore making the UI thread responsive.
Edit: As for your timer, how often does it run?
A simple way to perform an operation in a separate dedicated thread which allows you to know when the thread has completed is by using BackgroundWorker.
An example of usage is on the page I linked above.
If you want to copy a big or unknown amount of files, you should use ThreadPool
ThreadPool.QueueUserWorkItem(delegate
{
CopyFiles(folder, dest);
});
Background worker can be used to implement asynchronous execution.
This link may help
http://www.codeproject.com/Articles/20627/BackgroundWorker-Threads-and-Supporting-Cancel

How to know if all the Thread Pool's thread are already done with its tasks?

I have this application that will recurse all folders in a given directory and look for PDF. If a PDF file is found, the application will count its pages using ITextSharp. I did this by using a thread to recursively scan all the folders for pdf, then if then PDF is found, this will be queued to the thread pool. The code looks like this:
//spawn a thread to handle the processing of pdf on each folder.
var th = new Thread(() =>
{
pdfDirectories = Directory.GetDirectories(pdfPath);
processDir(pdfDirectories);
});
th.Start();
private void processDir(string[] dirs)
{
foreach (var dir in dirs)
{
pdfFiles = Directory.GetFiles(dir, "*.pdf");
processFiles(pdfFiles);
string[] newdir = Directory.GetDirectories(dir);
processDir(newdir);
}
}
private void processFiles(string[] files)
{
foreach (var pdf in files)
{
ThreadPoolHelper.QueueUserWorkItem(
new { path = pdf },
(data) => { processPDF(data.path); }
);
}
}
My problem is, how do i know that the thread pool's thread has finished processing all the queued items so i can tell the user that the application is done with its intended task?
Generally I would do something like this by having a counter variable.
For each work item you queue in the ThreadPool add one to the counter variable.
Then when it is processed you would decrease the counter variable.
Be sure that you do the incrementing and decrementing via the methods on the Interlocked class as this will ensure that things are done in a thread-safe manner.
Once the counter hits zero you could flag that the tasks are completed using a ManualResetEvent
If you have access to .NET 4 then you can use the new CountdownEvent class to do a similar thing.
1) How to know if all threads are finished?
You will have to let the Threads do their own check-in/check-out, by bracketing your code between:
Interlocked.Increment(ref jobCounter);
// your code
Interlocked.Decrement(ref jobCounter);
If this messes up your anonymous delegates too much then just use wrapper methods. You are probably going to have to add exception handling too.
The interlocked approach still eaves the problem of waiting for it to become 0, a loop with Sleep() is a weak but in this case viable solution.
2) You are starting threads in a recursive Tree walker. Be careful, you are probably creating too much of them and that will hurt performance.

Categories

Resources