multithread read and process large text files

multithread read and process large text files - c#

I have 10 lists of over 100Mb each with emails and I wanna process them using multithreads as fast as possible and without loading them into memory (something like reading line by line or reading small blocks)
I have created a function which is removing invalid ones based on a regex and another one which is organizing them based on each domain to other lists.
I managed to do it using one thread with:
while (reader.Peek() != -1)
but it takes too damn long.
How can I use multithreads (around 100 - 200) and maybe a backgroundworker or something to be able to use the form while processing the lists in parallel?
I'm new to csharp :P

Unless the data is on multiple physical discs, chances are that any more than a few threads will slow down, rather than speed up, the process.
What'll happen is that rather than reading consecutive data (pretty fast), you'll end up seeking to one place to read data for one thread, then seeking to somewhere else to read data for another thread, and so on. Seeking is relatively slow, so it ends up slower -- often quite a lot slower.
About the best you can do is dedicate one thread to reading data from each physical disc, then another to process the data -- but unless your processing is quite complex, or you have a lot of fast hard drives, one thread for processing may be entirely adequate.

There are multiple approaches to it:
1.) You can create threads explicitly like Thread t = new Thread(), but this approach is expensive on creating and managing a thread.
2.) You can use .net ThreadPool and pass your executing function's address to QueueUserWorkItem static method of ThreadPool Class. This approach needs some manual code management and synchronization primitives.
3.) You can create an array of System.Threading.Tasks.Task each processing a list which are executed parallely using all your available processors on the machine and pass that array to task.WaitAll(Task[]) to wait for their completion. This approach is related to Task Parallelism and you can find detailed information on MSDN
Task[] tasks = null;
for(int i = 0 ; i < 10; i++)
{
//automatically create an async task and execute it using ThreadPool's thread
tasks[i] = Task.StartNew([address of function/lambda expression]);
}
try
{
//Wait for all task to complete
Task.WaitAll(tasks);
}
catch (AggregateException ae)
{
//handle aggregate exception here
//it will be raised if one or more task throws exception and all the exceptions from defaulting task get accumulated in this exception object
}
//continue your processing further

You will want to take a look at the Task Parallel Library (TPL).
This library is made for parallel work, in fact. It will perform your action on the Threadpool in whatever is the most efficient fashion (typically). The only thing that I would caution is that if you run 100-200 threads at one time, then you possibly run into having to deal with context switching. That is, unless you have 100-200 processors. A good rule of thumb is to only run as many tasks in parallel as you have processors.
Some other good resources to review how to use the TPL:
Why and how to use the TPL
How to start a task.

I would be inclined to use parallel linq (plinq).
Something along the lines of:
Lists.AsParallel()
.SelectMany(list => list)
.Where(MyItemFileringFunction)
.GroupBy(DomainExtractionFunction)
AsParallel tells linq it can do this in parallel (which will mean the ordering of everything following will not be maintained)
SelectMany takes your individual lists and unrolls them such that all all items from all lists are effectivly in a single Enumerable
Where filers the items using your predicate function
GroupBy collects them by key, where DomainExtractionFunction is a function which gets a key (the domain name in your case) from the items (ie, the email)

Related

Thread Contention on a ConcurrentDictionary in C#

I have a C# .NET program that uses an external API to process events for real-time stock market data. I use the API callback feature to populate a ConcurrentDictionary with the data it receives on a stock-by-stock basis.
I have a set of algorithms that each run in a constant loop until a terminal condition is met. They are called like this (but all from separate calling functions elsewhere in the code):
Task.Run(() => ExecutionLoop1());
Task.Run(() => ExecutionLoop2());
...
Task.Run(() => ExecutionLoopN());
Each one of those functions calls SnapTotals():
public void SnapTotals()
{
foreach (KeyValuePair<string, MarketData> kvpMarketData in
new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime))
{
...
The Handler.MessageEventHandler.Realtime object is the ConcurrentDictionary that is updated in real-time by the external API.
At a certain specific point in the day, there is an instant burst of data that comes in from the API. That is the precise time I want my ExecutionLoop() functions to do some work.
As I've grown the program and added more of those execution loop functions, and grown the number of elements in the ConcurrentDictionary, the performance of the program as a whole has seriously degraded. Specifically, those ExecutionLoop() functions all seem to freeze up and take much longer to meet their terminal condition than they should.
I added some logging to all of the functions above, and to the function that updates the ConcurrentDictionary. From what I can gather, the ExecutionLoop() functions appear to access the ConcurrentDictionary so often that they block the API from updating it with real-time data. The loops are dependent on that data to meet their terminal condition so they cannot complete.
I'm stuck trying to figure out a way to re-architect this. I would like for the thread that updates the ConcurrentDictionary to have a higher priority but the message events are handled from within the external API. I don't know if ConcurrentDictionary was the right type of data structure to use, or what the alternative could be, because obviously a regular Dictionary would not work here. Or is there a way to "pause" my execution loops for a few milliseconds to allow the market data feed to catch up? Or something else?

Your basic approach is sound except for one fatal flaw: they are all hitting the same dictionary at the same time via iterators, sets, and gets. So you must do one thing: in SnapTotals you must iterate over a copy of the concurrent dictionary.
When you iterate over Handler.MessageEventHandler.Realtime or even new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime) you are using the ConcurrentDictionary<>'s iterator, which even though is thread-safe, is going to be using the dictionary for the entire period of iteration (including however long it takes to do the processing for each and every entry in the dictionary). That is most likely where the contention occurs.
Making a copy of the dictionary is much faster, so should lower contention.
Change SnapTotals to
public void SnapTotals()
{
var copy = Handler.MessageEventHandler.Realtime.ToArray();
foreach (var kvpMarketData in copy)
{
...
Now, each ExecutionLoopX can execute in peace without write-side contention (your API updates) and without read-side contention from the other loops. The write-side can execute without read-side contention as well.
The only "contention" should be for the short duration needed to do each copy.
And by the way, the dictionary copy (an array) is not threadsafe; it's just a plain array, but that is ok because each task is executing in isolation on its own copy.

I think that your main problem is not related to the ConcurrentDictionary, but to the large number of ExecutionLoopX methods. Each of these methods saturates a CPU core, and since the methods are more than the cores of your machine, the whole CPU is saturated. My assumption is that if you find a way to limit the degree of parallelism of the ExecutionLoopX methods to a number smaller than the Environment.ProcessorCount, your program will behave and perform better. Below is my suggestion for implementing this limitation.
The main obstacle is that currently your ExecutionLoopX methods are monolithic: they can't be separated to pieces so that they can be parallelized. My suggestion is to change their return type from void to async Task, and place an await Task.Yield(); inside the outer loop. This way it will be possible to execute them in steps, with each step being the code from the one await to the next.
Then create a TaskScheduler with limited concurrency, and a TaskFactory that uses this scheduler:
int maxDegreeOfParallelism = Environment.ProcessorCount - 1;
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxDegreeOfParallelism).ConcurrentScheduler;
TaskFactory taskFactory = new TaskFactory(scheduler);
Now you can parallelize the execution of the methods, by starting the tasks with the taskFactory.StartNew method instead of the Task.Run:
List<Task> tasks = new();
tasks.Add(taskFactory.StartNew(() => ExecutionLoop1(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop2(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop3(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop4(data)).Unwrap());
//...
Task.WaitAll(tasks.ToArray());
The .Unwrap() is needed because the taskFactory.StartNew returns a nested task (Task<Task>). The Task.Run method is also doing this unwrapping internally, when the action is asynchronous.
An online demo of this idea can be found here.
The Environment.ProcessorCount - 1 configuration means that one CPU core will be available for other work, like the communication with the external API and the updating of the ConcurrentDictionary.
A more cumbersome implementation of the same idea, using iterators and the Parallel.ForEach method instead of async/await, can be found in the first revision of this answer.

If you're not squeamish about mixing operations in a task, you could redesign such that instead of task A doing A things, B doing B things, C doing C things, etc. you can reduce the number of tasks to the number of processors, and thus run fewer concurrently, greatly easing contention.
So, for example, say you have just two processors. Make a "general purpose/pluggable" task wrapper that accepts delegates. So, wrapper 1 would accept delegates to do A and B work. Wrapper 2 would accept delegates to do C and D work. Then ask each wrapper to spin up a task that calls the delegates in a loop over the dictionary.
This would of course need to be measured. What I am proposing is, say, 4 tasks each doing 4 different types of processing. This is 4 units of work per loop over 4 loops. This is not the same as 16 tasks each doing 1 unit of work. In that case you have 16 loops.
16 loops intuitively would cause more contention than 4.
Again, this is a potential solution that should be measured. There is one drawback for sure: you will have to ensure that a piece of work within a task doesn't affect any of the others.

Parallel for each or any alternative for parallel loop?

I have this code
Lines.ToList().ForEach(y =>
{
globalQueue.AddRange(GetTasks(y.LineCode).ToList());
});
So for each line in my list of lines I get the tasks that I add to a global production queue. I can have 8 lines. Each get task request GetTasks(y.LineCode) take 1 minute. I would like to use parallelism to be sure I request my 8 calls together and not one by one.
What should I do?
Using another ForEach loop or using another extension method? Is there a ForEachAsync? Make the GetTasks request itself async?

Parallelism isn't concurrency. Concurrency isn't asynchrony. Running multiple slow queries in parallel won't make them run faster, quite the opposite. These are different problems and require very different solutions. Without a specific problem one can only give generic advice.
Parallelism - processing an 800K item array
Parallelism means processing a ton of data using multiple cores in parallel. To do that, you need to partition your data and feed each partition to a "worker" for processing. You need to minimize communication between workers and the need of synchronization to get the best performance, otherwise your workers will spend CPU time doing nothing. That means, no global queue updating.
If you have a lot of lines, or if line processing is CPU-bound, you can use PLINQ to process it :
var query = from y in lines.AsParallel()
from t in GetTasks(y.LineCode)
select t;
var theResults=query.ToList();
That's it. No need to synchronize access to a queue, either through locking or using a concurrent collection. This will use all available cores though. You can add WithDegreeOfParallelism() to reduce the number of cores used to avoid freezing
Concurrency - calling 2000 servers
Concurrency on the other hand means doing several different things at the same time. No partitioning is involved.
For example, if I had to query 8 or 2000 servers for monitoring data (true story) I wouldn't use Parallel or PLINQ. For one thing, Parallel and PLINQ use all available cores. In this case though they won't be doing anything, they'll just wait for responses. Parallelism classes can't handle async methods either because there's no point - they aren't meant to wait for responses.
A very quick & dirty solution would be to start multiple tasks and wait for them to return, eg :
var tasks=lines.Select(y=>Task.Run(()=>GetTasks(y.LineCode));
//Array of individual results
var resultsArray=await Task.WhenAll(tasks);
//flatten the results
var resultList=resultsArray.SelectMany(r=>r).ToList();
This will start all requests at once. Network Security didn't like the 2000 concurrent requests, since it looked like a hack attack and caused a bit of network flooding.
Concurrency with Dataflow
We can use the TPL Dataflow library and eg ActionBlock or TransformBlock to make the requests with a controlled degree of parallelism :
var options=new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 4 ,
BoundedCapacity=10,
};
var spamBlock=new TransformManyBlock<Line,Result>(
y=>GetTasks(y.LineCode),
options);
var outputBlock=new BufferBlock<Result>();
spamBlock.LinkTo(outputBlock);
foreach(var line in lines)
{
await spamBlock.SendAsync(line);
}
spamBlock.Complete();
//Wait for all 4 workers to finish
await spamBlock.Completion;
Once the spamBlock completes, the results can be found in outputBlock. By setting a BoundedCapacity I ensure that the posting loop will wait if there are too many unprocessed messages in spamBlock's input queue.
An ActionBlock can handle asynchronous methods too. Assuming GetTasksAsync returns a Task<Result[]> we can use:
var spamBlock=new TransformManyBlock<Line,Result>(
y=>GetTasksAsync(y.LineCode),
options);

You can use Parallel Foreach:
Parallel.ForEach(Lines, (line) =>
{
globalQueue.AddRange(GetTasks(line.LineCode).ToList());
});
A Parallel.ForEach loop works like a Parallel.For loop. The loop
partitions the source collection and schedules the work on multiple
threads based on the system environment. The more processors on the
system, the faster the parallel method runs.

Thread management with ASP.NET async/await

I've got a database entity type Entity, a long list of Thingy and method
private Task<Entity> MakeEntity(Thingy thingy) {
...
}
MakeEntity does lots of stuff, and is CPU bound. I would like to convert all my thingies to entities, and save them in a db.context. Considering that
I don't want to finish as fast as possible
The amount of entities is large, and I want to effectively use the database, so I want to start saving changes and waiting for the remote database to do it's thing
how can I do this performantly? What I would really like is to loop while waiting for the database to do its thing, and offer all the newly made entities so far, untill the database has processed them all. What's the best route there? I've run in to saveChanges throwing if it's called concurrently, so I can't do that. What I'd really like is to have a threadpool of eight threads (or rather, as many threads as I have cores) to do the CPU bound work, and a single thread doing the SaveChanges()

This is a kind of "asynchronous stream", which is always a bit awkward.
In this case (assuming you really do want to multithread on ASP.NET, which is not recommended in general), I'd say TPL Dataflow is your best option. You can use a TransformBlock with MaxDegreeOfParallelism set to 8 (or unbounded, for that matter), and link it to an ActionBlock that does the SaveChanges.
Remember, use synchronous signatures (not async/await) for CPU-bound code, and asynchronous methods for I/O-bound code (i.e., SaveChangesAsync).

You could set up a pipeline of N CPU workers feeding into a database worker. The database worker could batch items up.
Since MakeEntity is CPU bound there is no need to use async and await there. await does not create tasks or threads (a common misconception).
var thingies = ...;
var entities = thingies.AsParallel().WithDOP(8).Select(MakeEntity);
var batches = CreateBatches(entities, batchSize: 100);
foreach (var batch in batches) {
Insert(batch);
}
You need to provide a method that creates batches from an IEnumerable. This is available on the web.
If you don't need batching for the database part you can delete that code.
For the database part you probably don't need async IO because it seems to be a low-frequency operation.

Parallelization of long running processes and performance optimization

I would like to parallelize the application that processes multiple video clips frame by frame. Sequence of each frame per clip is important (obviously).
I decided to go with TPL Dataflow since I believe this is a good example of dataflow (movie frames being data).
So I have one process that loads frames from database (lets say in a batch of 500, all bunched up)
Example sequence:
|mid:1 fr:1|mid:1 fr:2|mid:2 fr:1|mid:3 fr:1|mid:1 fr:3|mid:2 fr:2|mid:2 fr:3|mid:1 fr:4|
and posts them to BufferBlock. To this BufferBlock I have linked ActionBlocks with the filter to have one ActionBlock per MovieID so that I get some kind of data partitioning. Each ActionBlock is sequential, but ideally multiple ActionBlocks for multiple movies can run in parallel.
I do have the above described network working and it does run in parallel, but from my calculations only eight to ten ActionBlocks are executing simultaneously. I timed each ActionBlock's running time and its around 100-200ms.
What steps can I take to at least double concurrency?
I did try converting action delegates to async methods and make database access asynchronous within ActionBlock action delegate but it did not help.
EDIT: I implemented extra level of data partitioning: frames for Movies with Odd IDs are processed on ServerA, frames for Even movies are processed on ServerB. Both instances of the application hit the same database. If my problem was DB IO, then I would not see any improvement in total frames processed count (or very little, under 20%). But I do see it doubling. So this leads me to conclude that Threadpool is not spawning more threads to do more frames in parallel (both servers are quad-cores and profiler shows about 25-30 threads per application).

Some assumptions:
From your example data, you are receiving movie frames (and possibly the frames in the movies) out of order
Your ActionBlock<T> instances are generic; they all call the same method for processing, you just create a list of them based on each movie id (you have a list of movie ids beforehand) like so:
// The movie IDs
IEnumerable<int> movieIds = ...;
// The actions.
var actions = movieIds.Select(
i => new { Id = i, Action = new ActionBlock<Frame>(MethodToProcessFrame) });
// The buffer block.
BufferBlock<Frame> buffer = ...;
// Link everything up.
foreach (var action in actions)
{
// Not necessary in C# 5.0, but still, good practice.
// The copy of the action.
var actionCopy = action;
// Link.
bufferBlock.LinkTo(actionCopy.Action, f => f.MovieId == actionCopy.Id);
}
If this is the case, you're creating too many ActionBlock<T> instances which aren't being given work; because your frames (and possibly movies) are out-of-order, you aren't guaranteed that all of the ActionBlock<T> instances will have work to do.
Additionally, when you create an ActionBlock<T> instance it's going to be created with a MaxDegreeOfParallelism of 1, meaning that it's thread safe because only one thread can access the block at the same time.
Additionally, the TPL DataFlow library ultimately relies on the Task<TResult> class, which schedules by default on the thread pool. The thread pool is going to do a few things here:
Make sure that all processor cores are saturated. This is very different from making sure that your ActionBlock<T> instances are saturated and this is the metric you should be concerned with
Make sure that while the processor cores are saturated, make sure that the work is distributed evenly, as well as make sure that not too many concurrent tasks are executing (context switches are expensive).
It also looks like your method that processes your movies is generic, and it doesn't matter what frame from what movie is passed in (if it does matter, then you need to update your question with that, as it changes a lot of things). This would also mean that it's thread-safe.
Also, if it can be assumed that the processing of one frame doesn't rely on the processing of any previous frames (or, it looks like the frames of the movie come in order) you can use a single ActionBlock<T> but tweak up the MaxDegreeOfParallelism value, like so:
// The buffer block.
BufferBlock<Frame> buffer = ...;
// Have *one* ActionBlock<T>
var action = new ActionBlock<Frame>(MethodToProcessFrame,
// This is where you tweak the concurrency:
new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 4,
}
);
// Link. No filter needed.
bufferBlock.LinkTo(action);
Now, your ActionBlock<T> will always be saturated. Granted, any responsible task scheduler (the thread pool by default) is still going to limit the maximum amount of concurrency, but it's going to do as much as it can reasonably do at the same time.
To that end, if your action is truly thread safe, you can set the MaxDegreeOfParallelism to DataflowBlockOptions.Unbounded, like so:
// Have *one* ActionBlock<T>
var action = new ActionBlock<Frame>(MethodToProcessFrame,
// This is where you tweak the concurrency:
new ExecutionDataflowBlockOptions {
// We're thread-safe, let the scheduler determine
// how nuts we can go.
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
}
);
Of course, all of this assumes that everything else is optimal (I/O reads/writes, etc.)

Odds are that's the optimal degree of parallelization. The thread pool is honestly pretty darn good at determining the optimal number of actual threads to have active. My guess is that your hardware can support about that many parallel processes actually working in parallel. If you added more you wouldn't actually be increasing throughput, you'd just be spending more time doing context switches between threads and less time actually working on them.
If you notice that, over an extended period of time, your CPU load, memory bus, network connection, disk access, etc. are all working below capacity then you might have a problem, and you'd want to check to see what is actually bottlenecking. Chances are though some resource somewhere is at it's capacity, and the TPL has recognized that and ensured that it doesn't over saturate that resource.

I suspect you are IO bound. The question is where? On the read or the write. Are you writing more data than reading. CPU may be under 50% because it cannot write out faster.
I am not saying the ActionBlock is wrong but I would consider a producer consumer with BlockingCollection. Optimize how you read and write data.
This different but I have an app where I read blocks of text. Parse the text and then write the words back to SQL. I read the on a single thread, then parallel the parse, and then write on a single thread. I write on a single thread so as not to fracture indexes. If you are IO bound you need to figure out what is the slowest IO then optimize that process.
Tell me more about that IO.
In the question you mention reading from database also.
I would give BlockingCollections a try.
BlockingCollection Class
And have size limit for each as so you don't blow memory.
Make it just big enough that it (almost) never goes empty.
The Blocking Collection after the slowest step will go empty.
If you can parallel process then do so.
What I have found is parallel inserts in a table are not faster.
Let one process take lock and hold it and keep that hose open.
Look close at how you insert.
One row at a time is slow.
I use TVP and insert 10,000 at a time but a lot of people like Drapper or BulkInsert.
If you drop indexes and triggers and insert sorted by clustered index will be fastest.
Take a tablock and hold it.
I am getting inserts in the 10 ms range.
Right now the update is the slowest.
Look at that - are you doing just one row at a time?
Look at taking tablock and doing by video clip.
Unless it is an ugly update it should not take longer than in insert.

Parallel.ForEach not spinning up new threads

Parallel.ForEach Not Spinning Up New Threads
Hello all, we have a very IO-intensive operation that we wrote using Parallel.ForEach from Microsoft's Parallel Extensions for the .NET Framework. We need to delete a large number of files, and we represent the files to be deleted as a list of lists. Each nested list has 1000 messages in it, and we have 50 of these lists. The issue here is that when I look in the logs afterwards, I only see one thread executing inside of our Parallel.ForEach block.
Here's what the code looks like:
List<List<Message>> expiredMessagesLists = GetNestedListOfMessages();
foreach (List<Message> subList in expiredMessagesLists)
{
Parallel.ForEach(subList, msg =>
{
try
{
Logger.LogEvent(TraceEventType.Information, "Purging Message {0} on Thread {1}", msg.MessageID, msg.ExtensionID, Thread.CurrentThread.Name);
DeleteMessageFiles(msg);
}
catch (Exception ex)
{
Logger.LogException(TraceEventType.Error, ex);
}
});
}
I wrote some sample code with a simpler data structure and no IO logic, and I could see several different threads executing within the Parallel.ForEach block. Are we doing something incorrect with Parallel.ForEach in the code above? Could it be the list of lists that's tripping it up, or is there some sort of threading limitation for IO operations?

There are a couple of possibilities.
First off, in most cases, Parallel.ForEach will not spawn a new thread. It uses the .NET 4 ThreadPool (all of the TPL does), and will reuse ThreadPool threads.
That being said, Parallel.ForEach uses a partitioning strategy based on the size of the List being passed to it. My first guess is that your "outer" list has many messages, but the inner list only has one Message instance, so the ForEach partitioner is only using a single thread. With one element, Parallel is smart enough to just use the main thread, and not spin work onto a background thread.
Normally, in situations like this, it's better to parallelize the outer loop, not the inner loop. That will usually give you better performance (since you'll have larger work items), although it's difficult to know without having a good sense of the loop sizes plus the size of the Unit of Work. You could also, potentially, parallelize both the inner and outer loops, but without profiling, it'd be difficult to tell what would be the best option.
One other possibility:
Try using [Thread.ManagedThreadId][1] instead of Thread.CurrentThread.Name for your logging. Since Parallel uses ThreadPool threads, the "Name" is often identical across multiple threads. You may think you're only using a single thread, when you're in fact using more than one....

The assumption underlying your code is that it is possible to delete files in parallel. I'm not saying it isn't (I'm no expert on the matter), but I wouldn't be surprised if that is simply not possible for most hardware. You are, after all, performing an operation with a physical object (your hard disk) when you do this.
Suppose you had a class, Person, with a method called RaiseArm(). You could always try shooting off RaiseArm() on 100 different threads, but the Person is only ever going to be able to raise two at a time...
Like I said, I could be wrong. This is just my suspicion.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.