Parallel for each or any alternative for parallel loop?

Parallel for each or any alternative for parallel loop? - c#

I have this code
Lines.ToList().ForEach(y =>
{
globalQueue.AddRange(GetTasks(y.LineCode).ToList());
});
So for each line in my list of lines I get the tasks that I add to a global production queue. I can have 8 lines. Each get task request GetTasks(y.LineCode) take 1 minute. I would like to use parallelism to be sure I request my 8 calls together and not one by one.
What should I do?
Using another ForEach loop or using another extension method? Is there a ForEachAsync? Make the GetTasks request itself async?

Parallelism isn't concurrency. Concurrency isn't asynchrony. Running multiple slow queries in parallel won't make them run faster, quite the opposite. These are different problems and require very different solutions. Without a specific problem one can only give generic advice.
Parallelism - processing an 800K item array
Parallelism means processing a ton of data using multiple cores in parallel. To do that, you need to partition your data and feed each partition to a "worker" for processing. You need to minimize communication between workers and the need of synchronization to get the best performance, otherwise your workers will spend CPU time doing nothing. That means, no global queue updating.
If you have a lot of lines, or if line processing is CPU-bound, you can use PLINQ to process it :
var query = from y in lines.AsParallel()
from t in GetTasks(y.LineCode)
select t;
var theResults=query.ToList();
That's it. No need to synchronize access to a queue, either through locking or using a concurrent collection. This will use all available cores though. You can add WithDegreeOfParallelism() to reduce the number of cores used to avoid freezing
Concurrency - calling 2000 servers
Concurrency on the other hand means doing several different things at the same time. No partitioning is involved.
For example, if I had to query 8 or 2000 servers for monitoring data (true story) I wouldn't use Parallel or PLINQ. For one thing, Parallel and PLINQ use all available cores. In this case though they won't be doing anything, they'll just wait for responses. Parallelism classes can't handle async methods either because there's no point - they aren't meant to wait for responses.
A very quick & dirty solution would be to start multiple tasks and wait for them to return, eg :
var tasks=lines.Select(y=>Task.Run(()=>GetTasks(y.LineCode));
//Array of individual results
var resultsArray=await Task.WhenAll(tasks);
//flatten the results
var resultList=resultsArray.SelectMany(r=>r).ToList();
This will start all requests at once. Network Security didn't like the 2000 concurrent requests, since it looked like a hack attack and caused a bit of network flooding.
Concurrency with Dataflow
We can use the TPL Dataflow library and eg ActionBlock or TransformBlock to make the requests with a controlled degree of parallelism :
var options=new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 4 ,
BoundedCapacity=10,
};
var spamBlock=new TransformManyBlock<Line,Result>(
y=>GetTasks(y.LineCode),
options);
var outputBlock=new BufferBlock<Result>();
spamBlock.LinkTo(outputBlock);
foreach(var line in lines)
{
await spamBlock.SendAsync(line);
}
spamBlock.Complete();
//Wait for all 4 workers to finish
await spamBlock.Completion;
Once the spamBlock completes, the results can be found in outputBlock. By setting a BoundedCapacity I ensure that the posting loop will wait if there are too many unprocessed messages in spamBlock's input queue.
An ActionBlock can handle asynchronous methods too. Assuming GetTasksAsync returns a Task<Result[]> we can use:
var spamBlock=new TransformManyBlock<Line,Result>(
y=>GetTasksAsync(y.LineCode),
options);

You can use Parallel Foreach:
Parallel.ForEach(Lines, (line) =>
{
globalQueue.AddRange(GetTasks(line.LineCode).ToList());
});
A Parallel.ForEach loop works like a Parallel.For loop. The loop
partitions the source collection and schedules the work on multiple
threads based on the system environment. The more processors on the
system, the faster the parallel method runs.

Related

Thread Contention on a ConcurrentDictionary in C#

I have a C# .NET program that uses an external API to process events for real-time stock market data. I use the API callback feature to populate a ConcurrentDictionary with the data it receives on a stock-by-stock basis.
I have a set of algorithms that each run in a constant loop until a terminal condition is met. They are called like this (but all from separate calling functions elsewhere in the code):
Task.Run(() => ExecutionLoop1());
Task.Run(() => ExecutionLoop2());
...
Task.Run(() => ExecutionLoopN());
Each one of those functions calls SnapTotals():
public void SnapTotals()
{
foreach (KeyValuePair<string, MarketData> kvpMarketData in
new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime))
{
...
The Handler.MessageEventHandler.Realtime object is the ConcurrentDictionary that is updated in real-time by the external API.
At a certain specific point in the day, there is an instant burst of data that comes in from the API. That is the precise time I want my ExecutionLoop() functions to do some work.
As I've grown the program and added more of those execution loop functions, and grown the number of elements in the ConcurrentDictionary, the performance of the program as a whole has seriously degraded. Specifically, those ExecutionLoop() functions all seem to freeze up and take much longer to meet their terminal condition than they should.
I added some logging to all of the functions above, and to the function that updates the ConcurrentDictionary. From what I can gather, the ExecutionLoop() functions appear to access the ConcurrentDictionary so often that they block the API from updating it with real-time data. The loops are dependent on that data to meet their terminal condition so they cannot complete.
I'm stuck trying to figure out a way to re-architect this. I would like for the thread that updates the ConcurrentDictionary to have a higher priority but the message events are handled from within the external API. I don't know if ConcurrentDictionary was the right type of data structure to use, or what the alternative could be, because obviously a regular Dictionary would not work here. Or is there a way to "pause" my execution loops for a few milliseconds to allow the market data feed to catch up? Or something else?

Your basic approach is sound except for one fatal flaw: they are all hitting the same dictionary at the same time via iterators, sets, and gets. So you must do one thing: in SnapTotals you must iterate over a copy of the concurrent dictionary.
When you iterate over Handler.MessageEventHandler.Realtime or even new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime) you are using the ConcurrentDictionary<>'s iterator, which even though is thread-safe, is going to be using the dictionary for the entire period of iteration (including however long it takes to do the processing for each and every entry in the dictionary). That is most likely where the contention occurs.
Making a copy of the dictionary is much faster, so should lower contention.
Change SnapTotals to
public void SnapTotals()
{
var copy = Handler.MessageEventHandler.Realtime.ToArray();
foreach (var kvpMarketData in copy)
{
...
Now, each ExecutionLoopX can execute in peace without write-side contention (your API updates) and without read-side contention from the other loops. The write-side can execute without read-side contention as well.
The only "contention" should be for the short duration needed to do each copy.
And by the way, the dictionary copy (an array) is not threadsafe; it's just a plain array, but that is ok because each task is executing in isolation on its own copy.

I think that your main problem is not related to the ConcurrentDictionary, but to the large number of ExecutionLoopX methods. Each of these methods saturates a CPU core, and since the methods are more than the cores of your machine, the whole CPU is saturated. My assumption is that if you find a way to limit the degree of parallelism of the ExecutionLoopX methods to a number smaller than the Environment.ProcessorCount, your program will behave and perform better. Below is my suggestion for implementing this limitation.
The main obstacle is that currently your ExecutionLoopX methods are monolithic: they can't be separated to pieces so that they can be parallelized. My suggestion is to change their return type from void to async Task, and place an await Task.Yield(); inside the outer loop. This way it will be possible to execute them in steps, with each step being the code from the one await to the next.
Then create a TaskScheduler with limited concurrency, and a TaskFactory that uses this scheduler:
int maxDegreeOfParallelism = Environment.ProcessorCount - 1;
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxDegreeOfParallelism).ConcurrentScheduler;
TaskFactory taskFactory = new TaskFactory(scheduler);
Now you can parallelize the execution of the methods, by starting the tasks with the taskFactory.StartNew method instead of the Task.Run:
List<Task> tasks = new();
tasks.Add(taskFactory.StartNew(() => ExecutionLoop1(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop2(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop3(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop4(data)).Unwrap());
//...
Task.WaitAll(tasks.ToArray());
The .Unwrap() is needed because the taskFactory.StartNew returns a nested task (Task<Task>). The Task.Run method is also doing this unwrapping internally, when the action is asynchronous.
An online demo of this idea can be found here.
The Environment.ProcessorCount - 1 configuration means that one CPU core will be available for other work, like the communication with the external API and the updating of the ConcurrentDictionary.
A more cumbersome implementation of the same idea, using iterators and the Parallel.ForEach method instead of async/await, can be found in the first revision of this answer.

If you're not squeamish about mixing operations in a task, you could redesign such that instead of task A doing A things, B doing B things, C doing C things, etc. you can reduce the number of tasks to the number of processors, and thus run fewer concurrently, greatly easing contention.
So, for example, say you have just two processors. Make a "general purpose/pluggable" task wrapper that accepts delegates. So, wrapper 1 would accept delegates to do A and B work. Wrapper 2 would accept delegates to do C and D work. Then ask each wrapper to spin up a task that calls the delegates in a loop over the dictionary.
This would of course need to be measured. What I am proposing is, say, 4 tasks each doing 4 different types of processing. This is 4 units of work per loop over 4 loops. This is not the same as 16 tasks each doing 1 unit of work. In that case you have 16 loops.
16 loops intuitively would cause more contention than 4.
Again, this is a potential solution that should be measured. There is one drawback for sure: you will have to ensure that a piece of work within a task doesn't affect any of the others.

Parallelizing execution with Task.Run

I am trying to improve performane of some code which does some shopping function calling number of different vendors. 3rd party vendor call is async and results are processed to generate a result. Strucure of the code is as follows.
public async Task<List<ShopResult>> DoShopping(IEnumerable<Vendor> vendors)
{
var res = vendors.Select(async s => await DoShopAndProcessResultAsync(s));
await Task.WhenAll(res); ....
}
Since DoShopAndProcessResultAsync is both IO bound and CPU bound, and each vendor iteration is independant I think Task.Run can be used to do something like below.
public async Task<List<ShopResult>> DoShopping(IEnumerable<Vendor> vendors)
{
var res = vendors.Select(s => Task.Run(() => DoShopAndProcessResultAsync(s)));
await Task.WhenAll(res); ...
}
Using Task.Run as is having a performance gain and I can see multiple threads are being involved here from the order of execution of the calls. And it is running without any issue locally on my machine.
However, it is a tasks of tasks scenario and wondering whether any pitfalls or this is deadlock prone in a high traffic prod environment.
What are your opinions on the approach of using Task.Run to parallelize async calls?

Tasks are .NET's low-level building blocks. .NET almost always has a better high-level abstraction for specific concurrency paradigms.
To paraphrase Rob Pike (slides) Concurrency is not parallelism is not asynchronous execution. What you ask is concurrent execution, with a specific degree-of-parallelism. NET already offers high-level classes that can do that, without resorting to low-level task handling.
At the end, I explain why these distinctions matter and how they're implemented using different .NET classes or libraries
Dataflow blocks
At the highest level, the Dataflow classes allow creating a pipeline of processing blocks similar to a Powershell or Bash pipeline, where each block can use one or more tasks to process input. Dataflow blocks preserve message order, ensuring results are emitted in the order the input messages were received.
You'll often see combinations of block called meshes, not pipelines. Dataflow grew out of the Microsoft Robotics Framework and can be used to create a network of independent processing blocks. Most programmers just use to build a pipeline of steps though.
In your case, you could use a TransformBlock to execute DoShopAndProcessResultAsync and feed the output either to another processing block, or a BufferBlock you can read after processing all results. You could even split Shop and Process into separate blocks, each with its own logic and degree of parallelism
Eg.
var buffer=new BufferBlock<ShopResult>();
var blockOptions=new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism=3,
BoundedCapacity=1
};
var shop=new TransformBlock<Vendor,ShopResult)(DoShopAndProcessResultAsync,
blockOptions);
var linkOptions=new DataflowLinkOptions{ PropagateCompletion=true;}
shop.LinkTo(buffer,linkOptions);
foreach(var v in vendors)
{
await shop.SendAsync(v);
}
shop.Complete();
await shop.Completion;
buffer.TryReceiveAll(out IList<ShopResult> results);
You can use two separate blocks to shop and process :
var shop=new TransformBlock<Vendor,ShopResponse>(DoShopAsync,shopOptions);
var process=new TransformBlock<ShopResponse,ShopResult>(DoProcessAsync,processOptions);
shop.LinkTo(process,linkOptions);
process.LinkTo(results,linkOptions);
foreach(var v in vendors)
{
await shop.SendAsync(v);
}
shop.Complete();
await process.Completion;
In this case we await the completion of the last block in the chain before reading the results.
Instead of reading from a buffer block, we could use an ActionBlock at the end to do whatever we want to do with the results, eg store them to a database. The results can be batched using a BatchBlock to reduce the number of storage operations
...
var batch=new BatchBlock<ShopResult>(100);
var store=new ActionBlock<ShopResult[]>(DoStoreAsync);
shop.LinkTo(process,linkOptions);
process.LinkTo(batch,linkOptions);
batch.LinkTo(store,linkOptions);
...
shop.Complete();
await store.Completion;
Why do names matter
Tasks are the lowest level building blocks used to implement multiple paradigms. In other languages you'd see them described as Futures or Promises (eg Javascript)
Parallelism in .NET means executing CPU-bound computations over a lot of data using all available cores. Parallel.ForEach will partition the input data into roughly as many partitions as there are cores and use one worker task per partition. PLINQ goes one step further, allowing the use of LINQ operators to specify the computation and let PLINQ to use algorithms optimized for parallel execution to map, filter, sort, group and collect results. That's why Parallel.ForEach can't be used for async work at all.
Concurrency means executing multiple independent and often IO-bound jobs. At the lowest level you can use Tasks but Dataflow, Rx.NET, Channels, IAsyncEnumerable etc allow the use of high-level patterns like CSP/Pipelines, event stream processing etc
Asynchronous execution means you don't have to block while waiting for I/O-bound work to complete.

What is alarming with the Task.Run approach in your question, is that it depletes the ThreadPool from available worker threads in a non-controlled manner. It doesn't offer any configuration option that would allow you to reduce the parallelism of each individual request, in favor of preserving the scalability of the whole service. That's something that might bite you in the long run.
Ideally you would like to control both the parallelism and the concurrency, and control them independently. For example you might want to limit the maximum concurrency of the I/O-bound work to 10, and the maximum parallelism of the CPU-bound work to 2. Regarding the former you could take a look at this question: How to limit the amount of concurrent async I/O operations?
Regarding the later, you could use a TaskScheduler with limited concurrency. The ConcurrentExclusiveSchedulerPair is a handy class for this purpose. Here is an example of how you could rewrite your DoShopping method in a way that limits the ThreadPool usage to two threads at maximum (per request), without limiting at all the concurrency of the I/O-bound work:
public async Task<ShopResult[]> DoShopping(IEnumerable<Vendor> vendors)
{
var scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxConcurrencyLevel: 2).ConcurrentScheduler;
var tasks = vendors.Select(vendor =>
{
return Task.Factory.StartNew(() => DoShopAndProcessResultAsync(vendor),
default, TaskCreationOptions.DenyChildAttach, scheduler).Unwrap();
});
return await Task.WhenAll(tasks);
}
Important: In order for this to work, the DoShopAndProcessResultAsync method should be implemented internally without .ConfigureAwait(false) at the await points. Otherwise the continuations after the await will not run on our preferred scheduler, and the goal of limiting the ThreadPool utilization will be defeated.
My personal preference though would be to use instead the new (.NET 6) Parallel.ForEachAsync API. Apart from making it easy to control the concurrency through the MaxDegreeOfParallelism option, it also comes with a better behavior in case of exceptions. Instead of launching invariably all the async operations, it stops launching new operations as soon as a previously launched operation has failed. This can make a big difference in the responsiveness of your service, in case for example that all individual async operations are failing with a timeout exception. You can find here a synopsis of the main differences between the Parallel.ForEachAsync and the Task.WhenAll APIs.
Unfortunately the Parallel.ForEachAsync has the disadvantage that it doesn't return the results of the async operations. Which means that you have to collect the results manually as a side-effect of each async operation. I've posted here a ForEachAsync variant that returns results, that combines the best aspects of the Parallel.ForEachAsync and the Task.WhenAll APIs. You could use it like this:
public async Task<ShopResult[]> DoShopping(IEnumerable<Vendor> vendors)
{
var scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxConcurrencyLevel: 2).ConcurrentScheduler;
ParallelOptions options = new() { MaxDegreeOfParallelism = 10 };
return await ForEachAsync(vendors, options, async (vendor, ct) =>
{
return await Task.Factory.StartNew(() => DoShopAndProcessResultAsync(vendor),
ct, TaskCreationOptions.DenyChildAttach, scheduler).Unwrap();
});
}
Note: In my initial answer (revision 1) I had suggested erroneously to pass the scheduler through the ParallelOptions.TaskScheduler property. I just found out that this doesn't work as I expected. The ParallelOptions class has an internal property EffectiveMaxConcurrencyLevel that represents the minimum of the MaxDegreeOfParallelism and the TaskScheduler.MaximumConcurrencyLevel. The implementation of the Parallel.ForEachAsync method uses this property, instead of reading directly the MaxDegreeOfParallelism. So the MaxDegreeOfParallelism, by being larger than the MaximumConcurrencyLevel, was effectively ignored.
You've probably also noticed by now that the names of these two settings are confusing. We use the MaximumConcurrencyLevel in order to control the number of threads (aka the parallelization), and we use the MaxDegreeOfParallelism in order to control the amount of concurrent async operations (aka the concurrency). The reason for this confusing terminology can be traced to the historic origins of these APIs. The ParallelOptions class was introduced before the async-await era, and the designers of the new Parallel.ForEachAsync API aimed at making it compatible with the older non-asynchronous members of the Parallel class.

Threads (Tasks) Limit in WebAPI (or in general)

We're developing WebAPI which has some logic of decryption of around 200 items (can be more). Each decryption takes around 20ms.
We've tried to parallel the tasks so we'll get it done as soon as possible, but it seems we're getting some kind of a limit as the threads are getting reused by waiting for the older threads to complete (and there are only few used) - overall action takes around 1-2 seconds to complete...
What we basically want to achieve is get x amount of threads start at the same time and finish after those ~20 ms.
We tried this:
Await multiple async Task while setting max running task at a time
But it seems this only describes setting a limit while we want to release it...
Here's a snippet:
var tasks = new List<Task>();
foreach (var element in Elements)
{
var task = new Task(() =>
{
element.Value = Cipher.Decrypt((string)element.Value);
}
});
task.Start();
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
What are we missing here?
Thanks,
Nir.

I cannot recommend parallelism on ASP.NET. It will certainly impact the scalability of your service, particularly if it is public-facing. I have thought "oh, I'm smart enough to do this" a couple of times and added parallelism in an ASP.NET app, only to have to tear it right back out a week later.
However, if you really want to...
it seems we're getting some kind of a limit
Is it the limit of physical cores on your machine?
We tried this: Await multiple async Task while setting max running task at a time
That solution is specifically for asynchronous concurrent code (e.g., I/O-bound). What you want is parallel (threaded) concurrent code (e.g., CPU-bound). Completely different use cases and solutions.
What are we missing here?
Your current code is throwing a ton of simultaneous tasks at the thread pool, which will attempt to handle them as best as it can. You can make this more efficient by using a higher-level abstraction, e.g., Parallel:
Parallel.ForEach(Elements, element =>
{
element.Value = Cipher.Decrypt((string)element.Value);
});
Parallel is more intelligent in terms of its partitioning and (re-)use of threads (i.e., not exceeding number of cores). So you should see some speedup.
However, I would expect it only to be a minor speedup. You are likely being limited by your number of physical cores.

Asuming no hyper threading:
If it takes 20ms for 1 item , then you can look at it as if it takes 1 core 20ms. If you want 200 items to complete in 20 ms, then you need 200 cores all for you. If you don't have that many, it just can't be done...
Under normal surcumstances, as many Task Will be scheduled parallel as optimal for you system

Parallelization of long running processes and performance optimization

I would like to parallelize the application that processes multiple video clips frame by frame. Sequence of each frame per clip is important (obviously).
I decided to go with TPL Dataflow since I believe this is a good example of dataflow (movie frames being data).
So I have one process that loads frames from database (lets say in a batch of 500, all bunched up)
Example sequence:
|mid:1 fr:1|mid:1 fr:2|mid:2 fr:1|mid:3 fr:1|mid:1 fr:3|mid:2 fr:2|mid:2 fr:3|mid:1 fr:4|
and posts them to BufferBlock. To this BufferBlock I have linked ActionBlocks with the filter to have one ActionBlock per MovieID so that I get some kind of data partitioning. Each ActionBlock is sequential, but ideally multiple ActionBlocks for multiple movies can run in parallel.
I do have the above described network working and it does run in parallel, but from my calculations only eight to ten ActionBlocks are executing simultaneously. I timed each ActionBlock's running time and its around 100-200ms.
What steps can I take to at least double concurrency?
I did try converting action delegates to async methods and make database access asynchronous within ActionBlock action delegate but it did not help.
EDIT: I implemented extra level of data partitioning: frames for Movies with Odd IDs are processed on ServerA, frames for Even movies are processed on ServerB. Both instances of the application hit the same database. If my problem was DB IO, then I would not see any improvement in total frames processed count (or very little, under 20%). But I do see it doubling. So this leads me to conclude that Threadpool is not spawning more threads to do more frames in parallel (both servers are quad-cores and profiler shows about 25-30 threads per application).

Some assumptions:
From your example data, you are receiving movie frames (and possibly the frames in the movies) out of order
Your ActionBlock<T> instances are generic; they all call the same method for processing, you just create a list of them based on each movie id (you have a list of movie ids beforehand) like so:
// The movie IDs
IEnumerable<int> movieIds = ...;
// The actions.
var actions = movieIds.Select(
i => new { Id = i, Action = new ActionBlock<Frame>(MethodToProcessFrame) });
// The buffer block.
BufferBlock<Frame> buffer = ...;
// Link everything up.
foreach (var action in actions)
{
// Not necessary in C# 5.0, but still, good practice.
// The copy of the action.
var actionCopy = action;
// Link.
bufferBlock.LinkTo(actionCopy.Action, f => f.MovieId == actionCopy.Id);
}
If this is the case, you're creating too many ActionBlock<T> instances which aren't being given work; because your frames (and possibly movies) are out-of-order, you aren't guaranteed that all of the ActionBlock<T> instances will have work to do.
Additionally, when you create an ActionBlock<T> instance it's going to be created with a MaxDegreeOfParallelism of 1, meaning that it's thread safe because only one thread can access the block at the same time.
Additionally, the TPL DataFlow library ultimately relies on the Task<TResult> class, which schedules by default on the thread pool. The thread pool is going to do a few things here:
Make sure that all processor cores are saturated. This is very different from making sure that your ActionBlock<T> instances are saturated and this is the metric you should be concerned with
Make sure that while the processor cores are saturated, make sure that the work is distributed evenly, as well as make sure that not too many concurrent tasks are executing (context switches are expensive).
It also looks like your method that processes your movies is generic, and it doesn't matter what frame from what movie is passed in (if it does matter, then you need to update your question with that, as it changes a lot of things). This would also mean that it's thread-safe.
Also, if it can be assumed that the processing of one frame doesn't rely on the processing of any previous frames (or, it looks like the frames of the movie come in order) you can use a single ActionBlock<T> but tweak up the MaxDegreeOfParallelism value, like so:
// The buffer block.
BufferBlock<Frame> buffer = ...;
// Have *one* ActionBlock<T>
var action = new ActionBlock<Frame>(MethodToProcessFrame,
// This is where you tweak the concurrency:
new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 4,
}
);
// Link. No filter needed.
bufferBlock.LinkTo(action);
Now, your ActionBlock<T> will always be saturated. Granted, any responsible task scheduler (the thread pool by default) is still going to limit the maximum amount of concurrency, but it's going to do as much as it can reasonably do at the same time.
To that end, if your action is truly thread safe, you can set the MaxDegreeOfParallelism to DataflowBlockOptions.Unbounded, like so:
// Have *one* ActionBlock<T>
var action = new ActionBlock<Frame>(MethodToProcessFrame,
// This is where you tweak the concurrency:
new ExecutionDataflowBlockOptions {
// We're thread-safe, let the scheduler determine
// how nuts we can go.
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
}
);
Of course, all of this assumes that everything else is optimal (I/O reads/writes, etc.)

Odds are that's the optimal degree of parallelization. The thread pool is honestly pretty darn good at determining the optimal number of actual threads to have active. My guess is that your hardware can support about that many parallel processes actually working in parallel. If you added more you wouldn't actually be increasing throughput, you'd just be spending more time doing context switches between threads and less time actually working on them.
If you notice that, over an extended period of time, your CPU load, memory bus, network connection, disk access, etc. are all working below capacity then you might have a problem, and you'd want to check to see what is actually bottlenecking. Chances are though some resource somewhere is at it's capacity, and the TPL has recognized that and ensured that it doesn't over saturate that resource.

I suspect you are IO bound. The question is where? On the read or the write. Are you writing more data than reading. CPU may be under 50% because it cannot write out faster.
I am not saying the ActionBlock is wrong but I would consider a producer consumer with BlockingCollection. Optimize how you read and write data.
This different but I have an app where I read blocks of text. Parse the text and then write the words back to SQL. I read the on a single thread, then parallel the parse, and then write on a single thread. I write on a single thread so as not to fracture indexes. If you are IO bound you need to figure out what is the slowest IO then optimize that process.
Tell me more about that IO.
In the question you mention reading from database also.
I would give BlockingCollections a try.
BlockingCollection Class
And have size limit for each as so you don't blow memory.
Make it just big enough that it (almost) never goes empty.
The Blocking Collection after the slowest step will go empty.
If you can parallel process then do so.
What I have found is parallel inserts in a table are not faster.
Let one process take lock and hold it and keep that hose open.
Look close at how you insert.
One row at a time is slow.
I use TVP and insert 10,000 at a time but a lot of people like Drapper or BulkInsert.
If you drop indexes and triggers and insert sorted by clustered index will be fastest.
Take a tablock and hold it.
I am getting inserts in the 10 ms range.
Right now the update is the slowest.
Look at that - are you doing just one row at a time?
Look at taking tablock and doing by video clip.
Unless it is an ugly update it should not take longer than in insert.

multithread read and process large text files

I have 10 lists of over 100Mb each with emails and I wanna process them using multithreads as fast as possible and without loading them into memory (something like reading line by line or reading small blocks)
I have created a function which is removing invalid ones based on a regex and another one which is organizing them based on each domain to other lists.
I managed to do it using one thread with:
while (reader.Peek() != -1)
but it takes too damn long.
How can I use multithreads (around 100 - 200) and maybe a backgroundworker or something to be able to use the form while processing the lists in parallel?
I'm new to csharp :P

Unless the data is on multiple physical discs, chances are that any more than a few threads will slow down, rather than speed up, the process.
What'll happen is that rather than reading consecutive data (pretty fast), you'll end up seeking to one place to read data for one thread, then seeking to somewhere else to read data for another thread, and so on. Seeking is relatively slow, so it ends up slower -- often quite a lot slower.
About the best you can do is dedicate one thread to reading data from each physical disc, then another to process the data -- but unless your processing is quite complex, or you have a lot of fast hard drives, one thread for processing may be entirely adequate.

There are multiple approaches to it:
1.) You can create threads explicitly like Thread t = new Thread(), but this approach is expensive on creating and managing a thread.
2.) You can use .net ThreadPool and pass your executing function's address to QueueUserWorkItem static method of ThreadPool Class. This approach needs some manual code management and synchronization primitives.
3.) You can create an array of System.Threading.Tasks.Task each processing a list which are executed parallely using all your available processors on the machine and pass that array to task.WaitAll(Task[]) to wait for their completion. This approach is related to Task Parallelism and you can find detailed information on MSDN
Task[] tasks = null;
for(int i = 0 ; i < 10; i++)
{
//automatically create an async task and execute it using ThreadPool's thread
tasks[i] = Task.StartNew([address of function/lambda expression]);
}
try
{
//Wait for all task to complete
Task.WaitAll(tasks);
}
catch (AggregateException ae)
{
//handle aggregate exception here
//it will be raised if one or more task throws exception and all the exceptions from defaulting task get accumulated in this exception object
}
//continue your processing further

You will want to take a look at the Task Parallel Library (TPL).
This library is made for parallel work, in fact. It will perform your action on the Threadpool in whatever is the most efficient fashion (typically). The only thing that I would caution is that if you run 100-200 threads at one time, then you possibly run into having to deal with context switching. That is, unless you have 100-200 processors. A good rule of thumb is to only run as many tasks in parallel as you have processors.
Some other good resources to review how to use the TPL:
Why and how to use the TPL
How to start a task.

I would be inclined to use parallel linq (plinq).
Something along the lines of:
Lists.AsParallel()
.SelectMany(list => list)
.Where(MyItemFileringFunction)
.GroupBy(DomainExtractionFunction)
AsParallel tells linq it can do this in parallel (which will mean the ordering of everything following will not be maintained)
SelectMany takes your individual lists and unrolls them such that all all items from all lists are effectivly in a single Enumerable
Where filers the items using your predicate function
GroupBy collects them by key, where DomainExtractionFunction is a function which gets a key (the domain name in your case) from the items (ie, the email)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.