C# parallel.foreach starving for data

C# parallel.foreach starving for data - c#

My application processes millions of pieces of data which vary in size. Small objects are processed quickly while others can take upwards of fifteen minutes.
My current code:
List<QueueRecords> queueRecords= Get500QueueRecords();
bool morefiles=true;
while(morefiles)
{
Parallel.ForEach(
queueRecords,parallelOptions,(record,loopstate)=>
{
//dowork
}
queueRecords = Get500QueueRecords();
if(queueRecords.Count() == 0)
{
morefiles = false;
}
}
The issue with this is that many times I will end up with one thread performing a long running task while there are still massive amounts of data to be processed.
Which pattern should I look into to resolve this issue?

Issues:
1) Get500QueueRecords could also taking some time to execute during which time you aren't doing any processing;
2) If the last record in a set takes 15 minutes you are only processing one at a time when it's processing because ParallelForEach will be waiting for it to complete.
You really should look at TPL DataFlow (https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library) or at least create a reader Task that's pumping data into a BlockingCollection<T> and then launch multiple reader Tasks that pull from the blocking collection until it's consumed.
Using a producer and a consumer with a finite size BlockingCollection<T> between them allows you to control (i) how many items are buffered from the reader Task and (ii) how many Tasks you have consuming it.

Related

Concurrently running multiple tasks C#

I have REST web API service in IIS which takes a collection of request objects. The user can enter more than 100 request objects.
I want to run this 100 request concurrently and then aggregate the result and send it back. This involves both I/O operation (calling to backend services for each request) and CPU bound operations (to compute few response elements)
Code snippet -
using System.Threading.Tasks;
....
var taskArray = new Task<FlightInformation>[multiFlightStatusRequest.FlightRequests.Count];
for (int i = 0; i < multiFlightStatusRequest.FlightRequests.Count; i++)
{
var z = i;
taskArray[z] = Tasks.Task.Run(() =>
PerformLogic(multiFlightStatusRequest.FlightRequests[z],lite, fetchRouteByAnyLeg)
);
}
Task.WaitAll(taskArray);
for (int i = 0; i < taskArray.Length; i++)
{
flightInformations.Add(taskArray[i].Result);
}
public Object PerformLogic(Request,...)
{
//multiple IO operations each depends on the outcome of the previous result
//Computations after getting the result from all I/O operations
}
If i individually run the PerformLogic operation (for 1 object) it is taking 300 ms, now my requirement is when I run this PerformLogic() for 100 objects in a single request it should take around 2 secs.
PerformLogic() has the following steps - 1. Call a 3rd Party web service to get some details 2. Based on the details call another 3rd Party webservice 3. Collect the result from the webservice, apply few transformation
But with Task.run() it takes around 7 secs, I would like to know the best approach to handle concurrency and achieve the desired NFR of 2 secs.
I can see that at any point of time 7-8 threads are working concurrently
not sure if I can spawn 100 threads or tasks may be we can see some better performance. Please suggest an approach to handle this efficiently.

Judging by this
public Object PerformLogic(Request,...)
{
//multiple IO operations each depends on the outcome of the previous result
//Computations after getting the result from all I/O operations
}
I'd wager that PerformLogic spends most its time waiting on the IO operations. If so, there's hope with async. You'll have to rewrite PerformLogicand maybe even the IO operations - async needs to be present in all levels, from the top to the bottom. But if you can do it, the result should be a lot faster.
Other than that - get faster hardware. If 8 cores take 7 seconds, then get 32 cores. It's pricey, but could still be cheaper than rewriting the code.

First, don't reinvent the wheel. PLINQ is perfectly capable of doing stuff in parallel, there is no need for manual task handling or result merging.
If you want 100 tasks each taking 300ms done in 2 seconds, you need at least 15 parallel workers, ignoring the cost of parallelization itself.
var results = multiFlightStatusRequest.FlightRequests
.AsParallel()
.WithDegreeOfParallelism(15)
.Select(flightRequest => PerformLogic(flightRequest, lite, fetchRouteByAnyLeg)
.ToList();
Now you have told PLinq to use 15 concurrent workers to work on your queue of tasks. Are you sure your machine is up to the task? You could put any number you want in there, that doesn't mean that your computer magically gets the power to do that.
Another option is to look at your PerformLogic method and optimize that. You call it 100 times, maybe it's worth optimizing.

How to approximate job completion times in Hangfire

I have an application that uses hangfire to do long-running jobs for me (I know the time the job takes and it is always roughly the same), and in my UI I want to give an estimate for when a certain job is done. For that I need to query hangfire for the position of the job in the queue and the number of servers working on it.
I know I can get the number of enqueued jobs (in the "DEFAULT" queue) by
public long JobsInQueue() {
var monitor = JobStorage.Current.GetMonitoringApi();
return monitor.EnqueuedCount("DEFAULT");
}
and the number of servers by
public int HealthyServers() {
var monitor = JobStorage.Current.GetMonitoringApi();
return monitor.Servers().Count(n => (n.Heartbeat != null) && (DateTime.Now - n.Heartbeat.Value).TotalMinutes < 5);
}
(BTW: I exclude older heartbeats, because if I turn off servers they sometimes linger in the hangfire database. Is there a better way?), but to give a proper estimate I need to know the position of the job in the queue. How do I get that?

The problem you have is that hangfire is asynchronous, queued, parallel, exhibits an at-least-once durability semantic, and basically non-deterministic.
To know with certainty the order in which an item will finish being processed in such a system is impossible. In fact, if the requirement was to enforce strict ordering, then many of the benefits of hangfire would go away.
There is a very good blog post by #odinserj (the author of hangfire) where he outlines this point: http://odinserj.net/2014/05/10/are-your-methods-ready-to-run-in-background/
However, that said, it's not impossible to come up with a sensible estimation algorithm, but it would have to be one where the order of execution is approximated in some way. As to how you can arrive at such an algorithm I don't know but something like this might work (but probably won't):
Approximate seconds remaining until completion =
(
(average duration of job in seconds * queue depth)
/ (the lower of: number of hangfire threads OR queue depth)
)
- number of seconds already spent in queue
+ average duration of job in seconds

Parallel.ForEach slows down towards end of the iteration

I have the following issue :
I am using a parallel.foreach iteration for a pretty CPU intensive workload (applying a method on a number of items) & it works fine for about the first 80% of the items - using all cpu cores very nice.
As the iteration seems to come near to the end (around 80% i would say) i see that the number of threads begins to go down core by core, & at the end the last around 5% of the items are proceesed only by two cores. So insted to use all cores untill the end, it slows down pretty hard toward the end of the iteration.
Please note the the workload can be per item very different. One can last 1-2 seconds, the other item can take 2-3 minutes to finish.
Any ideea, suggestion is very welcome.
Code used:
var source = myList.ToArray();
var rangePartitioner = Partitioner.Create(0, source.Lenght);
using (SqlConnection connection =new SqlConnection(cnStr))
{
connection.Open();
try
(
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
for(int i = range.Item1; i<range.Item2; i++)
{
CPUIntensiveMethod(source[i]);
}
});
}
catch(AggretateException ae)
{ //Exception cachting}
}

This is an unavoidable consequence of the fact the parallelism is per computation. It is clear that the whole parallel batch cannot run any quicker than the time taken by the slowest single item in the work-set.
Imagine a batch of 100 items, 8 of which are slow (say 1000s to run) and the rest are quick (say 1s to run). You kick them off in a random order across 8 threads. Its clear that eventually each thread will be calculating one of your long running items, at this point you are seeing full utilisation. Eventually the one(s) that hit their long-op(s) first will finish up their long op(s) and quickly finish up any remaining short ops. At that time you ONLY have some of the long ops waiting to finish, so you will see the active utilisation drop off.. i.e. at some point there are only 3 ops left to finish, so only 3 cores are in use.
Mitigation Tactics
Your long running items might be amenable to 'internal parallelism' allowing them to have a faster minimum limit runtime.
Your long running items may be able to be identified and prioritised to start first (which will ensure you get full CPU utilisation for a long as possible)
(see update below) DONT use partitioning in cases where the body can be long running as this simply increases the 'hit' of this effect. (ie get rid of your rangePartitioner entirely). This will massively reduce the impact of this effect to your particular loop
either way your batch run-time is bound by the run-time of the slowest item in the batch.
Update I have also noticed you are using partitioning on your loop, which massively increases the scope of this effect, i.e. you are saying 'break this work-set down into N work-sets' and then parallelize the running of those N work-sets. In the example above this could mean that you get (say) 3 of the long ops into the same work-set and so those are going to process on that same thread. As such you should NOT be using partitioning if the inner body can be long running. For example the docs on partitioning here https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx are saying this is aimed at short bodies

If you have multiple threads that process the same number of items each and each item takes varying amount of time, then of course you will have some threads that finish earlier.
If you use collection whose size is not known, then the items will be taken one by one:
var source = myList.AsEnumerable();
Another approach can be a Producer-Consumer pattern
https://msdn.microsoft.com/en-us/library/dd997371

Run one method in multiple threads as decided by the system

I have a method that processeswords in two lists, a priority list and a standard list.
ConcurrentBag<Word> PriorityWords = ...;
ConcurrentBag<Word> UnprocessedWords = ...;
public void ProcessAllWords()
{
while (true)
{
Word word = SelectWordToProcess();
if (word == null) break;
ProcessWord(word);
}
}
private Word SelectWordToProcess()
{
Word word;
if (PriorityWords.TryTake(out word) || UnprocessedWords.TryTake(out word))
return word;
else
return null;
}
public void ProcessWord(Word word) { ... }
I want to run this method on multiple cores. Currently, I simply open one thread per processor:
for (int i = 0; i < Environment.ProcessorCount; i++)
{
new Thread(ProcessAllWords).Start();
}
Is there a more suitable way that lets the system decide how many threads to open based on current system performance, similar to Parallel.ForEach()?
EDIT: More detail on the application.
The word list is prepopulated with ~180,000 words and every word is to be permutated with every other word. ProcessAllWords is an O(n²) operation. The threads will all run flat-out until all words are processed, then terminate. While the threads are running, I can asynchronously give priority to specific words by adding them to the PriorityWords list. Initial tests show my system processes about 5 words/sec, so that's 10 hours of 100% CPU processing.

Your method of starting Environment.ProcessorCount threads is good. The Task Parallel Library will do the automatic scheduling that you're looking for, but at the cost of over subscribing your CPU. This will decrease the responsiveness of your application to priority words.
As for the various methods of using TPL, parallel for and task factory will both queue up a bunch of words, making it very unresponsive to priority. You can maintain your priority with a generator method and PLINQ, but then you get a fixed number of threads as you have now. You can either set your thread count or use the default of 2xEnvironment.ProcessorCount. All in all, since your task is CPU bound, I would keep your current implementation.

A non priority queue with insert priority

Google is giving me headaches with this search term.
I need a thread safe mechanism to achieve the following.
A thread safe list with insert priority over read.
I need to always be able to insert a message (let's say) to the queue (or whatever) and occasionally, be able to read.
So reading, cannot, ever, interfere with inserting.
Thanks.
EDIT: Reading would also mean clearing the red part.
EDIT2: Maybe helpful, there is a single reader and a single writer.
EDIT3: Case scenario: 10 inserts per second for a period of 1 minute (or max possible using the hardware on which the software is on). Then a insert pause of 1 minute. Then 20 inserts (or max possible using the hardware on which the software is on) in 2 seconds for a period of 30 sec. Then a pause of 30 sec. Then the pause is used for max number of reads. I don't know if I am being clear enough. Obviously not. (PS: I don't know when the pause will occur, that is the problem). Max acc. delay for insert: the time for the Enqueue or Add method to finish.
ADDITIONAL: Could a ConcurrentDictionary with a AddOrUpdate with TryGetValue and TryRemove be used?

Construct your queue as a linked list of objects. Keep a reference to the head and the tail of the queue. See below the pseudo code which roughly tells the idea
QueueEntity Head;
QueueEntity Tail
class QueueEntity
{
QueueEntity Prev;
QueueEntity Next;
... //queue content;
}
and then do this:
//Read
lock(Tail)
{
//get the content
Tail=Tail.Prev;
}
//Write
lock(Head)
{
newEntity = new QueueEntity();
newEntity.Next = Head ;
Head.Prev = newEntity;
Head = newEntity;
}
Here you have separate locks for reading and writing and they won't block each other unless there is only one entiry in your queue.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.