Multi Thread Processing with Parallel.Foreach c# .net4.5.1 - c#

I have an IEnumerable<customClass> object that has roughly 10-15 entries, so not a lot, but I'm running into a System.IO.FileNotFoundException when I try and do
Parallel.Foreach(..some linq query.., object => { ...stuff....});
with the enumerable. Here is the code I have that sometimes works, other times doesn't:
IEnumerable<UserIdentifier> userIds = script.Entries.Select(x => x.UserIdentifier).Distinct();
await Task.Factory.StartNew(() =>
{
Parallel.ForEach(userIds, async userId =>
{
Stopwatch watch = new Stopwatch();
watch.Start();
_Log.InfoFormat("user identifier: {0}", userId);
await Task.Factory.StartNew(() =>
{
foreach (ScriptEntry se in script.Entries.Where(x => x.UserIdentifier.Equals(userId)))
{
// // Run the script //
_Log.InfoFormat("waiting {0}", se.Delay);
Task.Delay(se.Delay);
_Log.InfoFormat("running SelectionInformation{0}", se.SelectionInformation);
ExecuteSingleEntry(se);
_Log.InfoFormat("[====== SelectionInformation {1} ELAPSED TIME: {0} ======]", watch.Elapsed,
se.SelectionInformation.Verb);
}
});
watch.Stop();
_Log.InfoFormat("[====== TOTAL ELAPSED TIME: {0} ======]", watch.Elapsed);
});
});
When the function ExecuteSingleEntry is ran, there is a function a few calls deep within that function that creates a temp directory and files. It seems to me, that when I run the parallel.foreach the function is getting slammed at once by numerous calls (I'm testing 5 at once currently but need to handle about 10) and isn't creating some of the files I need. But if I hit a break point in the file creation function and just F5 every time it gets hit I don't have any problems with a file not found exception being thrown.
So, my question is, how can I achieve running a subset of my scripts.Entries in parallel based on the user id within the script entries with a delay of 1 second between each different user id entries being started?
and a script entry is like:
UserIdentifier: 141, SelectionInformation: class of stuff, Ids: list of EntryIds, Names: list of Entry Names
And each user identifier can appear 1 or more times in the array. I want to start all the different user identifiers, more or less, at once. Then Task out the different SelectionInformation's tied to a script entry.
scripts.Entries is an array of ScriptEntry, which is as follows:
[DataMember]
public TimeSpan Delay { get; set; }
[DataMember]
public SelectionInformation Selection { get; set; }
[DataMember]
public long[] Ids { get; set; }
[DataMember]
public string Names { get; set; }
[DataMember]
public long UserIdentifier { get; set; }
I referenced: Parallel.ForEach vs Task.Factory.StartNew to obtain the
Task.Factory.StartNew(() => Parallel.Foreach({ }) ) so my UI doesn't lock up on me

There are a few principles to apply:
Prefer Task.Run over Task.Factory.StartNew. I describe on my blog why StartNew is dangerous; Run is a much safer, more modern alternative.
Don't pass an async lambda to Parallel.ForEach. It doesn't make sense, and it won't work right.
Task.Delay doesn't do anything by itself. You either have to await it or use the synchronous version (Thread.Sleep).
(In fact, in your case, the internal StartNew is meaningless; it's already parallel, and the code - running on a thread pool thread - is trying to start a new operation on a thread pool thread and immediately asynchronously await it???)
After applying these principles:
await Task.Run(() =>
{
Parallel.ForEach(userIds, userId =>
{
Stopwatch watch = new Stopwatch();
watch.Start();
_Log.InfoFormat("user identifier: {0}", userId);
foreach (ScriptEntry se in script.Entries.Where(x => x.UserIdentifier.Equals(userId)))
{
// // Run the script //
_Log.InfoFormat("waiting {0}", se.Delay);
Thread.Sleep(se.Delay);
_Log.InfoFormat("running SelectionInformation{0}", se.SelectionInformation);
ExecuteSingleEntry(se);
_Log.InfoFormat("[====== SelectionInformation {1} ELAPSED TIME: {0} ======]", watch.Elapsed,
se.SelectionInformation.Verb);
}
watch.Stop();
_Log.InfoFormat("[====== TOTAL ELAPSED TIME: {0} ======]", watch.Elapsed);
});
});

Related

FIFO Queue improvement for C#

hi i am working on an assignment and i should implement a queue which handles jobs waiting to be processed (producer-consumer problem). I have to develop a better queue that works more efficiently than the FIFO queue. There are parameters that describe the waiting time before the starvation occurs, the time they need to process after the queue is over for them. consumers come at a specified time, can wait for specified time and they take some time to execute whatever they wanna do when their turn has come. can you help me with a better queue rather than FIFO method?
First of all you are trying to solve different problems at the same time, if you want to improve the performance of the regular queue, you can implement a queue based on a priority of the elements(a heap) if you want to maintain the priority of the regular queue you can put a priority based on an integer, increasing that number every time you add an element into the heap.
Here I am attaching the first link that I found on google for
Priority queue. The order of the insertion is O(log n) if you use a Binary Heap
Now if you want to implement that queue allowing concurrency, you need to isolate the common resource(for example the basic structure where the heap store the elements).
Albahari is a good reference to see how producer-consumer works with the concurrency.
And here are all the classes that you can use to implement the concurrency for producer-consumer Concurrency sheet
I am adding an example with one of those Types
//BlockingCollection with a fix number of Products to put, it works with 10 items max on the collection
class Program
{
private static int counter = 1;
private static BlockingCollection<Product> products =
new BlockingCollection<Product>(10);
static void Main(string[] args)
{
//three producers
Task.Run(() => Producer());
Task.Run(() => Producer());
Task.Run(() => Producer());
Task.Run(() => Consumer());
Console.ReadLine();
}
static void Producer()
{
while (true)
{
var product = new Product()
{
Number = counter,
Name = "Product " + counter++
};
//Adding one element
Console.WriteLine("Producing: " + product);
products.Add(product);
Thread.Sleep(2000);
}
}
static void Consumer()
{
while (true)
{
//wait until exist one element
if (products.Count == 0)
continue;
var product = products.Take();
Console.WriteLine("Consuming: " + product);
Thread.Sleep(2000);
}
}
}
public class Product
{
public int Number { get; set; }
public string Name { get; set; }
public override string ToString()
{
return Name;
}
}

How will Parallel.Foreach behave when Iterating over the results of a method call?

Scope:
I am currently implementing an application that uses Amazon SQS Service as a provider of data for this program to process.
Since I need a parallel processing over the messages dequeued from this queue, this is what I've did.
Parallel.ForEach (GetMessages (msgAttributes), new ParallelOptions { MaxDegreeOfParallelism = threadCount }, message =>
{
// Processing Logic
});
Here's the header of the "GetMessages" method:
private static IEnumerable<Message> GetMessages (List<String> messageAttributes = null)
{
// Dequeueing Logic... 10 At a Time
// Yielding the messages to the Parallel Loop
foreach (Message awsMessage in messages)
{
yield return awsMessage;
}
}
How will this work ?:
My initial thought about how this would work was that the GetMessagesmethod would be executed whenever the thread's had no work (or a good number of threads had no work, something like an internal heuristic to measure this). That being said, to me, the GetMessages method would than, distribute the messages to the Parallel.For working threads, which would process the messages and wait for the Parallel.For handler to give them more messages to work.
Problem? I was wrong...
The thing is that, I was wrong. Still, I have no idea on what's happening in this situation.
The number of messages being dequeued is way too high, and it grews by powers of 2 every time they get dequeued. The dequeueing count (messsages) goes as following:
Dequeue is Called: Returns 80 Messages
Dequeue is Called: Returns 160 Messages
Dequeue is Called: Returns 320 Messages (and so forth)
After a certain point, the number of messages being dequeued, or, in this case, waiting to be processed by my application is too high and I end up running out of memory.
More Information:
I am using thread-safe InterLocked operations to increment counters mentioned above.
The number of threads being used is 25 (for the Parallel.Foreach)
Each "GetMessages" will return up to 10 messages (as an IEnumerable, yielded).
Question: What exactly is happening on this scenario ?
I am having a hard-time trying to figure out what exactly is going on. Is my GetMessages method being invoked by each thread once it finishes the "Processing Loop", hence, leading to more and more messages being dequeued ?
Is the call to the "GetMessages", made by a single thread, or is it being called by multiple threads ?
I think there's an issue with Parallel.ForEach partitioning... Your question is a typical producer / consumer scenario. For such a case, you should have independent logics for dequeuing on one side, and processing on the other. It will respect separation of concerns, and will simplify debugging.
BlockingCollection<T> will let you to separate boths : on one side, you add items to be processed, and on the other, you consume them. Here's an example of how to implement it :
You will need the ParallelExtensionsExtras nuget package for BlockingCollection<T> workload partitioning (.GetConsumingEnumerable() in the process method).
public static class ProducerConsumer
{
public static ConcurrentQueue<String> SqsQueue = new ConcurrentQueue<String>();
public static BlockingCollection<String> Collection = new BlockingCollection<String>();
public static ConcurrentBag<String> Result = new ConcurrentBag<String>();
public static async Task TestMethod()
{
// Here we separate all the Tasks in distinct threads
Task sqs = Task.Run(async () =>
{
Console.WriteLine("Amazon on thread " + Thread.CurrentThread.ManagedThreadId.ToString());
while (true)
{
ProducerConsumer.BackgroundFakedAmazon(); // We produce 50 Strings each second
await Task.Delay(1000);
}
});
Task deq = Task.Run(async () =>
{
Console.WriteLine("Dequeue on thread " + Thread.CurrentThread.ManagedThreadId.ToString());
while (true)
{
ProducerConsumer.DequeueData(); // Dequeue 20 Strings each 100ms
await Task.Delay(100);
}
});
Task process = Task.Run(() =>
{
Console.WriteLine("Process on thread " + Thread.CurrentThread.ManagedThreadId.ToString());
ProducerConsumer.BackgroundParallelConsumer(); // Process all the Strings in the BlockingCollection
});
await Task.WhenAll(c, sqs, deq, process);
}
public static void DequeueData()
{
foreach (var i in Enumerable.Range(0, 20))
{
String dequeued = null;
if (SqsQueue.TryDequeue(out dequeued))
{
Collection.Add(dequeued);
Console.WriteLine("Dequeued : " + dequeued);
}
}
}
public static void BackgroundFakedAmazon()
{
Console.WriteLine(" ---------- Generate 50 items on amazon side ---------- ");
foreach (var data in Enumerable.Range(0, 50).Select(i => Path.GetRandomFileName().Split('.').FirstOrDefault()))
SqsQueue.Enqueue(data + " / ASQS");
}
public static void BackgroundParallelConsumer()
{
// Here we stay in Parallel.ForEach, waiting for data. Once processed, we are still waiting the next chunks
Parallel.ForEach(Collection.GetConsumingEnumerable(), (i) =>
{
// Processing Logic
String processedData = "Processed : " + i;
Result.Add(processedData);
Console.WriteLine(processedData);
});
}
}
You can try it from a console app like this :
static void Main(string[] args)
{
ProducerConsumer.TestMethod().Wait();
}

Task Inside Loop

I have a windows service with a thread that runs every 2 minutes.
while (true)
{
try
{
repNeg.willExecuteLoopWithTasks(param1, param2, param3);
Thread.Sleep(20000);
}
Inside this I have a loop with tasks:
foreach (RepModel repModelo in listaRep)
{
Task t = new Task(() => { this.coletaFunc(repModelo.EndIp, user, tipoFilial); });
t.Start();
}
But I think this implementation is wrong. I need to only run one task for every element in the list,
and, when a specific task finishes, wait a minute and start again.
M8's I need to say i have 2 situations here.
1 - I can't wait all Task Finish. Because some task can take more then 2 hours to finish and another can take only 27 seconds.
2 - My List of tasks can change. Thats why i got a Thread. Every 2 minutes My thread get a list of Tasks to execute and then start a loop.
But sometimes my Task not Finished yet and another Thread Start Again and then strange things show in my log.
I tryed to use a Dictionry to solve my problem but after some time of execution, sometimes takes days, my log show:
"System.IndexOutOfRangeException"
Here is what I would do...
Create a new class that stores the following (as properties):
a RepModel ID (something unique)
a DateTime for the last time ran
a int for the frequency the task should run in seconds
a bool to determine if the task is in progress or not
Then you need a global list of the class somewhere, say called "JobList".
Your main app should have a Timer, which runs every couple of minutes. The job of this timer is to check for new RepModel (assume these can change over time, i.e a database list). When this ticks, is loops the list and adds any new ones (different ID) to JobList. You may also want to remove any that are no longer required (i.e. removed from DB list).
Then you have a second timer, this runs every second. It's job is to check all items in the JobList and compare the last run time with the current time (and ensure they are not already in progress). If the duration has lapped, then kick off the task. Once the task is complete, update the last run time so it can work next time, ensuring to change the "in progress" flag as you go.
This is all theory and you will need to give it a try yourself, but I think it covers what you are actually trying to achieve.
Some sample code (may or may not compile/work):
class Job
{
public int ID { get; set; }
public DateTime? LastRun { get; set; }
public int Frequency { get; set; }
public bool InProgress { get; set; }
}
List<Job> JobList = new List<Job>();
// Every 2 minutes (or whatever).
void timerMain_Tick()
{
foreach (RepModel repModelo in listaRep)
{
if(!JobList.Any(x => x.ID == repModelo.ID)
{
JobList.Add(new Job(){ ID = repModel.ID, Frequency = 120 });
}
}
}
// Every 10 seconds (or whatever).
void timerTask_Tick()
{
foreach(var job in JobList.Where(x => !x.InProgress && (x.LastRun == null || DateTime.Compare(x.LastRun.AddSeconds(x.Duration), DateTime.Now) < 0))
{
Task t = new Task(() => {
// Do task.
}).ContinueWith(task => {
job.LastRun = DateTime.Now;
job.InProgress = false;
}, TaskScheduler.FromCurrentSynchronizationContext());;
job.InProgress = true;
t.Start();
}
}
So what you really need here is a class that has two operations, it needs to be able to start processing one of your models, and it needs to be able to end processing of one of your models. Separating it from the list will make this easier.
When you start processing a model you'll want to create a CancellationTokenSource to associate with it so that you can stop processing it later. Processing it, in your case, means having a loop, while not cancelled, that runs an operation and then waits a while. Ending the operation is as easy as cancelling the token source.
public class Foo
{
private ConcurrentDictionary<RepModel, CancellationTokenSource> tokenLookup =
new ConcurrentDictionary<RepModel, CancellationTokenSource>();
public async Task Start(RepModel model)
{
var cts = new CancellationTokenSource();
tokenLookup[model] = cts;
while (!cts.IsCancellationRequested)
{
await Task.Run(() => model.DoWork());
await Task.Delay(TimeSpan.FromMinutes(1));
}
}
public void End(RepModel model)
{
CancellationTokenSource cts;
if (tokenLookup.TryRemove(model, out cts))
cts.Cancel();
}
}
If you are using framework 4.0 and more, you may try to benefit from
Parallel.ForEach
Executes a foreach operation in which iterations may run in parallel.
Parallel code may look like this:
Parallel.ForEach(listaRep , repModelo => {
this.coletaFunc(repModelo.EndIp, user, tipoFilial);
});
This will run on multiple cores (if that is possible), and you don't need some specific task sceduler, as your code will wait until all parallel tasks inside parallel loop are finished. And after you can call recursively the same function, if condition was met.

Is it safe to call the same method several times asynchronously in C# 5

I would be grateful if you could let me know opinion about calling asynchronously the same method several times. I more interested in know the safety side of this activities. I have posted here a console code that contain one method that I am calling asynchronously four times and it is working well. So far I have not noted any hiccups. But I want to be sure. Please the following code:
public class OneMethodCalledServeralTimes
{
protected async Task<bool> DoSomeWork(Values values, int whenItIsCalled, string description)
{
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
await Task.Delay(values.Value);
Console.WriteLine("{0} Completed in {1} and this was called: {2}",description, stopWatch.ElapsedMilliseconds, whenItIsCalled);
return true;
}
public bool DoAllWork()
{
Console.WriteLine("=====================Started doing weekend Chores===================");
var task1 = DoSomeWork(new Values { Value = 10000 }, 1, "First work to be done");
var task2 = DoSomeWork(new Values {Value = 7000}, 1, "First work to be done");
var task3 = DoSomeWork(new Values { Value = 4000 }, 2, "second work to be done");
var task4 = DoSomeWork(new Values { Value = 1000 }, 3, "third work to be done");
Task.WaitAll(task1, task2, task3, task4);
Console.WriteLine("=====================Completed doing weekend Chores===================");
return true;
}
}
The following is the Console application calling the above call:
static void Main(string[] args)
{
//var weekend = new HomeWork().DoAllWork();
Console.WriteLine("############################Using proper methods#############################");
var workToBeDone= new OneMethodCalledServeralTimes().DoAllWork(); //Passing parameters and the most successful one
Console.WriteLine("It took the entire four mothod {0} seconds to finish", stopewach.ElapsedMilliseconds/1000.0);
Console.ReadKey();
}
I also highly welcome any concise views about the corns and pros of calling asynchronously the same method several times
It depends on the method - but it's certainly not inherently unsafe. There's nothing within the state machine generated by the compiler that introduces problems. However, if your asynchronous method uses shared state, then the normal caveats apply. For example, you wouldn't want to call this multiple threads concurrently:
static Task DoBadThingsAsync(List<string> list)
{
await Task.Delay(500);
list.Add("Hello!");
}
... because List<T> isn't safe to use in a multi-threaded environment if any of the threads perform writes (without synchronization).
One point to note is that if you've got asynchronous methods which are expected to be used in a "single thread synchronization context" (e.g. the UI thread in a WPF or WinForms app) then you don't need to worry about thread safety - but you do need to worry about general concurrency, as both invocations of the method could be "live" at the same time.

Throttling producer/consumer pattern leads to deadlock

Got some help here on Stackoverflow earlier this week which resulted in going forward with a producer/consumer pattern for loading processing and importing large datasets into RavenDb.
Parallelization of CPU bound task continuing with IO bound
I'm now looking to throttle the amount of work units that are prepared in advance by the producers in order to manage memory consumption. I've implemented the throttling using a basic semaphore but I'm having trouble with the implementation deadlocking at a certain point.
I cannot figure out what could be causing the deadlocks. Below is an excerpt of the code:
private static void LoadData<TParsedData, TData>(IDataLoader<TParsedData> dataLoader, int batchSize, Action<IndexedBatch<TData>> importProceedure, Func<IEnumerable<TParsedData>, List<TData>> processProceedure)
where TParsedData : class
where TData : class
{
Console.WriteLine(#"Loading {0}...", typeof(TData).ToString());
var batchCounter = 0;
var ist = Stopwatch.StartNew();
var throttler = new SemaphoreSlim(10);
var bc = new BlockingCollection<IndexedBatch<TData>>();
var importTask = Task.Run(() =>
{
bc.GetConsumingEnumerable()
.AsParallel()
.WithExecutionMode(ParallelExecutionMode.ForceParallelism)
//or
//.WithDegreeOfParallelism(1)
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.ForAll(data =>
{
var st = Stopwatch.StartNew();
importProceedure(data);
Console.WriteLine(#"Batch imported {0} in {1} ms", data.Index, st.ElapsedMilliseconds);
throttler.Release();
});
});
var processTask = Task.Run(() =>
{
dataLoader.GetParsedItems()
.Partition(batchSize)
.AsParallel()
.WithDegreeOfParallelism(Environment.ProcessorCount)
//or
//.WithDegreeOfParallelism(1)
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.ForAll(batch =>
{
throttler.Wait(); //.WaitAsync()
var batchno = ++batchCounter;
var st = Stopwatch.StartNew();
bc.Add(new IndexedBatch<TData>(batchno, processProceedure(batch)));
Console.WriteLine(#"Batch processed {0} in {1} ms", batchno, st.ElapsedMilliseconds);
});
});
processTask.Wait();
bc.CompleteAdding();
importTask.Wait();
Console.WriteLine(nl(1) + #"Loading {0} completed in {1} ms", typeof(TData).ToString(), ist.ElapsedMilliseconds);
}
public class IndexedBatch<TBatch>
where TBatch : class
{
public IndexedBatch(int index, List<TBatch> batch)
{
Index = index;
Batch = batch ?? new List<TBatch>();
}
public int Index { get; set; }
public List<TBatch> Batch { get; set; }
}
This is the call being made to LoadData:
LoadData<DataBase, Data>(
DataLoaderFactory.Create<DataBase>(datafilePath),
1024,
(data) =>
{
using (var session = Store.OpenSession())
{
foreach (var i in data.Batch)
{
session.Store(i);
d.TryAdd(i.LongId.GetHashCode(), int.Parse(i.Id.Substring(i.Id.LastIndexOf('/') + 1)));
}
session.SaveChanges();
}
},
(batch) =>
{
return batch.Select(i => new Data()
{
...
}).ToList();
}
);
Store is a RavenDb IDocumentStore. DataLoaderFactory constructs a custom parser for the give dataset.
Hard to debug a deadlock without big arrows that say "blocks here!". Avoiding debugging the code without a debugger: BlockingCollection already can throttle. Use the constructor that takes the int boundedCapacity argument and eliminate the semaphore. Very high odds that solves your deadlock.
Can you check the amount of threads you have? Probably you have exhausted the thread-pool due to blocking. The TPL injects more threads than ProcessorCount if it thinks your code would deadlock without them. But it can only do so up to a certain limit.
Anyway, blocking inside of TPL tasks is generally a bad idea as the built-in heuristics work best with non-blocking stuff.

Categories

Resources