Task post-processing to start soon after 2 tasks are done - c#

I have a core task retreiving me some core data and multiple other sub-tasks fetching extra data. Would like to run some enricher process to the core data as soon as the core task and any of the sub-task is ready. Would you know how to do so?
Thought about something like this but not sure it's the doing what I want:
// Starting the tasks
var coreDataTask = new Task(...);
var extraDataTask1 = new Task(...);
var extraDataTask2 = new Task(...);
coreDataTask.Start();
extraDataTask1.Start();
extraDataTask2.Start();
// Enriching the results
Task.WaitAll(coreDataTask, extraDataTask1);
EnrichCore(coreDataTask.Results, extraDataTask1.Results);
Task.WaitAll(coreDataTask, extraDataTask2);
EnrichCore(coreDataTask.Results, extraDataTask2.Results);
Also given the enrichement is on the same core object, guess I would need to lock it somewhere?
Thanks in advance!

Here is another idea taking advantage of Task.WhenAny() to detect when tasks are completing.
For this minimal example, I just assume that the core data and extra data are strings. But you can adjust for whatever your type is.
Also, I am not actually doing any processing. You would have to plug in your processing.
Also, an assumption I am making, that is not really clear, is that you are mostly trying to parallelize the gathering of your data because that's the expensive part, but that the enriching part is actually pretty fast. Based on that assumption, you'll notice that the tasks run in parallel to gather the core data and extra data. But as the data becomes available, the core data is enriched synchronously to avoid having to complicate the code with locking.
If you copy-paste the code below, you should be able to run it as is to see how it works.
public static void Main(string[] args)
{
StartWork().Wait();
}
private async static Task StartWork()
{
// start core and extra tasks
Task<string> coreDataTask = Task.Run(() => "core data" /* do something more complicated here */);
List<Task<string>> extraDataTaskList = new List<Task<string>>();
for (int i = 0; i < 10; i++)
{
int x = i;
extraDataTaskList.Add(Task.Run(() => "extra data " + x /* do something more complicated here */));
}
// wait for core data to be ready first.
StringBuilder coreData = new StringBuilder(await coreDataTask);
// enrich core as the extra data tasks complete.
while (extraDataTaskList.Count != 0)
{
Task<string> completedExtraDataTask = await Task.WhenAny(extraDataTaskList);
extraDataTaskList.Remove(completedExtraDataTask);
EnrichCore(coreData, await completedExtraDataTask);
}
Console.WriteLine(coreData.ToString());
}
private static void EnrichCore(StringBuilder coreData, string extraData)
{
coreData.Append(" enriched with ").Append(extraData);
}
EDIT: .NET 4.0 version
Here is how I would change it for .NET 4.0, while still retaining the same overall design:
Task.Run() becomes Task.Factory.StartNew()
Instead of doing await on tasks, I call Result, which is a blocking call that waits for the task to complete.
Use Task.WaitAny instead of Task.WhenAny, which is also a blocking call.
The design remains very similar. The one big difference between both versions of the code is that in the .NET 4.5 version, whenever there is an await, the current thread is free to do other work. In the .NET 4.0 version, whenever you call Task.Result or Task.WaitAny, the current thread blocks until the Task completes. It's possible that this difference is not really important to you. But if it is, just make sure to wrap and run the whole block of code in a background thread or task to free up your main thread.
The other difference is with the exception handling. With the .NET 4.5 version, if any of your tasks fails with an unhandled exception, the exception is automatically unwrapped and propagated in a very transparent manner. With the .NET 4.0 version, you'll be getting AggregateExceptions that you will have to unwrap and handle yourself. If this is a concern, make sure you test this beforehand so you know what to expect.
Personally, I try to avoid Task.ContinueWith whenever I can. It tends to make the code really ugly and hard to read.
public static void Main(string[] args)
{
// start core and extra tasks
Task<string> coreDataTask = Task.Factory.StartNew(() => "core data" /* do something more complicated here */);
List<Task<string>> extraDataTaskList = new List<Task<string>>();
for (int i = 0; i < 10; i++)
{
int x = i;
extraDataTaskList.Add(Task.Factory.StartNew(() => "extra data " + x /* do something more complicated here */));
}
// wait for core data to be ready first.
StringBuilder coreData = new StringBuilder(coreDataTask.Result);
// enrich core as the extra data tasks complete.
while (extraDataTaskList.Count != 0)
{
int indexOfCompletedTask = Task.WaitAny(extraDataTaskList.ToArray());
Task<string> completedExtraDataTask = extraDataTaskList[indexOfCompletedTask];
extraDataTaskList.Remove(completedExtraDataTask);
EnrichCore(coreData, completedExtraDataTask.Result);
}
Console.WriteLine(coreData.ToString());
}
private static void EnrichCore(StringBuilder coreData, string extraData)
{
coreData.Append(" enriched with ").Append(extraData);
}

I think what you probably want is "ContinueWith" (Documentation here : https://msdn.microsoft.com/en-us/library/dd270696(v=vs.110).aspx). That is as long as your enriching doesn't need to be done in a specific order.
The code would look something like the following :
var coreTask = new Task<object>(() => { return null; });
var enrichTask1 = new Task<object>(() => { return null; });
var enrichTask2 = new Task<object>(() => { return null; });
coreTask.Start();
coreTask.Wait();
//Create your continue tasks here with the data you want.
enrichTask1.ContinueWith(task => {/*Do enriching here with task.Result*/});
//Start all enricher tasks here.
enrichTask1.Start();
//Wait for all the tasks to complete here.
Task.WaitAll(enrichTask1);
You still need to run your CoreTask first as that's required to finish before all enriching tasks. But from there you can start all tasks, and tell them when they are done to "ContinueWith" doing something else.
You should also take a quick look in the "Enricher Pattern" that may be able to help you in general with what you want to achieve (Outside of threading). Examples like here : http://www.enterpriseintegrationpatterns.com/DataEnricher.html

Related

Multiple HttpClients with proxies, trying to achieve maximum download speed

I need to use proxies to download a forum. The problem with my code is that it takes only 10% of my internet bandwidth. Also I have read that I need to use a single HttpClient instance, but with multiple proxies I don't know how to do it. Changing MaxDegreeOfParallelism doesn't change anything.
public static IAsyncEnumerable<IFetchResult> FetchInParallelAsync(
this IEnumerable<Url> urls, FetchContext context)
{
var fetchBlcock = new TransformBlock<Url, IFetchResult>(
transform: url => url.FetchAsync(context),
dataflowBlockOptions: new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 128
}
);
foreach(var url in urls)
fetchBlcock.Post(url);
fetchBlcock.Complete();
var result = fetchBlcock.ToAsyncEnumerable();
return result;
}
Every call to FetchAsync will create or reuse a HttpClient with a WebProxy.
public static async Task<IFetchResult> FetchAsync(this Url url, FetchContext context)
{
var httpClient = context.ProxyPool.Rent();
var result = await url.FetchAsync(httpClient, context.Observer, context.Delay,
context.isReloadWithCookie);
context.ProxyPool.Return(httpClient);
return result;
}
public HttpClient Rent()
{
lock(_lockObject)
{
if (_uninitiliazedDatacenterProxiesAddresses.Count != 0)
{
var proxyAddress = _uninitiliazedDatacenterProxiesAddresses.Pop();
return proxyAddress.GetWebProxy(DataCenterProxiesCredentials).GetHttpClient();
}
return _proxiesQueue.Dequeue();
}
}
I am a novice at software developing, but the task of downloading using hundreds or thousands of proxies asynchronously looks like a trivial task that many should have been faced with and found a correct way to do it. So far I was unable to find any solutions to my problem on the internet. Any thoughts of how to achieve maximum download speed?
Let's take a look at what happens here:
var result = await url.FetchAsync(httpClient, context.Observer, context.Delay, context.isReloadWithCookie);
You are actually awaiting before you continue with the next item. That's why it is asynchronous and not parallel programming. async in Microsoft docs
The await keyword is where the magic happens. It yields control to the caller of the method that performed await, and it ultimately allows a UI to be responsive or a service to be elastic.
In essence, it frees the calling thread to do other stuff but the original calling code is suspended from executing, until the IO operation is done.
Now to your problem:
You can either use this excellent solution here: foreach async
You can use the Parallel library to execute your code in different threads.
Something like the following from Parallel for example
Parallel.For(0, urls.Count,
index => fetchBlcock.Post(urls[index])
});

How to improve the performance of aync?

I'm working on a problem where I have to delete records using a service call. The issue is that I have a for each loop where i have multiple await operations.This is making the operation take lot of time and performance is lacking
foreach(var a in list<long>b)
{
await _serviceresolver().DeleteOperationAsync(id,a)
}
The issue is that I have a for each loop where i have multiple await operations.
This is making the operation take lot of time and performance is lacking
The number one solution is to reduce the number of calls. This is often called "chunky" over "chatty". So if your service supports some kind of bulk-delete operation, then expose it in your service type and then you can just do:
await _serviceresolver().BulkDeleteOperationAsync(id, b);
But if that isn't possible, then you can at least use asynchronous concurrency. This is quite different from parallelism; you don't want to use Parallel or PLINQ.
var service = _serviceresolver();
var tasks = b.Select(a => service.DeleteOperationAsync(id, a)).ToList();
await Task.WhenAll(tasks);
I do not know what code is behind this DeleteOperationAsync, but for sure async/await isn't designed to speed things up. It was designated to "spare" threads (colloquially speaking)
The best would be to change the method to take as a parameter the whole list of ids - instead of taking and sending just one id.
And then to perform this async/await heavy operation only once for all of the ids.
If that is not possible, you could just run it in parallel using TPL (but it is ready the worst-case scenario - really:) )
Parallel.ForEach(listOfIdsToDelete,
async idToDelete => await _serviceresolver().DeleteOperationAsync(id,idToDelete)
);
You're waiting for each async operation to finish right now. If you can fire them all off concurrently, you can just call them without the await, or if you need to know when they finish, you can just fire them all off and then wait for them all to finish by tracking the tasks in a list:
List<Task> tasks = new List<Task>();
foreach (var a in List<long> b)
tasks.Add(_serviceresolveer().DeleteOperationAsync(id, a));
await Task.WhenAll(tasks);
You can use PLINQ (to leverage of all the processors of your machine) and the Task.WhenAll method (to no freeze the calling thread). In code, resulting something like this:
class Program {
static async Task Main(string[] args) {
var list = new List<long> {
4, 3, 2
};
var service = new Service();
var response =
from item in list.AsParallel()
select service.DeleteOperationAsync(item);
await Task.WhenAll(response);
}
}
public class Service {
public async Task DeleteOperationAsync(long value) {
await Task.Delay(2000);
Console.WriteLine($"Finished... {value}");
}
}

Understand parallel programming in C# with async examples

I am trying to understand parallel programming and I would like my async methods to run on multiple threads. I have written something but it does not work like I thought it should.
Code
public static async Task Main(string[] args)
{
var listAfterParallel = RunParallel(); // Running this function to return tasks
await Task.WhenAll(listAfterParallel); // I want the program exceution to stop until all tasks are returned or tasks are completed
Console.WriteLine("After Parallel Loop"); // But currently when I run program, after parallel loop command is printed first
Console.ReadLine();
}
public static async Task<ConcurrentBag<string>> RunParallel()
{
var client = new System.Net.Http.HttpClient();
client.DefaultRequestHeaders.Add("Accept", "application/json");
client.BaseAddress = new Uri("https://jsonplaceholder.typicode.com");
var list = new List<int>();
var listResults = new ConcurrentBag<string>();
for (int i = 1; i < 5; i++)
{
list.Add(i);
}
// Parallel for each branch to run await commands on multiple threads.
Parallel.ForEach(list, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, async (index) =>
{
var response = await client.GetAsync("posts/" + index);
var contents = await response.Content.ReadAsStringAsync();
listResults.Add(contents);
Console.WriteLine(contents);
});
return listResults;
}
I would like RunParallel function to complete before "After parallel loop" is printed. Also I want my get posts method to run on multiple threads.
Any help would be appreciated!
What's happening here is that you're never waiting for the Parallel.ForEach block to complete - you're just returning the bag that it will eventually pump into. The reason for this is that because Parallel.ForEach expects Action delegates, you've created a lambda which returns void rather than Task. While async void methods are valid, they generally continue their work on a new thread and return to the caller as soon as they await a Task, and the Parallel.ForEach method therefore thinks the handler is done, even though it's kicked that remaining work off into a separate thread.
Instead, use a synchronous method here;
Parallel.ForEach(list, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, index =>
{
var response = client.GetAsync("posts/" + index).Result;
var contents = response.Content.ReadAsStringAsync().Result;
listResults.Add(contents);
Console.WriteLine(contents);
});
If you absolutely must use await inside, Wrap it in Task.Run(...).GetAwaiter().GetResult();
Parallel.ForEach(list, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, index => Task.Run(async () =>
{
var response = await client.GetAsync("posts/" + index);
var contents = await response.Content.ReadAsStringAsync();
listResults.Add(contents);
Console.WriteLine(contents);
}).GetAwaiter().GetResult();
In this case, however, Task.run generally goes to a new thread, so we've subverted most of the control of Parallel.ForEach; it's better to use async all the way down;
var tasks = list.Select(async (index) => {
var response = await client.GetAsync("posts/" + index);
var contents = await response.Content.ReadAsStringAsync();
listResults.Add(contents);
Console.WriteLine(contents);
});
await Task.WhenAll(tasks);
Since Select expects a Func<T, TResult>, it will interpret an async lambda with no return as an async Task method instead of async void, and thus give us something we can explicitly await
Take a look at this: There Is No Thread
When you are making multiple concurrent web requests it's not your CPU that is doing the hard work. It's the CPU of the web server that is serving your requests. Your CPU is doing nothing during this time. It's not in a special "Wait-state" or something. The hardware inside your box that is working is your network card, that writes data to your RAM. When the response is received then your CPU will be notified about the arrived data, so it can do something with them.
You need parallelism when you have heavy work to do inside your box, not when you want the heavy work to be done by the external world. From the point of view of your CPU, even your hard disk is part of the external world. So everything that applies to web requests, applies also to requests targeting filesystems and databases. These workloads are called I/O bound, to be distinguished from the so called CPU bound workloads.
For I/O bound workloads the tool offered by the .NET platform is the asynchronous Task. There are multiple APIs throughout the libraries that return Task objects. To achieve concurrency you typically start multiple tasks and then await them with Task.WhenAll. There are also more advanced tools like the TPL Dataflow library, that is build on top of Tasks. It offers capabilities like buffering, batching, configuring the maximum degree of concurrency, and much more.

Running a long-running parallel task in the background, while allowing small async tasks to update the foreground

I have around 10 000 000 tasks that each takes from 1-10 seconds to complete. I am running those tasks on a powerful server, using 50 different threads, where each thread picks the first not-done task, runs it, and repeats.
Pseudo-code:
for i = 0 to 50:
run a new thread:
while True:
task = first available task
if no available tasks: exit thread
run task
Using this code, I can run all the tasks in parallell on any given number of threads.
In reality, the code uses C#'s Task.WhenAll, and looks like this:
ServicePointManager.DefaultConnectionLimit = threadCount; //Allow more HTTP request simultaneously
var currentIndex = -1;
var threads = new List<Task>(); //List of threads
for (int i = 0; i < threadCount; i++) //Generate the threads
{
var wc = CreateWebClient();
threads.Add(Task.Run(() =>
{
while (true) //Each thread should loop, picking the first available task, and executing it.
{
var index = Interlocked.Increment(ref currentIndex);
if (index >= tasks.Count) break;
var task = tasks[index];
RunTask(conn, wc, task, port);
}
}));
}
await Task.WhenAll(threads);
This works just as I wanted it to, but I have a problem: since this code takes a lot of time to run, I want the user to see some progress. The progress is displayed in a colored bitmap (representing a matrix), and also takes some time to generate (a few seconds).
Therefore, I want to generate this visualization on a background thread. But this other background thread is never executed. My suspicion is that it is using the same thread pool as the parallel code, and is therefore enqueued, and will not be executed before the parallel code is actually finished. (And that's a bit too late.)
Here's an example of how I generate the progress visualization:
private async void Refresh_Button_Clicked(object sender, RoutedEventArgs e)
{
var bitmap = await Task.Run(() => // <<< This task is never executed!
{
//bla, bla, various database calls, and generating a relatively large bitmap
});
//Convert the bitmap into a WPF image, and update the GUI
VisualizationImage = BitmapToImageSource(bitmap);
}
So, how could I best solve this problem? I could create a list of Tasks, where each Task represents one of my tasks, and run them with Parallel.Invoke, and pick another Thread pool (I think). But then I have to generate 10 million Task objects, instead of just 50 Task objects, running through my array of stuff to do. That sounds like it uses much more RAM than necessary. Any clever solutions to this?
EDIT:
As Panagiotis Kanavos suggested in one of his comments, I tried replacing some of my loop logic with ActionBlock, like this:
// Create an ActionBlock<int> that performs some work.
var workerBlock = new ActionBlock<ZoneTask>(
t =>
{
var wc = CreateWebClient(); //This probably generates some unnecessary overhead, but that's a problem I can solve later.
RunTask(conn, wc, t, port);
},
// Specify a maximum degree of parallelism.
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = threadCount
});
foreach (var t in tasks) //Note: the objects in the tasks array are not Task objects
workerBlock.Post(t);
workerBlock.Complete();
await workerBlock.Completion;
Note: RunTask just executes a web request using the WebClient, and parses the results. It's nothing in there that can create a dead lock.
This seems to work as the old parallelism code, except that it needs a minute or two to do the initial foreach loop to post the tasks. Is this delay really worth it?
Nevertheless, my progress task still seems to be blocked. Ignoring the Progress< T > suggestion for now, since this reduced code still suffers the same problem:
private async void Refresh_Button_Clicked(object sender, RoutedEventArgs e)
{
Debug.WriteLine("This happens");
var bitmap = await Task.Run(() =>
{
Debug.WriteLine("This does not!");
//Still doing some work here, so it's not optimized away.
};
VisualizationImage = BitmapToImageSource(bitmap);
}
So it still looks like new tasks are not executed as long as the parallell task is running. I even reduced the "MaxDegreeOfParallelism" from 50 to 5 (on a 24 core server) to see if Peter Ritchie's suggestion was right, but no change. Any other suggestions?
ANOTHER EDIT:
The issue seems to have been that I overloaded the thread pool with all my simultaneous blocking I/O calls. I replaced WebClient with HttpClient and its async-functions, and now everything seems to be working nicely.
Thanks to everyone for the great suggestions! Even though not all of them directly solved the problem, I'm sure they all improved my code. :)
.NET already provides a mechanism to report progress with the IProgress< T> and the Progress< T> implementation.
The IProgress interface allows clients to publish messages with the Report(T) class without having to worry about threading. The implementation ensures that the messages are processed in the appropriate thread, eg the UI thread. By using the simple IProgress< T> interface the background methods are decoupled from whoever processes the messages.
You can find more information in the Async in 4.5: Enabling Progress and Cancellation in Async APIs article. The cancellation and progress APIs aren't specific to the TPL. They can be used to simplify cancellation and reporting even for raw threads.
Progress< T> processes messages on the thread on which it was created. This can be done either by passing a processing delegate when the class is instantiated, or by subscribing to an event. Copying from the article:
private async void Start_Button_Click(object sender, RoutedEventArgs e)
{
//construct Progress<T>, passing ReportProgress as the Action<T>
var progressIndicator = new Progress<int>(ReportProgress);
//call async method
int uploads=await UploadPicturesAsync(GenerateTestImages(), progressIndicator);
}
where ReportProgress is a method that accepts a parameter of int. It could also accept a complex class that reported work done, messages etc.
The asynchronous method only has to use IProgress.Report, eg:
async Task<int> UploadPicturesAsync(List<Image> imageList, IProgress<int> progress)
{
int totalCount = imageList.Count;
int processCount = await Task.Run<int>(() =>
{
int tempCount = 0;
foreach (var image in imageList)
{
//await the processing and uploading logic here
int processed = await UploadAndProcessAsync(image);
if (progress != null)
{
progress.Report((tempCount * 100 / totalCount));
}
tempCount++;
}
return tempCount;
});
return processCount;
}
This decouples the background method from whoever receives and processes the progress messages.

Hangfire Background Job with Return Value

I'm switching from Task.Run to Hangfire. In .NET 4.5+ Task.Run can return Task<TResult> which allows me to run tasks that return other than void. I can normally wait and get the result of my task by accessing the property MyReturnedTask.Result
Example of my old code:
public void MyMainCode()
{
List<string> listStr = new List<string>();
listStr.Add("Bob");
listStr.Add("Kate");
listStr.Add("Yaz");
List<Task<string>> listTasks = new List<Task<string>>();
foreach(string str in listStr)
{
Task<string> returnedTask = Task.Run(() => GetMyString(str));
listTasks.Add(returnedTask);
}
foreach(Task<string> task in listTasks)
{
// using task.Result will cause the code to wait for the task if not yet finished.
// Alternatively, you can use Task.WaitAll(listTasks.ToArray()) to wait for all tasks in the list to finish.
MyTextBox.Text += task.Result + Environment.NewLine;
}
}
private string GetMyString(string str)
{
// long execution in order to calculate the returned string
return str + "_finished";
}
As far as I can see from the Quick Start page of Hangfire, your main guy which is BackgroundJob.Enqueue(() => Console.WriteLine("Fire-and-forget"));
perfectly runs the code as a background job but apparently doesn't support jobs that have a return value (like the code I presented above). Is that right? if not, how can I tweak my code in order to use Hangfire?
P.S. I already looked at HostingEnvironment.QueueBackgroundWorkItem (here) but it apparently lacks the same functionality (background jobs have to be void)
EDIT
As #Dejan figured out, the main reason I want to switch to Hangfire is the same reason the .NET folks added QueueBackgroundWorkItem in .NET 4.5.2. And that reason is well described in Scott Hanselman's great article about Background Tasks in ASP.NET. So I'm gonna quote from the article:
QBWI (QueueBackgroundWorkItem) schedules a task which can run in the background, independent of
any request. This differs from a normal ThreadPool work item in that
ASP.NET automatically keeps track of how many work items registered
through this API are currently running, and the ASP.NET runtime will
try to delay AppDomain shutdown until these work items have finished
executing.
One simple solution would be to poll the monitoring API until the job is finished like this:
public static Task Enqueue(Expression<Action> methodCall)
{
string jobId = BackgroundJob.Enqueue(methodCall);
Task checkJobState = Task.Factory.StartNew(() =>
{
while (true)
{
IMonitoringApi monitoringApi = JobStorage.Current.GetMonitoringApi();
JobDetailsDto jobDetails = monitoringApi.JobDetails(jobId);
string currentState = jobDetails.History[0].StateName;
if (currentState != "Enqueued" && currentState != "Processing")
{
break;
}
Thread.Sleep(100); // adjust to a coarse enough value for your scenario
}
});
return checkJobState;
}
Attention: Of course, in a Web-hosted scenario you cannot rely on continuation of the task (task.ContinueWith()) to do more things after the job has finished as the AppDomain might be shut down - for the same reasons you probably want to use Hangfire in the first place.

Categories

Resources