Scenario is something like this, I have 4 specific URLs in hand, each URL page contains many links to a web page, I need to extract some information of those web pages. I'm planning to use nested task to do this job, Multiple tasks inside one task. Something like below.
var t1Actions = new List<Action>();
var t1 = Task.Factory.StartNew(() =>
{
foreach (var action in t1Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
var t2Actions = new List<Action>();
var t2 = Task.Factory.StartNew(() =>
{
foreach (var action in t2Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
var t3Actions = new List<Action>();
var t3 = Task.Factory.StartNew(() =>
{
foreach (var action in t3Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
var t4Actions = new List<Action>();
var t4 = Task.Factory.StartNew(() =>
{
foreach (var action in t4Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
Task.WhenAll(t1, t2, t3, t4);
Here is my questions:
Is this way a good way to do jobs like what I mentioned above?
Which one is efficient, replace child tasks with Parallel.Invoke(action) or leave it as it is?
How should I notify (for example UI) if a nested task completed, Do I have control over nested tasks?
Any advice will be helpful.
The actual problem isn't how to handle child tasks. It's how to get a list of URLs from some directory pages, retrieve those pages and process them.
This can be done easily using .NET's Dataflow library. Each step can be implemented as a block that reads one URL and produces an output.
The first block can be a TransformManyBlock that accepts one page URL and retursn a list of page URLs
The second block can be a TransformBlock that accepts a single page URL and returns its contents
The third block can be an Action Block that accepts the page and does whatever is needed with it.
For example:
var listBlock = new TransformManyBlock<Uri,Uri>(async uri=>
{
var content=await httpClient.GetStringAsync(uri);
var uris=ProcessThePage(contents);
return uris;
});
var downloadBlock = new TransformBlock<Uri,(Uri,string)>(async uri=>
{
var content=await httpClient.GetStringAsync(uri);
return (uri,content);
});
var processingBlock = new ActionBlock<(Uri uri,string content)>(async msg=>
{
//Do something
var pathFromUri(msg.uri);
File.WriteAllText(pathFromUri,msg.content);
});
var linkOptions=new DataflowLinkOptions{PropagateCompletion=true};
listBlock.LinkTo(downloadBlock,linkOptions);
downloadBlock.LinkTo(processingBlock,linkOptions);
Each block runs using its own Task. You can specify that a block may use more than one tasks, eg to download multiple pages concurrently.
Each block has an input and output buffer. You can specify a limit to the input buffer to avoid flooding a block with too many messages to process. If a block reaches the limit upstream blocks will pause. This way, you could prevent eg the downloadBlock from flooding a slow processingBlock with thousands of pages.
Once you have a pipeline, you can post messages to the first block. When you're done, you can tell the block to Complete(). Each block in the pipeline will finish processing messages in its input buffer and propagate the completion call to the next linked block.
You can await for all messages to finish by awaiting the last block's Completion task.
var directoryPages=new Uri[]{..};
foreach(var uri in directoryPages)
{
listBlock.Post(uri);
}
listBlock.Complete();
await processingBlock.Complete();
The ExecutionDataflowBlockOptions can be used to specify the use of multiple tasks and the intput buffer limits, eg :
var options=new ExecutionDataflowBlockOptions
{
BoundedCapacity=10,
MaxDegreeOfParallelism=4,
};
var downloadBlock = new TransformBlock<Uri,(Uri,string)>(...,options);
This means that downloadBlock will accept up to 10 URIs before signalling the listBlock to pause. It will process up to 4 Uris concurrently
Related
I have some tasks executing in a WhenAll(). I get a semantic error if a task returns an object and calls an async method inside their Run(). The async method fetches from Blob some string content, then constructs and returns an object.
Do you know how to solve this issue, while maintaining the batch download done by tasks?
I need a list with those FinalWrapperObjects.
Error message
Cannot convert async lamba expression to delegate type
'Func<FinalWrapperObject>'. An async lambda expression may return
void, Task or Task, none of which are convertible to
'Func<FinalWrapperObject>'.
...
List<FinalWrapperObject> finalReturns = new List<FinalWrapperObject>();
List<Task<FinalWrapperObject>> tasks = new List<Task<FinalWrapperObject>>();
var resultsBatch = fetchedObjects.Skip(i).Take(10).ToList();
foreach (var resultBatchItem in resultsBatch)
{
tasks.Add(
new Task<FinalWrapperObject>(async () => //!! errors here on arrow
{
var blobContent = await azureBlobService.GetAsync(resultBatchItem.StoragePath);
return new FinalWrapperObject {
BlobContent = blobContent,
CreationDateTime = resultBatchItem.CreationDateTime
};
})
);
}
FinalWrapperObject[] listFinalWrapperObjects = await Task.WhenAll(tasks);
finalReturns.AddRange(listFinalWrapperObjects);
return finalReturns;
Your code never starts any tasks. Tasks aren't threads anyway. They're a promise that something will complete and maybe produce a value in the future. Some tasks require a thread to run. These are executed using threads that come from a threadpool. Others, eg async IO operations, don't require a thread. Uploading a file is such an IO operation.
Your lambda is asynchronous and already returning a Task so there's no reason to use Task.Run. You can execute it once for all items, collect the Tasks in a list and await all of them. That's the bare-bones way :
async Task<FinalWrapperObject> UploadItemAsync(BatchItem resultBatchItem) =>
{
var blobContent = await azureBlobService.GetAsync(resultBatchItem.StoragePath);
return new FinalWrapperObject {
BlobContent = blobContent,
CreationDateTime = resultBatchItem.CreationDateTime
};
}
...
var tasks=resultsBatch.Select(UploadItemAsync);
var results=await Task.WhenAll(tasks);
Using TPL Dataflow
A better option would be to use the TPL Dataflow classes to upload items concurrently and even construct a pipeline from processing blocks.
var options= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10
};
var results=new BufferBlock<FinalWrapperObject>();
var uploader=new TransformBlock<BatchItem,FinalWrapperObject>(UploadItemAsync,options);
uploader.LinkTo(results);
foreach(var item in fetchedObjects)
{
uploader.PostAsync(item);
}
uploader.Complete();
await uploader.Completion;
By default, a block only processes one message at a time. Using MaxDegreeOfParallelism = 10 we're telling it to process 10 items concurrently. This code will upload 10 items concurrently at a time, as long as there items to post to the uploader block.
The results are forwarded to the results BufferBlock. The items can be extracted with TryReceiveAll :
IList<FinalWrapperObject> items;
results.TryReceiveAll(out items);
Dataflow blocks can be combined into a pipeline. You could have a block that loads items from disk, another to upload them and a final one that stores the response to another file or database :
var dop10= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10,
BoundedCapacity=4
};
var bounded= new ExecutionDataflowBlockOptions
{
BoundedCapacity=4
};
var loader=new TransformBlock<FileInfo,BatchItem>(LoadFile,bounded);
var uploader=new TransformBlock<BatchItem,FinalWrapperObject>(UploadItemAsync,dop10);
var dbLogger=new ActionBlock<FinalWrapperObject>(bounded);
var linkOptions=new DataflowLinkOptions {PropagateCompletion=true};
loader.LinkTo(uploader,linkOptions);
uploader.LinkTo(dbLogger,linkOptions);
var folder=new DirectoryInfo(rootPath);
foreach(var item in folder.EnumerateFiles())
{
await loader.SendAsync(item);
}
loader.Complete();
await dbLogger.Completion;
In this case, all files in a folder are posted to the loader block which loads files one by one and forwards a BatchItem. The uploader uploads the file and the results are stored by dbLogger. In the end, we tell loader we're finished and wait for all items to get processed all the way to the end with await dbLogger.Completion.
The BoundedCapacity is used to put a limit on how many items can be held at each block's input buffer. This prevents loading all files into memory.
I have a list of items to process, and I create a task for each one, and then await using Task.WhenAny(). I am following the pattern described here: Start Multiple Async Tasks and Process Them As They Complete .
I have changed one thing: I am using HashSet<Task> instead of List<Task>. But I notice that all the tasks end-up getting the same id, and thus the HashSet only adds one of them, and hence I end up waiting for only one task.
I have a working example here in dotnetfiddle: https://dotnetfiddle.net/KQN2ow
Also pasting the code below:
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace ReproTasksWithSameId
{
public class Program
{
public static async Task Main(string[] args)
{
List<int> itemIds = new List<int>() { 1, 2, 3, 4 };
await ProcessManyItems(itemIds);
}
private static async Task ProcessManyItems(List<int> itemIds)
{
//
// Create tasks for each item and then wait for them using Task.WhenAny
// Following Task.WhenAny() pattern described here: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/start-multiple-async-tasks-and-process-them-as-they-complete
// But replaced List<Task> with HashSet<Task>.
//
HashSet<Task> tasks = new HashSet<Task>();
// We map the task ids to item ids so that we have enough info to log if a task throws an exception.
Dictionary<int, int> taskIdToItemId = new Dictionary<int, int>();
foreach (int itemId in itemIds)
{
Task task = ProcessOneItem(itemId);
Console.WriteLine("Created task with id: {0}", task.Id);
tasks.Add(task);
taskIdToItemId[task.Id] = itemId;
}
// Add a loop to process the tasks one at a time until none remain.
while (tasks.Count > 0)
{
// Identify the first task that completes.
Task task = await Task.WhenAny(tasks);
// Remove the selected task from the list so that we don't
// process it more than once.
tasks.Remove(task);
// Get the item id from our map, so that we can log rich information.
int itemId = taskIdToItemId[task.Id];
try
{
// Await the completed task.
await task; // unwrap exceptions.
Console.WriteLine("Successfully processed task with id: {0}, itemId: {1}", task.Id, itemId);
}
catch (Exception ex)
{
Console.WriteLine("Failed to process task with id: {0}, itemId: {1}. Just logging & eating the exception {1}", task.Id, itemId, ex);
}
}
}
private static async Task ProcessOneItem(int itemId)
{
// Assume this method awaits on some asynchronous IO.
Console.WriteLine("item: {0}", itemId);
}
}
}
The output I get is this:
item: 1
Created task with id: 1
item: 2
Created task with id: 1
item: 3
Created task with id: 1
item: 4
Created task with id: 1
Successfully processed task with id: 1, itemId: 4
So basically the program exits after awaiting just the first task.
Why do multiple short Tasks end up getting the same id? BTW I also tested with a method that returns Task<TResult> instead of Task, and in that case it works fine.
Is there a better approach I can use?
The question's code is synchronous so there's only one completed task going around. async doesn't make something run asynchronously, it's syntactic sugar that allows using await to await an already executing asynchronous operation to complete without blocking the calling thread.
As for the documentation example, that's what it is. A documentation example, not a pattern and certainly not something that can be used in production except for simple cases.
What happens if you can only make 5 requests at a time to avoid flooding your network or CPU? You'd need to download only a fixed number of records for that. What if you need to process the downloaded data? What if the list of URLs comes from another thread?
Those issues are handled by concurrent containers, pub/sub patterns and the purpose-built Dataflow and Channel classes.
Dataflow
The older Dataflow classes take care of buffering input and output and handling worker tasks automatically. The entire download code can be replaced with an ActionBlock:
var client=new HttpClient(....);
//Cancel if the process takes longer than 30 minutes
var cts=new CancellationTokenSource(TimeSpan.FromMinutes(30));
var options=new ExecutionDataflowBlockOptions(){
MaxDegreeOfParallelism=10,
BoundedCapacity=5,
CancellationToken=cts.Token
};
var block=new ActionBlock<string>(url=>ProcessUrl(url,client,cts.Token));
That's it. The block will use up to 10 concurrent tasks to perform up to 10 concurrent downloads. It will keep up to 5 urls in memory (it would buffer everything otherwise). If the input buffer becomes full, sending items to the block will await asynchronously, t thus preventing slow downloads from flooding memory with URLs.
On the same or a different thread, the "publisher" of urls can post as many URLs as it wants, for as long as it wants.
foreach(var url in urls)
{
await block.SendAsync(url);
}
//Tell the block we're done
block.Complete();
//Wait until all downloads are complete
await block.Completion;
We can use other blocks like TransformBlock to produce output, pass it to another block and thus, create a concurrent processing pipeline. Let's say we have two methods, DownloadURL and ParseResponse instead of just ProcessUrl :
Task<string> DownloadUrlAsync(string url,HttpClient client)
{
return client.GetStringAsync(url);
}
void ParseResponse(string content)
{
var object=JObject.Parse();
DoSomethingWith(object);
}
We could create a separate block for each step in the pipeline, with different DOP and buffers :
var dlOptions=new ExecutionDataflowBlockOptions(){
MaxDegreeOfParallelism=5,
BoundedCapacity=5,
CancellationToken=cts.Token
};
var downloader=new TransformBlock<string,string>(
url=>DownloadUrlAsync(url,client),
dlOptions);
var parseOptions = new ExecutionDataflowBlockOptions(){
MaxDegreeOfParallelism=10,
BoundedCapacity=2,
CancellationToken=cts.Token
};
var parser=new ActionBlock<string>(ParseResponse);
downloader.LinkTo(parser, new DataflowLinkOptions{PropageateCompletion=true});
We can post URLs to the downloader now and wait until all of them are parsed. By using different DOP and capacities, we can balance the number of downloader and parser tasks to download as many URLs as we can parse and handle eg slow downloads or big responses.
foreach(var url in urls)
{
await downloader.SendAsync(url);
}
//Tell the block we're done
downloader.Complete();
//Wait until all urls are parsed
await parser.Completion;
Channels
System.Threading.Channels introduces Go-style channels. These are actually lower-level concepts that a Dataflow block. If Channels were available back in 2012, they'd be written using channels.
An equivalent download method would look like this :
ChannelReader<string> Downloader(ChannelReader<string> ulrs,HttpClient client,
int capacity,CancellationToken token=default)
{
var channel=Channel.CreateBounded(capacity);
var writer=channel.Writer;
_ = Task.Run(async ()=>{
await foreach(var url in urls.ReadAsStreamAsync(token))
{
var response=await client.GetStringAsync(url);
await writer.WriteAsync(response);
}
}).ContinueWith(t=>writer.Complete(t.Exception));
return channel.Reader;
}
That's more verbose but it allows us to do things like create the HttpClient in the method and reuse it. Using a ChannelReader as both input and output may look weird, but now we can chain such methods simply by passing an output reader as input to another method.
The "magic" is that we create a worker task that waits to process messages and return a reader immediatelly. Whenever a result is produced, it's sent to the channel and the next step in the pipeline.
To use multiple worker tasks, we can use Enumerable.Range to start many of them and use Task.WhenAny to close the channel when all channels are done :
ChannelReader<string> Downloader(ChannelReader<string> ulrs,HttpClient client,
int capacity,int dop,CancellationToken token=default)
{
var channel=Channel.CreateBounded(capacity);
var writer=channel.Writer;
var tasks = Enumerable
.Range(0,dop)
.Select(_=> Task.Run(async ()=>{
await foreach(var url in urls.ReadAllAsync(token))
{
var response=await client.GetStringAsync(url);
await writer.WriteAsync(response);
}
});
_=Task.WhenAll(tasks)
.ContinueWith(t=>writer.Complete(t.Exception));
return channel.Reader;
}
Publishers can create their own channel and pass a reader to the Downloader method. They don't need to publish anything in advance either :
var channel=Channel.CreateUnbounded<string>();
var dlReader=Downloader(channel.Reader,client,5,5);
foreach(var url in someUrlList)
{
await channel.Writer.WriteAsync(url);
}
channel.Writer.Complete();
Fluent pipelines
This is so common that someone could create an extension method for this. Eg, to convert an IList to a Channel<T>, we don't need to wait as all the results are already available :
ChannelReader<T> Generate<T>(this IEnumerable<T> source)
{
var channel=Channel.CreateUnbounded<T>();
foreach(var item in source)
{
channel.Writer.TryWrite(T);
}
channel.Writer.Complete();
return channel.Reader;
}
If we convert the Downloader to an extension method too, we can use :
var pipeline= someUrls.Generate()
.Downloader(client,5,5);
It's because ProcessOneItem is not async.
You should see the following warning:
This async method lacks 'await' operators and will run synchronously. Consider using the 'await' operator to await non-blocking API calls, or 'await Task.Run(...)' to do CPU-bound work on a background thread.
Once you add await (...) to ProcessOneItem the return task will have a unique-ish id.
From the documentation of Task.Id property:
Task IDs are assigned on-demand and do not necessarily represent the order in which task instances are created. Note that although collisions are very rare, task identifiers are not guaranteed to be unique.
From what I understand this property is mainly there for debugging purposes. You should probably avoid depending on it for production code.
I am trying to understand parallel programming and I would like my async methods to run on multiple threads. I have written something but it does not work like I thought it should.
Code
public static async Task Main(string[] args)
{
var listAfterParallel = RunParallel(); // Running this function to return tasks
await Task.WhenAll(listAfterParallel); // I want the program exceution to stop until all tasks are returned or tasks are completed
Console.WriteLine("After Parallel Loop"); // But currently when I run program, after parallel loop command is printed first
Console.ReadLine();
}
public static async Task<ConcurrentBag<string>> RunParallel()
{
var client = new System.Net.Http.HttpClient();
client.DefaultRequestHeaders.Add("Accept", "application/json");
client.BaseAddress = new Uri("https://jsonplaceholder.typicode.com");
var list = new List<int>();
var listResults = new ConcurrentBag<string>();
for (int i = 1; i < 5; i++)
{
list.Add(i);
}
// Parallel for each branch to run await commands on multiple threads.
Parallel.ForEach(list, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, async (index) =>
{
var response = await client.GetAsync("posts/" + index);
var contents = await response.Content.ReadAsStringAsync();
listResults.Add(contents);
Console.WriteLine(contents);
});
return listResults;
}
I would like RunParallel function to complete before "After parallel loop" is printed. Also I want my get posts method to run on multiple threads.
Any help would be appreciated!
What's happening here is that you're never waiting for the Parallel.ForEach block to complete - you're just returning the bag that it will eventually pump into. The reason for this is that because Parallel.ForEach expects Action delegates, you've created a lambda which returns void rather than Task. While async void methods are valid, they generally continue their work on a new thread and return to the caller as soon as they await a Task, and the Parallel.ForEach method therefore thinks the handler is done, even though it's kicked that remaining work off into a separate thread.
Instead, use a synchronous method here;
Parallel.ForEach(list, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, index =>
{
var response = client.GetAsync("posts/" + index).Result;
var contents = response.Content.ReadAsStringAsync().Result;
listResults.Add(contents);
Console.WriteLine(contents);
});
If you absolutely must use await inside, Wrap it in Task.Run(...).GetAwaiter().GetResult();
Parallel.ForEach(list, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, index => Task.Run(async () =>
{
var response = await client.GetAsync("posts/" + index);
var contents = await response.Content.ReadAsStringAsync();
listResults.Add(contents);
Console.WriteLine(contents);
}).GetAwaiter().GetResult();
In this case, however, Task.run generally goes to a new thread, so we've subverted most of the control of Parallel.ForEach; it's better to use async all the way down;
var tasks = list.Select(async (index) => {
var response = await client.GetAsync("posts/" + index);
var contents = await response.Content.ReadAsStringAsync();
listResults.Add(contents);
Console.WriteLine(contents);
});
await Task.WhenAll(tasks);
Since Select expects a Func<T, TResult>, it will interpret an async lambda with no return as an async Task method instead of async void, and thus give us something we can explicitly await
Take a look at this: There Is No Thread
When you are making multiple concurrent web requests it's not your CPU that is doing the hard work. It's the CPU of the web server that is serving your requests. Your CPU is doing nothing during this time. It's not in a special "Wait-state" or something. The hardware inside your box that is working is your network card, that writes data to your RAM. When the response is received then your CPU will be notified about the arrived data, so it can do something with them.
You need parallelism when you have heavy work to do inside your box, not when you want the heavy work to be done by the external world. From the point of view of your CPU, even your hard disk is part of the external world. So everything that applies to web requests, applies also to requests targeting filesystems and databases. These workloads are called I/O bound, to be distinguished from the so called CPU bound workloads.
For I/O bound workloads the tool offered by the .NET platform is the asynchronous Task. There are multiple APIs throughout the libraries that return Task objects. To achieve concurrency you typically start multiple tasks and then await them with Task.WhenAll. There are also more advanced tools like the TPL Dataflow library, that is build on top of Tasks. It offers capabilities like buffering, batching, configuring the maximum degree of concurrency, and much more.
I have some time consuming code in a foreach that uses task/await.
it includes pulling data from the database, generating html, POSTing that to an API, and saving the replies to the DB.
A mock-up looks like this
List<label> labels = db.labels.ToList();
foreach (var x in list)
{
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
List<response> res = await api.sendMessage(object); //POST
//put all the responses in the db
foreach (var r in res)
{
db.responses.add(r);
}
db.SaveChanges();
}
Time wise, generating the Html and posting it to the API seem to be taking most of the time.
Ideally it would be great if I could generate the HTML for the next item, and wait for the post to finish, before posting the next item.
Other ideas are also welcome.
How would one go about this?
I first thought of adding a Task above the foreach and wait for that to finish before making the next POST, but then how do I process the last loop... it feels messy...
You can do it in parallel but you will need different context in each Task.
Entity framework is not thread safe, so if you can't use one context in parallel tasks.
var tasks = myLabels.Select( async label=>{
using(var db = new MyDbContext ()){
// do processing...
var response = await api.getresponse();
db.Responses.Add(response);
await db.SaveChangesAsync();
}
});
await Task.WhenAll(tasks);
In this case, all tasks will appear to run in parallel, and each task will have its own context.
If you don't create new Context per task, you will get error mentioned on this question Does Entity Framework support parallel async queries?
It's more an architecture problem than a code issue here, imo.
You could split your work into two separate parts:
Get data from database and generate HTML
Send API request and save response to database
You could run them both in parallel, and use a queue to coordinate that: whenever your HTML is ready it's added to a queue and another worker proceeds from there, taking that HTML and sending to the API.
Both parts can be done in multithreaded way too, e.g. you can process multiple items from the queue at the same time by having a set of workers looking for items to be processed in the queue.
This screams for the producer / consumer pattern: one producer produces data in a speed different than the consumer consumes it. Once the producer does not have anything to produce anymore it notifies the consumer that no data is expected anymore.
MSDN has a nice example of this pattern where several dataflowblocks are chained together: the output of one block is the input of another block.
Walkthrough: Creating a Dataflow Pipeline
The idea is as follows:
Create a class that will generate the HTML.
This class has an object of class System.Threading.Tasks.Dataflow.BufferBlock<T>
An async procedure creates all HTML output and await SendAsync the data to the bufferBlock
The buffer block implements interface ISourceBlock<T>. The class exposes this as a get property:
The code:
class MyProducer<T>
{
private System.Threading.Tasks.Dataflow.BufferBlock<T> bufferBlock = new BufferBlock<T>();
public ISourceBlock<T> Output {get {return this.bufferBlock;}
public async ProcessAsync()
{
while (somethingToProduce)
{
T producedData = ProduceOutput(...)
await this.bufferBlock.SendAsync(producedData);
}
// no date to send anymore. Mark the output complete:
this.bufferBlock.Complete()
}
}
A second class takes this ISourceBlock. It will wait at this source block until data arrives and processes it.
do this in an async function
stop when no more data is available
The code:
public class MyConsumer<T>
{
ISourceBlock<T> Source {get; set;}
public async Task ProcessAsync()
{
while (await this.Source.OutputAvailableAsync())
{ // there is input of type T, read it:
var input = await this.Source.ReceiveAsync();
// process input
}
// if here, no more input expected. finish.
}
}
Now put it together:
private async Task ProduceOutput<T>()
{
var producer = new MyProducer<T>();
var consumer = new MyConsumer<T>() {Source = producer.Output};
var producerTask = Task.Run( () => producer.ProcessAsync());
var consumerTask = Task.Run( () => consumer.ProcessAsync());
// while both tasks are working you can do other things.
// wait until both tasks are finished:
await Task.WhenAll(new Task[] {producerTask, consumerTask});
}
For simplicity I've left out exception handling and cancellation. StackOverFlow has artibles about exception handling and cancellation of Tasks:
Keep UI responsive using Tasks, Handle AggregateException
Cancel an Async Task or a List of Tasks
This is what I ended up using: (https://stackoverflow.com/a/25877042/275990)
List<ToSend> sendToAPI = new List<ToSend>();
List<label> labels = db.labels.ToList();
foreach (var x in list) {
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
sendToAPI.add(the object with HTML);
}
int maxParallelPOSTs=5;
await TaskHelper.ForEachAsync(sendToAPI, maxParallelPOSTs, async i => {
using (NasContext db2 = new NasContext()) {
List<response> res = await api.sendMessage(i.object); //POST
//put all the responses in the db
foreach (var r in res)
{
db2.responses.add(r);
}
db2.SaveChanges();
}
});
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body) {
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext()) {
await body(partition.Current).ContinueWith(t => {
if (t.Exception != null) {
string problem = t.Exception.ToString();
}
//observe exceptions
});
}
}));
}
basically lets me generate the HTML sync, which is fine, since it only takes a few seconds to generate 1000's but lets me post and save to DB async, with as many threads as I predefine. In this case I'm posting to the Mandrill API, parallel posts are no problem.
Attempting to write a HTML crawler using the Async CTP I have gotten stuck as to how to write a recursion free method for accomplishing this.
This is the code I have so far.
private readonly ConcurrentStack<LinkItem> _LinkStack;
private readonly Int32 _MaxStackSize;
private readonly WebClient client = new WebClient();
Func<string, string, Task<List<LinkItem>>> DownloadFromLink = async (BaseURL, uri) =>
{
string html = await client.DownloadStringTaskAsync(uri);
return LinkFinder.Find(html, BaseURL);
};
Action<LinkItem> DownloadAndPush = async (o) =>
{
List<LinkItem> result = await DownloadFromLink(o.BaseURL, o.Href);
if (this._LinkStack.Count() + result.Count <= this._MaxStackSize)
{
this._LinkStack.PushRange(result.ToArray());
o.Processed = true;
}
};
Parallel.ForEach(this._LinkStack, (o) =>
{
DownloadAndPush(o);
});
But obviously this doesn't work as I would hope because at the time that Parallel.ForEach executes the first (and only iteration) I only have only 1 item. The simplest approach I can think of to make the ForEach recursive but I can't (I don't think) do this as I would quickly run out of stack space.
Could anyone please guide me as to how I can restructure this code, to create what I would describe as a recursive continuation that adds items until either the MaxStackSize is reached or the system runs out of memory?
I think the best way to do something like this using C# 5/.Net 4.5 is to use TPL Dataflow. There even is a walkthrough on how to implement web crawler using it.
Basically, you create one "block" that takes care of downloading one URL and getting the link from it:
var cts = new CancellationTokenSource();
Func<LinkItem, Task<IEnumerable<LinkItem>>> downloadFromLink =
async link =>
{
// WebClient is not guaranteed to be thread-safe,
// so we shouldn't use one shared instance
var client = new WebClient();
string html = await client.DownloadStringTaskAsync(link.Href);
return LinkFinder.Find(html, link.BaseURL);
};
var linkFinderBlock = new TransformManyBlock<LinkItem, LinkItem>(
downloadFromLink,
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = 4, CancellationToken = cts.Token });
You can set MaxDegreeOfParallelism to any value you want. It says at most how many URLs can be downloaded concurrently. If you don't want to limit it at all, you can set it to DataflowBlockOptions.Unbounded.
Then you create one block that processes all the downloaded links somehow, like storing them all in a list. It can also decide when to cancel downloading:
var links = new List<LinkItem>();
var storeBlock = new ActionBlock<LinkItem>(
linkItem =>
{
links.Add(linkItem);
if (links.Count == maxSize)
cts.Cancel();
});
Since we didn't set MaxDegreeOfParallelism, it defaults to 1. That means using collection that is not thread-safe should be okay here.
We create one more block: it will take a link from linkFinderBlock, and pass it both to storeBlock and back to linkFinderBlock.
var broadcastBlock = new BroadcastBlock<LinkItem>(li => li);
The lambda in its constructor is a "cloning function". You can use it to create a clone of the item if you want to, but it shouldn't be necessary here, since we don't modify the LinkItem after creation.
Now we can connect the blocks together:
linkFinderBlock.LinkTo(broadcastBlock);
broadcastBlock.LinkTo(storeBlock);
broadcastBlock.LinkTo(linkFinderBlock);
Then we can start processing by giving the first item to linkFinderBlock (or broadcastBlock, if you want to also send it to storeBlock):
linkFinderBlock.Post(firstItem);
And finally wait until the processing is complete:
try
{
linkFinderBlock.Completion.Wait();
}
catch (AggregateException ex)
{
if (!(ex.InnerException is TaskCanceledException))
throw;
}