I have a list of 100 urls. I need to fetch the html content of those urls. Lets say I don't use the async version of DownloadString and instead do the following.
var task1 = SyTask.Factory.StartNew(() => new WebClient().DownloadString("url1"));
What I want to achieve is to get the html string for at max 4 urls at a time.
I start 4 tasks for the first four urls. Assume the 2nd url completes, I want to immediately start the 5th task for the 5th url. And so on. This way at max 4 only 4 urls will be downloaded, and for all purposes there will always be 4 urls being downloaded, ie till all 100 are processed.
I can't seem to visualize how will I actually achieve this. There must be an established pattern for doing this. Thoughts?
EDIT:
Following up on #Damien_The_Unbeliever's comment to use Parallel.ForEach, I wrote the following
var urls = new List<string>();
var results = new Dictionary<string, string>();
var lockObj = new object();
Parallel.ForEach(urls,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
url =>
{
var str = new WebClient().DownloadString(url);
lock (lockObj)
{
results[url] = str;
}
});
I think the above reads better than creating individual tasks and using a semaphore to limit concurrency. That said having never used or worked with Parallel.ForEach, I am unsure if this correctly does what I need to do.
SemaphoreSlim sem = new SemaphoreSlim(4);
foreach (var url in urls)
{
sem.Wait();
Task.Factory.StartNew(() => new WebClient().DownloadString(url))
.ContinueWith(t => sem.Release());
}
Actually, Task.WaitAnyis much better for what you're trying to achieve than ContinueWith
int tasksPerformedCount = 0
Task[] tasks = //initial 4 tasks
while(tasksPerformedCount< 100)
{
//returns the index of the first task to complete, as soon as it completes
int index = Task.WaitAny(tasks);
tasksPerformedCount++;
//replace it with a new one
tasks[index] = //new task
}
Edit:
Another example of Task.WaitAny from http://www.amazon.co.uk/Exam-Ref-70-483-Programming-In/dp/0735676828/ref=sr_1_1?ie=UTF8&qid=1378105711&sr=8-1&keywords=exam+ref+70-483+programming+in+c
namespace Chapter1 {
public static class Program {
public static void Main() {
Task<int>[] tasks = new Task<int>[3];
tasks[0] = Task.Run(() => { Thread.Sleep(2000); return 1; });
tasks[1] = Task.Run(() => { Thread.Sleep(1000); return 2; });
tasks[2] = Task.Run(() => { Thread.Sleep(3000); return 3; });
while (tasks.Length > 0)
{
int i = Task.WaitAny(tasks);
Task<int> completedTask = tasks[i];
Console.WriteLine(completedTask.Result);
var temp = tasks.ToList();
temp.RemoveAt(i);
tasks = temp.ToArray();
}
}
}
}
Related
I am using the HTTPClient in System.Net.Http to make requests against an API. The API is limited to 10 requests per second.
My code is roughly like so:
List<Task> tasks = new List<Task>();
items..Select(i => tasks.Add(ProcessItem(i));
try
{
await Task.WhenAll(taskList.ToArray());
}
catch (Exception ex)
{
}
The ProcessItem method does a few things but always calls the API using the following:
await SendRequestAsync(..blah). Which looks like:
private async Task<Response> SendRequestAsync(HttpRequestMessage request, CancellationToken token)
{
token.ThrowIfCancellationRequested();
var response = await HttpClient
.SendAsync(request: request, cancellationToken: token).ConfigureAwait(continueOnCapturedContext: false);
token.ThrowIfCancellationRequested();
return await Response.BuildResponse(response);
}
Originally the code worked fine but when I started using Task.WhenAll I started getting 'Rate Limit Exceeded' messages from the API. How can I limit the rate at which requests are made?
Its worth noting that ProcessItem can make between 1-4 API calls depending on the item.
The API is limited to 10 requests per second.
Then just have your code do a batch of 10 requests, ensuring they take at least one second:
Items[] items = ...;
int index = 0;
while (index < items.Length)
{
var timer = Task.Delay(TimeSpan.FromSeconds(1.2)); // ".2" to make sure
var tasks = items.Skip(index).Take(10).Select(i => ProcessItemsAsync(i));
var tasksAndTimer = tasks.Concat(new[] { timer });
await Task.WhenAll(tasksAndTimer);
index += 10;
}
Update
My ProcessItems method makes 1-4 API calls depending on the item.
In this case, batching is not an appropriate solution. You need to limit an asynchronous method to a certain number, which implies a SemaphoreSlim. The tricky part is that you want to allow more calls over time.
I haven't tried this code, but the general idea I would go with is to have a periodic function that releases the semaphore up to 10 times. So, something like this:
private readonly SemaphoreSlim _semaphore = new SemaphoreSlim(10);
private async Task<Response> ThrottledSendRequestAsync(HttpRequestMessage request, CancellationToken token)
{
await _semaphore.WaitAsync(token);
return await SendRequestAsync(request, token);
}
private async Task PeriodicallyReleaseAsync(Task stop)
{
while (true)
{
var timer = Task.Delay(TimeSpan.FromSeconds(1.2));
if (await Task.WhenAny(timer, stop) == stop)
return;
// Release the semaphore at most 10 times.
for (int i = 0; i != 10; ++i)
{
try
{
_semaphore.Release();
}
catch (SemaphoreFullException)
{
break;
}
}
}
}
Usage:
// Start the periodic task, with a signal that we can use to stop it.
var stop = new TaskCompletionSource<object>();
var periodicTask = PeriodicallyReleaseAsync(stop.Task);
// Wait for all item processing.
await Task.WhenAll(taskList);
// Stop the periodic task.
stop.SetResult(null);
await periodicTask;
The answer is similar to this one.
Instead of using a list of tasks and WhenAll, use Parallel.ForEach and use ParallelOptions to limit the number of concurrent tasks to 10, and make sure each one takes at least 1 second:
Parallel.ForEach(
items,
new ParallelOptions { MaxDegreeOfParallelism = 10 },
async item => {
ProcessItems(item);
await Task.Delay(1000);
}
);
Or if you want to make sure each item takes as close to 1 second as possible:
Parallel.ForEach(
searches,
new ParallelOptions { MaxDegreeOfParallelism = 10 },
async item => {
var watch = new Stopwatch();
watch.Start();
ProcessItems(item);
watch.Stop();
if (watch.ElapsedMilliseconds < 1000) await Task.Delay((int)(1000 - watch.ElapsedMilliseconds));
}
);
Or:
Parallel.ForEach(
searches,
new ParallelOptions { MaxDegreeOfParallelism = 10 },
async item => {
await Task.WhenAll(
Task.Delay(1000),
Task.Run(() => { ProcessItems(item); })
);
}
);
UPDATED ANSWER
My ProcessItems method makes 1-4 API calls depending on the item. So with a batch size of 10 I still exceed the rate limit.
You need to implement a rolling window in SendRequestAsync. A queue containing timestamps of each request is a suitable data structure. You dequeue entries with a timestamp older than 10 seconds. As it so happens, there is an implementation as an answer to a similar question on SO.
ORIGINAL ANSWER
May still be useful to others
One straightforward way to handle this is to batch your requests in groups of 10, run those concurrently, and then wait until a total of 10 seconds has elapsed (if it hasn't already). This will bring you in right at the rate limit if the batch of requests can complete in 10 seconds, but is less than optimal if the batch of requests takes longer. Have a look at the .Batch() extension method in MoreLinq. Code would look approximately like
foreach (var taskList in tasks.Batch(10))
{
Stopwatch sw = Stopwatch.StartNew(); // From System.Diagnostics
await Task.WhenAll(taskList.ToArray());
if (sw.Elapsed.TotalSeconds < 10.0)
{
// Calculate how long you still have to wait and sleep that long
// You might want to wait 10.5 or 11 seconds just in case the rate
// limiting on the other side isn't perfectly implemented
}
}
https://github.com/thomhurst/EnumerableAsyncProcessor
I've written a library to help with this sort of logic.
Usage would be:
var responses = await AsyncProcessorBuilder.WithItems(items) // Or Extension Method: items.ToAsyncProcessorBuilder()
.SelectAsync(item => ProcessItem(item), CancellationToken.None)
.ProcessInParallel(levelOfParallelism: 10, TimeSpan.FromSeconds(1));
Scenario is something like this, I have 4 specific URLs in hand, each URL page contains many links to a web page, I need to extract some information of those web pages. I'm planning to use nested task to do this job, Multiple tasks inside one task. Something like below.
var t1Actions = new List<Action>();
var t1 = Task.Factory.StartNew(() =>
{
foreach (var action in t1Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
var t2Actions = new List<Action>();
var t2 = Task.Factory.StartNew(() =>
{
foreach (var action in t2Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
var t3Actions = new List<Action>();
var t3 = Task.Factory.StartNew(() =>
{
foreach (var action in t3Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
var t4Actions = new List<Action>();
var t4 = Task.Factory.StartNew(() =>
{
foreach (var action in t4Actions)
{
Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
}
});
Task.WhenAll(t1, t2, t3, t4);
Here is my questions:
Is this way a good way to do jobs like what I mentioned above?
Which one is efficient, replace child tasks with Parallel.Invoke(action) or leave it as it is?
How should I notify (for example UI) if a nested task completed, Do I have control over nested tasks?
Any advice will be helpful.
The actual problem isn't how to handle child tasks. It's how to get a list of URLs from some directory pages, retrieve those pages and process them.
This can be done easily using .NET's Dataflow library. Each step can be implemented as a block that reads one URL and produces an output.
The first block can be a TransformManyBlock that accepts one page URL and retursn a list of page URLs
The second block can be a TransformBlock that accepts a single page URL and returns its contents
The third block can be an Action Block that accepts the page and does whatever is needed with it.
For example:
var listBlock = new TransformManyBlock<Uri,Uri>(async uri=>
{
var content=await httpClient.GetStringAsync(uri);
var uris=ProcessThePage(contents);
return uris;
});
var downloadBlock = new TransformBlock<Uri,(Uri,string)>(async uri=>
{
var content=await httpClient.GetStringAsync(uri);
return (uri,content);
});
var processingBlock = new ActionBlock<(Uri uri,string content)>(async msg=>
{
//Do something
var pathFromUri(msg.uri);
File.WriteAllText(pathFromUri,msg.content);
});
var linkOptions=new DataflowLinkOptions{PropagateCompletion=true};
listBlock.LinkTo(downloadBlock,linkOptions);
downloadBlock.LinkTo(processingBlock,linkOptions);
Each block runs using its own Task. You can specify that a block may use more than one tasks, eg to download multiple pages concurrently.
Each block has an input and output buffer. You can specify a limit to the input buffer to avoid flooding a block with too many messages to process. If a block reaches the limit upstream blocks will pause. This way, you could prevent eg the downloadBlock from flooding a slow processingBlock with thousands of pages.
Once you have a pipeline, you can post messages to the first block. When you're done, you can tell the block to Complete(). Each block in the pipeline will finish processing messages in its input buffer and propagate the completion call to the next linked block.
You can await for all messages to finish by awaiting the last block's Completion task.
var directoryPages=new Uri[]{..};
foreach(var uri in directoryPages)
{
listBlock.Post(uri);
}
listBlock.Complete();
await processingBlock.Complete();
The ExecutionDataflowBlockOptions can be used to specify the use of multiple tasks and the intput buffer limits, eg :
var options=new ExecutionDataflowBlockOptions
{
BoundedCapacity=10,
MaxDegreeOfParallelism=4,
};
var downloadBlock = new TransformBlock<Uri,(Uri,string)>(...,options);
This means that downloadBlock will accept up to 10 URIs before signalling the listBlock to pause. It will process up to 4 Uris concurrently
I'm making an app that show some data collected from web in a windows form, today I have to wait sequentially to download all data before show them, how I can do it in parallel in a limited queue (with max concurrent tasks executing) to show result refreshing a datagridview while they are downloaded?
what I have today is a method
internal async Task<string> RequestDataAsync(string uri)
{
var wb = new System.Net.WebClient(); //
var sourceAsync = wb.DownloadStringTaskAsync(uri);
string data = await sourceAsync;
return data;
}
that I put on a foreach() and after it ends, parse data to a list of custom object, then convert that object to a DataTable and bind the datagridview to that.
I not sure if the best way is using LimitedConcurrencyLevelTaskScheduler from example on https://msdn.microsoft.com/library/system.threading.tasks.taskscheduler.aspx (that I not sure how can report to grid each time a resource is downlaoded) or there is a best way to do this.
I not like to start all tasks at same time, because sometimes can be that I have to request 100 downlads at same time, and I like that it will be executed for example 10 tasks at same time maximum.
I know that it is a question that involves control concurrent tasks and report while download that, but not sure what is best nowadays to do that.
I don't often recommend my book, but I think it would help you.
Concurrent asynchrony is done via Task.WhenAll (recipe 2.4 in my book):
List<string> uris = ...;
var tasks = uris.Select(uri => RequestDataAsync(uri));
string[] results = await Task.WhenAll(tasks);
To limit concurrency, use a SemaphoreSlim (recipe 11.5 in my book):
List<string> uris = ...;
var semaphore = new SemaphoreSlim(10);
var tasks = uris.Select(async uri =>
{
await semaphore.WaitAsync();
try { await RequestDataAsync(uri); }
finally { semaphore.Release(); }
});
string[] results = await Task.WhenAll(tasks);
To process data as it arrives, introduce another async method (recipe 2.6 in my book):
List<string> uris = ...;
var semaphore = new SemaphoreSlim(10);
var tasks = uris.Select(async uri =>
{
await semaphore.WaitAsync();
try { await RequestAndProcessDataAsync(uri); }
finally { semaphore.Release(); }
});
await Task.WhenAll(tasks);
async Task RequestAndProcessDataAsync(string uri)
{
var data = await RequestDataAsync(uri);
var myObject = Parse(data);
_listBoundToDataTable.Add(myObject);
}
I have an application where i have 1000+ small parts of 1 large file.
I have to upload maximum of 16 parts at a time.
I used Thread parallel library of .Net.
I used Parallel.For to divide in multiple parts and assigned 1 method which should be executed for each part and set DegreeOfParallelism to 16.
I need to execute 1 method with checksum values which are generated by different part uploads, so i have to set certain mechanism where i have to wait for all parts upload say 1000 to complete.
In TPL library i am facing 1 issue is it is randomly executing any of the 16 threads from 1000.
I want some mechanism using which i can run first 16 threads initially, if the 1st or 2nd or any of the 16 thread completes its task next 17th part should be started.
How can i achieve this ?
One possible candidate for this can be TPL Dataflow. This is a demonstration which takes in a stream of integers and prints them out to the console. You set the MaxDegreeOfParallelism to whichever many threads you wish to spin in parallel:
void Main()
{
var actionBlock = new ActionBlock<int>(
i => Console.WriteLine(i),
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 16});
foreach (var i in Enumerable.Range(0, 200))
{
actionBlock.Post(i);
}
}
This can also scale well if you want to have multiple producer/consumers.
Here is the manual way of doing this.
You need a queue. The queue is sequence of pending tasks. You have to dequeue and put them inside list of working task. When ever the task is done remove it from list of working task and take another from queue. Main thread controls this process. Here is the sample of how to do this.
For the test i used List of integer but it should work for other types because its using generics.
private static void Main()
{
Random r = new Random();
var items = Enumerable.Range(0, 100).Select(x => r.Next(100, 200)).ToList();
ParallelQueue(items, DoWork);
}
private static void ParallelQueue<T>(List<T> items, Action<T> action)
{
Queue pending = new Queue(items);
List<Task> working = new List<Task>();
while (pending.Count + working.Count != 0)
{
if (pending.Count != 0 && working.Count < 16) // Maximum tasks
{
var item = pending.Dequeue(); // get item from queue
working.Add(Task.Run(() => action((T)item))); // run task
}
else
{
Task.WaitAny(working.ToArray());
working.RemoveAll(x => x.IsCompleted); // remove finished tasks
}
}
}
private static void DoWork(int i) // do your work here.
{
// this is just an example
Task.Delay(i).Wait();
Console.WriteLine(i);
}
Please let me know if you encounter problem of how to implement DoWork for your self. because if you change method signature you may need to do some changes.
Update
You can also do this with async await without blocking the main thread.
private static void Main()
{
Random r = new Random();
var items = Enumerable.Range(0, 100).Select(x => r.Next(100, 200)).ToList();
Task t = ParallelQueue(items, DoWork);
// able to do other things.
t.Wait();
}
private static async Task ParallelQueue<T>(List<T> items, Func<T, Task> func)
{
Queue pending = new Queue(items);
List<Task> working = new List<Task>();
while (pending.Count + working.Count != 0)
{
if (working.Count < 16 && pending.Count != 0)
{
var item = pending.Dequeue();
working.Add(Task.Run(async () => await func((T)item)));
}
else
{
await Task.WhenAny(working);
working.RemoveAll(x => x.IsCompleted);
}
}
}
private static async Task DoWork(int i)
{
await Task.Delay(i);
}
var workitems = ... /*e.g. Enumerable.Range(0, 1000000)*/;
SingleItemPartitioner.Create(workitems)
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(16)
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.ForAll(i => { Thread.Slee(1000); Console.WriteLine(i); });
This should be all you need. I forgot how the methods are named exactly... Look at the documentation.
Test this by printing to the console after sleeping for 1sec (which this sample code does).
Another option would be to use a BlockingCollection<T> as a queue between your file reader thread and your 16 uploader threads. Each uploader thread would just loop around consuming the blocking collection until it is complete.
And, if you want to limit memory consumption in the queue you can set an upper limit on the blocking collection such that the file-reader thread will pause when the buffer has reached capacity. This is particularly useful in a server environment where you may need to limit memory used per user/API call.
// Create a buffer of 4 chunks between the file reader and the senders
BlockingCollection<Chunk> queue = new BlockingCollection<Chunk>(4);
// Create a cancellation token source so you can stop this gracefully
CancellationTokenSource cts = ...
File reader thread
...
queue.Add(chunk, cts.Token);
...
queue.CompleteAdding();
Sending threads
for(int i = 0; i < 16; i++)
{
Task.Run(() => {
foreach (var chunk in queue.GetConsumingEnumerable(cts.Token))
{
.. do the upload
}
});
}
Attempting to write a HTML crawler using the Async CTP I have gotten stuck as to how to write a recursion free method for accomplishing this.
This is the code I have so far.
private readonly ConcurrentStack<LinkItem> _LinkStack;
private readonly Int32 _MaxStackSize;
private readonly WebClient client = new WebClient();
Func<string, string, Task<List<LinkItem>>> DownloadFromLink = async (BaseURL, uri) =>
{
string html = await client.DownloadStringTaskAsync(uri);
return LinkFinder.Find(html, BaseURL);
};
Action<LinkItem> DownloadAndPush = async (o) =>
{
List<LinkItem> result = await DownloadFromLink(o.BaseURL, o.Href);
if (this._LinkStack.Count() + result.Count <= this._MaxStackSize)
{
this._LinkStack.PushRange(result.ToArray());
o.Processed = true;
}
};
Parallel.ForEach(this._LinkStack, (o) =>
{
DownloadAndPush(o);
});
But obviously this doesn't work as I would hope because at the time that Parallel.ForEach executes the first (and only iteration) I only have only 1 item. The simplest approach I can think of to make the ForEach recursive but I can't (I don't think) do this as I would quickly run out of stack space.
Could anyone please guide me as to how I can restructure this code, to create what I would describe as a recursive continuation that adds items until either the MaxStackSize is reached or the system runs out of memory?
I think the best way to do something like this using C# 5/.Net 4.5 is to use TPL Dataflow. There even is a walkthrough on how to implement web crawler using it.
Basically, you create one "block" that takes care of downloading one URL and getting the link from it:
var cts = new CancellationTokenSource();
Func<LinkItem, Task<IEnumerable<LinkItem>>> downloadFromLink =
async link =>
{
// WebClient is not guaranteed to be thread-safe,
// so we shouldn't use one shared instance
var client = new WebClient();
string html = await client.DownloadStringTaskAsync(link.Href);
return LinkFinder.Find(html, link.BaseURL);
};
var linkFinderBlock = new TransformManyBlock<LinkItem, LinkItem>(
downloadFromLink,
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = 4, CancellationToken = cts.Token });
You can set MaxDegreeOfParallelism to any value you want. It says at most how many URLs can be downloaded concurrently. If you don't want to limit it at all, you can set it to DataflowBlockOptions.Unbounded.
Then you create one block that processes all the downloaded links somehow, like storing them all in a list. It can also decide when to cancel downloading:
var links = new List<LinkItem>();
var storeBlock = new ActionBlock<LinkItem>(
linkItem =>
{
links.Add(linkItem);
if (links.Count == maxSize)
cts.Cancel();
});
Since we didn't set MaxDegreeOfParallelism, it defaults to 1. That means using collection that is not thread-safe should be okay here.
We create one more block: it will take a link from linkFinderBlock, and pass it both to storeBlock and back to linkFinderBlock.
var broadcastBlock = new BroadcastBlock<LinkItem>(li => li);
The lambda in its constructor is a "cloning function". You can use it to create a clone of the item if you want to, but it shouldn't be necessary here, since we don't modify the LinkItem after creation.
Now we can connect the blocks together:
linkFinderBlock.LinkTo(broadcastBlock);
broadcastBlock.LinkTo(storeBlock);
broadcastBlock.LinkTo(linkFinderBlock);
Then we can start processing by giving the first item to linkFinderBlock (or broadcastBlock, if you want to also send it to storeBlock):
linkFinderBlock.Post(firstItem);
And finally wait until the processing is complete:
try
{
linkFinderBlock.Completion.Wait();
}
catch (AggregateException ex)
{
if (!(ex.InnerException is TaskCanceledException))
throw;
}