How to make WebClient wait until previous download is finished? - c#

I'm using DownloadFileAsyncTask method to download files. However, when i execute it in a loop i get an exception, which tells me concurrent operations are not supported. I tried to fix it like this:
public async Task<string> Download(string uri, string path)
{
if (uri == null) return;
//manually wait for previous task to complete
while (Client.IsBusy)
{
await Task.Delay(10);
}
await Client.DownloadFileTaskAsync(new Uri(absoluteUri), path);
return path;
}
Sometimes it works, when a number of iterations isn't big(1-5), and when it runs 10 or more times i'm getting this error.
Client here is a WebClient and i create it once. I don't produce a new Clients on every iteration because it makes an overhead.
Back to i was saying, how to make WebClient wait before previous download finishes? Also a question here is why IsBusy works for small amount of downloads.
The code i'm using:
public IEnumerable<Task<string>> GetPathById(IEnumerable<Photo> photos)
{
return photos?.Select(
async photo =>
{
var path = await Download(Uri, Path);
return path;
});
}
I want to download many files and don't block my Ui thread. Maybe there is other way to do this?

You're missing a lot of code that is necessary to help you out so I wrote this quick example to show you what I'm thinking you might want to try. Its in .NET Core but its essentially the same, just swap HttpClient for WebClient.
static void Main(string[] args)
{
Task.Run(async () =>
{
var toDownload = new string[] { "http://google.com", "http://microsoft.com", "http://apple.com" };
var client = new HttpClient();
var downloadedItems = await DownloadItems(client, toDownload);
Console.WriteLine("This is async");
foreach (var item in downloadedItems)
{
Console.WriteLine(item);
}
Console.ReadLine();
}).Wait();
}
static async Task<IEnumerable<string>> DownloadItems(HttpClient client, string[] uris)
{
// This sets up each page to be downloaded using the same HttpClient.
var items = new List<string>();
foreach (var uri in uris)
{
var item = await Download(client, uri);
items.Add(item);
}
return items;
}
static async Task<string> Download(HttpClient client, string uri)
{
// This download the page and returns the content.
if (string.IsNullOrEmpty(uri)) return null;
var content = await client.GetStringAsync(uri);
return content;
}

Related

SemaphoreSlim vs Parallel Processing for API post calls in .NET

I have a chron job which calls a database table and gets about half a million records returned. I need to loop through all of that data, and send API post's to a third party API. In general, this works fine, but the processing time is forever (10 hours). I need a way to speed it up. I've been trying to use a list of Task with SemaphoreSlim, but running into issues (it doesn't like that my api call returns a Task). I'm wondering if anyone has a solution to this that won't destroy the VM's memory?
Current code looks something like:
foreach(var data in dataList)
{
try
{
var response = await _apiService.PostData(data);
_logger.Trace(response.Message);
} catch//
}
But I'm trying to do this and getting the syntax wrong:
var tasks = new List<Task<DataObj>>();
var throttler = new SemaphoreSlim(10);
foreach(var data in dataList)
{
await throttler.WaitAsync();
tasks.Add(Task.Run(async () => {
try
{
var response = await _apiService.PostData(data);
_logger.Trace(response.Message);
}
finally
{
throttler.Release();
}
}));
}
Your list is of type Task<DataObj>, but your async lambda doesn't return anything, so its return type is Task. To fix the syntax, just return the value:
var response = await _apiService.PostData(data);
_logger.Trace(response.Message);
return response;
As others have noted in the comments, I also recommend not using Task.Run here. A local async method would work fine:
var tasks = new List<Task<DataObj>>();
var throttler = new SemaphoreSlim(10);
foreach(var data in dataList)
{
tasks.Add(ThrottledPostData(data));
}
var results = await Task.WhenAll(tasks);
async Task<DataObj> ThrottledPostData(Data data)
{
await throttler.WaitAsync();
try
{
var response = await _apiService.PostData(data);
_logger.Trace(response.Message);
return response;
}
finally
{
throttler.Release();
}
}

Asynchronous parallel web requests in ASP.NET MVC(C#)

I have a mini-project that requires to download html documents of multiple websites using C# and make it perform as fast as possible. In my scenario I might need to switch IP using proxies based on certain conditions. I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
Here's the code I have so far.
public class HTMLDownloader
{
public static List<string> URL_LIST = new List<string>();
public static List<string> HTML_DOCUMENTS = new List<string>();
public static void Main()
{
for (var i = 0; i < URL_LIST.Count; i++)
{
var html = Task.Run(() => Run(URL_LIST[i]));
HTML_DOCUMENTS.Add(html.Result);
}
}
public static async Task<string> Run(string url)
{
var client = new WebClient();
//Handle Proxy Credentials client.Proxy= new WebProxy();
string html = "";
try
{
html = await client.DownloadStringTaskAsync(new Uri(url));
//if(condition ==true)
//{
// return html;
//}
//else
//{
// Switch IP and try again
//}
}
catch (Exception e)
{
}
return html;
}
The problem here is that I'm not really taking advantage of sending multiple web requests because each request has to finish in order for the next one to begin. Is there a better approach to this? For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
Thanks
I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
You can use Task.WhenAll to get asynchronous concurrency.
For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
To throttle asynchronous concurrency, use SemaphoreSlim:
public static async Task Main()
{
using var limit = new SemaphoreSlim(10); // 10 at a time
var tasks = URL_LIST.Select(Process).ToList();
var results = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(results);
async Task<string> Process(string url)
{
await limit.WaitAsync();
try { return await Run(url); }
finally { limit.Release(); }
}
}
One way is to use Task.WhenAll.
Creates a task that will complete when all of the supplied tasks have
completed.
The premise is, Select all the tasks into a List, await the list of task with Task.WhenAll, Select the results
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
await Task.WhenAll(tasks);
var results = tasks.Select(x => x.Result);
}
Note : The result of WhenAll will be the collection of results as well
First change your Main to be async.
Then you can use LINQ Select to run the Tasks in parallel.
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
string[] documents = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(documents);
}
Task.WhenAll will unwrap the Task results into an array, once all the tasks are complete.

Parallel.For and httpclient crash the application C#

I want to avoid application crashing problem due to parallel for loop and httpclient but I am unable to apply solutions that are provided elsewhere on the web due to my limited knowledge of programming. My code is pasted below.
class Program
{
public static List<string> words = new List<string>();
public static int count = 0;
public static string output = "";
private static HttpClient Client = new HttpClient();
public static void Main(string[] args)
{
//input path strings...
List<string> links = new List<string>();
links.AddRange(File.ReadAllLines(input));
List<string> longList = new List<string>(File.ReadAllLines(#"a.txt"));
words.AddRange(File.ReadAllLines(output1));
System.Net.ServicePointManager.DefaultConnectionLimit = 8;
count = longList.Count;
//for (int i = 0; i < longList.Count; i++)
Task.Run(() => Parallel.For(0, longList.Count, new ParallelOptions { MaxDegreeOfParallelism = 5 }, (i, loopState) =>
{
Console.WriteLine(i);
string link = #"some link" + longList[i] + "/";
try
{
if (!links.Contains(link))
{
Task.Run(async () => { await Download(link); }).Wait();
}
}
catch (System.Exception e)
{
}
}));
//}
}
public static async Task Download(string link)
{
HtmlAgilityPack.HtmlDocument document = new HtmlDocument();
document.LoadHtml(await getURL(link));
//...stuff with html agility pack
}
public static async Task<string> getURL(string link)
{
string result = "";
HttpResponseMessage response = await Client.GetAsync(link);
Console.WriteLine(response.StatusCode);
if(response.IsSuccessStatusCode)
{
HttpContent content = response.Content;
var bytes = await response.Content.ReadAsByteArrayAsync();
result = Encoding.UTF8.GetString(bytes);
}
return result;
}
}
There are solutions for example this one, but I don't know how to put await keyword in my main method, and currently the program simply exits due to its absence before Task.Run(). As you can see I have already applied a workaround regarding async Download() method to call it in main method.
I have also doubts regarding the use of same instance of httpclient in different parallel threads. Please advise me whether I should create new instance of httpclient each time.
You're right that you have to block tasks somewhere in a console application, otherwise the program will just exit before it's complete. But you're doing this more than you need to. Aim for just blocking the main thread and delegating the rest to an async method. A good practice is to create a method with a signature like private async Task MainAsyc(args), put the "guts" of your program logic there, call it from Main like this:
MainAsync(args).Wait();
In your example, move everything from Main to MainAsync. Then you're free to use await as much as you want. Task.Run and Parallel.For are explicitly consuming new threads for I/O bound work, which is unnecessary in the async world. Use Task.WhenAll instead. The last part of your MainAsync method should end up looking something like this:
await Task.WhenAll(longList.Select(async s => {
Console.WriteLine(i);
string link = #"some link" + s + "/";
try
{
if (!links.Contains(link))
{
await Download(link);
}
}
catch (System.Exception e)
{
}
}));
There is one little wrinkle here though. Your example is throttling the parallelism at 5. If you find you still need this, TPL Dataflow is a great library for throttled parallelism in the async world. Here's a simple example.
Regarding HttpClient, using a single instance across threads is completely safe and highly encouraged.

Can this c# async process be more performant?

I am working on a program which makes multiple json calls to retrieve it's data.
The data however is pretty big and when running it without async it takes 17 hours to fully process.
The fetching of the data goes as follows:
Call to a service with a page number (2000 pages in total to be processed), which returns 200 records per page.
For each record it returns, an other service needs to be called to receive the data for the current record.
I'm new to the whole async functionality and I've made an attempt using async and await and already made a performance boost but was wondering if this is the correct way of using it and if there are any other ways to increase performance?
This is the code I currently have:
static void Main(string[] args)
{
MainAsyncCall().Wait();
Console.ReadKey();
}
public static async Task MainAsyncCall()
{
ServicePointManager.DefaultConnectionLimit = 999999;
List<Task> allPages = new List<Task>();
for (int i = 0; i <= 10; i++)
{
var page = i;
allPages.Add(Task.Factory.StartNew(() => processPage(page)));
}
Task.WaitAll(allPages.ToArray());
Console.WriteLine("Finished all pages");
}
public static async Task processPage(Int32 page)
{
List<Task> players = new List<Task>();
using (var client = new HttpClient())
{
string url = "<Request URL>";
var response = client.GetAsync(url).Result;
var content = response.Content.ReadAsStringAsync().Result;
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
dynamic data = item.data;
var localPage = page;
Console.WriteLine($"Processing Page: {localPage}");
foreach (dynamic d in data)
{
players.Add(Task.Factory.StartNew(() => processPlayer(d, localPage)));
}
}
Task.WaitAll(players.ToArray());
Console.WriteLine($"Finished Page: {page}");
}
public static async Task processPlayer(dynamic player, int page)
{
using (var client = new HttpClient())
{
string url = "<Request URL>";
HttpResponseMessage response = null;
response = client.GetAsync(url).Result;
var content = await response.Content.ReadAsStringAsync();
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
Console.WriteLine($"{page}: Processed {item.name}");
}
}
Any suggestion is welcome!
This is what it should look like to me:
static void Main(string[] args)
{
// it's okay here to use wait because we're at the root of the application
new AsyncServerCalls().MainAsyncCall().Wait();
Console.ReadKey();
}
public class AsyncServerCalls
{
// dont use static async methods
public async Task MainAsyncCall()
{
ServicePointManager.DefaultConnectionLimit = 999999;
List<Task> allPages = new List<Task>();
for (int i = 0; i <= 10; i++)
{
var page = i;
allPages.Add(processPage(page));
}
await Task.WhenAll(allPages.ToArray());
Console.WriteLine("Finished all pages");
}
public async Task processPage(Int32 page)
{
List<Task> players = new List<Task>();
using (var client = new HttpClient())
{
string url = "<Request URL>";
var response = await client.GetAsync(url)// nope .Result;
var content = await response.Content.ReadAsStringAsync(); // again never use .Result;
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
dynamic data = item.data;
var localPage = page;
Console.WriteLine($"Processing Page: {localPage}");
foreach (dynamic d in data)
{
players.Add(processPlayer(d, localPage)); // no need to put the task unnecessarily on a different thread, let the current SynchronisationContext deal with that
}
}
await Task.WhenAll(players.ToArray()); // always await a task in an async method
Console.WriteLine($"Finished Page: {page}");
}
public async Task processPlayer(dynamic player, int page)
{
using (var client = new HttpClient())
{
string url = "<Request URL>";
HttpResponseMessage response = null;
response = await client.GetAsync(url); // don't use .Result;
var content = await response.Content.ReadAsStringAsync();
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
Console.WriteLine($"{page}: Processed {item.name}");
}
}
}
So basially the points here are to make sure you let the SynchronisationContext do it's job. Inside a console program it should use the TaskSchedular.Default which is a ThreadPool SynchronisationContext. You can always force this by doing:
static void Main(string[] args)
{
Task.Run(() => new AsyncServerCalls().MainAsyncCall()).Wait();
Console.ReadKey();
}
Reference to Task.Run forcing Default
One thing you need to remember, which I got into trouble with last week is that you can fire hose the thread pool, i.e. spawn so many tasks that the your process just dies with insane CPU and Memory usage. So you may need to use a Semaphore to just limit the number of threads that going to be created.
I created a solution that processes a single file in multiple parts all at the same time Parallel Read it is still being worked on, but shows the uses of async stuff
Just to clarify the parallelism.
When you take a reference to all those tasks:
allPages.Add(processPage(page));
They all will be started.
When you do:
await Task.WhenAll(allPages);
This will block the current method execution until all those page processes have been executed (it won't block the current thread though, don't get these confused)
Danger Zone
If you don't want to block method execution on
Task.WhenAll
So, you can parallel all page processes for each page, then you can add that Task to an overall List<Task>.
However, the danger with this is the fire hosing... You are going to limit the number of threads you execute at some point, so where.... well that is up to you but just remember, it will happen at some point.

Run async method 8 times in parallel

How do I turn the following into a Parallel.ForEach?
public async void getThreadContents(String[] threads)
{
HttpClient client = new HttpClient();
List<String> usernames = new List<String>();
int i = 0;
foreach (String url in threads)
{
i++;
progressLabel.Text = "Scanning thread " + i.ToString() + "/" + threads.Count<String>();
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
Predicate<String> userPredicate;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
userPredicate = (String x) => x == user;
if (usernames.Find(userPredicate) != user)
{
usernames.Add(match.Groups[1].ToString());
}
}
progressBar1.PerformStep();
}
}
I coded it in the assumption that asynchronous and parallel processing would be the same, and I just realized it isn't. I took a look at all the questions I could find on this, and I really can't seem to find an example that does it for me. Most of them lack readable variable names. Using single-letter variable names which don't explain what they contain is a horrible way to state an example.
I normally have between 300 and 2000 entries in the array named threads (Contains URL's to forum threads) and it would seem that parallel processing (Due to the many HTTP requests) would speed up the execution).
Do I have to remove all the asynchrony (I got nothing async outside the foreach, only variable definitions) before I can use Parallel.ForEach? How should I go about doing this? Can I do this without blocking the main thread?
I am using .NET 4.5 by the way.
I coded it in the assumption that asynchronous and parallel processing would be the same
Asynchronous processing and parallel processing are quite different. If you don't understand the difference, I think you should first read more about it (for example what is the relation between Asynchronous and parallel programming in c#?).
Now, what you want to do is actually not that simple, because you want to process a big collection asynchronously, with a specific degree of parallelism (8). With synchronous processing, you could use Parallel.ForEach() (along with ParallelOptions to configure the degree of parallelism), but there is no simple alternative that would work with async.
In your code, this is complicated by the fact that you expect everything to execute on the UI thread. (Though ideally, you shouldn't access the UI directly from your computation. Instead, you should use IProgress, which would mean the code no longer has to execute on the UI thread.)
Probably the best way to do this in .Net 4.5 is to use TPL Dataflow. Its ActionBlock does exactly what you want, but it can be quite verbose (because it's more flexible than what you need). So it makes sense to create a helper method:
public static Task AsyncParallelForEach<T>(
IEnumerable<T> source, Func<T, Task> body,
int maxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
TaskScheduler scheduler = null)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDegreeOfParallelism
};
if (scheduler != null)
options.TaskScheduler = scheduler;
var block = new ActionBlock<T>(body, options);
foreach (var item in source)
block.Post(item);
block.Complete();
return block.Completion;
}
In your case, you would use it like this:
await AsyncParallelForEach(
threads, async url => await DownloadUrl(url), 8,
TaskScheduler.FromCurrentSynchronizationContext());
Here, DownloadUrl() is an async Task method that processes a single URL (the body of your loop), 8 is the degree of parallelism (probably shouldn't be a literal constant in real code) and FromCurrentSynchronizationContext() makes sure the code executes on the UI thread.
Stephen Toub has a good blog post on implementing a ForEachAsync. Svick's answer is quite good for platforms on which Dataflow is available.
Here's an alternative, using the partitioner from the TPL:
public static Task ForEachAsync<T>(this IEnumerable<T> source,
int degreeOfParallelism, Func<T, Task> body)
{
var partitions = Partitioner.Create(source).GetPartitions(degreeOfParallelism);
var tasks = partitions.Select(async partition =>
{
using (partition)
while (partition.MoveNext())
await body(partition.Current);
});
return Task.WhenAll(tasks);
}
You can then use this as such:
public async Task getThreadContentsAsync(String[] threads)
{
HttpClient client = new HttpClient();
ConcurrentDictionary<String, object> usernames = new ConcurrentDictionary<String, object>();
await threads.ForEachAsync(8, async url =>
{
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
usernames.TryAdd(user, null);
}
progressBar1.PerformStep();
});
}
Yet another alternative is using SemaphoreSlim or AsyncSemaphore (which is included in my AsyncEx library and supports many more platforms than SemaphoreSlim):
public async Task getThreadContentsAsync(String[] threads)
{
SemaphoreSlim semaphore = new SemaphoreSlim(8);
HttpClient client = new HttpClient();
ConcurrentDictionary<String, object> usernames = new ConcurrentDictionary<String, object>();
await Task.WhenAll(threads.Select(async url =>
{
await semaphore.WaitAsync();
try
{
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
usernames.TryAdd(user, null);
}
progressBar1.PerformStep();
}
finally
{
semaphore.Release();
}
}));
}
You can try the ParallelForEachAsync extension method from AsyncEnumerator NuGet Package:
using System.Collections.Async;
public async void getThreadContents(String[] threads)
{
HttpClient client = new HttpClient();
List<String> usernames = new List<String>();
int i = 0;
await threads.ParallelForEachAsync(async url =>
{
i++;
progressLabel.Text = "Scanning thread " + i.ToString() + "/" + threads.Count<String>();
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
Predicate<String> userPredicate;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
userPredicate = (String x) => x == user;
if (usernames.Find(userPredicate) != user)
{
usernames.Add(match.Groups[1].ToString());
}
}
// THIS CALL MUST BE THREAD-SAFE!
progressBar1.PerformStep();
},
maxDegreeOfParallelism: 8);
}

Categories

Resources