I have a mini-project that requires to download html documents of multiple websites using C# and make it perform as fast as possible. In my scenario I might need to switch IP using proxies based on certain conditions. I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
Here's the code I have so far.
public class HTMLDownloader
{
public static List<string> URL_LIST = new List<string>();
public static List<string> HTML_DOCUMENTS = new List<string>();
public static void Main()
{
for (var i = 0; i < URL_LIST.Count; i++)
{
var html = Task.Run(() => Run(URL_LIST[i]));
HTML_DOCUMENTS.Add(html.Result);
}
}
public static async Task<string> Run(string url)
{
var client = new WebClient();
//Handle Proxy Credentials client.Proxy= new WebProxy();
string html = "";
try
{
html = await client.DownloadStringTaskAsync(new Uri(url));
//if(condition ==true)
//{
// return html;
//}
//else
//{
// Switch IP and try again
//}
}
catch (Exception e)
{
}
return html;
}
The problem here is that I'm not really taking advantage of sending multiple web requests because each request has to finish in order for the next one to begin. Is there a better approach to this? For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
Thanks
I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
You can use Task.WhenAll to get asynchronous concurrency.
For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
To throttle asynchronous concurrency, use SemaphoreSlim:
public static async Task Main()
{
using var limit = new SemaphoreSlim(10); // 10 at a time
var tasks = URL_LIST.Select(Process).ToList();
var results = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(results);
async Task<string> Process(string url)
{
await limit.WaitAsync();
try { return await Run(url); }
finally { limit.Release(); }
}
}
One way is to use Task.WhenAll.
Creates a task that will complete when all of the supplied tasks have
completed.
The premise is, Select all the tasks into a List, await the list of task with Task.WhenAll, Select the results
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
await Task.WhenAll(tasks);
var results = tasks.Select(x => x.Result);
}
Note : The result of WhenAll will be the collection of results as well
First change your Main to be async.
Then you can use LINQ Select to run the Tasks in parallel.
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
string[] documents = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(documents);
}
Task.WhenAll will unwrap the Task results into an array, once all the tasks are complete.
Related
Scenario 1 - For each website in string list (_websites), the caller method wraps GetWebContent into a task, waits for all the tasks to finish and return results.
private async Task<string[]> AsyncGetUrlStringFromWebsites()
{
List<Task<string>> tasks = new List<Task<string>>();
foreach (var website in _websites)
{
tasks.Add(Task.Run(() => GetWebsiteContent(website)));
}
var results = await Task.WhenAll(tasks);
return results;
}
private string GetWebContent(string url)
{
var client = new HttpClient();
var content = client.GetStringAsync(url);
return content.Result;
}
Scenario 2 - For each website in string list (_websites), the caller method calls GetWebContent (returns Task< string >), waits for all the tasks to finish and return the results.
private async Task<string[]> AsyncGetUrlStringFromWebsites()
{
List<Task<string>> tasks = new List<Task<string>>();
foreach (var website in _websites)
{
tasks.Add(GetWebContent(website));
}
var results = await Task.WhenAll(tasks);
return results;
}
private async Task<string> GetWebContent(string url)
{
var client = new HttpClient();
var content = await client.GetStringAsync(url);
return content;
}
Questions - Which way is the correct approach and why? How does each approach impact achieving asynchronous processing?
With Task.Run() you occupy a thread from the thread pool and tell it to wait until the web content has been received.
Why would you want to do that? Do you pay someone to stand next to your mailbox to tell you when a letter arrives?
GetStringAsync already is asynchronous. The cpu has nothing to do (with this process) while the content comes in over the network.
So the second approach is correct, no need to use extra threads from the thread pool here.
Always interesting to read: Stephen Cleary's "There is no thread"
#René Vogt gave a great explanation.
There a minor 5 cents from my side.
In the second example there is not need to use async / await in GetWebContent method. You can simply return Task<string> (this would also reduce async depth).
I'm writing a .NET Core Console App that needs to continuously read data from multiple WebSockets. My current approach is to create a new Task (via Task.Run) per WebSocket that runs an infinite while loop and blocks until it reads the data from the socket. However, since the data is pushed at a rather low frequency, the threads just block most of the time which seems quite inefficient.
From my understanding, the async/await pattern should be ideal for blocking I/O operations. However, I'm not sure how to apply it for my situation or even if async/await can improve this in any way - especially since it's a Console app.
I've put together a proof of concept (doing a HTTP GET instead of reading from WebSocket for simplicity). The only way I was able to achieve this was without actually awaiting. Code:
static void Main(string[] args)
{
Console.WriteLine($"ThreadId={ThreadId}: Main");
Task task = Task.Run(() => Process("https://duckduckgo.com", "https://stackoverflow.com/"));
// Do main work.
task.Wait();
}
private static void Process(params string[] urls)
{
Dictionary<string, Task<string>> tasks = urls.ToDictionary(x => x, x => (Task<string>)null);
HttpClient client = new HttpClient();
while (true)
{
foreach (string url in urls)
{
Task<string> task = tasks[url];
if (task == null || task.IsCompleted)
{
if (task != null)
{
string result = task.Result;
Console.WriteLine($"ThreadId={ThreadId}: Length={result.Length}");
}
tasks[url] = ReadString(client, url);
}
}
Thread.Yield();
}
}
private static async Task<string> ReadString(HttpClient client, string url)
{
var response = await client.GetAsync(url);
Console.WriteLine($"ThreadId={ThreadId}: Url={url}");
return await response.Content.ReadAsStringAsync();
}
private static int ThreadId => Thread.CurrentThread.ManagedThreadId;
This seems to be working and executing on various Worker Threads on the ThreadPool. However, this definitely doesn't seem as any typical async/await code which makes me think there has to be a better way.
Is there a more proper / more elegant way of doing this?
You've basically written a version of Task.WhenAny that uses a CPU loop to check for completed tasks rather than... whatever magic the framework method uses behind the scenes.
A more idiomatic version might look like this. (Although it might not - I feel like there should be an easier method of "re-run the completed task" than the reverse dictionary I've used here.)
static void Main(string[] args)
{
Console.WriteLine($"ThreadId={ThreadId}: Main");
// No need for Task.Run here.
var task = Process("https://duckduckgo.com", "https://stackoverflow.com/");
task.Wait();
}
private static async Task Process(params string[] urls)
{
// Set up initial dictionary mapping task (per URL) to the URL used.
HttpClient client = new HttpClient();
var tasks = urls.ToDictionary(u => client.GetAsync(u), u => u);
while (true)
{
// Wait for any task to complete, get its URL and remove it from the current tasks.
var firstCompletedTask = await Task.WhenAny(tasks.Keys);
var firstCompletedUrl = tasks[firstCompletedTask];
tasks.Remove(firstCompletedTask);
// Do work with completed task.
try
{
Console.WriteLine($"ThreadId={ThreadId}: URL={firstCompletedUrl}");
using (var response = await firstCompletedTask)
{
var content = await response.Content.ReadAsStringAsync();
Console.WriteLine($"ThreadId={ThreadId}: Length={content.Length}");
}
}
catch (Exception ex)
{
Console.WriteLine($"ThreadId={ThreadId}: Ex={ex}");
}
// Queue the task again.
tasks.Add(client.GetAsync(firstCompletedUrl), firstCompletedUrl);
}
}
private static int ThreadId => Thread.CurrentThread.ManagedThreadId;
I've accepted Rawling's answer - I believe it is correct for the exact scenario I described. However, with a bit of inverted logic, I ended up with something way simpler - leaving it in case anyone needs something like this:
static void Main(string[] args)
{
string[] urls = { "https://duckduckgo.com", "https://stackoverflow.com/" };
HttpClient client = new HttpClient();
var tasks = urls.Select(async url =>
{
while (true) await ReadString(client, url);
});
Task.WhenAll(tasks).Wait();
}
private static async Task<string> ReadString(HttpClient client, string url)
{
var response = await client.GetAsync(url);
string data = await response.Content.ReadAsStringAsync();
Console.WriteLine($"Fetched data from url={url}. Length={data.Length}");
return data;
}
Maybe better question is: do you really need thread per socket in this case? You should think of threads as system-wide resource and you should take this into consideration when spawning them, especially if you don't really know the number of threads that your application will be using. This is a good read: What's the maximum number of threads in Windows Server 2003?
Few years ago .NET team introduced Asynchronous sockets.
...The client is built with an asynchronous socket, so execution of
the client application is not suspended while the server returns a
response. The application sends a string to the server and then
displays the string returned by the server on the console.
Asynchronous Client Socket Example
There are a lot more examples out there showcasing this approach. While it is a bit more complicated and "low level" it let's you be in control.
I'm making an app that show some data collected from web in a windows form, today I have to wait sequentially to download all data before show them, how I can do it in parallel in a limited queue (with max concurrent tasks executing) to show result refreshing a datagridview while they are downloaded?
what I have today is a method
internal async Task<string> RequestDataAsync(string uri)
{
var wb = new System.Net.WebClient(); //
var sourceAsync = wb.DownloadStringTaskAsync(uri);
string data = await sourceAsync;
return data;
}
that I put on a foreach() and after it ends, parse data to a list of custom object, then convert that object to a DataTable and bind the datagridview to that.
I not sure if the best way is using LimitedConcurrencyLevelTaskScheduler from example on https://msdn.microsoft.com/library/system.threading.tasks.taskscheduler.aspx (that I not sure how can report to grid each time a resource is downlaoded) or there is a best way to do this.
I not like to start all tasks at same time, because sometimes can be that I have to request 100 downlads at same time, and I like that it will be executed for example 10 tasks at same time maximum.
I know that it is a question that involves control concurrent tasks and report while download that, but not sure what is best nowadays to do that.
I don't often recommend my book, but I think it would help you.
Concurrent asynchrony is done via Task.WhenAll (recipe 2.4 in my book):
List<string> uris = ...;
var tasks = uris.Select(uri => RequestDataAsync(uri));
string[] results = await Task.WhenAll(tasks);
To limit concurrency, use a SemaphoreSlim (recipe 11.5 in my book):
List<string> uris = ...;
var semaphore = new SemaphoreSlim(10);
var tasks = uris.Select(async uri =>
{
await semaphore.WaitAsync();
try { await RequestDataAsync(uri); }
finally { semaphore.Release(); }
});
string[] results = await Task.WhenAll(tasks);
To process data as it arrives, introduce another async method (recipe 2.6 in my book):
List<string> uris = ...;
var semaphore = new SemaphoreSlim(10);
var tasks = uris.Select(async uri =>
{
await semaphore.WaitAsync();
try { await RequestAndProcessDataAsync(uri); }
finally { semaphore.Release(); }
});
await Task.WhenAll(tasks);
async Task RequestAndProcessDataAsync(string uri)
{
var data = await RequestDataAsync(uri);
var myObject = Parse(data);
_listBoundToDataTable.Add(myObject);
}
I have to call 100,000 urls and I don't need to wait for the response. I have searched a lot . some people say there is no way to call http request without waiting for the response, also there are few answers for questions like mine that say to set Method="POST".
The following is the source code based on all of them. I have tried to call the urls asynchronously using WhenAll.
The problem is this that when I see CPU usage in task manager, it is fully busy for 140 seconds and within that time the system is almost unusable.
protected async void btnStartCallWhenAll_Click(object sender, EventArgs e)
{
// Make a list of web addresses.
List<string> urlList = SetUpURLList(Convert.ToInt32(txtNoRecordsToAdd.Text));
// One-step async call.
await ProcessAllURLSAsync(urlList);
}
private async Task ProcessAllURLSAsync(List<string> urlList)
{
// Create a query.
IEnumerable<Task<int>> CallingTasksQuery =
from url in urlList select ProcessURLAsync(url);
// Use ToArray to execute the query and start the Calling tasks.
Task<int>[] CallingTasks = CallingTasksQuery.ToArray();
// Await the completion of all the running tasks.
int[] lengths = await Task.WhenAll(CallingTasks);
int total = lengths.Sum();
}
private async Task<int> ProcessURLAsync(string url)
{
await CallURLAsync(url);
return 1;
}
private async Task CallURLAsync(string url)
{
// Initialize an HttpWebRequest for the current URL.
var webReq = (HttpWebRequest)WebRequest.Create(url);
webReq.Method="POST";
// Send the request to the Internet resource and wait for the response.
Task<WebResponse> responseTask = webReq.GetResponseAsync() ;
}
private List<string> SetUpURLList(int No)
{
List<string> urls = new List<string>
{
};
for (int i = 1; i <= No; i++)
urls.Add("http://msdn.microsoft.com/library/windows/apps/br211380.aspx");
return urls;
}
By the way the compiler hints that "this async method lacks 'await' operators and will run synchronously consider using the await operator to await non-blocking api calls or await task.run(..) to do cpu bound work on background thread" for this line:
private async Task CallURLAsync(string url).
I don't know if this affect my problem or not but after searching same problem in other questions, they said that I need to disable compiler message before this line.
I am updating my concurrency skillset. My problem seems to be fairly common: read from multiple Uris, parse and work with the result, etc. I have Concurrency in C# Cookbook. There are a few examples using GetStringAsync, such as
static async Task<string> DownloadAllAsync(IEnumerable<string> urls)
{
var httpClient = new HttpClient();
var downloads = urls.Select(url => httpClient.GetStringAsync(url));
Task<string>[] downloadTasks = downloads.ToArray();
string[] htmlPages = await Task.WhenAll(downloadTasks);
return string.Concat(htmlPages);
}
What I need is the asynchronous pattern for running multiple async tasks, capturing full or partial success.
Url 1 succeeds
Url 2 succeeds
Url 3 fails (timeout, bad Uri format, 401, etc)
Url 4 succeeds
... 20 more with mixed success
waiting on DownloadAllAsync task will throw a single aggregate exception if any fail, dropping the accumulated results. From my limited research, with WhenAll or WaitAll behave the same. I want to catch the exceptions, log the failures, but continue with the remaining tasks, even if they all fail.
I could process them one by one, but wouldn't that defeat the purpose of allowing TPL to manage the whole process? Is there a link to a pattern which would accomplish this in a pure TPL way? Perhaps I'm using the wrong tool?
I want to catch the exceptions, log the failures, but continue with the remaining tasks, even if they all fail.
In this case, the cleanest solution is to change what your code does for each element. I.e., this current code:
var downloads = urls.Select(url => httpClient.GetStringAsync(url));
says "for each url, download a string". What you want it to say is "for each url, download a string and then log and ignore any errors":
static async Task<string> DownloadAllAsync(IEnumerable<string> urls)
{
var httpClient = new HttpClient();
var downloads = urls.Select(url => TryDownloadAsync(httpClient, url));
Task<string>[] downloadTasks = downloads.ToArray();
string[] htmlPages = await Task.WhenAll(downloadTasks);
return string.Concat(htmlPages);
}
static async Task<string> TryDownloadAsync(HttpClient client, string url)
{
try
{
return await client.GetStringAsync(url);
}
catch (Exception ex)
{
Log(ex);
return string.Empty; // or whatever you prefer
}
}
You can attach continuation for all your tasks and wait for them instead of waiting directly on the tasks.
static async Task<string> DownloadAllAsync(IEnumerable<string> urls)
{
var httpClient = new HttpClient();
IEnumerable<Task<Task<string>>> downloads = urls.Select(url => httpClient.GetStringAsync(url).ContinueWith(p=> p, TaskContinuationOptions.ExecuteSynchronously));
Task<Task<string>>[] downloadTasks = downloads.ToArray();
Task<string>[] compleTasks = await Task.WhenAll(downloadTasks);
foreach (var task in compleTasks)
{
if (task.IsFaulted)//Or task.IsCanceled
{
//Handle it
}
}
var htmlPages = compleTasks.Where(x => x.Status == TaskStatus.RanToCompletion)
.Select(x => x.Result);
return string.Concat(htmlPages);
}
This will not stop as soon as one task fails, rather it will wait for all the tasks to complete. Then handle the success and failure separately.