I am working on a program which makes multiple json calls to retrieve it's data.
The data however is pretty big and when running it without async it takes 17 hours to fully process.
The fetching of the data goes as follows:
Call to a service with a page number (2000 pages in total to be processed), which returns 200 records per page.
For each record it returns, an other service needs to be called to receive the data for the current record.
I'm new to the whole async functionality and I've made an attempt using async and await and already made a performance boost but was wondering if this is the correct way of using it and if there are any other ways to increase performance?
This is the code I currently have:
static void Main(string[] args)
{
MainAsyncCall().Wait();
Console.ReadKey();
}
public static async Task MainAsyncCall()
{
ServicePointManager.DefaultConnectionLimit = 999999;
List<Task> allPages = new List<Task>();
for (int i = 0; i <= 10; i++)
{
var page = i;
allPages.Add(Task.Factory.StartNew(() => processPage(page)));
}
Task.WaitAll(allPages.ToArray());
Console.WriteLine("Finished all pages");
}
public static async Task processPage(Int32 page)
{
List<Task> players = new List<Task>();
using (var client = new HttpClient())
{
string url = "<Request URL>";
var response = client.GetAsync(url).Result;
var content = response.Content.ReadAsStringAsync().Result;
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
dynamic data = item.data;
var localPage = page;
Console.WriteLine($"Processing Page: {localPage}");
foreach (dynamic d in data)
{
players.Add(Task.Factory.StartNew(() => processPlayer(d, localPage)));
}
}
Task.WaitAll(players.ToArray());
Console.WriteLine($"Finished Page: {page}");
}
public static async Task processPlayer(dynamic player, int page)
{
using (var client = new HttpClient())
{
string url = "<Request URL>";
HttpResponseMessage response = null;
response = client.GetAsync(url).Result;
var content = await response.Content.ReadAsStringAsync();
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
Console.WriteLine($"{page}: Processed {item.name}");
}
}
Any suggestion is welcome!
This is what it should look like to me:
static void Main(string[] args)
{
// it's okay here to use wait because we're at the root of the application
new AsyncServerCalls().MainAsyncCall().Wait();
Console.ReadKey();
}
public class AsyncServerCalls
{
// dont use static async methods
public async Task MainAsyncCall()
{
ServicePointManager.DefaultConnectionLimit = 999999;
List<Task> allPages = new List<Task>();
for (int i = 0; i <= 10; i++)
{
var page = i;
allPages.Add(processPage(page));
}
await Task.WhenAll(allPages.ToArray());
Console.WriteLine("Finished all pages");
}
public async Task processPage(Int32 page)
{
List<Task> players = new List<Task>();
using (var client = new HttpClient())
{
string url = "<Request URL>";
var response = await client.GetAsync(url)// nope .Result;
var content = await response.Content.ReadAsStringAsync(); // again never use .Result;
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
dynamic data = item.data;
var localPage = page;
Console.WriteLine($"Processing Page: {localPage}");
foreach (dynamic d in data)
{
players.Add(processPlayer(d, localPage)); // no need to put the task unnecessarily on a different thread, let the current SynchronisationContext deal with that
}
}
await Task.WhenAll(players.ToArray()); // always await a task in an async method
Console.WriteLine($"Finished Page: {page}");
}
public async Task processPlayer(dynamic player, int page)
{
using (var client = new HttpClient())
{
string url = "<Request URL>";
HttpResponseMessage response = null;
response = await client.GetAsync(url); // don't use .Result;
var content = await response.Content.ReadAsStringAsync();
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
Console.WriteLine($"{page}: Processed {item.name}");
}
}
}
So basially the points here are to make sure you let the SynchronisationContext do it's job. Inside a console program it should use the TaskSchedular.Default which is a ThreadPool SynchronisationContext. You can always force this by doing:
static void Main(string[] args)
{
Task.Run(() => new AsyncServerCalls().MainAsyncCall()).Wait();
Console.ReadKey();
}
Reference to Task.Run forcing Default
One thing you need to remember, which I got into trouble with last week is that you can fire hose the thread pool, i.e. spawn so many tasks that the your process just dies with insane CPU and Memory usage. So you may need to use a Semaphore to just limit the number of threads that going to be created.
I created a solution that processes a single file in multiple parts all at the same time Parallel Read it is still being worked on, but shows the uses of async stuff
Just to clarify the parallelism.
When you take a reference to all those tasks:
allPages.Add(processPage(page));
They all will be started.
When you do:
await Task.WhenAll(allPages);
This will block the current method execution until all those page processes have been executed (it won't block the current thread though, don't get these confused)
Danger Zone
If you don't want to block method execution on
Task.WhenAll
So, you can parallel all page processes for each page, then you can add that Task to an overall List<Task>.
However, the danger with this is the fire hosing... You are going to limit the number of threads you execute at some point, so where.... well that is up to you but just remember, it will happen at some point.
Related
I have a mini-project that requires to download html documents of multiple websites using C# and make it perform as fast as possible. In my scenario I might need to switch IP using proxies based on certain conditions. I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
Here's the code I have so far.
public class HTMLDownloader
{
public static List<string> URL_LIST = new List<string>();
public static List<string> HTML_DOCUMENTS = new List<string>();
public static void Main()
{
for (var i = 0; i < URL_LIST.Count; i++)
{
var html = Task.Run(() => Run(URL_LIST[i]));
HTML_DOCUMENTS.Add(html.Result);
}
}
public static async Task<string> Run(string url)
{
var client = new WebClient();
//Handle Proxy Credentials client.Proxy= new WebProxy();
string html = "";
try
{
html = await client.DownloadStringTaskAsync(new Uri(url));
//if(condition ==true)
//{
// return html;
//}
//else
//{
// Switch IP and try again
//}
}
catch (Exception e)
{
}
return html;
}
The problem here is that I'm not really taking advantage of sending multiple web requests because each request has to finish in order for the next one to begin. Is there a better approach to this? For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
Thanks
I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
You can use Task.WhenAll to get asynchronous concurrency.
For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
To throttle asynchronous concurrency, use SemaphoreSlim:
public static async Task Main()
{
using var limit = new SemaphoreSlim(10); // 10 at a time
var tasks = URL_LIST.Select(Process).ToList();
var results = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(results);
async Task<string> Process(string url)
{
await limit.WaitAsync();
try { return await Run(url); }
finally { limit.Release(); }
}
}
One way is to use Task.WhenAll.
Creates a task that will complete when all of the supplied tasks have
completed.
The premise is, Select all the tasks into a List, await the list of task with Task.WhenAll, Select the results
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
await Task.WhenAll(tasks);
var results = tasks.Select(x => x.Result);
}
Note : The result of WhenAll will be the collection of results as well
First change your Main to be async.
Then you can use LINQ Select to run the Tasks in parallel.
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
string[] documents = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(documents);
}
Task.WhenAll will unwrap the Task results into an array, once all the tasks are complete.
I want to avoid application crashing problem due to parallel for loop and httpclient but I am unable to apply solutions that are provided elsewhere on the web due to my limited knowledge of programming. My code is pasted below.
class Program
{
public static List<string> words = new List<string>();
public static int count = 0;
public static string output = "";
private static HttpClient Client = new HttpClient();
public static void Main(string[] args)
{
//input path strings...
List<string> links = new List<string>();
links.AddRange(File.ReadAllLines(input));
List<string> longList = new List<string>(File.ReadAllLines(#"a.txt"));
words.AddRange(File.ReadAllLines(output1));
System.Net.ServicePointManager.DefaultConnectionLimit = 8;
count = longList.Count;
//for (int i = 0; i < longList.Count; i++)
Task.Run(() => Parallel.For(0, longList.Count, new ParallelOptions { MaxDegreeOfParallelism = 5 }, (i, loopState) =>
{
Console.WriteLine(i);
string link = #"some link" + longList[i] + "/";
try
{
if (!links.Contains(link))
{
Task.Run(async () => { await Download(link); }).Wait();
}
}
catch (System.Exception e)
{
}
}));
//}
}
public static async Task Download(string link)
{
HtmlAgilityPack.HtmlDocument document = new HtmlDocument();
document.LoadHtml(await getURL(link));
//...stuff with html agility pack
}
public static async Task<string> getURL(string link)
{
string result = "";
HttpResponseMessage response = await Client.GetAsync(link);
Console.WriteLine(response.StatusCode);
if(response.IsSuccessStatusCode)
{
HttpContent content = response.Content;
var bytes = await response.Content.ReadAsByteArrayAsync();
result = Encoding.UTF8.GetString(bytes);
}
return result;
}
}
There are solutions for example this one, but I don't know how to put await keyword in my main method, and currently the program simply exits due to its absence before Task.Run(). As you can see I have already applied a workaround regarding async Download() method to call it in main method.
I have also doubts regarding the use of same instance of httpclient in different parallel threads. Please advise me whether I should create new instance of httpclient each time.
You're right that you have to block tasks somewhere in a console application, otherwise the program will just exit before it's complete. But you're doing this more than you need to. Aim for just blocking the main thread and delegating the rest to an async method. A good practice is to create a method with a signature like private async Task MainAsyc(args), put the "guts" of your program logic there, call it from Main like this:
MainAsync(args).Wait();
In your example, move everything from Main to MainAsync. Then you're free to use await as much as you want. Task.Run and Parallel.For are explicitly consuming new threads for I/O bound work, which is unnecessary in the async world. Use Task.WhenAll instead. The last part of your MainAsync method should end up looking something like this:
await Task.WhenAll(longList.Select(async s => {
Console.WriteLine(i);
string link = #"some link" + s + "/";
try
{
if (!links.Contains(link))
{
await Download(link);
}
}
catch (System.Exception e)
{
}
}));
There is one little wrinkle here though. Your example is throttling the parallelism at 5. If you find you still need this, TPL Dataflow is a great library for throttled parallelism in the async world. Here's a simple example.
Regarding HttpClient, using a single instance across threads is completely safe and highly encouraged.
I'm using DownloadFileAsyncTask method to download files. However, when i execute it in a loop i get an exception, which tells me concurrent operations are not supported. I tried to fix it like this:
public async Task<string> Download(string uri, string path)
{
if (uri == null) return;
//manually wait for previous task to complete
while (Client.IsBusy)
{
await Task.Delay(10);
}
await Client.DownloadFileTaskAsync(new Uri(absoluteUri), path);
return path;
}
Sometimes it works, when a number of iterations isn't big(1-5), and when it runs 10 or more times i'm getting this error.
Client here is a WebClient and i create it once. I don't produce a new Clients on every iteration because it makes an overhead.
Back to i was saying, how to make WebClient wait before previous download finishes? Also a question here is why IsBusy works for small amount of downloads.
The code i'm using:
public IEnumerable<Task<string>> GetPathById(IEnumerable<Photo> photos)
{
return photos?.Select(
async photo =>
{
var path = await Download(Uri, Path);
return path;
});
}
I want to download many files and don't block my Ui thread. Maybe there is other way to do this?
You're missing a lot of code that is necessary to help you out so I wrote this quick example to show you what I'm thinking you might want to try. Its in .NET Core but its essentially the same, just swap HttpClient for WebClient.
static void Main(string[] args)
{
Task.Run(async () =>
{
var toDownload = new string[] { "http://google.com", "http://microsoft.com", "http://apple.com" };
var client = new HttpClient();
var downloadedItems = await DownloadItems(client, toDownload);
Console.WriteLine("This is async");
foreach (var item in downloadedItems)
{
Console.WriteLine(item);
}
Console.ReadLine();
}).Wait();
}
static async Task<IEnumerable<string>> DownloadItems(HttpClient client, string[] uris)
{
// This sets up each page to be downloaded using the same HttpClient.
var items = new List<string>();
foreach (var uri in uris)
{
var item = await Download(client, uri);
items.Add(item);
}
return items;
}
static async Task<string> Download(HttpClient client, string uri)
{
// This download the page and returns the content.
if (string.IsNullOrEmpty(uri)) return null;
var content = await client.GetStringAsync(uri);
return content;
}
Originally I wrote my C# program using Threads and ThreadPooling, but most of the users on here were telling me Async was a better approach for more efficiency. My program sends JSON objects to a server until status code of 200 is returned and then moves on to the next task.
The problem is once one of my Tasks retrieves a status code of 200, it waits for the other Tasks to get 200 code and then move onto the next Task. I want each task to continue it's next task without waiting for other Tasks to finish (or catch up) by receiving a 200 response.
Below is my main class to run my tasks in parallel.
public static void Main (string[] args)
{
var tasks = new List<Task> ();
for (int i = 0; i < 10; i++) {
tasks.Add (getItemAsync (i));
}
Task.WhenAny (tasks);
}
Here is the getItemAsync() method that does the actual sending of information to another server. Before this method works, it needs a key. The problem also lies here, let's say I run 100 tasks, all 100 tasks will wait until every single one has a key.
public static async Task getItemAsync (int i)
{
if (key == "") {
await getProductKey (i).ConfigureAwait(false);
}
Uri url = new Uri ("http://www.website.com/");
var content = new FormUrlEncodedContent (new[] {
...
});
while (!success) {
using (HttpResponseMessage result = await client.PostAsync (url, content)) {
if (result.IsSuccessStatusCode) {
string resultContent = await result.Content.ReadAsStringAsync ().ConfigureAwait(false);
Console.WriteLine(resultContent);
success=true;
}
}
}
await doMoreAsync (i);
}
Here's the function that retrieves the keys, it uses HttpClient and gets parsed.
public static async Task getProductKey (int i)
{
string url = "http://www.website.com/key";
var stream = await client.GetStringAsync (url).ConfigureAwait(false);
var doc = new HtmlDocument ();
doc.LoadHtml (stream);
HtmlNode node = doc.DocumentNode.SelectNodes ("//input[#name='key']") [0];
try {
key = node.Attributes ["value"].Value;
} catch (Exception e) {
Console.WriteLine ("Exception: " + e);
}
}
After every task receives a key and has a 200 status code, it runs doMoreAsync(). I want any individual Task that retrieved the 200 code to run doMoreAsync() without waiting for other tasks to catch up. How do I go about this?
Your Main method is not waiting for a Task to complete, it justs fires off a bunch of async tasks and returns.
If you want to await an async task you can only do that from an async method. Workaround is to kick off an async task from the Main method and wait for its completion using the blocking Task.Wait():
public static void Main(string[] args) {
Task.Run(async () =>
{
var tasks = new List<Task> ();
for (int i = 0; i < 10; i++) {
tasks.Add (getItemAsync (i));
}
var finishedTask = await Task.WhenAny(tasks); // This awaits one task
}).Wait();
}
When you are invoking getItemAsync() from an async method, you can remove the ConfigureAwait(false) as well. ConfigureAwait(false) just makes sure some code does not execute on the UI thread.
If you want to append another task to a task, you can also append it to the previous Task directly by using ContinueWith():
getItemAsync(i).ContinueWith(anotherTask);
Your main problem seems to be that you're sharing the key field across all your concurrently-running asynchronous operations. This will inevitably lead to race hazards. Instead, you should alter your getProductKey method to return each retrieved key as the result of its asynchronous operation:
// ↓ result type of asynchronous operation
public static async Task<string> getProductKey(int i)
{
// retrieval logic here
// return key as result of asynchronous operation
return node.Attributes["value"].Value;
}
Then, you consume it like so:
public static async Task getItemAsync(int i)
{
string key;
try
{
key = await getProductKey(i).ConfigureAwait(false);
}
catch
{
// handle exceptions
}
// use local variable 'key' in further asynchronous operations
}
Finally, in your main logic, use a WaitAll to prevent the console application from terminating before all your tasks complete. Although WaitAll is a blocking call, it is acceptable in this case since you want the main thread to block.
public static void Main (string[] args)
{
var tasks = new List<Task> ();
for (int i = 0; i < 10; i++) {
tasks.Add(getItemAsync(i));
}
Task.WaitAll(tasks.ToArray());
}
How do I turn the following into a Parallel.ForEach?
public async void getThreadContents(String[] threads)
{
HttpClient client = new HttpClient();
List<String> usernames = new List<String>();
int i = 0;
foreach (String url in threads)
{
i++;
progressLabel.Text = "Scanning thread " + i.ToString() + "/" + threads.Count<String>();
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
Predicate<String> userPredicate;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
userPredicate = (String x) => x == user;
if (usernames.Find(userPredicate) != user)
{
usernames.Add(match.Groups[1].ToString());
}
}
progressBar1.PerformStep();
}
}
I coded it in the assumption that asynchronous and parallel processing would be the same, and I just realized it isn't. I took a look at all the questions I could find on this, and I really can't seem to find an example that does it for me. Most of them lack readable variable names. Using single-letter variable names which don't explain what they contain is a horrible way to state an example.
I normally have between 300 and 2000 entries in the array named threads (Contains URL's to forum threads) and it would seem that parallel processing (Due to the many HTTP requests) would speed up the execution).
Do I have to remove all the asynchrony (I got nothing async outside the foreach, only variable definitions) before I can use Parallel.ForEach? How should I go about doing this? Can I do this without blocking the main thread?
I am using .NET 4.5 by the way.
I coded it in the assumption that asynchronous and parallel processing would be the same
Asynchronous processing and parallel processing are quite different. If you don't understand the difference, I think you should first read more about it (for example what is the relation between Asynchronous and parallel programming in c#?).
Now, what you want to do is actually not that simple, because you want to process a big collection asynchronously, with a specific degree of parallelism (8). With synchronous processing, you could use Parallel.ForEach() (along with ParallelOptions to configure the degree of parallelism), but there is no simple alternative that would work with async.
In your code, this is complicated by the fact that you expect everything to execute on the UI thread. (Though ideally, you shouldn't access the UI directly from your computation. Instead, you should use IProgress, which would mean the code no longer has to execute on the UI thread.)
Probably the best way to do this in .Net 4.5 is to use TPL Dataflow. Its ActionBlock does exactly what you want, but it can be quite verbose (because it's more flexible than what you need). So it makes sense to create a helper method:
public static Task AsyncParallelForEach<T>(
IEnumerable<T> source, Func<T, Task> body,
int maxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
TaskScheduler scheduler = null)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDegreeOfParallelism
};
if (scheduler != null)
options.TaskScheduler = scheduler;
var block = new ActionBlock<T>(body, options);
foreach (var item in source)
block.Post(item);
block.Complete();
return block.Completion;
}
In your case, you would use it like this:
await AsyncParallelForEach(
threads, async url => await DownloadUrl(url), 8,
TaskScheduler.FromCurrentSynchronizationContext());
Here, DownloadUrl() is an async Task method that processes a single URL (the body of your loop), 8 is the degree of parallelism (probably shouldn't be a literal constant in real code) and FromCurrentSynchronizationContext() makes sure the code executes on the UI thread.
Stephen Toub has a good blog post on implementing a ForEachAsync. Svick's answer is quite good for platforms on which Dataflow is available.
Here's an alternative, using the partitioner from the TPL:
public static Task ForEachAsync<T>(this IEnumerable<T> source,
int degreeOfParallelism, Func<T, Task> body)
{
var partitions = Partitioner.Create(source).GetPartitions(degreeOfParallelism);
var tasks = partitions.Select(async partition =>
{
using (partition)
while (partition.MoveNext())
await body(partition.Current);
});
return Task.WhenAll(tasks);
}
You can then use this as such:
public async Task getThreadContentsAsync(String[] threads)
{
HttpClient client = new HttpClient();
ConcurrentDictionary<String, object> usernames = new ConcurrentDictionary<String, object>();
await threads.ForEachAsync(8, async url =>
{
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
usernames.TryAdd(user, null);
}
progressBar1.PerformStep();
});
}
Yet another alternative is using SemaphoreSlim or AsyncSemaphore (which is included in my AsyncEx library and supports many more platforms than SemaphoreSlim):
public async Task getThreadContentsAsync(String[] threads)
{
SemaphoreSlim semaphore = new SemaphoreSlim(8);
HttpClient client = new HttpClient();
ConcurrentDictionary<String, object> usernames = new ConcurrentDictionary<String, object>();
await Task.WhenAll(threads.Select(async url =>
{
await semaphore.WaitAsync();
try
{
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
usernames.TryAdd(user, null);
}
progressBar1.PerformStep();
}
finally
{
semaphore.Release();
}
}));
}
You can try the ParallelForEachAsync extension method from AsyncEnumerator NuGet Package:
using System.Collections.Async;
public async void getThreadContents(String[] threads)
{
HttpClient client = new HttpClient();
List<String> usernames = new List<String>();
int i = 0;
await threads.ParallelForEachAsync(async url =>
{
i++;
progressLabel.Text = "Scanning thread " + i.ToString() + "/" + threads.Count<String>();
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
Predicate<String> userPredicate;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
userPredicate = (String x) => x == user;
if (usernames.Find(userPredicate) != user)
{
usernames.Add(match.Groups[1].ToString());
}
}
// THIS CALL MUST BE THREAD-SAFE!
progressBar1.PerformStep();
},
maxDegreeOfParallelism: 8);
}