I want to avoid application crashing problem due to parallel for loop and httpclient but I am unable to apply solutions that are provided elsewhere on the web due to my limited knowledge of programming. My code is pasted below.
class Program
{
public static List<string> words = new List<string>();
public static int count = 0;
public static string output = "";
private static HttpClient Client = new HttpClient();
public static void Main(string[] args)
{
//input path strings...
List<string> links = new List<string>();
links.AddRange(File.ReadAllLines(input));
List<string> longList = new List<string>(File.ReadAllLines(#"a.txt"));
words.AddRange(File.ReadAllLines(output1));
System.Net.ServicePointManager.DefaultConnectionLimit = 8;
count = longList.Count;
//for (int i = 0; i < longList.Count; i++)
Task.Run(() => Parallel.For(0, longList.Count, new ParallelOptions { MaxDegreeOfParallelism = 5 }, (i, loopState) =>
{
Console.WriteLine(i);
string link = #"some link" + longList[i] + "/";
try
{
if (!links.Contains(link))
{
Task.Run(async () => { await Download(link); }).Wait();
}
}
catch (System.Exception e)
{
}
}));
//}
}
public static async Task Download(string link)
{
HtmlAgilityPack.HtmlDocument document = new HtmlDocument();
document.LoadHtml(await getURL(link));
//...stuff with html agility pack
}
public static async Task<string> getURL(string link)
{
string result = "";
HttpResponseMessage response = await Client.GetAsync(link);
Console.WriteLine(response.StatusCode);
if(response.IsSuccessStatusCode)
{
HttpContent content = response.Content;
var bytes = await response.Content.ReadAsByteArrayAsync();
result = Encoding.UTF8.GetString(bytes);
}
return result;
}
}
There are solutions for example this one, but I don't know how to put await keyword in my main method, and currently the program simply exits due to its absence before Task.Run(). As you can see I have already applied a workaround regarding async Download() method to call it in main method.
I have also doubts regarding the use of same instance of httpclient in different parallel threads. Please advise me whether I should create new instance of httpclient each time.
You're right that you have to block tasks somewhere in a console application, otherwise the program will just exit before it's complete. But you're doing this more than you need to. Aim for just blocking the main thread and delegating the rest to an async method. A good practice is to create a method with a signature like private async Task MainAsyc(args), put the "guts" of your program logic there, call it from Main like this:
MainAsync(args).Wait();
In your example, move everything from Main to MainAsync. Then you're free to use await as much as you want. Task.Run and Parallel.For are explicitly consuming new threads for I/O bound work, which is unnecessary in the async world. Use Task.WhenAll instead. The last part of your MainAsync method should end up looking something like this:
await Task.WhenAll(longList.Select(async s => {
Console.WriteLine(i);
string link = #"some link" + s + "/";
try
{
if (!links.Contains(link))
{
await Download(link);
}
}
catch (System.Exception e)
{
}
}));
There is one little wrinkle here though. Your example is throttling the parallelism at 5. If you find you still need this, TPL Dataflow is a great library for throttled parallelism in the async world. Here's a simple example.
Regarding HttpClient, using a single instance across threads is completely safe and highly encouraged.
Related
I have a mini-project that requires to download html documents of multiple websites using C# and make it perform as fast as possible. In my scenario I might need to switch IP using proxies based on certain conditions. I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
Here's the code I have so far.
public class HTMLDownloader
{
public static List<string> URL_LIST = new List<string>();
public static List<string> HTML_DOCUMENTS = new List<string>();
public static void Main()
{
for (var i = 0; i < URL_LIST.Count; i++)
{
var html = Task.Run(() => Run(URL_LIST[i]));
HTML_DOCUMENTS.Add(html.Result);
}
}
public static async Task<string> Run(string url)
{
var client = new WebClient();
//Handle Proxy Credentials client.Proxy= new WebProxy();
string html = "";
try
{
html = await client.DownloadStringTaskAsync(new Uri(url));
//if(condition ==true)
//{
// return html;
//}
//else
//{
// Switch IP and try again
//}
}
catch (Exception e)
{
}
return html;
}
The problem here is that I'm not really taking advantage of sending multiple web requests because each request has to finish in order for the next one to begin. Is there a better approach to this? For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
Thanks
I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
You can use Task.WhenAll to get asynchronous concurrency.
For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
To throttle asynchronous concurrency, use SemaphoreSlim:
public static async Task Main()
{
using var limit = new SemaphoreSlim(10); // 10 at a time
var tasks = URL_LIST.Select(Process).ToList();
var results = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(results);
async Task<string> Process(string url)
{
await limit.WaitAsync();
try { return await Run(url); }
finally { limit.Release(); }
}
}
One way is to use Task.WhenAll.
Creates a task that will complete when all of the supplied tasks have
completed.
The premise is, Select all the tasks into a List, await the list of task with Task.WhenAll, Select the results
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
await Task.WhenAll(tasks);
var results = tasks.Select(x => x.Result);
}
Note : The result of WhenAll will be the collection of results as well
First change your Main to be async.
Then you can use LINQ Select to run the Tasks in parallel.
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
string[] documents = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(documents);
}
Task.WhenAll will unwrap the Task results into an array, once all the tasks are complete.
Reading Stephen Cleary take on not blocking on Async code I write something like this
public static async Task<JObject> GetJsonAsync(Uri uri)
{
using (var client = new HttpClient())
{
var jsonString = await client.GetStringAsync(uri).ConfigureAwait(false);
return JObject.Parse(jsonString);
}
}
public async void Button1_Click(...)
{
var json = await GetJsonAsync(...);
textBox1.Text=json;
}
so far so good, I understand that after the ConfigureAwait the method is going to continue running on a different context after GetStringAsync returns.
but what about if I want to use something like MessageBox (which is UI) like this
public static async Task<JObject> GetJsonAsync(Uri uri)
{
if(someValue<MAXVALUE)
{
using (var client = new HttpClient())
{
//var jsonString = await client.GetStringAsync(uri); //starts the REST request
var jsonString = await client.GetStringAsync(uri).ConfigureAwait(false);
return JObject.Parse(jsonString);
}
}
else
{
MessageBox.Show("The parameter someValue is too big!");
}
}
can I do this?
Even more complicated, how about this?
public static async Task<JObject> GetJsonAsync(Uri uri)
{
if(someValue<MAXVALUE)
{
try{
using (var client = new HttpClient())
{
//var jsonString = await client.GetStringAsync(uri); //starts the REST request
var jsonString = await client.GetStringAsync(uri).ConfigureAwait(false);
return JObject.Parse(jsonString);
}
}
catch(Exception ex)
{
MessageBox.Show("An Exception was raised!");
}
}
else
{
MessageBox.Show("The parameter someValue is too big!");
}
}
Can I do this?
Now, I am thinking perhaps all the message boxes should be called outside GetJsonAync as good design, but my question is can the above thing be done?
can I do this? [use a MessageBox]
Yes, but mainly because it has nothing to do with async/await or threading.
MessageBox.Show() is special, it is a static method and is documented as thread-safe.
You can show a MessageBox from any thread, any time.
So maybe it was the wrong example, but you do have MessageBox in the title.
public static async Task<JObject> GetJsonAsync(Uri uri)
{
try{
... // old context
... await client.GetStringAsync(uri).ConfigureAwait(false);
... // new context
}
catch
{
// this might bomb
someLabel.Text = "An Exception was raised!";
}
}
In this example, there could be code paths where the catch runs on the old and other paths where it runs on the new context.
Bottom line is: you don't know and should assume the worst case.
I would not use a Message Box, as it is very limited, and dated.
Also, Pop up's are annoying.
Use your own user control which enables user interaction the way you intend it.
In the context of Winforms / WPF / (and I guess UWP), only a single thread can manipulate the UI. Other threads can issue work to it via a queue of actions which eventually get invoked.
This architecture prevents other threads from constantly poking at the UI, which can make UX very janky (and thread unsafe).
The only way to communicate with it the UI work queue (in Winforms) is via the System.Windows.Form.Controls.BeginInvoke instance method, found on every form and control.
In your case:
public async void Button1_Click(...)
{
var json = await GetJsonAsync(...).ConfigureAwait(false);
BeginInvoke(UpdateTextBox, json);
}
private void UpdateTextBox(string value)
{
textBox1.Text=json;
}
I'm writing a .NET Core Console App that needs to continuously read data from multiple WebSockets. My current approach is to create a new Task (via Task.Run) per WebSocket that runs an infinite while loop and blocks until it reads the data from the socket. However, since the data is pushed at a rather low frequency, the threads just block most of the time which seems quite inefficient.
From my understanding, the async/await pattern should be ideal for blocking I/O operations. However, I'm not sure how to apply it for my situation or even if async/await can improve this in any way - especially since it's a Console app.
I've put together a proof of concept (doing a HTTP GET instead of reading from WebSocket for simplicity). The only way I was able to achieve this was without actually awaiting. Code:
static void Main(string[] args)
{
Console.WriteLine($"ThreadId={ThreadId}: Main");
Task task = Task.Run(() => Process("https://duckduckgo.com", "https://stackoverflow.com/"));
// Do main work.
task.Wait();
}
private static void Process(params string[] urls)
{
Dictionary<string, Task<string>> tasks = urls.ToDictionary(x => x, x => (Task<string>)null);
HttpClient client = new HttpClient();
while (true)
{
foreach (string url in urls)
{
Task<string> task = tasks[url];
if (task == null || task.IsCompleted)
{
if (task != null)
{
string result = task.Result;
Console.WriteLine($"ThreadId={ThreadId}: Length={result.Length}");
}
tasks[url] = ReadString(client, url);
}
}
Thread.Yield();
}
}
private static async Task<string> ReadString(HttpClient client, string url)
{
var response = await client.GetAsync(url);
Console.WriteLine($"ThreadId={ThreadId}: Url={url}");
return await response.Content.ReadAsStringAsync();
}
private static int ThreadId => Thread.CurrentThread.ManagedThreadId;
This seems to be working and executing on various Worker Threads on the ThreadPool. However, this definitely doesn't seem as any typical async/await code which makes me think there has to be a better way.
Is there a more proper / more elegant way of doing this?
You've basically written a version of Task.WhenAny that uses a CPU loop to check for completed tasks rather than... whatever magic the framework method uses behind the scenes.
A more idiomatic version might look like this. (Although it might not - I feel like there should be an easier method of "re-run the completed task" than the reverse dictionary I've used here.)
static void Main(string[] args)
{
Console.WriteLine($"ThreadId={ThreadId}: Main");
// No need for Task.Run here.
var task = Process("https://duckduckgo.com", "https://stackoverflow.com/");
task.Wait();
}
private static async Task Process(params string[] urls)
{
// Set up initial dictionary mapping task (per URL) to the URL used.
HttpClient client = new HttpClient();
var tasks = urls.ToDictionary(u => client.GetAsync(u), u => u);
while (true)
{
// Wait for any task to complete, get its URL and remove it from the current tasks.
var firstCompletedTask = await Task.WhenAny(tasks.Keys);
var firstCompletedUrl = tasks[firstCompletedTask];
tasks.Remove(firstCompletedTask);
// Do work with completed task.
try
{
Console.WriteLine($"ThreadId={ThreadId}: URL={firstCompletedUrl}");
using (var response = await firstCompletedTask)
{
var content = await response.Content.ReadAsStringAsync();
Console.WriteLine($"ThreadId={ThreadId}: Length={content.Length}");
}
}
catch (Exception ex)
{
Console.WriteLine($"ThreadId={ThreadId}: Ex={ex}");
}
// Queue the task again.
tasks.Add(client.GetAsync(firstCompletedUrl), firstCompletedUrl);
}
}
private static int ThreadId => Thread.CurrentThread.ManagedThreadId;
I've accepted Rawling's answer - I believe it is correct for the exact scenario I described. However, with a bit of inverted logic, I ended up with something way simpler - leaving it in case anyone needs something like this:
static void Main(string[] args)
{
string[] urls = { "https://duckduckgo.com", "https://stackoverflow.com/" };
HttpClient client = new HttpClient();
var tasks = urls.Select(async url =>
{
while (true) await ReadString(client, url);
});
Task.WhenAll(tasks).Wait();
}
private static async Task<string> ReadString(HttpClient client, string url)
{
var response = await client.GetAsync(url);
string data = await response.Content.ReadAsStringAsync();
Console.WriteLine($"Fetched data from url={url}. Length={data.Length}");
return data;
}
Maybe better question is: do you really need thread per socket in this case? You should think of threads as system-wide resource and you should take this into consideration when spawning them, especially if you don't really know the number of threads that your application will be using. This is a good read: What's the maximum number of threads in Windows Server 2003?
Few years ago .NET team introduced Asynchronous sockets.
...The client is built with an asynchronous socket, so execution of
the client application is not suspended while the server returns a
response. The application sends a string to the server and then
displays the string returned by the server on the console.
Asynchronous Client Socket Example
There are a lot more examples out there showcasing this approach. While it is a bit more complicated and "low level" it let's you be in control.
I am working on a program which makes multiple json calls to retrieve it's data.
The data however is pretty big and when running it without async it takes 17 hours to fully process.
The fetching of the data goes as follows:
Call to a service with a page number (2000 pages in total to be processed), which returns 200 records per page.
For each record it returns, an other service needs to be called to receive the data for the current record.
I'm new to the whole async functionality and I've made an attempt using async and await and already made a performance boost but was wondering if this is the correct way of using it and if there are any other ways to increase performance?
This is the code I currently have:
static void Main(string[] args)
{
MainAsyncCall().Wait();
Console.ReadKey();
}
public static async Task MainAsyncCall()
{
ServicePointManager.DefaultConnectionLimit = 999999;
List<Task> allPages = new List<Task>();
for (int i = 0; i <= 10; i++)
{
var page = i;
allPages.Add(Task.Factory.StartNew(() => processPage(page)));
}
Task.WaitAll(allPages.ToArray());
Console.WriteLine("Finished all pages");
}
public static async Task processPage(Int32 page)
{
List<Task> players = new List<Task>();
using (var client = new HttpClient())
{
string url = "<Request URL>";
var response = client.GetAsync(url).Result;
var content = response.Content.ReadAsStringAsync().Result;
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
dynamic data = item.data;
var localPage = page;
Console.WriteLine($"Processing Page: {localPage}");
foreach (dynamic d in data)
{
players.Add(Task.Factory.StartNew(() => processPlayer(d, localPage)));
}
}
Task.WaitAll(players.ToArray());
Console.WriteLine($"Finished Page: {page}");
}
public static async Task processPlayer(dynamic player, int page)
{
using (var client = new HttpClient())
{
string url = "<Request URL>";
HttpResponseMessage response = null;
response = client.GetAsync(url).Result;
var content = await response.Content.ReadAsStringAsync();
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
Console.WriteLine($"{page}: Processed {item.name}");
}
}
Any suggestion is welcome!
This is what it should look like to me:
static void Main(string[] args)
{
// it's okay here to use wait because we're at the root of the application
new AsyncServerCalls().MainAsyncCall().Wait();
Console.ReadKey();
}
public class AsyncServerCalls
{
// dont use static async methods
public async Task MainAsyncCall()
{
ServicePointManager.DefaultConnectionLimit = 999999;
List<Task> allPages = new List<Task>();
for (int i = 0; i <= 10; i++)
{
var page = i;
allPages.Add(processPage(page));
}
await Task.WhenAll(allPages.ToArray());
Console.WriteLine("Finished all pages");
}
public async Task processPage(Int32 page)
{
List<Task> players = new List<Task>();
using (var client = new HttpClient())
{
string url = "<Request URL>";
var response = await client.GetAsync(url)// nope .Result;
var content = await response.Content.ReadAsStringAsync(); // again never use .Result;
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
dynamic data = item.data;
var localPage = page;
Console.WriteLine($"Processing Page: {localPage}");
foreach (dynamic d in data)
{
players.Add(processPlayer(d, localPage)); // no need to put the task unnecessarily on a different thread, let the current SynchronisationContext deal with that
}
}
await Task.WhenAll(players.ToArray()); // always await a task in an async method
Console.WriteLine($"Finished Page: {page}");
}
public async Task processPlayer(dynamic player, int page)
{
using (var client = new HttpClient())
{
string url = "<Request URL>";
HttpResponseMessage response = null;
response = await client.GetAsync(url); // don't use .Result;
var content = await response.Content.ReadAsStringAsync();
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
Console.WriteLine($"{page}: Processed {item.name}");
}
}
}
So basially the points here are to make sure you let the SynchronisationContext do it's job. Inside a console program it should use the TaskSchedular.Default which is a ThreadPool SynchronisationContext. You can always force this by doing:
static void Main(string[] args)
{
Task.Run(() => new AsyncServerCalls().MainAsyncCall()).Wait();
Console.ReadKey();
}
Reference to Task.Run forcing Default
One thing you need to remember, which I got into trouble with last week is that you can fire hose the thread pool, i.e. spawn so many tasks that the your process just dies with insane CPU and Memory usage. So you may need to use a Semaphore to just limit the number of threads that going to be created.
I created a solution that processes a single file in multiple parts all at the same time Parallel Read it is still being worked on, but shows the uses of async stuff
Just to clarify the parallelism.
When you take a reference to all those tasks:
allPages.Add(processPage(page));
They all will be started.
When you do:
await Task.WhenAll(allPages);
This will block the current method execution until all those page processes have been executed (it won't block the current thread though, don't get these confused)
Danger Zone
If you don't want to block method execution on
Task.WhenAll
So, you can parallel all page processes for each page, then you can add that Task to an overall List<Task>.
However, the danger with this is the fire hosing... You are going to limit the number of threads you execute at some point, so where.... well that is up to you but just remember, it will happen at some point.
How do I turn the following into a Parallel.ForEach?
public async void getThreadContents(String[] threads)
{
HttpClient client = new HttpClient();
List<String> usernames = new List<String>();
int i = 0;
foreach (String url in threads)
{
i++;
progressLabel.Text = "Scanning thread " + i.ToString() + "/" + threads.Count<String>();
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
Predicate<String> userPredicate;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
userPredicate = (String x) => x == user;
if (usernames.Find(userPredicate) != user)
{
usernames.Add(match.Groups[1].ToString());
}
}
progressBar1.PerformStep();
}
}
I coded it in the assumption that asynchronous and parallel processing would be the same, and I just realized it isn't. I took a look at all the questions I could find on this, and I really can't seem to find an example that does it for me. Most of them lack readable variable names. Using single-letter variable names which don't explain what they contain is a horrible way to state an example.
I normally have between 300 and 2000 entries in the array named threads (Contains URL's to forum threads) and it would seem that parallel processing (Due to the many HTTP requests) would speed up the execution).
Do I have to remove all the asynchrony (I got nothing async outside the foreach, only variable definitions) before I can use Parallel.ForEach? How should I go about doing this? Can I do this without blocking the main thread?
I am using .NET 4.5 by the way.
I coded it in the assumption that asynchronous and parallel processing would be the same
Asynchronous processing and parallel processing are quite different. If you don't understand the difference, I think you should first read more about it (for example what is the relation between Asynchronous and parallel programming in c#?).
Now, what you want to do is actually not that simple, because you want to process a big collection asynchronously, with a specific degree of parallelism (8). With synchronous processing, you could use Parallel.ForEach() (along with ParallelOptions to configure the degree of parallelism), but there is no simple alternative that would work with async.
In your code, this is complicated by the fact that you expect everything to execute on the UI thread. (Though ideally, you shouldn't access the UI directly from your computation. Instead, you should use IProgress, which would mean the code no longer has to execute on the UI thread.)
Probably the best way to do this in .Net 4.5 is to use TPL Dataflow. Its ActionBlock does exactly what you want, but it can be quite verbose (because it's more flexible than what you need). So it makes sense to create a helper method:
public static Task AsyncParallelForEach<T>(
IEnumerable<T> source, Func<T, Task> body,
int maxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
TaskScheduler scheduler = null)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDegreeOfParallelism
};
if (scheduler != null)
options.TaskScheduler = scheduler;
var block = new ActionBlock<T>(body, options);
foreach (var item in source)
block.Post(item);
block.Complete();
return block.Completion;
}
In your case, you would use it like this:
await AsyncParallelForEach(
threads, async url => await DownloadUrl(url), 8,
TaskScheduler.FromCurrentSynchronizationContext());
Here, DownloadUrl() is an async Task method that processes a single URL (the body of your loop), 8 is the degree of parallelism (probably shouldn't be a literal constant in real code) and FromCurrentSynchronizationContext() makes sure the code executes on the UI thread.
Stephen Toub has a good blog post on implementing a ForEachAsync. Svick's answer is quite good for platforms on which Dataflow is available.
Here's an alternative, using the partitioner from the TPL:
public static Task ForEachAsync<T>(this IEnumerable<T> source,
int degreeOfParallelism, Func<T, Task> body)
{
var partitions = Partitioner.Create(source).GetPartitions(degreeOfParallelism);
var tasks = partitions.Select(async partition =>
{
using (partition)
while (partition.MoveNext())
await body(partition.Current);
});
return Task.WhenAll(tasks);
}
You can then use this as such:
public async Task getThreadContentsAsync(String[] threads)
{
HttpClient client = new HttpClient();
ConcurrentDictionary<String, object> usernames = new ConcurrentDictionary<String, object>();
await threads.ForEachAsync(8, async url =>
{
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
usernames.TryAdd(user, null);
}
progressBar1.PerformStep();
});
}
Yet another alternative is using SemaphoreSlim or AsyncSemaphore (which is included in my AsyncEx library and supports many more platforms than SemaphoreSlim):
public async Task getThreadContentsAsync(String[] threads)
{
SemaphoreSlim semaphore = new SemaphoreSlim(8);
HttpClient client = new HttpClient();
ConcurrentDictionary<String, object> usernames = new ConcurrentDictionary<String, object>();
await Task.WhenAll(threads.Select(async url =>
{
await semaphore.WaitAsync();
try
{
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
usernames.TryAdd(user, null);
}
progressBar1.PerformStep();
}
finally
{
semaphore.Release();
}
}));
}
You can try the ParallelForEachAsync extension method from AsyncEnumerator NuGet Package:
using System.Collections.Async;
public async void getThreadContents(String[] threads)
{
HttpClient client = new HttpClient();
List<String> usernames = new List<String>();
int i = 0;
await threads.ParallelForEachAsync(async url =>
{
i++;
progressLabel.Text = "Scanning thread " + i.ToString() + "/" + threads.Count<String>();
HttpResponseMessage response = await client.GetAsync(url);
String content = await response.Content.ReadAsStringAsync();
String user;
Predicate<String> userPredicate;
foreach (Match match in regex.Matches(content))
{
user = match.Groups[1].ToString();
userPredicate = (String x) => x == user;
if (usernames.Find(userPredicate) != user)
{
usernames.Add(match.Groups[1].ToString());
}
}
// THIS CALL MUST BE THREAD-SAFE!
progressBar1.PerformStep();
},
maxDegreeOfParallelism: 8);
}