I have list of 10 000 000 urls in text file. Now I open every of them in my await/async method - at the beging the speed is very good (near 10 000 urls / min) but while the program is running it's decreasing to reach 500 urls / min after ~10 hours. When I restart the program and run from begging the situation is the same - fast at beggining and then slower and slower. I'm working on Windows Server 2008 R2. Tested my code at various PC - some results. Can You tell me where is the problem?
int finishedUrls = 0;
IEnumerable<string> urls = File.ReadLines("urlslist.txt");
await urls.ForEachAsync(500, async url =>
{
Uri newUri;
if (!Uri.TryCreate(siteUrl, UriKind.Absolute, out newUri)) return false;
_uri = newUri;
var timeout = new CancellationTokenSource(TimeSpan.FromSeconds(30));
string html = "";
using(var _httpClient = new HttpClient { Timeout = TimeSpan.FromSeconds(30),MaxResponseContentBufferSize = 300000 }) {
using(var _req = new HttpRequestMessage(HttpMethod.Get, _uri)){
using( var _response = await _httpClient.SendAsync(_req,HttpCompletionOption.ResponseContentRead,timeout.Token).ConfigureAwait(false)) {
if (_response != null &&
(_response.StatusCode == HttpStatusCode.OK || _response.StatusCode == HttpStatusCode.NotFound))
{
using (var cancel = timeout.Token.Register(_response.Dispose))
{
var rawResponse = await _response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
html = Encoding.UTF8.GetString(rawResponse);
}
}
}
}
}
Interlocked.Increment(ref finishedUrls);
});
http://blogs.msdn.com/b/pfxteam/archive/2012/03/05/10278165.aspx
I believe you are exhausting your io completion ports. You need to throttle your requests. If you need higher concurrency than a single box can handle, then distribute your concurrent requests across more machines. I'd suggest using TPL more managing the conncurrency. I ran into this exact same behavior doing similar things. Also, you should absolutely not be disposing your HttpClient per request. Pull that code out and use a single client.
Related
I'm trying to make several GET requests to an API, in parallel, but I'm getting an error ("Too many requests") when trying to do large volumes of requests (1600 items).
The following is a snippet of the code.
Call:
var metadataItemList = await GetAssetMetadataBulk(unitHashList_Unique);
Method:
private static async Task<List<MetadataModel>> GetAssetMetadataBulk(List<string> assetHashes)
{
List<MetadataModel> resultsList = new();
int batchSize = 100;
int batches = (int)Math.Ceiling((double)assetHashes.Count() / batchSize);
for (int i = 0; i < batches; i++)
{
var currentAssets = assetHashes.Skip(i * batchSize).Take(batchSize);
var tasks = currentAssets.Select(asset => EndpointProcessor<MetadataModel>.LoadAddress($"assets/{asset}"));
resultsList.AddRange(await Task.WhenAll(tasks));
}
return resultsList;
}
The method runs tasks in parallel in batches of 100, it works fine for small volumes of requests (<~300), but for greater amounts (~1000+), I get the aforementioned "Too many requests" error.
I tried stepping through the code, and to my surprise, it worked when I manually stepped through it. But I need it to work automatically.
Is there any way to slow down requests, or a better way to somehow circumvent the error whilst maintaining relatively good performance?
The request does not return a "Retry-After" header, and I also don't know how I'd implement this in C#. Any input on what code to edit, or direction to a doc is much appreciated!
The following is the Class I'm using to send HTTP requests:
class EndpointProcessor<T>
{
public static async Task<T> LoadAddress(string url)
{
using (HttpResponseMessage response = await ApiHelper.apiClient.GetAsync(url))
{
if (response.IsSuccessStatusCode)
{
T result = await response.Content.ReadAsAsync<T>();
return result;
}
else
{
//Console.WriteLine("Error: {0} ({1})\nTrailingHeaders: {2}\n", response.StatusCode, response.ReasonPhrase, response.TrailingHeaders);
throw new Exception(response.ReasonPhrase);
}
}
}
}
You can use a semaphore as a limiter for currently active threads. Add a field of Semaphore type to your API client and initialize it with a maximum count and an initial count of, say, 250 or what you determine as a safe maximum number of running requests. In your method ApiHelper.apiClient.GetAsync(), before making the real connection, try to acquire the semaphore, then release it after completing/failing the download. This will allow you enforce a maximum number of concurrently running requests.
The problem summary: I need to make call to HTTP resource A while using name resolution from previous HTTP request to resource B on the same host.
CASE 1. Consecutive calls to same resource produce faster result after 1st call.
Profiler tells me that the difference between 1st and 2nd call goes to DNS name resolution (GetHostAddresses)
var request = (HttpWebRequest)WebRequest.Create("https://www.somehost.com/resources/b.txt");
using (var response = (HttpWebResponse)request.GetResponse()) {}
var request = (HttpWebRequest)WebRequest.Create("https://www.somehost.com/resources/b.txt");
using (var response = (HttpWebResponse)request.GetResponse()) {}
CASE 2. Consecutive calls to different resources on the same host produce same delay.
Profiler tells me that they both incur calls to DNS name resolution.
var request = (HttpWebRequest)WebRequest.Create("https://www.somehost.com/resources/a.txt");
using (var response = (HttpWebResponse)request.GetResponse()) {}
var request = (HttpWebRequest)WebRequest.Create("https://www.somehost.com/resources/b.txt");
using (var response = (HttpWebResponse)request.GetResponse()) {}
I wonder why in case 2 second call cant use DNS cache from first call? its the same host.
And main question - how to change that?
EDIT the behaviour above covers also use of HttpClient class. It appeared this is specific to the few webservers I use and this issue does not happen on other servers. I cant figure what specifically happens but I suspect the webservers in question (Amazon CloudFront and Akamai) force close connection after it has been served, ignoring my requests keep-alive headers. I am going to close this for now as it is not possible to formulate a conscious question..
Your problem doesn't exist for System.Net.Http.HttpClient, try it instead. It can reuse the existing connections (no DNS cache needed for such calls). Looks like that is exactly what you want to achieve. As a bonus it supports HTTP/2 (can be enabled with Property assignment at HttpClient instance creation).
WebRequest is ancient and not recommentded by Microsoft for new development. In .NET 5 HttpClient is rather faster (twice?).
Create the HttpClient instance once per application (link).
private static readonly HttpClient client = new HttpClient();
Analog of your request. Note await is available only in methods marked as async.
string text = await client.GetStringAsync("https://www.somehost.com/resources/b.txt");
You may also do multiple requests at once without spawning concurrent Threads.
string[] urls = new string[]
{
"https://www.somehost.com/resources/a.txt",
"https://www.somehost.com/resources/b.txt"
};
List<Task<string>> tasks = new List<Task<string>>();
foreach (string url in urls)
{
tasks.Add(client.GetStringAsync(url));
}
string[] results = await Task.WhenAll(tasks);
If you're not familiar with Asynchronous programming e.g. async/await, start with this article.
Also you can set a limit how many requests will be processed at once. Let's do the same request 1000 times with limit to 10 requests at once.
static async Task Main(string[] args)
{
Stopwatch sw = new StopWatch();
string url = "https://www.somehost.com/resources/a.txt";
using SemaphoreSlim semaphore = new SemaphoreSlim(10);
List<Task<string>> tasks = new List<Task<string>>();
sw.Start();
for (int i = 0; i < 1000; i++)
{
await semaphore.WaitAsync();
tasks.Add(GetPageAsync(url, semaphore));
}
string[] results = await Task.WhenAll(tasks);
sw.Stop();
Console.WriteLine($"Elapsed: {sw.Elapsemilliseconds}ms");
}
private static async Task GetPageAsync(string url, SemaphoreSlim semaphore)
{
try
{
return await client.GetStringAsync(url);
}
finally
{
semaphore.Release();
}
}
You may measure the time.
I am sending five HttpClient requests to the same URL, but with a varying page number parameter. They all fire async, and then I await for them all to finish using Tasks.WaitAll(). My requests are using System.Net.Http.HttpClient.
This mostly works fine, and I get five distinct results representing each page of the data about 99% of the time.
But every so often, and I have not dug into deep analysis yet, I get the exact same response for each task. Each task does indeed instantiate its own HttpClient. When I was reusing one client instance, I got this problem. But since I started instantiating new clients for every call, the problem went away.
I am calling a 3rd party web service over which I have no control. So before nagging their team too much about this, I do want to know if I may be doing something wrong here, or if there is some aspect of HttpClient ot Task that I'm missing.
Here is the calling code:
for (int i = 1; i <= 5; i++)
{
page = load_made + i;
var t_page = page;
var t_url = url;
var task = new Task<List<T>>(() => DoPagedLoad<T>(t_page, per_page, t_url));
task.Run();
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
Here is the code in the DoPagedLoad, which returns a Task:
var client = new HttpClient();
var response = client.GetAsync(url).Result;
var results = response.Content.ReadAsStringAsync().Result();
I would appreciate any help from folks familiar with the possible quirks of Task and HttpClient
NOTE: Run is an extension method to help with async exceptions.
public static Task Run(this Task task)
{
task.Start();
task.ContinueWith(t =>
{
if(t.Exception != null)
Log.Error(t.Exception.Flatten().ToString());
});
return task;
}
It's hard to give a definitive answer because we don't have all the detail but here's a sample implementation of how you should fire off HTTP requests. Notice that all async operations are awaited - Result and Wait / WaitAll are not used. You should almost never need / use any of those - they block synchronously and can create problems.
Also notice that there are no global cookie containers, default headers, etc. defined for the HTTP client. If you need any of that stuff, just create individial HttpRequestMessage objects and add whatever headers you need to add. Don't use the global properties - it's a lot cleaner to just set per-request properties.
// Globally defined HTTP client.
private static readonly HttpClient _httpClient = new HttpClient();
// Other stuff here...
private async Task SomeFunctionToGetContent()
{
var requestTasks = new List<Task<HttpResponseMessage>>();
var responseTasks = new List<Task>();
for (var i = 0; i < 5; i++)
{
// Fake URI but still based on the counter (or other
// variable, similar to page in the question)
var uri = new Uri($"https://.../{i}.html");
requestTasks.Add(_httpClient.GetAsync(uri));
}
await (Task.WhenAll(requestTasks));
for (var i = 0; i < 5; i++)
{
var response = await (requestTasks[i]);
responseTasks.Add(HandleResponse(response));
}
await (Tasks.WhenAll(responseTasks));
}
private async Task HandleResponse(HttpResponseMessage response)
{
try
{
if (response.Content != null)
{
var content = await (response.Content.ReadAsStringAsync());
// do something with content here; check IsSuccessStatusCode to
// see if the request failed or succeeded
}
else
{
// Do something when no content
}
}
finally
{
response.Dispose();
}
}
I have two versions of my program that submit ~3000 HTTP GET requests to a web server.
The first version is based off of what I read here. That solution makes sense to me because making web requests is I/O bound work, and the use of async/await along with Task.WhenAll or Task.WaitAll means that you can submit 100 requests all at once and then wait for them all to finish before submitting the next 100 requests so that you don't bog down the web server. I was surprised to see that this version completed all of the work in ~12 minutes - way slower than I expected.
The second version submits all 3000 HTTP GET requests inside a Parallel.ForEach loop. I use .Result to wait for each request to finish before the rest of the logic within that iteration of the loop can execute. I thought that this would be a far less efficient solution, since using threads to perform tasks in parallel is usually better suited for performing CPU bound work, but I was surprised to see that the this version completed all of the work within ~3 minutes!
My question is why is the Parallel.ForEach version faster? This came as an extra surprise because when I applied the same two techniques against a different API/web server, version 1 of my code was actually faster than version 2 by about 6 minutes - which is what I expected. Could performance of the two different versions have something to do with how the web server handles the traffic?
You can see a simplified version of my code below:
private async Task<ObjectDetails> TryDeserializeResponse(HttpResponseMessage response)
{
try
{
using (Stream stream = await response.Content.ReadAsStreamAsync())
using (StreamReader readStream = new StreamReader(stream, Encoding.UTF8))
using (JsonTextReader jsonTextReader = new JsonTextReader(readStream))
{
JsonSerializer serializer = new JsonSerializer();
ObjectDetails objectDetails = serializer.Deserialize<ObjectDetails>(
jsonTextReader);
return objectDetails;
}
}
catch (Exception e)
{
// Log exception
return null;
}
}
private async Task<HttpResponseMessage> TryGetResponse(string urlStr)
{
try
{
HttpResponseMessage response = await httpClient.GetAsync(urlStr)
.ConfigureAwait(false);
if (response.StatusCode != HttpStatusCode.OK)
{
throw new WebException("Response code is "
+ response.StatusCode.ToString() + "... not 200 OK.");
}
return response;
}
catch (Exception e)
{
// Log exception
return null;
}
}
private async Task<ListOfObjects> GetObjectDetailsAsync(string baseUrl, int id)
{
string urlStr = baseUrl + #"objects/id/" + id + "/details";
HttpResponseMessage response = await TryGetResponse(urlStr);
ObjectDetails objectDetails = await TryDeserializeResponse(response);
return objectDetails;
}
// With ~3000 objects to retrieve, this code will create 100 API calls
// in parallel, wait for all 100 to finish, and then repeat that process
// ~30 times. In other words, there will be ~30 batches of 100 parallel
// API calls.
private Dictionary<int, Task<ObjectDetails>> GetAllObjectDetailsInBatches(
string baseUrl, Dictionary<int, MyObject> incompleteObjects)
{
int batchSize = 100;
int numberOfBatches = (int)Math.Ceiling(
(double)incompleteObjects.Count / batchSize);
Dictionary<int, Task<ObjectDetails>> objectTaskDict
= new Dictionary<int, Task<ObjectDetails>>(incompleteObjects.Count);
var orderedIncompleteObjects = incompleteObjects.OrderBy(pair => pair.Key);
for (int i = 0; i < 1; i++)
{
var batchOfObjects = orderedIncompleteObjects.Skip(i * batchSize)
.Take(batchSize);
var batchObjectsTaskList = batchOfObjects.Select(
pair => GetObjectDetailsAsync(baseUrl, pair.Key));
Task.WaitAll(batchObjectsTaskList.ToArray());
foreach (var objTask in batchObjectsTaskList)
objectTaskDict.Add(objTask.Result.id, objTask);
}
return objectTaskDict;
}
public void GetObjectsVersion1()
{
string baseUrl = #"https://mywebserver.com:/api";
// GetIncompleteObjects is not shown, but it is not relevant to
// the question
Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();
Dictionary<int, Task<ObjectDetails>> objectTaskDict
= GetAllObjectDetailsInBatches(baseUrl, incompleteObjects);
foreach (KeyValuePair<int, MyObject> pair in incompleteObjects)
{
ObjectDetails objectDetails = objectTaskDict[pair.Key].Result
.objectDetails;
// Code here that copies fields from objectDetails to pair.Value
// (the incompleteObject)
AllObjects.Add(pair.Value);
};
}
public void GetObjectsVersion2()
{
string baseUrl = #"https://mywebserver.com:/api";
// GetIncompleteObjects is not shown, but it is not relevant to
// the question
Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();
Parallel.ForEach(incompleteHosts, pair =>
{
ObjectDetails objectDetails = GetObjectDetailsAsync(
baseUrl, pair.Key).Result.objectDetails;
// Code here that copies fields from objectDetails to pair.Value
// (the incompleteObject)
AllObjects.Add(pair.Value);
});
}
A possible reason why Parallel.ForEach may run faster is because it creates the side-effect of throttling. Initially x threads are processing the first x elements (where x in the number of the available cores), and progressively more threads may be added depending on internal heuristics. Throttling IO operations is a good thing because it protects the network and the server that handles the requests from becoming overburdened. Your alternative improvised method of throttling, by making requests in batches of 100, is far from ideal for many reasons, one of them being that 100 concurrent requests are a lot of requests! Another one is that a single long running operation may delay the completion of the batch until long after the completion of the other 99 operations.
Note that Parallel.ForEach is also not ideal for parallelizing IO operations. It just happened to perform better than the alternative, wasting memory all along. For better approaches look here: How to limit the amount of concurrent async I/O operations?
https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreach?view=netframework-4.8
Basically the parralel foreach allows iterations to run in parallel so you are not constraining the iteration to run in serial, on a host that is not thread constrained this will tend to lead to improved throughput
In short:
Parallel.Foreach() is most useful for CPU bound tasks.
Task.WaitAll() is more useful for IO bound tasks.
So in your case, you are getting information from webservers, which is IO. If the async methods are implemented correctly, it won't block any thread. (It will use IO Completion ports to wait on) This way the threads can do other stuff.
By running the async methods GetObjectDetailsAsync(baseUrl, pair.Key).Result synchroniced, it will block a thread. So the threadpool will be flood by waiting threads.
So I think the Task solution will have a better fit.
I wish to download around 100,000 files from a web site. The answers from this qustion have the same issues as what I tried.
I have tried two approaches, both of which use highly erratic amounts of bandwidth:
The first attempts to synchronously download the files:
ParallelOptions a = new ParallelOptions();
a.MaxDegreeOfParallelism = 30;
ServicePointManager.DefaultConnectionLimit = 10000;
Parallel.For(start, end, a, i =>
{
using (var client = new WebClient())
{
...
}
});
This works, but my throughput looks like this:
The second way involves using semaphore and async to do the parallelism more manually (without the semaphores it will obviously spawn too many work items):
Parallel.For(start, end, a, i =>
{
list.Add(getAndPreprocess(/*get URL from somewhere*/);
});
...
static async Task getAndPreprocess(string url)
{
var client = new HttpClient();
sem.WaitOne();
string content = "";
try
{
var data = client.GetStringAsync(url);
content = await data;
}
catch (Exception ex) { Console.WriteLine(ex.InnerException.Message); sem.Release(); return; }
sem.Release();
try
{
//try to use results from content
}
catch { return; }
}
My throughput now looks like this:
Is there a nice way to do this, such that it starts downloading anouther file when the speed falls, and stops adding when the aggregate speed is constant (like what you would expect a download manager to do)?
Additionally, even though the second form gives better results, I dislike that I have to use semaphores, as it is error prone.
What the standard way to do this?
Note: these are all small files (<50KB)