I'm trying to make several GET requests to an API, in parallel, but I'm getting an error ("Too many requests") when trying to do large volumes of requests (1600 items).
The following is a snippet of the code.
Call:
var metadataItemList = await GetAssetMetadataBulk(unitHashList_Unique);
Method:
private static async Task<List<MetadataModel>> GetAssetMetadataBulk(List<string> assetHashes)
{
List<MetadataModel> resultsList = new();
int batchSize = 100;
int batches = (int)Math.Ceiling((double)assetHashes.Count() / batchSize);
for (int i = 0; i < batches; i++)
{
var currentAssets = assetHashes.Skip(i * batchSize).Take(batchSize);
var tasks = currentAssets.Select(asset => EndpointProcessor<MetadataModel>.LoadAddress($"assets/{asset}"));
resultsList.AddRange(await Task.WhenAll(tasks));
}
return resultsList;
}
The method runs tasks in parallel in batches of 100, it works fine for small volumes of requests (<~300), but for greater amounts (~1000+), I get the aforementioned "Too many requests" error.
I tried stepping through the code, and to my surprise, it worked when I manually stepped through it. But I need it to work automatically.
Is there any way to slow down requests, or a better way to somehow circumvent the error whilst maintaining relatively good performance?
The request does not return a "Retry-After" header, and I also don't know how I'd implement this in C#. Any input on what code to edit, or direction to a doc is much appreciated!
The following is the Class I'm using to send HTTP requests:
class EndpointProcessor<T>
{
public static async Task<T> LoadAddress(string url)
{
using (HttpResponseMessage response = await ApiHelper.apiClient.GetAsync(url))
{
if (response.IsSuccessStatusCode)
{
T result = await response.Content.ReadAsAsync<T>();
return result;
}
else
{
//Console.WriteLine("Error: {0} ({1})\nTrailingHeaders: {2}\n", response.StatusCode, response.ReasonPhrase, response.TrailingHeaders);
throw new Exception(response.ReasonPhrase);
}
}
}
}
You can use a semaphore as a limiter for currently active threads. Add a field of Semaphore type to your API client and initialize it with a maximum count and an initial count of, say, 250 or what you determine as a safe maximum number of running requests. In your method ApiHelper.apiClient.GetAsync(), before making the real connection, try to acquire the semaphore, then release it after completing/failing the download. This will allow you enforce a maximum number of concurrently running requests.
Related
I am working on a protocol and trying to use as much async/await as I can to make it scale well. The protocol will have to support hundreds to thousands of simultaneous connections. Below is a little bit of pseudo code to illustrate my problem.
private static async void DoSomeWork()
{
var protocol = new FooProtocol();
await protocol.Connect("127.0.0.1", 1234);
var i = 0;
while(i != int.MaxValue)
{
i++;
var request = new FooRequest();
request.Payload = "Request Nr " + i;
var task = protocol.Send(request);
_ = task.ContinueWith(async tmp =>
{
var resp = await task;
Console.WriteLine($"Request {resp.SequenceNr} Successful: {(resp.Status == 0)}");
});
}
}
And below is a little pseudo code for the protocol.
public class FooProtocol
{
private int sequenceNr = 0;
private SemaphoreSlim ss = new SemaphoreSlim(20, 20);
public Task<FooResponse> Send(FooRequest fooRequest)
{
var tcs = new TaskCompletionSource<FooResponse>();
ss.Wait();
var tmp = Interlocked.Increment(ref sequenceNr);
fooRequest.SequenceNr = tmp;
// Faking some arbitrary delay. This work is done over sockets.
Task.Run(async () =>
{
await Task.Delay(1000);
tcs.SetResult(new FooResponse() {SequenceNr = tmp});
ss.Release();
});
return tcs.Task;
}
}
I have a protocol with request and response pairs. I have used asynchronous socket programming. The FooProtocol will take care of matching up request with responses (sequence numbers) and will also take care of the maximum number of pending requests. (Done in the pseudo and my code with a semaphore slim, So I am not worried about run away requests). The DoSomeWork method calls the Protocol.Send method, but I don't want to await the response, I want to spin around and send the next one until I am blocked by the maximum number of pending requests. When the task does complete I want to check the response and maybe do some work.
I would like to fix two things
I would like to avoid using Task.ContinueWith() because it seems to not fit in cleanly with the async/await patterns
Because I have awaited on the connection, I have had to use the async modifier. Now I get warnings from the IDE "Because this call is not waited, execution of the current method continues before this call is complete. Consider applying the 'await' operator to the result of the call." I don't want to do that, because as soon as I do it ruins the protocol's ability to have many requests in flight. The only way I can get rid of the warning is to use a discard. Which isn't the worst thing but I can't help but feel like I am missing a trick and fighting this too hard.
Side note: I hope your actual code is using SemaphoreSlim.WaitAsync rather than SemaphoreSlim.Wait.
In most socket code, you do end up with a list of connections, and along with each connection is a "processor" of some kind. In the async world, this is naturally represented as a Task.
So you will need to keep a list of Tasks; at the very least, your consuming application will need to know when it is safe to shut down (i.e., all responses have been received).
Don't preemptively worry about using Task.Run; as long as you aren't blocking (e.g., SemaphoreSlim.Wait), you probably will not starve the thread pool. Remember that during the awaits, no thread pool thread is used.
I am not sure that it's a good idea to enforce the maximum concurrency at the protocol level. It seems to me that this responsibility belongs to the caller of the protocol. So I would remove the SemaphoreSlim, and let it do the one thing that it knows to do well:
public class FooProtocol
{
private int sequenceNr = 0;
public async Task<FooResponse> Send(FooRequest fooRequest)
{
var tmp = Interlocked.Increment(ref sequenceNr);
fooRequest.SequenceNr = tmp;
await Task.Delay(1000); // Faking some arbitrary delay
return new FooResponse() { SequenceNr = tmp };
}
}
Then I would use an ActionBlock from the TPL Dataflow library in order to coordinate the process of sending a massive number of requests through the protocol, by handling the concurrency, the backpreasure (BoundedCapacity), the cancellation (if needed), the error-handling, and the status of the whole operation (running, completed, failed etc). Example:
private static async Task DoSomeWorkAsync()
{
var protocol = new FooProtocol();
var actionBlock = new ActionBlock<FooRequest>(async request =>
{
var resp = await protocol.Send(request);
Console.WriteLine($"Request {resp.SequenceNr} Status: {resp.Status}");
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 20,
BoundedCapacity = 100
});
await protocol.Connect("127.0.0.1", 1234);
foreach (var i in Enumerable.Range(0, Int32.MaxValue))
{
var request = new FooRequest();
request.Payload = "Request Nr " + i;
var accepted = await actionBlock.SendAsync(request);
if (!accepted) break; // The block has failed irrecoverably
}
actionBlock.Complete();
await actionBlock.Completion; // Propagate any exceptions
}
The BoundedCapacity = 100 configuration means that the ActionBlock will store in its internal buffer at most 100 requests. When this threshold is reached, anyone who wants to send more requests to it will have to wait. The awaiting will happen in the await actionBlock.SendAsync line.
I need to use proxies to download a forum. The problem with my code is that it takes only 10% of my internet bandwidth. Also I have read that I need to use a single HttpClient instance, but with multiple proxies I don't know how to do it. Changing MaxDegreeOfParallelism doesn't change anything.
public static IAsyncEnumerable<IFetchResult> FetchInParallelAsync(
this IEnumerable<Url> urls, FetchContext context)
{
var fetchBlcock = new TransformBlock<Url, IFetchResult>(
transform: url => url.FetchAsync(context),
dataflowBlockOptions: new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 128
}
);
foreach(var url in urls)
fetchBlcock.Post(url);
fetchBlcock.Complete();
var result = fetchBlcock.ToAsyncEnumerable();
return result;
}
Every call to FetchAsync will create or reuse a HttpClient with a WebProxy.
public static async Task<IFetchResult> FetchAsync(this Url url, FetchContext context)
{
var httpClient = context.ProxyPool.Rent();
var result = await url.FetchAsync(httpClient, context.Observer, context.Delay,
context.isReloadWithCookie);
context.ProxyPool.Return(httpClient);
return result;
}
public HttpClient Rent()
{
lock(_lockObject)
{
if (_uninitiliazedDatacenterProxiesAddresses.Count != 0)
{
var proxyAddress = _uninitiliazedDatacenterProxiesAddresses.Pop();
return proxyAddress.GetWebProxy(DataCenterProxiesCredentials).GetHttpClient();
}
return _proxiesQueue.Dequeue();
}
}
I am a novice at software developing, but the task of downloading using hundreds or thousands of proxies asynchronously looks like a trivial task that many should have been faced with and found a correct way to do it. So far I was unable to find any solutions to my problem on the internet. Any thoughts of how to achieve maximum download speed?
Let's take a look at what happens here:
var result = await url.FetchAsync(httpClient, context.Observer, context.Delay, context.isReloadWithCookie);
You are actually awaiting before you continue with the next item. That's why it is asynchronous and not parallel programming. async in Microsoft docs
The await keyword is where the magic happens. It yields control to the caller of the method that performed await, and it ultimately allows a UI to be responsive or a service to be elastic.
In essence, it frees the calling thread to do other stuff but the original calling code is suspended from executing, until the IO operation is done.
Now to your problem:
You can either use this excellent solution here: foreach async
You can use the Parallel library to execute your code in different threads.
Something like the following from Parallel for example
Parallel.For(0, urls.Count,
index => fetchBlcock.Post(urls[index])
});
I have two versions of my program that submit ~3000 HTTP GET requests to a web server.
The first version is based off of what I read here. That solution makes sense to me because making web requests is I/O bound work, and the use of async/await along with Task.WhenAll or Task.WaitAll means that you can submit 100 requests all at once and then wait for them all to finish before submitting the next 100 requests so that you don't bog down the web server. I was surprised to see that this version completed all of the work in ~12 minutes - way slower than I expected.
The second version submits all 3000 HTTP GET requests inside a Parallel.ForEach loop. I use .Result to wait for each request to finish before the rest of the logic within that iteration of the loop can execute. I thought that this would be a far less efficient solution, since using threads to perform tasks in parallel is usually better suited for performing CPU bound work, but I was surprised to see that the this version completed all of the work within ~3 minutes!
My question is why is the Parallel.ForEach version faster? This came as an extra surprise because when I applied the same two techniques against a different API/web server, version 1 of my code was actually faster than version 2 by about 6 minutes - which is what I expected. Could performance of the two different versions have something to do with how the web server handles the traffic?
You can see a simplified version of my code below:
private async Task<ObjectDetails> TryDeserializeResponse(HttpResponseMessage response)
{
try
{
using (Stream stream = await response.Content.ReadAsStreamAsync())
using (StreamReader readStream = new StreamReader(stream, Encoding.UTF8))
using (JsonTextReader jsonTextReader = new JsonTextReader(readStream))
{
JsonSerializer serializer = new JsonSerializer();
ObjectDetails objectDetails = serializer.Deserialize<ObjectDetails>(
jsonTextReader);
return objectDetails;
}
}
catch (Exception e)
{
// Log exception
return null;
}
}
private async Task<HttpResponseMessage> TryGetResponse(string urlStr)
{
try
{
HttpResponseMessage response = await httpClient.GetAsync(urlStr)
.ConfigureAwait(false);
if (response.StatusCode != HttpStatusCode.OK)
{
throw new WebException("Response code is "
+ response.StatusCode.ToString() + "... not 200 OK.");
}
return response;
}
catch (Exception e)
{
// Log exception
return null;
}
}
private async Task<ListOfObjects> GetObjectDetailsAsync(string baseUrl, int id)
{
string urlStr = baseUrl + #"objects/id/" + id + "/details";
HttpResponseMessage response = await TryGetResponse(urlStr);
ObjectDetails objectDetails = await TryDeserializeResponse(response);
return objectDetails;
}
// With ~3000 objects to retrieve, this code will create 100 API calls
// in parallel, wait for all 100 to finish, and then repeat that process
// ~30 times. In other words, there will be ~30 batches of 100 parallel
// API calls.
private Dictionary<int, Task<ObjectDetails>> GetAllObjectDetailsInBatches(
string baseUrl, Dictionary<int, MyObject> incompleteObjects)
{
int batchSize = 100;
int numberOfBatches = (int)Math.Ceiling(
(double)incompleteObjects.Count / batchSize);
Dictionary<int, Task<ObjectDetails>> objectTaskDict
= new Dictionary<int, Task<ObjectDetails>>(incompleteObjects.Count);
var orderedIncompleteObjects = incompleteObjects.OrderBy(pair => pair.Key);
for (int i = 0; i < 1; i++)
{
var batchOfObjects = orderedIncompleteObjects.Skip(i * batchSize)
.Take(batchSize);
var batchObjectsTaskList = batchOfObjects.Select(
pair => GetObjectDetailsAsync(baseUrl, pair.Key));
Task.WaitAll(batchObjectsTaskList.ToArray());
foreach (var objTask in batchObjectsTaskList)
objectTaskDict.Add(objTask.Result.id, objTask);
}
return objectTaskDict;
}
public void GetObjectsVersion1()
{
string baseUrl = #"https://mywebserver.com:/api";
// GetIncompleteObjects is not shown, but it is not relevant to
// the question
Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();
Dictionary<int, Task<ObjectDetails>> objectTaskDict
= GetAllObjectDetailsInBatches(baseUrl, incompleteObjects);
foreach (KeyValuePair<int, MyObject> pair in incompleteObjects)
{
ObjectDetails objectDetails = objectTaskDict[pair.Key].Result
.objectDetails;
// Code here that copies fields from objectDetails to pair.Value
// (the incompleteObject)
AllObjects.Add(pair.Value);
};
}
public void GetObjectsVersion2()
{
string baseUrl = #"https://mywebserver.com:/api";
// GetIncompleteObjects is not shown, but it is not relevant to
// the question
Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();
Parallel.ForEach(incompleteHosts, pair =>
{
ObjectDetails objectDetails = GetObjectDetailsAsync(
baseUrl, pair.Key).Result.objectDetails;
// Code here that copies fields from objectDetails to pair.Value
// (the incompleteObject)
AllObjects.Add(pair.Value);
});
}
A possible reason why Parallel.ForEach may run faster is because it creates the side-effect of throttling. Initially x threads are processing the first x elements (where x in the number of the available cores), and progressively more threads may be added depending on internal heuristics. Throttling IO operations is a good thing because it protects the network and the server that handles the requests from becoming overburdened. Your alternative improvised method of throttling, by making requests in batches of 100, is far from ideal for many reasons, one of them being that 100 concurrent requests are a lot of requests! Another one is that a single long running operation may delay the completion of the batch until long after the completion of the other 99 operations.
Note that Parallel.ForEach is also not ideal for parallelizing IO operations. It just happened to perform better than the alternative, wasting memory all along. For better approaches look here: How to limit the amount of concurrent async I/O operations?
https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreach?view=netframework-4.8
Basically the parralel foreach allows iterations to run in parallel so you are not constraining the iteration to run in serial, on a host that is not thread constrained this will tend to lead to improved throughput
In short:
Parallel.Foreach() is most useful for CPU bound tasks.
Task.WaitAll() is more useful for IO bound tasks.
So in your case, you are getting information from webservers, which is IO. If the async methods are implemented correctly, it won't block any thread. (It will use IO Completion ports to wait on) This way the threads can do other stuff.
By running the async methods GetObjectDetailsAsync(baseUrl, pair.Key).Result synchroniced, it will block a thread. So the threadpool will be flood by waiting threads.
So I think the Task solution will have a better fit.
Questions on Lambda to delete partitions.
The existing query which uses parallelization is failing since it exceeds the number of parallel queries. We want to replace it with sequential queries and increased timeout for lambda.
Can we change the lambda to parallel with limited threads?
Database -> aws athena = Getting the List of clients from Athena. Looping throgh it.
Right now it works fine with sequential calls as well but since the number of clients is small now, it would pose a problem for future.
The only issue with limited parallel threads is that we need some code to handle the thread count as well.
Then someone suggested me use this: https://devblogs.microsoft.com/pfxteam/implementing-a-simple-foreachasync-part-2/
https://gist.github.com/0xced/94f6c50d620e582e19913742dbd76eb6
public class AthenaClient {
private readonly IAmazonAthena _client;
private readonly string _databaseName;
private readonly string _outputLocation;
private readonly string _tableName;
const int MaxQueryLength = 262144;
readonly int _maxclientsToBeProcessed;
public AthenaClient(string databaseName, string tableName, string outputLocation, int maxclientsToBeProcessed) {
_databaseName = databaseName;
_tableName = tableName;
_outputLocation = outputLocation;
_maxclientsToBeProcessed = maxclientsToBeProcessed == 0 ? 1 : maxclientsToBeProcessed;
_client = new AmazonAthenaClient();
}
public async Task < bool > DeletePartitions() {
var clients = await GetClients();
for (int i = 0; i < clients.Count; i = i + _maxclientsToBeProcessed) {
var clientItems = clients.Skip(i).Take(_maxclientsToBeProcessed);
var queryBuilder = new StringBuilder();
queryBuilder.AppendLine($ "ALTER TABLE { _databaseName }.{_tableName} DROP IF EXISTS");
foreach(var client in clientItems) {
queryBuilder.AppendLine($ " PARTITION (client_id = '{client}'), ");
}
var query = queryBuilder.ToString().Trim().TrimEnd(',') + ";";
LambdaLogger.Log(query);
if (query.Length >= MaxQueryLength) {
throw new Exception("Delete partition query length exceeded.");
}
var queryExecutionId = StartQueryExecution(query).Result;
await CheckQueryExecutionStatus(queryExecutionId);
}
return true;
}
}
It seems that the actual question should be :
How can I change the database partitions for lots of clients in AWS Athena without executing them sequentially?
The answer isn't ForEachAsync or the upcoming await foreach in C# 8. An asynchronous loop would still send calls to the service one at a time, it "just" wouldn't block while waiting for an answer.
Concurrent workers
This is a concurrent worker problem that can be handled using eg the TPL Dataflow library's ActionBlock class or the new System.Threading.Channel classes.
The Dataflow library is meant to create event/message processing pipelines similar to a shell script pipeline, by moving data between independent blocks. Each block runs on its own task/thread which means you can get concurrent execution simply by breaking up processing into blocks.
It's also possible to increase the number of processing tasks per block, by specifying the MaxDegreeOfParallelism option when creating the block. This allows us to quickly create "workers" that can work on lots of messages concurrently.
Example
In this case, the "message" is the Client whatever that is. A single ActionBlock could create the DDL statement and execute it. Each block has an input queue which means we can just post messages to a block and await for it to execute everything using the DOP we specified.
We can also specify a limit to the queue so it won't get flooded if the worker tasks can't run fast enough :
var options=new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = _maxclientsToBeProcessed,
BoundedCapacity = _maxclientsToBeProcessed*3, //Just a guess
});
var block=new ActionBlock<Client>(client=>CreateAndRunDDL(client));
//Post the client requests
foreach(var client in clients)
{
await block.SendAsync(client);
}
//Tell the block we're done
block.Complete();
//Await for all queued messages to finish processing
await block.Completion;
The CreateAndRunDDL(Client) method should do what the code inside the question's loop does. A good idea would be to refactor it though, and create separate functions to create and execute the query , eg :
async Task CreateAndRunDDL(Client client)
{
var query = QueryForClient(...);
LambdaLogger.Log(query);
if (query.Length >= MaxQueryLength) {
throw new Exception("Delete partition query length exceeded.");
}
var queryExecutionId = await StartQueryExecution(query);
await CheckQueryExecutionStatus(queryExecutionId);
}
Blocks can be linked too. If we wanted to batch multiple clients together for processing, we can use a BatchBlock and feed its results to our action block, eg :
var batchClients = new BatchBlock<Client>(20);
var linkOptions = new DataflowLinkOptions
{
PropagateCompletion = true
};
var block=new ActionBlock<Client>(clients=>CreateAndRunDDL(clients));
batchClients.LinkTo(block,linkOptions);
This time the CreateAndRunDDL method accepts a Client[] array with the number of clients/messages we specified in the batch size.
async Task CreateAndRunDDL(Client[] clients)
{
var query = QueryForClients(clients);
...
}
Messages should be posted to the batchClients block now. Once that completes, we need to wait for the last block in the pipeline to complete :
foreach(var client in clients)
{
await batchClients.SendAsync(client);
}
//Tell the *batch block* we're done
batchClient.Complete();
//Await for all queued messages to finish processing
await block.Completion;
I have a scenario where I need to make a large number of GET requests in as little time as possible (think around 1000).
I know generally it's best to keep a single client and reuse it as much as possible:
// Create Single HTTP Client
HttpClient client = new HttpClient();
// Create all tasks
for (int x = 0; x < 1000; x++)
{
tasks.Add(ProcessURLAsync($"https://someapi.com/request/{x}", client, x));
}
// wait for all tasks to complete.
Task.WaitAll(tasks.ToArray());
...
static async Task<string> ProcessURLAsync(string url, HttpClient client, int x)
{
var response = await client.GetStringAsync(url);
ParseResponse(response.Result, x);
return response;
}
But doing so takes approximately 70 seconds for all requests to complete.
On the other hand, If I create multiple clients beforehand and distribute the requests across them, the it takes about 3 seconds to complete:
// Create arbitrary number of clients
while (clients.Count < maxClients)
{
clients.Add(new HttpClient());
}
// Create all tasks
for (int x = 0; x < 1000; x++)
{
tasks.Add(ProcessURLAsync(
$"https://someapi.com/request/{x}", clients[x % maxClients], x));
}
// Same same code as above
Due to the nature of the data requested, I need to either keep the results sequential or pass along the index associated with the request.
Assuming the API cannot be changed to better format the requested data, and the all requests must complete before moving on, is this solution wise or am I missing a smarter alternative?
(For the sake of brevity I've used an arbitrary number of HttpClient whereas I would create a pool of HttpClient that releases a client once it receives a response and only create a new one when none are free)
I would suggest two main changes.
Remove the await so that multiple downloads can occur at the same
time.
Set DefaultConnectionLimit to a larger number (e.g. 50).