I have a task where I form thousands of requests which are later sent to a server. The server returns the response for each request and that response is then dumped to an output file line by line.
The pseudo code goes like this:
//requests contains thousands of requests to be sent to the server
string[] requests = GetRequestsString();
foreach(string request in requests)
{
string response = MakeWebRequest(request);
ParseandDump(response);
}
Now, as can be seen the serve is handling my requests one by one. I want to make this entire process fast. The server in question is capable of handling multiple requests at a time. I want to apply multi-threading and send let's say 4 requests to the server at a time and dump the response in same thread.
Can you please give me any pointer to possible approaches.
You can take advantage of Task from .NET 4.0 and the new toy HttpClient, sample code below is showed how you send requests in parallel, then dump response in the same thread by using ContinueWith:
var httpClient = new HttpClient();
var tasks = requests.Select(r => httpClient.GetStringAsync(r).ContinueWith(t =>
{
ParseandDump(t.Result);
}));
Task uses ThreadPool under the hood, so you don't need to specify how many threads should be used, ThreadPool will manage this for you in optimized way.
The easiest way would be to use Parallel.ForEach like this:
string[] requests = GetRequestsString();
Parallel.ForEach(requests, request => ParseandDump(MakeWebRequest(request)));
.NET framework 4.0 or greater is required to use Parallel.
I think this could be done in a consumer-producer-pattern. You could use a ConcurrentQueue (from the namespace System.Collections.Concurrent) as a shared resource between the many parallel WebRequests and the dumping thread.
The pseudo code would be something like:
var requests = GetRequestsString();
var queue = new ConcurrentQueue<string>();
Task.Factory.StartNew(() =>
{
Parallel.ForEach(requests , currentRequest =>
{
queue.Enqueue(MakeWebRequest(request));
}
});
Task.Factory.StartNew(() =>
{
while (true)
{
string response;
if (queue.TryDequeue(out response))
{
ParseandDump(response);
}
}
});
Maybe a BlockingCollection might serve you even better, depending on how you want to go about synchronizing the threads to signal the end of incoming requests.
Related
I'm writing a responsive API. We have to handle 10 requests per second.
The problem is, sending a response takes half a second. So you can imagine, the server is overwhelmed quickly.
I made the code process asynchronously, up to 10 tasks at once, to help mitigate this.
However I have concerns about whether using a single instance of HttpClient is the correct approach. The advice as soon as someone mentions HttpClient is always create a single instance of it.
I have a static instance of it. Although it is thread-safe, at least for PostAsync, should I really create 10 HttpClients (or a pool of HttpClients) to be able to send data out faster?
I assume that during the half a second it's sending out, that it won't let you send out other 'postasync's. However I can't confirm this behaviour.
Most benchmarks and resources simply look at sending requests synchronously, i.e. one after the other (i.e. await postasync)
However for my use case I need to send several simultaneously, i.e. from separate threads. The only way to reply to 10 messages per second that take half a second each is to to send five simultanous messages back - not five queued to go out one by one, but five simultaneous messages.
I cannot find any documentation on how HttpClient handles this. I've only seen a few references to it having a connection pool, but it's unclear whether it will actually perform multiple connections simultaneously, or if I need to create a small pool of 5 httpclients to rotate through.
Question: Does a single instance of HttpClient support multiple connections simultaneously?
And I don't mean just letting you call postasync lots of times in a thread-safe way before it has finished, but I mean truly opening five simultaneous connections and sending data through each of them at the exact same time?
An example would be, you're sending fifty 10 byte files to the moon, and there is a latency of 10 seconds. Your program scoops up all fifty files and makes fifty calls to HttpClient.PostAsync almost instantly.
Assuming the listening service can support it, would the cross-thread calls to HttpClient.PostAsync open fifty connections (or whatever, some limit, but more than 1) and send the data, meaning that the server receives all fifty files ~10 seconds later?
Or would it internally queue them up and you'd end up waiting 10x50=500 seconds?
Seems there is no limit, or at least, it's a high one.
I made a default web api application, modified the boilerplate controller method to this:
// GET api/values
public async Task<IEnumerable<string>> Get()
{
Debug.Print("Called");
Thread.Sleep(100000);
return new string[] { "value1", "value2" };
}
I then made a program that using a single instance of HttpClient, would make lots of simultaneous connections using Task.Run.
List<Task> tasks = new List<Task>();
var task1 = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var task2 = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var task3 = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var task4 = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var task5 = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var task6 = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var task7 = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var task8 = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var task9 = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var taskA = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var taskB = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var taskC = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var taskD = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
var taskE = Task.Run(() => httpClient.GetAsync("http://localhost:57984/api/values"));
await Task.WhenAll(task1, task2, task3, task4, task5, task6, task7, task8, task9, taskA, taskB, taskC, taskD);
I ran them and the word 'Called' was logged 14 times.
Since the Thread.Sleep will have blocked the response, it should mean there were 14 simultaneous connections.
There are two properties that might affect the maximum number of connections, that I've found by looking on google:
ServicePointManager.DefaultConnectionLimit which is defaulted to 2
and, the HttpClientHandler.MaxConnectionsPerServer which is also 2.
As I'm able to make many more than 2 connections, I really don't know if it's just ignored, or if these are the wrong settings, or what. Changing them appears to have no effect.
I noticed after a lot of stopping and starting my test projects that new connections were much slower to be made. I am guessing that I saturated the connection pool.
My conclusion is that if you set those two values to something higher (just in case, I mean, why not), then you can use a single httpclient concurrently where the connections will be truly concurrent, rather than sequential and thread safe.
However I can't confirm this behaviour.
Why not? Just create a webapi with a few seconds delay, and test calling it with HttpClient. Or you can use a service like Slowwly.
static async Task Main(string[] args)
{
var stopwatch = Stopwatch.StartNew();
await Serial(stopwatch);
Console.WriteLine($"Serial took {stopwatch.Elapsed}");
stopwatch.Restart();
await Concurrent(stopwatch);
Console.WriteLine($"Concurrent took {stopwatch.Elapsed}");
}
private static async Task Serial(Stopwatch stopwatch)
{
for (var i = 0; i != 5; ++i)
{
var client = new HttpClient();
await MakeRequest(stopwatch, client);
}
}
private static async Task Concurrent(Stopwatch stopwatch)
{
var client = new HttpClient();
var tasks = Enumerable.Range(0, 5).Select(async _ => { await MakeRequest(stopwatch, client); }).ToList();
await Task.WhenAll(tasks);
}
private static async Task MakeRequest(Stopwatch stopwatch, HttpClient client)
{
Console.WriteLine($"{stopwatch.Elapsed}: Issuing request.");
var response = await client.GetStringAsync("http://slowwly.robertomurray.co.uk/delay/3000/url/http://www.google.com");
Console.WriteLine($"{stopwatch.Elapsed}: Received {response.Length} bytes.");
}
Output for me (from the US):
00:00:00.0463664: Issuing request.
00:00:04.2560734: Received 49237 bytes.
00:00:04.2562498: Issuing request.
00:00:07.6731908: Received 49247 bytes.
00:00:07.6734158: Issuing request.
00:00:11.0882322: Received 49364 bytes.
00:00:11.0883803: Issuing request.
00:00:14.4990981: Received 49294 bytes.
00:00:14.4993977: Issuing request.
00:00:17.9082167: Received 49328 bytes.
Serial took 00:00:17.9083969
00:00:00.0025096: Issuing request.
00:00:00.0252402: Issuing request.
00:00:00.0422682: Issuing request.
00:00:00.0588887: Issuing request.
00:00:00.0755351: Issuing request.
00:00:03.4631815: Received 49278 bytes.
00:00:03.4632073: Received 49293 bytes.
00:00:03.4844698: Received 49313 bytes.
00:00:03.4913929: Received 49308 bytes.
00:00:03.4915415: Received 49280 bytes.
Concurrent took 00:00:03.4917199
Question: Does a single instance of HttpClient support multiple connections simultaneously?
Yes.
Background
We have a service operation that can receive concurrent asynchronous requests and must process those requests one at a time.
In the following example, the UploadAndImport(...) method receives concurrent requests on multiple threads, but its calls to the ImportFile(...) method must happen one at a time.
Layperson Description
Imagine a warehouse with many workers (multiple threads). People (clients) can send the warehouse many packages (requests) at the same time (concurrently). When a package comes in a worker takes responsibility for it from start to finish, and the person who dropped off the package can leave (fire-and-forget). The workers' job is to put each package down a small chute, and only one worker can put a package down a chute at a time, otherwise chaos ensues. If the person who dropped off the package checks in later (polling endpoint), the warehouse should be able to report on whether the package went down the chute or not.
Question
The question then is how to write a service operation that...
can receive concurrent client requests,
receives and processes those requests on multiple threads,
processes requests on the same thread that received the request,
processes requests one at a time,
is a one way fire-and-forget operation, and
has a separate polling endpoint that reports on request completion.
We've tried the following and are wondering two things:
Are there any race conditions that we have not considered?
Is there a more canonical way to code this scenario in C#.NET with a service oriented architecture (we happen to be using WCF)?
Example: What We Have Tried?
This is the service code that we have tried. It works though it feels like somewhat of a hack or kludge.
static ImportFileInfo _inProgressRequest = null;
static readonly ConcurrentDictionary<Guid, ImportFileInfo> WaitingRequests =
new ConcurrentDictionary<Guid, ImportFileInfo>();
public void UploadAndImport(ImportFileInfo request)
{
// Receive the incoming request
WaitingRequests.TryAdd(request.OperationId, request);
while (null != Interlocked.CompareExchange(ref _inProgressRequest, request, null))
{
// Wait for any previous processing to complete
Thread.Sleep(500);
}
// Process the incoming request
ImportFile(request);
Interlocked.Exchange(ref _inProgressRequest, null);
WaitingRequests.TryRemove(request.OperationId, out _);
}
public bool UploadAndImportIsComplete(Guid operationId) =>
!WaitingRequests.ContainsKey(operationId);
This is example client code.
private static async Task UploadFile(FileInfo fileInfo, ImportFileInfo importFileInfo)
{
using (var proxy = new Proxy())
using (var stream = new FileStream(fileInfo.FullName, FileMode.Open, FileAccess.Read))
{
importFileInfo.FileByteStream = stream;
proxy.UploadAndImport(importFileInfo);
}
await Task.Run(() => Poller.Poll(timeoutSeconds: 90, intervalSeconds: 1, func: () =>
{
using (var proxy = new Proxy())
{
return proxy.UploadAndImportIsComplete(importFileInfo.OperationId);
}
}));
}
It's hard to write a minimum viable example of this in a Fiddle, but here is a start that give a sense and that compiles.
As before, the above seems like a hack/kludge, and we are asking both about potential pitfalls in its approach and for alternative patterns that are more appropriate/canonical.
Simple solution using Producer-Consumer pattern to pipe requests in case of thread count restrictions.
You still have to implement a simple progress reporter or event. I suggest to replace the expensive polling approach with an asynchronous communication which is offered by Microsoft's SignalR library. It uses WebSocket to enable async behavior. The client and server can register their callbacks on a hub. Using RPC the client can now invoke server side methods and vice versa. You would post progress to the client by using the hub (client side). In my experience SignalR is very simple to use and very good documented. It has a library for all famous server side languages (e.g. Java).
Polling in my understanding is the totally opposite of fire-and-forget. You can't forget, because you have to check something based on an interval. Event based communication, like SignalR, is fire-and-forget since you fire and will get a reminder (cause you forgot). The "event side" will invoke your callback instead of you waiting to do it yourself!
Requirement 5 is ignored since I didn't get any reason. Waiting for a thread to complete would eliminate the fire and forget character.
private BlockingCollection<ImportFileInfo> requestQueue = new BlockingCollection<ImportFileInfo>();
private bool isServiceEnabled;
private readonly int maxNumberOfThreads = 8;
private Semaphore semaphore = new Semaphore(numberOfThreads);
private readonly object syncLock = new object();
public void UploadAndImport(ImportFileInfo request)
{
// Start the request handler background loop
if (!this.isServiceEnabled)
{
this.requestQueue?.Dispose();
this.requestQueue = new BlockingCollection<ImportFileInfo>();
// Fire and forget (requirement 4)
Task.Run(() => HandleRequests());
this.isServiceEnabled = true;
}
// Cache multiple incoming client requests (requirement 1) (and enable throttling)
this.requestQueue.Add(request);
}
private void HandleRequests()
{
while (!this.requestQueue.IsCompleted)
{
// Wait while thread limit is exceeded (some throttling)
this.semaphore.WaitOne();
// Process the incoming requests in a dedicated thread (requirement 2) until the BlockingCollection is marked completed.
Task.Run(() => ProcessRequest());
}
// Reset the request handler after BlockingCollection was marked completed
this.isServiceEnabled = false;
this.requestQueue.Dispose();
}
private void ProcessRequest()
{
ImportFileInfo request = this.requestQueue.Take();
UploadFile(request);
// You updated your question saying the method "ImportFile()" requires synchronization.
// This a bottleneck and will significantly drop performance, when this method is long running.
lock (this.syncLock)
{
ImportFile(request);
}
this.semaphore.Release();
}
Remarks:
BlockingCollection is a IDisposable
TODO: You have to "close" the BlockingCollection by marking it completed:
"BlockingCollection.CompleteAdding()" or it will loop indeterminately waiting for further requests. Maybe you introduce a additional request methods for the client to cancel and/ or to update the process and to mark adding to the BlockingCollection as completed. Or a timer that waits an idle time before marking it as completed. Or make your request handler thread block or spin.
Replace Take() and Add(...) with TryTake(...) and TryAdd(...) if you want cancellation support
Code is not tested
Your "ImportFile()" method is a bottleneck in your multi threading environment. I suggest to make it thread safe. In case of I/O that requires synchronization, I would cache the data in a BlockingCollection and then write them to I/O one by one.
The problem is that your total bandwidth is very small-- only one job can run at a time-- and you want to handle parallel requests. That means that queue time could vary wildly. It may not be the best choice to implement your job queue in-memory, as it would make your system much more brittle, and more difficult to scale out when your business grows.
A traditional, scaleable way to architect this would be:
An HTTP service to accept requests, load balanced/redundant, with no session state.
A SQL Server database to persist the requests in a queue, returning a persistent unique job ID.
A Windows service to process the queue, one job at a time, and mark jobs as complete. The worker process for the service would probably be single-threaded.
This solution requires you to choose a web server. A common choice is IIS running ASP.NET. On that platform, each request is guaranteed to be handled in a single-threaded manner (i.e. you don't need to worry about race conditions too much), but due to a feature called thread agility the request might end with a different thread, but in the original synchronization context, which means you will probably never notice unless you are debugging and inspecting thread IDs.
Given the constraints context of our system, this is the implementation we ended up using:
static ImportFileInfo _importInProgressItem = null;
static readonly ConcurrentQueue<ImportFileInfo> ImportQueue =
new ConcurrentQueue<ImportFileInfo>();
public void UploadAndImport(ImportFileInfo request) {
UploadFile(request);
ImportFileSynchronized(request);
}
// Synchronize the file import,
// because the database allows a user to perform only one write at a time.
private void ImportFileSynchronized(ImportFileInfo request) {
ImportQueue.Enqueue(request);
do {
ImportQueue.TryPeek(out var next);
if (null != Interlocked.CompareExchange(ref _importInProgressItem, next, null)) {
// Queue processing is already under way in another thread.
return;
}
ImportFile(next);
ImportQueue.TryDequeue(out _);
Interlocked.Exchange(ref _importInProgressItem, null);
}
while (ImportQueue.Any());
}
public bool UploadAndImportIsComplete(Guid operationId) =>
ImportQueue.All(waiting => waiting.OperationId != operationId);
This solution works well for the loads we are expecting. That load involves a maximum of about 15-20 concurrent PDF file uploads. The batch of up to 15-20 files tends to arrive all at once and then to go quiet for several hours until the next batch arrives.
Criticism and feedback is most welcome.
I am from ruby background. I have a project need to be migrated to c#. It will make thousands of api service calls. In ruby I use Typhoeus Hydra to run the request parallel and execute the response parallel.
NOTE: each api call are separate no dependency between each call.
The template of ruby will be like this
#typhoeus gem used to make api call
QUEUE = Typhoeus::Hydra.new
[1..100].each do |val|
request = Typhoeus::Request.new("http://api.com/?value=#{val}")
request.on_complete do |response|
# code to be executed after each call
end
QUEUE.queue(request)
end
#run the queue will run 100 api calls in parallel and execute complete block in parallel
QUEUE.run
I have little idea that i have to work with async and await (TPL) in c#. But I need some good examples which will be helpful.
Thanks in advance
Shou should have look at the Parallel LINQ library (PLINQ).
You can do requests with like this:
Parallel.ForEach(Enumerable.Range(1, 100), (val) =>
{
// make syncron api call
WebClient webClient = new WebClient();
var result = webClient.DownloadString(string.Format("http://api.com/?value={0}", val);
// work on the result
});
Parallel processing is an option; however, it blocks threads unnecessarily. Since your operation is I/O-bound (hitting an HTTP API), asynchronous concurrency is a better option.
First, you'd define your "download and process" operation:
private static HttpClient client = new HttpClient();
private static async Task DownloadAndProcessAsync(string value)
{
var response = await client.GetStringAsync($"http://api.com/?value={value}");
// Process response.
}
If you want to run them all concurrently, then a simple Task.WhenAll would suffice:
var source = Enumerable.Range(1, 100);
var tasks = source.Select(v => DownloadAndProcessAsync(v.ToString()));
await Task.WhenAll(tasks);
For more information about async/await, see my intro to async blog post (and the followup resources at the end of it).
I have a number of producer tasks that push data into a BlockingCollection, lets call it requestQueue.
I also have a consumer task that pops requests from the requestQueue, and forwards async http requests to a remote web service.
I need to throttle or block the number of active requests sent to the web service. On some machines that are far away from the service or have a slower internet connection, the http response time is long enough that the number of active requests fills up more memory than I'd like.
At the moment I am using a semaphore approach, calling WaitOne on the consumer thread multiple times, and Release on the HTTP response callback. Is there a more elegant solution?
I am bound to .net 4.0, and would like a standard library based solution.
You are already using a BlockingCollection why have a WaitHandle?
The way I would do it is to have a BlockingCollection with n as it's bounded capacity where n is the maximum number of concurrent requests you want to have at any given time.
You can then do something like....
var n = 4;
var blockingQueue = new BlockingCollection<Request>(n);
Action<Request> consumer = request =>
{
// do something with request.
};
var noOfWorkers = 4;
var workers = new Task[noOfWorkers];
for (int i = 0; i < noOfWorkers; i++)
{
var task = new Task(() =>
{
foreach (var item in blockingQueue.GetConsumingEnumerable())
{
consumer(item);
}
}, TaskCreationOptions.LongRunning | TaskCreationOptions.DenyChildAttach);
workers[i] = task;
workers[i].Start();
}
Task.WaitAll(workers);
I let you take care of cancellation and error handling but using this you can also control how many workers you want to have at any given time, if the workers are busy sending and processing the request any other producer will be blocked until more room is available in the queue.
I used to have:
using (MyWebClient client = new MyWebClient(TimeoutInSeconds))
{
var res = client.DownloadData(par.Base_url);
//code that checks res
}
Now I have:
using (MyWebClient client = new MyWebClient(TimeoutInSeconds))
{
client.DownloadDataAsync(new Uri(par.Base_url));
client.DownloadDataCompleted += (sender, e) =>
{
//code that checks e.Result
}
}
Where MyWebClient is derived from WebClient.
Now I have lots of threads doing this and in the first case memory consumption wasn't an issue while in the second one I see steady rise in memory until I get OutOfMemoryException.
I profiled and it seems that WebClient is the culprit, not being disposed and downloaded data is kept. But why? What's the difference between two cases? Perhaps e.Result needs to be somehow disposed of?
Your first case limits the number of concurrent downloads to the number of threads. Your second case has no limit on the number of concurrent downloads.
You are disposing of your WebClient immediately in the second option. You have a couple of choices:
If you're using .NET 4.5 (or .NET 4.0 with Visual Studio 2012 and the AsyncTargetingPack installed), you can do var res = await client.DownloadDataAsync(par.Base_url); and have code that looks similar to your first line but is actually asynchronous.
Use a normal continuation and get rid of your using block
The first option would look like this:
using (MyWebClient client = new MyWebClient(TimeoutInSeconds))
{
var res = await client.DownloadDataAsync(par.Base_url);
//code that checks res
}
The second option would look like this:
var client = new MyWebClient(TimeoutInSeconds);
client.DownloadDataAsync(new Uri(par.Base_url))
.ContinueWith(t =>
{
client.Dispose();
var res = t.Result;
//code that checks res
}
}
HOWEVER
You must change your threading approach depending on which solution you use. The first version of your code runs synchronously, so if you have a thread dedicated to a URL (or connection or however it is you're splitting them up), the downloads will run synchronously on that thread and block it. If you choose either of these options, however, you'll end up using IO completion threads to complete your work, splitting it out from the main thread. In the long run, this is probably better, but it means you have to be mindful about how many of these requests you submit in parallel.