Multiple HttpClients with proxies, trying to achieve maximum download speed - c#

I need to use proxies to download a forum. The problem with my code is that it takes only 10% of my internet bandwidth. Also I have read that I need to use a single HttpClient instance, but with multiple proxies I don't know how to do it. Changing MaxDegreeOfParallelism doesn't change anything.
public static IAsyncEnumerable<IFetchResult> FetchInParallelAsync(
this IEnumerable<Url> urls, FetchContext context)
{
var fetchBlcock = new TransformBlock<Url, IFetchResult>(
transform: url => url.FetchAsync(context),
dataflowBlockOptions: new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 128
}
);
foreach(var url in urls)
fetchBlcock.Post(url);
fetchBlcock.Complete();
var result = fetchBlcock.ToAsyncEnumerable();
return result;
}
Every call to FetchAsync will create or reuse a HttpClient with a WebProxy.
public static async Task<IFetchResult> FetchAsync(this Url url, FetchContext context)
{
var httpClient = context.ProxyPool.Rent();
var result = await url.FetchAsync(httpClient, context.Observer, context.Delay,
context.isReloadWithCookie);
context.ProxyPool.Return(httpClient);
return result;
}
public HttpClient Rent()
{
lock(_lockObject)
{
if (_uninitiliazedDatacenterProxiesAddresses.Count != 0)
{
var proxyAddress = _uninitiliazedDatacenterProxiesAddresses.Pop();
return proxyAddress.GetWebProxy(DataCenterProxiesCredentials).GetHttpClient();
}
return _proxiesQueue.Dequeue();
}
}
I am a novice at software developing, but the task of downloading using hundreds or thousands of proxies asynchronously looks like a trivial task that many should have been faced with and found a correct way to do it. So far I was unable to find any solutions to my problem on the internet. Any thoughts of how to achieve maximum download speed?

Let's take a look at what happens here:
var result = await url.FetchAsync(httpClient, context.Observer, context.Delay, context.isReloadWithCookie);
You are actually awaiting before you continue with the next item. That's why it is asynchronous and not parallel programming. async in Microsoft docs
The await keyword is where the magic happens. It yields control to the caller of the method that performed await, and it ultimately allows a UI to be responsive or a service to be elastic.
In essence, it frees the calling thread to do other stuff but the original calling code is suspended from executing, until the IO operation is done.
Now to your problem:
You can either use this excellent solution here: foreach async
You can use the Parallel library to execute your code in different threads.
Something like the following from Parallel for example
Parallel.For(0, urls.Count,
index => fetchBlcock.Post(urls[index])
});

Related

Multi-threading in a foreach loop

I have read a few stackoverflow threads about multi-threading in a foreach loop, but I am not sure I am understanding and using it right.
I have tried multiple scenarios, but I am not seeing much increase in performance.
Here is what I believe runs Asynchronous tasks, but running synchronously in the loop using a single thread:
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
foreach (IExchangeAPI selectedApi in selectedApis)
{
if (exchangeSymbols.TryGetValue(selectedApi.Name, out symbol))
{
ticker = await selectedApi.GetTickerAsync(symbol);
}
}
stopWatch.Stop();
Here is what I hoped to be running Asynchronously (still using a single thread) - I would have expected some speed improvement already here:
List<Task<ExchangeTicker>> exchTkrs = new List<Task<ExchangeTicker>>();
stopWatch.Start();
foreach (IExchangeAPI selectedApi in selectedApis)
{
if (exchangeSymbols.TryGetValue(selectedApi.Name, out symbol))
{
exchTkrs.Add(selectedApi.GetTickerAsync(symbol));
}
}
ExchangeTicker[] retTickers = await Task.WhenAll(exchTkrs);
stopWatch.Stop();
Here is what I would have hoped to run Asynchronously in Multi-thread:
stopWatch.Start();
Parallel.ForEach(selectedApis, async (IExchangeAPI selectedApi) =>
{
if (exchangeSymbols.TryGetValue(selectedApi.Name, out symbol))
{
ticker = await selectedApi.GetTickerAsync(symbol);
}
});
stopWatch.Stop();
Stop watch results interpreted as follows:
Console.WriteLine("Time elapsed (ns): {0}", stopWatch.Elapsed.TotalMilliseconds * 1000000);
Console outputs:
Time elapsed (ns): 4183308100
Time elapsed (ns): 4183946299.9999995
Time elapsed (ns): 4188032599.9999995
Now, the speed improvement looks minuscule. Am I doing something wrong or is that more or less what I should be expecting? I suppose writing to files would be a better to check that.
Would you mind also confirming I am interpreting the different use cases correctly?
Finally, using a foreach loop in order to get the ticker from multiple platforms in parallel may not be the best approach. Suggestions on how to improve this would be welcome.
EDIT
Note that I am using the ExchangeSharp code base that you can find here
Here is what the GerTickerAsync() method looks like:
public virtual async Task<ExchangeTicker> GetTickerAsync(string marketSymbol)
{
marketSymbol = NormalizeMarketSymbol(marketSymbol);
return await Cache.CacheMethod(MethodCachePolicy, async () => await OnGetTickerAsync(marketSymbol), nameof(GetTickerAsync), nameof(marketSymbol), marketSymbol);
}
For the Kraken API, you then have:
protected override async Task<ExchangeTicker> OnGetTickerAsync(string marketSymbol)
{
JToken apiTickers = await MakeJsonRequestAsync<JToken>("/0/public/Ticker", null, new Dictionary<string, object> { { "pair", NormalizeMarketSymbol(marketSymbol) } });
JToken ticker = apiTickers[marketSymbol];
return await ConvertToExchangeTickerAsync(marketSymbol, ticker);
}
And the Caching method:
public static async Task<T> CacheMethod<T>(this ICache cache, Dictionary<string, TimeSpan> methodCachePolicy, Func<Task<T>> method, params object?[] arguments) where T : class
{
await new SynchronizationContextRemover();
methodCachePolicy.ThrowIfNull(nameof(methodCachePolicy));
if (arguments.Length % 2 == 0)
{
throw new ArgumentException("Must pass function name and then name and value of each argument");
}
string methodName = (arguments[0] ?? string.Empty).ToStringInvariant();
string cacheKey = methodName;
for (int i = 1; i < arguments.Length;)
{
cacheKey += "|" + (arguments[i++] ?? string.Empty).ToStringInvariant() + "=" + (arguments[i++] ?? string.Empty).ToStringInvariant("(null)");
}
if (methodCachePolicy.TryGetValue(methodName, out TimeSpan cacheTime))
{
return (await cache.Get<T>(cacheKey, async () =>
{
T innerResult = await method();
return new CachedItem<T>(innerResult, CryptoUtility.UtcNow.Add(cacheTime));
})).Value;
}
else
{
return await method();
}
}
At first it should be pointed out that what you are trying to achieve is performance, not asynchrony. And you are trying to achieve it by running multiple operations concurrently, not in parallel. To keep the explanation simple I'll use a simplified version of your code, and I'll assume that each operation is a direct web request, without an intermediate caching layer, and with no dependencies to values existing in dictionaries.
foreach (var symbol in selectedSymbols)
{
var ticker = await selectedApi.GetTickerAsync(symbol);
}
The above code runs the operations sequentially. Each operation starts after the completion of the previous one.
var tasks = new List<Task<ExchangeTicker>>();
foreach (var symbol in selectedSymbols)
{
tasks.Add(selectedApi.GetTickerAsync(symbol));
}
var tickers = await Task.WhenAll(tasks);
The above code runs the operations concurrently. All operations start at once. The total duration is expected to be the duration of the longest running operation.
Parallel.ForEach(selectedSymbols, async symbol =>
{
var ticker = await selectedApi.GetTickerAsync(symbol);
});
The above code runs the operations concurrently, like the previous version with Task.WhenAll. It offers no advantage, while having the huge disadvantage that you no longer have a way to await the operations to complete. The Parallel.ForEach method will return immediately after launching the operations, because the Parallel class doesn't understand async delegates (it does not accept Func<Task> lambdas). Essentially there are a bunch of async void lambdas in there, that are running out of control, and in case of an exception they will bring down the process.
So the correct way to run the operations concurrently is the second way, using a list of tasks and the Task.WhenAll. Since you've already measured this method and haven't observed any performance improvements, I am assuming that there something else that serializes the concurrent operations. It could be something like a SemaphoreSlim hidden somewhere in your code, or some mechanism on the server side that throttles your requests. You'll have to investigate further to find where and why the throttling happens.
In general, when you do not see an increase by multi threading, it is because your task is not CPU limited or large enough to offset the overhead.
In your example, i.e.:
selectedApi.GetTickerAsync(symbol);
This can hae 2 reasons:
1: Looking up the ticker is brutally fast and it should not be an async to start with. I.e. when you look it up in a dictionary.
2: This is running via a http connection where the runtime is LIMITING THE NUMBER OF CONCURRENT CALLS. Regardless how many tasks you open, it will not use more than 4 at the same time.
Oh, and 3: you think async is using threads. It is not. It is particularly not the case in a codel ike this:
await selectedApi.GetTickerAsync(symbol);
Where you basically IMMEDIATELY WAIT FOR THE RESULT. There is no multi threading involved here at all.
foreach (IExchangeAPI selectedApi in selectedApis) {
if (exchangeSymbols.TryGetValue(selectedApi.Name, out symbol))
{
ticker = await selectedApi.GetTickerAsync(symbol);
} }
This is linear non threaded code using an async interface to not block the current thread while the (likely expensive IO) operation is in place. It starts one, THEN WAITS FOR THE RESULT. No 2 queries ever start at the same time.
If you want a possible (just as example) more scalable way:
In the foreach, do not await but add the task to a list of tasks.
Then start await once all the tasks have started. Likein a 2nd loop.
WAY not perfect, but at least the runtime has a CHANCE to do multiple lookups at the same time. Your await makes sure that you essentially run single threaded code, except async, so your thread goes back into the pool (and is not waiting for results), increasing your scalability - an item possibly not relevant in this case and definitely not measured in your test.

Understand parallel programming in C# with async examples

I am trying to understand parallel programming and I would like my async methods to run on multiple threads. I have written something but it does not work like I thought it should.
Code
public static async Task Main(string[] args)
{
var listAfterParallel = RunParallel(); // Running this function to return tasks
await Task.WhenAll(listAfterParallel); // I want the program exceution to stop until all tasks are returned or tasks are completed
Console.WriteLine("After Parallel Loop"); // But currently when I run program, after parallel loop command is printed first
Console.ReadLine();
}
public static async Task<ConcurrentBag<string>> RunParallel()
{
var client = new System.Net.Http.HttpClient();
client.DefaultRequestHeaders.Add("Accept", "application/json");
client.BaseAddress = new Uri("https://jsonplaceholder.typicode.com");
var list = new List<int>();
var listResults = new ConcurrentBag<string>();
for (int i = 1; i < 5; i++)
{
list.Add(i);
}
// Parallel for each branch to run await commands on multiple threads.
Parallel.ForEach(list, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, async (index) =>
{
var response = await client.GetAsync("posts/" + index);
var contents = await response.Content.ReadAsStringAsync();
listResults.Add(contents);
Console.WriteLine(contents);
});
return listResults;
}
I would like RunParallel function to complete before "After parallel loop" is printed. Also I want my get posts method to run on multiple threads.
Any help would be appreciated!
What's happening here is that you're never waiting for the Parallel.ForEach block to complete - you're just returning the bag that it will eventually pump into. The reason for this is that because Parallel.ForEach expects Action delegates, you've created a lambda which returns void rather than Task. While async void methods are valid, they generally continue their work on a new thread and return to the caller as soon as they await a Task, and the Parallel.ForEach method therefore thinks the handler is done, even though it's kicked that remaining work off into a separate thread.
Instead, use a synchronous method here;
Parallel.ForEach(list, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, index =>
{
var response = client.GetAsync("posts/" + index).Result;
var contents = response.Content.ReadAsStringAsync().Result;
listResults.Add(contents);
Console.WriteLine(contents);
});
If you absolutely must use await inside, Wrap it in Task.Run(...).GetAwaiter().GetResult();
Parallel.ForEach(list, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, index => Task.Run(async () =>
{
var response = await client.GetAsync("posts/" + index);
var contents = await response.Content.ReadAsStringAsync();
listResults.Add(contents);
Console.WriteLine(contents);
}).GetAwaiter().GetResult();
In this case, however, Task.run generally goes to a new thread, so we've subverted most of the control of Parallel.ForEach; it's better to use async all the way down;
var tasks = list.Select(async (index) => {
var response = await client.GetAsync("posts/" + index);
var contents = await response.Content.ReadAsStringAsync();
listResults.Add(contents);
Console.WriteLine(contents);
});
await Task.WhenAll(tasks);
Since Select expects a Func<T, TResult>, it will interpret an async lambda with no return as an async Task method instead of async void, and thus give us something we can explicitly await
Take a look at this: There Is No Thread
When you are making multiple concurrent web requests it's not your CPU that is doing the hard work. It's the CPU of the web server that is serving your requests. Your CPU is doing nothing during this time. It's not in a special "Wait-state" or something. The hardware inside your box that is working is your network card, that writes data to your RAM. When the response is received then your CPU will be notified about the arrived data, so it can do something with them.
You need parallelism when you have heavy work to do inside your box, not when you want the heavy work to be done by the external world. From the point of view of your CPU, even your hard disk is part of the external world. So everything that applies to web requests, applies also to requests targeting filesystems and databases. These workloads are called I/O bound, to be distinguished from the so called CPU bound workloads.
For I/O bound workloads the tool offered by the .NET platform is the asynchronous Task. There are multiple APIs throughout the libraries that return Task objects. To achieve concurrency you typically start multiple tasks and then await them with Task.WhenAll. There are also more advanced tools like the TPL Dataflow library, that is build on top of Tasks. It offers capabilities like buffering, batching, configuring the maximum degree of concurrency, and much more.

How to correctly queue up tasks to run in C#

I have an enumeration of items (RunData.Demand), each representing some work involving calling an API over HTTP. It works great if I just foreach through it all and call the API during each iteration. However, each iteration takes a second or two so I'd like to run 2-3 threads and divide up the work between them. Here's what I'm doing:
ThreadPool.SetMaxThreads(2, 5); // Trying to limit the amount of threads
var tasks = RunData.Demand
.Select(service => Task.Run(async delegate
{
var availabilityResponse = await client.QueryAvailability(service);
// Do some other stuff, not really important
}));
await Task.WhenAll(tasks);
The client.QueryAvailability call basically calls an API using the HttpClient class:
public async Task<QueryAvailabilityResponse> QueryAvailability(QueryAvailabilityMultidayRequest request)
{
var response = await client.PostAsJsonAsync("api/queryavailabilitymultiday", request);
if (response.IsSuccessStatusCode)
{
return await response.Content.ReadAsAsync<QueryAvailabilityResponse>();
}
throw new HttpException((int) response.StatusCode, response.ReasonPhrase);
}
This works great for a while, but eventually things start timing out. If I set the HttpClient Timeout to an hour, then I start getting weird internal server errors.
What I started doing was setting a Stopwatch within the QueryAvailability method to see what was going on.
What's happening is all 1200 items in RunData.Demand are being created at once and all 1200 await client.PostAsJsonAsync methods are being called. It appears it then uses the 2 threads to slowly check back on the tasks, so towards the end I have tasks that have been waiting for 9 or 10 minutes.
Here's the behavior I would like:
I'd like to create the 1,200 tasks, then run them 3-4 at a time as threads become available. I do not want to queue up 1,200 HTTP calls immediately.
Is there a good way to go about doing this?
As I always recommend.. what you need is TPL Dataflow (to install: Install-Package System.Threading.Tasks.Dataflow).
You create an ActionBlock with an action to perform on each item. Set MaxDegreeOfParallelism for throttling. Start posting into it and await its completion:
var block = new ActionBlock<QueryAvailabilityMultidayRequest>(async service =>
{
var availabilityResponse = await client.QueryAvailability(service);
// ...
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
foreach (var service in RunData.Demand)
{
block.Post(service);
}
block.Complete();
await block.Completion;
Old question, but I would like to propose an alternative lightweight solution using the SemaphoreSlim class. Just reference System.Threading.
SemaphoreSlim sem = new SemaphoreSlim(4,4);
foreach (var service in RunData.Demand)
{
await sem.WaitAsync();
Task t = Task.Run(async () =>
{
var availabilityResponse = await client.QueryAvailability(serviceCopy));
// do your other stuff here with the result of QueryAvailability
}
t.ContinueWith(sem.Release());
}
The semaphore acts as a locking mechanism. You can only enter the semaphore by calling Wait (WaitAsync) which subtracts one from the count. Calling release adds one to the count.
You're using async HTTP calls, so limiting the number of threads will not help (nor will ParallelOptions.MaxDegreeOfParallelism in Parallel.ForEach as one of the answers suggests). Even a single thread can initiate all requests and process the results as they arrive.
One way to solve it is to use TPL Dataflow.
Another nice solution is to divide the source IEnumerable into partitions and process items in each partition sequentially as described in this blog post:
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate
{
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
While the Dataflow library is great, I think it's a bit heavy when not using block composition. I would tend to use something like the extension method below.
Also, unlike the Partitioner method, this runs the async methods on the calling context - the caveat being that if your code is not truly async, or takes a 'fast path', then it will effectively run synchronously since no threads are explicitly created.
public static async Task RunParallelAsync<T>(this IEnumerable<T> items, Func<T, Task> asyncAction, int maxParallel)
{
var tasks = new List<Task>();
foreach (var item in items)
{
tasks.Add(asyncAction(item));
if (tasks.Count < maxParallel)
continue;
var notCompleted = tasks.Where(t => !t.IsCompleted).ToList();
if (notCompleted.Count >= maxParallel)
await Task.WhenAny(notCompleted);
}
await Task.WhenAll(tasks);
}

Limit number of parallel running methods

In my class I have a download function. Now in order to not allow a too high number of concurrent downloads, I would like to block this function until a "download-spot" is free ;)
void Download(Uri uri)
{
currentDownloads++;
if (currentDownloads > MAX_DOWNLOADS)
{
//wait here
}
DoActualDownload(uri); // blocks long time
currentDownloads--;
}
Is there a ready made programming pattern for this in C# / .NET?
edit: unfortunatelyt i cant use features from .net4.5 but only .net4.0
May be this
var parallelOptions = new ParallelOptions
{
MaxDegreeOfParallelism = 3
};
Parallel.ForEach(downloadUri, parallelOptions, (uri, state, index) =>
{
YourDownLoad(uri);
});
You should use Semaphore for this concurrency problem, see more in the documentation:
https://msdn.microsoft.com/en-us/library/system.threading.semaphore(v=vs.110).aspx
For such cases, I create an own Queue<MyQuery> in a custom class like QueryManager, with some methods :
Each new query is enqueued in Queue<MyQuery> queries
After each "enqueue" AND in each query answer, I call checkIfQueryCouldBeSent()
The checkIfQueryCouldBeSent() method checks your conditions : number of concomitant queries, and so on. In your case you accept to launch a new query if global counter is less than 5. And you increment the counter
Decrement the counter in query answer
It works only if all your queries are asynchronous.
You have to store Callback in MyQuery class, and call it when query is over.
You're doing async IO bound work, there's no need to be using multiple threads with a call such as Parallel.ForEach.
You can simply use naturally async API's exposed in the BCL, such ones that make HTTP calls using HttpClient. Then, you can throttle your connections using SemaphoreSlim and it's WaitAsync method which asynchronously waits:
private readonly SemaphoreSlim semaphoreSlim = new SemaphoreSlim(3);
public async Task DownloadAsync(Uri uri)
{
await semaphoreSlim.WaitAsync();
try
{
string result = await DoActualDownloadAsync(uri);
}
finally
{
semaphoreSlim.Release();
}
}
And your DoActualyDownloadAsync will use HttpClient to do it's work. Something along the lines of:
public Task<string> DoActualDownloadAsync(Uri uri)
{
var httpClient = new HttpClient();
return httpClient.GetStringAsync(uri);
}

Task Scheduler with WCF Service Reference async function

I am trying to consume a service reference, making multiple requests at the same time using a task scheduler. The service includes an synchronous and an asynchronous function that returns a result set. I am a bit confused, and I have a couple of initial questions, and then I will share how far I got in each. I am using some logging, concurrency visualizer, and fiddler to investigate. Ultimately I want to use a reactive scheduler to make as many requests as possible.
1) Should I use the async function to make all the requests?
2) If I were to use the synchronous function in multiple tasks what would be the limited resources that would potentially starve my thread count?
Here is what I have so far:
var myScheduler = new myScheduler();
var myFactory = new Factory(myScheduler);
var myClientProxy = new ClientProxy();
var tasks = new List<Task<Response>>();
foreach( var request in Requests )
{
var localrequest = request;
tasks.Add( myFactory.StartNew( () =>
{
// log stuff
return client.GetResponsesAsync( localTransaction.Request );
// log some more stuff
}).Unwrap() );
}
Task.WaitAll( tasks.ToArray() );
// process all the requests after they are done
This runs but according to fiddler it just tries to do all of the requests at once. It could be the scheduler but I trust that more then I do the above.
I have also tried to implement it without the unwrap command and instead using an async await delegate and it does the same thing. I have also tried referencing the .result and that seems to do it sequentially. Using the non synchronous service function call with the scheduler/factory it only gets up to about 20 simultaneous requests at the same time per client.
Yes. It will allow your application to scale better by using fewer threads to accomplish more.
Threads. When you initiate a synchronous operation that is inherently asynchronous (e.g. I/O) you have a blocked thread waiting for the operation to complete. You could however be using this thread in the meantime to execute CPU bound operations.
The simplest way to limit the amount of concurrent requests is to use a SemaphoreSlim which allows to asynchronously wait to enter it:
async Task ConsumeService()
{
var client = new ClientProxy();
var semaphore = new SemaphoreSlim(100);
var tasks = Requests.Select(async request =>
{
await semaphore.WaitAsync();
try
{
return await client.GetResponsesAsync(request);
}
finally
{
semaphore.Release();
}
}).ToList();
await Task.WhenAll(tasks);
// TODO: Process responses...
}
Regardless of how you are calling the WCF service whether it is an Async call or a Synchronous one you will be bound by the WCF serviceThrottling limits. You should look at these settings and possible adjust them higher (if you have them set to low values for some reason), in .NET4 the defaults are pretty good, however In older versions of the .NET framework, these defaults were much more conservative than .NET4.
.NET 4.0
MaxConcurrentSessions: default is 100 * ProcessorCount
MaxConcurrentCalls: default is 16 * ProcessorCount
MaxConcurrentInstances: default is MaxConcurrentCalls+MaxConcurrentSessions
1.)Yes.
2.)Yes.
If you want to control the number of simultaneous requests you can try using Stephen Toub's ForEachAsync method. it allows you to control how many tasks are processed at the same time.
public static class Extensions
{
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
}
void Main()
{
var myClientProxy = new ClientProxy();
var responses = new List<Response>();
// Max 10 concurrent requests
Requests.ForEachAsync<Request>(10, async (r) =>
{
var response = await client.GetResponsesAsync( localTransaction.Request );
responses.Add(response);
}).Wait();
}

Categories

Resources