Rx extensions Parallel.ForEach throttling - c#

I'm following the answer to this question: Rx extensions: Where is Parallel.ForEach? in order to run a number of operations in parallel using Rx.
The problem I'm running into is that it seems to be allocating a new thread for every request, whereas using Parallel.ForEach did considerably fewer.
The processes I'm running in parallel are quite memory intensive, so if I'm trying to process hundreds of items at once the answer provided to the linked question quickly sees me running out of memory.
Is there a way I can modify that answer to throttle the number of items being done at any given time?
I've taken a look at the Window and Buffer operations, my code looks like this:
return inputs.Select(i => new AccountViewModel(i))
.ToObservable()
.ObserveOn(RxApp.MainThreadScheduler)
.ToList()
.Do(l =>
{
using (Accounts.SuppressChangeNotifications())
{
Accounts.AddRange(l);
}
})
.SelectMany(x => x)
.SelectMany(acc => Observable.StartAsync(async () =>
{
var res = await acc.ProcessAsync(config, m, outputPath);
processed++;
var prog = ((double) processed/inputs.Count())*100.0;
OverallProgress.Message.OnNext(string.Format("Processing Accounts ({0:000}%)", prog));
OverallProgress.Progress.OnNext(prog);
return res;
}))
.All(x => x);
Ideally I want to be able to batch it up into chunks of account view models, that I then call the ProcessAsync method on, and only once all of that batch are done move on.
Ideally I'd like it so that if even only one of the batch finished, it moved on, but only ever kept the same batch size.
So if I've got a batch of 5 and 1 finishes, I'd like another to start, but only one until more space is available.

As usual, Paul Betts has answered a similar question that solves my problem:
The question: Reactive Extensions Parallel processing based on specific number
Has some information on using Observable.Defer and then merging into batches, using that I've modified my previous code like so:
return inputs.Select(i => new AccountViewModel(i))
.ToObservable()
.ObserveOn(RxApp.MainThreadScheduler)
.ToList()
.Do(l =>
{
using (Accounts.SuppressChangeNotifications())
{
Accounts.AddRange(l);
}
})
.SelectMany(x => x)
.Select(x => Observable.DeferAsync(async _ =>
{
var res = await x.ProcessAsync(config, m, outputPath);
processed++;
var prog = ((double) processed/inputs.Count())*100.0;
OverallProgress.Message.OnNext(string.Format("Processing Accounts ({0:000}%)", prog));
OverallProgress.Progress.OnNext(prog);
return Observable.Return(res);
}))
.Merge(5)
.All(x => x);
And sure enough, I get the rolling completion behaviour (e.g. if 1/5 finish then just one starts).
Clearly I've got a few more fundamentals to grasp, but this is brilliant!

Related

How to run parallel code for multiple search methods

I'm working on a code I would like to improve. It is a search method. Based on an input I would like to search this value in multiple tables of my database.
public async Task<IEnumerable<SearchResponseModel>> Search(string input)
{
var listOfSearchResponse = new List<SearchResponseModel>();
listOfSearchResponse.AddRange(await SearchOrder(input)),
listOfSearchResponse.AddRange(await SearchJob(input));
listOfSearchResponse.AddRange(await SearchClient(input));
listOfSearchResponse.AddRange(await SearchItem(input));
listOfSearchResponse.AddRange(await SearchProduction(input));
return listOfSearchResponse;
}
I use the work await because every search is defined like this one:
public async Task<IEnumerable<SearchResponseModel>> SearchOrder(string input) {...}
My five search methods are not yet really async. They all execute in sequence after the previous one. What should I do from here to make them parallel?
I would think that something like this should work, in theory:
var tasks = new[]
{
SearchOrder(input),
SearchJob(input),
SearchClient(input),
SearchItem(input),
SearchProduction(input)
};
await Task.WhenAll(tasks);
//var listOfSearchResponse = tasks.Select(t => t.Result).ToList();
var listOfSearchResponse = tasks. SelectMany(t => t.Result).ToList();
In practice, it's hard to know how much benefit you'll see.
It's worth considering using Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:
public IObservable<SearchResponseModel> SearchObservable(string input) =>
Observable.Defer<SearchResponseModel>(() =>
new []
{
Observable.FromAsync(() => SearchOrder(input)),
Observable.FromAsync(() => SearchJob(input)),
Observable.FromAsync(() => SearchClient(input)),
Observable.FromAsync(() => SearchItem(input)),
Observable.FromAsync(() => SearchProduction(input)),
}
.Merge()
.SelectMany(x => x));
The advantage here is that as each search completes you get the partial results through from the observable - there's no need to wait until all the tasks have finished.
Observables signal each value as they are produced and they signal a finial completion so you know when all of the results are through.

rx.net locking up from use of ToEnumerable

I am trying to convert the below statement so that I can get the key alongside the selected list:
var feed = new Subject<TradeExecuted>();
feed
.GroupByUntil(x => (x.Execution.Contract.Symbol, x.Execution.AccountId, x.Tenant, x.UserId), x => Observable.Timer(TimeSpan.FromSeconds(5)))
.SelectMany(x => x.ToList())
.Select(trades => Observable.FromAsync(() => Mediator.Publish(trades, cts.Token)))
.Concat() // Ensure that the results are serialized.
.Subscribe(cts.Token); // Check status of calls.
The above works, whereas the below does not - when I try and itterate over the list, it locks up.
feed
.GroupByUntil(x => (x.Execution.Contract.Symbol, x.Execution.AccountId, x.Tenant, x.UserId), x => Observable.Timer(timespan))
.Select(x => Observable.FromAsync(() =>
{
var list = x.ToEnumerable(); // <---- LOCK UP if we use list.First() etc
var aggregate = AggregateTrades(x.Key.Symbol, x.Key.AccountId, x.Key.Tenant, list);
return Mediator.Publish(aggregate, cts.Token);
}))
.Concat()
.Subscribe(cts.Token); // Check status of calls.
I am clearly doing something wrong and probably horrific!
Going back to the original code, how can I get the Key alongside the enumerable list (and avoiding the hack below)?
As a sidenote, the below code works but it a nasty hack where I get the keys from the first list item:
feed
.GroupByUntil(x => (x.Execution.Contract.Symbol, x.Execution.AccountId, x.Tenant, x.UserId), x => Observable.Timer(TimeSpan.FromSeconds(5)))
.SelectMany(x => x.ToList())
.Select(trades => Observable.FromAsync(() =>
{
var firstTrade = trades.First();
var aggregate = AggregateTrades(firstTrade.Execution.Contract.Symbol, firstTrade.Execution.AccountId, firstTrade.Tenant, trades);
return Mediator.Publish(aggregate, cts.Token);
}))
.Concat() // Ensure that the results are serialized.
.Subscribe(cts.Token); // Check status of calls.
All versions of your code suffer from trying to eagerly evaluate the grouped sub-observable. Since in v1 and v3 your group observable will run a maximum of 5 seconds, that isn't horrible/awful, but it's still not great. In v2, I don't know what timespan is, but assuming it's 5 seconds, you have the same problem: Trying to turn the grouped sub-observable into a list or an enumerable means waiting for the sub-observable to complete, blocking the thread (or the task).
You can fix this by using the Buffer operator to lazily evaluate the grouped sub-observable:
var timespan = TimeSpan.FromSeconds(5);
feed
.GroupByUntil(x => (x.Execution.Contract.Symbol, x.Execution.AccountId, x.Tenant, x.UserId), x => Observable.Timer(timespan))
.SelectMany(x => x
.Buffer(timespan)
.Select(list => Observable.FromAsync(() =>
{
var aggregate = AggregateTrades(x.Key.Symbol, x.Key.AccountId, x.Key.Tenant, list));
return Mediator.Publish(aggregate, cts.Token);
}))
)
.Concat() // Ensure that the results are serialized.
.Subscribe(cts.Token); // Check status of calls.
This essentially means that until timespan is up, the items in the group by gather in a list inside Buffer. Once timespan is up, they're released as a list, and the mediator publish happens.

Rx.Net - process groups asynchronously and in parallel with a constrained concurrency

Playing with System.Reactive trying to resolve the next task -
Break an incoming stream of strings into groups
Items in each group must be processed asynchronously and sequentially
Groups must be processed in parallel
No more than N groups must be processed at the same time
Ideally, w/o using sync primitives
Here is the best I've figured out so far -
TaskFactory taskFactory = new (new LimitedConcurrencyLevelTaskScheduler(2));
TaskPoolScheduler scheduler = new (taskFactory);
source
.GroupBy(item => item)
.SelectMany(g => g.Select(item => Observable.FromAsync(() => onNextAsync(item))).ObserveOn(scheduler).Concat())
.Subscribe();
Any idea how to achieve it w/o a scheduler? Couldn't make it work via Merge()
The easiest way to enforce the "No more than N groups must be processed at the same time" limitation, is probably to use a SemaphoreSlim. So instead of this:
.SelectMany(g => g.Select(item => Observable.FromAsync(() => onNextAsync(item))).Concat())
...you can do this:
var semaphore = new SemaphoreSlim(N, N);
//...
.SelectMany(g => g.Select(item => Observable.FromAsync(async () =>
{
await semaphore.WaitAsync();
try { return await onNextAsync(item); }
finally { semaphore.Release(); }
})).Merge(1))
Btw in the current Rx version (5.0.0) I don't trust the Concat operator, and I prefer to use the Merge(1) instead.
To solve this problem using exclusively Rx tools, ideally you would like to have something like this:
source
.GroupBy(item => item.Key)
.Select(group => group.Select(
item => Observable.FromAsync(() => ProcessAsync(item))).Merge(1))
.Merge(maxConcurrent: N)
.Wait();
The inner Merge(1) would enforce the exclusive processing within each group, and the outer Merge(N) would enforce the global maximum concurrency policy. Unfortunately this doesn't work because the outer Merge(N) restricts the subscriptions to the inner sequences (the IGroupedObservable<T>s), not to their individual elements. This is not what you want. The result will be that only the first N groups to be processed, and the elements of all other groups will be ignored. The GroupBy operator creates hot subsequences, and if you don't subscribe to them immediately you'll lose elements.
In order for the outer Merge(N) to work as desired, you'll have to merge freely all the inner sequences that are produced by the Observable.FromAsync, and have some other mechanism to serialize the processing of each group. One idea is to implement a special Select operator that emits an Observable.FromAsync only after the previous one is completed. Below is such an implementation, based on the Zip operator. The Zip operator maintains internally two hidden buffers, so that it can produce pairs from two sequences that might emit elements with different frequences. This buffering is exactly what we need in order to avoid losing elements.
private static IObservable<IObservable<TResult>> SelectOneByOne<TSource, TResult>(
this IObservable<TSource> source,
Func<TSource, IObservable<TResult>> selector)
{
var subject = new BehaviorSubject<Unit>(default);
var synchronizedSubject = Observer.Synchronize(subject);
return source
.Zip(subject, (item, _) => item)
.Select(item => selector(item).Do(
_ => { },
_ => synchronizedSubject.OnNext(default),
() => synchronizedSubject.OnNext(default)));
}
The BehaviorSubject<T> contains initially one element, so the first pair will be produced immediately. The second pair will not be produced before the first element has been processed. The same with the third pair and second element, etc.
You could then use this operator to solve the problem like this:
source
.GroupBy(item => item.Key)
.SelectMany(group => group.SelectOneByOne(
item => Observable.FromAsync(() => ProcessAsync(item))))
.Merge(maxConcurrent: N)
.Wait();
The above solution is presented only for the purpose of answering the question. I don't think that I would trust it in a production environment.

How can I get the first async response fastest (and don't perform the remainder)?

Here's the setup: There is a federal remote service which returns whether a particular value is correct or not correct. We can send requests as we like, up to 50 per request to the remote service.
Since we need to only use the correct value, and the set of possible values is small (~700), we can just send 15 or so batch requests of 50 and the correct value will be part of the result set. As such, I've used the following code:
Observable
.Range(0, requests.Count)
.Select(i => Observable.FromAsync(async () =>
{
responses.Add(await client.FederalService.VerifyAsync(requests[i]));
Console.Write(".");
}))
.Merge(8)
.Wait();
But - what I don't like about this is that if one of the earlier requests has the correct value, I still run all the possibilities through the service wasting time. I'm trying to make this run as fast as possible. I know the exit condition (response code is from 1 to 99, any response code within 50-59 indicates the value is "correct").
Is there a way to make this code a little smarter, so we minimize the number of requests? Unfortunately, the value we are verifying is distributed evenly so sorting the requests does nothing (that I'm aware of).
You should consider usage of the FirstAsync method here:
The secret in our example is the FirstAsync method. We are actually awaiting the first result returned by our observable and don’t care about any further results.
So your code could be like this:
await Observable
.Range(0, requests.Count)
.Select(i => Observable.FromAsync(async () =>
{
responses.Add(await client.FederalService.VerifyAsync(requests[i]));
Console.Write(".");
}))
.FirstAsync()
.Subscribe(Console.WriteLine);
> System.Reactive.Linq.ObservableImpl.Defer`1[System.Reactive.Unit]
Rx and Await: Some Notes article provides some tricks with similar methods. For example, you have an overload for FirstAsync, which can be filtered, as the LINQ' method First:
await Observable
.Range(0, requests.Count)
.Select(i => Observable.FromAsync(async () =>
{
responses.Add(await client.FederalService.VerifyAsync(requests[i]));
Console.Write(".");
}))
.FirstAsync(r => /* do the check here */)
.Subscribe(Console.WriteLine);
You're pretty close. Change your observable to this:
Observable
.Range(0, requests.Count)
.Select(i => Observable.FromAsync(async () =>
{
var response = await Task.FromResult(i); //replace with client.FederalService.VerifyAsync(requests[i])
responses.Add(response);
Console.Write($"{i}.");
var responseCode = response; //replace with however you get the response code.
return responseCode >= 50 && responseCode <= 59;
}))
.Merge(8)
.Where(b => b)
.Take(1)
.Wait();
This way your observable continues to emit values, so you can continue acting on it.

Parallelizing data processing

I'm trying to improve the runtime of some data processing I'm doing. The data starts out as various collections (Dictionary mostly, but a few other IEnumerable types), and the end result of processing should be a Dictionary<DataType, List<DataPoint>>.
I have all this working just fine... except it takes close to an hour to run, and it needs to run in under 20 minutes. None of the data has any connection to any other from the same collection, although they cross-reference other collections frequently, so I figured I should parallelize it.
The main structure of the processing has two levels of loops with some processing in between:
// Custom class, 0.01%
var primaryData= GETPRIMARY().ToDictionary(x => x.ID, x => x);
// Custom class, 11.30%
var data1 = GETDATAONE().GroupBy(x => x.Category)
.ToDictionary(x => x.Key, x => x);
// DataRows, 8.19%
var data2 = GETDATATWO().GroupBy(x => x.Type)
.ToDictionary(x => x.Key, x => x.OrderBy(y => y.ID));
foreach (var key in listOfKeys)
{
// 0.01%
var subData1 = data1[key].ToDictionary(x => x.ID, x => x);
// 1.99%
var subData2 = data2.GroupBy(x => x.ID)
.Where(x => primaryData.ContainsKey(x.Type))
.ToDictionary(x => x.Key, x => ProcessDataTwo(x, primaryData[x.Key]));
// 0.70%
var grouped = primaryData.Select(x => new { ID = x.Key,
Data1 = subData1[x.Key],
Data2 = subData2[x.Key] }).ToList();
foreach (var item in grouped)
{
// 62.12%
item.Data1.Results = new Results(item.ID, item.Data2);
// 12.37%
item.Data1.Status = new Status(item.ID, item.Data2);
}
results.Add(key, grouped);
}
return results;
listOfKeys is very small, but each grouped will have several thousand items. How can I structure this so that each call to item.Data1.Process(item.Data2) can get queued up and executed in parallel?
According to my profiler, all the ToDictionary() calls together take up about 21% of the time, the ToList() takes up 0.7%, and the two items inside the inner foreach together take up 74%. Hence why I'm focusing my optimization there.
I don't know if I should use Parallel.ForEach() to replace the outer foreach, the inner one, both, or if there's some other structure I should use. I'm also not sure if there's anything I can do to the data (or the structures holding it) to improve parallel access to it.
(Note that I'm stuck on .NET4, so don't have access to async or await)
Based on the percentages you posted and you said that grouped was very large you would definitely benefit by only paralyzing the inner loop.
Doing it is fairly simple to do
var grouped = primaryData.Select(x => new { ID = x.Key,
Data1 = subData1[x.Key],
Data2 = subData2[x.Key] }).ToList();
Parallel.ForEach(grouped, (item) =>
{
item.Data1.Results = new Results(item.ID, item.Data2);
item.Data1.Status = new Status(item.ID, item.Data2);
});
results.Add(key, grouped);
This assumes that new Results(item.ID, item.Data2); and new Status(item.ID, item.Data2); are safe to do multiple initializations at once (the only concern I would have is if they access non-thread safe static resources internally, and even so a non thread safe constructor is a really bad design flaw)
There is one big cavat: This will only help if you are CPU bound. If Results or Status is IO bound (for example it is waiting on a database call or a file on the hard drive) doing this will hurt your performance instead of helping it. If you are IO bound instead of CPU bound the only options are to buy faster hardware, attempt optimize those two methods more, or use caching in memory if possible so you don't need to do the slow IO.
EDIT
Given the time measurements provided after I wrote this answer, it appears that this approach was looking for savings in the wrong places. I'll leave my answer as a warning against optimization without measurement!!!
So, because of the nestedness of your approach, you are causing some unnecessary over-iteration of some of your collections leading to rather nasty Big O characteristics.
This can be mitigated by using the ILookup interface to pre-group collections by a key and to use these instead of repeated and expensive Where clauses.
I've had a stab at re-imagining your code to reduce complexity (but it is somewhat abstract):
var data2Lookup = data2.ToLookup(x => x.Type);
var tmp1 =
listOfKeys
.Select(key =>
new {
key,
subData1 = data1[key],
subData2 = data2Lookup[key].GroupBy(x=>x.Category)
})
.Select(x =>
new{
x.key,
x.subData1,
x.subData2,
subData2Lookup = x.subData2.ToLookup(y => y.Key)});
var tmp2 =
tmp1
.Select(x =>
new{
x.key,
grouped = x.subData1
.Select(sd1 =>
new{
Data1 = sd1,
Data2 = subData2Lookup[sd1]
})
});
var result =
tmp2
.ToDictionary(x => x.key, x => x.grouped);
It seems to me that the processing is somewhat arbitrarily place midway through the building of results, but shouldn't affect it, right?
So once results is built, let's process it...
var items = result.SelectMany(kvp => kvp.Value);
for(var item in items)
{
item.Data1.Process(item.Data2);
}
EDIT
I've deliberately avoided going parallel fttb, so if you can get this working, there might be further speedup by adding a bit of parallel magic.

Categories

Resources