Parallel.ForEach and global variable - c#

If I have code as following:
var dict = new Dictionary<string, NetObject>();
Parallel.ForEach(results, options, result =>
{
var items = parser.Parse(result);
Parallel.ForEach(items, options, nextObject =>
{
if (nextObject != null)
{
dict[nextObject.Id] = nextObject;
}
});
});
The dict is a dictionary defined at method level. My question is, will it cause parallel foreach to work like normal foreach in synchronous manner since it is global object? I do not see any performance difference between normal foreach and parallel once for the code above.

Your code isn't thread-safe. You're mutating a Dictionary<K, V> which isn't a thread-safe data structure. Also, you're most likely over parallelising the loop with both Parallel.ForEach to iterate the inner and outter loop.
Let me suggest a different approach using PLINQ, which doesn't require you to synchronize a global Dictionary. Note you should make sure there are no duplicate keys when doing this (perhaps with an additional .Distinct() call):
results
.AsParallel()
.SelectMany(x => parser.Parse(x))
.Where(x => x != null)
.ToDictionary(x => x.Id, x => x);
But of course, the most important thing is to benchmark your code to make sure that parallelism is actually increasing performance, and that you're gaining anything from this.

Related

Rx.Net - process groups asynchronously and in parallel with a constrained concurrency

Playing with System.Reactive trying to resolve the next task -
Break an incoming stream of strings into groups
Items in each group must be processed asynchronously and sequentially
Groups must be processed in parallel
No more than N groups must be processed at the same time
Ideally, w/o using sync primitives
Here is the best I've figured out so far -
TaskFactory taskFactory = new (new LimitedConcurrencyLevelTaskScheduler(2));
TaskPoolScheduler scheduler = new (taskFactory);
source
.GroupBy(item => item)
.SelectMany(g => g.Select(item => Observable.FromAsync(() => onNextAsync(item))).ObserveOn(scheduler).Concat())
.Subscribe();
Any idea how to achieve it w/o a scheduler? Couldn't make it work via Merge()
The easiest way to enforce the "No more than N groups must be processed at the same time" limitation, is probably to use a SemaphoreSlim. So instead of this:
.SelectMany(g => g.Select(item => Observable.FromAsync(() => onNextAsync(item))).Concat())
...you can do this:
var semaphore = new SemaphoreSlim(N, N);
//...
.SelectMany(g => g.Select(item => Observable.FromAsync(async () =>
{
await semaphore.WaitAsync();
try { return await onNextAsync(item); }
finally { semaphore.Release(); }
})).Merge(1))
Btw in the current Rx version (5.0.0) I don't trust the Concat operator, and I prefer to use the Merge(1) instead.
To solve this problem using exclusively Rx tools, ideally you would like to have something like this:
source
.GroupBy(item => item.Key)
.Select(group => group.Select(
item => Observable.FromAsync(() => ProcessAsync(item))).Merge(1))
.Merge(maxConcurrent: N)
.Wait();
The inner Merge(1) would enforce the exclusive processing within each group, and the outer Merge(N) would enforce the global maximum concurrency policy. Unfortunately this doesn't work because the outer Merge(N) restricts the subscriptions to the inner sequences (the IGroupedObservable<T>s), not to their individual elements. This is not what you want. The result will be that only the first N groups to be processed, and the elements of all other groups will be ignored. The GroupBy operator creates hot subsequences, and if you don't subscribe to them immediately you'll lose elements.
In order for the outer Merge(N) to work as desired, you'll have to merge freely all the inner sequences that are produced by the Observable.FromAsync, and have some other mechanism to serialize the processing of each group. One idea is to implement a special Select operator that emits an Observable.FromAsync only after the previous one is completed. Below is such an implementation, based on the Zip operator. The Zip operator maintains internally two hidden buffers, so that it can produce pairs from two sequences that might emit elements with different frequences. This buffering is exactly what we need in order to avoid losing elements.
private static IObservable<IObservable<TResult>> SelectOneByOne<TSource, TResult>(
this IObservable<TSource> source,
Func<TSource, IObservable<TResult>> selector)
{
var subject = new BehaviorSubject<Unit>(default);
var synchronizedSubject = Observer.Synchronize(subject);
return source
.Zip(subject, (item, _) => item)
.Select(item => selector(item).Do(
_ => { },
_ => synchronizedSubject.OnNext(default),
() => synchronizedSubject.OnNext(default)));
}
The BehaviorSubject<T> contains initially one element, so the first pair will be produced immediately. The second pair will not be produced before the first element has been processed. The same with the third pair and second element, etc.
You could then use this operator to solve the problem like this:
source
.GroupBy(item => item.Key)
.SelectMany(group => group.SelectOneByOne(
item => Observable.FromAsync(() => ProcessAsync(item))))
.Merge(maxConcurrent: N)
.Wait();
The above solution is presented only for the purpose of answering the question. I don't think that I would trust it in a production environment.

Rx extensions Parallel.ForEach throttling

I'm following the answer to this question: Rx extensions: Where is Parallel.ForEach? in order to run a number of operations in parallel using Rx.
The problem I'm running into is that it seems to be allocating a new thread for every request, whereas using Parallel.ForEach did considerably fewer.
The processes I'm running in parallel are quite memory intensive, so if I'm trying to process hundreds of items at once the answer provided to the linked question quickly sees me running out of memory.
Is there a way I can modify that answer to throttle the number of items being done at any given time?
I've taken a look at the Window and Buffer operations, my code looks like this:
return inputs.Select(i => new AccountViewModel(i))
.ToObservable()
.ObserveOn(RxApp.MainThreadScheduler)
.ToList()
.Do(l =>
{
using (Accounts.SuppressChangeNotifications())
{
Accounts.AddRange(l);
}
})
.SelectMany(x => x)
.SelectMany(acc => Observable.StartAsync(async () =>
{
var res = await acc.ProcessAsync(config, m, outputPath);
processed++;
var prog = ((double) processed/inputs.Count())*100.0;
OverallProgress.Message.OnNext(string.Format("Processing Accounts ({0:000}%)", prog));
OverallProgress.Progress.OnNext(prog);
return res;
}))
.All(x => x);
Ideally I want to be able to batch it up into chunks of account view models, that I then call the ProcessAsync method on, and only once all of that batch are done move on.
Ideally I'd like it so that if even only one of the batch finished, it moved on, but only ever kept the same batch size.
So if I've got a batch of 5 and 1 finishes, I'd like another to start, but only one until more space is available.
As usual, Paul Betts has answered a similar question that solves my problem:
The question: Reactive Extensions Parallel processing based on specific number
Has some information on using Observable.Defer and then merging into batches, using that I've modified my previous code like so:
return inputs.Select(i => new AccountViewModel(i))
.ToObservable()
.ObserveOn(RxApp.MainThreadScheduler)
.ToList()
.Do(l =>
{
using (Accounts.SuppressChangeNotifications())
{
Accounts.AddRange(l);
}
})
.SelectMany(x => x)
.Select(x => Observable.DeferAsync(async _ =>
{
var res = await x.ProcessAsync(config, m, outputPath);
processed++;
var prog = ((double) processed/inputs.Count())*100.0;
OverallProgress.Message.OnNext(string.Format("Processing Accounts ({0:000}%)", prog));
OverallProgress.Progress.OnNext(prog);
return Observable.Return(res);
}))
.Merge(5)
.All(x => x);
And sure enough, I get the rolling completion behaviour (e.g. if 1/5 finish then just one starts).
Clearly I've got a few more fundamentals to grasp, but this is brilliant!

Dictionary of dictionaries

I have a resource file I grab like this:
var rs = <somewhere>.FAQ.ResourceManager
.GetResourceSet(CultureInfo.CurrentUICulture, true, true);
and I want to parse it into a dictionary of dictionaries but I can't figure out quite how. this is what I'm trying:
var ret = rs.OfType<DictionaryEntry>()
.Where(x => x.Key.ToString().StartsWith("Title"))
.ToDictionary<string, Dictionary<String, string>>(
k => k.Value.ToString(),
v => rs.OfType<DictionaryEntry>()
.Where(x => x.Key.ToString().StartsWith(v.Value.ToString().Replace("Title", "")))
.ToDictionary<string, string>(
key => key.Value,
val => val.Value
)
);
so if I understand this correctly, k should refer to a DictionaryEntry and thus I should be able to dereference it like k.Value and to manufacture my dictionary in each of the outer dictionary's entries I do another query against the resource file, thus key and val should also be of type DictionaryEntry.
In referencing val.Value I get the error "Cannot choose method from method group. Did you intend to invoke the method?" though that should be a property, not a method.
help?
p.s. as an explanation, my resource file looks sort of like this:
TitleUser: User Questions
TitleCust: Customer Questions
User1: Why does something happen? Because…
User2: How do I do this? Start by…
Cust1: Where can I find…? It is located at…
Cust2: Is there any…? yes, look for it…
which means I first get a list of sections (by looking for all keys that start with "Title") and for each I look for a list of questions
so the answer turns out to be that the compiler knows better as to the types involved. leaving out the specifiers makes it work, though quite why my specifiers were wrong I don't yet get.
var ret = rs.OfType<DictionaryEntry>()
.Where(x => x.Key.ToString().StartsWith("Title"))
.ToDictionary(
k => k.Value.ToString(),
v => rs.OfType<DictionaryEntry>()
.Where(x => x.Key.ToString().StartsWith(v.Key.ToString().Replace("Title", "")))
.ToDictionary(
x => x.Value.ToString().Split('?')[0] + "?",
x => x.Value.ToString().Split('?')[1]
)
);
(I've made some changes to actually make it do what I intended for it to do).

Parallelizing data processing

I'm trying to improve the runtime of some data processing I'm doing. The data starts out as various collections (Dictionary mostly, but a few other IEnumerable types), and the end result of processing should be a Dictionary<DataType, List<DataPoint>>.
I have all this working just fine... except it takes close to an hour to run, and it needs to run in under 20 minutes. None of the data has any connection to any other from the same collection, although they cross-reference other collections frequently, so I figured I should parallelize it.
The main structure of the processing has two levels of loops with some processing in between:
// Custom class, 0.01%
var primaryData= GETPRIMARY().ToDictionary(x => x.ID, x => x);
// Custom class, 11.30%
var data1 = GETDATAONE().GroupBy(x => x.Category)
.ToDictionary(x => x.Key, x => x);
// DataRows, 8.19%
var data2 = GETDATATWO().GroupBy(x => x.Type)
.ToDictionary(x => x.Key, x => x.OrderBy(y => y.ID));
foreach (var key in listOfKeys)
{
// 0.01%
var subData1 = data1[key].ToDictionary(x => x.ID, x => x);
// 1.99%
var subData2 = data2.GroupBy(x => x.ID)
.Where(x => primaryData.ContainsKey(x.Type))
.ToDictionary(x => x.Key, x => ProcessDataTwo(x, primaryData[x.Key]));
// 0.70%
var grouped = primaryData.Select(x => new { ID = x.Key,
Data1 = subData1[x.Key],
Data2 = subData2[x.Key] }).ToList();
foreach (var item in grouped)
{
// 62.12%
item.Data1.Results = new Results(item.ID, item.Data2);
// 12.37%
item.Data1.Status = new Status(item.ID, item.Data2);
}
results.Add(key, grouped);
}
return results;
listOfKeys is very small, but each grouped will have several thousand items. How can I structure this so that each call to item.Data1.Process(item.Data2) can get queued up and executed in parallel?
According to my profiler, all the ToDictionary() calls together take up about 21% of the time, the ToList() takes up 0.7%, and the two items inside the inner foreach together take up 74%. Hence why I'm focusing my optimization there.
I don't know if I should use Parallel.ForEach() to replace the outer foreach, the inner one, both, or if there's some other structure I should use. I'm also not sure if there's anything I can do to the data (or the structures holding it) to improve parallel access to it.
(Note that I'm stuck on .NET4, so don't have access to async or await)
Based on the percentages you posted and you said that grouped was very large you would definitely benefit by only paralyzing the inner loop.
Doing it is fairly simple to do
var grouped = primaryData.Select(x => new { ID = x.Key,
Data1 = subData1[x.Key],
Data2 = subData2[x.Key] }).ToList();
Parallel.ForEach(grouped, (item) =>
{
item.Data1.Results = new Results(item.ID, item.Data2);
item.Data1.Status = new Status(item.ID, item.Data2);
});
results.Add(key, grouped);
This assumes that new Results(item.ID, item.Data2); and new Status(item.ID, item.Data2); are safe to do multiple initializations at once (the only concern I would have is if they access non-thread safe static resources internally, and even so a non thread safe constructor is a really bad design flaw)
There is one big cavat: This will only help if you are CPU bound. If Results or Status is IO bound (for example it is waiting on a database call or a file on the hard drive) doing this will hurt your performance instead of helping it. If you are IO bound instead of CPU bound the only options are to buy faster hardware, attempt optimize those two methods more, or use caching in memory if possible so you don't need to do the slow IO.
EDIT
Given the time measurements provided after I wrote this answer, it appears that this approach was looking for savings in the wrong places. I'll leave my answer as a warning against optimization without measurement!!!
So, because of the nestedness of your approach, you are causing some unnecessary over-iteration of some of your collections leading to rather nasty Big O characteristics.
This can be mitigated by using the ILookup interface to pre-group collections by a key and to use these instead of repeated and expensive Where clauses.
I've had a stab at re-imagining your code to reduce complexity (but it is somewhat abstract):
var data2Lookup = data2.ToLookup(x => x.Type);
var tmp1 =
listOfKeys
.Select(key =>
new {
key,
subData1 = data1[key],
subData2 = data2Lookup[key].GroupBy(x=>x.Category)
})
.Select(x =>
new{
x.key,
x.subData1,
x.subData2,
subData2Lookup = x.subData2.ToLookup(y => y.Key)});
var tmp2 =
tmp1
.Select(x =>
new{
x.key,
grouped = x.subData1
.Select(sd1 =>
new{
Data1 = sd1,
Data2 = subData2Lookup[sd1]
})
});
var result =
tmp2
.ToDictionary(x => x.key, x => x.grouped);
It seems to me that the processing is somewhat arbitrarily place midway through the building of results, but shouldn't affect it, right?
So once results is built, let's process it...
var items = result.SelectMany(kvp => kvp.Value);
for(var item in items)
{
item.Data1.Process(item.Data2);
}
EDIT
I've deliberately avoided going parallel fttb, so if you can get this working, there might be further speedup by adding a bit of parallel magic.

Remove item from dictionary where value is empty list

What is the best way to remove item from dictionary where the value is an empty list?
IDictionary<int,Ilist<T>>
var foo = dictionary
.Where(f => f.Value.Count > 0)
.ToDictionary(x => x.Key, x => x.Value);
This will create a new dictionary. If you want to remove in-place, Jon's answer will do the trick.
Well, if you need to perform this in-place, you could use:
var badKeys = dictionary.Where(pair => pair.Value.Count == 0)
.Select(pair => pair.Key)
.ToList();
foreach (var badKey in badKeys)
{
dictionary.Remove(badKey);
}
Or if you're happy creating a new dictionary:
var noEmptyValues = dictionary.Where(pair => pair.Value.Count > 0)
.ToDictionary(pair => pair.Key, pair => pair.Value);
Note that if you get a chance to change the way the dictionary is constructed, you could consider creating an ILookup instead, via the ToLookup method. That's usually simpler than a dictionary where each value is a list, even though they're conceptually very similar. A lookup has the nice feature where if you ask for an absent key, you get an empty sequence instead of an exception or a null reference.
Alternative provided just for completeness.
Alternatively (and depending entirely on your usage), do it at the point of amending the list content on the fly as opposed to a batch at a particular point in time. At a time like this it is likely you'll know the key without having to iterate:
var list = dictionary[0];
// Do stuff with the list.
if (list.Count == 0)
{
dictionary.Remove(0);
}
The other answers address the need to do it ad-hoc over the entire dictionary.

Categories

Resources