Parallelizing data processing

Parallelizing data processing - c#

I'm trying to improve the runtime of some data processing I'm doing. The data starts out as various collections (Dictionary mostly, but a few other IEnumerable types), and the end result of processing should be a Dictionary<DataType, List<DataPoint>>.
I have all this working just fine... except it takes close to an hour to run, and it needs to run in under 20 minutes. None of the data has any connection to any other from the same collection, although they cross-reference other collections frequently, so I figured I should parallelize it.
The main structure of the processing has two levels of loops with some processing in between:
// Custom class, 0.01%
var primaryData= GETPRIMARY().ToDictionary(x => x.ID, x => x);
// Custom class, 11.30%
var data1 = GETDATAONE().GroupBy(x => x.Category)
.ToDictionary(x => x.Key, x => x);
// DataRows, 8.19%
var data2 = GETDATATWO().GroupBy(x => x.Type)
.ToDictionary(x => x.Key, x => x.OrderBy(y => y.ID));
foreach (var key in listOfKeys)
{
// 0.01%
var subData1 = data1[key].ToDictionary(x => x.ID, x => x);
// 1.99%
var subData2 = data2.GroupBy(x => x.ID)
.Where(x => primaryData.ContainsKey(x.Type))
.ToDictionary(x => x.Key, x => ProcessDataTwo(x, primaryData[x.Key]));
// 0.70%
var grouped = primaryData.Select(x => new { ID = x.Key,
Data1 = subData1[x.Key],
Data2 = subData2[x.Key] }).ToList();
foreach (var item in grouped)
{
// 62.12%
item.Data1.Results = new Results(item.ID, item.Data2);
// 12.37%
item.Data1.Status = new Status(item.ID, item.Data2);
}
results.Add(key, grouped);
}
return results;
listOfKeys is very small, but each grouped will have several thousand items. How can I structure this so that each call to item.Data1.Process(item.Data2) can get queued up and executed in parallel?
According to my profiler, all the ToDictionary() calls together take up about 21% of the time, the ToList() takes up 0.7%, and the two items inside the inner foreach together take up 74%. Hence why I'm focusing my optimization there.
I don't know if I should use Parallel.ForEach() to replace the outer foreach, the inner one, both, or if there's some other structure I should use. I'm also not sure if there's anything I can do to the data (or the structures holding it) to improve parallel access to it.
(Note that I'm stuck on .NET4, so don't have access to async or await)

Based on the percentages you posted and you said that grouped was very large you would definitely benefit by only paralyzing the inner loop.
Doing it is fairly simple to do
var grouped = primaryData.Select(x => new { ID = x.Key,
Data1 = subData1[x.Key],
Data2 = subData2[x.Key] }).ToList();
Parallel.ForEach(grouped, (item) =>
{
item.Data1.Results = new Results(item.ID, item.Data2);
item.Data1.Status = new Status(item.ID, item.Data2);
});
results.Add(key, grouped);
This assumes that new Results(item.ID, item.Data2); and new Status(item.ID, item.Data2); are safe to do multiple initializations at once (the only concern I would have is if they access non-thread safe static resources internally, and even so a non thread safe constructor is a really bad design flaw)
There is one big cavat: This will only help if you are CPU bound. If Results or Status is IO bound (for example it is waiting on a database call or a file on the hard drive) doing this will hurt your performance instead of helping it. If you are IO bound instead of CPU bound the only options are to buy faster hardware, attempt optimize those two methods more, or use caching in memory if possible so you don't need to do the slow IO.

EDIT
Given the time measurements provided after I wrote this answer, it appears that this approach was looking for savings in the wrong places. I'll leave my answer as a warning against optimization without measurement!!!
So, because of the nestedness of your approach, you are causing some unnecessary over-iteration of some of your collections leading to rather nasty Big O characteristics.
This can be mitigated by using the ILookup interface to pre-group collections by a key and to use these instead of repeated and expensive Where clauses.
I've had a stab at re-imagining your code to reduce complexity (but it is somewhat abstract):
var data2Lookup = data2.ToLookup(x => x.Type);
var tmp1 =
listOfKeys
.Select(key =>
new {
key,
subData1 = data1[key],
subData2 = data2Lookup[key].GroupBy(x=>x.Category)
})
.Select(x =>
new{
x.key,
x.subData1,
x.subData2,
subData2Lookup = x.subData2.ToLookup(y => y.Key)});
var tmp2 =
tmp1
.Select(x =>
new{
x.key,
grouped = x.subData1
.Select(sd1 =>
new{
Data1 = sd1,
Data2 = subData2Lookup[sd1]
})
});
var result =
tmp2
.ToDictionary(x => x.key, x => x.grouped);
It seems to me that the processing is somewhat arbitrarily place midway through the building of results, but shouldn't affect it, right?
So once results is built, let's process it...
var items = result.SelectMany(kvp => kvp.Value);
for(var item in items)
{
item.Data1.Process(item.Data2);
}
EDIT
I've deliberately avoided going parallel fttb, so if you can get this working, there might be further speedup by adding a bit of parallel magic.

Related

rx.net locking up from use of ToEnumerable

I am trying to convert the below statement so that I can get the key alongside the selected list:
var feed = new Subject<TradeExecuted>();
feed
.GroupByUntil(x => (x.Execution.Contract.Symbol, x.Execution.AccountId, x.Tenant, x.UserId), x => Observable.Timer(TimeSpan.FromSeconds(5)))
.SelectMany(x => x.ToList())
.Select(trades => Observable.FromAsync(() => Mediator.Publish(trades, cts.Token)))
.Concat() // Ensure that the results are serialized.
.Subscribe(cts.Token); // Check status of calls.
The above works, whereas the below does not - when I try and itterate over the list, it locks up.
feed
.GroupByUntil(x => (x.Execution.Contract.Symbol, x.Execution.AccountId, x.Tenant, x.UserId), x => Observable.Timer(timespan))
.Select(x => Observable.FromAsync(() =>
{
var list = x.ToEnumerable(); // <---- LOCK UP if we use list.First() etc
var aggregate = AggregateTrades(x.Key.Symbol, x.Key.AccountId, x.Key.Tenant, list);
return Mediator.Publish(aggregate, cts.Token);
}))
.Concat()
.Subscribe(cts.Token); // Check status of calls.
I am clearly doing something wrong and probably horrific!
Going back to the original code, how can I get the Key alongside the enumerable list (and avoiding the hack below)?
As a sidenote, the below code works but it a nasty hack where I get the keys from the first list item:
feed
.GroupByUntil(x => (x.Execution.Contract.Symbol, x.Execution.AccountId, x.Tenant, x.UserId), x => Observable.Timer(TimeSpan.FromSeconds(5)))
.SelectMany(x => x.ToList())
.Select(trades => Observable.FromAsync(() =>
{
var firstTrade = trades.First();
var aggregate = AggregateTrades(firstTrade.Execution.Contract.Symbol, firstTrade.Execution.AccountId, firstTrade.Tenant, trades);
return Mediator.Publish(aggregate, cts.Token);
}))
.Concat() // Ensure that the results are serialized.
.Subscribe(cts.Token); // Check status of calls.

All versions of your code suffer from trying to eagerly evaluate the grouped sub-observable. Since in v1 and v3 your group observable will run a maximum of 5 seconds, that isn't horrible/awful, but it's still not great. In v2, I don't know what timespan is, but assuming it's 5 seconds, you have the same problem: Trying to turn the grouped sub-observable into a list or an enumerable means waiting for the sub-observable to complete, blocking the thread (or the task).
You can fix this by using the Buffer operator to lazily evaluate the grouped sub-observable:
var timespan = TimeSpan.FromSeconds(5);
feed
.GroupByUntil(x => (x.Execution.Contract.Symbol, x.Execution.AccountId, x.Tenant, x.UserId), x => Observable.Timer(timespan))
.SelectMany(x => x
.Buffer(timespan)
.Select(list => Observable.FromAsync(() =>
{
var aggregate = AggregateTrades(x.Key.Symbol, x.Key.AccountId, x.Key.Tenant, list));
return Mediator.Publish(aggregate, cts.Token);
}))
)
.Concat() // Ensure that the results are serialized.
.Subscribe(cts.Token); // Check status of calls.
This essentially means that until timespan is up, the items in the group by gather in a list inside Buffer. Once timespan is up, they're released as a list, and the mediator publish happens.

How to remove Items from a list with the same property value, where the count is greater than 2

I have List of Jo Cards, i want to remove Job Cards where the VehicleID count is greater than two
Here is My Attempt.
var OpenJobCards = await _context.WorkshopJobCards.Include(wjc => wjc.WorkshopJobCardCategory).Where(wjc => wjc.Job_Card_Closed == false).ToListAsync() ;
OpenJobCards.Remove(OpenJobCards.GroupBy(wjc => wjc.VehicleID).Count() >2);

Based on your comment, it would seem you're actually looking at this the wrong way. If vehicles have a collection of job cards, and your ultimate goal is to show only vehicles with less than 2 job cards assigned to them, then just do:
var vehicles = await _context.Vehicles.Where(x => x.JobCards.Count() < 2).ToListAsync();
Done.

as far as I understand you are using DBContext. If it is so, please also keep in mind that it is better to use Queriable where possible so that filtering is done on the DB side and does not fetched into your application memory
var listToRemove = await _context.WorkshopJobCards.Include(wjc => wjc.WorkshopJobCardCategory)
.Where(wjc => wjc.Job_Card_Closed == false).GroupBy(wjc => wjc.VehicleID)
.Where(t => t.Count() >2)
.Select(x => new OpenJobCard() {Id = x.Key});
_context.entity.RemoveRange(listToRemove);

Parallel.ForEach and global variable

If I have code as following:
var dict = new Dictionary<string, NetObject>();
Parallel.ForEach(results, options, result =>
{
var items = parser.Parse(result);
Parallel.ForEach(items, options, nextObject =>
{
if (nextObject != null)
{
dict[nextObject.Id] = nextObject;
}
});
});
The dict is a dictionary defined at method level. My question is, will it cause parallel foreach to work like normal foreach in synchronous manner since it is global object? I do not see any performance difference between normal foreach and parallel once for the code above.

Your code isn't thread-safe. You're mutating a Dictionary<K, V> which isn't a thread-safe data structure. Also, you're most likely over parallelising the loop with both Parallel.ForEach to iterate the inner and outter loop.
Let me suggest a different approach using PLINQ, which doesn't require you to synchronize a global Dictionary. Note you should make sure there are no duplicate keys when doing this (perhaps with an additional .Distinct() call):
results
.AsParallel()
.SelectMany(x => parser.Parse(x))
.Where(x => x != null)
.ToDictionary(x => x.Id, x => x);
But of course, the most important thing is to benchmark your code to make sure that parallelism is actually increasing performance, and that you're gaining anything from this.

Improve Linq query performance that use ToList()

this code written by #Rahul Singh in this post Convert TSQL to Linq to Entities :
var result = _dbContext.ExtensionsCategories.ToList().GroupBy(x => x.Category)
.Select(x =>
{
var files = _dbContext.FileLists.Count(f => x.Select(z => z.Extension).Contains(f.Extension));
return new
{
Category = x.Key,
TotalFileCount = files
};
});
but this code have problem when used inside database context and we should use ToList() like this to fix "Only primitive types or enumeration types are supported in this context" error :
var files = _dbContext.FileLists.Count(f => x.Select(z => z.Extension).ToList().Contains(f.Extension));
the problem of this is ToList() fetch all records and reduce performance, now i wrote my own code :
var categoriesByExtensionFileCount =
_dbContext.ExtensionsCategories.Select(
ec =>
new
{
Category = ec.Category,
TotalSize = _dbContext.FileLists.Count(w => w.Extension == ec.Extension)
});
var categoriesTOtalFileCount =
categoriesByExtensionFileCount.Select(
se =>
new
{
se.Category,
TotalCount =
categoriesByExtensionFileCount.Where(w => w.Category == se.Category).Sum(su => su.TotalSize)
}).GroupBy(x => x.Category).Select(y => y.FirstOrDefault());
the performance of this code is better but it have much line of code, any idea about improve performance of first code or reduce line of second code :D
Regards, Mojtaba

You should have a navigation property from ExtensionCategories to FileLists. If you are using DB First, and have your foreign key constraints set up in the database, it should do this automatically for you.
If you supply your table designs (or model classes), it would help a lot too.
Lastly, you can rewrite using .ToList().Contains(...) with .Any() which should solve your immediate issue. Something like:
_dbContext.FileLists.Count(f => x.Any(z => z.Extension==f.Extension)));

combining one observable with latest from another observable

I'm trying to combine two observables whose values share some key.
I want to produce a new value whenever the first observable produces a new value, combined with the latest value from a second observable which selection depends on the latest value from the first observable.
pseudo code example:
var obs1 = Observable.Interval(TimeSpan.FromSeconds(1)).Select(x => Tuple.create(SomeKeyThatVaries, x)
var obs2 = Observable.Interval(TimeSpan.FromMilliSeconds(1)).Select(x => Tuple.create(SomeKeyThatVaries, x)
from x in obs1
let latestFromObs2WhereKeyMatches = …
select Tuple.create(x, latestFromObs2WhereKeyMatches)
Any suggestions?
Clearly this could be implemented by subcribing to the second observable and creating a dictionary with the latest values indexable by the key. But I'm looking for a different approach..
Usage scenario: one minute price bars computed from a stream of stock quotes. In this case the key is the ticker and the dictionary contains latest ask and bid prices for concrete tickers, which are then used in the computation.
(By the way, thank you Dave and James this has been a very fruitful discussion)
(sorry about the formatting, hard to get right on an iPad..)

...why are you looking for a different approach? Sounds like you are on the right lines to me. It's short, simple code... roughly speaking it will be something like:
var cache = new ConcurrentDictionary<long, long>();
obs2.Subscribe(x => cache[x.Item1] = x.Item2);
var results = obs1.Select(x => new {
obs1 = x.Item2,
cache.ContainsKey(x.Item1) ? cache[x.Item1] : 0
});
At the end of the day, C# is an OO language and the heavy lifting of the thread-safe mutable collections is already all done for you.
There may be fancy Rx approach (feels like joins might be involved)... but how maintainable will it be? And how will it perform?
$0.02

I'd like to know the purpose of a such a query. Would you mind describing the usage scenario a bit?
Nevertheless, it seems like the following query may solve your problem. The initial projections aren't necessary if you already have some way of identifying the origin of each value, but I've included them for the sake of generalization, to be consistent with your extremely abstract mode of questioning. ;-)
Note: I'm assuming that someKeyThatVaries is not shared data as you've shown it, which is why I've also included the term anotherKeyThatVaries; otherwise, the entire query really makes no sense to me.
var obs1 = Observable.Interval(TimeSpan.FromSeconds(1))
.Select(x => Tuple.Create(someKeyThatVaries, x));
var obs2 = Observable.Interval(TimeSpan.FromSeconds(.25))
.Select(x => Tuple.Create(anotherKeyThatVaries, x));
var results = obs1.Select(t => new { Key = t.Item1, Value = t.Item2, Kind = 1 })
.Merge(
obs2.Select(t => new { Key = t.Item1, Value = t.Item2, Kind = 2 }))
.GroupBy(t => t.Key, t => new { t.Value, t.Kind })
.SelectMany(g =>
g.Scan(
new { X = -1L, Y = -1L, Yield = false },
(acc, cur) => cur.Kind == 1
? new { X = cur.Value, Y = acc.Y, Yield = true }
: new { X = acc.X, Y = cur.Value, Yield = false })
.Where(s => s.Yield)
.Select(s => Tuple.Create(s.X, s.Y)));

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parallelizing data processing - c#

Related

rx.net locking up from use of ToEnumerable

How to remove Items from a list with the same property value, where the count is greater than 2

Parallel.ForEach and global variable

Improve Linq query performance that use ToList()

combining one observable with latest from another observable

Categories

Resources