How to optimize LINQ OrderBy if the keySelector is slow?

How to optimize LINQ OrderBy if the keySelector is slow? - c#

I want to sort a list of objects using a value that can take some time to compute. For now I have code like this:
public IEnumerable<Foo> SortFoo(IEnumerable<Foo> original)
{
return foos.OrderByDescending(foo => CalculateBar(foo));
}
private int CalculateBar(Foo foo)
{
//some slow process here
}
The problem with the above code is that it will call calculate the value several times for each item, which is not good. The possible optimization is to use cached value (maybe a dictionary), but it will mean that SortFoo will have to clear the cache after each sorting (to avoid memory leak, and I do want the value to be recalculated on each SortFoo call).
Is there a cleaner and more elegant solution to this problem?

It appears that .OrderBy() is already optimized for slow keySelectors.
Based on the following, .OrderBy() seems to cache the result of the keySelector delegate you supply it.
var random = new Random(0);
var ordered = Enumerable
.Range(0, 10)
.OrderBy(x => {
var result = random.Next(20);
Console.WriteLine("keySelector({0}) => {1}", x, result);
return result;
});
Console.WriteLine(String.Join(", ", ordered));
Here's the output:
keySelector(0) => 14
keySelector(1) => 16
keySelector(2) => 15
keySelector(3) => 11
keySelector(4) => 4
keySelector(5) => 11
keySelector(6) => 18
keySelector(7) => 8
keySelector(8) => 19
keySelector(9) => 5
4, 9, 7, 3, 5, 0, 2, 1, 6, 8
If it were running the delegate once per comparison, I'd see more than just one invocation of my keySelector delegate per item.

Because each item is compared against other items multiple times in a sort, you can cheaply cache the computation at least one-per-item.
If you're often running the calculation against the same values, Memoizing the function would be your best bet,
public IEnumerable<Foo> SortFoo(IEnumerable<Foo> original)
{
return foos
.Select(f => new { Foo = f, SortBy = CalculateBar(f) })
.OrderByDescending(f=> f.SortBy)
.Select(f => f.Foo);
}
This will reduce the calculations to once per item

Related

Memory efficient GroupBy + Aggregation using Rx

I have a sequence of items, and want to group them by a key and calculate several aggregations for each key.
The number of items is large, but the number of distinct keys is small.
A toy example:
static List<(string Key, decimal Sum, int Count)> GroupStats(
IEnumerable<(string Key, decimal Value)> items)
{
return items
.GroupBy(x => x.Key)
.Select(g => (
Key : g.Key,
Sum : g.Sum(x => x.Value),
Count : g.Count()
))
.ToList();
}
Using Linq's GroupBy has the unfortunate consequence that it'll need to load all the items into memory.
An imperative implementation would only consume memory proportional to the number of distinct keys, but I'm wondering if there is a nicer solution.
Reactive Extension's "push" approach should theoretically enable low memory grouping as well, but I didn't find a way to escape from IObservable to materialize the actual values. I'm also open to other elegant solutions (besides the obvious imperative implementation).

You could do this:
static IList<(string Key, decimal Sum, int Count)> GroupStats(
IEnumerable<(string Key, decimal Value)> source)
{
return source
.ToObservable()
.GroupBy(x => x.Key)
.Select(g => (
Key: g.Key,
Sum: g.Sum(x => x.Value).PublishLast().AutoConnect(0),
Count: g.Count().PublishLast().AutoConnect(0)
))
.ToList()
.Wait()
.Select(e => (e.Key, e.Sum.Wait(), e.Count.Wait()))
.ToArray();
}
With the ToObservable operator, the IEnumerable<T>¹ source is converted to an IObservable<T> sequence.
The GroupBy converts the IObservable<T> to an IObservable<IGroupedObservable<string, T>>.
The Select converts each IGroupedObservable<string, T> to a (string, IObservable<decimal>, IObservable<int>). The PublishLast is used in order to remember the last (and only) value emitted by the Sum and Count operators. The AutoConnect(0) subscribes to these subsequences immediately when they are emitted.
The ToList converts the IObservable<T> to an IObservable<IList<T>>. The outer observable will emit a single list when it is completed.
The Wait waits synchronously for the outer observable to complete, and to emit the single list. This is where all the work happens. Until this point the source sequence has not been enumerated. The Wait subscribes to the observable that has been constructed so far, which triggers subscriptions to the underlying observables, and eventually triggers the enumeration of the source. All the calculations are performed synchronously during the subscriptions, on the current thread. So the verb "wait" doesn't describe accurately what's happening here.
The next Select converts each (string, IObservable<decimal>, IObservable<int>) to a (string, decimal, int), by waiting the subsequences. These subsequences have already completed at this point, and their single output is stored inside the PublishLast. So these inner Wait invocations are not triggering any serious work. All the heavy work has already been done on the previous step.
Finally the ToArray converts the IEnumerable<(string, decimal, int)> to an array of (string, decimal, int), which is the output of the GroupStats method.
¹ I am using the T as placeholder for a complex ValueTuple, so that the explanation is not overly verbose.
Update: The Rx ToObservable operator has quite a lot of overhead, because it has to support the Rx scheduling infrastructure.
You can replace it with the ToObservableHypersonic below, and achieve a speed-up of around 5x:
public static IObservable<TSource> ToObservableHypersonic<TSource>(
this IEnumerable<TSource> source)
{
return Observable.Create<TSource>(observer =>
{
foreach (var item in source) observer.OnNext(item);
observer.OnCompleted();
return Disposable.Empty;
});
}
I should also mention an alternative to the PublishLast+AutoConnect(0) combination, which is to convert the subsequences to tasks with the ToTask method. It has the same effect: the subsequences are subscribed immediately and their last value is memorized.
Sum: g.Sum(x => x.Value).ToTask(),
Count: g.Count().ToTask()
//...
.Select(e => (e.Key, e.Sum.Result, e.Count.Result))

I wonder if this is a simpler implementation:
static IList<(string Key, decimal Sum, int Count)> GroupStats(
IEnumerable<(string Key, decimal Value)> source)
{
return source
.ToObservable(Scheduler.Immediate)
.GroupBy(x => x.Key)
.SelectMany(
g => g.Aggregate(
(Sum: 0m, Count: 0),
(a, x) => (a.Sum + x.Value, a.Count + 1)),
(x, y) => (Key: x.Key, Sum: y.Sum, Count: y.Count))
.ToList()
.Wait();
}
Or better, a non-blocking version:
static async Task<IList<(string Key, decimal Sum, int Count)>> GroupStats(
IEnumerable<(string Key, decimal Value)> source)
{
return await source
.ToObservable(Scheduler.Immediate)
.GroupBy(x => x.Key)
.SelectMany(
g => g.Aggregate(
(Sum: 0m, Count: 0),
(a, x) => (a.Sum + x.Value, a.Count + 1)),
(x, y) => (Key: x.Key, Sum: y.Sum, Count: y.Count))
.ToList();
}
If I run the async version with this source:
var source = new[]
{
(Key: "a", Value: 1m),
(Key: "c", Value: 2m),
(Key: "b", Value: 3m),
(Key: "b", Value: 4m),
(Key: "c", Value: 5m),
(Key: "c", Value: 6m),
};
var output = await GroupStats(source);
I get this output:

Observe values not seen in other observers

I have an observable that emits unique values e.g.
var source=Observable.Range(1,100).Publish();
source.Connect();
I want to observe its values from e.g. two observers but each observer to get notified only for values not seen in other observers.
So if first observer contains the value 10 the second observer should never get notified for the 10 value.
Update
I chose #Asti`s answer cause it was first and although buggy it pointed to the right direction and up-voted #Shlomo's answer. Too bad I cannot accept both answers as #Shlomo answer was more correct and I really appreciate all his help we get on this tag.

Observables aren't supposed to behave differently for different observers; a better approach would be to give each observer its own filtered observable.
That being said, if your constraints require that you need this behavior in a single observable - we can use a Round-Robin method.
public static IEnumerable<T> Repeat<T>(this IEnumerable<T> source)
{
for (; ; )
foreach (var item in source.ToArray())
yield return item;
}
public static IObservable<T> RoundRobin<T>(this IObservable<T> source)
{
var subscribers = new List<IObserver<T>>();
var shared = source
.Zip(subscribers.Repeat(), (value, observer) => (value, observer))
.Publish()
.RefCount();
return Observable.Create<T>(observer =>
{
subscribers.Add(observer);
var subscription =
shared
.Where(pair => pair.observer == observer)
.Select(pair => pair.value)
.Subscribe(observer);
var dispose = Disposable.Create(() => subscribers.Remove(observer));
return new CompositeDisposable(subscription, dispose);
});
}
Usage:
var source = Observable.Range(1, 100).Publish();
var dist = source.RoundRobin();
dist.Subscribe(i => Console.WriteLine($"One sees {i}"));
dist.Subscribe(i => Console.WriteLine($"Two sees {i}"));
source.Connect();
Result:
One sees 1
Two sees 2
One sees 3
Two sees 4
One sees 5
Two sees 6
One sees 7
Two sees 8
One sees 9
Two sees 10
If you already have a list of observers, the code becomes much simpler.

EDIT: #Asti fixed his bug, and I fixed mine based on his answer. Our answers are now largely similar. I have an idea how to do a purely reactive one, if I have time I'll post that later.
Fixed code:
public static IObservable<T> RoundRobin2<T>(this IObservable<T> source)
{
var subscribers = new BehaviorSubject<ImmutableList<IObserver<T>>>(ImmutableList<IObserver<T>>.Empty);
ImmutableList<IObserver<T>> latest = ImmutableList<IObserver<T>>.Empty;
subscribers.Subscribe(l => latest = l);
var shared = source
.Select((v, i) => (v, i))
.WithLatestFrom(subscribers, (t, s) => (t.v, t.i, s))
.Publish()
.RefCount();
return Observable.Create<T>(observer =>
{
subscribers.OnNext(latest.Add(observer));
var dispose = Disposable.Create(() => subscribers.OnNext(latest.Remove(observer)));
var sub = shared
.Where(t => t.i % t.s.Count == t.s.FindIndex(o => o == observer))
.Select(t => t.v)
.Subscribe(observer);
return new CompositeDisposable(dispose, sub);
});
}
Original answer:
I upvoted #Asti's answer, because he's largely correct: Just because you can, doesn't mean you should. And his answer largely works, but it's subject to a bug:
This works fine:
var source = Observable.Range(1, 20).Publish();
var dist = source.RoundRobin();
dist.Subscribe(i => Console.WriteLine($"One sees {i}"));
dist.Take(1).Subscribe(i => Console.WriteLine($"Two sees {i}"));
This doesn't:
var source = Observable.Range(1, 20).Publish();
var dist = source.RoundRobin();
dist.Take(1).Subscribe(i => Console.WriteLine($"One sees {i}"));
dist.Subscribe(i => Console.WriteLine($"Two sees {i}"));
Output is:
One sees 1
Two sees 1
Two sees 2
Two sees 3
Two sees 4
...
I first thought the bug is Halloween related, but now I'm not sure. The .ToArray() in Repeat should take care of that. I also wrote a pure-ish observable implementation which has the same bug. This implementation doesn't guarantee a perfect Round Robin, but that wasn't in the question:
public static IObservable<T> RoundRobin2<T>(this IObservable<T> source)
{
var subscribers = new BehaviorSubject<ImmutableList<IObserver<T>>>(ImmutableList<IObserver<T>>.Empty);
ImmutableList<IObserver<T>> latest = ImmutableList<IObserver<T>>.Empty;
subscribers.Subscribe(l => latest = l);
var shared = source
.Select((v, i) => (v, i))
.WithLatestFrom(subscribers, (t, s) => (t.v, t.i, s))
.Publish()
.RefCount();
return Observable.Create<T>(observer =>
{
subscribers.OnNext(latest.Add(observer));
var dispose = Disposable.Create(() => subscribers.OnNext(latest.Remove(observer)));
var sub = shared
.Where(t => t.i % t.s.Count == t.s.FindIndex(o => o == observer))
.Select(t => t.v)
.Subscribe(observer);
return new CompositeDisposable(dispose, sub);
});
}

This is a simple distributed queue implementation using TPL Dataflow. But with respect to different observers not seeing the same value, there's little chance of it behaving incorrectly. It's not round-robin, but actually has back-pressure semantics.
public static IObservable<T> Distribute<T>(this IObservable<T> source)
{
var buffer = new BufferBlock<T>();
source.Subscribe(buffer.AsObserver());
return Observable.Create<T>(observer =>
buffer.LinkTo(new ActionBlock<T>(observer.OnNext, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 })
);
}
Output
One sees 1
Two sees 2
One sees 3
Two sees 4
One sees 5
One sees 6
One sees 7
One sees 8
One sees 9
One sees 10
I might prefer skipping Rx entirely and just using TPL Dataflow.

Optimizing linq efficiency

I got this linq:
return ngrms.GroupBy(x => x)
.Select(s => new { Text = s.Key, Count = s.Count() })
.Where(x => x.Count > minCount)
.OrderByDescending(x => x.Count)
.ToDictionary(g => g.Text, g => g.Count);
ngrms is IEnumerable<String>
Is there a way that I can optimize this code?
I don't care if I have to rewrite all the code and open to all low level tweaks.

If you implement a Dictionary that can be incremented (emulating a multiset or bag) then you can speed up about 3x faster than LINQ, but the difference is small unless you have a lot of ngrms. On a list of 10 million, with about 100 unique values, the LINQ code still takes less than a second on my PC. If your LINQ code takes time 1, a foreach with a Dictionary<string,int> takes 0.85 and this code takes 0.32.
Here is the class for creating an updateable value in the Dictionary:
public class Ref<T> {
public T val { get; set; }
public Ref(T firstVal) => val = firstVal;
public static implicit operator T(Ref<T> rt) => rt.val;
}
(If C# allowed operator ref T you could return a ref to the val property and almost treat a Ref<T> as if it were a lvalue of type T.)
Now you can count the occurrences of the strings in a Dictionary<string,Ref<int>> with only one lookup per string:
var dictCounts = new Dictionary<string, Ref<int>>();
foreach (var s in ngrms) {
if (dictCounts.TryGetValue(s, out var refn))
++refn.val;
else
dictCounts.Add(s, new Ref<int>(1));
}
Finally you can compute the answer by filtering the counts to the ones you want to keep:
var ans = dictCounts.Where(kvp => kvp.Value > minCount).ToDictionary(kvp => kvp.Key, kvp => kvp.Value.val);

Going by your linq query, you may consider rewriting the code using simple foreach loop for better performance, like below. It takes o(n) time complexity to execute:
Dictionary<string, int> dict = new Dictionary<string, int>();
foreach(var s in ngrms)
{
if (dict.ContainsKey(s))
dict[s]++;
else
dict.Add(s, 1);
}
return dict.Where(a => a.Value > minCount);

How to concatenate result of GroupBy using Linq

Let say you have list of items and you want to partition them, make operation on one partition and concatenate partitions back into list.
For example there is list of numbers and I want to partition them by parity, then reverse odds and concatenate with evens. [1,2,3,4,5,6,7,8] -> [7,5,3,1,2,4,6,8]
Sounds simple, but I've got stuck on merging back two groups. How would you do it with LINQ?
IEnumerable<int> result = Enumerable.Range(0, 1000)
.GroupBy(i => i % 2)
.Select(p => p.Key == 1 ? p.Reverse() : p)
.??? // need to concatenate
Edit
[1,2,3] is the representation of array which I want to get as the result, not output, sorry if I confused you by that.

The GroupBy method returns an IEnumerable<IGrouping<TKey, TSource>>. As IGrouping implements IEnumerable, you can use SelectMany to concatenate multiple IEnumerable<T> instances into one.
Enumerable.Range(0, 1000)
.GroupBy(i => i % 2)
.Select(p => p.Key == 1 ? p.Reverse() : p)
.OrderByDescending(p => p.Key)
.SelectMany(p => p);

There are a few ways to achieve this,
so if we start with your function
Enumerable.Range(0, 1000)
.GroupBy(i => i % 2)
.Select(p => p.Key == 1 ? p.Reverse() : p)
you could then use an Aggregate
.Aggregate((aggrgate,enumerable)=>aggrgate.Concat(enumerable))
this will then go though your list of results and concat them all into a collection and return it, you just need to make sure that aggrgate and enumerable are the same type in this case a IEnumerable<int>
another would be to call SelectMany()
.SelectMany(enumerable=>enumerable)
this then likewise pulls all the enumerables together into a single enumerable, again you need to ensure the types are IEnumerable<int>
other options would be to hard code the keys as Tim is suggesting or pull out of linq and use a loop

You could use this approach using a Lookup<TKey, TElement>:
var evenOddLookup = numbers.ToLookup(i => i % 2);
string result = String.Join(",", evenOddLookup[1].Reverse().Concat(evenOddLookup[0]));
If you don't want a string but an int[] as result:
int[] result = evenOddLookup[1].Reverse().Concat(evenOddLookup[0]).ToArray();

You could do something like this.
var number = string.Join(",",
Enumerable.Range(0, 1000)
.GroupBy(i => i % 2) // Separate even/odd numbers
.OrderByDescending(x=>x.Key) // Sort to bring odd numbers first.
.SelectMany(x=> x.Key ==1? // Sort elements based on even or odd.
x.OrderByDescending(s=>s)
: x.Where(s=> s!=0).OrderBy(s=>s))
.ToArray());
string output = string.Format("[{0}]", number);
Check this Demo

Just use OrderBy like this:
List<int> arr = new List<int> { 1, 2, 3, 4, 5, 6, 7, 8 };
var result = arr.OrderBy(i => i % 2 == 0 ? 1 : 0)
.ThenBy(i => i % 2 == 0 ? i : int.MaxValue)
.ThenByDescending(i => i);
This should give you your desired result as you want:
[1,2,3,4,5,6,7,8] will be converted into [7,5,3,1,2,4,6,8]

get a List of Max values across a list of lists

I have a List<List<double>> and I need to find a List MyList where MyList[0], for instance, is the Max of all the first elements of the List.
Example, just to be clear:
First list contains (3,5,1), second contains (5,1,8), third contains (3,3,3), fourt contains (2,0,4).
I need to find a list with (5, 5, 8).
I do NOT need the list (5,8,3,4).
Of course i know how to do it with nested for cycles.
I'd like to know if there's a linq way and believe me i don't know where to start from.

var source = new List<List<int>> {
new List<int> { 3, 5, 1 },
new List<int> { 5, 1, 8 },
new List<int> { 3, 3, 3 },
new List<int> { 2, 0, 4 }
};
var maxes = source.SelectMany(x => x.Select((v, i) => new { v, i }))
.GroupBy(x => x.i, x => x.v)
.OrderBy(g => g.Key)
.Select(g => g.Max())
.ToList();
Returns { 5, 5, 8}, which is what you need. And will work when source lists have different number of elements too.
Bonus
If you need version for Min too, and want to prevent code duplication, you can go a little bit functional:
private static IEnumerable<TSource> GetByIndex<TSource>(IEnumerable<IEnumerable<TSource>> source, Func<IEnumerable<TSource>, TSource> selector)
{
return source.SelectMany(x => x.Select((v, i) => new { v, i }))
.GroupBy(x => x.i, x => x.v)
.OrderBy(g => g.Key)
.Select(g => selector(g));
}
public static IEnumerable<TSource> GetMaxByIndex<TSource>(IEnumerable<IEnumerable<TSource>> source)
{
return GetByIndex(source, Enumerable.Max);
}
public static IEnumerable<TSource> GetMinByIndex<TSource>(IEnumerable<IEnumerable<TSource>> source)
{
return GetByIndex(source, Enumerable.Min);
}

Try this one:
// Here I declare your initial list.
List<List<double>> list = new List<List<double>>()
{
new List<double>(){3,5,1},
new List<double>(){5,1,8},
new List<double>(){3,3,3},
new List<double>(){2,0,4},
};
// That would be the list, which will hold the maxs.
List<double> result = new List<double>();
// Find the maximum for the i-st element of all the lists in the list and add it
// to the result.
for (int i = 0; i < list[0].Count-1; i++)
{
result.Add(list.Select(x => x[i]).Max());
}
Note: this solution works only, when all the lists that are contained in the list have the same number of elements.

Even if this topic is answered long time ago, I'd like to put here another solution I've made up with Linq, shorter than this other solution :
List<List<int>> mylist; //initial list of list
List<List<int>> mins_list = mylist.Aggregate(
(x, cur) => cur.Zip(x, (a, b) => (a.Value > b.Value) ? a : b).ToList()
).ToList();
This very simple code is just aggregating every sub-list into a list of minima. Note that the internal ToList is mandatory as Zip is deferred.
You can encapsulate the code in an extension method, and do the same trick as MarcinJuraszek to generate other similar computations (min, max, mean, std, ...).

If always you know how many elements present in your lists,you can use this approach:
var result = new[]
{
list.Select(a => a[0]).Max(),
list.Select(a => a[1]).Max(),
list.Select(a => a[2]).Max()
};

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to optimize LINQ OrderBy if the keySelector is slow? - c#

Related

Memory efficient GroupBy + Aggregation using Rx

Observe values not seen in other observers

Optimizing linq efficiency

How to concatenate result of GroupBy using Linq

get a List of Max values across a list of lists

Categories

Resources