Memory optimized OrderBy and Take? - c#

I have 9 GB of data, and I want only 10 rows. When I do:
data.OrderBy(datum => datum.Column1)
.Take(10)
.ToArray();
I get an OutOfMemoryException. I would like to use an OrderByAndTake method, optimized for lower memory consumption. It's easy to write, but I guess someone already did. Where can I find it.
Edit: It's Linq-to-objects. The data comes from a file. Each row can be discarded if its value for Column1 is smaller than the current list of 10 biggest values.

I'm assuming you're doing this in Linq to Objects. You could do something like...
var best = data
.Aggregate(new List<T>(), (soFar, current) => soFar
.Concat(new [] { current })
.OrderBy(datum => datum.Column1)
.Take(10)
.ToList());
In this way, not all the items need to be kept in a new sorted collection, only the best 10 you're interested in.
This was the least code way. Since you know the soFar list is sorted, testing where/if to insert current could be optimized. I didn't feel like doing ALL the work for you. ;-)
PS: Replace T with whatever your type is.
EDIT: Thinking about it, the most efficient way would actually be a plain old foreach that compares each item to the running list of best 10.

It figures: OrderBy is a Sort and that requires storing all the elements (deferred execution is cancelled).
It ought to work efficiently when data is an IQueryable, then it's up to the database.
// just 4 fun
public static IEnumerable<T> TakeDistinctMin<T, TKey>(this IEnumerable<T> #this,
int n, Func<T, TKey> selector)
where TKey: IComparable<TKey>
{
var tops = new SortedList<TKey, T>(n+1);
foreach (var item in #this)
{
TKey k = selector(item);
if (tops.ContainsKey(k))
continue;
if (tops.Count < n)
{
tops.Add(k, item);
}
else if (k.CompareTo(tops.Keys[tops.Count - 1]) < 0)
{
tops.Add(k, item);
tops.RemoveAt(n);
}
}
return tops.Values;
}

To order a set of unordered objects you have to look at all of them, no?
I don't see how you'd be able to avoid parsing all 9 GB of data to get the first 10 ordered in a certain way unless the 9 GB of data was already ordered in that fashion or if there were indexes or other ancillary data structures that could be utilized.
Could you provide a bit more background on your question. Are you querying a database using LINQ to SQL or Entity Framework or some other O/RM?

You can use something like this together with a projection comparer:
public static IEnumerable<T> OrderAndTake<T>(this IEnumerable<T> seq,int count,IComparer<T> comp)
{
var resultSet=new SortedSet<T>(comp);
foreach(T elem in seq)
{
resultSet.Add(elem);
if(resultSet.Count>count)
resultSet.Remove(resultSet.Max);
}
return resultSet.Select(x=>x);
}
Runtime should be O(log(count)*seq.Count()) and space O(min(log(count),seq.Count()))
One issue is that it will break if you have two elements for which comp.Compare(a,b)==0 since the set doesn't allow duplicate entries.

Related

Fast filtering of a IEnumerable based on a list of tuples in C#

In my C# application I have a list of tuples, a sort of a reference database, as follow:
public List<Tuple<DateTime, double, int>> MyList = new List<Tuple<DateTime, double, int>>();
I also have IEnumerable<Custom> that holds 3 values, x and y and z. I need to iterate over this IEnumerable and check if the x-y-z values are "present" in the reference database with certain criteria.
To do this, I am using LINQ (as it is very easy to work with) to count how many tuples remain in the reference list after the custom filtering
foreach (Container c in IEnumerable.V)
{
var res = MyList
.OrderBy(tuple => tuple.Item1)
.Where(tuple => tuple.Item1 >= DateTime.Now.AddHours(-c.x))
.OrderBy(tuple => tuple.Item3)
.Where(tuple => tuple.Item3 == c.z)
.OrderBy(tuple => tuple.Item2)
.Where(tuple => somemath(tuple.Item2, c.y) <= 1)
.Count();
if (res != 0)
{
continue;
}
else
{
// do stuff
}
}
}
Although this works just fine, it is too slow for my purpose (10-30 ms, I need it at least 100 times faster!).
What would be a better/faster way to do this?
Since the first criterion is to compare time, I believe a better strategy would be to order both the list and the input sequence by time and to walk through them in lockstep. Designing such an algorithm would be a bit more engaged for me to post here.
The other problem is that you are effectively testing whether there are any items in the list that match the current input element. You should not waste resources counting them to only test one bit of information.
Those are the two principal sources of CPU waste in your code.

Why doesn't IOrderedEnumerable retain order after where filtering

I've created a simplification of the issue. I have an ordered IEnumerable, I'm wondering why applying a where filter could unorder the objects
This does not compile while it should have the potential to
IOrderedEnumerable<int> tmp = new List<int>().OrderBy(x => x);
//Error Cannot Implicitly conver IEnumerable<int> To IOrderedEnumerable<int>
tmp = tmp.Where(x => x > 1);
I understand that there would be no gaurenteed execution order if coming from an IQueryable such as using linq to some DB Provider.
However, when dealing with Linq To Object what senario could occur that would unorder your objects, or why wasn't this implemented?
EDIT
I understand how to properly order this that is not the question. My Question is more of a design question. A Where filter on linq to objects should enumerate the give enumerable and apply filtering. So why is that we can only return an IEnumerable instead of an IOrderedEnumerable?
EDIT
To Clarify the senario in when this would be userful. I'm building Queries based on conditions in my code, I want to reuse as much code as possible. I have a function that is returning an OrderedEnumerable, however after applying the additional where I would have to reorder this even though it would be in its original ordered state
Rene's answer is correct, but could use some additional explanation.
IOrderedEnumerable<T> does not mean "this is a sequence that is ordered". It means "this is a sequence that has had an ordering operation applied to it and you may now follow that up with a ThenBy to impose additional ordering requirements."
The result of Where does not allow you to follow it up with ThenBy, and therefore you may not use it in a context where an IOrderedEnumerable<T> is required.
Make sense?
But of course, as others have said, you almost always want to do the filtering first and then the ordering. That way you are not spending time putting items into order that you are just going to throw away.
There are of course times when you do have to order and then filter; for example, the query "songs in the top ten that were sung by a woman" and the query "the top ten songs that were sung by a woman" are potentially very different! The first one is sort the songs -> take the top ten -> apply the filter. The second is apply the filter -> sort the songs -> take the top ten.
The signature of Where() is this:
public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
So this method takes an IEnumerable<int> as first argument. The IOrderedEnumerable<int> returned from OrderBy implements IEnumerable<int> so this is no problem.
But as you can see, Where returns an IEnumerable<int> and not an IOrderedEnumerable<int>. And this cannot be casted into one another.
Anyway, the object in that sequence will still have the same order. So you could just do it like this
IEnumerable<int> tmp = new List<int>().OrderBy(x => x).Where(x => x > 1);
and get the sequence you expected.
But of course you should (for performance reasons) filter your objects first and sort them afterwards when there are fewer objects to sort:
IOrderedEnumerable<int> tmp = new List<int>().Where(x => x > 1).OrderBy(x => x);
The tmp variable's type is IOrderedEnumerable.
Where() is a function just like any other with a return type, and that return type is IEnumerable. IEnumerable and IOrderedEnumerable are not the same.
So when you do this:
tmp = tmp.Where(x => x > 1);
You are trying to assign the result of a Where() function call, which is an IEnuemrable, to the tmp variable, which is an IOrderedEnumerable. They are not directly compatible, there is no implicit cast, and so the compiler sends you an error.
The problem is you are being too specific with the tmp variable's type. You can make one simple change that will make this all work by being just be a little less specific with your tmp variable:
IEnumerable<int> tmp = new List<int>().OrderBy(x => x);
tmp = tmp.Where(x => x > 1);
Because IOrderedEnumerable inherits from IEnumerable, this code will all work. As long as you don't want to call ThenBy() later on, this should give you exactly the same results as you expect without any other loss of ability to use the tmp variable later.
If you really need an IOrderedEnumerable, you can always just call .OrderBy(x => x) again:
IOrderedEnumerable<int> tmp = new List<int>().OrderBy(x => x);
tmp = tmp.Where(x => x > 1).OrderBy(x => x);
And again, in most cases (not all, but most) you want to get your filtering out of the way before you start sorting. In other words, this is even better:
var tmp = new List<int>().Where(x => x > 1).OrderBy(x => x);
why wasn't this implemented?
Most likely because the LINQ designers decided that the effort to implement, test, document etc. isn't worth enough compared to the potential use cases. In fact your are the first one I hear complaining about that.
But if it's so important to you, you can add that missing functionality yourself (similar to #Jon Skeet MoreLINQ extension library). For instance, something like this:
namespace MyLinq
{
public static class Extensions
{
public static IOrderedEnumerable<T> Where<T>(this IOrderedEnumerable<T> source, Func<T, bool> predicate)
{
return new WhereOrderedEnumerable<T>(source, predicate);
}
class WhereOrderedEnumerable<T> : IOrderedEnumerable<T>
{
readonly IOrderedEnumerable<T> source;
readonly Func<T, bool> predicate;
public WhereOrderedEnumerable(IOrderedEnumerable<T> source, Func<T, bool> predicate)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (predicate == null) throw new ArgumentNullException(nameof(predicate));
this.source = source;
this.predicate = predicate;
}
public IOrderedEnumerable<T> CreateOrderedEnumerable<TKey>(Func<T, TKey> keySelector, IComparer<TKey> comparer, bool descending) =>
new WhereOrderedEnumerable<T>(source.CreateOrderedEnumerable(keySelector, comparer, descending), predicate);
public IEnumerator<T> GetEnumerator() => Enumerable.Where(source, predicate).GetEnumerator();
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
}
}
And putting it into action:
using System;
using System.Collections.Generic;
using System.Linq;
using MyLinq;
var test = Enumerable.Range(0, 100)
.Select(n => new { Foo = 1 + (n / 20), Bar = 1 + n })
.OrderByDescending(e => e.Foo)
.Where(e => (e.Bar % 2) == 0)
.ThenByDescending(e => e.Bar) // Note this compiles:)
.ToList();

How do I use Linq with a HashSet of Integers to pull multiple items from a list of Objects?

I have a HashSet of ID numbers, stored as integers:
HashSet<int> IDList; // Assume that this is created with a new statement in the constructor.
I have a SortedList of objects, indexed by the integers found in the HashSet:
SortedList<int,myClass> masterListOfMyClass;
I want to use the HashSet to create a List as a subset of the masterListOfMyclass.
After wasting all day trying to figure out the Linq query, I eventually gave up and wrote the following, which works:
public List<myclass> SubSet {
get {
List<myClass> xList = new List<myClass>();
foreach (int x in IDList) {
if (masterListOfMyClass.ContainsKey(x)) {
xList.Add(masterListOfMyClass[x]);
}
}
return xList;
}
private set { }
}
So, I have two questions here:
What is the appropriate Linq query? I'm finding Linq extremely frustrating to try to figuere out. Just when I think I've got it, it turns around and "goes on strike".
Is a Linq query any better -- or worse -- than what I have written here?
var xList = IDList
.Where(masterListOfMyClass.ContainsKey)
.Select(x => masterListOfMyClass[x])
.ToList();
If your lists both have equally large numbers of items, you may wish to consider inverting the query (i.e. iterate through masterListOfMyClass and query IDList) since a HashSet is faster for random queries.
Edit:
It's less neat, but you could save a lookup into masterListOfMyClass with the following query, which would be a bit faster:
var xList = IDList
.Select(x => { myClass y; masterListOfMyClass.TryGetValue(x, out y); return y; })
.Where(x => x != null)
.ToList();
foreach (int x in IDList.Where(x => masterListOfMyClass.ContainsKey(x)))
{
xList.Add(masterListOfMyClass[x]);
}
This is the appropriate linq query for your loop.
Here the linq query will not effective in my point of view..
Here is the Linq expression:
List<myClass> xList = masterListOfMyClass
.Where(x => IDList.Contains(x.Key))
.Select(x => x.Value).ToList();
There is no big difference in the performance in such a small example, Linq is slower in general, it actually uses iterations under the hood too. The thing you get with ling is, imho, clearer code and the execution is defered until it is needed. Not i my example though, when I call .ToList().
Another option would be (which is intentionally the same as Sankarann's first answer)
return (
from x in IDList
where masterListOfMyClass.ContainsKey(x)
select masterListOfMyClass[x]
).ToList();
However, are you sure you want a List to be returned? Usually, when working with IEnumerable<> you should chain your calls using IEnumerable<> until the point where you actually need the data. There you can decide to e.g. loop once (use the iterator) or actually pull the data in some sort of cache using the ToList(), ToArray() etc. methods.
Also, exposing a List<> to the public implies that modifying this list has an impact on the calling class. I would leave it to the user of the property to decide to make a local copy or continue using the IEnumerable<>.
Second, as your private setter is empty, setting the 'SubSet' has no impact on the functionality. This again is confusing and I would avoid it.
An alternate (an maybe less confusing) declaration of your property might look like this
public IEnumerable<myclass> SubSet {
get {
return from x in IDList
where masterListOfMyClass.ContainsKey(x)
select masterListOfMyClass[x]
}
}

LINQ to Objects and improved perf with an Index?

I am using LINQ to Objects and wonder if it is possible to improve the performance of my queries by making use of an index that I have. This is best explained with an example. Imagine a simple type...
public class Person
{
public int Age;
public string FirstName;
public string LastName;
}
And a simple query I would make against it...
List<Person> people = new List<Person>();
// 'people' populated with 50,000 instances...
var x = from t in people
where t.Age > 18 && t.Age < 21
select t;
If I understand LINQ to Objects correctly then the implementation of the Where extension method will enumerate all 50,000 instances in the people collection in order to find the 100 that actually match. As it happens I already have an index of the people collection that is sorted by Age. Like this...
SortedList<int, Person> ageSorted = new SortedList<int, Person>();
Clearly it would make sense if I could get the Where to use the SortedList so that it no longer has to enumerate all 50,000 instances, instead finding the range of 100 matching entries and so saving time.
Is it possible to extend LINQ to Objects to enable my situation? Is it already possible but I am missing the technique?
There's already a project which I believe does exactly that - i4o. I can't say I've used it myself, but it sounds like the kind of thing you want... you may need to juggle your existing code a bit, but it's certainly worth looking at.
If that doesn't help, you could at least write your own extension methods on SortedList<TKey, TValue>. You probably wouldn't be able to easily use your actual where clause, but you could use your own methods taking a minimum and a maximum value. You might also want to make them apply to IList<T> where you assert that you've already sorted the values appropriately (according to some comparer).
For example (completely untested):
public static IEnumerable<T> Between<T, TKey>(this IList<T> source,
Func<T, TKey> projection,
TKey minKeyInclusive,
TKey maxKeyExclusive,
IComparer<TKey> comparer)
{
comparer = comparer ?? Comparer<TKey>.Default;
// TODO: Find the index of the lower bound via a binary search :)
// (It's too late for me to jot it down tonight :)
int index = ...; // Find minimum index
while (index < source.Count &&
comparer.Compare(projection(source[index]), maxKeyExclusive) < 0)
{
yield return source[index];
index++;
}
}
(If you only have List<T> instead of IList<T>, you could use List<T>.BinarySearch, although you'd need to build a custom IComparer<T>.)
Also, have a look at SortedSet<T> in .NET 4.
You're right that the query you wrote will enumerate the whole list as obviously LINQ can't assume anything about your data.
If you have a SortedList, you can exploit that using the SkipWhile/TakeWhile linq methods:
var x = x.SkipWhile(kv => kv.Key <= 18).TakeWhile(kv => kv.Key < 21)
EDIT
#Davy8 is right of course that worst case this still has the same performance. See the other answers for a way to more quickly find the first value.
If you need to do this operation more than once for different age ranges then you can probably also speed it up by grouping on age:
var byAge = people.GroupBy(p => p.Age);
var x = from grp in byAge
where grp.Key > 18 && grp.Key < 21
from person in grp
select person;
The LINQ query syntax actually uses any extension method named Where that matches the signature, so you can always write your own that handles your specific type the way you want.
public static IEnumerable<Person> Where(this IEnumerable<Person> collection, Func<Person, bool> condition )
{
Console.WriteLine("My Custom 'Where' method called");
return System.Linq.Enumerable.Where(collection, condition);
}
...
var x = from t in people
where t.Age > 18 && t.Age < 21
select t; //Will print "My Custom 'Where' method called"
Then you can apply any logic you want. I believe the normal method overload rules apply for determining which Where extension method would be called.
In a pre-sorted container, the efficiency is achieved by finding the first element quickly. Once you find the first element, just linearly retrieve the following elements until you find the end of your range.
Assuming your SortedList is sorted by Person.Age, you can find the first element of the range using SortedList.IndexOfKey, which is a binary search algorithm; therefore, this method is an O(log n) operation.
(I don't think you can change your code so the Enumerable.Where suddenly becomes more intelligent and finds the range start by using binary search.)
--- EDIT ---
Actually, what you really need is List.BinarySearch or Array.BinarySearch. The SortedList.IndexOfKey won't let you get the index of the closest match in case exact match does not exist. Or you can just implement the binary search yourself.

How to get last x records from a list with Lambda

I have as List of strings with where i remove each duplicates, now I want to filter it even more to get the last 5 records. How can I do this?
What I got so far
List<string> query = otherlist.Distinct().Select(a => a).ToList();
You do not need the .Select(a => a). Thats redundant.
You can get the last 5 records, by skipping over the rest like
List<string> query = otherlist.Distinct().ToList();
List<string> lastFive = query.Skip(query.Count-5).ToList();
edit to cater for non-list inputs, now handles IEnumerable<T> and checks if this is an IList<T>; if not it buffers it via ToList(), which helps ensure we only read the data once (rather than .Count() and .Skip() which may read the data multiple times).
Since this is a list, I'd be inclined to write an extension method that uses that to the full:
public static IEnumerable<T> TakeLast<T>(
this IEnumerable<T> source, int count)
{
IList<T> list = (source as IList<T>) ?? source.ToList();
count = Math.Min(count, list.Count);
for (int i = list.Count - count; i < list.Count; i++)
{
yield return list[i];
}
}
How about this?
var lastFive = list.Reverse().Take(5).Reverse();
edit: here's the whole thing -
var lastFiveDistinct = otherlist.Distinct()
.Reverse()
.Take(5)
.Reverse()
.ToList();
Also note that you shouldn't call it query if you've got a ToList() call at the end, because then it's not a query anymore, it's been evaluated and turned into a list. If you only need it to iterate over, you can omit the ToList() call and leave it as an IEnumerable.
var count=list.Count();
var last5=list.Skip(count-5);
EDIT:
I missed that the data is List<T> . This approach would be better for IEnumerable<T>

Categories

Resources