LINQ to Objects and improved perf with an Index?

LINQ to Objects and improved perf with an Index? - c#

I am using LINQ to Objects and wonder if it is possible to improve the performance of my queries by making use of an index that I have. This is best explained with an example. Imagine a simple type...
public class Person
{
public int Age;
public string FirstName;
public string LastName;
}
And a simple query I would make against it...
List<Person> people = new List<Person>();
// 'people' populated with 50,000 instances...
var x = from t in people
where t.Age > 18 && t.Age < 21
select t;
If I understand LINQ to Objects correctly then the implementation of the Where extension method will enumerate all 50,000 instances in the people collection in order to find the 100 that actually match. As it happens I already have an index of the people collection that is sorted by Age. Like this...
SortedList<int, Person> ageSorted = new SortedList<int, Person>();
Clearly it would make sense if I could get the Where to use the SortedList so that it no longer has to enumerate all 50,000 instances, instead finding the range of 100 matching entries and so saving time.
Is it possible to extend LINQ to Objects to enable my situation? Is it already possible but I am missing the technique?

There's already a project which I believe does exactly that - i4o. I can't say I've used it myself, but it sounds like the kind of thing you want... you may need to juggle your existing code a bit, but it's certainly worth looking at.
If that doesn't help, you could at least write your own extension methods on SortedList<TKey, TValue>. You probably wouldn't be able to easily use your actual where clause, but you could use your own methods taking a minimum and a maximum value. You might also want to make them apply to IList<T> where you assert that you've already sorted the values appropriately (according to some comparer).
For example (completely untested):
public static IEnumerable<T> Between<T, TKey>(this IList<T> source,
Func<T, TKey> projection,
TKey minKeyInclusive,
TKey maxKeyExclusive,
IComparer<TKey> comparer)
{
comparer = comparer ?? Comparer<TKey>.Default;
// TODO: Find the index of the lower bound via a binary search :)
// (It's too late for me to jot it down tonight :)
int index = ...; // Find minimum index
while (index < source.Count &&
comparer.Compare(projection(source[index]), maxKeyExclusive) < 0)
{
yield return source[index];
index++;
}
}
(If you only have List<T> instead of IList<T>, you could use List<T>.BinarySearch, although you'd need to build a custom IComparer<T>.)
Also, have a look at SortedSet<T> in .NET 4.

You're right that the query you wrote will enumerate the whole list as obviously LINQ can't assume anything about your data.
If you have a SortedList, you can exploit that using the SkipWhile/TakeWhile linq methods:
var x = x.SkipWhile(kv => kv.Key <= 18).TakeWhile(kv => kv.Key < 21)
EDIT
#Davy8 is right of course that worst case this still has the same performance. See the other answers for a way to more quickly find the first value.
If you need to do this operation more than once for different age ranges then you can probably also speed it up by grouping on age:
var byAge = people.GroupBy(p => p.Age);
var x = from grp in byAge
where grp.Key > 18 && grp.Key < 21
from person in grp
select person;

The LINQ query syntax actually uses any extension method named Where that matches the signature, so you can always write your own that handles your specific type the way you want.
public static IEnumerable<Person> Where(this IEnumerable<Person> collection, Func<Person, bool> condition )
{
Console.WriteLine("My Custom 'Where' method called");
return System.Linq.Enumerable.Where(collection, condition);
}
...
var x = from t in people
where t.Age > 18 && t.Age < 21
select t; //Will print "My Custom 'Where' method called"
Then you can apply any logic you want. I believe the normal method overload rules apply for determining which Where extension method would be called.

In a pre-sorted container, the efficiency is achieved by finding the first element quickly. Once you find the first element, just linearly retrieve the following elements until you find the end of your range.
Assuming your SortedList is sorted by Person.Age, you can find the first element of the range using SortedList.IndexOfKey, which is a binary search algorithm; therefore, this method is an O(log n) operation.
(I don't think you can change your code so the Enumerable.Where suddenly becomes more intelligent and finds the range start by using binary search.)
--- EDIT ---
Actually, what you really need is List.BinarySearch or Array.BinarySearch. The SortedList.IndexOfKey won't let you get the index of the closest match in case exact match does not exist. Or you can just implement the binary search yourself.

Related

LINQ Compare two lists where property value is not equal

I have been over a few StackOverflow articles about this (this in particular)
and for some reason my case is different. I've used Tony the Lion's answer to attempt to get a list of objects that have different property values, without success. This, however does work:
List<Task> changedTasksWorking = new List<Task>();
for (int x = 0; x < originalTaskList.Count; x++)
{
if (originalTaskList[x].ActiveFlag != newTaskList[x].ActiveFlag)
{
changedTasksWorking.Add(newTaskList[x]);
}
}
The following is what I thought would provide me the same result. But where the returned list should equal 1, it instead equals zero. When I flip the property comparison to != and remove the nor condition on the inner list, I get ALL the objects of the list instead:
List<Task> notWork = oL.Where(o => newL.Any(n => o.ActiveFlag != n.ActiveFlag)).ToList();
I feel like I'm taking crazy pills. Looking at the above one-liner that should give me what I'm asking for. Perhaps I have misunderstood how the LINQ methods Where and Any are interacting.

Your proposed LINQ approach is completely different from what you seem to actually be trying to do. In particular, according to your original example, you have two lists that are exactly in sync with each other. I.e. they have the same number of elements, and each element from one list corresponds exactly to the same element in the same position in the other list.
Your LINQ code, on the other hand, looks at each element in one list at a time, and for each of those elements, searches the other list for one that has a property value that doesn't match. In other words, if the newL list has elements of all possible values of ActiveFlag then of course it will return all elements of oL, because for each element in oL, LINQ is able to find an element in newL where the property value doesn't match.
There are at least a couple of obvious alternatives using LINQ that will actually work:
Use the overload for Where() that passes the index to the predicate delegate:
List<Task> changedTasks = newTaskList
.Where((n, i) => n.ActiveFlag != originalTaskList[i].ActiveFlag).ToList();
Use Enumerable.Zip() to pair up elements in a new sequence and filter that:
List<Task> changedTasks = originalTaskList
.Zip(newTaskList, (o, n) => o.ActiveFlag != n.ActiveFlag ? n : null)
.Where(n => n != null).ToList();
Either of those should work fine.

How get max value by lambda query in C#?

I want to get the last number in the code column in C# using the query lambda. Be careful that I want a numerical value, not a list. For example, if the last registered number was 50, I would like this number 50. That is, we can store the query result in a numerical variable so that I can use it elsewhere
var maxcode= dbContext.personel
.Select(a => a.Max(w => w.code))
.FirstOrDefault();
For example
code name old
-----------------
1 Amelia 18
2 Olivia 27
3 Emily 11
4 Amelia 99
I want to get number 4
If I want to use top(1) to improve the speed?

This should work:
var max = dbContext.personel.Max(x => x.code);

While SQL and LINQ share some similarities, they are quite different in how they work. It's best to start with how IEnumerable works with this type of query, then how that translates to IQueryable. In most simple cases the two look exactly the same, by design.
The Select extension for IEnumerable iterates through the sequence and passes each object to the supplied function, collecting the results in a new IEnumerable of the appropriate type. In your code a will be a record rather than a collection.
Essentially Select looks like this under the hood:
public static IEnumerable<TResult> Select<TElement, TResult>(this IEnumerable<TElement> seq, Func<TElement, TResult> selector)
{
foreach (var item in seq)
yield return selector(item);
}
In simple words Select is a transformation. It takes objects of one type and passes them through a transform function.
The Max extension - at least the relevant one - processes a sequence of objects, uses the supplied function to pull some value from each object, then returns the largest of those values. It looks a little bit like this pseudo-code:
public static TResult Max<TElement, TResult>(this IEnumerable<TElement> seq, Func<TElement, TResult> valueFunc)
{
var result = default(TResult);
foreach (var item in seq)
{
var curr = valueFunc(item);
if (result == default(TResult) || result < curr)
result = curr;
}
return curr;
}
OK that won't compile, but it shows the basic concept.
So if you had an array of Personel objects in memory and wanted to find the largest code then you'd do this:
var maxCode = personel.Max(p => p.code);
The nice thing about LinqToSQL and pretty much all LINQ-like ORMs (Entity Framework, LinqToDB, etc) is that the exact same thing works for IQueryable:
var maxCode = dbContext.personel.Max(p => p.code);
The actual SQL for that will look something like (actual output from LinqToDB code gen):
SELECT
Max([t1].[code]) as [c1]
FROM
[personel] [t1]
For more interesting queries the syntax differs.
You have two Amelia entries with different ages. Let's say you want to find the age range for each name in your list. This is where grouping comes in.
In LINQ query syntax the query would look something like this:
var nameAges =
from p in dbContext.personel
group p.old by p.name into grp
select new { name = grp.Key, lowAge = grp.Min(), highAge = grp.Max() };
Grouping is easier in that format. In fluent it looks more like:
var nameAges = dbContext.personel
.GroupBy(p => p.name, p => p.old)
.Select(grp => new { name = grp.Key, lowAge = grp.Min(), highAge = grp.Max() };
Or in SQL:
SELECT name, Min(code) AS lowAge, Max(code) AS highAge
FROM personel
GROUP BY name
The moral is, writing LINQ queries is not the same as writing SQL queries... but the concepts are similar. Play around with them, work out how they work. LINQ is a great tool once you understand it.

How do I use Linq with a HashSet of Integers to pull multiple items from a list of Objects?

I have a HashSet of ID numbers, stored as integers:
HashSet<int> IDList; // Assume that this is created with a new statement in the constructor.
I have a SortedList of objects, indexed by the integers found in the HashSet:
SortedList<int,myClass> masterListOfMyClass;
I want to use the HashSet to create a List as a subset of the masterListOfMyclass.
After wasting all day trying to figure out the Linq query, I eventually gave up and wrote the following, which works:
public List<myclass> SubSet {
get {
List<myClass> xList = new List<myClass>();
foreach (int x in IDList) {
if (masterListOfMyClass.ContainsKey(x)) {
xList.Add(masterListOfMyClass[x]);
}
}
return xList;
}
private set { }
}
So, I have two questions here:
What is the appropriate Linq query? I'm finding Linq extremely frustrating to try to figuere out. Just when I think I've got it, it turns around and "goes on strike".
Is a Linq query any better -- or worse -- than what I have written here?

var xList = IDList
.Where(masterListOfMyClass.ContainsKey)
.Select(x => masterListOfMyClass[x])
.ToList();
If your lists both have equally large numbers of items, you may wish to consider inverting the query (i.e. iterate through masterListOfMyClass and query IDList) since a HashSet is faster for random queries.
Edit:
It's less neat, but you could save a lookup into masterListOfMyClass with the following query, which would be a bit faster:
var xList = IDList
.Select(x => { myClass y; masterListOfMyClass.TryGetValue(x, out y); return y; })
.Where(x => x != null)
.ToList();

foreach (int x in IDList.Where(x => masterListOfMyClass.ContainsKey(x)))
{
xList.Add(masterListOfMyClass[x]);
}
This is the appropriate linq query for your loop.
Here the linq query will not effective in my point of view..

Here is the Linq expression:
List<myClass> xList = masterListOfMyClass
.Where(x => IDList.Contains(x.Key))
.Select(x => x.Value).ToList();
There is no big difference in the performance in such a small example, Linq is slower in general, it actually uses iterations under the hood too. The thing you get with ling is, imho, clearer code and the execution is defered until it is needed. Not i my example though, when I call .ToList().

Another option would be (which is intentionally the same as Sankarann's first answer)
return (
from x in IDList
where masterListOfMyClass.ContainsKey(x)
select masterListOfMyClass[x]
).ToList();
However, are you sure you want a List to be returned? Usually, when working with IEnumerable<> you should chain your calls using IEnumerable<> until the point where you actually need the data. There you can decide to e.g. loop once (use the iterator) or actually pull the data in some sort of cache using the ToList(), ToArray() etc. methods.
Also, exposing a List<> to the public implies that modifying this list has an impact on the calling class. I would leave it to the user of the property to decide to make a local copy or continue using the IEnumerable<>.
Second, as your private setter is empty, setting the 'SubSet' has no impact on the functionality. This again is confusing and I would avoid it.
An alternate (an maybe less confusing) declaration of your property might look like this
public IEnumerable<myclass> SubSet {
get {
return from x in IDList
where masterListOfMyClass.ContainsKey(x)
select masterListOfMyClass[x]
}
}

Memory optimized OrderBy and Take?

I have 9 GB of data, and I want only 10 rows. When I do:
data.OrderBy(datum => datum.Column1)
.Take(10)
.ToArray();
I get an OutOfMemoryException. I would like to use an OrderByAndTake method, optimized for lower memory consumption. It's easy to write, but I guess someone already did. Where can I find it.
Edit: It's Linq-to-objects. The data comes from a file. Each row can be discarded if its value for Column1 is smaller than the current list of 10 biggest values.

I'm assuming you're doing this in Linq to Objects. You could do something like...
var best = data
.Aggregate(new List<T>(), (soFar, current) => soFar
.Concat(new [] { current })
.OrderBy(datum => datum.Column1)
.Take(10)
.ToList());
In this way, not all the items need to be kept in a new sorted collection, only the best 10 you're interested in.
This was the least code way. Since you know the soFar list is sorted, testing where/if to insert current could be optimized. I didn't feel like doing ALL the work for you. ;-)
PS: Replace T with whatever your type is.
EDIT: Thinking about it, the most efficient way would actually be a plain old foreach that compares each item to the running list of best 10.

It figures: OrderBy is a Sort and that requires storing all the elements (deferred execution is cancelled).
It ought to work efficiently when data is an IQueryable, then it's up to the database.
// just 4 fun
public static IEnumerable<T> TakeDistinctMin<T, TKey>(this IEnumerable<T> #this,
int n, Func<T, TKey> selector)
where TKey: IComparable<TKey>
{
var tops = new SortedList<TKey, T>(n+1);
foreach (var item in #this)
{
TKey k = selector(item);
if (tops.ContainsKey(k))
continue;
if (tops.Count < n)
{
tops.Add(k, item);
}
else if (k.CompareTo(tops.Keys[tops.Count - 1]) < 0)
{
tops.Add(k, item);
tops.RemoveAt(n);
}
}
return tops.Values;
}

To order a set of unordered objects you have to look at all of them, no?
I don't see how you'd be able to avoid parsing all 9 GB of data to get the first 10 ordered in a certain way unless the 9 GB of data was already ordered in that fashion or if there were indexes or other ancillary data structures that could be utilized.
Could you provide a bit more background on your question. Are you querying a database using LINQ to SQL or Entity Framework or some other O/RM?

You can use something like this together with a projection comparer:
public static IEnumerable<T> OrderAndTake<T>(this IEnumerable<T> seq,int count,IComparer<T> comp)
{
var resultSet=new SortedSet<T>(comp);
foreach(T elem in seq)
{
resultSet.Add(elem);
if(resultSet.Count>count)
resultSet.Remove(resultSet.Max);
}
return resultSet.Select(x=>x);
}
Runtime should be O(log(count)*seq.Count()) and space O(min(log(count),seq.Count()))
One issue is that it will break if you have two elements for which comp.Compare(a,b)==0 since the set doesn't allow duplicate entries.

Should I use two "where" clauses or "&&" in my LINQ query?

When writing a LINQ query with multiple "and" conditions, should I write a single where clause containing && or multiple where clauses, one for each conditon?
static void Main(string[] args)
{
var ints = new List<int>(Enumerable.Range(-10, 20));
var positiveEvensA = from i in ints
where (i > 0) && ((i % 2) == 0)
select i;
var positiveEvensB = from i in ints
where i > 0
where (i % 2) == 0
select i;
System.Diagnostics.Debug.Assert(positiveEvensA.Count() ==
positiveEvensB.Count());
}
Is there any difference other than personal preference or coding style (long lines, readability, etc.) between positiveEvensA and positiveEvensB?
One possible difference that comes to mind is that different LINQ providers may be able to better cope with multiple wheres rather than a more complex expression; is this true?

I personally would always go with the && vs. two where clauses whenever it doesn't make the statement unintelligible.
In your case, it probably won't be noticeble at all, but having 2 where clauses definitely will have a performance impact if you have a large collection, and if you use all of the results from this query. For example, if you call .Count() on the results, or iterate through the entire list, the first where clause will run, creating a new IEnumerable<T> that will be completely enumerated again, with a second delegate.
Chaining the 2 clauses together causes the query to form a single delegate that gets run as the collection is enumerated. This results in one enumeration through the collection and one call to the delegate each time a result is returned.
If you split them, things change. As your first where clause enumerates through the original collection, the second where clause enumerates its results. This causes, potentially (worst case), 2 full enumerations through your collection and 2 delegates called per member, which could mean this statement (theoretically) could take 2x the runtime speed.
If you do decide to use 2 where clauses, placing the more restrictive clause first will help quite a bit, since the second where clause is only run on the elements that pass the first one.
Now, in your case, this won't matter. On a large collection, it could. As a general rule of thumb, I go for:
Readability and maintainability
Performance
In this case, I think both options are equally maintainable, so I'd go for the more performant option.

This is mostly a personal style issue. Personally, as long as the where clause fits on one line, I group the clauses.
Using multiple wheres will tend to be less performant because it requires an extra delegate invocation for every element that makes it that far. However it's likely to be an insignificant issue and should only be considered if a profiler shows it to be a problem.

As Jared Par has already said: it depends on your personal preference, readability and the use-case. For example if your method has some optional parameters and you want to filter a collection if they are given, the Where is perfect:
IEnumerable<SomeClass> matchingItems = allItems;
if(!string.IsNullOrWhiteSpace(name))
matchingItems = matchingItems
.Where(c => c.Name == name);
if(date.HasValue)
matchingItems = matchingItems
.Where(c => c.Date == date.Value);
if(typeId.HasValue)
matchingItems = matchingItems
.Where(c => c.TypeId == typeId.Value);
return matchingItems;
If you wanted to do this with &&, have fun ;)
Where i don't agree with Jared and Reed is the performance issue that multiple Where are supposed to have. Actually Where is optimized in a way that it combines multiple predicates to one as you can see here(in CombinePredicates).
But i wanted to know if it really has no big impact if the collection is large and there are multiple Where which all have to be evaluated. I was suprised that the following benchmark revealed that even the multiple Where approach was slightly more efficient. The summary:
Method
Mean
Error
StdDev
MultipleWhere
1.555 s
0.0310 s
0.0392 s
MultipleAnd
1.571 s
0.0308 s
0.0649 s
Here's the benchmark code, i think it's good enough for this test:
#LINQPad optimize+
void Main()
{
var summary = BenchmarkRunner.Run<WhereBenchmark>();
}
public class WhereBenchmark
{
string[] fruits = new string[] { "apple", "mango", "papaya", "banana", "guava", "pineapple" };
private IList<string> longFruitList;
[GlobalSetup]
public void Setup()
{
Random rnd = new Random();
int size = 1_000_000;
longFruitList = new List<string>(size);
for (int i = 1; i < size; i++)
longFruitList.Add(GetRandomFruit());
string GetRandomFruit()
{
return fruits[rnd.Next(0, fruits.Length)];
}
}
[Benchmark]
public void MultipleWhere()
{
int count = longFruitList
.Where(f => f.EndsWith("le"))
.Where(f => f.Contains("app"))
.Where(f => f.StartsWith("pine"))
.Count(); // counting pineapples
}
[Benchmark]
public void MultipleAnd()
{
int count = longFruitList
.Where(f => f.EndsWith("le") && f.Contains("app") && f.StartsWith("pine"))
.Count(); // counting pineapples
}
}

The performance issue only applies to memory based collections ... Linq to SQL generates expression trees that defer execution. More Details here:
Multiple WHERE Clauses with LINQ extension methods

If you run SQL Profiler and check the generated queries, you can see that there is no difference between two types of queries in terms of performance.
So, just your taste in code style.

Like others have suggested, it's more of a personal preference. I like the use of && as it's more readable and mimics the syntax of other mainstream languages.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.