Can LINQ use binary search when the collection is ordered? - c#

Can I somehow "instruct" LINQ to use binary search when the collection that I'm trying to search is ordered. I'm using an ObservableCollection<T>, populated with ordered data, and I'm trying to use Enumerable.First(<Predicate>). In my predicate, I'm filtering by the value of the field my collection's sorted by.

As far as I know, it's not possible with the built-in methods. However it would be relatively easy to write an extension method that would allow you to write something like that :
var item = myCollection.BinarySearch(i => i.Id, 42);
(assuming, of course, that you collection implements IList ; there's no way to perform a binary search if you can't access the items randomly)
Here's a sample implementation :
public static T BinarySearch<T, TKey>(this IList<T> list, Func<T, TKey> keySelector, TKey key)
where TKey : IComparable<TKey>
{
if (list.Count == 0)
throw new InvalidOperationException("Item not found");
int min = 0;
int max = list.Count;
while (min < max)
{
int mid = min + ((max - min) / 2);
T midItem = list[mid];
TKey midKey = keySelector(midItem);
int comp = midKey.CompareTo(key);
if (comp < 0)
{
min = mid + 1;
}
else if (comp > 0)
{
max = mid - 1;
}
else
{
return midItem;
}
}
if (min == max &&
min < list.Count &&
keySelector(list[min]).CompareTo(key) == 0)
{
return list[min];
}
throw new InvalidOperationException("Item not found");
}
(not tested... a few adjustments might be necessary) Now tested and fixed ;)
The fact that it throws an InvalidOperationException may seem strange, but that's what Enumerable.First does when there's no matching item.

The accepted answer is very good.
However, I need that the BinarySearch returns the index of the first item that is larger, as the List<T>.BinarySearch() does.
So I watched its implementation by using ILSpy, then I modified it to have a selector parameter. I hope it will be as useful to someone as it is for me:
public static class ListExtensions
{
public static int BinarySearch<T, U>(this IList<T> tf, U target, Func<T, U> selector)
{
var lo = 0;
var hi = (int)tf.Count - 1;
var comp = Comparer<U>.Default;
while (lo <= hi)
{
var median = lo + (hi - lo >> 1);
var num = comp.Compare(selector(tf[median]), target);
if (num == 0)
return median;
if (num < 0)
lo = median + 1;
else
hi = median - 1;
}
return ~lo;
}
}

Well, you can write your own extension method over ObservableCollection<T> - but then that will be used for any ObservableCollection<T> where your extension method is available, without knowing whether it's sorted or not.
You'd also have to indicate in the predicate what you wanted to find - which would be better done with an expression tree... but that would be a pain to parse. Basically, the signature of First isn't really suitable for a binary search.
I suggest you don't try to overload the existing signatures, but write a new one, e.g.
public static TElement BinarySearch<TElement, TKey>
(this IList<TElement> collection, Func<TElement, TItem> keySelector,
TKey key)
(I'm not going to implement it right now, but I can do so later if you want.)
By providing a function, you can search by the property the collection is sorted by, rather than by the items themselves.

Enumerable.First(predicate) works on an IEnumarable<T> which only supports enumeration, therefore it does not have random access to the items within.
Also, your predicate contains arbitrary code that eventually results in true or false, and so cannot indicate whether the tested item was too low or too high. This information would be needed in order to do a binary search.
Enumerable.First(predicate) can only test each item in order as it walks through the enumeration.

Keep in mind that all(? at least most) of the extension methods used by LINQ are implemented on IQueryable<T>orIEnumerable<T> or IOrderedEnumerable<T> or IOrderedQueryable<T>.
None of these interfaces supports random access, and therefore none of them can be used for a binary search. One of the benefits of something like LINQ is that you can work with large datasets without having to return the entire dataset from the database. Obviously you can't binary search something if you don't even have all of the data yet.
But as others have said, there is no reason at all you can't write this extension method for IList<T> or other collection types that support random access.

Related

Performance of Skip and Take in Linq to Objects

"Searching for alternative functionalities for "Skip" and "Take" functionalities"
1 of the link says "Everytime you invoke Skip() it will have to iterate you collection from the beginning in order to skip the number of elements you desire, which gives a loop within a loop (n2 behaviour)"
Conclusion: For large collections, don’t use Skip and Take. Find another way to iterate through your collection and divide it.
In order to access last page data in a huge collection, can you please suggest us a way other than Skip and Take approach?
Looking at the source for Skip, you can see it enumerates over all the items, even over the first n items you want to skip.
It's strange though, because several LINQ-methods have optimizations for collections, like Count and Last.
Skip apparently does not.
If you have an array or IList<T>, you use the indexer to truly skip over them:
for (int i = skipStartIndex; i < list.Count; i++) {
yield return list[i];
}
Internally it is really correct:
private static IEnumerable<TSource> SkipIterator<TSource>(IEnumerable<TSource> source, int count)
{
using (IEnumerator<TSource> enumerator = source.GetEnumerator())
{
while (count > 0 && enumerator.MoveNext())
--count;
if (count <= 0)
{
while (enumerator.MoveNext())
yield return enumerator.Current;
}
}
}
If you want to skip for IEnumerable<T> then it works right. There are no other way except enumeration to get specific element(s). But you can write own extension method on IReadOnlyList<T> or IList<T> (if this interface is implemented in collection used for your elements).
public static class IReadOnlyListExtensions
{
public static IEnumerable<T> Skip<T>(this IReadOnlyList<T> collection, int count)
{
if (collection == null)
return null;
return ICollectionExtensions.YieldSkip(collection, count);
}
private static IEnumerable<T> YieldSkip<T>(IReadOnlyList<T> collection, int count)
{
for (int index = count; index < collection.Count; index++)
{
yield return collection[index];
}
}
}
In addition you can implement it for IEnumerable<T> but check inside for optimization:
if (collection is IReadOnlyList<T>)
{
// do optimized skip
}
Such solution is used a lot of where in Linq source code (but not in Skip unfortunately).
Depends on your implementation, but it would make sense to use indexed arrays for the purpose, instead.

Writing the implementation for List<T> using IEnumerable<T> from scratch

Out of boredom I decided to write the implementation of List from scratch using IEnumerable. I ran into a few issues that I honestly don't know how to solve:
How would you resize a generic array (T[]) when an index is nulled or set to default(T)?
Since you cannot null T, how do you overcome the numerical primitive problem with their values being 0 by default?
If nothing can be done regarding #2, how do you stop the GetEnumerator() method from yield returning 0 when utilizing a numerical data type?
Last but not least, what is the standard practice regarding downsizing an array? I know for certain that one of the best solutions for upsizing is to increase the current length by a power of 2; if and when do you downsize? Per Remove/RemoveAt or by the currently used length % 2?
Here's what I've done so far:
public class List<T> : IEnumerable<T>
{
T[] list = new T[32];
int current;
public void Add(T item)
{
if (current + 1 > list.Length)
{
T[] temp = new T[list.Length * 2];
Array.Copy(list, temp, list.Length);
list = temp;
}
list[current] = item;
current++;
}
public void Remove(T item)
{
for (int i = 0; i < list.Length; i++)
if (list[i].Equals(item))
list[i] = default(T);
}
public void RemoveAt(int index)
{
list[index] = default(T);
}
public IEnumerator<T> GetEnumerator()
{
foreach (T item in list)
if (item != null && !item.Equals(default(T)))
yield return item;
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
foreach (T item in list)
if (item != null && !item.Equals(default(T)))
yield return item;
}
}
Thanks in advance.
Well, for starters, your Remove and RemoveAt methods do not implement the same behavior as List<T>. List<T> will decrease in size by 1, whereas your List will remain constant in size. You should be shifting the values of higher index from the removed object to one lower index.
Also, GetEnumerator will iterate over all items in the array, regardless of what the value is.
I believe that will solve all of the issues you have. If someone adds a default(T) to the list, then a default(T) is what they will get back out again, regardless if T is an int and thus 0 or a class-type and thus null.
Finally, on downsizing: some growable array implementations rationalize that, if the array had ever gotten so big, then it is more likely than usual to get that big again. For that reason, they specifically avoid downsizing.
The key problem you're running into is maintaining the internal array and what remove does. List<T> does not support partial arrays internally. That doesn't mean you can't, but doing so is far more complicated. To exactly mimic List<T> you want to keep an array and a field for the number of elements in the array that are actually utilized (the list length, which is equal to or less than array length).
Add is easy, you add an element to the end like you did.
Remove is more complicated. If you are removing an element from the end, set the end element to default(T) and change the list length. If you are removing and element from the beginning or middle, then you need to shift the contents of the array and set the last one to default(T). The reason we set the last element to default(T) is to clear the reference, not so we can tell whether or not it's "in use". We know if it's "in use" based on the position in the array and our stored list length.
Another key to implementation is the enumerator. You want to loop through the first elements until you hit the list length. Don't skip nulls.
This is not a complete implementation, but should be correct implementation of the methods you started.
btw, I would not agree with
I know for certain that the best solution for upsizing is to increase the current length by a power of 2
This is the default behavior of List<T> but it's not the best solution in all situations. That's exactly why List<T> allows you to specify a capacity. If you're loading a list from a source and know how many items you're adding, then you can pre-initialize the capacity of the list to reduce the number of copies. Similarly, if you're creating hundreds or thousands of lists that are larger than the default size or likely to be larger, it can be a benefit to memory utilization to pre-initialize the lists to be the same size. That way the memory they allocate and free will be the same continuous blocks and can be more efficiently allocated and deallocated repeatedly. For example, we have a reporting calculation engine that creates about 300,000 lists for each run, with many runs a second. We know the lists are always a few hundred items each, so we pre-initialize them all to 1024 capacity. This is more than most need, but since they're all the same length and they're created and disposed of very quickly, this makes memory reusage efficient.
public class MyList<T> : IEnumerable<T>
{
T[] list = new T[32];
int listLength;
public void Add(T item)
{
if (listLength + 1 > list.Length)
{
T[] temp = new T[list.Length * 2];
Array.Copy(list, temp, list.Length);
list = temp;
}
list[listLength] = item;
listLength++;
}
public void Remove(T item)
{
for (int i = 0; i < list.Length; i++)
if (list[i].Equals(item))
{
RemoveAt(i);
return;
}
}
public void RemoveAt(int index)
{
if (index < 0 || index >= listLength)
{
throw new ArgumentException("'index' must be between 0 and list length.");
}
if (index == listLength - 1)
{
list[index] = default(T);
listLength = index;
return;
}
// need to shift the list
Array.Copy(list, index + 1, list, index, listLength - index + 1);
listLength--;
list[listLength] = default(T);
}
public IEnumerator<T> GetEnumerator()
{
for (int i = 0; i < listLength; i++)
{
yield return list[i];
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}

lambda expression foreach loop

I have the following code
int someCount = 0;
for ( int i =0 ; i < intarr.Length;i++ )
{
if ( intarr[i] % 2 == 0 )
{
someCount++;
continue;
}
// Some other logic for those not satisfying the condition
}
Is it possible to use any of the Array.Where or Array.SkiplWhile to achieve the same?
foreach(int i in intarr.where(<<condtion>> + increment for failures) )
{
// Some other logic for those not satisfying the condition
}
Use LINQ:
int someCount = intarr.Count(val => val % 2 == 0);
I definitely prefer #nneonneo's way for short statements (and it uses an explicit lambda), but if you want to build a more elaborate query, you can use the LINQ query syntax:
var count = ( from val in intarr
where val % 2 == 0
select val ).Count();
Obviously this is probably a poor choice when the query can be expressed with a single lambda expression, but I find it useful when composing larger queries.
More examples: http://code.msdn.microsoft.com/101-LINQ-Samples-3fb9811b
Nothing (much) prevents you from rolling your own Where that counts the failures. "Nothing much" because neither lambdas nor methods with yield return statements are allowed to reference out/ref parameters, so the desired extension with the following signature won't work:
// dead-end/bad signature, do not attempt
IEnumerable<T> Where(
this IEnumerable<T> self,
Func<T,bool> predicate,
out int failures)
However, we can declare a local variable for the failure-count and return a Func<int> that can get the failure-count, and a local variable is completely valid to reference from lambdas. Thus, here's a possible (tested) implementation:
public static class EnumerableExtensions
{
public static IEnumerable<T> Where<T>(
this IEnumerable<T> self,
Func<T,bool> predicate,
out Func<int> getFailureCount)
{
if (self == null) throw new ArgumentNullException("self");
if (predicate == null) throw new ArgumentNullException("predicate");
int failures = 0;
getFailureCount = () => failures;
return self.Where(i =>
{
bool res = predicate(i);
if (!res)
{
++failures;
}
return res;
});
}
}
...and here's some test code that exercises it:
Func<int> getFailureCount;
int[] items = { 0, 1, 2, 3, 4 };
foreach(int i in items.Where(i => i % 2 == 0, out getFailureCount))
{
Console.WriteLine(i);
}
Console.WriteLine("Failures = " + getFailureCount());
The above test, when run outputs:
0
2
4
Failures = 2
There are a couple caveats I feel obligated to warn about. Since you could break out of the loop prematurely without having walked the entire IEnumerable<>, the failure-count would only reflect encountered-failures, not the total number of failures as in #nneonneo's solution (which I prefer.) Also, if the implementation of LINQ's Where extension were to change in a way that called the predicate more than once per item, then the failure count would be incorrect. One more point of interest is that, from within your loop body you should be able to make calls to the getFailureCount Func to get the current running failure count so-far.
I presented this solution to show that we are not locked-into the existing prepackaged solutions. The language and framework provides us with lots of opportunities to extend it to suit our needs.

Linq: The "opposite" of Take?

Using Linq; how can I do the "opposite" of Take?
I.e. instead of getting the first n elements such as in
aCollection.Take(n)
I want to get everything but the last n elements. Something like
aCollection.Leave(n)
(Don't ask why :-)
Edit
I suppose I can do it this way aCollection.TakeWhile((x, index) => index < aCollection.Count - n) Or in the form of an extension
public static IEnumerable<TSource> Leave<TSource>(this IEnumerable<TSource> source, int n)
{
return source.TakeWhile((x, index) => index < source.Count() - n);
}
But in the case of Linq to SQL or NHibernate Linq it would have been nice if the generated SQL took care of it and generated something like (for SQL Server/T-SQL)
SELECT TOP(SELECT COUNT(*) -#n FROM ATable) * FROM ATable Or some other more clever SQL implementation.
I suppose there is nothing like it?
(But the edit was actually not part of the question.)
aCollection.Take(aCollection.Count() - n);
EDIT: Just as a piece of interesting information which came up in the comments - you may think that the IEnumerable's extension method .Count() is slow, because it would iterate through all elements. But in case the actual object implements ICollection or ICollection<T>, it will just use the .Count property which should be O(1). So performance will not suffer in that case.
You can see the source code of IEnumerable.Count() at TypeDescriptor.net.
I'm pretty sure there's no built-in method for this, but this can be done easily by chaining Reverse and Skip:
aCollection.Reverse().Skip(n).Reverse()
I don't believe there's a built-in function for this.
aCollection.Take(aCollection.Count - n)
should be suitable; taking the total number of items in the collection minus n should skip the last n elements.
Keeping with the IEnumerable philosphy, and going through the enumeration once for cases where ICollection isn't implemented, you can use these extension methods:
public static IEnumerable<T> Leave<T>(this ICollection<T> src, int drop) => src.Take(src.Count - drop);
public static IEnumerable<T> Leave<T>(this IEnumerable<T> src, int drop) {
IEnumerable<T> IEnumHelper() {
using (var esrc = src.GetEnumerator()) {
var buf = new Queue<T>();
while (drop-- > 0)
if (esrc.MoveNext())
buf.Enqueue(esrc.Current);
else
break;
while (esrc.MoveNext()) {
buf.Enqueue(esrc.Current);
yield return buf.Dequeue();
}
}
}
return (src is ICollection<T> csrc) ? csrc.Leave(drop) : IEnumHelper();
}
This will be much more efficient than the solutions with a double-reverse, since it creates only one list and only enumerates the list once.
public static class Extensions
{
static IEnumerable<T> Leave<T>(this IEnumerable<T> items, int numToSkip)
{
var list = items.ToList();
// Assert numToSkip <= list count.
list.RemoveRange(list.Count - numToSkip, numToSkip);
return List
}
}
string alphabet = "abcdefghijklmnopqrstuvwxyz";
var chars = alphabet.Leave(10); // abcdefghijklmnop
Currently, C# has a TakeLast(n) method defined which takes characters from the end of the string.
See here: https://msdn.microsoft.com/en-us/library/hh212114(v=vs.103).aspx

Find sequence in IEnumerable<T> using Linq

What is the most efficient way to find a sequence within a IEnumerable<T> using LINQ
I want to be able to create an extension method which allows the following call:
int startIndex = largeSequence.FindSequence(subSequence)
The match must be adjacent and in order.
Here's an implementation of an algorithm that finds a subsequence in a sequence. I called the method IndexOfSequence, because it makes the intent more explicit and is similar to the existing IndexOf method:
public static class ExtensionMethods
{
public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence)
{
return source.IndexOfSequence(sequence, EqualityComparer<T>.Default);
}
public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence, IEqualityComparer<T> comparer)
{
var seq = sequence.ToArray();
int p = 0; // current position in source sequence
int i = 0; // current position in searched sequence
var prospects = new List<int>(); // list of prospective matches
foreach (var item in source)
{
// Remove bad prospective matches
prospects.RemoveAll(k => !comparer.Equals(item, seq[p - k]));
// Is it the start of a prospective match ?
if (comparer.Equals(item, seq[0]))
{
prospects.Add(p);
}
// Does current character continues partial match ?
if (comparer.Equals(item, seq[i]))
{
i++;
// Do we have a complete match ?
if (i == seq.Length)
{
// Bingo !
return p - seq.Length + 1;
}
}
else // Mismatch
{
// Do we have prospective matches to fall back to ?
if (prospects.Count > 0)
{
// Yes, use the first one
int k = prospects[0];
i = p - k + 1;
}
else
{
// No, start from beginning of searched sequence
i = 0;
}
}
p++;
}
// No match
return -1;
}
}
I didn't fully test it, so it might still contain bugs. I just did a few tests on well-known corner cases to make sure I wasn't falling into obvious traps. Seems to work fine so far...
I think the complexity is close to O(n), but I'm not an expert of Big O notation so I could be wrong... at least it only enumerates the source sequence once, whithout ever going back, so it should be reasonably efficient.
The code you say you want to be able to use isn't LINQ, so I don't see why it need be implemented with LINQ.
This is essentially the same problem as substring searching (indeed, an enumeration where order is significant is a generalisation of "string").
Since computer science has considered this problem frequently for a long time, so you get to stand on the shoulders of giants.
Some reasonable starting points are:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
http://en.wikipedia.org/wiki/Rabin-karp
Even just the pseudocode in the wikipedia articles is enough to port to C# quite easily. Look at the descriptions of performance in different cases and decide which cases are most likely to be encountered by your code.
I understand this is an old question, but I needed this exact method and I wrote it up like so:
public static int ContainsSubsequence<T>(this IEnumerable<T> elements, IEnumerable<T> subSequence) where T: IEquatable<T>
{
return ContainsSubsequence(elements, 0, subSequence);
}
private static int ContainsSubsequence<T>(IEnumerable<T> elements, int index, IEnumerable<T> subSequence) where T: IEquatable<T>
{
// Do we have any elements left?
bool elementsLeft = elements.Any();
// Do we have any of the sub-sequence left?
bool sequenceLeft = subSequence.Any();
// No elements but sub-sequence not fully matched
if (!elementsLeft && sequenceLeft)
return -1; // Nope, didn't match
// No elements of sub-sequence, which means even if there are
// more elements, we matched the sub-sequence fully
if (!sequenceLeft)
return index - subSequence.Count(); // Matched!
// If we didn't reach a terminal condition,
// check the first element of the sub-sequence against the first element
if (subSequence.First().Equals(e.First()))
// Yes, it matched - move onto the next. Consume (skip) one element in each
return ContainsSubsequence(elements.Skip(1), index + 1 subSequence.Skip(1));
else
// No, it didn't match. Try the next element, without consuming an element
// from the sub-sequence
return ContainsSubsequence(elements.Skip(1), index + 1, subSequence);
}
Updated to not just return if the sub-sequence matched, but where it started in the original sequence.
This is an extension method on IEnumerable, fully lazy, terminates early and is far more linq-ified than the currently up-voted answer. Bewarned, however (as #wai-ha-lee points out) it is recursive and creates a lot of enumerators. Use it where applicable (performance/memory). This was fine for my needs, but YMMV.
You can use this library called Sequences to do that (disclaimer: I'm the author).
It has a IndexOfSlice method that does exactly what you need - it's an implementation of the Knuth-Morris-Pratt algorithm.
int startIndex = largeSequence.AsSequence().IndexOfSlice(subSequence);
UPDATE:
Given the clarification of the question my response below isn't as applicable. Leaving it for historical purposes.
You probably want to use mySequence.Where(). Then the key is to optimize the predicate to work well in your environment. This can vary quite a bit depending on your requirements and typical usage patterns.
It is quite possible that what works well for small collections doesn't scale well for much larger collections depending on what type T is.
Of course, if the 90% use is for small collections then optimizing for the outlier large collection seems a bit YAGNI.

Categories

Resources