Find sequence in IEnumerable<T> using Linq - c#

What is the most efficient way to find a sequence within a IEnumerable<T> using LINQ
I want to be able to create an extension method which allows the following call:
int startIndex = largeSequence.FindSequence(subSequence)
The match must be adjacent and in order.

Here's an implementation of an algorithm that finds a subsequence in a sequence. I called the method IndexOfSequence, because it makes the intent more explicit and is similar to the existing IndexOf method:
public static class ExtensionMethods
{
public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence)
{
return source.IndexOfSequence(sequence, EqualityComparer<T>.Default);
}
public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence, IEqualityComparer<T> comparer)
{
var seq = sequence.ToArray();
int p = 0; // current position in source sequence
int i = 0; // current position in searched sequence
var prospects = new List<int>(); // list of prospective matches
foreach (var item in source)
{
// Remove bad prospective matches
prospects.RemoveAll(k => !comparer.Equals(item, seq[p - k]));
// Is it the start of a prospective match ?
if (comparer.Equals(item, seq[0]))
{
prospects.Add(p);
}
// Does current character continues partial match ?
if (comparer.Equals(item, seq[i]))
{
i++;
// Do we have a complete match ?
if (i == seq.Length)
{
// Bingo !
return p - seq.Length + 1;
}
}
else // Mismatch
{
// Do we have prospective matches to fall back to ?
if (prospects.Count > 0)
{
// Yes, use the first one
int k = prospects[0];
i = p - k + 1;
}
else
{
// No, start from beginning of searched sequence
i = 0;
}
}
p++;
}
// No match
return -1;
}
}
I didn't fully test it, so it might still contain bugs. I just did a few tests on well-known corner cases to make sure I wasn't falling into obvious traps. Seems to work fine so far...
I think the complexity is close to O(n), but I'm not an expert of Big O notation so I could be wrong... at least it only enumerates the source sequence once, whithout ever going back, so it should be reasonably efficient.

The code you say you want to be able to use isn't LINQ, so I don't see why it need be implemented with LINQ.
This is essentially the same problem as substring searching (indeed, an enumeration where order is significant is a generalisation of "string").
Since computer science has considered this problem frequently for a long time, so you get to stand on the shoulders of giants.
Some reasonable starting points are:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
http://en.wikipedia.org/wiki/Rabin-karp
Even just the pseudocode in the wikipedia articles is enough to port to C# quite easily. Look at the descriptions of performance in different cases and decide which cases are most likely to be encountered by your code.

I understand this is an old question, but I needed this exact method and I wrote it up like so:
public static int ContainsSubsequence<T>(this IEnumerable<T> elements, IEnumerable<T> subSequence) where T: IEquatable<T>
{
return ContainsSubsequence(elements, 0, subSequence);
}
private static int ContainsSubsequence<T>(IEnumerable<T> elements, int index, IEnumerable<T> subSequence) where T: IEquatable<T>
{
// Do we have any elements left?
bool elementsLeft = elements.Any();
// Do we have any of the sub-sequence left?
bool sequenceLeft = subSequence.Any();
// No elements but sub-sequence not fully matched
if (!elementsLeft && sequenceLeft)
return -1; // Nope, didn't match
// No elements of sub-sequence, which means even if there are
// more elements, we matched the sub-sequence fully
if (!sequenceLeft)
return index - subSequence.Count(); // Matched!
// If we didn't reach a terminal condition,
// check the first element of the sub-sequence against the first element
if (subSequence.First().Equals(e.First()))
// Yes, it matched - move onto the next. Consume (skip) one element in each
return ContainsSubsequence(elements.Skip(1), index + 1 subSequence.Skip(1));
else
// No, it didn't match. Try the next element, without consuming an element
// from the sub-sequence
return ContainsSubsequence(elements.Skip(1), index + 1, subSequence);
}
Updated to not just return if the sub-sequence matched, but where it started in the original sequence.
This is an extension method on IEnumerable, fully lazy, terminates early and is far more linq-ified than the currently up-voted answer. Bewarned, however (as #wai-ha-lee points out) it is recursive and creates a lot of enumerators. Use it where applicable (performance/memory). This was fine for my needs, but YMMV.

You can use this library called Sequences to do that (disclaimer: I'm the author).
It has a IndexOfSlice method that does exactly what you need - it's an implementation of the Knuth-Morris-Pratt algorithm.
int startIndex = largeSequence.AsSequence().IndexOfSlice(subSequence);

UPDATE:
Given the clarification of the question my response below isn't as applicable. Leaving it for historical purposes.
You probably want to use mySequence.Where(). Then the key is to optimize the predicate to work well in your environment. This can vary quite a bit depending on your requirements and typical usage patterns.
It is quite possible that what works well for small collections doesn't scale well for much larger collections depending on what type T is.
Of course, if the 90% use is for small collections then optimizing for the outlier large collection seems a bit YAGNI.

Related

Expensive IEnumerable: Any way to prevent multiple enumerations without forcing an immediate enumeration? [duplicate]

This question already has answers here:
Is there an IEnumerable implementation that only iterates over it's source (e.g. LINQ) once?
(4 answers)
Closed 9 months ago.
I have a very large enumeration and am preparing an expensive deferred operation on it (e.g. sorting it). I'm then passing this into a function which may or may not consume the IEnumerable, depending on some logic of its own.
Here's an illustration:
IEnumerable<Order> expensiveEnumerable = fullCatalog.OrderBy(c => Prioritize(c));
MaybeFullFillSomeOrders(expensiveEnumerable);
// Elsewhere... (example use-case for multiple enumerations, not real code)
void MaybeFullFillSomeOrders(IEnumerable<Order> nextUpOrders){
if(notAGoodTime())
return;
foreach(var order in nextUpOrders)
collectSomeInfo(order);
processInfo();
foreach(var order in nextUpOrders) {
maybeFulfill(order);
if(atCapacity())
break;
}
}
I'm would like to prepare my input to the other function such that:
If they do not consume the enumerable, the performance price of sorting is not paid.
This already precludes calling e.g. ToList() or ToArray() on it
If they choose to enumerate multiple times (perhaps not realizing how expensive it would be in this case) I want some defence in place to prevent the multiple enumeration.
Ideally, the result is still an IEnumerable<T>
The best solution I've come up with is to use Lazy<>
var expensive = new Lazy<List<Order>>>(
() => fullCatalog.OrderBy(c => Prioritize(c)).ToList());
This appears to satisfy criteria 1 and 2, but has a couple of drawbacks:
I have to change the interface to all downstream usages to expect a Lazy.
The full list (which in this case was built up from a SelectMany() on serveral smaller partitions) would need to be allocated as a new single contiguous list in memory. I'm not sure there's an easy way around this if I want to "cache" the sort result, but if you know of one I'm all ears.
One idea I had to solve the first problem was to wrap Lazy<> in some custom class that either implements or can implicitly be converted to an IEnumerable<T>, but I'm hoping someone knows of a more elegant approach.
You certainly could write your own IEnumerable<T> implementation that wraps another one, remembering all the elements it's already seen (and whether it's exhausted or not). If you need it to be thread-safe that becomes trickier, and you'd need to remember that at any time there may be multiple iterators working against the same IEnumerable<T>.
Fundamentally I think it would come down to working out what to do when asked for the next element (which is somewhat-annoyingly split into MoveNext() and Current, but that can probably be handled...):
If you've already read the next element within another iterator, you can yield it from your buffer
If you've already discovered that there is no next element, you can return that immediately
Otherwise, you need to ask the original iterator for the next element, and remember if for all the other wrapped iterators.
The other aspect that's tricky is knowing when to dispose of the underlying IEnumerator<T> - if you don't need to do that, it makes things simpler.
As a very sketchy attempt that I haven't even attempted to compile, and which is definitely not thread-safe, you could try something like this:
public class LazyEnumerable<T> : IEnumerable<T>
{
private readonly IEnumerator<T> iterator;
private List<T> buffer;
private bool completed = false;
public LazyEnumerable(IEnumerable<T> original)
{
// TODO: You could be even lazier, only calling
// GetEnumerator when you first need an element
iterator = original.GetEnumerator();
}
IEnumerator GetEnumerator() => GetEnumerator();
public IEnumerator<T> GetEnumerator()
{
int index = 0;
while (true)
{
// If we already have the element, yield it
if (index < buffer.Count)
{
yield return buffer[index];
}
// If we've yielded everything in the buffer and some
// other iterator has come to the end of the original,
// we're done.
else if (completed)
{
yield break;
}
// Otherwise, see if there's anything left in the original
// iterator.
else
{
bool hasNext = iterator.MoveNext();
if (hasNext)
{
var current = iterator.Current;
buffer.Add(current);
yield return current;
}
else
{
completed = true;
yield break;
}
}
index++;
}
}
}

C# sort List<int> recursively

there's an exercise i need to do, given a List i need to sort the content using ONLY recursive methods (no while, do while, for, foreach).
So... i'm struggling (for over 2 hours now) and i dont know how to even begin.
The function must be
List<int> SortHighestToLowest (List<int> list) {
}
I THINK i should check if the previous number is greater than the actual number and so on but what if the last number is greater than the first number on the list?, that's why im having a headache.
I appreciate your help, thanks a lot.
[EDIT]
I delivered the exercise but then teacher said i shouldn't use external variables like i did here:
List<int> _tempList2 = new List<int>();
int _actualListIndex = 0;
int _actualMaxNumber = 0;
int _actualMaxNumberIndex = 0;
List<int> SortHighestToLowest(List<int> list)
{
if (list.Count == 0)
return _tempList2;
if (_actualListIndex == 0)
_actualMaxNumber = list[0];
if (_actualListIndex < list.Count -1)
{
_actualListIndex++;
if (list[_actualListIndex] > _actualMaxNumber)
{
_actualMaxNumberIndex = _actualListIndex;
_actualMaxNumber = list[_actualListIndex];
}
return SortHighestToLowest(list);
}
_tempList2.Add(_actualMaxNumber);
list.RemoveAt(_actualMaxNumberIndex);
_actualListIndex = 0;
_actualMaxNumberIndex = 0;
return SortHighestToLowest(list);
}
Exercise is done and i approved (thanks to other exercises as well) but i was wondering if there's a way of doing this without external variables and without using System.Linq like String.Empty's response (im just curious, the community helped me to solve my issue and im thankful).
I am taking your instructions to the letter here.
Only recursive methods
No while, do while, for, foreach
Signature must be List<int> SortHighestToLowest(List<int> list)
Now, I do assume you may use at least the built-in properties and methods of the List<T> type. If not, you would have a hard time even reading the elements of your list.
That said, any calls to Sort or OrderBy methods would be beyond the point here, since they would render any recursive method useless.
I also assume it is okay to use other lists in the process, since you didn't mention anything in regards to that.
With all that in mind, I came to this piece below, making use of Max and Remove methods from List<T> class, and a new list of integers for each recursive call:
public static List<int> SortHighestToLowest(List<int> list)
{
// recursivity breaker
if (list.Count <= 1)
return list;
// remove highest item
var max = list.Max();
list.Remove(max);
// append highest item to recursive call for the remainder of the list
return new List<int>(SortHighestToLowest(list)) { max };
}
For solving this problem, try to solve smaller subsets. Consider the following list
[1,5,3,2]
Let's take the last element out of list, and consider the rest as sorted which will be [1,3,5] and 2. Now the problem reduces to another problem of inserting this 2 in its correct position. If we can insert it in correct position then the array becomes sorted. This can be applied recursively.
For every recursive problem there should be a base condition w.r.t the hypothesis we make. For the first problem the base condition is array with single element. A single element array is always sorted.
For the second insert problem the base condition will be an empty array or the last element in array is less than the element to be inserted. In both cases the element is inserted at the end.
Algorithm
---------
Sort(list)
if(list.count==1)
return
temp = last element of list
temp_list = list with last element removed
Sort(temp_list)
Insert(temp_list, temp)
Insert(list, temp)
if(list.count ==0 || list[n-1] <= temp)
list.insert(temp)
return
insert_temp = last element of list
insert_temp_list = list with last element removed
Insert(insert_temo_list, insert_temp)
For Insert after base condition its calling recursively till it find the correct position for the last element which is removed.

C# Time complexity of Array[T].Contains(T item) vs HashSet<T>.Contains(T item)

HashSet(T).Contains(T) (inherited from ICollection<T>.Contains(T)) has a time complexity of O(1).
So, I'm wondering what the complexity of a class member array containing integers would be as I strive to achieve O(1) and don't need the existence checks of HashSet(T).Add(T).
Since built-in types are not shown in the .NET reference source, I have no chance of finding found the array implementation of IList(T).Contains(T).
Any (further) reading material or reference would be very much appreciated.
You can see source code of Array with any reflector (maybe online too, didn't check). IList.Contains is just:
Array.IndexOf(this,value) >= this.GetLowerBound(0);
And Array.IndexOf calls Array.IndexOf<T>, which, after a bunch of consistency checks, redirects to
EqualityComparer<T>.Default.IndexOf(array, value, startIndex, count)
And that one finally does:
int num = startIndex + count;
for (int index = startIndex; index < num; ++index)
{
if (this.Equals(array[index], value))
return index;
}
return -1;
So just loops over array with average complexity O(N). Of course that was obvious from the beginning, but just to provide some more evidence.
Array source code for the .Net Framework (up to v4.8) is available in reference source, and can be de-compiled using ILSpy.
In reference source, you find at line 2753 then 2809:
// -----------------------------------------------------------
// ------- Implement ICollection<T> interface methods --------
// -----------------------------------------------------------
...
[SecuritySafeCritical]
bool Contains<T>(T value) {
//! Warning: "this" is an array, not an SZArrayHelper. See comments above
//! or you may introduce a security hole!
T[] _this = JitHelpers.UnsafeCast<T[]>(this);
return Array.IndexOf(_this, value) != -1;
}
And IndexOf ends up on this IndexOf which is a O(n) algorithm.
internal virtual int IndexOf(T[] array, T value, int startIndex, int count)
{
int endIndex = startIndex + count;
for (int i = startIndex; i < endIndex; i++) {
if (Equals(array[i], value)) return i;
}
return -1;
}
Those methods are on a special class SZArrayHelper in same source file, and as explained at line 2721, this is the implementation your are looking for.
// This class is needed to allow an SZ array of type T[] to expose IList<T>,
// IList<T.BaseType>, etc., etc. all the way up to IList<Object>. When the following call is
// made:
//
// ((IList<T>) (new U[n])).SomeIListMethod()
//
// the interface stub dispatcher treats this as a special case, loads up SZArrayHelper,
// finds the corresponding generic method (matched simply by method name), instantiates
// it for type <T> and executes it.
About achieving O(1) complexity, you should convert it to a HashSet:
var lookupHashSet = new HashSet<T>(yourArray);
...
var hasValue = lookupHashSet.Contains(testValue);
Of course, this conversion is an O(n) operation. If you do not have many lookup to do, it is moot.
Note from documentation on this constructor:
If collection contains duplicates, the set will contain one of each unique element. No exception will be thrown. Therefore, the size of the resulting set is not identical to the size of collection.
You actually can see the source for List<T>, but you need to look it up online. Here's one source.
Any pure list/array bool Contains(T item) check is O(N) complexity, because each element needs to be checked. .NET is no exception. (If you designed a data structure that manifested as a list but also contained a bloom filter helper data structure, that would be another story.)

"Unzip" IEnumerable dynamically in C# or best alternative

Lets assume you have a function that returns a lazily-enumerated object:
struct AnimalCount
{
int Chickens;
int Goats;
}
IEnumerable<AnimalCount> FarmsInEachPen()
{
....
yield new AnimalCount(x, y);
....
}
You also have two functions that consume two separate IEnumerables, for example:
ConsumeChicken(IEnumerable<int>);
ConsumeGoat(IEnumerable<int>);
How can you call ConsumeChicken and ConsumeGoat without a) converting FarmsInEachPen() ToList() beforehand because it might have two zillion records, b) no multi-threading.
Basically:
ConsumeChicken(FarmsInEachPen().Select(x => x.Chickens));
ConsumeGoats(FarmsInEachPen().Select(x => x.Goats));
But without forcing the double enumeration.
I can solve it with multithread, but it gets unnecessarily complicated with a buffer queue for each list.
So I'm looking for a way to split the AnimalCount enumerator into two int enumerators without fully evaluating AnimalCount. There is no problem running ConsumeGoat and ConsumeChicken together in lock-step.
I can feel the solution just out of my grasp but I'm not quite there. I'm thinking along the lines of a helper function that returns an IEnumerable being fed into ConsumeChicken and each time the iterator is used, it internally calls ConsumeGoat, thus executing the two functions in lock-step. Except, of course, I don't want to call ConsumeGoat more than once..
I don't think there is a way to do what you want, since ConsumeChickens(IEnumerable<int>) and ConsumeGoats(IEnumerable<int>) are being called sequentially, each of them enumerating a list separately - how do you expect that to work without two separate enumerations of the list?
Depending on the situation, a better solution is to have ConsumeChicken(int) and ConsumeGoat(int) methods (which each consume a single item), and call them in alternation. Like this:
foreach(var animal in animals)
{
ConsomeChicken(animal.Chickens);
ConsomeGoat(animal.Goats);
}
This will enumerate the animals collection only once.
Also, a note: depending on your LINQ-provider and what exactly it is you're trying to do, there may be better options. For example, if you're trying to get the total sum of both chickens and goats from a database using linq-to-sql or linq-to-entities, the following query..
from a in animals
group a by 0 into g
select new
{
TotalChickens = g.Sum(x => x.Chickens),
TotalGoats = g.Sum(x => x.Goats)
}
will result in a single query, and do the summation on the database-end, which is greatly preferable to pulling the entire table over and doing the summation on the client end.
The way you have posed your problem, there is no way to do this. IEnumerable<T> is a pull enumerable - that is, you can GetEnumerator to the front of the sequence and then repeatedly ask "Give me the next item" (MoveNext/Current). You can't, on one thread, have two different things pulling from the animals.Select(a => a.Chickens) and animals.Select(a => a.Goats) at the same time. You would have to do one then the other (which would require materializing the second).
The suggestion BlueRaja made is one way to change the problem slightly. I would suggest going that route.
The other alternative is to utilize IObservable<T> from Microsoft's reactive extensions (Rx), a push enumerable. I won't go into the details of how you would do that, but it's something you could look into.
Edit:
The above is assuming that ConsumeChickens and ConsumeGoats are both returning void or are at least not returning IEnumerable<T> themselves - which seems like an obvious assumption. I'd appreciate it if the lame downvoter would actually comment.
Actually simples way to achieve what you what is convert FarmsInEachPen return value to push collection or IObservable and use ReactiveExtensions for working with it
var observable = new Subject<Animals>()
observable.Do(x=> DoSomethingWithChicken(x. Chickens))
observable.Do(x=> DoSomethingWithGoat(x.Goats))
foreach(var item in FarmsInEachPen())
{
observable.OnNext(item)
}
I figured it out, thanks in large part due to the path that #Lee put me on.
You need to share a single enumerator between the two zips, and use an adapter function to project the correct element into the sequence.
private static IEnumerable<object> ConsumeChickens(IEnumerable<int> xList)
{
foreach (var x in xList)
{
Console.WriteLine("X: " + x);
yield return null;
}
}
private static IEnumerable<object> ConsumeGoats(IEnumerable<int> yList)
{
foreach (var y in yList)
{
Console.WriteLine("Y: " + y);
yield return null;
}
}
private static IEnumerable<int> SelectHelper(IEnumerator<AnimalCount> enumerator, int i)
{
bool c = i != 0 || enumerator.MoveNext();
while (c)
{
if (i == 0)
{
yield return enumerator.Current.Chickens;
c = enumerator.MoveNext();
}
else
{
yield return enumerator.Current.Goats;
}
}
}
private static void Main(string[] args)
{
var enumerator = GetAnimals().GetEnumerator();
var chickensList = ConsumeChickens(SelectHelper(enumerator, 0));
var goatsList = ConsumeGoats(SelectHelper(enumerator, 1));
var temp = chickensList.Zip(goatsList, (i, i1) => (object) null);
temp.ToList();
Console.WriteLine("Total iterations: " + iterations);
}

LINQ Performance for Large Collections

I have a large collection of strings (up to 1M) alphabetically sorted. I have experimented with LINQ queries against this collection using HashSet, SortedDictionary, and Dictionary. I am static caching the collection, it's up to 50MB in size, and I'm always calling the LINQ query against the cached collection. My problem is as follows:
Regardless of collection type, performance is much poorer than SQL (up to 200ms). When doing a similar query against the underlying SQL tables, performance is much quicker ( 5-10ms). I have implemented my LINQ queries as follows:
public static string ReturnSomething(string query, int limit)
{
StringBuilder sb = new StringBuilder();
foreach (var stringitem in MyCollection.Where(
x => x.StartsWith(query) && x.Length > q.Length).Take(limit))
{
sb.Append(stringitem);
}
return sb.ToString();
}
It is my understanding that the HashSet, Dictionary, etc. implement lookups using binary tree search instead of the standard enumeration. What are my options for high performance LINQ queries into the advanced collection types?
In your current code you don't make use of any of the special features of the Dictionary / SortedDictionary / HashSet collections, you are using them the same way that you would use a List. That is why you don't see any difference in performance.
If you use a dictionary as index where the first few characters of the string is the key and a list of strings is the value, you can from the search string pick out a small part of the entire collection of strings that has possible matches.
I wrote the class below to test this. If I populate it with a million strings and search with an eight character string it rips through all possible matches in about 3 ms. Searching with a one character string is the worst case, but it finds the first 1000 matches in about 4 ms. Finding all matches for a one character strings takes about 25 ms.
The class creates indexes for 1, 2, 4 and 8 character keys. If you look at your specific data and what you search for, you should be able to select what indexes to create to optimise it for your conditions.
public class IndexedList {
private class Index : Dictionary<string, List<string>> {
private int _indexLength;
public Index(int indexLength) {
_indexLength = indexLength;
}
public void Add(string value) {
if (value.Length >= _indexLength) {
string key = value.Substring(0, _indexLength);
List<string> list;
if (!this.TryGetValue(key, out list)) {
Add(key, list = new List<string>());
}
list.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
return
this[query.Substring(0, _indexLength)]
.Where(s => s.Length > query.Length && s.StartsWith(query))
.Take(limit);
}
}
private Index _index1;
private Index _index2;
private Index _index4;
private Index _index8;
public IndexedList(IEnumerable<string> values) {
_index1 = new Index(1);
_index2 = new Index(2);
_index4 = new Index(4);
_index8 = new Index(8);
foreach (string value in values) {
_index1.Add(value);
_index2.Add(value);
_index4.Add(value);
_index8.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
if (query.Length >= 8) return _index8.Find(query, limit);
if (query.Length >= 4) return _index4.Find(query,limit);
if (query.Length >= 2) return _index2.Find(query,limit);
return _index1.Find(query, limit);
}
}
I bet you have an index on the column so SQL server can do the comparison in O(log(n)) operations rather than O(n). To imitate the SQL server behavior, use a sorted collection and find all strings s such that s >= query and then look at values until you find a value that does not start with s and then do an additional filter on the values. This is what is called a range scan (Oracle) or an index seek (SQL server).
This is some example code which is very likely to go into infinite loops or have one-off errors because I didn't test it, but you should get the idea.
// Note, list must be sorted before being passed to this function
IEnumerable<string> FindStringsThatStartWith(List<string> list, string query) {
int low = 0, high = list.Count - 1;
while (high > low) {
int mid = (low + high) / 2;
if (list[mid] < query)
low = mid + 1;
else
high = mid - 1;
}
while (low < list.Count && list[low].StartsWith(query) && list[low].Length > query.Length)
yield return list[low];
low++;
}
}
If you're doing a "starts with", you only care about ordinal comparisons, and you can have the collection sorted (again in ordinal order) then I would suggest you have the values in a list. You can then binary search to find the first value which starts with the right prefix, then go down the list linearly yielding results until the first value which doesn't start with the right prefix.
In fact, you could probably do another binary search for the first value which doesn't start with the prefix, so you'd have a start and an end point. Then you just need to apply the length criterion to that matching portion. (I'd hope that if it's sensible data, the prefix matching is going to get rid of most candidate values.) The way to find the first value which doesn't start with the prefix is to search for the lexicographically-first value which doesn't - e.g. with a prefix of "ABC", search for "ABD".
None of this uses LINQ, and it's all very specific to your particular case, but it should work. Let me know if any of this doesn't make sense.
If you are trying to optimize looking up a list of strings with a given prefix you might want to take a look at implementing a Trie (not to be mistaken with a regular tree) data structure in C#.
Tries offer very fast prefix lookups and have a very small memory overhead compared to other data structures for this sort of operation.
About LINQ to Objects in general. It's not unusual to have a speed reduction compared to SQL. The net is littered with articles analyzing its performance.
Just looking at your code, I would say that you should reorder the comparison to take advantage of short-circuiting when using boolean operators:
foreach (var stringitem in MyCollection.Where(
x => x.Length > q.Length && x.StartsWith(query)).Take(limit))
The comparison of length is always going to be an O(1) operation (as the length is being stored as part of the string, it doesn't count each character every time), whereas the call to StartsWith is going to be an O(N) operation, where N is the length of query (or the length of the string, whichever is smaller).
By placing the comparison of length before the call to StartsWith, if that comparison fails, you save yourself some extra cycles which could add up when processing large numbers of items.
I don't think that a lookup table is going to help you here, as lookup tables are good when you are comparing the entire key, not parts of the key, like you are doing with the call to StartsWith.
Rather, you might be better off using a tree structure which is split based on the letters in the words in the list.
However, at that point, you are really just recreating what SQL Server is doing (in the case of indexes) and that would just be a duplication of effort on your part.
I think the problem is that Linq has no way to use the fact that your sequence is already sorted. Especially it cannot know, that applying the StartsWith function retains the order.
I would suggest to use the List.BinarySearch method together with a IComparer<string> that does only comparison of the first query chars (this might be tricky, since it's not clear, if the query string will always be the first or the second parameter to ()).
You could even use the standard string comparison, since BinarySearch returns a negative number which you can complement (using ~) in order to get the index of the first element that is larger than your query.
You have then to start from the returned index (in both directions!) to find all elements matching your query string.

Categories

Resources