I have a string of dash-separated numbers that I am removing duplicate numbers from
string original = "45-1-3-45-10-3-15";
string new = "45-1-3-10-15";
I have tried two approaches, and used Stopwatch to determine which method is faster, but I am getting inconsistent time elapses so I was hoping for some insight into which method would be more efficient for achieving the new duplicate-free list.
Method 1: While loop
List<string> temp = new List<string>();
bool moreNumbers = true;
while (moreNumbers)
{
if (original.Contains("-"))
{
string number = original.Substring(0, original.IndexOf("-"));
//don't add if the number is already in the list
int index = temp.FindIndex(item => item == number);
if (index < 0)
temp.Add(value);
original = original.Substring(original.IndexOf("-") + 1);
}
else
moreNumbers = false;
}
//add remaining value in
string lastNumber = original;
//don't add if the number is already in the list
int indexLast = temp.FindIndex(item => item == lastNumber);
if (indexLast < 0)
temp.Add(lastNumber);
string new = "";
foreach (string number in temp)
{
new += "-" + number;
}
if (new[0] == '-')
new = new.Substring(1);
Method 2: Split
List<string> temp = original.Split('-').Distinct().ToList();
string new = "";
foreach (string number in temp)
{
new += "-" + number;
}
if (new[0] == '-')
new = new.Substring(1);
I think the second method is more readable, but possibly slower? Which of these methods would be more efficient or a better approach?
This will be highly optimized but you test for performance.
string result = string.Join("-", original.Split('-').Distinct());
You have some inefficiencies in both your examples.
Method 1: manipulating a string is never efficient. Strings are immutable.
Method 2: no need to create a List and use a StringBuilder() instead of using string concatenation.
Lastly, new is a C# reserved word so none of your code will compile.
In the first approach, you're using several Substring calls and several IndexOf calls. I don't know exactly the internal implementation, but I guess they are O(n) in time complexity.
Since, for each number in the list, you'll do a full loop in the other list (you're using strings as lists), you'll have an O(n^2) time complexity.
The second option, I assume it is O(n^2) too, because to make a distinct of the list in an IEnumerable, it will have to iterate the list.
I think one optimezed approach to the problem is:
1) loop the main string and for each "-" or end of string, save the number (this will be more economic than the Split in terms of space).
2) for each number, put it in a Dictionary. This won't be economic in terms of space, but will provide O(1) time to check if the item. Hashing small strings shouldn't be too constly.
3) Loop the Dictionary to retrieve the distinct values.
This implementation will be O(n), better than O(n^2).
Note that only using the dictionary can deliver the result string in a different order. If the order is important, use the Dictionary to check if the item is duplicated, but put in an auxiliary list. Again, this will have a space cost.
Related
This question already has answers here:
Is there an easy way to change a char in a string in C#?
(8 answers)
Closed 5 years ago.
This is kind of a basic question, but I learned programming in C++ and am just transitioning to C#, so my ignorance of the C# methods are getting in my way.
A client has given me a few fixed length files and they want the 484th character of every odd numbered record, skipping the first one (3, 5, 7, etc...) changed from a space to a 0. In my mind, I should be able to do something like the below:
static void Main(string[] args)
{
List<string> allLines = System.IO.File.ReadAllLines(#"C:\...").ToList();
foreach(string line in allLines)
{
//odd numbered logic here
line[483] = '0';
}
...
//write to new file
}
However, the property or indexer cannot be assigned to because it is read only. All my reading says that I have not set a setter for the variable, and I have tried what was shown at this SO article, but I am doing something wrong every time. Should what is shown in that article work? Should I do something else?
You cannot modify C# strings directly, because they are immutable. You can convert strings to char[], modify it, then make a string again, and write it to file:
File.WriteAllLines(
#"c:\newfile.txt"
, File.ReadAllLines(#"C:\...").Select((s, index) => {
if (index % 2 = 0) {
return s; // Even strings do not change
}
var chars = s.ToCharArray();
chars[483] = '0';
return new string(chars);
})
);
Since strings are immutable, you can't modify a single character by treating it as a char[] and then modify a character at a specific index. However, you can "modify" it by assigning it to a new string.
We can use the Substring() method to return any part of the original string. Combining this with some concatenation, we can take the first part of the string (up to the character you want to replace), add the new character, and then add the rest of the original string.
Also, since we can't directly modify the items in a collection being iterated over in a foreach loop, we can switch your loop to a for loop instead. Now we can access each line by index, and can modify them on the fly:
for(int i = 0; i < allLines.Length; i++)
{
if (allLines[i].Length > 483)
{
allLines[i] = allLines[i].Substring(0, 483) + "0" + allLines[i].Substring(484);
}
}
It's possible that, depending on how many lines you're processing and how many in-line concatenations you end up doing, there is some chance that using a StringBuilder instead of concatenation will perform better. Here is an alternate way to do this using a StringBuilder. I'll leave the perf measuring to you...
var sb = new StringBuilder();
for (int i = 0; i < allLines.Length; i++)
{
if (allLines[i].Length > 483)
{
sb.Clear();
sb.Append(allLines[i].Substring(0, 483));
sb.Append("0");
sb.Append(allLines[i].Substring(484));
allLines[i] = sb.ToString();
}
}
The first item after the foreach (string line in this case) is a local variable that has no scope outside the loop - that’s why you can’t assign a value to it. Try using a regular for loop instead.
Purpose of for each is meant to iterate over a container. It's read only in nature. You should use regular for loop. It will work.
static void Main(string[] args)
{
List<string> allLines = System.IO.File.ReadAllLines(#"C:\...").ToList();
for (int i=0;i<=allLines.Length;++i)
{
if (allLines[i].Length > 483)
{
allLines[i] = allLines[i].Substring(0, 483) + "0";
}
}
...
//write to new file
}
I'm sorry in advance if it's bad to ask for this sort of help... but I don't know who else to ask.
I have an assignment to read two text files, and find the 10 longest words in the first file (and the amount of times they're repeated) which dont exist in the second file.
I currently read both of the files with File.ReadAllLines then split them into arrays, where every element is a single word (punctuation marks removed as well) and removed empty entries.
The idea I had to pick out the words fitting the requirements was: to make a dictionary containing a string Word and an int Count. Then make a loop repeating for the first file's length.... firstly comparing the element with the entire dictionary - if it finds a match, increase the Count by 1. Then if it doesn't match with any of the dictionary elements - compare the given element with every element in the 2nd file through another loop, if it finds a match - just go on to the next element of the first file, if it doesn't find any matches - add the word to the dictionary, and set Count to 1.
So my first question is: Is this actually the most efficient way to do this? (Don't forget I've only recently started studying c# and am not allowed to use linq)
Second question: How do I work with the dictionary, because most of the results I could find were very confusing, and we have not yet met them at university.
My code so far:
// Reading and making all the words lowercase for comparisons
string punctuation = " ,.?!;:\"\r\n";
string Read1 = File.ReadAllText("#\\..\\Book1.txt");
Read1 = Read1.ToLower();
string Read2 = File.ReadAllText("#\\..\\Book2.txt");
Read2 = Read2.ToLower();
//Working with the 1st file
string[] FirstFileWords = Read1.Split(punctuation.ToCharArray());
var temp1 = new List<string>();
foreach (var word in FirstFileWords)
{
if (!string.IsNullOrEmpty(word))
temp1.Add(word);
}
FirstFileWords = temp1.ToArray();
Array.Sort(FirstFileWords, (x, y) => y.Length.CompareTo(x.Length));
//Working with the 2nd file
string[] SecondFileWords = Read2.Split(punctuation.ToCharArray());
var temp2 = new List<string>();
foreach (var word in SecondFileWords)
{
if (!string.IsNullOrEmpty(word))
temp2.Add(word);
}
SecondFileWords = temp2.ToArray();
Well I think you've done very well so far. Not being able to use Linq here is torture ;)
As for performance, you should consider making your SecondFileWords a HashSet<string> as this would increase lookup times if any word exists in the 2nd file tremendously without much effort. I wouldn't go much further in terms of performance optimization for an exercise like that if performance is not a key requirement.
Of course, you would have to check that you don't add duplicates to your 2nd list, so change your current implementation to something like:
HashSet<string> temp2 = new HashSet<string>();
foreach (var word in SecondFileWords)
{
if (!string.IsNullOrEmpty(word) && !temp2.Contains(word))
{
temp2.Add(word);
}
}
Don't convert this back to an Array again, this is not necessary.
This brings me back to your FirstFileWords which would contain duplicates too. This will cause issues later on when the top words might contain the same word multiple times. So let's get rid of them. Here it's more complicated as you need to retain the information how often a word appeared in your first list.
So let's bring a Dictionary<string, int> into play here now. A Dictionary stores a lookup key, as the HashSet, but in addition, also a value. We will use the key for the word, and the value for a number that contains the amount of how often the word appeared in the first list.
Dictionary<string, int> temp1 = new Dictionary<string, int>();
foreach (var word in FirstFileWords)
{
if (string.IsNullOrEmpty(word))
{
continue;
}
if (temp1.ContainsKey(word))
{
temp1[word]++;
}
else
{
temp1.Add(word, 1);
}
}
Now a dictionary cannot be sorted, which complicates things at this point as you still need to get your sorting by word length done. So let's get back to your Array.Sort method which I think is a good choice when you are not allowed to use Linq:
KeyValuePair<string, int>[] firstFileWordsWithCount = temp1.ToArray();
Array.Sort(firstFileWordsWithCount, (x, y) => y.Key.Length.CompareTo(x.Key.Length));
Note: You are using .ToArray() in your example, so I think it's OK to use it. But strictly speaking, this would also fall unter using Linq IMHO.
Now all that's left is working through your firstFileWordsWithCount array until you got 10 words that do not exist in the HashSet temp2. Something like:
int foundWords = 0;
foreach(KeyValuePair<string, int> candidate in firstFileWordsWithCount)
{
if (!temp2.Contains(candidate.Key))
{
Console.WriteLine($"{candidate.Key}: {candidate.Value}");
foundWords++;
}
if (foundWords >= 10)
{
break;
}
}
If anything is unclear, just ask.
This is what you'll get when using dictionaries:
string File1 = "AMD Intel Skylake Processors Graphics Cards Nvidia Architecture Microprocessor Skylake SandyBridge KabyLake";
string File2 = "Graphics Nvidia";
Dictionary<string, int> Dic = new Dictionary<string, int>();
string[] File1Array = File1.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
Array.Sort(File1Array, (s1, s2) => s2.Length.CompareTo(s1.Length));
foreach (string s in File1Array)
{
if (Dic.ContainsKey(s))
{
Dic[s]++;
}
else
{
Dic.Add(s, 1);
}
}
string[] File2Array = File2.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
foreach (string s in File2Array)
{
if (Dic.ContainsKey(s))
{
Dic.Remove(s);
}
}
int i = 0;
foreach (KeyValuePair<string, int> kvp in Dic)
{
i++;
Console.WriteLine(kvp.Key + " " + kvp.Value);
if (i == 9)
{
break;
}
}
My earlier attempt was using LINQ, which is apparently not allowed but missed it.
string[] Results = File1.Split(" ".ToCharArray()).Except(File2.Split(" ".ToCharArray())).OrderByDescending(s => s.Length).Take(10).ToArray();
for (int i = 0; i < Results.Length; i++)
{
Console.WriteLine(Results[i] + " " + Regex.Matches(File1, Results[i]).Count);
}
Here is what I have done:
// this is an example of my function, the string and the remover should be variables
string delimeter = ",";
string remover="4";
string[] separator = new string[] { "," };
List<String> List = "1,2,3,4,5,6".Split(separator, StringSplitOptions.None).ToList();
for (int i = 0; i < List.Count - 1; i++)
{
if(List[i]==remover)
List.RemoveAt(i);
}
string allStrings = (List.Aggregate((i, j) => i + delimeter + j));
return allStrings;
The problem is the retruned string is the same as the originial one, same as "1,2,3,4,5,6". the "4" is still in it.
How to fix it?
EDIT:
the solution was that i didnt check the last node of the list in that for, it doesnt seem like that in the example because it was an example i gave just now
When you remove items from a list like this you should make your for loop run in reverse, from highest index to lowest.
If you go from lowest to highest you will end up shifting items down as they are removed and this will skip items. Running it in reverse does not have this issue.
Your code, as it stands, produces the expected output. When running it it will return 1,2,3,5,6. If it doesn't it's due to a bug in how you call this method.
That's not to say that you don't have problems.
When you remove an item you still increment the current index, so you skip checking the item after any item you remove.
While there are a number of solutions, the best solution here is to use the RemoveAll method of List. Not only does it ensure that all items are evaluated, but it can do so way more efficiently. Removing an item from a list means shifting all of the items over by one. RemoveAll can do all of that shifting at the end, which is way more efficient if a lot of items are removed.
Another bug that you have is that your for loop doesn't check the last item at all, ever.
On a side note, you shouldn't use Aggregate to join a bunch of strings together given a delimiter. It's extremely inefficient as you need to copy all of the data from the first item into an intermediate string when adding the second, then both of those to a new string when adding the third, then all three of those to a new string when creating a fourth, and so on. Instead you should use string.Join(delimeter, List);, which is not only way more efficient, but is way easier to write and semantically represents exactly what you're trying to do. Win win win.
We can now re-write the method as:
string delimeter = ",";
string remover = "4";
List<String> List = "1,2,3,4,5,6"
.Split(new[] { delimeter }, StringSplitOptions.None).ToList();
List.RemoveAll(n => n == remover);
return string.Join(delimeter, List);
Another option is to avoid creating a list just to remove items from it and then aggregate the data again. We can instead just take the sequence of items that we have, pull out only the items that we want to keep, rather than removing the items we don't want to keep, and then aggregate those. This is functionally the same, but remove the needless effort of building up a list and removing items, pulling out mechanism from the requirements:
string delimeter = ",";
string remover = "4";
var items = "1,2,3,4,5,6"
.Split(new[] { delimeter }, StringSplitOptions.None)
.Where(n => n != remover);
return string.Join(delimeter, items);
Use this for remove
list.RemoveAll(f => f==remover);
Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.
Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.
string[] FindKeyWordOccurence( string[] stringsToSearch,
string[] positiveKeywords,
string[] negativeKeywords )
{
Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
foreach( string searchIn in stringsToSearch )
{
// tokenize and sort the input to make searches faster
string[] tokenizedList = searchIn.Split( ' ' );
Array.Sort( tokenizedList );
// if any negative keywords exist, skip to the next search string...
foreach( string negKeyword in negativeKeywords )
if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
continue; // skip to next search string...
// for each positive keyword, add to dictionary to keep track of it
// we could have also used a SortedList, but the dictionary is easier
foreach( string posKeyword in positiveKeyWords )
if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
foundKeywords[posKeyword] = 1;
}
// convert the Keys in the dictionary (our found keywords) to an array...
string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
return foundKeywordsArray;
}
Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
IEnumerable<string> positiveKeywords,
IEnumerable<string> negativeKeywords )
{
var foundKeywordsDict = new Dictionary<string, int>();
foreach( var searchIn in searchStrings )
{
// tokenize the search string...
var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
// skip if any negative keywords exist...
if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
continue;
// merge found positive keywords into dictionary...
// an example of where Enumerable.ForEach() would be nice...
var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
foreach (var keyword in found)
foundKeywordsDict[keyword] = 1;
}
return foundKeywordsDict.Keys;
}
If you add this extension method:
public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
foreach (var keyword in keywords)
{
if (testString.Contains(keyword))
return true;
}
return false;
}
Then this becomes a one line statement:
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.
If you're truly just looking for space-delimited words, this code would be a very simple implementation:
static void Main(string[] args)
{
string sIn = "This is a string that isn't nearly as long as it should be " +
"but should still serve to prove an algorithm";
string[] sFor = { "string", "as", "not" };
Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
}
private static string[] FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Intersect(hsFor).ToArray();
}
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
private static bool FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Overlaps(hsFor);
}
Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.
First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.
Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
foreach (Bucket b in buckets) {
if (i + b.Length >= inputString.Length)
continue;
string candidate = inputString.Substring(i, b.Length);
int hash = ComputeHash(candidate);
foreach (MatchString match in b.MatchStrings) {
if (hash != match.Hash)
continue;
if (candidate == match.String) {
if (match.IsPositive) {
// positive case
}
else {
// negative case
}
}
}
}
}
To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.
I have a large collection of strings (up to 1M) alphabetically sorted. I have experimented with LINQ queries against this collection using HashSet, SortedDictionary, and Dictionary. I am static caching the collection, it's up to 50MB in size, and I'm always calling the LINQ query against the cached collection. My problem is as follows:
Regardless of collection type, performance is much poorer than SQL (up to 200ms). When doing a similar query against the underlying SQL tables, performance is much quicker ( 5-10ms). I have implemented my LINQ queries as follows:
public static string ReturnSomething(string query, int limit)
{
StringBuilder sb = new StringBuilder();
foreach (var stringitem in MyCollection.Where(
x => x.StartsWith(query) && x.Length > q.Length).Take(limit))
{
sb.Append(stringitem);
}
return sb.ToString();
}
It is my understanding that the HashSet, Dictionary, etc. implement lookups using binary tree search instead of the standard enumeration. What are my options for high performance LINQ queries into the advanced collection types?
In your current code you don't make use of any of the special features of the Dictionary / SortedDictionary / HashSet collections, you are using them the same way that you would use a List. That is why you don't see any difference in performance.
If you use a dictionary as index where the first few characters of the string is the key and a list of strings is the value, you can from the search string pick out a small part of the entire collection of strings that has possible matches.
I wrote the class below to test this. If I populate it with a million strings and search with an eight character string it rips through all possible matches in about 3 ms. Searching with a one character string is the worst case, but it finds the first 1000 matches in about 4 ms. Finding all matches for a one character strings takes about 25 ms.
The class creates indexes for 1, 2, 4 and 8 character keys. If you look at your specific data and what you search for, you should be able to select what indexes to create to optimise it for your conditions.
public class IndexedList {
private class Index : Dictionary<string, List<string>> {
private int _indexLength;
public Index(int indexLength) {
_indexLength = indexLength;
}
public void Add(string value) {
if (value.Length >= _indexLength) {
string key = value.Substring(0, _indexLength);
List<string> list;
if (!this.TryGetValue(key, out list)) {
Add(key, list = new List<string>());
}
list.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
return
this[query.Substring(0, _indexLength)]
.Where(s => s.Length > query.Length && s.StartsWith(query))
.Take(limit);
}
}
private Index _index1;
private Index _index2;
private Index _index4;
private Index _index8;
public IndexedList(IEnumerable<string> values) {
_index1 = new Index(1);
_index2 = new Index(2);
_index4 = new Index(4);
_index8 = new Index(8);
foreach (string value in values) {
_index1.Add(value);
_index2.Add(value);
_index4.Add(value);
_index8.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
if (query.Length >= 8) return _index8.Find(query, limit);
if (query.Length >= 4) return _index4.Find(query,limit);
if (query.Length >= 2) return _index2.Find(query,limit);
return _index1.Find(query, limit);
}
}
I bet you have an index on the column so SQL server can do the comparison in O(log(n)) operations rather than O(n). To imitate the SQL server behavior, use a sorted collection and find all strings s such that s >= query and then look at values until you find a value that does not start with s and then do an additional filter on the values. This is what is called a range scan (Oracle) or an index seek (SQL server).
This is some example code which is very likely to go into infinite loops or have one-off errors because I didn't test it, but you should get the idea.
// Note, list must be sorted before being passed to this function
IEnumerable<string> FindStringsThatStartWith(List<string> list, string query) {
int low = 0, high = list.Count - 1;
while (high > low) {
int mid = (low + high) / 2;
if (list[mid] < query)
low = mid + 1;
else
high = mid - 1;
}
while (low < list.Count && list[low].StartsWith(query) && list[low].Length > query.Length)
yield return list[low];
low++;
}
}
If you're doing a "starts with", you only care about ordinal comparisons, and you can have the collection sorted (again in ordinal order) then I would suggest you have the values in a list. You can then binary search to find the first value which starts with the right prefix, then go down the list linearly yielding results until the first value which doesn't start with the right prefix.
In fact, you could probably do another binary search for the first value which doesn't start with the prefix, so you'd have a start and an end point. Then you just need to apply the length criterion to that matching portion. (I'd hope that if it's sensible data, the prefix matching is going to get rid of most candidate values.) The way to find the first value which doesn't start with the prefix is to search for the lexicographically-first value which doesn't - e.g. with a prefix of "ABC", search for "ABD".
None of this uses LINQ, and it's all very specific to your particular case, but it should work. Let me know if any of this doesn't make sense.
If you are trying to optimize looking up a list of strings with a given prefix you might want to take a look at implementing a Trie (not to be mistaken with a regular tree) data structure in C#.
Tries offer very fast prefix lookups and have a very small memory overhead compared to other data structures for this sort of operation.
About LINQ to Objects in general. It's not unusual to have a speed reduction compared to SQL. The net is littered with articles analyzing its performance.
Just looking at your code, I would say that you should reorder the comparison to take advantage of short-circuiting when using boolean operators:
foreach (var stringitem in MyCollection.Where(
x => x.Length > q.Length && x.StartsWith(query)).Take(limit))
The comparison of length is always going to be an O(1) operation (as the length is being stored as part of the string, it doesn't count each character every time), whereas the call to StartsWith is going to be an O(N) operation, where N is the length of query (or the length of the string, whichever is smaller).
By placing the comparison of length before the call to StartsWith, if that comparison fails, you save yourself some extra cycles which could add up when processing large numbers of items.
I don't think that a lookup table is going to help you here, as lookup tables are good when you are comparing the entire key, not parts of the key, like you are doing with the call to StartsWith.
Rather, you might be better off using a tree structure which is split based on the letters in the words in the list.
However, at that point, you are really just recreating what SQL Server is doing (in the case of indexes) and that would just be a duplication of effort on your part.
I think the problem is that Linq has no way to use the fact that your sequence is already sorted. Especially it cannot know, that applying the StartsWith function retains the order.
I would suggest to use the List.BinarySearch method together with a IComparer<string> that does only comparison of the first query chars (this might be tricky, since it's not clear, if the query string will always be the first or the second parameter to ()).
You could even use the standard string comparison, since BinarySearch returns a negative number which you can complement (using ~) in order to get the index of the first element that is larger than your query.
You have then to start from the returned index (in both directions!) to find all elements matching your query string.