Imagine I have a list of several-hundred unique names, e.g.
["john", "maria", "joseph", "richard", "samantha", "isaac", ...]
What's the best way I can store these to provide a fast lookup-time by matching against a pattern?
I only need to match "masks", can't think of a better word for it.
Basically, I get in letters and their positions, ____a__ (where _ represents an unknown letter.) Then I need to find all values in the data structure that match that mask, e.g. in this case it would return "richard", but it should also be possible to get multiple "returned" values.
Seems like a lot of work for "hundreds" of names. Doing a linear search on a list of hundreds of names will be very fast. Now, if you're talking hundreds of thousands or millions ...
In any case, you can speed this up using a dictionary. You can pre-process the data into a dictionary whose keys are a combination of character and position, and values are the words that contain that character at that position. For example, if you were to index "john" and "joseph", you would have:
{'j',0},{"john","jospeh"}
{'o',1},{"john","joseph"}
{'h',2},{"john"}
{'n',3},{"john}
{'s',2},{"joseph"}
{'e',3},{"joseph"}
{'p',4},{"joseph"}
{'h',5},{"joseph"}
Now let's say you're given the mask "jo...." (the dots are "don't care"). You'd turn that into two keys:
{'j',0}
{'o',1}
You query the dictionary for the list of words that has 'j' at index 0. Then you query the dictionary for the list of words that has 'o' at index 1. Then you intersect the lists to get your result.
It's a simple inverted index, but on character rather than on word.
The lists themselves will cost you a total of O(m * n) space, where m is the total number of words and n is the average word length in characters. At maximum, the number of dictionary entries will be 26*max_word_length. In practice, it will probably be much less.
If you make the values a HashSet<string> rather than List<string>, intersection will go much faster. It'll increase your memory footprint, though.
That should be faster than linear search if your masks contain only a few characters. The longer the mask, the more lists you'll have to intersect.
For the dictionary key, I'd recommend:
public struct Key
{
public char KeyChar;
public int Pos;
public override int GetHashCode()
{
return (int)KeyChar + Pos << 16;
}
public override bool Equals(object obj)
{
if (!obj is Key) return false;
var other = (Key)obj;
return KeyChar == other.KeyChar && Pos == other.Pos;
}
}
So your dictionary would be Dictionary<Key, HashSet<string>>.
If the longest word has m letters, then you can keep m lists l[1], ..., l[m] such that the words in each list l[i] are sorted lexicographically starting from the i-th letter in every word (shorter words will not appear in that list). Then, if your query is ...ac., just perform a binary search in list l[4].
This will cost you O(mn) in memory and takes O(m n log n) time to build, but will give you O(log n) query time, which is the fastest you can get.
EDIT
Good news, I have recently stumbled upon range trees, that would allow you to perform this kind of queries somewhat efficiently, namely in O(log^m(n)+k) time, and requiring O(n log^(d-1)(n)) space.
They are not straightforward to implement, in the sense that you need to build a binary search tree sorting the words by the first letter, then build a binary search tree for every internal node which stores the words in the subtree of that node sorted by the second letter, and so on.
On the other hand, this would allow you to perform a wider range of queries, namely you can look for contiguous intervals of letters, i.e. a pattern like ..[a-c].[b-f].
Related
I have a text of the Word document and an array of the strings. The goal is to find all occurrences for those strings in the document's text. I tried to use Aho-Corasick string matching in C# implementation of the Aho-Corasick algorithm but the default implementation doesn't fit for me.
The typical part of the text looks like
“Activation” means a written notice from Lender to the Bank substantially in the form of Exhibit A.
“Activation Notice” means a written notice from Lender to the Bank substantially in the form of Exhibit A and Activation.
“Business Day" means each day (except Saturdays and Sundays) on which banks are open for general business and Activation Notice.
The array of the keywords looks like
var keywords = new[] {"Activation", "Activation Notice"};
The default implementation of the Aho-Corasick algorithm returns the following count of the occurrences
Activation - 4
Activation Notice - 2
For 'Activation Notes' it's the correct result. But for 'Activation' the correct count should be also 2
because I do not need to consider occurrences inside the adjacent keyword 'Activation Notice'.
Is there a proper algorithm for this case?
I will assume you got your results according to the example you linked.
StringSearchResult[] results = searchAlg.FindAll(textToSearch);
With those results, if you assume that the only overlaps are subsets, you can sort by index and collect your desired results in a single pass.
public class SearchResultComparer : IComparer<StringSearchResult> {
public int StringSearchResult(StringSearchResult x, StringSearchResult y)
{
// Try ordering by the start index.
int compare = x.Index.CompareTo(y.Index);
if (compare == 0)
{
// In case of ties, reverse order by keyword length.
compare = y.Keyword.Length.CompareTo(x.Keyword.Length);
}
return compare;
}
}
// ...
IComparer searchResultComparer = new SearchResultComparer();
Array.Sort(results, searchResultComparer);
int activeEndIndex = -1;
List<StringSearchResult> nonOverlappingResults = new List<StringSearchResult>();
foreach(StringSearchResult r in results)
{
if (r.Index < activeEndIndex)
{
// This range starts before the active range ends.
// Since it's an overlap, skip it.
continue;
}
// Save this result, track when it ends.
nonOverlappingResults.Add(r);
activeEndIndex = r.Index + r.Keyword.Length;
}
Due to the index sorting, the loop guarantees that only non-overlapping ranges will be kept. But some ranges will be rejected. This can only happen for two reasons.
The candidate starts at the same index as the active range. Since the sorting breaks these ties so longest goes first, the candidate must be shorter than the active range and can be skipped.
The candidate starts after the active range. Since the only overlaps are subsets, and this overlaps with the active range, it is a subset that starts later but still ends at or before.
Therefore the only rejected candidates will be subsets, and must end before the active range. So the active range remains the only thing to worry about overlapping with.
I'm looking for the fastest way to find all strings in a collection starting from a set of characters. I can use sorted collection for this, however I can't find convenient way to do this in .net. Basically I need to find low and high indexes in a collection that meet the criteria.
BinarySearch on List<T> does not guarantee the returned index is that of the 1st element, so one would need to iterate up and down to find all matching strings which is not fast if one has a large list.
There are also Linq methods (with parallel), but I'm not sure which data structure will provide the best results.
List example, ~10M of records:
aaaaaaaaaaaaaaabb
aaaaaaaaaaaaaaba
aaaaaaaaaaaaabc
...
zzzzzzzzzzzzzxx
zzzzzzzzzzzzzyzzz
zzzzzzzzzzzzzzzzzza
Search for strings starting from: skk...
Result: record indexes from x to y.
UPDATE: strings can have different lengths and are unique.
In terms of time complexity - you should use a trie, and not a sorted set or binary search.
Trie will get you a O(|S|) time complexity [while sorted set and binary search gets you O(|S|logn)] to find the node [let it be v] that represents that prefix.
All the strings [paths] in the trie that fit the prefix will "pass" via v. By adding numberOfLeaves field to each node, you can find out exactly how much leaves [=strings] this node has.
In a single pass - you can also find the index of this v [For each node u in the path from the root to v - sum numberOfLeaves for each sibling which is left to u].
This requires much more work then using already existing structures, but if the data is huge - it can make your algorithm much faster, so you should concider it if performance is an issue and you expect a huge set of strings.
You can do it with a hand-written binary search - one which just doesn't stop when it's found a match; it continues until it's found a single index.
In fact, you don't even have to write the binary search bit yourself - you could create a custom comparer which never returns 0, i.e. if you're looking for "abc" then it treats "abb" as being below the target value, but "abc" as being above the target value. This way the BinarySearch will always return a negative number, which you can then just bit-flip to find the theoretical insertion point for "the string which comes between abb and abc".
You can do the same in reverse (treat "abc" as lower than the target value) to find the highest bound.
If you know the format of these strings and it won't have edge cases like Unicode NULL characters, and everything's the same length, you can even do it without writing your own comparer:
// This could be done more efficiently :)
string stringJustBelow = target.Substring(0, target.Length - 1) +
target[target.Length - 1] + "X";
string stringJustAbove = target + "X"; // Or any character
int lowerBoundInclusive = ~list.BinarySearch(stringJustBelow);
int upperBoundExclusive = ~list.BinarySearch(stringJustAbove);
So if you strings are all length 3 and you were searching for "abc" you'd actually look for where "abbX" and "abcX" would be inserted.
Put them in SortedSet and use GetViewBetween.
This answer illustrates searching for both prefix and suffix, I'm sure you'll have no trouble adapting it to prefix-only search, if that is indeed what you want.
If you just want to search for a range (not prefix), directly using GetViewBetween should suffice.
I have a very huge text document. I am implementing "Search" functionality to find occurrences of a given string in the file and to display its position. It is not just whole word search, it can have part of a word / sentance/ paragraph. I am working out on efficient data structure for this process. If it is whole word search I could have used tries/ hash table. I will not be able to use suffix array/ suffix tree as the file size is very large. Sorting is also not that efficient. Other simple option is just to use string search/ regular expression functionality of the framework, which takes linear time. Is there any better known approach for this kind of opeation? Initially it is just string search, later on planning to give search with metacharacters.
Trie and suffix tree/array are a good option but if you do not like them i have another solution: create a hash table:
Create a hash table for all the strings of length 1, 2, 3, .. N where N is whatever number you want complexity O(N * size_of_text)
If you want to find a string you have 2 options:
If the size of the string is lower than N you just search it into the hash table ~O(1) for the search and o(size_of_string) for creating the hash_key
If the size is larger than N you just create chunks of size N and do this: Search a chunk and remember all the position. Than you do the same for the next chunk and check if there are numbers that are consecutive ( ex: first time we have i, j and second time we have k, i+N , than i, i+N is a good combination) save the last number of a consecutive pair(i, i+N, you keep just i+N) and continue until you don't have a number in your Stack or you finished the word
Hope it helped.
Lucene.NET is a search engine library that does text scanning with indexes:
http://incubator.apache.org/lucene.net/
For each day we have approximately 50,000 instances of a data structure (this could eventually grow to be much larger) that encapsulate the following:
DateTime AsOfDate;
int key;
List<int> values; // list of distinct integers
This is probably not relevant but the list values is a list of distinct integers with the property that for a given value of AsOfDate, the union of values over all values of key produces a list of distinct integers. That is, no integer appears in two different values lists on the same day.
The lists usually contain very few elements (between one and five), but are sometimes as long as fifty elements.
Given adjacent days, we are trying to find instances of these objects for which the values of key on the two days are different, but the list values contain the same integers.
We are using the following algorithm. Convert the list values to a string via
string signature = String.Join("|", values.OrderBy(n => n).ToArray());
then hash signature to an integer, order the resulting lists of hash codes (one list for each day), walk through the two lists looking for matches and then check to see if the associated keys differ. (Also check the associated lists to make sure that we didn't have a hash collision.)
Is there a better method?
You could probably just hash the list itself, instead of going through String.
Apart from that, I think your algorithm is nearly optimal. Assuming no hash collisions, it is O(n log n + m log m) where n and m are the numbers of entries for each of the two days you're comparing. (The sorting is the bottleneck.)
You can do this in O(n + m) if you use a bucket array (essentially: a hashtable) that you plug the hashes in. You can compare the two bucket arrays in O(max(n, m)) assuming a length dependent on the number of entries (to get a reasonable load factor).
It should be possible to have the library do this for you (it looks like you're using .NET) by using HashSet.IntersectWith() and writing a suitable compare function.
You cannot do better than O(n + m), because every entry needs to be visited at least once.
Edit: misread, fixed.
On top of the other answers you could make the process faster by creating a low-cost hash simply constructed of a XOR amongst all the elements of each List.
You wouldn't have to order your list and all you would get is an int which is easier and faster to store than strings.
Then you only need to use the resulting XORed number as a key to a Hashtable and check for the existence of the key before inserting it.
If there is already an existing key, only then do you sort the corresponding Lists and compare them.
You still need to compare them if you find a match because there may be some collisions using a simple XOR.
I think thought that the result would be much faster and have a much lower memory footprint than re-ordering arrays and converting them to strings.
If you were to have your own implementation of the List<>, then you could build the generation of the XOR key within it so it would be recalculated at each operation on the List.
This would make the process of checking duplicate lists even faster.
Code
Below is a first-attempt at implementing this.
Dictionary<int, List<List<int>>> checkHash = new Dictionary<int, List<List<int>>>();
public bool CheckDuplicate(List<int> theList) {
bool isIdentical = false;
int xorkey = 0;
foreach (int v in theList) xorkey ^= v;
List<List<int>> existingLists;
checkHash.TryGetValue(xorkey, out existingLists);
if (existingLists != null) {
// Already in the dictionary. Check each stored list
foreach (List<int> li in existingLists) {
isIdentical = (theList.Count == li.Count);
if (isIdentical) {
// Check all elements
foreach (int v in theList) {
if (!li.Contains(v)) {
isIdentical = false;
break;
}
}
}
if (isIdentical) break;
}
}
if (existingLists == null || !isIdentical) {
// never seen this before, add it
List<List<int>> newList = new List<List<int>>();
newList.Add(theList);
checkHash.Add(xorkey, newList);
}
return isIdentical;
}
Not the most elegant or easiest to read at first sight, it's rather 'hackey' and I'm not even sure it performs better than the more elegant version from Guffa.
What it does though is take care of collision in the XOR key by storing Lists of List<int> in the Dictionary.
If a duplicate key is found, we loop through each previously stored List until we found a mismatch.
The good point about the code is that it should be probably as fast as you could get in most cases and still faster than compiling strings when there is a collision.
Implement an IEqualityComparer for List, then you can use the list as a key in a dictionary.
If the lists are sorted, it could be as simple as this:
IntListEqualityComparer : IEqualityComparer<List<int>> {
public int GetHashCode(List<int> list) {
int code = 0;
foreach (int value in list) code ^=value;
return code;
}
public bool Equals(List<int> list1, List<int> list2) {
if (list1.Count != list2.Coount) return false;
for (int i = 0; i < list1.Count; i++) {
if (list1[i] != list2[i]) return false;
}
return true;
}
}
Now you can create a dictionary that uses the IEqualityComparer:
Dictionary<List<int>, YourClass> day1 = new Dictionary<List<int>, YourClass>(new IntListEqualityComparer());
Add all the items from the first day in the dictionary, then loop through the items from the second day and check if the key exists in the dictionary. As the IEqualityComprarer both handles the hash code and the comparison, you will not get any false matches.
You may want to test some different methods of calculating the hash code. The one in the example works, but may not give the best efficiency for your specific data. The only requirement on the hash code for the dictionary to work is that the same list always gets the same hash code, so you can do pretty much what ever you want to calculate it. The goal is to get as many different hash codes as possible for the keys in your dictionary, so that there are as few items as possible in each bucket (with the same hash code).
Does the ordering matter? i.e. [1,2] on day 1 and [2,1] on day 2, are they equal?
If they are, then hashing might not work all that well. You could use a sorted array/vector instead to help with the comparison.
Also, what kind of keys is it? Does it have a definite range (e.g. 0-63)? You might be able to concatenate them into large integer (may require precision beyond 64-bits), and hash, instead of converting to string, because that might take a while.
It might be worthwhile to place this in a SQL database. If you don't want to have a full blown DBMS you could use sqlite.
This would make uniqueness checks and unions and these types of operations very simple queries and would very efficient. It would also allow you to easily store information if it is ever needed again.
Would you consider summing up the list of values to obtain an integer which can be used as a precheck of whether different list contains the same set of values?
Though there will be much more collisions (same sum doesn't necessarily mean same set of values) but I think it can first reduce the set of comparisons required by a large part.
I have a list of input words separated by comma. I want to sort these words by alphabetical and length. How can I do this without using the built-in sorting functions?
Good question!! Sorting is probably the most important concept to learn as an up-and-coming computer scientist.
There are actually lots of different algorithms for sorting a list.
When you break all of those algorithms down, the most fundamental operation is the comparison of two items in the list, defining their "natural order".
For example, in order to sort a list of integers, I'd need a function that tells me, given any two integers X and Y whether X is less than, equal to, or greater than Y.
For your strings, you'll need the same thing: a function that tells you which of the strings has the "lesser" or "greater" value, or whether they're equal.
Traditionally, these "comparator" functions look something like this:
int CompareStrings(String a, String b) {
if (a < b)
return -1;
else if (a > b)
return 1;
else
return 0;
}
I've left out some of the details (like, how do you compute whether a is less than or greater than b? clue: iterate through the characters), but that's the basic skeleton of any comparison function. It returns a value less than zero if the first element is smaller and a value greater than zero if the first element is greater, returning zero if the elements have equal value.
But what does that have to do with sorting?
A sort routing will call that function for pairs of elements in your list, using the result of the function to figure out how to rearrange the items into a sorted list. The comparison function defines the "natural order", and the "sorting algorithm" defines the logic for calling and responding to the results of the comparison function.
Each algorithm is like a big-picture strategy for guaranteeing that ANY input will be correctly sorted. Here are a few of the algorithms that you'll probably want to know about:
Bubble Sort:
Iterate through the list, calling the comparison function for all adjacent pairs of elements. Whenever you get a result greater than zero (meaning that the first element is larger than the second one), swap the two values. Then move on to the next pair. When you get to the end of the list, if you didn't have to swap ANY pairs, then congratulations, the list is sorted! If you DID have to perform any swaps, go back to the beginning and start over. Repeat this process until there are no more swaps.
NOTE: this is usually not a very efficient way to sort a list, because in the worst cases, it might require you to scan the whole list as many as N times, for a list with N elements.
Merge Sort:
This is one of the most popular divide-and-conquer algorithms for sorting a list. The basic idea is that, if you have two already-sorted lists, it's easy to merge them. Just start from the beginning of each list and remove the first element of whichever list has the smallest starting value. Repeat this process until you've consumed all the items from both lists, and then you're done!
1 4 8 10
2 5 7 9
------------ becomes ------------>
1 2 4 5 7 8 9 10
But what if you don't have two sorted lists? What if you have just one list, and its elements are in random order?
That's the clever thing about merge sort. You can break any single list into smaller pieces, each of which is either an unsorted list, a sorted list, or a single element (which, if you thing about it, is actually a sorted list, with length = 1).
So the first step in a merge sort algorithm is to divide your overall list into smaller and smaller sub lists, At the tiniest levels (where each list only has one or two elements), they're very easy to sort. And once sorted, it's easy to merge any two adjacent sorted lists into a larger sorted list containing all the elements of the two sub lists.
NOTE: This algorithm is much better than the bubble sort method, described above, in terms of its worst-case-scenario efficiency. I won't go into a detailed explanation (which involves some fairly trivial math, but would take some time to explain), but the quick reason for the increased efficiency is that this algorithm breaks its problem into ideal-sized chunks and then merges the results of those chunks. The bubble sort algorithm tackles the whole thing at once, so it doesn't get the benefit of "divide-and-conquer".
Those are just two algorithms for sorting a list, but there are a lot of other interesting techniques, each with its own advantages and disadvantages: Quick Sort, Radix Sort, Selection Sort, Heap Sort, Shell Sort, and Bucket Sort.
The internet is overflowing with interesting information about sorting. Here's a good place to start:
http://en.wikipedia.org/wiki/Sorting_algorithms
Create a console application and paste this into the Program.cs as the body of the class.
public static void Main(string[] args)
{
string [] strList = "a,b,c,d,e,f,a,a,b".Split(new [] { ',' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string s in strList.Sort())
Console.WriteLine(s);
}
public static string [] Sort(this string [] strList)
{
return strList.OrderBy(i => i).ToArray();
}
Notice that I do use a built in method, OrderBy. As other answers point out there are many different sort algorithms you could implement there and I think my code snippet does everything for you except the actual sort algorithm.
Some C# specific sorting tutorials
There is an entire area of study built around sorting algorithms. You may want to choose a simple one and implement it.
Though it won't be the most performant, it shouldn't take you too long to implement a bubble sort.
If you don't want to use build-in-functions, you have to create one by your self. I would recommend Bubble sort or some similar algorithm. Bubble sort is not an effective algoritm, but it get the works done, and is easy to understand.
You will find much good reading on wikipedia.
I would recommend doing a wiki for quicksort.
Still not sure why you don't want to use the built in sort?
Bubble sort damages the brain.
Insertion sort is at least as simple to understand and code, and is actually useful in practice (for very small data sets, and nearly-sorted data). It works like this:
Suppose that the first n items are already in order (you can start with n = 1, since obviously one thing on its own is "in the correct order").
Take the (n+1)th item in your array. Call this the "pivot". Starting with the nth item and working down:
- if it is bigger than the pivot, move it one space to the right (to create a "gap" to the left of it).
- otherwise, leave it in place, put the "pivot" one space to the right of it (that is, in the "gap" if you moved anything, or where it started if you moved nothing), and stop.
Now the first n+1 items in the array are in order, because the pivot is to the right of everything smaller than it, and to the left of everything bigger than it. Since you started with n items in order, that's progress.
Repeat, with n increasing by 1 at each step, until you've processed the whole list.
This corresponds to one way that you might physically put a series of folders into a filing cabinet in order: put one in; then put another one into its correct position by pushing everything that belongs after it over by one space to make room; repeat until finished. Nobody ever sorts physical objects by bubble sort, so it's a mystery to me why it's considered "simple".
All that's left now is that you need to be able to work out, given two strings, whether the first is greater than the second. I'm not quite sure what you mean by "alphabetical and length" : alphabetical order is done by comparing one character at a time from each string. If there not the same, that's your order. If they are the same, look at the next one, unless you're out of characters in one of the strings, in which case that's the one that's "smaller".
Use NSort
I ran across the NSort library a couple of years ago in the book Windows Developer Power Tools. The NSort library implements a number of sorting algorithms. The main advantage to using something like NSort over writing your own sorting is that is is already tested and optimized.
Posting link to fast string sort code in C#:
http://www.codeproject.com/KB/cs/fast_string_sort.aspx
Another point:
The suggested comparator above is not recommended for non-English languages:
int CompareStrings(String a, String b) {
if (a < b) return -1;
else if (a > b)
return 1; else
return 0; }
Checkout this link for non-English language sort:
http://msdn.microsoft.com/en-us/goglobal/bb688122
And as mentioned, use nsort for really gigantic arrays that don't fit in memory.