String parsing and matching algorithm

String parsing and matching algorithm - c#

I am solving the following problem:
Suppose I have a list of software packages and their names might looks like this (the only known thing is that these names are formed like SOMETHING + VERSION, meaning that the version always comes after the name):
Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED
Efficient.Exclusive.Zip.Archiver.123.01
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip.Archiver14.06
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Custom.Zip.Archiver1
Now, I need to parse this list and select only latest versions of each package. For this example the expected result would be:
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Current approach that I use can be described the following way:
Split the initial strings into groups by their starting letter,
ignoring spaces, case and special symbols.
(`E`, `Z`, `C` for the example list above)
Foreach element {
Apply the regular expression (or a set of regular expressions),
which tries to deduce the version from the string and perform
the following conversion `STRING -> (VERSION, STRING_BEFORE_VERSION)`
// Example for this step:
// 'Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED' ->
// (122.24, Efficient.Exclusive.Zip.Archiver-PROPER)
Search through the corresponding group (in this example - the 'E' group)
and find every other strings, which starts from the 'STRING_BEFORE_VERSION' or
from it's significant part. This comparison is performed in ignore-case and
ignore-special-symbols mode.
// The matches for this step:
// Efficient.Exclusive.Zip.Archiver-PROPER, {122.24}
// Efficient.Exclusive.Zip.Archiver, {123.01}
// Efficient-Exclusive.Zip.Archiver, {126.24, 2011}
// The last one will get picked, because year is ignored.
Get the possible version from each match, ***pick the latest, yield that match.***
Remove every possible match (including the initial element) from the list.
}
This algorithm (as I assume) should work for something like O(N * V + N lg N * M), where M stands for the average string matching time and V stands for the version regexp working time.
However, I suspect there is a better solution (there always is!), maybe specific data structure or better matching approach.
If you can suggest something or make some notes on the current approach, please do not hesitate to do this.

How about this? (Pseudo-Code)
Dictionary<string,string> latestPackages=new Dictionary<string,string>(packageNameComparer);
foreach element
{
(package,version)=applyRegex(element);
if(!latestPackages.ContainsKey(package) || isNewer)
{
latestPackages[package]=version;
}
}
//print out latestPackages
Dictionary operations are O(1), so you have O(n) total runtime. No pre-grouping necessary and instead of storing all matches, you only store the one which is currently the newest.
Dictionary has a constructor which accepts a IEqualityComparer-object. There you can implement your own semantic of equality between package names. Keep in mind however that you need to implement a GetHashCode method in this IEqualityComparer which should return the same values for objects that you consider equal. To reproduce the example above you could return a hash code for the first character in the string, which would reproduce the grouping you had inside your dictionary. However you will get more performance with a smarter hash code, which doesn't have so many collisions. Maybe using more characters if that still yields good results.

I think you could probably use a DAWG (http://en.wikipedia.org/wiki/Directed_acyclic_word_graph) here to good effect. I think you could simply cycle down each node till you hit one that has only 1 "child". On this node, you'll have common prefixes "up" the tree and version strings below. From there, parse the version strings by removing everything that isn't a digit or a period, splitting the string by the period and converting each element of the array to an integer. This should give you an int array for each version string. Identify the highest version, record it and travel to the next node with only 1 child.
EDIT: Populating a large DAWG is a pretty expensive operation but lookup is really fast.

Related

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

Fastest way to select all strings from list starting from

I'm looking for the fastest way to find all strings in a collection starting from a set of characters. I can use sorted collection for this, however I can't find convenient way to do this in .net. Basically I need to find low and high indexes in a collection that meet the criteria.
BinarySearch on List<T> does not guarantee the returned index is that of the 1st element, so one would need to iterate up and down to find all matching strings which is not fast if one has a large list.
There are also Linq methods (with parallel), but I'm not sure which data structure will provide the best results.
List example, ~10M of records:
aaaaaaaaaaaaaaabb
aaaaaaaaaaaaaaba
aaaaaaaaaaaaabc
...
zzzzzzzzzzzzzxx
zzzzzzzzzzzzzyzzz
zzzzzzzzzzzzzzzzzza
Search for strings starting from: skk...
Result: record indexes from x to y.
UPDATE: strings can have different lengths and are unique.

In terms of time complexity - you should use a trie, and not a sorted set or binary search.
Trie will get you a O(|S|) time complexity [while sorted set and binary search gets you O(|S|logn)] to find the node [let it be v] that represents that prefix.
All the strings [paths] in the trie that fit the prefix will "pass" via v. By adding numberOfLeaves field to each node, you can find out exactly how much leaves [=strings] this node has.
In a single pass - you can also find the index of this v [For each node u in the path from the root to v - sum numberOfLeaves for each sibling which is left to u].
This requires much more work then using already existing structures, but if the data is huge - it can make your algorithm much faster, so you should concider it if performance is an issue and you expect a huge set of strings.

You can do it with a hand-written binary search - one which just doesn't stop when it's found a match; it continues until it's found a single index.
In fact, you don't even have to write the binary search bit yourself - you could create a custom comparer which never returns 0, i.e. if you're looking for "abc" then it treats "abb" as being below the target value, but "abc" as being above the target value. This way the BinarySearch will always return a negative number, which you can then just bit-flip to find the theoretical insertion point for "the string which comes between abb and abc".
You can do the same in reverse (treat "abc" as lower than the target value) to find the highest bound.
If you know the format of these strings and it won't have edge cases like Unicode NULL characters, and everything's the same length, you can even do it without writing your own comparer:
// This could be done more efficiently :)
string stringJustBelow = target.Substring(0, target.Length - 1) +
target[target.Length - 1] + "X";
string stringJustAbove = target + "X"; // Or any character
int lowerBoundInclusive = ~list.BinarySearch(stringJustBelow);
int upperBoundExclusive = ~list.BinarySearch(stringJustAbove);
So if you strings are all length 3 and you were searching for "abc" you'd actually look for where "abbX" and "abcX" would be inserted.

Put them in SortedSet and use GetViewBetween.
This answer illustrates searching for both prefix and suffix, I'm sure you'll have no trouble adapting it to prefix-only search, if that is indeed what you want.
If you just want to search for a range (not prefix), directly using GetViewBetween should suffice.

Data structure for indexed searches of subsets

I'm working on a c# jquery implementation and am trying to figure out an efficient algorithm for locating elements in a subset of the entire DOM (e.g. a subselector). At present I am creating an index of common selectors: class, id, and tag when the DOM is built.
The basic data structure is as one would expect, a tree of Elements which contain IEnumerable<Element> Children and a Parent. This is simple when searching the whole domain using a Dictonary<string,HashSet<Element>> to store the index.
I have not been able to get my head around the most effective way to search subsets of elements using an index. I use the term "subset" to refer to the starting set from which a subsequent selector in a chain will be run against. The following are methods I've thought of:
Retrieve matches from entire DOM for a subquery, and eliminate those that are not part of the subset. This requires traversing up the parents of each match until the root is found (and it is eliminated) or a member of the subset is found (and it is a child, hence included)
Maintain the index separately for each element.
Maintain a set of parents for each element (to make #1 fast by eliminating traversal)
Rebuild the entire index for each subquery.
Just search manually except for primary selectors.
The cost of each possible technique depends greatly on the exact operation being done. #1 is probably pretty good most of the time, since most of the time when you do a sub-select, you're targeting specific elements. The number of iterations required would be the number of results * the average depth of each element.
The 2nd method would be by far the fastest for selecting, but at the expense of storage requirements that increase exponentially with depth, and difficult index maintenance. I've pretty much eliminated this.
The 3rd method has a fairly bad memory footprint (though much better than #2) - it may be reasonable, but in addition to the storage requirements, adding and removing elements becomes substantially more expensive and complicated.
The 4rd method requires traversing the entire selection anyway so it seems pointless since most subqueries are only going to be run once. It would only be beneficial if a subequery was expected to be repeated. (Alternatively, I could just do this while traversing a subset anyway - except some selectors don't require searching the whole subdomain, e.g. ID and position selectors).
The 5th method will be fine for limited subsets, but much worse than the 1st method for subsets that are much of the DOM.
Any thoughts or other ideas about how best to accomplish this? I could do some hybrid of #1 and #4 by guessing which is more efficient given the size of the subset being searched vs. the size of the DOM but this is pretty fuzzy and I'd rather find some universal solution. Right now I am just using #4 (only full-DOM queries use the index) which is fine, but really bad if you decided to do something like $('body').Find('#id')
Disclaimer: This is early optimization. I don't have a bottleneck that needs solving, but as an academic problem I can't stop thinking about it...
Solution
Here's the implementation for the data structure as proposed by the answer. Is working perfectly as a near drop-in replacement for a dictionary.
interface IRangeSortedDictionary<TValue>: IDictionary<string, TValue>
{
IEnumerable<string> GetRangeKeys(string subKey);
IEnumerable<TValue> GetRange(string subKey);
}
public class RangeSortedDictionary<TValue> : IRangeSortedDictionary<TValue>
{
protected SortedSet<string> Keys = new SortedSet<string>();
protected Dictionary<string,TValue> Index =
new Dictionary<string,TValue>();
public IEnumerable<string> GetRangeKeys(string subkey)
{
if (string.IsNullOrEmpty(subkey)) {
yield break;
}
// create the next possible string match
string lastKey = subkey.Substring(0,subkey.Length - 1) +
Convert.ToChar(Convert.ToInt32(subkey[subkey.Length - 1]) + 1);
foreach (var key in Keys.GetViewBetween(subkey, lastKey))
{
// GetViewBetween is inclusive, exclude the last key just in case
// there's one with the next value
if (key != lastKey)
{
yield return key;
}
}
}
public IEnumerable<TValue> GetRange(string subKey)
{
foreach (var key in GetRangeKeys(subKey))
{
yield return Index[key];
}
}
// implement dictionary interface against internal collections
}
Code is here: http://ideone.com/UIp9R

If you suspect name collisions will be uncommon, it may be fast enough to just walk up the tree.
If collisions are common though, it might be faster to use a data structure that excels at ordered prefix searches, such as a tree. Your various subsets make up the prefix. Your index keys would then include both selectors and total paths.
For the DOM:
<path>
<to>
<element id="someid" class="someclass" someattribute="1"/>
</to>
</path>
You would have the following index keys:
<element>/path/to/element
#someid>/path/to/element
.someclass>/path/to/element
#someattribute>/path/to/element
Now if you search these keys based on prefix, you can limit the query to any subset you want:
<element> ; finds all <element>, regardless of path
.someclass> ; finds all .someclass, regardless of path
.someclass>/path ; finds all .someclass that exist in the subset /path
.someclass>/path/to ; finds all .someclass that exist in the subset /path/to
#id>/body ; finds all #id that exist in the subset /body
A tree can find the lower bound (the first element >= to your search value) in O(log n), and because it is ordered from there you simply iterate until you come to a key that no longer matches the prefix. It will be very fast!
.NET doesn't have a suitable tree structure (it has SortedDictionary but that unfortunately doesn't expose the required LowerBound method), so you'll need to either write your own or use an existing third party one. The excellent C5 Generic Collection Library features trees with suitable Range methods.

Customizing Sort Order of C# Arrays

This has been bugging me for some time now. I've tried several approaches and none have worked properly.
I'm writing and IRC client and am trying to sort out the list of usernames (which needs to be sorted by a users' access level in the current channel).
This is easy enough. Trouble is, this list needs to added to whenever a user joins or leaves the channel so their username must be removed the list when the leave and re-added in the correct position when they rejoin.
Each users' access level is signified by a single character at the start of each username. These characters are reserved, so there's no potential problem of a name starting with one of the symbols. The symbols from highest to lowest (in the order I need to sort them) are:
~
&
#
%
+
Users without any sort of access have no symbol before their username. They should be at the bottom of the list.
For example: the unsorted array could contain the following:
~user1 ~user84 #user3 &user8 +user39 user002 user2838 %user29
And needs to be sorted so the elements are in the following order:
~user1 ~user84 &user8 #user3 %user29 +user39 user002 user2838
After users are sorted by access level, they also need to be sorted alphabetically.
Asking here is a last resort, if someone could help me out, I'd very much appreciate it.
Thankyou in advance.

As long as the array contains an object then implement IComparable on the object and then call Array.Sort().
Tho if the collection is changable I would recommend using a List<>.

You can use SortedList<K,V> with the K (key) implementing IComparable interface which then defines the criteria of your sort. The V can simply be null or the same K object.

You can give an IComparer<T> or a Comparison<T> to Array.Sort. Then you just need to implement the comparison yourself. If it's a relatively complex comparison (which it sounds like this is) I'd implement IComparer<T> in a separate class, which you can easily unit test. Then call:
Array.Sort(userNames, new UserNameComparer());
You might want to have a convenient instance defined, if UserNameComparer has no state:
Array.Sort(userNames, UserNameComparer.Instance);
List<T> has similar options for sorting - I'd personally use a list rather than an array, if you're going to be adding/removing items regularly.
In fact, it sounds like you don't often need to actually do a full sort. Removing a user doesn't change the sort order, and inserting only means inserting at the right point. In other words, you need:
Create list and sort it to start with
Removing a user is just a straightforward operation
Adding a user requires finding out where to insert them
You can do the last step using Array.BinarySearch or List.BinarySearch, which again allow you to specify a custom IComparer<T>. Once you know where to insert the user, you can do that relatively cheaply (compared with sorting the whole collection again).

You should take a look at the IComparer interface (or it's generic version). When implementing the CompareTo method, check whether either of the two usernames contains one of your reserved character. If neither has a special reserved character or both have the same character, call the String.CompareTo method, which will handle the alphabetical sorting. Otherwise use your custom sorting logic.

I gave the sorting a shot and came up with the following sorting approach:
List<char> levelChars = new List<char>();
levelChars.AddRange("+%#&~".ToCharArray());
List<string> names = new List<string>();
names.AddRange(new[]{"~user1", "~user84", "#user3", "&user8", "+user39", "user002", "user2838", "%user29"});
names.Sort((x,y) =>
{
int xLevel = levelChars.IndexOf(x[0]);
int yLevel = levelChars.IndexOf(y[0]);
if (xLevel != yLevel)
{
// if xLevel is higher; x should come before y
return xLevel > yLevel ? -1 : 1;
}
// x and y have the same level; regular string comparison
// will do the job
return x.CompareTo(y);
});
This comparison code can just as well live inside the Compare method of an IComparer<T> implementation.

How can I sort an array of strings?

I have a list of input words separated by comma. I want to sort these words by alphabetical and length. How can I do this without using the built-in sorting functions?

Good question!! Sorting is probably the most important concept to learn as an up-and-coming computer scientist.
There are actually lots of different algorithms for sorting a list.
When you break all of those algorithms down, the most fundamental operation is the comparison of two items in the list, defining their "natural order".
For example, in order to sort a list of integers, I'd need a function that tells me, given any two integers X and Y whether X is less than, equal to, or greater than Y.
For your strings, you'll need the same thing: a function that tells you which of the strings has the "lesser" or "greater" value, or whether they're equal.
Traditionally, these "comparator" functions look something like this:
int CompareStrings(String a, String b) {
if (a < b)
return -1;
else if (a > b)
return 1;
else
return 0;
}
I've left out some of the details (like, how do you compute whether a is less than or greater than b? clue: iterate through the characters), but that's the basic skeleton of any comparison function. It returns a value less than zero if the first element is smaller and a value greater than zero if the first element is greater, returning zero if the elements have equal value.
But what does that have to do with sorting?
A sort routing will call that function for pairs of elements in your list, using the result of the function to figure out how to rearrange the items into a sorted list. The comparison function defines the "natural order", and the "sorting algorithm" defines the logic for calling and responding to the results of the comparison function.
Each algorithm is like a big-picture strategy for guaranteeing that ANY input will be correctly sorted. Here are a few of the algorithms that you'll probably want to know about:
Bubble Sort:
Iterate through the list, calling the comparison function for all adjacent pairs of elements. Whenever you get a result greater than zero (meaning that the first element is larger than the second one), swap the two values. Then move on to the next pair. When you get to the end of the list, if you didn't have to swap ANY pairs, then congratulations, the list is sorted! If you DID have to perform any swaps, go back to the beginning and start over. Repeat this process until there are no more swaps.
NOTE: this is usually not a very efficient way to sort a list, because in the worst cases, it might require you to scan the whole list as many as N times, for a list with N elements.
Merge Sort:
This is one of the most popular divide-and-conquer algorithms for sorting a list. The basic idea is that, if you have two already-sorted lists, it's easy to merge them. Just start from the beginning of each list and remove the first element of whichever list has the smallest starting value. Repeat this process until you've consumed all the items from both lists, and then you're done!
1 4 8 10
2 5 7 9
------------ becomes ------------>
1 2 4 5 7 8 9 10
But what if you don't have two sorted lists? What if you have just one list, and its elements are in random order?
That's the clever thing about merge sort. You can break any single list into smaller pieces, each of which is either an unsorted list, a sorted list, or a single element (which, if you thing about it, is actually a sorted list, with length = 1).
So the first step in a merge sort algorithm is to divide your overall list into smaller and smaller sub lists, At the tiniest levels (where each list only has one or two elements), they're very easy to sort. And once sorted, it's easy to merge any two adjacent sorted lists into a larger sorted list containing all the elements of the two sub lists.
NOTE: This algorithm is much better than the bubble sort method, described above, in terms of its worst-case-scenario efficiency. I won't go into a detailed explanation (which involves some fairly trivial math, but would take some time to explain), but the quick reason for the increased efficiency is that this algorithm breaks its problem into ideal-sized chunks and then merges the results of those chunks. The bubble sort algorithm tackles the whole thing at once, so it doesn't get the benefit of "divide-and-conquer".
Those are just two algorithms for sorting a list, but there are a lot of other interesting techniques, each with its own advantages and disadvantages: Quick Sort, Radix Sort, Selection Sort, Heap Sort, Shell Sort, and Bucket Sort.
The internet is overflowing with interesting information about sorting. Here's a good place to start:
http://en.wikipedia.org/wiki/Sorting_algorithms

Create a console application and paste this into the Program.cs as the body of the class.
public static void Main(string[] args)
{
string [] strList = "a,b,c,d,e,f,a,a,b".Split(new [] { ',' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string s in strList.Sort())
Console.WriteLine(s);
}
public static string [] Sort(this string [] strList)
{
return strList.OrderBy(i => i).ToArray();
}
Notice that I do use a built in method, OrderBy. As other answers point out there are many different sort algorithms you could implement there and I think my code snippet does everything for you except the actual sort algorithm.
Some C# specific sorting tutorials

There is an entire area of study built around sorting algorithms. You may want to choose a simple one and implement it.
Though it won't be the most performant, it shouldn't take you too long to implement a bubble sort.

If you don't want to use build-in-functions, you have to create one by your self. I would recommend Bubble sort or some similar algorithm. Bubble sort is not an effective algoritm, but it get the works done, and is easy to understand.
You will find much good reading on wikipedia.

I would recommend doing a wiki for quicksort.
Still not sure why you don't want to use the built in sort?

Bubble sort damages the brain.
Insertion sort is at least as simple to understand and code, and is actually useful in practice (for very small data sets, and nearly-sorted data). It works like this:
Suppose that the first n items are already in order (you can start with n = 1, since obviously one thing on its own is "in the correct order").
Take the (n+1)th item in your array. Call this the "pivot". Starting with the nth item and working down:
- if it is bigger than the pivot, move it one space to the right (to create a "gap" to the left of it).
- otherwise, leave it in place, put the "pivot" one space to the right of it (that is, in the "gap" if you moved anything, or where it started if you moved nothing), and stop.
Now the first n+1 items in the array are in order, because the pivot is to the right of everything smaller than it, and to the left of everything bigger than it. Since you started with n items in order, that's progress.
Repeat, with n increasing by 1 at each step, until you've processed the whole list.
This corresponds to one way that you might physically put a series of folders into a filing cabinet in order: put one in; then put another one into its correct position by pushing everything that belongs after it over by one space to make room; repeat until finished. Nobody ever sorts physical objects by bubble sort, so it's a mystery to me why it's considered "simple".
All that's left now is that you need to be able to work out, given two strings, whether the first is greater than the second. I'm not quite sure what you mean by "alphabetical and length" : alphabetical order is done by comparing one character at a time from each string. If there not the same, that's your order. If they are the same, look at the next one, unless you're out of characters in one of the strings, in which case that's the one that's "smaller".

Use NSort
I ran across the NSort library a couple of years ago in the book Windows Developer Power Tools. The NSort library implements a number of sorting algorithms. The main advantage to using something like NSort over writing your own sorting is that is is already tested and optimized.

Posting link to fast string sort code in C#:
http://www.codeproject.com/KB/cs/fast_string_sort.aspx
Another point:
The suggested comparator above is not recommended for non-English languages:
int CompareStrings(String a, String b) {
if (a < b) return -1;
else if (a > b)
return 1; else
return 0; }
Checkout this link for non-English language sort:
http://msdn.microsoft.com/en-us/goglobal/bb688122
And as mentioned, use nsort for really gigantic arrays that don't fit in memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.