Customizing Sort Order of C# Arrays

Customizing Sort Order of C# Arrays - c#

This has been bugging me for some time now. I've tried several approaches and none have worked properly.
I'm writing and IRC client and am trying to sort out the list of usernames (which needs to be sorted by a users' access level in the current channel).
This is easy enough. Trouble is, this list needs to added to whenever a user joins or leaves the channel so their username must be removed the list when the leave and re-added in the correct position when they rejoin.
Each users' access level is signified by a single character at the start of each username. These characters are reserved, so there's no potential problem of a name starting with one of the symbols. The symbols from highest to lowest (in the order I need to sort them) are:
~
&
#
%
+
Users without any sort of access have no symbol before their username. They should be at the bottom of the list.
For example: the unsorted array could contain the following:
~user1 ~user84 #user3 &user8 +user39 user002 user2838 %user29
And needs to be sorted so the elements are in the following order:
~user1 ~user84 &user8 #user3 %user29 +user39 user002 user2838
After users are sorted by access level, they also need to be sorted alphabetically.
Asking here is a last resort, if someone could help me out, I'd very much appreciate it.
Thankyou in advance.

As long as the array contains an object then implement IComparable on the object and then call Array.Sort().
Tho if the collection is changable I would recommend using a List<>.

You can use SortedList<K,V> with the K (key) implementing IComparable interface which then defines the criteria of your sort. The V can simply be null or the same K object.

You can give an IComparer<T> or a Comparison<T> to Array.Sort. Then you just need to implement the comparison yourself. If it's a relatively complex comparison (which it sounds like this is) I'd implement IComparer<T> in a separate class, which you can easily unit test. Then call:
Array.Sort(userNames, new UserNameComparer());
You might want to have a convenient instance defined, if UserNameComparer has no state:
Array.Sort(userNames, UserNameComparer.Instance);
List<T> has similar options for sorting - I'd personally use a list rather than an array, if you're going to be adding/removing items regularly.
In fact, it sounds like you don't often need to actually do a full sort. Removing a user doesn't change the sort order, and inserting only means inserting at the right point. In other words, you need:
Create list and sort it to start with
Removing a user is just a straightforward operation
Adding a user requires finding out where to insert them
You can do the last step using Array.BinarySearch or List.BinarySearch, which again allow you to specify a custom IComparer<T>. Once you know where to insert the user, you can do that relatively cheaply (compared with sorting the whole collection again).

You should take a look at the IComparer interface (or it's generic version). When implementing the CompareTo method, check whether either of the two usernames contains one of your reserved character. If neither has a special reserved character or both have the same character, call the String.CompareTo method, which will handle the alphabetical sorting. Otherwise use your custom sorting logic.

I gave the sorting a shot and came up with the following sorting approach:
List<char> levelChars = new List<char>();
levelChars.AddRange("+%#&~".ToCharArray());
List<string> names = new List<string>();
names.AddRange(new[]{"~user1", "~user84", "#user3", "&user8", "+user39", "user002", "user2838", "%user29"});
names.Sort((x,y) =>
{
int xLevel = levelChars.IndexOf(x[0]);
int yLevel = levelChars.IndexOf(y[0]);
if (xLevel != yLevel)
{
// if xLevel is higher; x should come before y
return xLevel > yLevel ? -1 : 1;
}
// x and y have the same level; regular string comparison
// will do the job
return x.CompareTo(y);
});
This comparison code can just as well live inside the Compare method of an IComparer<T> implementation.

Related

Fast way to check for existence and then insert into a SortedList

Whenever I want to insert into a SortedList, I check to see if the item exists, then I insert. Is this performing the same search twice? Once to see if the item is there and again to find where to insert the item? Is there a way to optimize this to speed it up or is this just the way to do it, no changes necessary?
if( sortedList.ContainsKey( foo ) == false ){
sortedList.Add( foo, 0 );
}

You can add the items to a HashSet and the List, searching in the hash set is the fastest way to see if you have to add the value to the list.
if( hashSet.Contains( foo ) == false ){
sortedList.Add( foo, 0 );
hashSet.Add(foo);
}

You can use the indexer. The indexer does this in an optimal way internally by first looking for the index corresponding to the key using a binary search and then using this index to replace an existing item. Otherwise a new item is added by taking in account the index already calculated.
list["foo"] = value;
No exception is thrown whether the key already exists or not.
UPDATE:
If the new value is the same as the old value, replacing the old value will have the same effect than doing nothing.
Keep in mind that a binary search is done. This means that it takes about 10 steps to find an item among 1000 items! log2(1000) ~= 10. Therefore doing an extra search will not have a significant impact on speed. Searching among 1,000,000 items will only double this value (~ 20 steps).
But setting the value through the indexer will do only one search in any case. I looked at the code using Reflector and can confirm this.

I'm sorry if this doesn't answer your question, but I have to say sometimes the default collection structures in .NET are unjustifiably limited in features. This could have been handled if Add method returned a boolean indicating success/failure very much like HashSet<T>.Add does. So everything goes in one step. In fact the whole of ICollection<T>.Add should have been a boolean so that implementation-wise it's forced, very much like Collection<T> in Java does.
You could either use a SortedDictionary<K, V> structure as pointed out by Servy or a combination of HashSet<K> and SortedList<K, V> as in peer's answer for better performance, but neither of them are really sticking to do it only once philosophy. I tried a couple of open source projects to see if there is a better implementation in this respect, but couldn't find.
Your options:
In vast majority of the cases it's ok to do two lookups, doesn't hurt much. Stick to one. There is no solution built in.
Write your own SortedList<K, V> class. It's not difficult at all.
If you'r desperate, you can use reflection. The Insert method is a private member in SortedList class. An example that does.. Kindly dont do it. It's a very very poor choice. Mentioned here for completeness.

ContainsKey does a binary search, which is O(log n), so unless you list is massive, I wouldn't worry about it too much. And, presumably, on insertion it does another binary search to find the location to insert at.
One option to avoid this (doing the search twice) is to use a the BinarySearch method of List. This will return a negative value if the item isn't found and that negative value is the bitwise compliment of the place where the item should be inserted. So you can look for an item, and if it's not already in the list, you know exactly where it should be inserted to keep the list sorted.

SortedList<Key,Value> is a slow data structure that you probably shouldn't use at all. You may have already considered using SortedDictionary<Key,Value> but found it inconvenient because the items don't have indexes (you can't write sortedDictionary[0]) and because you can write a find nearest key operation for SortedList but not SortedDictionary.
But if you're willing to switch to a third-party library, you can get better performance by changing to a different data structure.
The Loyc Core libraries include a data type that works the same way as SortedList<Key,Value> but is dramatically faster when the list is large. It's called BDictionary<Key,Value>.
Now, answering your original question: yes, the way you wrote the code, it performs two searches and one insert (the insert is the slowest part). If you switch to BDictionary, there is a method bdictionary.AddIfNotPresent(key, value) which combines those two operations into a single operation. It returns true if the specified item was added, or false if it was already present.

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

String parsing and matching algorithm

I am solving the following problem:
Suppose I have a list of software packages and their names might looks like this (the only known thing is that these names are formed like SOMETHING + VERSION, meaning that the version always comes after the name):
Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED
Efficient.Exclusive.Zip.Archiver.123.01
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip.Archiver14.06
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Custom.Zip.Archiver1
Now, I need to parse this list and select only latest versions of each package. For this example the expected result would be:
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Current approach that I use can be described the following way:
Split the initial strings into groups by their starting letter,
ignoring spaces, case and special symbols.
(`E`, `Z`, `C` for the example list above)
Foreach element {
Apply the regular expression (or a set of regular expressions),
which tries to deduce the version from the string and perform
the following conversion `STRING -> (VERSION, STRING_BEFORE_VERSION)`
// Example for this step:
// 'Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED' ->
// (122.24, Efficient.Exclusive.Zip.Archiver-PROPER)
Search through the corresponding group (in this example - the 'E' group)
and find every other strings, which starts from the 'STRING_BEFORE_VERSION' or
from it's significant part. This comparison is performed in ignore-case and
ignore-special-symbols mode.
// The matches for this step:
// Efficient.Exclusive.Zip.Archiver-PROPER, {122.24}
// Efficient.Exclusive.Zip.Archiver, {123.01}
// Efficient-Exclusive.Zip.Archiver, {126.24, 2011}
// The last one will get picked, because year is ignored.
Get the possible version from each match, ***pick the latest, yield that match.***
Remove every possible match (including the initial element) from the list.
}
This algorithm (as I assume) should work for something like O(N * V + N lg N * M), where M stands for the average string matching time and V stands for the version regexp working time.
However, I suspect there is a better solution (there always is!), maybe specific data structure or better matching approach.
If you can suggest something or make some notes on the current approach, please do not hesitate to do this.

How about this? (Pseudo-Code)
Dictionary<string,string> latestPackages=new Dictionary<string,string>(packageNameComparer);
foreach element
{
(package,version)=applyRegex(element);
if(!latestPackages.ContainsKey(package) || isNewer)
{
latestPackages[package]=version;
}
}
//print out latestPackages
Dictionary operations are O(1), so you have O(n) total runtime. No pre-grouping necessary and instead of storing all matches, you only store the one which is currently the newest.
Dictionary has a constructor which accepts a IEqualityComparer-object. There you can implement your own semantic of equality between package names. Keep in mind however that you need to implement a GetHashCode method in this IEqualityComparer which should return the same values for objects that you consider equal. To reproduce the example above you could return a hash code for the first character in the string, which would reproduce the grouping you had inside your dictionary. However you will get more performance with a smarter hash code, which doesn't have so many collisions. Maybe using more characters if that still yields good results.

I think you could probably use a DAWG (http://en.wikipedia.org/wiki/Directed_acyclic_word_graph) here to good effect. I think you could simply cycle down each node till you hit one that has only 1 "child". On this node, you'll have common prefixes "up" the tree and version strings below. From there, parse the version strings by removing everything that isn't a digit or a period, splitting the string by the period and converting each element of the array to an integer. This should give you an int array for each version string. Identify the highest version, record it and travel to the next node with only 1 child.
EDIT: Populating a large DAWG is a pretty expensive operation but lookup is really fast.

In C#, is there a kind of a SortedList<double> that allows fast querying (with LINQ) for the nearest value?

I am looking for a structure that holds a sorted set of double values. I want to query this set to find the closest value to a specified reference value.
I have looked at the SortedList<double, double>, and it does quite well for me. However, since I do not need explicit key/value pairs. this seems to be overkill to me, and i wonder if i could do faster.
Conditions:
The structure is initialised only once, and does never change (no insert/deletes)
The amount of values is in the range of 100k.
The structure is queried often with new references, which must execute fast.
For simplicity and speed, the set's value just below of the reference may be returned, not actually the nearest value
I want to use LINQ for the query, if possible, for simplicity of code.
I want to use no 3rd party code if possible. .NET 3.5 is available.
Speed is more importand than memory footprint
I currently use the following code, where SortedValues is the aforementioned SortedList
IEnumerable<double> nearest = from item in SortedValues.Keys
where item <= suggestion
select item;
return nearest.ElementAt(nearest.Count() - 1);
Can I do faster?
Also I am not 100% percent sure, if this code is really safe. IEnumerable, the return type of my query is not by definition sorted anymore. However, a Unit test with a large test data base has shown that it is in practice, so this works for me. Have you hints regarding this aspect?
P.S. I know that there are many similar questions, but none actually answers my specific needs. Especially there is this one C# Data Structure Like Dictionary But Without A Value, but the questioner does just want to check the existence not find anything.

The way you are doing it is incredibly slow as it must search from the beginning of the list each time giving O(n) performance.
A better way is to put the elements into a List and then sort the list. You say you don't need to change the contents once initialized, so sorting once is enough.
Then you can use List<T>.BinarySearch to find elements or to find the insertion point of an element if it doesn't already exist in the list.
From the docs:
Return Value
The zero-based index of
item in the sorted List<T>,
if item is found; otherwise, a
negative number that is the bitwise
complement of the index of the next
element that is larger than item or,
if there is no larger element, the
bitwise complement of Count.
Once you have the insertion point, you need to check the elements on either side to see which is closest.

Might not be useful to you right now, but .Net 4 has a SortedSet class in the BCL.

I think it can be more elegant as follows:
In case your items are not sorted:
double nearest = values.OrderBy(x => x.Key).Last(x => x.Key <= requestedValue);
In case your items are sorted, you may omit the OrderBy call...

How can I sort an array of strings?

I have a list of input words separated by comma. I want to sort these words by alphabetical and length. How can I do this without using the built-in sorting functions?

Good question!! Sorting is probably the most important concept to learn as an up-and-coming computer scientist.
There are actually lots of different algorithms for sorting a list.
When you break all of those algorithms down, the most fundamental operation is the comparison of two items in the list, defining their "natural order".
For example, in order to sort a list of integers, I'd need a function that tells me, given any two integers X and Y whether X is less than, equal to, or greater than Y.
For your strings, you'll need the same thing: a function that tells you which of the strings has the "lesser" or "greater" value, or whether they're equal.
Traditionally, these "comparator" functions look something like this:
int CompareStrings(String a, String b) {
if (a < b)
return -1;
else if (a > b)
return 1;
else
return 0;
}
I've left out some of the details (like, how do you compute whether a is less than or greater than b? clue: iterate through the characters), but that's the basic skeleton of any comparison function. It returns a value less than zero if the first element is smaller and a value greater than zero if the first element is greater, returning zero if the elements have equal value.
But what does that have to do with sorting?
A sort routing will call that function for pairs of elements in your list, using the result of the function to figure out how to rearrange the items into a sorted list. The comparison function defines the "natural order", and the "sorting algorithm" defines the logic for calling and responding to the results of the comparison function.
Each algorithm is like a big-picture strategy for guaranteeing that ANY input will be correctly sorted. Here are a few of the algorithms that you'll probably want to know about:
Bubble Sort:
Iterate through the list, calling the comparison function for all adjacent pairs of elements. Whenever you get a result greater than zero (meaning that the first element is larger than the second one), swap the two values. Then move on to the next pair. When you get to the end of the list, if you didn't have to swap ANY pairs, then congratulations, the list is sorted! If you DID have to perform any swaps, go back to the beginning and start over. Repeat this process until there are no more swaps.
NOTE: this is usually not a very efficient way to sort a list, because in the worst cases, it might require you to scan the whole list as many as N times, for a list with N elements.
Merge Sort:
This is one of the most popular divide-and-conquer algorithms for sorting a list. The basic idea is that, if you have two already-sorted lists, it's easy to merge them. Just start from the beginning of each list and remove the first element of whichever list has the smallest starting value. Repeat this process until you've consumed all the items from both lists, and then you're done!
1 4 8 10
2 5 7 9
------------ becomes ------------>
1 2 4 5 7 8 9 10
But what if you don't have two sorted lists? What if you have just one list, and its elements are in random order?
That's the clever thing about merge sort. You can break any single list into smaller pieces, each of which is either an unsorted list, a sorted list, or a single element (which, if you thing about it, is actually a sorted list, with length = 1).
So the first step in a merge sort algorithm is to divide your overall list into smaller and smaller sub lists, At the tiniest levels (where each list only has one or two elements), they're very easy to sort. And once sorted, it's easy to merge any two adjacent sorted lists into a larger sorted list containing all the elements of the two sub lists.
NOTE: This algorithm is much better than the bubble sort method, described above, in terms of its worst-case-scenario efficiency. I won't go into a detailed explanation (which involves some fairly trivial math, but would take some time to explain), but the quick reason for the increased efficiency is that this algorithm breaks its problem into ideal-sized chunks and then merges the results of those chunks. The bubble sort algorithm tackles the whole thing at once, so it doesn't get the benefit of "divide-and-conquer".
Those are just two algorithms for sorting a list, but there are a lot of other interesting techniques, each with its own advantages and disadvantages: Quick Sort, Radix Sort, Selection Sort, Heap Sort, Shell Sort, and Bucket Sort.
The internet is overflowing with interesting information about sorting. Here's a good place to start:
http://en.wikipedia.org/wiki/Sorting_algorithms

Create a console application and paste this into the Program.cs as the body of the class.
public static void Main(string[] args)
{
string [] strList = "a,b,c,d,e,f,a,a,b".Split(new [] { ',' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string s in strList.Sort())
Console.WriteLine(s);
}
public static string [] Sort(this string [] strList)
{
return strList.OrderBy(i => i).ToArray();
}
Notice that I do use a built in method, OrderBy. As other answers point out there are many different sort algorithms you could implement there and I think my code snippet does everything for you except the actual sort algorithm.
Some C# specific sorting tutorials

There is an entire area of study built around sorting algorithms. You may want to choose a simple one and implement it.
Though it won't be the most performant, it shouldn't take you too long to implement a bubble sort.

If you don't want to use build-in-functions, you have to create one by your self. I would recommend Bubble sort or some similar algorithm. Bubble sort is not an effective algoritm, but it get the works done, and is easy to understand.
You will find much good reading on wikipedia.

I would recommend doing a wiki for quicksort.
Still not sure why you don't want to use the built in sort?

Bubble sort damages the brain.
Insertion sort is at least as simple to understand and code, and is actually useful in practice (for very small data sets, and nearly-sorted data). It works like this:
Suppose that the first n items are already in order (you can start with n = 1, since obviously one thing on its own is "in the correct order").
Take the (n+1)th item in your array. Call this the "pivot". Starting with the nth item and working down:
- if it is bigger than the pivot, move it one space to the right (to create a "gap" to the left of it).
- otherwise, leave it in place, put the "pivot" one space to the right of it (that is, in the "gap" if you moved anything, or where it started if you moved nothing), and stop.
Now the first n+1 items in the array are in order, because the pivot is to the right of everything smaller than it, and to the left of everything bigger than it. Since you started with n items in order, that's progress.
Repeat, with n increasing by 1 at each step, until you've processed the whole list.
This corresponds to one way that you might physically put a series of folders into a filing cabinet in order: put one in; then put another one into its correct position by pushing everything that belongs after it over by one space to make room; repeat until finished. Nobody ever sorts physical objects by bubble sort, so it's a mystery to me why it's considered "simple".
All that's left now is that you need to be able to work out, given two strings, whether the first is greater than the second. I'm not quite sure what you mean by "alphabetical and length" : alphabetical order is done by comparing one character at a time from each string. If there not the same, that's your order. If they are the same, look at the next one, unless you're out of characters in one of the strings, in which case that's the one that's "smaller".

Use NSort
I ran across the NSort library a couple of years ago in the book Windows Developer Power Tools. The NSort library implements a number of sorting algorithms. The main advantage to using something like NSort over writing your own sorting is that is is already tested and optimized.

Posting link to fast string sort code in C#:
http://www.codeproject.com/KB/cs/fast_string_sort.aspx
Another point:
The suggested comparator above is not recommended for non-English languages:
int CompareStrings(String a, String b) {
if (a < b) return -1;
else if (a > b)
return 1; else
return 0; }
Checkout this link for non-English language sort:
http://msdn.microsoft.com/en-us/goglobal/bb688122
And as mentioned, use nsort for really gigantic arrays that don't fit in memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.