Efficiently pairing objects in lists based on key

Efficiently pairing objects in lists based on key - c#

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

Related

Dictionary element accessing

Question
Must looping through all C# Dictionary elements be done only through foreach, and if so why?
Or, I could ask my question as: can Dictionary elements be accessed by position within the Dictionary object (i.e., first element, last element, 3rd from last element, etc)?
Background to this question
I'm trying to learn more about how Dictionary objects work, so I'd appreciate help wrapping my mind around this. I'm learning about this, so I have several thoughts that are all tied into this question. I'll try to present in a way that is appropriate for SO format.
Research
In a C# array, elements are referenced by position. In a Dictionary, values are referenced by keys.
Looking through the documentation on MSDN, there are the statements
"For purposes of enumeration, each item in the dictionary is treated as a
KeyValuePair structure representing a value and its key. The
order in which the items are returned is undefined."
So, it would seem that since the order items are returned in is undefined, there is no way to access elements by position. I also read:
"Retrieving a value by using its key is very fast, close to O(1),
because the Dictionary class is implemented as a hash table."
Looking at the documentation for the HashTable .NET 4.5 class, there is reference to using a foreach statement to loop through and return elements. But there is no reference to using a for statement, or for that matter while or any other looping statement.
Also, I've noticed Dictionary elements use the IEnumerable interface, which seems to use foreach as the only type of statement for looping functions.
Thoughts
So, does this mean that Dictionary elements cannot be accessed by "position," as arrays or lists can?
If this is so, why is there a .Count property that returns the number of key/value pairs, yet nothing that lets me reference these by nearness to the total? For example, .Count is 5, why can't I request key/value pair .Count minus 1?
How is foreach able to loop over each element, yet I have no access to individual elements in the same way?
Is there no way to determine the position of an element (key or value) in a Dictionary object, without utilizing foreach? Can I not tell, without mapping elements to a collection, if a key is the first key in a Dictionary, or the last key?
This SO question and the excellent answers touch on this, but I'm specifically looking to see if I must copy elements to an array or other enumerable type, to access specific elements by position.
Here's an example. Please note I'm not looking for a way to specifically solve this example - it's for illustration purposes of my questions only. Suppose I want to add all they keys in a Dictionary<string, string> object to a comma-separated list, with no comma at the end. With an array I could do:
string[] arrayStr = new string[2] { "Test1", "Test2" };
string outStr = "";
for (int i = 0; i < arrayStr.Length; i++)
{
outStr += arrayStr[i];
if (i < arrayStr.Length - 1)
{
outStr += ", ";
}
}
With Dictionary<string, string>, how would I copy each key to outStr using the above method? It appears I would have to use foreach. But what Dictionary methods or properties exist that would let me identify where an element is located at, within a dictionary?
If you're still reading this, I also want to point out I'm not trying to say there's something wrong with Dictionary... I'm simply trying to understand how this tool in the .NET framework works, and how to best use it myself.

Say you have four cars of different colors. And you want to be able to quickly find the key to a car by its color. So you make 4 envelopes labelled "red", "blue", "black", and "white" and place the key to each car in the right envelope. Which is the "first" car? Which is the "third"? You're not concerned about the order of the envelopes; you're concerned about being able to quickly get the key by the color.
So, does this mean that Dictionary elements cannot be accessed by "position," as arrays or lists can?
Not directly, no. You can use Skip and Take but all they will do is iterate until you get to the "nth" item.
If this is so, why is there a .Count property that returns the number of key/value pairs, yet nothing that lets me reference these by nearness to the total? For example, .Count is 5, why can't I request key/value pair .Count minus 1?
You can still measure the number of items even thought there's no order. In my example you know there are 4 envelopes, but there's no concept of the "third" envelope.
How is foreach able to loop over each element, yet I have no access to individual elements in the same way?
Because foreach use IEnumerable, which just asks for the "next" element each time - the underlying collection determines what order the elements are returned in. You can pick up the envelopes one by one, but the order is irrelevant.
Is there no way to determine the position of an element (key or value) in a Dictionary object, without utilizing foreach?
You can infer it by using foreach and counting how many elements you have before reaching the one you want, but as soon as you add or remove an item, that position may change. If I buy a green car and add the envelope, where in the "order" would it go?
I'm specifically looking to see if I must copy elements to an array or other enumerable type, to access specific elements by position.
Well, no, you can use Skip and Take, but there's no way to predict what item is at that location. You can pick up two envelopes, ignore them, pick up another one and call it the "third" envelope, but so what?

Several correct answers here, but I thought you might like a short version :)
Under the hood, the Dictionary class has a private field called buckets. It's just an ordinary array which maps integer positions to the objects you've added to the Dictionary.
When you add a key/value pair to the Dictionary, it calculates a hash value for your key. The hash value gets used as the index into the buckets array. The Dictionary uses as many bits of the hash as it needs to ensure that the index into the buckets array doesn't collide with an existing entry. The buckets array will be expanded as needed due to collisions.
Yes, it's possible via reflection (which allows you to extract private data fields) to get the 3rd, or 4th, or Nth member of the buckets array. But the array could be resized at any time and you're not even guaranteed that the implementation details of Dictionary won't change.

In addition to D Stanley's answer, I'd like to add that you should check out SortedDictionary<TKey, TValue>. It stores the key/value pairs in a data structure that does keep the keys ordered.
var d = new SortedDictionary<int, string>();
d.Add(4, "banana");
d.Add(2, "apple");
d.Add(7, "pineapple");
Console.WriteLine(d.ElementAt(1).Value); // banana

Looping through a dictionary does not need to be done using foreach, but the terms 'first,' and 'last' are meaningless in terms of a dictionary, because order is not guaranteed and is in no way related to the order items are added to your dictionary.
Think of it this way. You have a bag that you are using to store blocks, and each block has a unique label on it. Throw in a block with the labels "Foo," "Bar," and "Baz." Now, if you ask me what the count of my bag is, I can say I have 3 blocks in it and if you ask me for the block labeled "Bar" I can get it for you. However, if you ask me for the first block, I don't know what you mean. The blocks are just a jumble inside my bag. If, instead you say 'foreach' block, I'd like to take a photo of it, I'll hand you each block, 1 by 1. Again, the order isn't guaranteed. I'm just reaching into my bag and pulling out each block until I've gotten each one.
You can also ask for a collection of all the keys in a dictionary, then use each key to access the items in a dictionary. However, once again, the order of the keys is not guaranteed and, in theory, could change every time you access it (in practice, the .NET key order is normally pretty stable).
There's a lot of reasons why a dictionary is stored like this, but the key thing is dictionaries have to have unspecified order in order to have both O(1) insertion and O(1) access. An array, which has a specified order has O(1) access (you can get the n'th item in one step), but insertions are O(n).

There is a large number of collections available in the .Net framework. You have to analyse your requirements and decide which collection to use:
Do you need Key/Value pairs or just Items?
Is it important that items are sorted?
Do you need fast insertion or just add at start/end of collection?
Do you need fast retrieval: O(1) or O(log n)?
Do you need an index i.e. acces to items by an integer position?
For most combinations of these requirements there exists a specialized collection.
In your case: Key/Value pairs and acces through an index: SortedList

Tuple<List<int>, List<int>> or a List<Tuple<int, int>>?

Which is preferred and/or more efficient operation wise?
I ended up using a list of tuples, but am curious as to which one is preferred.
Opinions welcome but I would love a technical aspect I'm not finding via google.

Tuple of two Lists involves 2 reference types/pointers (1 for each List) per item, List of 2-Tuples involves only one (1 for each Tuple), so you save 8? Bytes for each item (forgot how big the pointer + .net metainfo is).
Does such a Micro-Optimization make sense? I doubt so. I'd rather advise to use whatever makes most sense from a business-logic perspective.

Those options have fairly different semantics.
One implies two lists that only have a tenuous connection to each other, and that an element in position x in List A doesn't correspond to the same location in List B. The following would be perfectly valid:
{ [1, 2, 3, 4, 5, 6], [7] }
The other one is a list of paired elements. If you want to associate two numbers together, one from group A and one from group B, I'd suspect you want a List<Tuple<int, int>>. The above example would be invalid in this case, and you'll probably save yourself a lot of headaches double-checking things when you want to use the list.
As always, consider what you're trying to model and determine what data structure best maps to that. It's not worth sacrificing semantic integrity for a negligible performance improvement (at the very least, build it with the semantially correct version and hack it when a performance problem presents itself). I'm not sure if there even is a performance hissue here, but if anything, the second option should result in a lower memory footprint (since you have fewer lists allocated)

It depends on what the data means.
If it's two lists of numbers, a tuple of lists would be correct. If it's a list of number pairs, the a list of tuples would be correct.
For whatever the integers mean, does it make sense that the lists would be of different lengths? If so, then you've probably got two independent lists, and a tuple of lists would be correct.
If the integers only make sense if the lists are of equal length, then you've probably got one list of number pairs, so a list of tuples would be correct.

They are for two different things. The first is if you want a tuple which contains two lists of integers that aren't really associated, but the two lists are somehow related as a whole. The second is a list of tuples with two integers, so integer A is somehow linked to integer B in the tuple.
Blindly guessing, you probably want option B. This means that each tuple has two integers that are somehow linked. I could be wrong, but of the two options, that is what most people want. It's kind of like a Key Value Pair.

It depends on what you want to achieve. I guess you want to have multiple pairs of int, then List<Tuple<int, int>> is the solution. There might be cases where the others are useful but I can't think of any from the top of my head.

What is the most performant way to check for existence with a collection of integers?

I have a large list of integers that are sent to my webservice. Our business rules state that these values must be unique. What is the most performant way to figure out if there are any duplicates? I dont need to know the values, I only need to know if 2 of the values are equal.
At first I was thinking about using a Generic List of integers and the list.Exists() method, but this is of O(n);
Then I was thinking about using a Dictionary and the ContainsKey method. But, I only need the Keys, I do not need the values. And I think this is a linear search as well.
Is there a better datatype to use to find uniqueness within a list? Or am I stuck with a linear search?

Use a HashSet<T>:
The HashSet class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order
HashSet<T> even exposes a constructor that accepts an IEnumerable<T>. By passing your List<T> to the HashSet<T>'s constructor you will end up with a reference to a new HashSet<T> that will contain a distinct sequence of items from your original List<T>.

Sounds like a job for a Hashset...

If you are using framework 3.5 you can use the HashSet collection.
Otherwise the best option is the Dictionary. The value of each item will be wasted, but that will give you the best performance.
If you check for duplicates while you add the items to the HashSet/Dictionary instead of counting them afterwards, you get better performance than O(n) in case there are duplicates, as you don't have to continue looking after finding the first duplicate.

If the set of numbers is sparse, then as others suggest use a HashSet.
But if the set of numbers is mostly in sequence with occasional gaps, it would be a lot better if you stored the number set as a sorted array or binary tree of begin,end pairs. Then you could search to find the pair with the largest begin value that was smaller than your search key and compare with that pair's end value to see if it exists in the set.

What about doing:
list.Distinct().Count() != list.Count()
I wonder about the performance of this. I think it would be as good as O(n) but with less code and still easily readable.

In C#, is there a kind of a SortedList<double> that allows fast querying (with LINQ) for the nearest value?

I am looking for a structure that holds a sorted set of double values. I want to query this set to find the closest value to a specified reference value.
I have looked at the SortedList<double, double>, and it does quite well for me. However, since I do not need explicit key/value pairs. this seems to be overkill to me, and i wonder if i could do faster.
Conditions:
The structure is initialised only once, and does never change (no insert/deletes)
The amount of values is in the range of 100k.
The structure is queried often with new references, which must execute fast.
For simplicity and speed, the set's value just below of the reference may be returned, not actually the nearest value
I want to use LINQ for the query, if possible, for simplicity of code.
I want to use no 3rd party code if possible. .NET 3.5 is available.
Speed is more importand than memory footprint
I currently use the following code, where SortedValues is the aforementioned SortedList
IEnumerable<double> nearest = from item in SortedValues.Keys
where item <= suggestion
select item;
return nearest.ElementAt(nearest.Count() - 1);
Can I do faster?
Also I am not 100% percent sure, if this code is really safe. IEnumerable, the return type of my query is not by definition sorted anymore. However, a Unit test with a large test data base has shown that it is in practice, so this works for me. Have you hints regarding this aspect?
P.S. I know that there are many similar questions, but none actually answers my specific needs. Especially there is this one C# Data Structure Like Dictionary But Without A Value, but the questioner does just want to check the existence not find anything.

The way you are doing it is incredibly slow as it must search from the beginning of the list each time giving O(n) performance.
A better way is to put the elements into a List and then sort the list. You say you don't need to change the contents once initialized, so sorting once is enough.
Then you can use List<T>.BinarySearch to find elements or to find the insertion point of an element if it doesn't already exist in the list.
From the docs:
Return Value
The zero-based index of
item in the sorted List<T>,
if item is found; otherwise, a
negative number that is the bitwise
complement of the index of the next
element that is larger than item or,
if there is no larger element, the
bitwise complement of Count.
Once you have the insertion point, you need to check the elements on either side to see which is closest.

Might not be useful to you right now, but .Net 4 has a SortedSet class in the BCL.

I think it can be more elegant as follows:
In case your items are not sorted:
double nearest = values.OrderBy(x => x.Key).Last(x => x.Key <= requestedValue);
In case your items are sorted, you may omit the OrderBy call...

How can I sort an array of strings?

I have a list of input words separated by comma. I want to sort these words by alphabetical and length. How can I do this without using the built-in sorting functions?

Good question!! Sorting is probably the most important concept to learn as an up-and-coming computer scientist.
There are actually lots of different algorithms for sorting a list.
When you break all of those algorithms down, the most fundamental operation is the comparison of two items in the list, defining their "natural order".
For example, in order to sort a list of integers, I'd need a function that tells me, given any two integers X and Y whether X is less than, equal to, or greater than Y.
For your strings, you'll need the same thing: a function that tells you which of the strings has the "lesser" or "greater" value, or whether they're equal.
Traditionally, these "comparator" functions look something like this:
int CompareStrings(String a, String b) {
if (a < b)
return -1;
else if (a > b)
return 1;
else
return 0;
}
I've left out some of the details (like, how do you compute whether a is less than or greater than b? clue: iterate through the characters), but that's the basic skeleton of any comparison function. It returns a value less than zero if the first element is smaller and a value greater than zero if the first element is greater, returning zero if the elements have equal value.
But what does that have to do with sorting?
A sort routing will call that function for pairs of elements in your list, using the result of the function to figure out how to rearrange the items into a sorted list. The comparison function defines the "natural order", and the "sorting algorithm" defines the logic for calling and responding to the results of the comparison function.
Each algorithm is like a big-picture strategy for guaranteeing that ANY input will be correctly sorted. Here are a few of the algorithms that you'll probably want to know about:
Bubble Sort:
Iterate through the list, calling the comparison function for all adjacent pairs of elements. Whenever you get a result greater than zero (meaning that the first element is larger than the second one), swap the two values. Then move on to the next pair. When you get to the end of the list, if you didn't have to swap ANY pairs, then congratulations, the list is sorted! If you DID have to perform any swaps, go back to the beginning and start over. Repeat this process until there are no more swaps.
NOTE: this is usually not a very efficient way to sort a list, because in the worst cases, it might require you to scan the whole list as many as N times, for a list with N elements.
Merge Sort:
This is one of the most popular divide-and-conquer algorithms for sorting a list. The basic idea is that, if you have two already-sorted lists, it's easy to merge them. Just start from the beginning of each list and remove the first element of whichever list has the smallest starting value. Repeat this process until you've consumed all the items from both lists, and then you're done!
1 4 8 10
2 5 7 9
------------ becomes ------------>
1 2 4 5 7 8 9 10
But what if you don't have two sorted lists? What if you have just one list, and its elements are in random order?
That's the clever thing about merge sort. You can break any single list into smaller pieces, each of which is either an unsorted list, a sorted list, or a single element (which, if you thing about it, is actually a sorted list, with length = 1).
So the first step in a merge sort algorithm is to divide your overall list into smaller and smaller sub lists, At the tiniest levels (where each list only has one or two elements), they're very easy to sort. And once sorted, it's easy to merge any two adjacent sorted lists into a larger sorted list containing all the elements of the two sub lists.
NOTE: This algorithm is much better than the bubble sort method, described above, in terms of its worst-case-scenario efficiency. I won't go into a detailed explanation (which involves some fairly trivial math, but would take some time to explain), but the quick reason for the increased efficiency is that this algorithm breaks its problem into ideal-sized chunks and then merges the results of those chunks. The bubble sort algorithm tackles the whole thing at once, so it doesn't get the benefit of "divide-and-conquer".
Those are just two algorithms for sorting a list, but there are a lot of other interesting techniques, each with its own advantages and disadvantages: Quick Sort, Radix Sort, Selection Sort, Heap Sort, Shell Sort, and Bucket Sort.
The internet is overflowing with interesting information about sorting. Here's a good place to start:
http://en.wikipedia.org/wiki/Sorting_algorithms

Create a console application and paste this into the Program.cs as the body of the class.
public static void Main(string[] args)
{
string [] strList = "a,b,c,d,e,f,a,a,b".Split(new [] { ',' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string s in strList.Sort())
Console.WriteLine(s);
}
public static string [] Sort(this string [] strList)
{
return strList.OrderBy(i => i).ToArray();
}
Notice that I do use a built in method, OrderBy. As other answers point out there are many different sort algorithms you could implement there and I think my code snippet does everything for you except the actual sort algorithm.
Some C# specific sorting tutorials

There is an entire area of study built around sorting algorithms. You may want to choose a simple one and implement it.
Though it won't be the most performant, it shouldn't take you too long to implement a bubble sort.

If you don't want to use build-in-functions, you have to create one by your self. I would recommend Bubble sort or some similar algorithm. Bubble sort is not an effective algoritm, but it get the works done, and is easy to understand.
You will find much good reading on wikipedia.

I would recommend doing a wiki for quicksort.
Still not sure why you don't want to use the built in sort?

Bubble sort damages the brain.
Insertion sort is at least as simple to understand and code, and is actually useful in practice (for very small data sets, and nearly-sorted data). It works like this:
Suppose that the first n items are already in order (you can start with n = 1, since obviously one thing on its own is "in the correct order").
Take the (n+1)th item in your array. Call this the "pivot". Starting with the nth item and working down:
- if it is bigger than the pivot, move it one space to the right (to create a "gap" to the left of it).
- otherwise, leave it in place, put the "pivot" one space to the right of it (that is, in the "gap" if you moved anything, or where it started if you moved nothing), and stop.
Now the first n+1 items in the array are in order, because the pivot is to the right of everything smaller than it, and to the left of everything bigger than it. Since you started with n items in order, that's progress.
Repeat, with n increasing by 1 at each step, until you've processed the whole list.
This corresponds to one way that you might physically put a series of folders into a filing cabinet in order: put one in; then put another one into its correct position by pushing everything that belongs after it over by one space to make room; repeat until finished. Nobody ever sorts physical objects by bubble sort, so it's a mystery to me why it's considered "simple".
All that's left now is that you need to be able to work out, given two strings, whether the first is greater than the second. I'm not quite sure what you mean by "alphabetical and length" : alphabetical order is done by comparing one character at a time from each string. If there not the same, that's your order. If they are the same, look at the next one, unless you're out of characters in one of the strings, in which case that's the one that's "smaller".

Use NSort
I ran across the NSort library a couple of years ago in the book Windows Developer Power Tools. The NSort library implements a number of sorting algorithms. The main advantage to using something like NSort over writing your own sorting is that is is already tested and optimized.

Posting link to fast string sort code in C#:
http://www.codeproject.com/KB/cs/fast_string_sort.aspx
Another point:
The suggested comparator above is not recommended for non-English languages:
int CompareStrings(String a, String b) {
if (a < b) return -1;
else if (a > b)
return 1; else
return 0; }
Checkout this link for non-English language sort:
http://msdn.microsoft.com/en-us/goglobal/bb688122
And as mentioned, use nsort for really gigantic arrays that don't fit in memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.