Building a sorted dictionary using ToDictionary - c#

I'm not an expert in C# and LINQ.
I have a Dictionary, which I understand a hash table, that is, keys are not sorted.
dataBase = new Dictionary<string, Record>()
Record is a user-defined class that holds a number of data for a given key string.
I found an interesting example that converts this Dictionary into a sorted dictionary by LINQ:
var sortedDict = (from entry in dataBase orderby entry.Key ascending select entry)
.ToDictionary(pair => pair.Key, pair => pair.Value);
This code works correctly. The resulting sortedDict is sorted by keys.
Question: I found that sortedDict is still a hash table, a type of:
System.Collections.Generic.Dictionary<string, Record>
I expected the resulting dictionary should be a sort of map as in C++ STL, which is generally implemented as a (balanced) binary tree to maintain the ordering of the keys. However, the resulting dictionary is still a hash table.
How sortedDict can maintain the ordering? A hash table can't hold the ordering of the keys. Is the implementation of C#'s Generic.Dictionary other than a typical hash table?

Dictionary maintains two data structures: a flat array that's kept in insertion order for enumeration, and the hash table for retrieval by key.
If you use ToDictionary() on a sorted set, it will be in order when enumerated, but it won't be maintained in order. Any newly inserted items will be added to the back when enumerating.
Edit: If you want to rely on this behaviour, I would recommend looking at the MSDN docs to see if this is guaranteed, or just incidental.

SortedDictionary takes an existing Dictionary in the constructor so making a SortedDictionary is very easy.
But you can make it an extension method if you want then you can use dataBase.ToSortedDictionary()
public static SortedDictionary<K, V> ToSortedDictionary<K,V>(this Dictionary<K, V> existing)
{
return new SortedDictionary<K, V>(existing);
}

the linq code looks building a sorted dictionary, but the sorting is done by the linq, not the dictionary itself, whereas a SortedDictionary should maintain the sorting by itself.
to get a sorted dictionary, use new SortedDictionary<string, Record>(yourNormalDictionary);
if you want to make it more accessible, then you may write an extension to the ienumerable:
public static class Extensions
{
public static SortedDictionary<T1, T2> ToSortedDictionary<T1, T2>(this IEnumerable<T2> source, Func<T2, T1> keySelector)
{
return new SortedDictionary<T1, T2>(source.ToDictionary(keySelector));
}
}

Related

Implementing hash table with both key and index-based access in O(1)

There is a data structure called NameObjectCollectionBase in .NET which I'm trying to understand.
Basically, it allows to enter arbitrary string => object key/value-pairs with both the possibility of the key and the value being null. A key may be used by multiple objects. Access is granted through both an index-based and a string-based access, whereas the string-based access returns only the first value with the specified key.
What they promise, is
add(string, object) O(1) if no relocation, O(n) otherwise
clear O(1)
get(int) O(1) corresponds to getkey(int)
get(string) O(1) returns first object found with given key
getallkeys O(n) if objects share a key, it is returned that many times
getallvalues O(n)
getallvalues(type) O(n) returns only objects of a given type
getkey(int) O(1) corresponds to get(int)
haskeys O(1) if there are objects with a non-null key
remove(string) O(n) remove all objects of a given key
removeat(int) O(n)
set(int, object) O(1)
set(string, object) O(1) sets the value of the first found object with given key
getenumerator O(1) enumerator over keys
copyto(array, int) O(n)
Index-based access does have nothing to do with the insertion order. However, get(int) and getkey(int) have to line up with each other.
I'm wondering how the structure may be implemented. Allowing both index and key-based access at the same time in O(1) seems not trivial to implement. They state on the MSDN page that "The underlying structure for this class is a hash table." However, the C# Hash tables don't allow multiple values per key and alo not null-keys.
Implementing it as a Dictionary<string, List<object> does not seem to be the solution as get(string) would be O(1) but get(int) not since you have to traverse all keys to find out which key has how many items in it.
Implementing it as two separated lists where one is a simple List<string> for the keys and a List<Object> for the values in combination of a Dictionary<string, int> which points for each key to the index of the first value would allow both types of access in O(1) but would not allow removing in an efficient way since all indices would have to be updated in the hashtable (would be possible in O(n) but doesn't seem to be the best solution). Or would there be a more efficient way to remove an entry?
How could such a data structure be implemented?
NameObjectCollectionBase uses both a Hashtable and an Arraylist to manage the entries. Have a look for yourself!
Microsoft provides reference source code for .NET libraries and can be integrated into Visual Studio:
http://referencesource.microsoft.com/
You can even debug the .NET library:
http://msdn.microsoft.com/en-us/library/cc667410(VS.90).aspx
Or you can grab a copy of dotPeek, a free decompiler:
http://www.jetbrains.com/decompiler/

Filling a multi-value dictionary with query results

I am retrieving five columns through an SQL query. Among the columns retrieved, I have a column RecordID which should act as a key to a dictionary.
I am referring to the solution posted: C# Multi-Value Dictionary at StackOverflow, but I am not able to use it effectively depending upon my situation.
I want to store all the rows of my query but the RecordID column should also act as a key to the dictionary element. I want something like:
Dictionary<RecordID, Entire columns of the current row for this RecordID>
An alternative I think is to use an array, something like:
Dictionary<key,string[]>
But I want to use any super-fast way.
Use a Lookup - you haven't said much about the data, but you may need something as simple as:
var lookup = list.ToLookup(x => x.RecordID);
There are, of course, overloads for ToLookup which allow you to do other things.
Oh goodness sake, you need a dedicated class to hold the values. Since the columns are from a db table the number of fields is fixed. Hence you can have a class with fixed fields rather than a dynamically growing collection. And lastly use a KeyedCollection<TKey, TItem> structure to hold a collection of records. In a KeyedCollection<TKey, TItem> the TKey part is embedded in the TItem, in your case its RecordId.
class Record
{
public int Id { get; set; }
//rest of the fields
}
Your structure should look like KeyedCollection<int, Record>. It preserves insertion order as well. You can query the collection using RecordId via indexer, just like a dictionary.

Huge in-memory set of data. Need a fast search by integer Id property

I have a huge in-memory set (like ~100K records) of plain CLR objects of defined type. This Type has public property int Id {get; set;}. What is the best .NET structure to contain this huge set of data in to provide quick access to any item by its Id? More specifically, this set of data is supposed to be operated inside a loop to find an item by Id, so the search should be done as fast as possible. The search might look like this:
// Find by id
var entity = entities.First(e => e.Id == id)
IEnumerable based structures like collections and lists are going to go through every element of the data until seeking element is found. What are alternative ways? I believe there should be a way to make a search of sorted arrays by Id like an index search in databases.
Thanks
Results of testing: FYI: Dictionary is not just fast, it's just incomparable. My small test shown performance gain from around 3000+ ms (calling First() on IEnumerable) to 0 ([index] on Dictionary)!
I would go with a Dictionary<TKey, TValue>:
var index = new System.Collections.Generic.Dictionary<int, T>();
where T is the type of objects that you want to access.
This is implemented as a hash table, ie. looking up an item is done by computing the key's hash value (which is usually a very quick operation) and using that hash value as an index into a table. It's perhaps a bit of a over-simplification, but with a dictionary, it almost doesn't matter how many entries you've stored in your dictionary — access time should stay approximately constant.
To add an entry, do index.Add(entity.Id, entity);
To check whether an item is in the collection, index.ContainsKey(id).
To retrieve an item by ID, index[id].
Dictionary<TKey, TValue>, where TKey is int and TValue is YourEntity.
Example
var dictionary = new Dictionary<TKey, TValue>();
dictionary.Add(obj1.Id, obj1);
// continue
Or if you have a collection of objects, you can create the dictionary using a query
var dictionary = list.ToDictionary(obj => obj.Id, obj => obj);
Note: key values must be unique. If you have a non-unique collection, filter duplicates first (perhaps by calling Distinct() before creating the dictionary. Alternately, if you're looping over the collection to create the dictionary manually, check the ContainsKey method before attempting an Add operation.
Generally in-memory seek is best done with the Dictionary:
System.Collections.Generic.Dictionary<TKey, TValue>
Optionally when your data set no longer fits in memory, one would use disk-based btree.
Based on the information given, a HashTable is probably going to be the fastest. The Dictionary<T> class is going to provide you the best trade off for ease of use vs. performance. If you truly need maximum performance I would try all of the following classes. Based on memory usage, insert speed, search speed, they all perform differently:
ListDictionary
HashTable
Dictionary
SortedDictionary
ConcurrentDictionary
in addition to performance you may be concerned with multithreaded access. These two collections provide thread saftey:
HashTable (multiple reads, only one thread allowed to write)
ConcurrentDictionary
It depends on your data. If there is a ceiling to the amount of objects you have and there aren't too many missing objects (meaning you can't have more than X objects and you usually have close to X objects) then a regular array is the fastest.
T[] itemList = new T[MAX_ITEMS];
However, if either of those two conditions don't hold, an IDictionary is probably the best option.
Dictionary<int, T> itemList = new Dictionary<int, T>();

Faking IGrouping for LINQ

Imagine you have a large dataset that may or may not be filtered by a particular condition of the dataset elements that can be intensive to calculate. In the case where it is not filtered, the elements are grouped by the value of that condition - the condition is calculated once.
However, in the case where the filtering has taken place, although the subsequent code still expects to see an IEnumerable<IGrouping<TKey, TElement>> collection, it doesn't make sense to perform a GroupBy operation that would result in the condition being re-evaluated a second time for each element. Instead, I would like to be able to create an IEnumerable<IGrouping<TKey, TElement>> by wrapping the filtered results appropriately, and thus avoiding yet another evaluation of the condition.
Other than implementing my own class that provides the IGrouping interface, is there any other way I can implement this optimization? Are there existing LINQ methods to support this that would give me the IEnumerable<IGrouping<TKey, TElement>> result? Is there another way that I haven't considered?
the condition is calculated once
I hope those keys are still around somewhere...
If your data was in some structure like this:
public class CustomGroup<T, U>
{
T Key {get;set;}
IEnumerable<U> GroupMembers {get;set}
}
You could project such items with a query like this:
var result = customGroups
.SelectMany(cg => cg.GroupMembers, (cg, z) => new {Key = cg.Key, Value = z})
.GroupBy(x => x.Key, x => x.Value)
Inspired by David B's answer, I have come up with a simple solution. So simple that I have no idea how I missed it.
In order to perform the filtering, I obviously need to know what value of the condition I am filtering by. Therefore, given a condition, c, I can just project the filtered list as:
filteredList.GroupBy(x => c)
This avoids any recalculation of properties on the elements (represented by x).
Another solution I realized would work is to revers the ordering of my query and perform the grouping before I perform the filtering. This too would mean the conditions only get evaluated once, although it would unnecessarily allocate groupings that I wouldn't subsequently use.
What about putting the result into a LookUp and using this for the rest of the time?
var lookup = data.ToLookUp(i => Foo(i));

NHibernate - Getting the results as an IDictionary instead of IList

I'm using NHibernate as a persistency layer and I have many places in my code where I need to retrieve all the columns of a specific table (to show in a grid for example) but i also need a fast way to get specific item from this collection.
The ICriteria API let me get the query result either as a unique value of T or a IList of T.
I wonder if there is a way to make NHibernate give me those objects as an IDictionary where the key in the object's Id and the value is the object itself. doing it myself will make me iterate all over the original list which is not very scalable.
Thank you.
If you are working with .NET 3.5, You could use the Enumerable() method from IQuery, then use the IEnumerable<T>.ToDictionary() extension method :
var dictionary = query.Enumerable().ToDictionary(r => r.Id);
This way, the list would not be iterated twice over.
You mention using ICriteria, but it does not provide a way to lazily enumerate over items, whereas IQuery does.
However, if the number of items return by your query is too big, you might want to consider querying the database with the key you'd have used against the IDictionary instance.

Categories

Resources