Conversion of an IEnumerable to a dictionary for performance?

Conversion of an IEnumerable to a dictionary for performance? - c#

I have recently seen a new trend in my firm where we change the IEnumerable to a dictionary by a simple LINQ transformation as follows:
enumerable.ToDictionary(x=>x);
We mostly end up doing this when the operation on the collection is a Contains/Access and obviously a dictionary has a better performance in such cases.
But I realise that converting the Enumerable to a dictionary has its own cost and I am wondering at what point does it start to break-even (if it does) i.e the performance of IEnumerable Contains/Access is equal to ToDictionary + access/contains.
Ok I might add there is no databse access the enumerable might be created from a database query and thats it and the enumerable may be edited after that too..
Also it would be interesting to know how does the datatype of the key affect the performance?
The lookup might be 2-5 times generally but sometimes may be one too. But i have seen things like
For an enumerable:
var element=Enumerable.SingleorDefault(x=>x.Id);
//do something if element is null or return
for a dictionary:
if(dictionary.ContainsKey(x))
//do something if false else return
This has been bugging me for quite some time now.

Performance of Dictionary Compared to IEnumerable
A Dictionary, when used correctly, is always faster to read from (except in cases where the data set is very small, e.g. 10 items). There can be overhead when creating it.
Given m as the amount of lookups performed against the same object (these are approximate):
Performance of an IEnumerable (created from a clean list): O(mn)
This is because you need to look at all the items each time (essentially m * O(n)).
Performance of a Dictionary: O(n) + O(1m), or O(m + n)
This is because you need to insert items first (O(n)).
In general it can be seen that the Dictionary wins when m > 1, and the IEnumerable wins when m = 1 or m = 0.
In general you should:
Use a Dictionary when doing the lookup more than once against the same dataset.
Use an IEnumerable when doing the lookup one.
Use an IEnumerable when the data-set could be too large to fit into memory.
Keep in mind a SQL table can be used like a Dictionary, so you could use that to offset the memory pressure.
Further Considerations
Dictionarys use GetHashCode() to organise their internal state. The performance of a Dictionary is strongly-related to the hash code in two ways.
Poorly performing GetHashCode() - results in overhead every time an item is added, looked up, or deleted.
Low quality hash codes - results in the dictionary not having O(1) lookup performance.
Most built-in .Net types (especially the value types) have very good hashing algorithms. However, with list-like types (e.g. string) GetHashCode() has O(n) performance - because it needs to iterate over the whole string. Thus you dictionary's performance can really be seen as (where M is the big-oh for an efficient GetHashCode()): O(1) + M.

It depends....
How long is the IEnumerable?
Does accessing the IEnumerable cause database access?
How often is it accessed?
The best thing to do would be to experiment and profile.

If you searching elements in your collection by some key very often - definatelly the Dictionary will be faster because or it's hash-based collection and searching is faster in times, otherwise if you don't search a lot thru the collection - the convertion is not necessary, because time for conversion may be bigger than you one or two searches in the collection,

IMHO: you need to measure this on your environment with representative data. In such cases I just write a quick console app that measures the time of the code execution. To have a better measure you need to execute the same code multiple times I guess.
ADD:
It also depents on the application you develop. Usually you gain more in optimizing other places (avoiding networkroundrips, caching etc.) in that time and effort.

I'll add that you haven't told us what happens every time you "rewind" your IEnumerable<>. Is it directly backed by a data collection? (for example a List<>) or is it calculated "on the fly"? If it's the first, and for small collections, enumerating them to find the wanted element is faster (a Dictionary for 3/4 elements is useless. If you want I can build some benchmark to find the breaking point). If it's the second then you have to consider if "caching" the IEnumerable<> in a collection is a good idea. If it's, then you can choose between a List<> or a Dictionary<>, and we return to point 1. Is the IEnumerable small or big? And there is a third problem: if the collection isn't backed, but it's too big for memory, then clearly you can't put it in a Dictionary<>. Then perhaps it's time to make the SQL work for you :-)
I'll add that "failures" have their cost: in a List<> if you try to find an element that doesn't exist, the cost is O(n), while in a Dictionary<> the cost is still O(1).

Related

List<T> vs HashSet<T> - dynamic collection choice is efficient or not?

var usedIds = list.Count > 20 ? new HashSet<int>() as ICollection<int> : new List<int>();
Assuming that List is more performant with 20 or less items and HashSet is more performant with greater items amount (from this post), is it efficient approach to use different collection types dynamicaly based on the predictable items count?
All of the actions for each of the collection types will be the same.
PS: Also i have found HybridCollection Class which seems to do the same thing automaticaly, but i've never used it so i have no info on its performance either.
EDIT: My collection is mostly used as the buffer with many inserts and gets.

In theory, it could be, depending on how many and what type of operations you are performing on the collections. In practice, it would be a pretty rare case where such micro-optimization would justify the added complexity.
Also consider what type of data you are working with. If you are using int as the collection item as the first line of your question suggests, then the threshold is going to be quite a bit less than 20 where List is no longer faster than HashSet for many operations.
In any case, if you are going to do that, I would create a new collection class to handle it, something along the lines of the HybridDictionary, and expose it to your user code with some generic interface like IDictionary.
And make sure you profile it to be sure that your use case actually benefits from it.
There may even be a better option than either of those collections, depending on what exactly it is you are doing. i.e. if you are doing a lot of "before or after" inserts and traversals, then LinkedList might work better for you.

Hashtables like Hashset<T> and Dictionary<K,T> are faster at searching and inserting items in any order.
Arrays T[] are best used if you always have a fixed size and a lot of indexing operations. Adding items to a array is slower than adding into a list due to the covariance of arrays in c#.
List<T> are best used for dynamic sized collections whith indexing operations.
I don't think it is a good idea to write something like the hybrid collection better use a collection dependent on your requirements. If you have a buffer with a lof of index based operations i would not suggest a Hashtable, as somebody already quoted a Hashtable by design uses more memory

HashSet is for faster access, but List is for insert. If you don't plan adding new items, use HashSet, otherwise List.

If you collection is very small then the performance is virtually always going to be a non-issue. If you know that n is always less than 20, O(n) is, by definition, O(1). Everything is fast for small n.
Use the data structure that most appropriate represents how you are conceptually treating the data, the type of operations that you need to perform, and the type of operations that should be most efficient.

is it efficient approach to use different collection types dynamicaly based on the predictable items count?
It can be depending on what you mean by "efficiency" (MS offers HybridDictionary class for that, though unfortunately it is non generic). But irrespective of that its mostly a bad choice. I will explain both.
From an efficiency standpoint:
Addition will be always faster in a List<T>, since a HashSet<T> will have to precompute hash code and store it. Even though removal and lookup will be faster with a HashSet<T> as size grows up, addition to the end is where List<T> wins. You will have to decide which is more important to you.
HashSet<T> will come up with a memory overhead compared to List<T>. See this for some illustration.
But however, from a usability standpoint it need not make sense. A HashSet<T> is a set, unlike a bag which List<T> is. They are very different, and their uses are very different. For:
HashSet<T> cannot have duplicates.
HashSet<T> will not care about any order.
So when you return a hybrid ICollection<T>, your requirement goes like this: "It doesn't matter whether duplicates can be added or not. Sometimes let it be added, sometimes not. Of course iteration order is not important anyway" - very rarely useful.
Good q, and +1.

HashSet is better, because it will probably use less space, and you will have faster access to elements.

Collection that lets access item by key but doesn't require duplicate checking on addition?

I'm asking for something that's a bit weird, but here is my requirement (which is all a bit computation intensive, which I couldn't find anywhere so far)..
I need a collection of <TKey, TValue> of about 30 items. But the collection is used in massively nested foreach loops that would iterate possibly almost up to a billion times, seriously. The operations on collection are trivial, something that would look like:
Dictionary<Position, Value> _cells = new
_cells.Clear();
_cells.Add(Position.p1, v1);
_cells.Add(Position.p2, v2);
//etc
In short, nothing more than addition of about 30 items and clearing of the collection. Also the values will be read from somewhere else at some point. I need this reading/retrieval by the key. So I need something along the lines of a Dictionary. Now since I'm trying to squeeze out every ounce from the CPU, I'm looking for some micro-optimizations as well. For one, I do not require the collection to check if a duplicate already exists while adding (this typically makes dictionary slower when compared to a List<T> for addition). I know I wont be passing duplicates as keys.
Since Add method would do some checks, I tried this instead:
_cells[Position.p1] = v1;
_cells[Position.p2] = v2;
//etc
But this is still about 200 ms seconds slower for about 10k iterations than a typical List<T> implementation like this:
List<KeyValuePair<Position, Value>> _cells = new
_cells.Add(new KeyValuePair<Position, Value>(Position.p1, v1));
_cells.Add(new KeyValuePair<Position, Value>(Position.p2, v2));
//etc
Now that could scale to a noticeable time after full iteration. Note that in the above case I have read item from list by index (which was ok for testing purposes). The problem with a regular List<T> for us are many, the main reason being not being able to access an item by key.
My question in short are:
Is there a custom collection class that would let access item by key, yet bypass the duplicate checking while adding? Any 3rd party open source collection would do.
Or else please point me to a good starter as to how to implement my custom collection class from IDictionary<TKey, TValue> interface
Update:
I went by MiMo's suggestion and List was still faster. Perhaps it has got to do with overhead of creating the dictionary.

My suggestion would be to start with the source code of Dictionary<TKey, TValue> and change it to optimize for you specific situation.
You don't have to support removal of individual key/value pairs, this might help simplifying the code. There apppear to be also some check on the validity of keys etc. that you could get rid of.

But this is still a few ms seconds slower for about ten iterations than a typical List implementation like this
A few milliseconds slower for ten iterations of adding just 30 values? I don't believe that. Adding just a few values should take microscopic amounts of time, unless your hashing/equality routines are very slow. (That can be a real problem. I've seen code improved massively by tweaking the key choice to be something that's hashed quickly.)
If it's really taking milliseconds longer, I'd urge you to check your diagnostics.
But it's not surprising that it's slower in general: it's doing more work. For a list, it just needs to check whether or not it needs to grow the buffer, then write to an array element, and increment the size. That's it. No hashing, no computation of the right bucket.
Is there a custom collection class that would let access item by key, yet bypass the duplicate checking while adding?
No. The very work you're trying to avoid is what makes it quick to access by key later.
When do you need to perform a lookup by key, however? Do you often use collections without ever looking up a key? How big is the collection by the time you perform a key lookup?
Perhaps you should build a list of key/value pairs, and only convert it into a dictionary when you've finished writing and are ready to start looking up.

What is the fastest/safest method to iterate over a HashSet?

I'm still quite new to C#, but noticed the advantages through forum postings of using a HashSet instead of a List in specific cases.
My current case isn't that I'm storing a tremendous amount of data in a single List exectly, but rather than I'm having to check for members of it often.
The catch is that I do indeed need to iterate over it as well, but the order they are stored or retrieved doesn't actually matter.
I've read that for each loops are actually slower than for next, so how else could I go about this in the fastest method possible?
The number of .Contains() checks I'm doing is definitely hurting my performance with lists, so at least comparing to the performance of a HashSet would be handy.
Edit: I'm currently using lists, iterating through them in numerous locations, and different code is being executed in each location. Most often, the current lists contain point coordinates that I then use to refer to a 2 dimensional array for that I then do some operation or another based on the criteria of the list.
If there's not a direct answer to my question, that's fine, but I assumed there might be other methods of iterating over a HashSet than just foreach cycle. I'm currently in the dark as to what other methods there might even be, what advantages they provide, etc. Assuming there are other methods, I also made the assumption that there would be a typical preferred method of choice that is only ignored when it doesn't suite the needs (my needs are pretty basic).
As far as prematurely optimizing, I already know using the lists as I am is a bottleneck. How to go about helping this issue is where I'm getting stuck. Not even stuck exactly, but I didn't want to re-invent the wheel by testing repeatedly only to find out I'm already doing it the best way I could (this is a large project with over 3 months invested, lists are everywhere, but there are definitely ones that I do not want duplicates, have a lot of data, need not be stored in any specific order, etc).

A foreach loop has a small amount of addition overhead on an indexed collections (like an array).
This is mostly because the foreach does a little more bounds checking than a for loop.
HashSet does not have an indexer so you have to use the enumerator.
In this case foreach is efficient as it only calls MoveNext() as it moves through the collection.
Also Parallel.ForEach can dramatically improve your performance, depending on the work you are doing in the loop and the size of your HashSet.
As mentioned before profiling is your best bet.

You shouldn't be iterating over a hashset in the first place to determine if an item is in it. You should use the HashSet (not the LINQ) contains method. The HashSet is designed such that it won't need to look through every item to see if any given value is inside of the set. That is what makes it so powerful for searching over a List.

Not strictly answering the question in the header, but more concerning your specific problem:
I would make your own Collection object that uses both a HashSet and a List internally. Iterating is fast as you can use the List, checking for Contains is fast as you can use the HashSet. Just make it an IEnumerable and you can use this Collection in foreach as well.
The downside is more memory, but there are only twice as many references to object, not twice as many objects. Worst case scenario it's only twice as much memory, but you seem much more concerned with performance.
Adding, checking, and iterating are fast this way, only removal is still O(N) because of the List.
EDIT: If removal needs to be O(1) as well, use a doubly linked list instead of a regular list, and make the hashSet a Dictionary<KeyType, Cell> instead. You can check the dictionary for Contains, but also to find the cell with the data in it fast, so removal from the data structure is fast.

I had the same issue, where the HashSet suits very well the addition of unique elements, but is very slow when getting elements in a for loop. I solved it by converting the HashSet to array and then running the for over it.

Keeping considerable amount of complex objects in memory

I want to have around 20,000 complex objects sitting in memory at all times (app will run in indefinite loop). I am considering using either List<MyObject> and then converting the list to Dictionary<int, MyObject> or just avoiding List alltogether and keeping the objects in dictionary. I was wondering, is it pricey to convert list to dictionary each time i need to look up an object? What would be better? Have them stored as Dictionary at all times? Or have List and using lambdas to get the needed object? Or should i look at other options?
Please note, I don't need queue or stack behavior when object retrieval causes dequeuing.
Thanks in advance.

Using a lambda lookup against the list is O(N), which for 20,000 items is not inconsiderable. However, if you know you'll always need to fetch the object by a known key, you can use a dictionary which is O(1) - that's as fast as algorithms go. So if there's some way you can structure your data/application so that you can base retrieval around some sort of predictable, repeatable, unique key, that will maximize performance. The worst thing (from a performance standpoint) is some complex lookup routine against a list, but sometimes it is unavoidable.

Regardless of what you're doing, if you need to access the List, then you are going to need to loop through it to find whatever you want.
If you need to access the Dictionary, then you have the option to use the key value to immediately retrieve what you are looking for, or, if you must, you can still loop through the Dictionary's Values.
Just use the Dictionary.

Define: What is a HashSet?

HashSet
The C# HashSet data structure was introduced in the .NET Framework 3.5. A full list of the implemented members can be found at the HashSet MSDN page.
Where is it used?
Why would you want to use it?

A HashSet holds a set of objects, but in a way that allows you to easily and quickly determine whether an object is already in the set or not. It does so by internally managing an array and storing the object using an index which is calculated from the hashcode of the object. Take a look here
HashSet is an unordered collection containing unique elements. It has the standard collection operations Add, Remove, Contains, but since it uses a hash-based implementation, these operations are O(1). (As opposed to List for example, which is O(n) for Contains and Remove.) HashSet also provides standard set operations such as union, intersection, and symmetric difference. Take a look here
There are different implementations of Sets. Some make insertion and lookup operations super fast by hashing elements. However, that means that the order in which the elements were added is lost. Other implementations preserve the added order at the cost of slower running times.
The HashSet class in C# goes for the first approach, thus not preserving the order of elements. It is much faster than a regular List. Some basic benchmarks showed that HashSet is decently faster when dealing with primary types (int, double, bool, etc.). It is a lot faster when working with class objects. So the point is that HashSet is fast.
The only catch of HashSet is that there is no access by indices. To access elements you can either use an enumerator or use the built-in function to convert the HashSet into a List and iterate through that. Take a look here

A HashSet has an internal structure (hash), where items can be searched and identified quickly. The downside is that iterating through a HashSet (or getting an item by index) is rather slow.
So why would someone want be able to know if an entry already exists in a set?
One situation where a HashSet is useful is in getting distinct values from a list where duplicates may exist. Once an item is added to the HashSet it is quick to determine if the item exists (Contains operator).
Other advantages of the HashSet are the Set operations: IntersectWith, IsSubsetOf, IsSupersetOf, Overlaps, SymmetricExceptWith, UnionWith.
If you are familiar with the object constraint language then you will identify these set operations. You will also see that it is one step closer to an implementation of executable UML.

Simply said and without revealing the kitchen secrets:
a set in general, is a collection that contains no duplicate elements, and whose elements are in no particular order. So, A HashSet<T> is similar to a generic List<T>, but is optimized for fast lookups (via hashtables, as the name implies) at the cost of losing order.

From application perspective, if one needs only to avoid duplicates then HashSet is what you are looking for since it's Lookup, Insert and Remove complexities are O(1) - constant. What this means it does not matter how many elements HashSet has it will take same amount of time to check if there's such element or not, plus since you are inserting elements at O(1) too it makes it perfect for this sort of thing.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.