I want to have around 20,000 complex objects sitting in memory at all times (app will run in indefinite loop). I am considering using either List<MyObject> and then converting the list to Dictionary<int, MyObject> or just avoiding List alltogether and keeping the objects in dictionary. I was wondering, is it pricey to convert list to dictionary each time i need to look up an object? What would be better? Have them stored as Dictionary at all times? Or have List and using lambdas to get the needed object? Or should i look at other options?
Please note, I don't need queue or stack behavior when object retrieval causes dequeuing.
Thanks in advance.
Using a lambda lookup against the list is O(N), which for 20,000 items is not inconsiderable. However, if you know you'll always need to fetch the object by a known key, you can use a dictionary which is O(1) - that's as fast as algorithms go. So if there's some way you can structure your data/application so that you can base retrieval around some sort of predictable, repeatable, unique key, that will maximize performance. The worst thing (from a performance standpoint) is some complex lookup routine against a list, but sometimes it is unavoidable.
Regardless of what you're doing, if you need to access the List, then you are going to need to loop through it to find whatever you want.
If you need to access the Dictionary, then you have the option to use the key value to immediately retrieve what you are looking for, or, if you must, you can still loop through the Dictionary's Values.
Just use the Dictionary.
Related
I had two questions. I was wondering if there is an easy class in the C# library that stores pairs of values instead of just one, so that I can store a class and an integer in the same node of the list. I think the easiest way is to just make a container class, but as this is extra work each time. I wanted to know whether I should be doing so or not. I know that in later versions of .NET ( i am using 3.5) that there are tuples that I can store, but that's not available to me.
I guess the bigger question is what are the memory disadvantages of using a dictionary to store the integer class map even though I don't need to access in O(1) and could afford to just search the list? What is the minimum size of the hash table? should i just make the wrapper class I need?
If you need to store an unordered list of {integer, value}, then I would suggest making the wrapper class. If you need a data structure in which you can look up integer to get value (or, look up value to get integer), then I would suggest a dictionary.
The decision of List<Tuple<T1, T2>> (or List<KeyValuePair<T1, T2>>) vs Dictionary<T1, T2> is largely going to come down to what you want to do with it.
If you're going to be storing information and then iterating over it, without needing to do frequent lookups based on a particular key value, then a List is probably what you want. Depending on how you're going to use it, a LinkedList might be even better - slightly higher memory overheads, faster content manipulation (add/remove) operations.
On the other hand, if you're going to be primarily using the first value as a key to do frequent lookups, then a Dictionary is designed specifically for this purpose. Key value searching and comparison is significantly improved, so if you do much with the keys and your list is big a Dictionary will give you a big speed boost.
Data size is important to the decision. If you're talking about a couple hundred items or less, a List is probably fine. Above that point the lookup times will probably impact more significantly on execution time, so Dictionary might be more worth it.
There are no hard and fast rules. Every use case is different, so you'll have to balance your requirements against the overheads.
You can use a list of KeyValuePair:http://msdn.microsoft.com/en-us/library/5tbh8a42.aspx
You can use a Tuple<T,T1>, a list of KeyValuePair<T, T1> - or, an anonymous type, e.g.
var list = something.Select(x => new { Key = x.Something, Value = x.Value });
You can use either KeyValuePair or Tuple
For Tuple, you can read the following useful post:
What requirement was the tuple designed to solve?
Say that, in my method, I pass in a couple IEnumerables (probably because I'm going to get a bunch of objects from a db or something).
Then for each object in objects1, I want to pull out a diffobject from objects2 that has the same object.iD.
I don't want multiple enumerations (according to resharper) so I could make objects2 into a dictionary keyed with object.iD. Then I only enumerate once for each. (secondary question)Is that a good pattern?
(primary question) What's too big? At what point would this be a horrible pattern? How many objects is too many objects for the dictionary?
Internally, it would be prevented from ever having more than two billion items. Since the way things are positioned within a dictionary is fairly complicated, if I were looking at dealing with a billion items (if a 16-bit value, for example, then 2GB), I'd be looking to store them in a database and retrieve them using data-access code.
I have to ask though, where are Objects1 and Objects2 coming from? It sounds as though you could do this at the DB level and it would be MUCH, MUCH more efficient than doing it in C#!
You might also want to consider using KeyValuePair[]
Dictionaries store instances of KeyValuePair
If all you ever want to do is look up values in the dictionary given their Key, then yes, Dictionary is the way to go - they're pretty quick at doing that. However, if you want to sort items or search for them using the Value or a property of it, it's better to use something else.
As far as the size goes, they get a little slower as they get bigger, it's worth doing some benchmarks to see how it affects your needs, but you could always split values across multiple dictionaries based on their type or range. http://www.dotnetperls.com/dictionary-size
It's worth noting though that when you say "Then I only enumerate once for each", that's slightly incorrect. objects1 will be enumerated fully, but the dictionary of objects2 won't be enumerated. As long as you use the Key to retrieve values, it will hash the key and use the result to calculate a location to store the value, so a dictionary can get pretty quickly to the value you ask for. Ideally use an int for the Key because it can use that as the hash directly. You can enumerate them, but it's must better to look objects up using objects2Dictionary[key].
I'm still quite new to C#, but noticed the advantages through forum postings of using a HashSet instead of a List in specific cases.
My current case isn't that I'm storing a tremendous amount of data in a single List exectly, but rather than I'm having to check for members of it often.
The catch is that I do indeed need to iterate over it as well, but the order they are stored or retrieved doesn't actually matter.
I've read that for each loops are actually slower than for next, so how else could I go about this in the fastest method possible?
The number of .Contains() checks I'm doing is definitely hurting my performance with lists, so at least comparing to the performance of a HashSet would be handy.
Edit: I'm currently using lists, iterating through them in numerous locations, and different code is being executed in each location. Most often, the current lists contain point coordinates that I then use to refer to a 2 dimensional array for that I then do some operation or another based on the criteria of the list.
If there's not a direct answer to my question, that's fine, but I assumed there might be other methods of iterating over a HashSet than just foreach cycle. I'm currently in the dark as to what other methods there might even be, what advantages they provide, etc. Assuming there are other methods, I also made the assumption that there would be a typical preferred method of choice that is only ignored when it doesn't suite the needs (my needs are pretty basic).
As far as prematurely optimizing, I already know using the lists as I am is a bottleneck. How to go about helping this issue is where I'm getting stuck. Not even stuck exactly, but I didn't want to re-invent the wheel by testing repeatedly only to find out I'm already doing it the best way I could (this is a large project with over 3 months invested, lists are everywhere, but there are definitely ones that I do not want duplicates, have a lot of data, need not be stored in any specific order, etc).
A foreach loop has a small amount of addition overhead on an indexed collections (like an array).
This is mostly because the foreach does a little more bounds checking than a for loop.
HashSet does not have an indexer so you have to use the enumerator.
In this case foreach is efficient as it only calls MoveNext() as it moves through the collection.
Also Parallel.ForEach can dramatically improve your performance, depending on the work you are doing in the loop and the size of your HashSet.
As mentioned before profiling is your best bet.
You shouldn't be iterating over a hashset in the first place to determine if an item is in it. You should use the HashSet (not the LINQ) contains method. The HashSet is designed such that it won't need to look through every item to see if any given value is inside of the set. That is what makes it so powerful for searching over a List.
Not strictly answering the question in the header, but more concerning your specific problem:
I would make your own Collection object that uses both a HashSet and a List internally. Iterating is fast as you can use the List, checking for Contains is fast as you can use the HashSet. Just make it an IEnumerable and you can use this Collection in foreach as well.
The downside is more memory, but there are only twice as many references to object, not twice as many objects. Worst case scenario it's only twice as much memory, but you seem much more concerned with performance.
Adding, checking, and iterating are fast this way, only removal is still O(N) because of the List.
EDIT: If removal needs to be O(1) as well, use a doubly linked list instead of a regular list, and make the hashSet a Dictionary<KeyType, Cell> instead. You can check the dictionary for Contains, but also to find the cell with the data in it fast, so removal from the data structure is fast.
I had the same issue, where the HashSet suits very well the addition of unique elements, but is very slow when getting elements in a for loop. I solved it by converting the HashSet to array and then running the for over it.
I have recently seen a new trend in my firm where we change the IEnumerable to a dictionary by a simple LINQ transformation as follows:
enumerable.ToDictionary(x=>x);
We mostly end up doing this when the operation on the collection is a Contains/Access and obviously a dictionary has a better performance in such cases.
But I realise that converting the Enumerable to a dictionary has its own cost and I am wondering at what point does it start to break-even (if it does) i.e the performance of IEnumerable Contains/Access is equal to ToDictionary + access/contains.
Ok I might add there is no databse access the enumerable might be created from a database query and thats it and the enumerable may be edited after that too..
Also it would be interesting to know how does the datatype of the key affect the performance?
The lookup might be 2-5 times generally but sometimes may be one too. But i have seen things like
For an enumerable:
var element=Enumerable.SingleorDefault(x=>x.Id);
//do something if element is null or return
for a dictionary:
if(dictionary.ContainsKey(x))
//do something if false else return
This has been bugging me for quite some time now.
Performance of Dictionary Compared to IEnumerable
A Dictionary, when used correctly, is always faster to read from (except in cases where the data set is very small, e.g. 10 items). There can be overhead when creating it.
Given m as the amount of lookups performed against the same object (these are approximate):
Performance of an IEnumerable (created from a clean list): O(mn)
This is because you need to look at all the items each time (essentially m * O(n)).
Performance of a Dictionary: O(n) + O(1m), or O(m + n)
This is because you need to insert items first (O(n)).
In general it can be seen that the Dictionary wins when m > 1, and the IEnumerable wins when m = 1 or m = 0.
In general you should:
Use a Dictionary when doing the lookup more than once against the same dataset.
Use an IEnumerable when doing the lookup one.
Use an IEnumerable when the data-set could be too large to fit into memory.
Keep in mind a SQL table can be used like a Dictionary, so you could use that to offset the memory pressure.
Further Considerations
Dictionarys use GetHashCode() to organise their internal state. The performance of a Dictionary is strongly-related to the hash code in two ways.
Poorly performing GetHashCode() - results in overhead every time an item is added, looked up, or deleted.
Low quality hash codes - results in the dictionary not having O(1) lookup performance.
Most built-in .Net types (especially the value types) have very good hashing algorithms. However, with list-like types (e.g. string) GetHashCode() has O(n) performance - because it needs to iterate over the whole string. Thus you dictionary's performance can really be seen as (where M is the big-oh for an efficient GetHashCode()): O(1) + M.
It depends....
How long is the IEnumerable?
Does accessing the IEnumerable cause database access?
How often is it accessed?
The best thing to do would be to experiment and profile.
If you searching elements in your collection by some key very often - definatelly the Dictionary will be faster because or it's hash-based collection and searching is faster in times, otherwise if you don't search a lot thru the collection - the convertion is not necessary, because time for conversion may be bigger than you one or two searches in the collection,
IMHO: you need to measure this on your environment with representative data. In such cases I just write a quick console app that measures the time of the code execution. To have a better measure you need to execute the same code multiple times I guess.
ADD:
It also depents on the application you develop. Usually you gain more in optimizing other places (avoiding networkroundrips, caching etc.) in that time and effort.
I'll add that you haven't told us what happens every time you "rewind" your IEnumerable<>. Is it directly backed by a data collection? (for example a List<>) or is it calculated "on the fly"? If it's the first, and for small collections, enumerating them to find the wanted element is faster (a Dictionary for 3/4 elements is useless. If you want I can build some benchmark to find the breaking point). If it's the second then you have to consider if "caching" the IEnumerable<> in a collection is a good idea. If it's, then you can choose between a List<> or a Dictionary<>, and we return to point 1. Is the IEnumerable small or big? And there is a third problem: if the collection isn't backed, but it's too big for memory, then clearly you can't put it in a Dictionary<>. Then perhaps it's time to make the SQL work for you :-)
I'll add that "failures" have their cost: in a List<> if you try to find an element that doesn't exist, the cost is O(n), while in a Dictionary<> the cost is still O(1).
I often use Dictionary in C#2.0 with the first key as string that was containing a unique identifier.
I am learning C#3.0+ and it seems that I can now simply use a List and simply do LINQ on that object to get the specific object (with the .where()).
So, if I understand well, the Dictionary class has lost its purpose?
no, a dictionary is still more efficient for getting things back out given a key.
a list you still have to iterate through the list to find what you want. A dictionary does a lookup.
If you just have a List, then doing an LINQ select will scan through every item in the list comparing it against the one you are looking for.
The Dictionary however computes a hash code of the string you are looking for (returned by the GetHashCode method). This value is then used to look up the string more efficiently. For more info on how this works see Wikipedia.
If you have more than a few strings, the initial (List) method will start to get painfully slow.
IMHO the Dictionary approach will be MUCH faster than LINQ, so if you have an array with a lot of items, you should rather use Dictionary.
Dictionary is implemented as a hashtable. Thus it should give constant time access for lookups. List is implemented as a dynamic array, giving you linear time access.
Based on the underlying data structures, the Dictionary should still give you better performance.
MSDN docs on Dictionary
http://msdn.microsoft.com/en-us/library/xfhwa508.aspx
and List
http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx