List vs Dictionary (Hashtable) - c#

This may be a silly question but I am reading about that Hashtables and Dictionaries are faster than a list because they index the items with keys.
I know a List or Array is for elements without values, and a Dictionary is for elements with values. So I would think that it maybe be smart to have a Dictionary with the value that you need as a key and the value equal in all of them?
Update:
Based on the comments what I think I need is a HashSet. This question talks about their performance.

"Faster" depends on what you need them for.
A .NET List is just a slab of continuous memory (this in not a linked list), which makes it extremely efficient to access sequentially (especially when you consider the effects of caching and prefetching of modern CPUs) or "randomly" trough a known integer index. Searching or inserting elements (especially in the middle) - not so much.
Dictionary is an associative data structure - a key can be anything hashable (not just integer index), but elements are not sorted in a "meaningful" way and the access through the known key is not as fast as List's integer index.
So, pick the right tool for the job.

There are some weaknesses to Dictionary/Hashtable vs a List/array as well:
You have to compute the hash value of the object with each lookup.
For small collections, iterating through the array can be faster than computing that hash, especially because a hash is not guaranteed to be unique1.
They are not as good at iterating over the list of items.
They are not very good at storing duplicate entries (sometimes you legitimately want a value to show in an array more than once)
Sometimes a type does not have a good key to associate with it
Use what fits the situation. Sometimes that will be a list or an array. Sometimes it will be a Dictionary. You should almost never use a HashTable any more (prefer Dictionary<KeyType, Object> if you really don't what type you're storing).
1It usually is unique, but because there is a small potential for collisions the collection must check the bucket after computing the hash value.

Your statement "list or array is for elements without values, and dictionary is for elements with values", is not strictly true.
More accurately, a List is a collection of elements, and a Hashtable or Dictionary is a collection of elements along with a unique key to be used to access each one.
Use a list for collections of a very few elements, or when you will only need to access the entire collection, not a single element of the collection.
Use a Hashtable or Dictionary when the collection is large and/or when you will need to find/access individual members of the collection.

Related

Dictionary element accessing

Question
Must looping through all C# Dictionary elements be done only through foreach, and if so why?
Or, I could ask my question as: can Dictionary elements be accessed by position within the Dictionary object (i.e., first element, last element, 3rd from last element, etc)?
Background to this question
I'm trying to learn more about how Dictionary objects work, so I'd appreciate help wrapping my mind around this. I'm learning about this, so I have several thoughts that are all tied into this question. I'll try to present in a way that is appropriate for SO format.
Research
In a C# array, elements are referenced by position. In a Dictionary, values are referenced by keys.
Looking through the documentation on MSDN, there are the statements
"For purposes of enumeration, each item in the dictionary is treated as a
KeyValuePair structure representing a value and its key. The
order in which the items are returned is undefined."
So, it would seem that since the order items are returned in is undefined, there is no way to access elements by position. I also read:
"Retrieving a value by using its key is very fast, close to O(1),
because the Dictionary class is implemented as a hash table."
Looking at the documentation for the HashTable .NET 4.5 class, there is reference to using a foreach statement to loop through and return elements. But there is no reference to using a for statement, or for that matter while or any other looping statement.
Also, I've noticed Dictionary elements use the IEnumerable interface, which seems to use foreach as the only type of statement for looping functions.
Thoughts
So, does this mean that Dictionary elements cannot be accessed by "position," as arrays or lists can?
If this is so, why is there a .Count property that returns the number of key/value pairs, yet nothing that lets me reference these by nearness to the total? For example, .Count is 5, why can't I request key/value pair .Count minus 1?
How is foreach able to loop over each element, yet I have no access to individual elements in the same way?
Is there no way to determine the position of an element (key or value) in a Dictionary object, without utilizing foreach? Can I not tell, without mapping elements to a collection, if a key is the first key in a Dictionary, or the last key?
This SO question and the excellent answers touch on this, but I'm specifically looking to see if I must copy elements to an array or other enumerable type, to access specific elements by position.
Here's an example. Please note I'm not looking for a way to specifically solve this example - it's for illustration purposes of my questions only. Suppose I want to add all they keys in a Dictionary<string, string> object to a comma-separated list, with no comma at the end. With an array I could do:
string[] arrayStr = new string[2] { "Test1", "Test2" };
string outStr = "";
for (int i = 0; i < arrayStr.Length; i++)
{
outStr += arrayStr[i];
if (i < arrayStr.Length - 1)
{
outStr += ", ";
}
}
With Dictionary<string, string>, how would I copy each key to outStr using the above method? It appears I would have to use foreach. But what Dictionary methods or properties exist that would let me identify where an element is located at, within a dictionary?
If you're still reading this, I also want to point out I'm not trying to say there's something wrong with Dictionary... I'm simply trying to understand how this tool in the .NET framework works, and how to best use it myself.
Say you have four cars of different colors. And you want to be able to quickly find the key to a car by its color. So you make 4 envelopes labelled "red", "blue", "black", and "white" and place the key to each car in the right envelope. Which is the "first" car? Which is the "third"? You're not concerned about the order of the envelopes; you're concerned about being able to quickly get the key by the color.
So, does this mean that Dictionary elements cannot be accessed by "position," as arrays or lists can?
Not directly, no. You can use Skip and Take but all they will do is iterate until you get to the "nth" item.
If this is so, why is there a .Count property that returns the number of key/value pairs, yet nothing that lets me reference these by nearness to the total? For example, .Count is 5, why can't I request key/value pair .Count minus 1?
You can still measure the number of items even thought there's no order. In my example you know there are 4 envelopes, but there's no concept of the "third" envelope.
How is foreach able to loop over each element, yet I have no access to individual elements in the same way?
Because foreach use IEnumerable, which just asks for the "next" element each time - the underlying collection determines what order the elements are returned in. You can pick up the envelopes one by one, but the order is irrelevant.
Is there no way to determine the position of an element (key or value) in a Dictionary object, without utilizing foreach?
You can infer it by using foreach and counting how many elements you have before reaching the one you want, but as soon as you add or remove an item, that position may change. If I buy a green car and add the envelope, where in the "order" would it go?
I'm specifically looking to see if I must copy elements to an array or other enumerable type, to access specific elements by position.
Well, no, you can use Skip and Take, but there's no way to predict what item is at that location. You can pick up two envelopes, ignore them, pick up another one and call it the "third" envelope, but so what?
Several correct answers here, but I thought you might like a short version :)
Under the hood, the Dictionary class has a private field called buckets. It's just an ordinary array which maps integer positions to the objects you've added to the Dictionary.
When you add a key/value pair to the Dictionary, it calculates a hash value for your key. The hash value gets used as the index into the buckets array. The Dictionary uses as many bits of the hash as it needs to ensure that the index into the buckets array doesn't collide with an existing entry. The buckets array will be expanded as needed due to collisions.
Yes, it's possible via reflection (which allows you to extract private data fields) to get the 3rd, or 4th, or Nth member of the buckets array. But the array could be resized at any time and you're not even guaranteed that the implementation details of Dictionary won't change.
In addition to D Stanley's answer, I'd like to add that you should check out SortedDictionary<TKey, TValue>. It stores the key/value pairs in a data structure that does keep the keys ordered.
var d = new SortedDictionary<int, string>();
d.Add(4, "banana");
d.Add(2, "apple");
d.Add(7, "pineapple");
Console.WriteLine(d.ElementAt(1).Value); // banana
Looping through a dictionary does not need to be done using foreach, but the terms 'first,' and 'last' are meaningless in terms of a dictionary, because order is not guaranteed and is in no way related to the order items are added to your dictionary.
Think of it this way. You have a bag that you are using to store blocks, and each block has a unique label on it. Throw in a block with the labels "Foo," "Bar," and "Baz." Now, if you ask me what the count of my bag is, I can say I have 3 blocks in it and if you ask me for the block labeled "Bar" I can get it for you. However, if you ask me for the first block, I don't know what you mean. The blocks are just a jumble inside my bag. If, instead you say 'foreach' block, I'd like to take a photo of it, I'll hand you each block, 1 by 1. Again, the order isn't guaranteed. I'm just reaching into my bag and pulling out each block until I've gotten each one.
You can also ask for a collection of all the keys in a dictionary, then use each key to access the items in a dictionary. However, once again, the order of the keys is not guaranteed and, in theory, could change every time you access it (in practice, the .NET key order is normally pretty stable).
There's a lot of reasons why a dictionary is stored like this, but the key thing is dictionaries have to have unspecified order in order to have both O(1) insertion and O(1) access. An array, which has a specified order has O(1) access (you can get the n'th item in one step), but insertions are O(n).
There is a large number of collections available in the .Net framework. You have to analyse your requirements and decide which collection to use:
Do you need Key/Value pairs or just Items?
Is it important that items are sorted?
Do you need fast insertion or just add at start/end of collection?
Do you need fast retrieval: O(1) or O(log n)?
Do you need an index i.e. acces to items by an integer position?
For most combinations of these requirements there exists a specialized collection.
In your case: Key/Value pairs and acces through an index: SortedList

C# - Searching keys of dictionary vs searching values in List

In terms of speed in search, is it better to search the keys of a dictionary or the values of a list?
In other words, which of these would be most preferable?
Dictionary<tring,string> dic = new Dictionary<string,string>();
if(dic.ContainsKey("needle")){ ... }
Or
List<string> list = new List<string>();
if(list.Contains("needle")){ ... }
If by "better" you mean "faster" then use a dictionary. Dictionary keys are organized by hash codes so lookups are significantly faster that list searches with more than just a few items in the ocllection.
With a good hashing algorithm, Dictionary searches can be close to O(1), meaning the search time is independent of the size of the dictionary. Lists, on the other hand, are O(n), meaning that the time is (on average) proportional to the size of the list.
If you just have key items (not mapping keys to values) you might also try a HashSet. It has the benefit of O(1) lookups without the overhead of the Value side of a dictionary.
(Granted the overhead is probably minimal, but why have it if you don't need it?)
For lookups a dictionary is usually best because the time it takes remains constant. With a list it increases the larger the list gets.
See also: http://www.dotnetperls.com/dictionary-time
I suggest using Dictionary when the number of lookups greatly exceeds the number of insertions. It is fine to use List when you will always have fewer than four items.
For lookups, Dictionary is usually a better choice. The time required is flat, an O(1) constant time complexity. The List has an O(N) linear time complexity. Three elements can be looped over faster than looked up in a Dictionary.

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.
The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.
Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

What is the most performant way to check for existence with a collection of integers?

I have a large list of integers that are sent to my webservice. Our business rules state that these values must be unique. What is the most performant way to figure out if there are any duplicates? I dont need to know the values, I only need to know if 2 of the values are equal.
At first I was thinking about using a Generic List of integers and the list.Exists() method, but this is of O(n);
Then I was thinking about using a Dictionary and the ContainsKey method. But, I only need the Keys, I do not need the values. And I think this is a linear search as well.
Is there a better datatype to use to find uniqueness within a list? Or am I stuck with a linear search?
Use a HashSet<T>:
The HashSet class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order
HashSet<T> even exposes a constructor that accepts an IEnumerable<T>. By passing your List<T> to the HashSet<T>'s constructor you will end up with a reference to a new HashSet<T> that will contain a distinct sequence of items from your original List<T>.
Sounds like a job for a Hashset...
If you are using framework 3.5 you can use the HashSet collection.
Otherwise the best option is the Dictionary. The value of each item will be wasted, but that will give you the best performance.
If you check for duplicates while you add the items to the HashSet/Dictionary instead of counting them afterwards, you get better performance than O(n) in case there are duplicates, as you don't have to continue looking after finding the first duplicate.
If the set of numbers is sparse, then as others suggest use a HashSet.
But if the set of numbers is mostly in sequence with occasional gaps, it would be a lot better if you stored the number set as a sorted array or binary tree of begin,end pairs. Then you could search to find the pair with the largest begin value that was smaller than your search key and compare with that pair's end value to see if it exists in the set.
What about doing:
list.Distinct().Count() != list.Count()
I wonder about the performance of this. I think it would be as good as O(n) but with less code and still easily readable.

Array that can be resized fast

I'm looking for a kind of array data-type that can easily have items added, without a performance hit.
System.Array - Redim Preserve copies entire RAM from old to new, as slow as amount of existing elements
System.Collections.ArrayList - good enough?
System.Collections.IList - good enough?
Just to summarize a few data structures:
System.Collections.ArrayList: untyped data structures are obsolete. Use List(of t) instead.
System.Collections.Generic.List(of t): this represents a resizable array. This data structure uses an internal array behind the scenes. Adding items to List is O(1) as long as the underlying array hasn't been filled, otherwise its O(n+1) to resize the internal array and copy the elements over.
List<int> nums = new List<int>(3); // creates a resizable array
// which can hold 3 elements
nums.Add(1);
// adds item in O(1). nums.Capacity = 3, nums.Count = 1
nums.Add(2);
// adds item in O(1). nums.Capacity = 3, nums.Count = 3
nums.Add(3);
// adds item in O(1). nums.Capacity = 3, nums.Count = 3
nums.Add(4);
// adds item in O(n). Lists doubles the size of our internal array, so
// nums.Capacity = 6, nums.count = 4
Adding items is only efficient when adding to the back of the list. Inserting in the middle forces the array to shift all items forward, which is an O(n) operation. Deleting items is also O(n), since the array needs to shift items backward.
System.Collections.Generic.LinkedList(of t): if you don't need random or indexed access to items in your list, for example you only plan to add items and iterate from first to last, then a LinkedList is your friend. Inserts and removals are O(1), lookup is O(n).
You should use the Generic List<> (System.Collections.Generic.List) for this. It operates in constant amortized time.
It also shares the following features with Arrays.
Fast random access (you can access any element in the list in O(1))
It's quick to loop over
Slow to insert and remove objects in the start or middle (since it has to do a copy of the entire listbelieve)
If you need quick insertions and deletions in the beginning or end, use either linked-list or queues
Would the LinkedList< T> structure work for you? It's not (in some cases) as intuitive as a straight array, but is very quick.
AddLast to append to the end
AddBefore/AddAfter to insert into list
AddFirst to append to the beginning
It's not so quick for random access however, as you have to iterate over the structure to access your items... however, it has .ToList() and .ToArray() methods to grab a copy of the structure in list/array form so for read access, you could do that in a pinch. The performance increase of the inserts may outweigh the performance decrease of the need for random access or it may not. It will depend entirely on your situation.
There's also this reference which will help you decide which is the right way to go:
When to use a linked list over an array/array list?
What is "good enough" for you? What exactly do you want to do with that data structure?
No array structure (i.e. O(n) access) allows insertion in the middle without an O(n) runtime; insertion at the end is O(n) worst case an O(1) amortized for self-resizing arrays like ArrayList.
Maybe hashtables (amortized O(1) access and insertion anywhere, but O(n) worst case for insertion) or trees (O(log(n)) for access and insertion anywhere, guaranteed) are better suited.
If speed is your problem, I don't see how the selected answer is any better than using a raw Array, although it resizes itself, it uses the exact same mechanism you would use to resize an array (and should take just a touch longer) UNLESS you are always adding to the end, in which case it should do things a bit smarter because it allocates a chunk at a time instead of just one element.
If you often add near the beginning/middle of your collection and don't index into the middle/end very often, you probably want a Linked List. That will have the fastest insert time and will have great iteration time, it just sucks at indexing (such as looking at the 3rd element from the end, or the 72nd element).

Categories

Resources