Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
So I have been mainly using lists to retrieve small amounts of data from a database which feeds into a web application but have recently come across dictionaries which produce more readable code with keys but what is the performance difference when just referring by index/key?
I understand that a dictionary uses more memory but what is best practice in this scenario and is it worth the performance/maintenance trade-off bearing in mind that I will not be performing searches or sorting the data?
When you do want to find some one item through list, then you should see ALL items till you find its key.
Let's see some basic example. You have
Person
{
public int ID {get;set;}
public string Name {get;set;}
}
and you have collection List<Person> persons and you want to find some person by its ID:
var person = persons.FirstOrDefault(x => x.ID == 5);
As written it has to enumerate the entire List until it finds the entry in the List that has the correct ID (does entry 0 match the lambda? No... Does entry 1 match the lambda? No... etc etc). This is O(n).
However, if you want to find through the Dictionary dictPersons :
var person = dictPersons[person.ID];
If you want to find a certain element by key in a dictionary, it can instantly jump to where it is in the dictionary - this is O(1). O(n) for doing it for every person. (If you want to know how this is done - Dictionary runs a mathematical operation on the key, which turns it into a value that is a place inside the dictionary, which is the same place it put it when it was inserted. It is called hash-function)
So, Dictionary is faster than Listbecause Dictionary does not iterate through the all collection, but Dictionary takes the item from the exact place(hash-function calculates this place). It is a better algorithm.
Dictionary relies on chaining (maintaining a list of items for each hash table bucket) to resolve collisions whereas Hashtable uses rehashing for collision resolution (when a collision occurs, tries another hash function to map the key to a bucket). You can read how hash function works and difference between chaining and rehashing.
Unless you're actually experiencing performance issues and need to optimize it's better to go with what's more readable and maintainable. That's especially true since you mentioned that it's small amounts of data. Without exaggerating - it's possible that over the life of the application the cumulative difference in performance (if any) won't equal the time you save by making your code more readable.
To put it in perspective, consider the work that your application already does just to read request headers and parse views and read values from configuration files. Not only will the difference in performance between the list and the dictionary be small, it will also be a tiny fraction of the overall processing your application does just to serve a single page request.
And even then, if you were to see performance issues and needed to optimize, there would probably be plenty of other optimizations (like caching) that would make a bigger difference.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
First, a little background: I enjoy working on project Euler problems (https://projecteuler.net/archives) but many of them require a ton of heavy computation so I try to save known constants in memory so they don't have to be recalculated every time. These include things like n!, nPr, nCr, and lists of primes. For the purpose of this question let's just stick with primes because any solution with those can be easily ported to the others.
The question: Let's say I want to save the first 1,000,000 primes in memory for repeated access while doing heavy computation. The 1,000,000th prime is 15,485,863 so ints will do just fine here. I need to save these values in a way such that access is O(1) because these will be access a lot.
What I've tried so far:
Clearly I can't put all 1,000,000 in one cs file because Visual Studio throws a fit. I've been trying to break it into multiple files using a partial class and 2-D List<List<int>>
public partial class Primes
{
public readonly List<int> _primes_1 = new List<int>
{
2, 3, ... 999983
}
}
So _primes_1 has the primes less than 1,000,000, _primes_2 has the primes between 1,000,000 to 2,000,000, etc, 15 files worth. Then I put them together
public partial class Primes
{
public List<List<int>> _primes = new List<List<int>>()
{
_primes_1, _primes_2, _primes_3, _primes_4, _primes_5,
_primes_6, _primes_7, _primes_8, _primes_9, _primes_10,
_primes_11, _primes_12, _primes_13, _primes_14, _primes_15
};
}
This methodology does work as it is easy to enumerate through the list and IsPrime(n) checks are fairly simple as well (binary search). The big downfall with this methodology is that VS starts to freak out because each file has ~75,000 ints in it (~8000 lines depending on spacing). In fact, much of my editing of these files has to be done in NPP just to keep VS from hanging/crashing.
Other things I've considered:
I originally read the numbers in off a text file and could do that in the program but clearly I would want to do that at startup and then just have the values available. I also considered dumping them into sql but again, eventually they need to be in memory. For the in memory storage I considered memcache but I don't know enough about it to know how efficient it is in look ups.
In the end, this comes down to two questions:
How do the numbers get in to memory to begin with?
What mechanism is used to store them?
Spending a little more time in spin up is fine (within reason) as long as the lookup mechanism is fast fast fast.
Quick note: Yes I know that if I only do 15 pages as shown then I won't have all 1,000,000 because 15,485,863 is on page 16. That's fine, for our purposes here this is good enough.
Bring them in from a single text file at startup. This data shouldn't be in source files (as you are discovering).
Store them in a HashSet<int>, so for any number n, isPrime = n => primeHashSet.Contains(n). This will give you your desired O(1) complexity.
HashSet<int> primeHashSet = new HashSet<int>(
File.ReadLines(filePath)
.AsParallel() //maybe?
.SelectMany(line => Regex.Matches(line, #"\d+").Cast<Match>())
.Select(m => m.Value)
.Select(int.Parse));
Predicate<int> isPrime = primeHashSet.Contains;
bool someNumIsPrime = isPrime(5000); //for example
On my (admittedly fairly snappy) machine, this loads in about 300ms.
Recently, I started working on a C# .NET project that requires to keep a Dictionary of words in Memory.
My first approach was to create a
Dictionary<string, string>
(where the Key would be the word and Value the definition).
That worked well, and after a while I decided to try using "buckets" and went for a
Dictionary<char, Dictionary<string, string>>
Where the char is the first letter of the words inside the inner Dictionary.
My question is: Do I really have a performance gain by applying this change? (And making the code more complex)
I'm aware Dictionary is supposed to be O(1), so in theory it would be the same for 5 words or 2 million. And by adding multiple levels I would be duplicating the lookup time.
Thanks!
There are many, many factors at work here. By splitting your data per letter you inject a lot more lookups into unrelated types that need to be cached by your CPU. You're more likely to thrash the cache instead and get terrible performance.
On the other hand if you have a lot of entries relatively equally distributed across their first letters, and if you don't look up uniformly but focus on just a few letters, then you're likely to get an increase in lookup performance.
And as a last note, I don't know where you got the idea that dictionary lookup (or dictionary anything) is O(1), you might want to consider looking at that. It's bound to make your decisions down the line incorrect.
The answer is - NO, you would not improve performance of the hash table by splitting it. And, as you noted, you would ALWAYS do multiple look-ups.
To improve performance you need to reduce number of collisions. Assuming that the hashing function is the same, the only thing you can alter is the load factor. As always, the speed comes at the price of the space.
Ignoring an overhead, in the same space you can create one table with 1,000 buckets or ten - with 100. Placing a 1,000 items in it will give you the load factor of 1.0 for the big one, ans an average of 1.0 for the little ones. The "lucky" table will have better performance, the other - worth. Add a time for an extra look-up to that...
I have a Dictionary of objects with strings as the keys. This Dictionary is first populated with anywhere from 50 to tens of thousands of entries. Later on my program looks for values within this dictionary, and after having found an item in the dictionary I no longer have any need to persist the object that I just found in the dictionary. My question then is, would I be able to get better total execution time if I remove entries from the dictionary once I no longer have use for them, perhaps cutting down memory usage or just making subsequent lookups slightly faster, or would the extra time spent removing items be more impactful?
I understand the answer to this may depend upon certain details such as how many total lookups are done against the dictionary, the size of the key, and the size of the object, I will try to provide these below, but is there a general answer to this? Is it unnecessary to try and improve performance in this way, or are there cases where this would be a good idea?
Key is variable length string, either 6 characters or ~20 characters.
Total lookups is completely up in the air, I may have to only check 50x or so or I may have to look 10K times completely independent of the size of the dictionary, i.e. dictionary may have 50 items and I may do 10K lookups, or I may have 10K items and only do 50 lookups.
One additional note is that if I do remove items from the dictionary and I am ever left with an empty dictionary I can then signal to a waiting thread to no longer wait for me while I process the remaining items (involves parsing through a long text file while looking up items in the dictionary to determine what to do with the parsed data).
Dictionary lookups are essentially O(1). Removing items from the dictionary will have a tiny (if any) impact on lookup speed.
In the end, it's very likely that removing items will be slower than just leaving them in.
The only reason I'd suggest removing items would be if you need to reduce your memory footprint.
I found some interesting items over at DotNetPerls that seem to relate to your question.
The order you add keys to a Dictionary is important. It affects the
performance of accessing those keys. Because the Dictionary uses a
chaining algorithm, the keys that were added last are often faster to
locate.
http://www.dotnetperls.com/dictionary-order
Dictionary size influences lookup performance. Smaller Dictionaries
are faster than larger Dictionaries. This is true when they are tested
for keys that always exist in both. Reducing Dictionary size could
help improve performance.
http://www.dotnetperls.com/dictionary-size
I thought this last tidbit was really interesting. It didn't occur to me to consider my key length.
Generally, shorter [key] strings perform better than longer ones.
http://www.dotnetperls.com/dictionary-string-key
Good question!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 9 years ago.
Improve this question
I have a huge collection of strings. I will find out all the strings which starts with the given character more frequently. What would be a best collection to do this. I will initialize the collection in sorted order.
Thanks
If you want a map from a character to all strings starting with that character, you might find ILookup<TKey, TElement> suitable. It's very similar to a Dictionary<TKey, TValue>, with two main differences:
Instead of a 1:1 mapping, it performs a 1:n mapping (i.e. there can be more than one value per key).
You cannot instantiate (new) nor populate it (.Add(…)) yourself; instead, you let .NET derive a fully populated instance from another collection by calling .ToLookup(…) on the latter.
Here's an example how to build such a 1:n map:
using System.Collections.Generic; // for List<T>
using System.Linq; // for ILookup<TKey, TValue> and .ToLookup(…)
// This represents the source of your strings. It doesn't have to be sorted:
var strings = new List<string>() { "Foo", "Bar", "Baz", "Quux", … };
// This is how you would build a 1:n lookup table mapping from first characters
// to all strings starting with that character. Empty strings are excluded:
ILookup<char, string> stringsByFirstCharacter =
strings.Where(str => !string.IsNullOrEmpty(str)) // exclude empty strings
.ToLookup(str => str[0]); // key := first character
// This is how you would look up all strings starting with B.
// The output will be Bar and Baz:
foreach (string str in stringsByFirstCharacter['B'])
{
Console.WriteLine(str);
}
P.S.: The above hyperlink for ILookup<…> (the interface) refers you to the help page for Lookup<…> (the implementation class). This is on purpose, as I find the documentation for the class easier to read. I would however recommend to use the interface in your code.
If you need to search regularly with a huge collection of strings, then use a Hash table. Remember to distribute the table evenly to speed up the look-up operation.
Well so you need to create an index on function from string.
For this Id suggest using
Dictionary<string,List<string>> data structure.
ToLookup isn't so good cause it limits your ability to maniuplate the data structure.
I have a bunch of pairs of dates and monetary values in a SortedDictionary<DateTime, decimal>, corresponding to loan balances calculated into the future at contract-defined compounding dates. Is there an efficient way to find a date key that is nearest to a given value? (Specifically, the nearest key less than or equal to the target). The point is to store only the data at the points when the value changed, but efficiently answer the question "what was the balance on x date?" for any date in range.
A similar question was asked ( What .NET dictionary supports a "find nearest key" operation? ) and the answer was "no" at the time, at least from the people who responded, but that was almost 3 years ago.
The question How to find point between two keys in sorted dictionary presents the obvious solution of naively iterating through all keys. I am wondering if any built-in framework function exists to take advantage of the fact that the keys are already indexed and sorted in memory -- or alternatively a built-in Framework collection class that would lend itself better to this kind of query.
Since SortedDictionary is sorted on the key, you can create a sorted list of keys with
var keys = new List<DateTime>(dictionary.Keys);
and then efficiently perform binary search on it:
var index = keys.BinarySearch(key);
As the documentation says, if index is positive or zero then the key exists; if it is negative, then ~index is the index where key would be found at if it existed. Therefore the index of the "immediately smaller" existing key is ~index - 1. Make sure you handle correctly the edge case where key is smaller than any of the existing keys and ~index - 1 == -1.
Of course the above approach really only makes sense if keys is built up once and then queried repeatedly; since it involves iterating over the whole sequence of keys and doing a binary search on top of that there's no point in trying this if you are only going to search once. In that case even naive iteration would be better.
Update
As digEmAll correctly points out, you could also switch to SortedList<DateTime, decimal> so that the Keys collection implements IList<T> (which SortedDictionary.Keys does not). That interface provides enough functionality to perform a binary search on it manually, so you could take e.g. this code and make it an extension method on IList<T>.
You should also keep in mind that SortedList performs worse than SortedDictionary during construction if the items are not inserted in already-sorted order, although in this particular case it is highly likely that dates are inserted in chronological (sorted) order which would be perfect.
So, this doesn't directly answer your question, because you specifically asked for something built-in to the .NET framework, but facing a similar problem, I found the following solution to work best, and I wanted to post it here for other searchers.
I used the TreeDictionary<K, V> from the C5 Collections (GitHub/NuGet), which is an implementation of a red-black tree.
It has Predecessor/TryPredecessor and WeakPredessor/TryWeakPredecessor methods (as well as similar methods for successors) to easily find the nearest items to a key.
More useful in your case, I think, is the RangeFrom/RangeTo/RangeFromTo methods that allow you to retrieve a range of key-value-pairs between keys.
Note that all of these methods can also be applied to the TreeDictionary<K, V>.Keys collection, which allow you to work with only the keys as well.
It really is a very neat implementation, and something like it deserves to be in the BCL.
It is not possible to efficiently find the nearest key with SortedList, SortedDictionary or any other "built-in" .NET type, if you need to interleave queries with inserts (unless your data arrives pre-sorted, or the collection is always small).
As I mentioned on the other question you referenced, I created three data structures related to B+ trees that provide find-nearest-key functionality for any sortable data type: BList<T>, BDictionary<K,V> and BMultiMap<K,V>. Each of these data structures provide FindLowerBound() and FindUpperBound() methods that work like C++'s lower_bound and upper_bound.
These are available in the Loyc.Collections NuGet package, and BDictionary typically uses about 44% less memory than SortedDictionary.
public static DateTime RoundDown(DateTime dateTime)
{
long remainingTicks = dateTime.Ticks % PeriodLength.Ticks;
return dateTime - new TimeSpan(remainingTicks);
}