.NET: Scalability of generic Dictionary - c#

I'm using a Dictionary<> to store a bazillion items. Is it safe to assume that as long as the server's memory has enough space to accommodate these bazillion items that I'll get near O(1) retrieval of items from it? What should I know about using a generic Dictionary as huge cache when performance is important?
EDIT: I shouldn't rely on the default implementations? What makes for a good hashing function?

It depends, just about entirely, on how good a hashing functionality your "bazillion items" support -- if their hashing function is not excellent (so that many conflicts result) your performance will degrade with the growth of the dictionary.

You should measure it and find out. You're the one who has knowledge of the exact usage of your dictionary, so you're the one who can measure it to see if it meets your needs.
A word of advice: I have in the past done performance analysis on large dictionary structures, and discovered that performance did degrade as the dictionary became extremely large. But it seemed to degrade here and there, not consistently on each operation. I did a lot of work trying to analyze the hash algorithms, etc, before smacking myself in the forehead. The garbage collector was getting slower because I had so much live working set; the dictionary was just as fast as it always was, but if a collection happened to be triggered, then that was eating up my cycles.
That's why it is important to not do performance testing in unrealistic benchmark scenarios; to find out what the real-world performance cost of your bazillion-item dictionary is, well, that's going to be gated on lots of stuff that has nothing to do with your dictionary, like how much collection triggering is happening throughout the rest of your program, and when.

Yes you will have O(1) access times. In fact to be pedantic g it will be exactly O(1).
You need to ensure that all your objects that are used as keys have a good GetHashCode implementation and should likely override Equals.
Edit to clarify: In reality acess times will get slower the more items you have unless you can provide a "perfect" hash function.

Yes, you will have near O(1) no matter how many objects you put into the Dictionary. But for the Dictionary to be fast, your key-objects should provide a sufficient GetHashCode-implementation, because Dictionary uses a hashtable inside.

Related

Boo/C#: Returning a collection performance

I have a function that gets a collection by reference, iterates over it, constructs a new collection of the same length containing updated structs, returns that collection (or rather a reference to it, since it's Boo/C# code).
I'm worried about performance of such a function. Is performance a lot worse than just updating the collection by reference? I need to call this function tens of times per second.
Thank you. Alisa.
P.S.: Why am I doing this? I'm trying to move onto functional programming and make it as pure as possible.
It will be slower, but not by much. It will also consume more memory as you'll have two collections in RAM whenever you're updating the structs.
The impact on performance will also be affected by your collections' sizes.
The best way to answer your question is to create both functions, then profile them.

How to ensure ConcurrentDictionary of collections thread-safety?

So here I am implementing some caching layer. Particurally I am stuck with
ConcurrentDictionary<SomeKey,HashSet<SomeKey2>>
I need to ensure that operations on HashSet are threadsafe too (ergo Update is threadsafe). Is it possible in any simple way or do I have to synchronize in the UpdateFactory delegate? If the answer is yes (which I presume) did any one of You encountered this problem before and solved it?
I want to avoid ConcurrentDictionary of ConcurrentDictionaries because they allocate a lot of synchronization objects and I potentially have around a million entries in this thing, so I want to have less pressure in on the GC.
HashSet was chosen because it guarantees amortized constant cost of insertion,deletion and access.
The aforementioned structure will be used as a index on a larger data set with to columns as a key (SomeKey and Somekey2) much like a database index.
Ok so finally I decided to go with Immutable set and lock striping because it is reasonably simple to implement and understand. If I will need more performance on the writes (no copying the whole hash set on insert) I will implement reader/writer locks with striping - which should be fine anyway.
Thanks for suggestions.

What C# container is most resource-efficient for existence for only one operation?

I find myself often with a situation where I need to perform an operation on a set of properties. The operation can be anything from checking if a particular property matches anything in the set to a single iteration of actions. Sometimes the set is dynamically generated when the function is called, some built with a simple LINQ statement, other times it is a hard-coded set that will always remain the same. But one constant always exists: the set only exists for one single operation and has no use before or after it.
My problem is, I have so many points in my application where this is necessary, but I appear to be very, very inconsistent in how I store these sets. Some of them are arrays, some are lists, and just now I've found a couple linked lists. Now, none of the operations I'm specifically concerned about have to care about indices, container size, order, or any other functionality that is bestowed by any of the individual container types. I picked resource efficiency because it's a better idea than flipping coins. I figured, since array size is configured and it's a very elementary container, that might be my best choice, but I figure it is a better idea to ask around. Alternatively, if there's a better choice not out of resource-efficiency but strictly as being a better choice for this kind of situation, that would be nice as well.
With your acknowledgement that this is more about coding consistency rather than performance or efficiency, I think the general practice is to use a List<T>. Its actual backing store is an array, so you aren't really losing much (if anything noticable) to container overhead. Without more qualifications, I'm not sure that I can offer anything more than that.
Of course, if you truly don't care about the things that you list in your question, just type your variables as IEnumerable<T> and you're only dealing with the actual container when you're populating it; where you consume it will be entirely consistent.
There are two basic principles to be aware of regarding resource efficiency.
Runtime complexity
Memory overhead
You said that indices and order do not matter and that a frequent operation is matching. A Dictionary<T> (which is a hashtable) is an ideal candidate for this type of work. Lookups on the keys are very fast which would be beneficial in your matching operation. The disadvantage is that it will consume a little more memory than what would be strictly required. The usual load factor is around .8 so we are not talking about a huge increase or anything.
For your other operations you may find that an array or List<T> is a better option especially if you do not need to have the fast lookups. As long as you are not needing high performance on specialty operations (lookups, sorting, etc.) then it is hard to beat the general resource characteristics of array based containers.
List is probably fine in general. It's easy to understand (in the literate programming sense) and reasonably efficient. The keyed collections (e.g. Dict, SortedList) will throw an exception if you add an entry with a duplicate key, though this may not be a problem for what you're working on now.
Only if you find that you're running into a CPU-time or memory-size problem should you look at improving the "efficiency", and then only after determining that this is the bottleneck.
No matter which approach you use, there will still be creation and deletion of the underlying objects (collection or iterator) that will eventually be garbage collected, if the application runs long enough.

Fast keyword lookup

I'm writing a program that matches a user submitted query against a list of keywords. The list has about 2000 words and performance is most important.
Old
Is it faster to store this list in a
SQL table or hard code it in the
source code? The list does not need to
be updated often.
If SQL table is faster when which data
types would be the best? (Int,
Nvarchar?)
If hardcoded list is faster what data
type would be the best?
(List?)
Any suggestions?
what is the best in-memory data structure for fast lookups?
It doesn't matter for the performance where you store this data.
If you start your program, you load the string-array once from which datastore you stored it. And then you can use this array all the time until you quit the program.
IMO, If the list doesn't get update often, store it on a file(text/xml) then cached it in your application so that it would be faster for the next requests.
Okay, to respond to your edit (and basically lifting my comment into an answer):
Specify in advance the performance that you are expecting.
Code your application against a sorted array and using a binary search to search the array for a keyword. This is very simple to implement and gives decent performance. Then profile to see if it matches the performance that you demand. If this performance is acceptable, move on. The worst-case performance here is O(m log n) where n is the number of keywords and m is the maximum length of your keywords.
If the performance in step two is not acceptable, use a trie (also known as a prefix tree). The expected performance here is m where m is the maximum length of your keywords. Profile to see if this meets your expected performance. If it does not, revisit your performance criteria; they might have been unreasonable.
If you are still not meeting your performance specifications, consider using a hashtable (in .NET you would use a HashSet<string>. While a hashtable will have worse worst-case performance, it could have better average case performance (if there are no collisions a hashtable lookup is O(1) while the hash computing function is O(m) where m is the maximum length of your keywords). This might be faster (on average) but probably not noticeably so.
You might even consider skipping directly to the last step (as it's less complex than the former). It all depends on your needs. Tries have the advantage that you can easily spit out the closest matching keyword, for example.
The important thing here is to have a specification of your performance requirements and to profile! Use the simplest implementation that meets your performance requirements (for maintainability, readability and implementability (if it's not, it's a word now!))
The list does not need to be updated often
I say if it ever needs to be updated it does not belong in the source code.
The hardcoded list is faster. A database hit to retrieve the list will undoubtedly be slower than pulling the list from an in-memory object.
As for which datatype to store the values in, an array would probably be faster and take up less memory than a List, but trivially so.
If the list is largely static and you can afford to spend some time in prep (i.e. on application start), you'd probably be best off storing the list of keywords in a text file, then using say a B* tree to store the keywords internally (assuming you only care about exact match and not partial matching or Levenshtein distance).

Hash codes for immutable types

Are there any considerations for immutable types regarding hash codes?
Should I generate it once, in the constructor?
How would you make it clear that the hash code is fixed? Should I? If so, is it better to use a property called HashCode, instead of GetHashCode method? Would there be any drawback to it? (Considering both would work, but the property would be recommend).
Are there any considerations for immutable types regarding hash codes?
Immutable types are the easiest types to hash correctly; most hash code bugs happen when hashing mutable data. The most important thing is that hashing and equality agree; if two instances compare as equal, they should have the same hash code. (The reverse is not necessarily true; two instances that have the same hash need not be equal.)
Should I generate it once, in the constructor?
That's a performance optimizing technique; by doing so, you trade increased consumption of space (for the storage of the computed value) for a possible decrease in time. I never make performance optimizations unless they are driven by realistic, customer-focused performance tests that carefully measure the performance of both options against documented goals. You should do this if your carefully-designed experiments indicate that (1) failure to do so causes you to miss your goal, and (2) doing so causes you to meet your goal.
How would you make it clear that the hash code is fixed?
I don't understand the question. A changing hash code is the exception, not the rule. Hash codes are always supposed to be unchanging. If the hash code of an object changes then the object can get "lost" in a hash table, so everyone should assume that hash codes remain stable.
is it better to use a property called HashCode, instead of GetHashCode method?
What consumer of your object is going to say "well, I could call GetHashCode(), a method guaranteed to be on all objects, but instead I'm going to call this HashCode getter that does exactly the same thing" ? Do you have such a consumer in mind?
If you don't have any consumers of functionality, then don't provide the functionality.
I wouldn't normally generate it in the constructor, but I'd also want to know more about the expected usage before deciding whether to cache it or not.
Are you expecting a small number of instances, which get hashed an awful lot and which take a long time to calculate the hash? If so, caching may be appropriate. If you're expecting a large number of potentially "throw-away" instances, I wouldn't bother caching.
Interestingly, .NET and Java made different choices for String in this respect - Java caches the hash, .NET doesn't. Given that many string instances are never hashed, and those which are hashed are often only hashed once (e.g. on insertion into the hash table) I think I favour .NET's decision here.
Basically you're trading memory + complexity against speed. As Michael says, test before making your code more complex. Of course in some cases (e.g. for a class library) you can't accurate predict the real-world usage, but in many situations you'll have a pretty good idea.
You certainly don't need a separate property though. Hash codes should always stay the same unless someone changes the state of the object - and if your type is immutable, you're already prohibiting that, therefore a user shouldn't expect any changes. Just override GetHashCode().
I would generate the hash code once when getHashCode is called the first time, then cache it for later calls. This avoids calling it in the constructor when it may not be needed.
If you don't expect to call getHashCode very many times for each value object, you may not need to cache the value at all.
Well, you've got to have a GetHashCode() overridden method, as that's how consumers are going to retrieve your hashcode. Most hashcodes are fairly simple arithmetic operations, that will execute quickly. Do you have a reason to believe that caching the results (which has a memory cost) will give you a noticeable performance improvement?
Start simple - generate the hashcode on the fly. If you think you'll see performance improvements caching it, test first.
Regulations require me to refer you to the "premature optimization is the root of all evil" quote at this point.
I know from my personal experience that developers are really good at misjudging performance issues.
So it it recommended to keep everything as simple as possible while calculating hash code on the fly in the GetHashCode().
Why do you need to make sure that the hashcode is fixed? The semantics of a hashcode are that it will always be the same value for any given state of an object. Since your objects are immutable, this is a given. How you choose to implement GetHashCode is us up to you.
Having it be a private field that is returned is one choice - it's small, easy, and fast.
In general, computing the HashCode should be fast. So caching should not be much of an optimization, and not worth the trouble.
If profiling really shows that GethashCode takes a significant amount of time then maybe you should cache it, as a fix.
But I wouldn't consider it part of the normal practice.

Categories

Resources