I'm implementing a simple Dictionary<String, Int> that keeps track of picture files I download, and also the renaming of the files.
String - original filename
Int - new filename
I read up on TryGetValue vs ContainsKey and came across this:
TryGetValue approach is faster than ContainsKey approach but only when
you want to check the key in collection and also want to get the value
associated with it. If you only want to check the key is present or
not use ContainsKey only.
from here
As such, I was wondering what were other people's views on the following:
Should I use TryGetValue even if I do not plan to use the returned value, assuming the Dictonary size would grow to 1000 entries, and I do duplicate checks everytime I download ie. frequently?
In theory, follow the documentation. If you don't want the value then use ContainsKey because there's no code to go and actually grab the value out of memory.
Now, in practice, it probably doesn't matter because you're micro-optimizing on a Dictionary that's probably very small in the grand scheme of things. So, in practice, do what is best for you and the readability of your code.
And just to help you get a good idea, would grow to 1000 entries is really small so it really isn't going to matter in practice.
If you only want to check the key is present or not use ContainsKey only.
I think you answered the question for yourself.
Let's see the implementation of both under Reflector
public bool TryGetValue(TKey key, out TValue value)
{
int index = this.FindEntry(key);
if (index >= 0)
{
value = this.entries[index].value;
return true;
}
value = default(TValue);
return false;
}
public bool ContainsKey(TKey key)
{
return (this.FindEntry(key) >= 0);
}
this is how both the methods are implemented.
Now you can decide yourself which method is best.
I think that performance gains (if any) aren't worth the cost of obfuscating your code with this optimization.
Balance the scale your targeting with code maintainability. E.g.:
~10K concurrent calls average vs. < 5 developer team size GO FOR IT!
~500 concurrent call average vs. > 50 developer team size DON'T DO IT!
Related
I have a collection as below
private static readonly Dictionary<string,object> _AppCache = new Dictionary<string,object>;
Then I was wondering which one is better to use to check if a key exists (none of my keys has null value)
_AppCache.ContainsKey("x")
_AppCache["x"] != null
This code might be access through various number of threads
The whole code is:
public void SetGlobalObject(string key, object value)
{
globalCacheLock.EnterWriteLock();
try
{
if (!_AppCache.ContainsKey(key))
{
_AppCache.Add(key, value);
}
}
finally
{
globalCacheLock.ExitWriteLock();
}
}
Update
I changed my code to use dictionary to keep focus of the question on Conatinskey or Indexer
I don't disagree with other's advice to use Dictionary. However, to answer your question, I think you should use ContainsKey to check if a key exists for several reasons
That is specifically what ContainsKey was written to do
For _AppCache["x"] != null to work your app must operate under an unenforced assumption (that no values will be null). That assumption may hold true now, but future maintainers may not know or understand this critical assumption, resulting in unintuitive bugs
Slightly less processing for ContainsKey, although this is not really important
Neither of the two choices are threadsafe, so that is not a deciding factor. For that, you either need to use locking, or use ConcurrentDictionary.
If you move to a Dictionary (per your question update), the answer is even more in favor of ContainsKey. If you used the index option, you would have to catch an exception to detect if the key is not in the Dictionary. ContainsKey would be much more straightforward in your code.
When the key is in the Dictionary, ContainsKey is slightly more efficient. Both options first call an internal method FindEntry. In the case of ContainsKey, it just returns the result of that. For the index option, it must also retrieve the value. In the case of the key not being in the Dictionary, the index option would be a fair amount less efficient, because it will be throwing an exception.
You are obviously checking for the existence of that key. In that case, _AppCache["x"] != null will give you a KeyNotFoundException if the key does not exist, which is probably not as desirable. If you really want to check if the key exists, without generating an exception by just checking, you have to use _AppCache.ContainsKey("x"). For checking if the key exists in the dictionary or hashtable, I would stick with ContainsKey. Any difference in performance, if != null is faster, would be offset by the additional code to deal with the exception if the key really does not exist.
In reality, _AppCache["x"] != null is not checking if the key exists, it is checking, given that key "x" exists, whether the associated value is null.
Neither way (although accomplishing different tasks) gives you any advantage on thread safety.
All of this holds true if you use ConcurrentDictionary - no difference in thread safety, the two ways accomplish different things, any possible gain in checking with !=null is offset by additional code to handle exception. So, use ContainsKey.
If you're concerned about thread-safety, you should have a look at the ConcurrentDictionary class.
If you do not want to use ConcurrentDictionary, than you'll have to make sure that you synchronize access to your regular Dictionary<K,V> instance. That means, making sure that no 2 threads can have multiple access to your dictionary, by locking on each write and read operation.
For instance, if you want to add something to a regular Dictionary in a thread-safe way, you'll have to do it like this:
private readonly object _sync = new object();
// ...
lock( _sync )
{
if( _dictionary.ContainsKey(someKey) == false )
{
_dictionary.Add(someKey, somevalue);
}
}
You should'nt be using using Hashtable anymore since the introduction of the generic Dictionary<K,V> class and therefore type-safe alternative has been introduced in .NET 2.0
One caveat though when using a Dictionary<K,V>: when you want to retrieve the value associated with a given key, the Dictionary will throw an exception when there is no entry for that specified key, whereas a Hashtable will return null in that case.
You should use a ConcurrentDictionary rather than a Dictionary, which is thread-safe itself. Therefore you do not need the lock, which (generally *) improves performance, since the locking mechanisms are rather expensive.
Now, only to check whether an entry exists I recommend ContainsKey, irrespective of which (Concurrent)Dictionary you use:
_AppCache.ContainsKey(key)
But what you do in two steps can be done in one step using the Concurrent Dictionary by using GetOrAdd:
_AppCache.GetOrAdd(key, value);
You need a lock for neither action:
public void SetGlobalObject(string key, object value)
{
_AppCache.GetOrAdd(key, value);
}
Not only does this (probably *) perform better, but I think it expresses your intentions much clearer and less cluttered.
(*) Using "probably" and "generally" here to emphasise that these data structures do have loads of baked-in optimisations for performance, however performance in your specific case must always be measured.
The implementation of Nullable<T>.GetHashCode() is as follows:
public override int GetHashCode()
{
if (!this.HasValue)
{
return 0;
}
return this.value.GetHashCode();
}
If however the underlying value also generates a hash code of 0 (e.g. a bool set to false or an int32 set to 0), then we have two commonly occurring different object states with the same hash code. It seems to me that a better implementation would have been something like.
public override int GetHashCode()
{
if (!this.HasValue)
{
return 0xD523648A; // E.g. some arbitrary 32 bit int with a good mix of set and
// unset bits (also probably a prime number).
}
return this.value.GetHashCode();
}
Yes, you do have a point. It is always possible to write a better GetHashCode() implementation if you know up front what data you are going to store. Not a luxury that a library writer ever has available. But yes, if you have a lot of bool? that are either false or !HasValue then the default implementation is going to hurt. Same for enums and ints, zero is a common value.
Your argument is academic however, changing the implementation costs minus ten thousand points and you can't do it yourself. Best you can do is submit the suggestion, the proper channel is the user-voice site. Getting traction on this is going to be difficult, good luck.
Let's first note that this question is just about performance. The hash code is not required to be unique or collision resistant for correctness. It is helpful for performance though.
Actually, this is the main value proposition of a hash table: Practically evenly distributed hash codes lead to O(1) behavior.
So what hash code constant is most likely to lead to the best possible performance profile in real applications?
Certainly not 0 because 0 is a common hash code: 0.GetHashCode() == 0. That goes for other types as well. 0 is the worst candidate because it tends to occur so often.
So how to avoid collisions? My proposal:
static readonly int nullableDefaultHashCode = GetRandomInt32();
public override int GetHashCode()
{
if (!this.HasValue)
return nullableDefaultHashCode;
else
return this.value.GetHashCode();
}
Evenly distributed, unlikely to collide and no stylistic problem of choosing an arbitrary constant.
Note, that GetRandomInt32 could be implemented as return 0xD523648A;. It would still be more useful than return 0;. But it is probably best to query a cheap source of pseudo-random numbers.
In the end, a Nullable<T> without value has to return a hashcode, and that hashcode should be a constant.
Returning an arbitrary constant may look more safe or appropriate, perhaps even more so when viewed within the specific case of Nullable<int>, but in the end it's just that: a hash.
And within the entire set that Nullable<T> can cover (which is infinite), zero is not a better hashcode than any other value.
I don't understand the concern here - poor performance in what situation?
Why would you could consider a hash function as poor based on its result for one value.
I could see that it would be a problem if many different values of a Type hash to the same result. But the fact that null hashes to the same value as 0 seems insignificant.
As far as I know the most common use of a .NET hash function is for a Hashtable, HashSet or Dictionary key, and the fact that zero and null happen to be in the same bucket will have an insignificant effect on overall performance.
I am working on software for scientific research that deals heavily with chemical formulas. I keep track of the contents of a chemical formula using an internal Dictionary<Isotope, int> where Isotope is an object like "Carbon-13", "Nitrogen-14", and the int represents the number of those isotopes in the chemical formula. So the formula C2H3NO would exist like this:
{"C12", 2
"H1", 3
"N14", 1
"O16", 1}
This is all fine and dandy, but when I want to add two chemical formulas together, I end up having to calculate the hash function of Isotope twice to update a value, see follow code example.
public class ChemicalFormula {
internal Dictionary<Isotope, int> _isotopes = new Dictionary<Isotope, int>();
public void Add(Isotope isotope, int count)
{
if (count != 0)
{
int curValue = 0;
if (_isotopes.TryGetValue(isotope, out curValue))
{
int newValue = curValue + count;
if (newValue == 0)
{
_isotopes.Remove(isotope);
}
else
{
_isotopes[isotope] = newValue;
}
}
else
{
_isotopes.Add(isotope, count);
}
_isDirty = true;
}
}
}
While this may not seem like it would be a slow down, it is when we are adding billions of chemical formulas together, this method is consistently the slowest part of the program (>45% of the running time). I am dealing with large chemical formulas like "H5921C3759N1023O1201S21" that are consistently being added to by smaller chemical formulas.
My question is, is there a better data structure for storing data like this? I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.
EDIT
Isotope is immutable and shouldn't change during the lifetime of the program so I should be able to cache the hashcode.
I have linked to the source code so you can see the classes more in depth rather than me copy and paste them here.
I second the opinion that Isotope should be made immutable with precalculated hash. That would make everything much simpler.
(in fact, functionally-oriented programming is better suited for calculations of such sort, and it deals with immutable objects)
I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.
Well it would stop the double hashing... but obviously it's then worse in terms of space. What performance difference did you notice?
Another option you should strongly consider if you're doing this a lot is caching the hash within the Isotope class, assuming it's immutable. (If it's not, then using it as a dictionary key is at least somewhat worrying.)
If you're likely to use most Isotope values as dictionary keys (or candidates) then it's probably worth computing the hash during initialization. Otherwise, pick a particularly unlikely hash value (in an ideal world, that would be any value) and use that as the "uncached" value, and compute it lazily.
If you've got 45% of the running time in GetHashCode, have you looked at optimizing that? Is it actually GetHashCode, or Equals which is the problem? (You talk about "hashing" but I suspect you mean "hash lookup in general".)
If you could post the relevant bits of the Isotope type, we may be able to help more.
EDIT: Another option to consider if you're using .NET 4 would be ConcurrentDictionary, with its AddOrUpdate method. You'd use it like this:
public void Add(Isotope isotope, int count)
{
// I prefer early exit to lots of nesting :)
if (count == 0)
{
return 0;
}
int newCount = _isotopes.AddOrUpdate(isotope, count,
(key, oldCount) => oldCount + count);
if (newCount == 0)
{
_isotopes.Remove(isotope);
}
_isDirty = true;
}
Do you actually require random access to Isotope count by type or are you using the dictionary as a means for associating a key with a value?
I would guess the latter.
My suggestion to you is not to work with a dictionary but with a sorted array (or List) of IsotopeTuples, something like:
class IsotopeTuple{
Isotope i;
int count;
}
sorted by the name of the isotope.
Why the sorting?
Because then, when you want to "add" two isotopes together, you can do this in linear time by traversing both arrays (hope this is clear, I can elaborate if needed). No hash computation required, just super fast comparisons of order.
This is a classic approach when dealing with vector multiplications where the dimensions are words.
Used widely in text mining.
the tradeoff is of course that the construction of the initial vector is (n)log(n), but I doubt if you will feel the impact.
Another solution that you could think of if you had a limited number of Isotopes and no memory problems:
public struct Formula
{
public int C12;
public int H1;
public int N14;
public int O16;
}
I am guessing you're looking at organic chemistry, so you may not have to deal with that many isotopes, and if the lookup is the issue, this one will be pretty fast...
I'm trying to optimise the performance of a string comparison operation on each string key of a dictionary used as a database query cache. The current code looks like:
public void Clear(string tableName)
{
foreach (string key in cache.Keys.Where(key => key.IndexOf(tableName, StringComparison.Ordinal) >= 0).ToList())
{
cache.Remove(key);
}
}
I'm new to using C# parallel features and am wondering what the best way would be to convert this into a parallel operation so that multiple string comparisons can happen 'simultaneously'. The cache can often get quite large so maintenance on it with Clear() can get quite costly.
Make your cache object a ConcurrentDictionary and use TryRemove instead of Remove.
This will make your cache thread-safe; then, can invoke your current foreach loop like this:
Parallel.ForEach(cache.Keys, key =>
{
if(key.IndexOf(tableName, StringComparison.Ordinal) >= 0)
{
dynamic value; // just because I don't know your dictionary.
cache.TryRemove(key, out value);
}
});
Hope that gives you an starting point.
Your approach can't work well on a Dictionary<string, Whatever> because that class isn't thread-safe for multiple writers, so the simultaneous deletes could cause all sorts of problems.
You will therefore have to use a lock to synchronise the removals, which will therefore make the access of the dictionary essentially single-threaded. About the only thing that can be safely done across the threads simultaneously is the comparison in the Where.
You could use ConcurrentDictionary because its use of striped locks will reduce this impact. It still doesn't seem the best approach though.
If you are building keys from a strings so that testing if the key starts with a sub-key, and if removing the entire subkey is a frequent need, then you could try using a Dictionary<string, Dictionary<string, Whatever>>. Adding or updating becomes a bit more expensive, but clearing becomes an O(1) removal of just the one value from the higher-level dictionary.
I've used Dictionaries as caches before and what I've used to do is to do the clean up the cache "on the fly", that is, with each entry I also include its time of inclusion, then anytime an entry is requested I remove the old entries. Performance hit was minimal to me but if needed you could implement a Queue (of Tuple<DateTime, TKey> where TKey is the type of the keys on your dictionary) as an index to hold these timestamps so you didn't need to iterate over the entire dictionary every time. Anyway, if you're having to think about these issues, it's time to consider using a dedicated caching server. To me, Shared Cache (http://sharedcache.codeplex.com) has been good enough.
My team is currently debating this issue.
The code in question is something along the lines of
if (!myDictionary.ContainsKey(key))
{
lock (_SyncObject)
{
if (!myDictionary.ContainsKey(key))
{
myDictionary.Add(key,value);
}
}
}
Some of the posts I've seen say that this may be a big NO NO (when using TryGetValue). Yet members of our team say it is ok since "ContainsKey" does not iterate on the key collection but checks if the key is contained via the hash code in O(1). Hence they claim there is no danger here.
I would like to get your honest opinions regarding this issue.
Don't do this. It's not safe.
You could be calling ContainsKey from one thread while another thread calls Add. That's simply not supported by Dictionary<TKey, TValue>. If Add needs to reallocate buckets etc, I can imagine you could get some very strange results, or an exception. It may have been written in such a way that you don't see any nasty effects, but I wouldn't like to rely on it.
It's one thing using double-checked locking for simple reads/writes to a field, although I'd still argue against it - it's another to make calls to an API which has been explicitly described as not being safe for multiple concurrent calls.
If you're on .NET 4, ConcurrentDictionary is probably the way forward. Otherwise, just lock on every access.
If you are in a multithreaded environment, you may prefer to look at using a ConcurrentDictionary. I blogged about it a couple of months ago, you might find the article useful: http://colinmackay.co.uk/blog/2011/03/24/parallelisation-in-net-4-0-the-concurrent-dictionary/
This code is incorrect. The Dictionary<TKey, TValue> type does not support simultaneous read and write operations. Even though your Add method is called within the lock the ContainsKey is not. Hence it easily allows for a violation of the simultaneous read / write rule and will lead to corruption in your instance
It doesn't look thread-safe, but it would probably be hard to make it fail.
The iteration vs hash lookup argument doesn't hold, there could be a hash-collision for instance.
If this dictionary is rarely written and often read, then I often employ safe double locking by replacing the entire dictionary on write. This is particularly effective if you can batch writes together to make them less frequent.
For example, this is a cut down version of a method we use that tries to get a schema object associated with a type, and if it can't, then it goes ahead and creates schema objects for all the types it finds in the same assembly as the specified type to minimize the number of times the entire dictionary has to be copied:
public static Schema GetSchema(Type type)
{
if (_schemaLookup.TryGetValue(type, out Schema schema))
return schema;
lock (_syncRoot) {
if (_schemaLookup.TryGetValue(type, out schema))
return schema;
var newLookup = new Dictionary<Type, Schema>(_schemaLookup);
foreach (var t in type.Assembly.GetTypes()) {
var newSchema = new Schema(t);
newLookup.Add(t, newSchema);
}
_schemaLookup = newLookup;
return _schemaLookup[type];
}
}
So the dictionary in this case will be rebuilt, at most, as many times as there are assemblies with types that need schemas. For the rest of the application lifetime the dictionary accesses will be lock-free. The dictionary copy becomes a one-time initialization cost of the assembly. The dictionary swap is thread-safe because pointer writes are atomic so the whole reference gets switched at once.
You can apply similar principles in other situations as well.