Hash codes for immutable types

Hash codes for immutable types - c#

Are there any considerations for immutable types regarding hash codes?
Should I generate it once, in the constructor?
How would you make it clear that the hash code is fixed? Should I? If so, is it better to use a property called HashCode, instead of GetHashCode method? Would there be any drawback to it? (Considering both would work, but the property would be recommend).

Are there any considerations for immutable types regarding hash codes?
Immutable types are the easiest types to hash correctly; most hash code bugs happen when hashing mutable data. The most important thing is that hashing and equality agree; if two instances compare as equal, they should have the same hash code. (The reverse is not necessarily true; two instances that have the same hash need not be equal.)
Should I generate it once, in the constructor?
That's a performance optimizing technique; by doing so, you trade increased consumption of space (for the storage of the computed value) for a possible decrease in time. I never make performance optimizations unless they are driven by realistic, customer-focused performance tests that carefully measure the performance of both options against documented goals. You should do this if your carefully-designed experiments indicate that (1) failure to do so causes you to miss your goal, and (2) doing so causes you to meet your goal.
How would you make it clear that the hash code is fixed?
I don't understand the question. A changing hash code is the exception, not the rule. Hash codes are always supposed to be unchanging. If the hash code of an object changes then the object can get "lost" in a hash table, so everyone should assume that hash codes remain stable.
is it better to use a property called HashCode, instead of GetHashCode method?
What consumer of your object is going to say "well, I could call GetHashCode(), a method guaranteed to be on all objects, but instead I'm going to call this HashCode getter that does exactly the same thing" ? Do you have such a consumer in mind?
If you don't have any consumers of functionality, then don't provide the functionality.

I wouldn't normally generate it in the constructor, but I'd also want to know more about the expected usage before deciding whether to cache it or not.
Are you expecting a small number of instances, which get hashed an awful lot and which take a long time to calculate the hash? If so, caching may be appropriate. If you're expecting a large number of potentially "throw-away" instances, I wouldn't bother caching.
Interestingly, .NET and Java made different choices for String in this respect - Java caches the hash, .NET doesn't. Given that many string instances are never hashed, and those which are hashed are often only hashed once (e.g. on insertion into the hash table) I think I favour .NET's decision here.
Basically you're trading memory + complexity against speed. As Michael says, test before making your code more complex. Of course in some cases (e.g. for a class library) you can't accurate predict the real-world usage, but in many situations you'll have a pretty good idea.
You certainly don't need a separate property though. Hash codes should always stay the same unless someone changes the state of the object - and if your type is immutable, you're already prohibiting that, therefore a user shouldn't expect any changes. Just override GetHashCode().

I would generate the hash code once when getHashCode is called the first time, then cache it for later calls. This avoids calling it in the constructor when it may not be needed.
If you don't expect to call getHashCode very many times for each value object, you may not need to cache the value at all.

Well, you've got to have a GetHashCode() overridden method, as that's how consumers are going to retrieve your hashcode. Most hashcodes are fairly simple arithmetic operations, that will execute quickly. Do you have a reason to believe that caching the results (which has a memory cost) will give you a noticeable performance improvement?
Start simple - generate the hashcode on the fly. If you think you'll see performance improvements caching it, test first.
Regulations require me to refer you to the "premature optimization is the root of all evil" quote at this point.

I know from my personal experience that developers are really good at misjudging performance issues.
So it it recommended to keep everything as simple as possible while calculating hash code on the fly in the GetHashCode().

Why do you need to make sure that the hashcode is fixed? The semantics of a hashcode are that it will always be the same value for any given state of an object. Since your objects are immutable, this is a given. How you choose to implement GetHashCode is us up to you.
Having it be a private field that is returned is one choice - it's small, easy, and fast.

In general, computing the HashCode should be fast. So caching should not be much of an optimization, and not worth the trouble.
If profiling really shows that GethashCode takes a significant amount of time then maybe you should cache it, as a fix.
But I wouldn't consider it part of the normal practice.

Related

Using pre-defined instances

A few problem about instances are that Equals method doesn't provide true if even they contain same values. So trying to override Equals method provide much more slower than reference equality.
While i was thinking point of performance, i thought that it is stupidly to create 2 instance which is same but not same memory adress. If i can avoid to create same instances with different references will increase performance and help to compare references will be more easy than write custom Equal methods.
For example:
I have Coordinate class Which hold chess board coordinates.So i need only Coordinate[8,8] array for represent all cordinates of board. Instead of create instances, i can create all instances then my factory method can return them.
Cooardinate.Get(2,3) instead of new Coordinate(2,3)
First is static class's static method which returns pre-defined coordinate in given values.
Another advantage is that we will not spend time to create and collect garbage objects in memory. All of them is pre-defined already. Also we can provide unique GetHashCode for instances in easy and fast way like 0 for [0,0], 1 for [0,1]....
Isn't it worthy to try ? this idea would make coding harder write or understand ? Is there any such a pattern ?
Well, shortly what is disadvantages of this way ?

This is a good solution and in certain situations can save you a lot of time and memory. The main drawback is that it gets really complicated if your objects are mutable. If that is not the case, then this is really not that bad. You just have to make sure that all instances are obtained from the same factory. You do not even have to create all the instances in advance, but can make the class create a new instance when a particular set of parameters is requested for the first time (basically lazy-loading).

In this instance where you're dealing with a chess board (so 64 total coordinates) you don't need to be overly concerned with performance or garbage collection. Having said that, I think holding the baseline 64 coordinates in a dictionary is just fine (and makes sense). In terms of the Equals() comparison, you're basically comparing 2 integer values which is going to be lightning fast so overriding the Equals() method to specify that comparison is the right approach.

The short answer is that the disadvantage of this approach is that it increases the complexity of your code, thereby making it more difficult to code correctly, more difficult to debug, and more difficult to maintain.
Therefore, as with all optimisations, you should only contemplate it if you have a genuine need to optimise (e.g. it's running way too slow, or using way too much memory), and if the reward (faster performance, smaller memory usage) outweighs the risks (spending time optimising when you could be doing something more useful, introducing a bug, not being able to find a bug, not being being able to modify the code in the future quickly and easily, or without introducing a bug).

If you really were running into performance problems because of having to do deep comparisons in order to determine equality then you might want to look at the Flyweight pattern, but as you're only talking about 64 pairs of smallints I think that would be complete overkill in this case. What you actually need is to override the Equals operator to compare the two coordinates - that will be plenty fast enough. Anything more complex than that is probably a false optimisation, at least on most normal platforms.

What C# container is most resource-efficient for existence for only one operation?

I find myself often with a situation where I need to perform an operation on a set of properties. The operation can be anything from checking if a particular property matches anything in the set to a single iteration of actions. Sometimes the set is dynamically generated when the function is called, some built with a simple LINQ statement, other times it is a hard-coded set that will always remain the same. But one constant always exists: the set only exists for one single operation and has no use before or after it.
My problem is, I have so many points in my application where this is necessary, but I appear to be very, very inconsistent in how I store these sets. Some of them are arrays, some are lists, and just now I've found a couple linked lists. Now, none of the operations I'm specifically concerned about have to care about indices, container size, order, or any other functionality that is bestowed by any of the individual container types. I picked resource efficiency because it's a better idea than flipping coins. I figured, since array size is configured and it's a very elementary container, that might be my best choice, but I figure it is a better idea to ask around. Alternatively, if there's a better choice not out of resource-efficiency but strictly as being a better choice for this kind of situation, that would be nice as well.

With your acknowledgement that this is more about coding consistency rather than performance or efficiency, I think the general practice is to use a List<T>. Its actual backing store is an array, so you aren't really losing much (if anything noticable) to container overhead. Without more qualifications, I'm not sure that I can offer anything more than that.
Of course, if you truly don't care about the things that you list in your question, just type your variables as IEnumerable<T> and you're only dealing with the actual container when you're populating it; where you consume it will be entirely consistent.

There are two basic principles to be aware of regarding resource efficiency.
Runtime complexity
Memory overhead
You said that indices and order do not matter and that a frequent operation is matching. A Dictionary<T> (which is a hashtable) is an ideal candidate for this type of work. Lookups on the keys are very fast which would be beneficial in your matching operation. The disadvantage is that it will consume a little more memory than what would be strictly required. The usual load factor is around .8 so we are not talking about a huge increase or anything.
For your other operations you may find that an array or List<T> is a better option especially if you do not need to have the fast lookups. As long as you are not needing high performance on specialty operations (lookups, sorting, etc.) then it is hard to beat the general resource characteristics of array based containers.

List is probably fine in general. It's easy to understand (in the literate programming sense) and reasonably efficient. The keyed collections (e.g. Dict, SortedList) will throw an exception if you add an entry with a duplicate key, though this may not be a problem for what you're working on now.
Only if you find that you're running into a CPU-time or memory-size problem should you look at improving the "efficiency", and then only after determining that this is the bottleneck.
No matter which approach you use, there will still be creation and deletion of the underlying objects (collection or iterator) that will eventually be garbage collected, if the application runs long enough.

.NET: Scalability of generic Dictionary

I'm using a Dictionary<> to store a bazillion items. Is it safe to assume that as long as the server's memory has enough space to accommodate these bazillion items that I'll get near O(1) retrieval of items from it? What should I know about using a generic Dictionary as huge cache when performance is important?
EDIT: I shouldn't rely on the default implementations? What makes for a good hashing function?

It depends, just about entirely, on how good a hashing functionality your "bazillion items" support -- if their hashing function is not excellent (so that many conflicts result) your performance will degrade with the growth of the dictionary.

You should measure it and find out. You're the one who has knowledge of the exact usage of your dictionary, so you're the one who can measure it to see if it meets your needs.
A word of advice: I have in the past done performance analysis on large dictionary structures, and discovered that performance did degrade as the dictionary became extremely large. But it seemed to degrade here and there, not consistently on each operation. I did a lot of work trying to analyze the hash algorithms, etc, before smacking myself in the forehead. The garbage collector was getting slower because I had so much live working set; the dictionary was just as fast as it always was, but if a collection happened to be triggered, then that was eating up my cycles.
That's why it is important to not do performance testing in unrealistic benchmark scenarios; to find out what the real-world performance cost of your bazillion-item dictionary is, well, that's going to be gated on lots of stuff that has nothing to do with your dictionary, like how much collection triggering is happening throughout the rest of your program, and when.

Yes you will have O(1) access times. In fact to be pedantic g it will be exactly O(1).
You need to ensure that all your objects that are used as keys have a good GetHashCode implementation and should likely override Equals.
Edit to clarify: In reality acess times will get slower the more items you have unless you can provide a "perfect" hash function.

Yes, you will have near O(1) no matter how many objects you put into the Dictionary. But for the Dictionary to be fast, your key-objects should provide a sufficient GetHashCode-implementation, because Dictionary uses a hashtable inside.

Long lists of pass-by-ref parameters versus wrapper types

I need to get three objects out of a function, my instinct is to create a new type to return the three refs. Or if the refs were the same type I could use an array. However pass-by-ref is easier:
private void Mutate_AddNode_GetGenes(ref NeuronGene newNeuronGene, ref ConnectionGene newConnectionGene1, ref ConnectionGene newConnectionGene2)
{
}
There's obviously nothing wrong with this but I hesitate to use this approach, mostly I think for reasons of aesthetics and psycholgical bias. Are there actually any good reasons to use one of these approaches over the others? Perhaps a performance issue with creating extra wrapper objects or pushing parameters onto the stack. Note that in my particular case this is CPU intensive code. CPU cycles matter.
Is there a more elegant C#2 of C#3 approach?
Thanks.

For almost all computing problems, you will not notice the CPU difference. Since your sample code has the word "Gene" in it, you may actually fall into the rare category of code that would notice.
Creating and destroying objects just to wrap other objects would cost a bit of performance (they need to be created and garbage collected after all).
Aesthetically I would not create an object just to group unrelated objects, but if they logically belong together it is perfectly fine to define a containing object.

If you're worrying about the performance of a wrapping type (which is a lot cleaner, IMHO), you should use a struct. Current 32-bits implementations of .NET (and the upcomming 64-bits 4.0) support inlining / optimizing away of structs in many cases, so you'd probably see no performance difference whatsoever between a struct and ref arguments.

Worrying about the relative execution speed of those two options is probably a premature optimization. Focus on getting the algorithm correct first, and having clean, maintainable code. When that's done, you can run a profiler on it and optimize the 20% of the code that takes 80% of the CPU time. Even if this method ends up being in that 20%, the difference between the two calling styles is probably to small to register.
So, performance issues aside, I'd probably use a container class. Since this method takes only those three parameters, and (presumably) modifies each one, it sounds like it would make sense to have it as a method of the container class, with three member variables instead of ref parameters.

Is it safe to generally assume that toString() has a low cost?

Do you generally assume that toString() on any given object has a low cost (i.e. for logging)? I do. Is that assumption valid? If it has a high cost should that normally be changed? What are valid reasons to make a toString() method with a high cost? The only time that I get concerned about toString costs is when I know that it is on some sort of collection with many members.
From: http://jamesjava.blogspot.com/2007/08/tostring-cost.html
Update: Another way to put it is: Do you usually look into the cost of calling toString on any given class before calling it?

No it's not. Because ToString() can be overloaded by anyone, they can do whatever they like. It's a reasonable assumption that ToString() SHOULD have a low cost, but if ToString() accesses properties that do "lazy loading" of data, you might even hit a database inside your ToString().

The Java standard library seems to have been written with the intent of keeping the cost of toString calls very low. For example, Java arrays and collections have toString methods which do not iterate over their contents; to get a good string representation of these objects you must use either Arrays.toString or Collections.toString from the java.util package.
Similarly, even objects with expensive equals methods have inexpensive toString calls. For example, the java.net.URL class has an equals method which makes use of an internet connection to determine whether two URLs are truly equal, but it still has a simple and constant-time toString method.
So yes, inexpensive toString calls are the norm, and unless you use some weird third-party package which breaks with the convention, you shouldn't worry about these taking a long time.
Of course, you shouldn't really worry about performance until you find yourself in a situation where your program is taking too long, and even then you should use a profiler to figure out what's taking so longer rather than worrying about this sort of thing ahead of time.

The best way to find out is to profile your code. However, rather than worry that a particular function has a high overhead, it's (usually) better to worry about the correctness of your application and then do performance profiling on it (but be wary that real-world use and your test setup may differ radically). As it turns out, programmers generally guess wrong about what's really slow in their application and they often spend a lot of time optimizing things that don't need optimizing (eliminating that triple nested loop which only consumes .01% of your application's time is probably a waste).
Fortunately, there are plenty of open source profilers for Java.

Do you generally assume that toString() on any given object has a low cost? I do.
Why would you do that? Profile your code if you're running into performance issues; it'll save you a lot of time working past incorrect assumptions.

Your question's title uses the contradictory words "safe" and "generally." So even though in comments you seem to be emphasizing the general case, to which the answer is probably "yes, it's generally not a problem," a lot of people are seeing "safe" and therefore are answering either "No, because there's a risk of arbitrarily poor performance," or "No, because if you want to be 'safe' with a performance question, you must profile."

Since I generally only call toString() on methods and classes that I have written myself and overrode the base method, then I generally know what the cost is ahead of time. The only time I use toString() otherwise is error handling and or debugging when speed is not of the same importance.

My pragmatic answer would be: yes, you always assume a toString() call is cheap, unless you make an enormous amount of them. On the one hand, it is extremely unlikely that a toString() method would be expensive and on the other hand, it is extremely unlikely that you run into trouble if it isn't. I generally don't worry about issues like these, because there are too many of them and you won't get any code written if you do ;).
If you do run into performance issues, everything is open, including the performance of toString() and you should, as Shog9 suggest, simply profile the code. The Java Puzzlers show that even Sun wrote some pretty nasty constructors and toString() methods in their JDK's.

I think the question has a flaw. I wouldn't even assume toString() will print a useful piece of data. So, if you begin with that assumption, you know you have to check it prior to calling it and can assess it's 'cost' on a case by case basis.

Possibly the largest cost with naive toString() chaining is appending all those strings. If you want to generate large strings, you should use an underlying representation that supports an efficient append. If you know the append is efficient, then toString()s probably have a relatively low cost.
For example, in Java, StringBuilder will preallocate some space so that a certain amount of string appending takes linear time. It will reallocate when you run out of space.
In general, if you want to append sequences of things and you for whatever reason don't want to do something similar, you can use difference lists. These support linear time append by turning sequence appending into function composition.

toString() is used to represent an object as a String. So if you need slow running code to create a representation of an object, you need to be very careful and have a very good reason to do so. Debugging would be the only one I can think of where a slow running toString is acceptable.

My thought is:
Yes on standard-library objects
No on non-standard objects unless you have the source code in front of you and can check it.

I will always override toString to put in whatever I think I will need to debug problems. It it usually up to the developer to use it by either calling the toString method itself or having another class call it for you (println, logging, etc.).

There's an easy answer to this one, which I first heard in a discussion about reflection: "if you have to ask, you can't afford it."
Basically, if you need ToString() of large objects in the day-to-day operation of your program, then your program is crazy. Even if you need to ToString() an integer for anything time critical, your program is crazy, because it's obviously using a string where an integer would do.
ToString() for log messages is automatically okay, because logging is already expensive. If your program is too slow, turn down the log level! It doesn't really matter how slow it is to actually generate the debug messages, as long as you can choose to not generate them. (Note: your logging infrastructure should call ToString() itself, and only when the log message is supposed to be printed. Don't ToString() it by hand on the way into the log infrastructure, or you'll pay the price even if the log level is low and you won't be printing it after all! See http://www.colijn.ca/~caffeine/?m=200708#16 for more explanation of this.)

Since you put "generally" in your question, I would say yes. For -most- objects, there isn't going to be a costly ToString overload. There definitely can be, but generally there won't be.

In general I consider toString() low cost when I use it on simple objects, such as an integer or very simple struct. When applied to complex objects, however, toString() is a bit of crap shoot. There are two reason's for this. First, complex objects tend to contain other objects, so a single call to toString() can cascade into many calls to toString() on other objects plus the overehead of concatenating all those results. Second, there is no "standard" for converting complex objects to strings. One toString() call may yeild a single line of comma-separated values; another a much more verbose form. Only by checking it yourself can you know.
So my rule is toString() on simple objects is generally safe but on complex objects is suspect until checked.

I'd avoid using toString() on objects other than the basic types. toString() may not display anything useful. It may iterate over all the member variables and print them out. It may load something not yet loaded. Depending on what you plan to do with that string, you should consider not building it.
There are typically a few reasons why you use toString(): logging/debugging is probably the most common for random objects; display is common for certain objects (such as numbers). For logging I'd do something like
if(logger.isDebugEnabled()) {
logger.debug("The zig didn't take off. Response: {0}", response.getAsXML().toString());
}
This does two things: 1. Prevents constructing the string and 2. Prevents unnecessary string addition if the message won't be logged.

In general, I don't check every implementation. However, if I see a dependency on Apache commons, alarm bells go off and I look at the implementation more closely to make sure that they aren't using ToStringBuilder or other atrocities.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.