Best way to keep track of objects when writing a debugger [duplicate] - c#

Is there a way of getting a unique identifier of an instance?
GetHashCode() is the same for the two references pointing to the same instance. However, two different instances can (quite easily) get the same hash code:
Hashtable hashCodesSeen = new Hashtable();
LinkedList<object> l = new LinkedList<object>();
int n = 0;
while (true)
{
object o = new object();
// Remember objects so that they don't get collected.
// This does not make any difference though :(
l.AddFirst(o);
int hashCode = o.GetHashCode();
n++;
if (hashCodesSeen.ContainsKey(hashCode))
{
// Same hashCode seen twice for DIFFERENT objects (n is as low as 5322).
Console.WriteLine("Hashcode seen twice: " + n + " (" + hashCode + ")");
break;
}
hashCodesSeen.Add(hashCode, null);
}
I'm writing a debugging addin, and I need to get some kind of ID for a reference which is unique during the run of the program.
I already managed to get internal ADDRESS of the instance, which is unique until the garbage collector (GC) compacts the heap (= moves the objects = changes the addresses).
Stack Overflow question Default implementation for Object.GetHashCode() might be related.
The objects are not under my control as I am accessing objects in a program being debugged using the debugger API. If I was in control of the objects, adding my own unique identifiers would be trivial.
I wanted the unique ID for building a hashtable ID -> object, to be able to lookup already seen objects. For now I solved it like this:
Build a hashtable: 'hashCode' -> (list of objects with hash code == 'hashCode')
Find if object seen(o) {
candidates = hashtable[o.GetHashCode()] // Objects with the same hashCode.
If no candidates, the object is new
If some candidates, compare their addresses to o.Address
If no address is equal (the hash code was just a coincidence) -> o is new
If some address equal, o already seen
}

.NET 4 and later only
Good news, everyone!
The perfect tool for this job is built in .NET 4 and it's called ConditionalWeakTable<TKey, TValue>. This class:
can be used to associate arbitrary data with managed object instances much like a dictionary (although it is not a dictionary)
does not depend on memory addresses, so is immune to the GC compacting the heap
does not keep objects alive just because they have been entered as keys into the table, so it can be used without making every object in your process live forever
uses reference equality to determine object identity; moveover, class authors cannot modify this behavior so it can be used consistently on objects of any type
can be populated on the fly, so does not require that you inject code inside object constructors

Checked out the ObjectIDGenerator class? This does what you're attempting to do, and what Marc Gravell describes.
The ObjectIDGenerator keeps track of previously identified objects. When you ask for the ID of an object, the ObjectIDGenerator knows whether to return the existing ID, or generate and remember a new ID.
The IDs are unique for the life of the ObjectIDGenerator instance. Generally, a ObjectIDGenerator life lasts as long as the Formatter that created it. Object IDs have meaning only within a given serialized stream, and are used for tracking which objects have references to others within the serialized object graph.
Using a hash table, the ObjectIDGenerator retains which ID is assigned to which object. The object references, which uniquely identify each object, are addresses in the runtime garbage-collected heap. Object reference values can change during serialization, but the table is updated automatically so the information is correct.
Object IDs are 64-bit numbers. Allocation starts from one, so zero is never a valid object ID. A formatter can choose a zero value to represent an object reference whose value is a null reference (Nothing in Visual Basic).

The reference is the unique identifier for the object. I don't know of any way of converting this into anything like a string etc. The value of the reference will change during compaction (as you've seen), but every previous value A will be changed to value B, so as far as safe code is concerned it's still a unique ID.
If the objects involved are under your control, you could create a mapping using weak references (to avoid preventing garbage collection) from a reference to an ID of your choosing (GUID, integer, whatever). That would add a certain amount of overhead and complexity, however.

RuntimeHelpers.GetHashCode() may help (MSDN).

You can develop your own thing in a second. For instance:
class Program
{
static void Main(string[] args)
{
var a = new object();
var b = new object();
Console.WriteLine("", a.GetId(), b.GetId());
}
}
public static class MyExtensions
{
//this dictionary should use weak key references
static Dictionary<object, int> d = new Dictionary<object,int>();
static int gid = 0;
public static int GetId(this object o)
{
if (d.ContainsKey(o)) return d[o];
return d[o] = gid++;
}
}
You can choose what you will like to have as unique ID on your own, for instance, System.Guid.NewGuid() or simply integer for fastest access.

How about this method:
Set a field in the first object to a new value. If the same field in the second object has the same value, it's probably the same instance. Otherwise, exit as different.
Now set the field in the first object to a different new value. If the same field in the second object has changed to the different value, it's definitely the same instance.
Don't forget to set field in the first object back to it's original value on exit.
Problems?

It is possible to make a unique object identifier in Visual Studio: In the watch window, right-click the object variable and choose Make Object ID from the context menu.
Unfortunately, this is a manual step, and I don't believe the identifier can be accessed via code.

You would have to assign such an identifier yourself, manually - either inside the instance, or externally.
For records related to a database, the primary key may be useful (but you can still get duplicates). Alternatively, either use a Guid, or keep your own counter, allocating using Interlocked.Increment (and make it large enough that it isn't likely to overflow).

I know that this has been answered, but it's at least useful to note that you can use:
http://msdn.microsoft.com/en-us/library/system.object.referenceequals.aspx
Which will not give you a "unique id" directly, but combined with WeakReferences (and a hashset?) could give you a pretty easy way of tracking various instances.

If you are writing a module in your own code for a specific usage, majkinetor's method MIGHT have worked. But there are some problems.
First, the official document does NOT guarantee that the GetHashCode() returns an unique identifier (see Object.GetHashCode Method ()):
You should not assume that equal hash codes imply object equality.
Second, assume you have a very small amount of objects so that GetHashCode() will work in most cases, this method can be overridden by some types.
For example, you are using some class C and it overrides GetHashCode() to always return 0. Then every object of C will get the same hash code.
Unfortunately, Dictionary, HashTable and some other associative containers will make use this method:
A hash code is a numeric value that is used to insert and identify an object in a hash-based collection such as the Dictionary<TKey, TValue> class, the Hashtable class, or a type derived from the DictionaryBase class. The GetHashCode method provides this hash code for algorithms that need quick checks of object equality.
So, this approach has great limitations.
And even more, what if you want to build a general purpose library?
Not only are you not able to modify the source code of the used classes, but their behavior is also unpredictable.
I appreciate that Jon and Simon have posted their answers, and I will post a code example and a suggestion on performance below.
using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.Serialization;
using System.Collections.Generic;
namespace ObjectSet
{
public interface IObjectSet
{
/// <summary> check the existence of an object. </summary>
/// <returns> true if object is exist, false otherwise. </returns>
bool IsExist(object obj);
/// <summary> if the object is not in the set, add it in. else do nothing. </summary>
/// <returns> true if successfully added, false otherwise. </returns>
bool Add(object obj);
}
public sealed class ObjectSetUsingConditionalWeakTable : IObjectSet
{
/// <summary> unit test on object set. </summary>
internal static void Main() {
Stopwatch sw = new Stopwatch();
sw.Start();
ObjectSetUsingConditionalWeakTable objSet = new ObjectSetUsingConditionalWeakTable();
for (int i = 0; i < 10000000; ++i) {
object obj = new object();
if (objSet.IsExist(obj)) { Console.WriteLine("bug!!!"); }
if (!objSet.Add(obj)) { Console.WriteLine("bug!!!"); }
if (!objSet.IsExist(obj)) { Console.WriteLine("bug!!!"); }
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
public bool IsExist(object obj) {
return objectSet.TryGetValue(obj, out tryGetValue_out0);
}
public bool Add(object obj) {
if (IsExist(obj)) {
return false;
} else {
objectSet.Add(obj, null);
return true;
}
}
/// <summary> internal representation of the set. (only use the key) </summary>
private ConditionalWeakTable<object, object> objectSet = new ConditionalWeakTable<object, object>();
/// <summary> used to fill the out parameter of ConditionalWeakTable.TryGetValue(). </summary>
private static object tryGetValue_out0 = null;
}
[Obsolete("It will crash if there are too many objects and ObjectSetUsingConditionalWeakTable get a better performance.")]
public sealed class ObjectSetUsingObjectIDGenerator : IObjectSet
{
/// <summary> unit test on object set. </summary>
internal static void Main() {
Stopwatch sw = new Stopwatch();
sw.Start();
ObjectSetUsingObjectIDGenerator objSet = new ObjectSetUsingObjectIDGenerator();
for (int i = 0; i < 10000000; ++i) {
object obj = new object();
if (objSet.IsExist(obj)) { Console.WriteLine("bug!!!"); }
if (!objSet.Add(obj)) { Console.WriteLine("bug!!!"); }
if (!objSet.IsExist(obj)) { Console.WriteLine("bug!!!"); }
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
public bool IsExist(object obj) {
bool firstTime;
idGenerator.HasId(obj, out firstTime);
return !firstTime;
}
public bool Add(object obj) {
bool firstTime;
idGenerator.GetId(obj, out firstTime);
return firstTime;
}
/// <summary> internal representation of the set. </summary>
private ObjectIDGenerator idGenerator = new ObjectIDGenerator();
}
}
In my test, the ObjectIDGenerator will throw an exception to complain that there are too many objects when creating 10,000,000 objects (10x than in the code above) in the for loop.
Also, the benchmark result is that the ConditionalWeakTable implementation is 1.8x faster than the ObjectIDGenerator implementation.

The information I give here is not new, I just added this for completeness.
The idea of this code is quite simple:
Objects need a unique ID, which isn't there by default. Instead, we have to rely on the next best thing, which is RuntimeHelpers.GetHashCode to get us a sort-of unique ID
To check uniqueness, this implies we need to use object.ReferenceEquals
However, we would still like to have a unique ID, so I added a GUID, which is by definition unique.
Because I don't like locking everything if I don't have to, I don't use ConditionalWeakTable.
Combined, that will give you the following code:
public class UniqueIdMapper
{
private class ObjectEqualityComparer : IEqualityComparer<object>
{
public bool Equals(object x, object y)
{
return object.ReferenceEquals(x, y);
}
public int GetHashCode(object obj)
{
return RuntimeHelpers.GetHashCode(obj);
}
}
private Dictionary<object, Guid> dict = new Dictionary<object, Guid>(new ObjectEqualityComparer());
public Guid GetUniqueId(object o)
{
Guid id;
if (!dict.TryGetValue(o, out id))
{
id = Guid.NewGuid();
dict.Add(o, id);
}
return id;
}
}
To use it, create an instance of the UniqueIdMapper and use the GUID's it returns for the objects.
Addendum
So, there's a bit more going on here; let me write a bit down about ConditionalWeakTable.
ConditionalWeakTable does a couple of things. The most important thing is that it doens't care about the garbage collector, that is: the objects that you reference in this table will be collected regardless. If you lookup an object, it basically works the same as the dictionary above.
Curious no? After all, when an object is being collected by the GC, it checks if there are references to the object, and if there are, it collects them. So if there's an object from the ConditionalWeakTable, why will the referenced object be collected then?
ConditionalWeakTable uses a small trick, which some other .NET structures also use: instead of storing a reference to the object, it actually stores an IntPtr. Because that's not a real reference, the object can be collected.
So, at this point there are 2 problems to address. First, objects can be moved on the heap, so what will we use as IntPtr? And second, how do we know that objects have an active reference?
The object can be pinned on the heap, and its real pointer can be stored. When the GC hits the object for removal, it unpins it and collects it. However, that would mean we get a pinned resource, which isn't a good idea if you have a lot of objects (due to memory fragmentation issues). This is probably not how it works.
When the GC moves an object, it calls back, which can then update the references. This might be how it's implemented judging by the external calls in DependentHandle - but I believe it's slightly more sophisticated.
Not the pointer to the object itself, but a pointer in the list of all objects from the GC is stored. The IntPtr is either an index or a pointer in this list. The list only changes when an object changes generations, at which point a simple callback can update the pointers. If you remember how Mark & Sweep works, this makes more sense. There's no pinning, and removal is as it was before. I believe this is how it works in DependentHandle.
This last solution does require that the runtime doesn't re-use the list buckets until they are explicitly freed, and it also requires that all objects are retrieved by a call to the runtime.
If we assume they use this solution, we can also address the second problem. The Mark & Sweep algorithm keeps track of which objects have been collected; as soon as it has been collected, we know at this point. Once the object checks if the object is there, it calls 'Free', which removes the pointer and the list entry. The object is really gone.
One important thing to note at this point is that things go horribly wrong if ConditionalWeakTable is updated in multiple threads and if it isn't thread safe. The result would be a memory leak. This is why all calls in ConditionalWeakTable do a simple 'lock' which ensures this doesn't happen.
Another thing to note is that cleaning up entries has to happen once in a while. While the actual objects will be cleaned up by the GC, the entries are not. This is why ConditionalWeakTable only grows in size. Once it hits a certain limit (determined by collision chance in the hash), it triggers a Resize, which checks if objects have to be cleaned up -- if they do, free is called in the GC process, removing the IntPtr handle.
I believe this is also why DependentHandle is not exposed directly - you don't want to mess with things and get a memory leak as a result. The next best thing for that is a WeakReference (which also stores an IntPtr instead of an object) - but unfortunately doesn't include the 'dependency' aspect.
What remains is for you to toy around with the mechanics, so that you can see the dependency in action. Be sure to start it multiple times and watch the results:
class DependentObject
{
public class MyKey : IDisposable
{
public MyKey(bool iskey)
{
this.iskey = iskey;
}
private bool disposed = false;
private bool iskey;
public void Dispose()
{
if (!disposed)
{
disposed = true;
Console.WriteLine("Cleanup {0}", iskey);
}
}
~MyKey()
{
Dispose();
}
}
static void Main(string[] args)
{
var dep = new MyKey(true); // also try passing this to cwt.Add
ConditionalWeakTable<MyKey, MyKey> cwt = new ConditionalWeakTable<MyKey, MyKey>();
cwt.Add(new MyKey(true), dep); // try doing this 5 times f.ex.
GC.Collect(GC.MaxGeneration);
GC.WaitForFullGCComplete();
Console.WriteLine("Wait");
Console.ReadLine(); // Put a breakpoint here and inspect cwt to see that the IntPtr is still there
}

Related

Storing an object by reference or workarounds

I am building internal logic for a game in C# and coming from C++ this is something that might be lost in translation for me.
I have an object, Ability that calculates the bonus it provides and returns that as an integer value. The calculation is meant to be dynamic and can change depending on a variety of variables.
public class Ability: Buffable
{
public string abbr { get; private set; }
public Ability(string name, string abbr, uint score) : base(name, score)
{
this.abbr = abbr;
}
// Ability Modifier
// returns the ability modifier for the class.
public int Ability_modifier()
{
const double ARBITARY_MINUS_TEN = -10;
const double HALVE = 2;
double value = (double)this.Evaluate();
double result = (value + ARBITARY_MINUS_TEN) / HALVE;
// Round down in case of odd negative modifier
if (result < 0 && ((value % 2) != 0))
{
result--;
}
return (int)result;
}
I then have another object, Skill which should be aware of that bonus and add it into it's calculation. I wanted to pass an Ability into the constructor of Skill by reference and then store that reference so that if the Ability changed the calculation would as well. The obvious problem with this being that apparently storing references is taboo in C#.
Is there either a work around way to do this or an alternate way to approach this problem that my pointer infested mind isn't considering? I would greatly prefer not to have to pass the ability to the function that evaluates Skill every time, since the one referenced never changes after construction.
The obvious problem with this being that apparently storing references is taboo in C#.
Absolutely not. References are stored all over the place. You're doing it here, for example:
this.abbr = abbr;
System.String is a class, and therefore a reference type. And so the value of abbr is a reference.
I strongly suspect you've misunderstood how reference types work in C#. If you remember a reference to an object, then changes to the object will be visible via the reference. However, changes to the original expression you copied won't be.
For example, using StringBuilder as a handy mutable reference type:
StringBuilder x = new StringBuilder("abc");
// Copy the reference...
StringBuilder y = x;
// This changes data within the object that x's value refers to
x.Append("def");
// This changes the value of x to refer to a different StringBuilder
x = new StringBuilder("ghi");
Console.WriteLine(y); // abcdef
See my articles on references and values, and parameter passing in C# for much more detail.
I am not quite seing enough of your code to give a concrete example, but the way to do this is to pass in a lambda delegate such as () => object.property instead of this: object.property.
In C#, there are reference types and value types. All non-value-type objects are passed by reference, so there should be no issue with references. Just pass it, and it will be passed by reference.

Cycle in the struct layout that doesn't exist

This is a simplified version of some of my code:
public struct info
{
public float a, b;
public info? c;
public info(float a, float b, info? c = null)
{
this.a = a;
this.b = b;
this.c = c;
}
}
The problem is the error Struct member 'info' causes a cycle in the struct layout. I'm after struct like value type behaviour. I could simulate this using a class and a clone member function, but I don't see why I should need to.
How is this error true? Recursion could perhaps cause construction forever in some similar situations, but I can't think of any way that it could in this case. Below are examples that ought to be fine if the program would compile.
new info(1, 2);
new info(1, 2, null);
new info(1, 2, new info(3, 4));
edit:
The solution I used was to make "info" a class instead of a struct and giving it a member function to returned a copy that I used when passing it. In effect simulating the same behaviour as a struct but with a class.
I also created the following question while looking for an answer.
Value type class definition in C#?
It's not legal to have a struct that contains itself as a member. This is because a struct has fixed size, and it must be at least as large as the sum of the sizes of each of its members. Your type would have to have 8 bytes for the two floats, at least one byte to show whether or not info is null, plus the size of another info. This gives the following inequality:
size of info >= 4 + 4 + 1 + size of info
This is obviously impossible as it would require your type to be infinitely large.
You have to use a reference type (i.e. class). You can make your class immutable and override Equals and GetHashCode to give value-like behaviour, similar to the String class.
The reason why this creates a cycle is that Nullable<T> is itself a struct. Because it refers back to info you have a cycle in the layout (info has a field of Nullable<info> and it has a field of info) . It's essentially equivalent to the following
public struct MyNullable<T> {
public T value;
public bool hasValue;
}
struct info {
public float a, b;
public MyNullable<info> next;
}
The real problem is on this line:
public info? c;
Since this is a struct, C# needs to know the inner info/s layout before it could produce outer info's layout. And the inner info includes an inner inner info, which in turn includes an inner inner inner info, and so on. The compiler cannot produce a layout because of this circular reference issue.
Note: info? c is a shorthand for Nullable<info> which is itself a struct.
There isn't any way to achieve mutable value semantics of variable-sized items (semantically, I think what you're after is to have MyInfo1 = MyInfo2 generate a new linked list which is detached from the one started by MyInfo2). One could replace the info? with an info[] (which would always either be null or else populated with a single-element array), or with a holder class that wraps an instance of info, but the semantics would probably not be what you're after. Following MyInfo1 = MyInfo2, changes to MyInfo1.a would not affect MyInfo2.a, nor would changes to MyInfo1.c affect MyInfo2.c, but changes to MyInfo1.c[0].a would affect MyInfo2.c[0].a.
It would be nice if a future version of .net could have some concept of "value references", so that copying a struct wouldn't simply copy all of its fields. There is some value to the fact that .net does not support all the intricacies of C++ copy constructors, but there would also be value in allowing storage locations of type 'struct' to have an identity which would be associated with the storage location rather than its content.
Given that .net does not presently support any such concept, however, if you want info to be mutable, you're going to have to either put up with mutable reference semantics (including protective cloning) or with weird and wacky struct-class-hybrid semantics. One suggestion I would have if performance is a concern would be to have an abstract InfoBase class with descendants MutableInfo and ImmutableInfo, and with the following members:
AsNewFullyMutable -- Public instance -- Returns a new MutableInfo object, with data copied from the original, calling AsNewFullyMutable on any nested references.
AsNewMutable -- Public instance -- Returns a new MutableInfo object, with data copied from the original, calling AsImmutable on any nested references.
AsNewImmutable -- Protected instance -- Returns a new ImmutableInfo object, with data copied from the orignal, calling AsImmutable (not AsNewImmutable) on any nested references.
AsImmutable -- Public virtual -- For an ImmutableInfo, return itself; for a MutableInfo, call AsNewImmutable on itself.
AsMutable -- Public virtual -- For a MutableInfo, return itself; for an ImmutableInfo, call AsNewMutable on itself.
When cloning an object, depending upon whether one expected that the object or its descendants would be cloned again before it had to be mutated, one would call either AsImmutable, AsNewFullyMutable, or AsNewMutable. In scenarios where one would expect an object to be repeatedly defensively cloned, the object would be replaced by an immutable instance which would then no longer have to be cloned until there was a desire to mutate it.
Disclaimer: This may not achieve the goal of "struct like value type behaviour."
One solution is to use an array of one item to essentially get a reference the recursively referenced structure. Adapting my approach to your code looks something like this.
public struct info
{
public float a, b;
public info? c
{
get
{
return cArray[nextIndex];
}
set
{
steps[nextIndex] = value;
}
}
private info?[] cArray;
public info(float a, float b, info? c = null)
{
this.a = a;
this.b = b;
this.cArray = new info?[] { c }
this.c = c;
}
}

How to determine the size of an instance?

I have set my project to accept unsafe code and have the following helper Class to determine the size of an instance:
struct MyStruct
{
public long a;
public long b;
}
public static class CloneHelper
{
public unsafe static void GetSize(BookSetViewModel book)
{
long n = 0;
MyStruct inst;
inst.a = 0;
inst.b = 0;
n = Marshal.SizeOf(inst);
}
}
This works perfectly fine with a struct. However as soon as I use the actual class-instance that is passed in:
public unsafe static void GetSize(BookSetViewModel book)
{
long n = 0;
n = Marshal.SizeOf(book);
}
I get this error:
Type 'BookSetViewModel' cannot be marshaled as an unmanaged structure;
no meaningful size or offset can be computed.
Any idea how I could fix this?
Thanks,
Well, it really depends on what you mean by the "size" of an instance. There's the size of the single object in memory, but you usually need to think about any objects that the root object refers to. That's how much memory may be reclaimable after the root becomes eligible for garbage collection... but you can't just add them up, as those objects may be referred to by multiple other objects, and indeed there may be repeated references even within a single object.
This blog post shows some code I've used before to determine the size of the raw objects (header + fields), disregarding any extra cost due to the objects that one object refers to. It's not something I would use in production code, but it's useful for experimenting with how large an object is under varying circumstances.

ConcurrentDictionary.GetOrAdd(): valueFactory with different signature

I have a ConcurrentDictionary:
private static ConcurrentDictionary<int, string> _cd = new ConcurrentDictionary<int, string>();
To be conservative, the actual key to be used to retrieve items is an object but I instead simply use its hash code as the key so that a (potentially large) object isn't the key:
public static string GetTheValue(Foo foo)
{
int keyCode = foo.GetHashCode(); // GetHashCode is overridden to guarantee uniqueness
string theValue = _cd.GetOrAdd(keyCode, FooFactory);
return theValue;
}
However, I need various properties in the Foo object when preparing an object in the Factory:
private static string FooFactory(Foo foo)
{
string result = null;
object propA = foo.A;
object propB = foo.B;
// ... here be magic to set result
return result;
}
Since the valueFactory parameter of GetOrAdd() expects Func<int, string>, it appears that I cannot pass my Foo object to it. Is it at all possible to do so?
There is nothing wrong with using a large object as the key.
Unless the object is a struct (and it should not be a struct), it will never be copied.
If you intend to be able to look things up later, your object will obviously stcik around, so you won't be leaking memory.
As long as you have reasonable (or inherited) GetHashCode() and Equals() implementations, there won't be any performance impact.
I think there is a fundamental misunderstanding here:
To be conservative, the actual key to
be used to retrieve items is an object
but I instead simply use its hash code
as the key so that a (potentially
large) object isn't the key.
I've bolded the phrases that I think you think are related. They're really not. If you're picturing your dictionary as being too large because its keys are these big objects, you're picturing it wrong. For reference types (any class in C#), the keys will be stored in the dictionary only as references, which are the size of a whopping integer. And if you're concerned about passing the keys between methods, again: it's not what you think. Only references will be copied and passed around, not the objects themselves.
So I agree with SLaks: just use your Foo type (or whatever it's really called) as the key in the first place. It will make your life a lot simpler.
I needed this for a different reason. If someone's else end up here like I did, here's how it can be done:
public static string GetTheValue(Foo foo)
{
int keyCode = ...
string theValue = _cd.GetOrAdd(keyCode, (key => FooFactory(foo))); //key is not used in the factory
return theValue;
}

Object.GetHashCode

My question may duplicate Default implementation for Object.GetHashCode() but I'm asking again because I didn't understand the accepted answer to that one.
To begin with I have three questions about the accepted answer to the previous question, which quotes some documentation as follows:
"However, because this index can be reused after the object is reclaimed during garbage collection, it is possible to obtain the same hash code for two different objects."
Is this true? It seems to me that two objects won't have the same hash code, because an object's code isn't reused until the object is garbage collected (i.e. no longer exists).
"Also, two objects that represent the same value have the same hash code only if they are the exact same object."
Is this a problem? For example, I want to associate some data with each of the node instances in a DOM tree. To do this, the 'nodes' must have an identity or hash code, so that I can use them as keys in a dictionary of the data. Isn't a hash code which identities whether it's "the exact same object", i.e. "reference equality rather than "value equality", what I want?
"This implementation is not particularly useful for hashing; therefore, derived classes should override GetHashCode"
Is this true? If it's not good for hashing, then what if anything is it good for, and why is it even defined as a method of Object?
My final (and perhaps most important to me) question is, if I must invent/override a GetHashCode() implementation for an arbitrary type which has "reference equality" semantics, is the following a reasonable and good implementation:
class SomeType
{
//create a new value for each instance
static int s_allocated = 0;
//value associated with this instance
int m_allocated;
//more instance data
... plus other data members ...
//constructor
SomeType()
{
allocated = ++s_allocated;
}
//override GetHashCode
public override int GetHashCode()
{
return m_allocated;
}
}
Edit
FYI I tested it, using the following code:
class TestGetHash
{
//default implementation
class First
{
int m_x;
}
//my implementation
class Second
{
static int s_allocated = 0;
int m_allocated;
int m_x;
public Second()
{
m_allocated = ++s_allocated;
}
public override int GetHashCode()
{
return m_allocated;
}
}
//stupid worst-case implementation
class Third
{
int m_x;
public override int GetHashCode()
{
return 0;
}
}
internal static void test()
{
testT<First>(100, 1000);
testT<First>(1000, 100);
testT<Second>(100, 1000);
testT<Second>(1000, 100);
testT<Third>(100, 100);
testT<Third>(1000, 10);
}
static void testT<T>(int objects, int iterations)
where T : new()
{
System.Diagnostics.Stopwatch stopWatch =
System.Diagnostics.Stopwatch.StartNew();
for (int i = 0; i < iterations; ++i)
{
Dictionary<T, object> dictionary = new Dictionary<T, object>();
for (int j = 0; j < objects; ++j)
{
T t = new T();
dictionary.Add(t, null);
}
for (int k = 0; k < 100; ++k)
{
foreach (T t in dictionary.Keys)
{
object o = dictionary[t];
}
}
}
stopWatch.Stop();
string stopwatchMessage = string.Format(
"Stopwatch: {0} type, {1} objects, {2} iterations, {3} msec",
typeof(T).Name, objects, iterations,
stopWatch.ElapsedMilliseconds);
System.Console.WriteLine(stopwatchMessage);
}
}
On my machine the results/output are as follows:
First type, 100 objects, 1000 iterations, 2072 msec
First type, 1000 objects, 100 iterations, 2098 msec
Second type, 100 objects, 1000 iterations, 1300 msec
Second type, 1000 objects, 100 iterations, 1319 msec
Third type, 100 objects, 100 iterations, 1487 msec
Third type, 1000 objects, 10 iterations, 13754 msec
My implementation takes half the time of the default implementation (but my type is bigger by the size of my m_allocated data member).
My implementation and the default implementation both scale linearly.
In comparison and as a sanity check, the stupid implementation starts bad and scales worse.
The most important property a hash code implementation must have is this:
If two objects compare as equal then they must have identical hash codes.
If you have a class where instances of the class compare by reference equality, then you do not need to override GetHashCode; the default implementation guarantees that two objects that are the same reference have the same hash code. (You're calling the same method twice on the same object, so of course the result is the same.)
If you have written a class which implements its own equality that is different from reference equality then you are REQUIRED to override GetHashCode such that two objects that compare as equal have equal hash codes.
Now, you could do so by simply returning zero every time. That would be a lousy hash function, but it would be legal.
Other properties of good hash functions are:
GetHashCode should never throw an exception
Mutable objects which compare for equality on their mutable state, and therefore hash on their mutable state, are dangerously bug-prone. You can put an object into a hash table, mutate it, and be unable to get it out again. Try to never hash or compare for equality on mutable state.
GetHashCode should be extremely fast -- remember, the purpose of a good hash algorithm is to improve the performance of lookups. If the hash is slow then the lookups can't be made fast.
Objects which do not compare as equal should have dissimilar hash codes, well distributed over the whole range of a 32 bit integer
Question:
Is this true? It seems to me that two objects won't have the same hash code, because
an object's code isn't reused until the object is garbage collected (i.e. no longer exists).
Two objects may share the same hash code, if it is generated by default GetHashCode implementation, because:
Default GetHashCode result shouldn't be changed during object's lifetime, and default implementation ensures this. If it could change, such types as Hashtable couldn't deal with this implementation. That's because it's expected that default hash code is a hash code of unique instance identifier (even although there is no such identifier :) ).
Range of GetHashCode values is range of integer (2^32).
Conclusion:
It's enough to allocate 2^32 strongly-referenced objects to (must be easy on Win64) to reach the limit.
Finally, there is an explicit statement in object.GetHashCode reference in MSDN: The default implementation of the GetHashCode method does not guarantee unique return values for different objects. Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework. Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes.
You do not actually need to modify anything on a class which requires only reference equality.
Also, formally, that is not a good implementation since it has poor distribution. A hash function should have a reasonable distribution since it improves hash bucket distribution, and indirectly, performance in collections which use hash tables. As I said, that is a formal answer, one of the guidelines when designing a hash function.

Categories

Resources