UPDATE:
Starting with .Net 4.7.2, HashSet.TryGetValue - docs is available.
HashSet.TryGetValue - SO post
I have a problem with HashSet because it does not provide any method similar to TryGetValue known from Dictionary. And I need such method -- passing element to find in the set, and set returning element from its collection (when found).
Sidenote -- "why do you need element from the set, you already have that element?". No, I don't, equality and identity are two different things.
HashSet is not sealed but all its fields are private, so deriving from it is pointless. I cannot use Dictionary instead because I need SetEquals method. I was thinking about grabbing a source for HashSet and adding desired method, but the license is not truly open source (I can look, but I cannot distribute/modify). I could use reflection but the arrays in HashSet are not readonly meaning I cannot bind to those fields once per instance lifetime.
And I don't want to use full blown library for just single class.
So far I am stuck with LINQ SingleOrDefault. So the question is how fix this -- have HashSet with TryGetValue?
Probably you should switch from a HashSet to a SortedSet
There is a simple TryGetValue() for a SortedSet:
public bool TryGetValue(ref T element)
{
var foundSet = sortedSet.GetViewBetween(element, element);
if(foundSet.Count == 1)
{
element = foundSet.First();
return true;
}
return false;
}
when called, the element needs just all properties set which are used in the Comparer. It returns the element found in the Set.
I agree this is something which is basically missing. While it's only useful in rare cases, I think they're significant rare cases - most notable, key canonicalization.
I can only think of one suggestion at the moment, and it's truly foul.
You can specify your own IEqualityComparer<T> when creating a HashSet<T> - so create one which remembers the arguments to the last positive (i.e. true-returning) Equals comparison it has performed. You can then call Contains, and see what the equality comparer was asked to compare.
Caveats:
This holds on to references unnecessarily, so could end up preventing objects being garbage collected
You'd potentially want to do this on a per-thread basis (if you've got a set that isn't modified after initialization, but is then read by multiple threads, for example)
It assumes that HashSet<T> doesn't use any optimization such as "if the references are equal, don't bother consulting the equality comparer"
It's fundamentally a horrible abuse
I've been trying to think of other alternatives in terms of finding intersections, but I haven't got anywhere yet...
As noted in comments, it would be worth encapsulating this as far as possible - I suspect you only need a very limited set of operations, so I'd wrap a HashSet<T> in your own class and only expose the operations you really need - that way you get to clear the "cache" after each operation, removing my first objection above.
It still feels like a horrible abuse to me, but...
As others have suggested, an alternative would be to use a Dictionary<TKey, TValue> and implement SetEquals yourself. That would be simple enough to do - and again, you'd want to encapsulate this in your own type. Either way, you should probably design the type itself first, and then implement it using either a HashSet<> or a Dictionary<,> as an implementation detail.
Sounds like you trying to use the wrong tool. True, you can save some memory using a HashSet but it seems to me that you are trying to acheeve a different goal: Get the actual element that is just equal to a representation.
So in reality they are two different elements. Just the memento (a unique representation) is equal.
Therefore you'd be better of using a Dictionary where you add your elements as Key and Value. So you're able to get it back (the identical) but you miss your SetEquals....
I suppose SetEquals in it's implementation does nothing much different than sequencially compare two HashSets in it's bucket order and fails on first non-equality.
So you should be equally good off using a simple SequenceEqual() (LINQ) comparing the two Keys collections.
So this extension method could do
public static SetEqual<T,G>(this IDictionary<T,G> d, IDictionary<T,G> e)
{
return d.Keys.SequenceEqual(e.Keys);
}
This should work, because a Dictionary basically is a HashSet with an associated value. And more appropriate to your problem. (OK, to be correct, the code should go for Dictionary<> instead of IDictionary<> because Key order matters)
If you need an IEnumerable<> on the second parameter try sorting to get a defined order (not so efficient).
Finally added in .NET 4.7.2:
HashSet.TryGetValue(T, T) Method
An SO post with more details
hopefully not blind but I haven't seen this answer anywhere. If you want dictionary's TryGetValue, you can just steal it.
theHashset.ToDictionary(item => item.ID).TryGetValue(key, out value)
All you need is a quick lambda for determining unique keys.
Related
I got a method which accepts a collection as below
public IList<CountryDto> ApplyDefaults(IList<CountryDto> dtos)
{
//Iterates the collection
//Validates the items in collection
//If items are invalid
//Removes items e.g dtos.Remove(currentCountryDto)
return dtos;//Do I need to do this?
}
My question is since, the reference to the collection is not changed, should I return the collection again from the method?
For: By returning the collection back, I make it explicit in the signature and user is aware that the items in the collection could be different from the original source. Sort of it avoid ambiguity.
Against: Since the validation doesnt change the reference of the collection, it doesn't make sense technically to return it.
What is the best approach in this case?
Note: I am not sure if this question is opinion based. I think probably I missing something here on design side.
In every programming language consistency of your own code / library with the approach of the core libraries is of high value. Hence, inspecting how Collections.sort() or Collection.swap() and Collections.shuffle() are defined, I would suggest to not return the input parameter, if you intend to modify it. In addition, your method should be named in such a way, that it is obvious the input parameter gets modified. Otherwise your method will be considered to have side-effects.
Returning a value most often suggests that it is a new instance which reflects the work, performed by the method or is used for method-chaining in case of builders.
Given your comments/requirements:
Does not need to report if defaults are applied.
ApplyDefaults is complicated and invoking other services and not intended to produce a fluent API
ApplyDefaults is a "black box"; validation logic is injected so the calling code doesn't know/care about the validation
I think based on these, this method definitely should not return the reference to the incoming list, even if no validation is applied. Firstly, unless the API is clearly built around method chaining (which you indicated you do not want), returning a List<T> type usually indicates a new List is being created. Secondly, if a new list is not created, users may find themselves modifying the list in ways they didn't expect.
Consider:
IList<CountryDto> originalCountries = Service.GetCountries();
IList<CountryDto> validatedCountries = ApplyDefaults(originalCountries);
validatedCountries.Add(mySpecialCountry);
OutputOriginalCountries(originalCountries);
OutputValidatedCountries(validatedCountries);
This code isn't very special, and a fairly common pattern. If ApplyDefaults returned a reference to the same originalCountries collection, then mySpecialCountry would also be added to originalCountries. This would violate the Principle of Least Astonishment.
This would be exacerbated if this behaviour changed depending on whether or not items were validated/filtered. Since the validation logic is a black-box of behaviour that the caller doesn't know or care about, the API consumer could not depend on whether or not it returned the same reference. They would either have to do their own reference check (e.g., if (myValidatedCountries == myInputCountries)), or simply make a copy every time. Regardless, this becomes another weird behaviour that the programmer has to juggle when working with the API.
I think that the method should either:
A) always return a copied list with the items filtered out (public IList<CountryDto> ApplyDefaults(IEnumerable<CountryDto> dtos))
B) modify the incoming list in-place (public void ApplyDefaults(IList<CountryDto> dtos))
For option A, depending on the size of your list, this incurs the possible unnecessary work of creating a copied list every time even if no filtering is performed. However, the validation/filtering logic might be simpler. You might be able to use LINQ queries to apply the filtering nicely. Additionally, removing items from a list is generally costly as it has to rebuild the internal array. So it might actually be faster to build a new list. You may even simplify the signature here to be IEnumerable<CountryDto>; this allows for wider usage and is extremely obvious that you're creating a new collection.
For option B, if no validation is required, then no work is done and the method is essentially "free" (no array rebuilding, no copying, no reference changes). But if there is significant validation, the removal aspect may be costly. Since you're not method chaining, this version should have a void return type as it's much more obvious to the developer that this is modifying the list in-place. This follows other commonly known methods like List<T>.Sort. Furthermore, if a user wants to have a separate originalCountries and validatedCountries they can always make a copy:
var validatedCountries = originalCountries.ToList();
ApplyDefaults(validatedCountries);
Ultimately, which one you choose might depend on performance. If validation/removal is cheap and rare, then modifying the list in-place might be best. If you're expecting a lot of changes to the list, it might simply be faster to produce a new copy every time.
Regardless, I would suggest you name the method with a little more clarity as well. For example:
public IList<CountryDto> GetValidCountries(IEnumerable<CountryDto> dtos)
public void RemoveInvalidCountries(IList<CountryDto> dtos)
Of course, the naming might be different depending on your actual code context (I suspect ApplyDefaults is a common/inherited method name and not specific to CountryDto)
I'd rather return boolean (or enum in an elaborated case: collection preserved intact,
changed, can't be validated etc.)
// true if the collection is changed, false otherwise
public Boolean ApplyDefaults(IList<CountryDto> dtos) {
Boolean result = false;
//Iterates the collection
//Validates the items in collection
//If items are invalid:
// Removes items e.g dtos.Remove(currentCountryDto)
// result = true;
...
return result;
}
...
if (ApplyDefaults(myData)) {
// Collection is changed, do some extra stuff
}
First of all: you cannot change the reference of the collection you send by parameter, because by default you're getting copy of it. You'd need to use a ref keyword in order to be able to change it.
Secondly: if your method has a return type, than it has to return an object. Your method is not called GetNewCollectionWithAppliedDefaults, but ApplyDefaults which implies that the collection will be modified. You should either return boolean true/false to inform user changes were done or always return parameter's collecion (to allow nested methods calling).
Also, why would you think it doesn't make sense to return a collection? I'd say there's no argument against it. Turn the question around: "why wouldn't I return the collection and could it harm my code"?
Technically, I would say there is not much difference between the two.
However, and as you pointed out, a common used convention is that a function should only return an object it creates. Basically, that would mean that a function that returns an object is generating one while a function which doesn't return anything is modifying the object passed as a parameter.
Again, this is only a convention and it is not widely used within the C# community, but in the python community for example, it is.
Some people, returns a Boolean (or an error code) instead as an indicator of an error (like the old dos command line). I don't like this approach and prefer by far raising exceptions that I can handle later on.
Finally, the best approach in my regard, is to return a value that indicates if a change was done by the function and eventually a value indicating how much of a change was done. It can be a Boolean or it can be the number of inserted/removed elements...
In any case, try to be consistent with the approach you chose, if not in all your code, at least within a single project. Sometimes, you will have no other choice but to abide with the convention used by your teammates.
(My answer is based on the Java viewpoint; C++ and C# programmers might have a different take.) I think it's best to return the collection. The fact that the collection you're returning is the same collection that was given is just an implementation detail, and in future versions of the code, you might want to change that. Document that the collection returned might not be the same one passed in.
If, on the other hand, you want to lock in the design that this method modifies a collection in place, document it that way and don't return the collection. I prefer not to do it this way, but I can see advantages in some contexts.
In your case I would leave void since ApplyDefaults clearly states what its doing.
Also, it might be a good idea to ApplyDefaults in the collection itself. Subclass IList or List or whatever and then you'd call like this:
myCollection.ApplyDefaults();
Which is just obvious.
Say for example I have
Dictionary<string, double> foo;
I can do
foo["hello"] = foo["hello"] + 2.0
Or I could do
foo["hello"] += 2.0
but the compiler just expands this to the code above. I verified that by using JetBrains .Peek to look at the assemblies.
This seems wasteful as two key lookups are required to update. Is there a dictionary implementation that can do this in one lookup? Note I'm using a dictionary to store 100k items of geometry information from a mesh and the lookups are in an inner loop. Please no "premature optimization is the root of all evil" answers. :)
Yes I have profiled.
Using a class would probably be faster as the comments mention because:
With a struct, you must do a double look-up as mentioned in the comments.
With a class, you simply go to the memory of the class reference and can update it there.
Each Lookup:
GetHashCode
Get the bucket
Iterate through to find the right one
(This all involves reading multiple ref object values)
However, if you use a class and update its value:
Change the value at the correct position relative to that ref.
It's a single change in memory.
#George Duckett's solution should be much faster. Change to a class and get the ref and update the object's value:
var hello = foo["hello"];
hello.howAreYou += 2.0;
By the way, this is an example case where a mutable class will win in performance over the immutable struct.
There's a method in ConcurrentDictionary, ConcurrentDictionary.AddOrUpdate, that does what you want. You can update an existing value in the dictionary based on its previous value in one go.
However, the concurrent dictionary is supposed to be used in multiple thread situations, so I can imagine it does some locking which might defeat your optimization goal. But then again, you can always benchmark and see how it goes.
No, it is not. As noted in the comment by bradgonesurfing, the language lacks a way to return reference to the stored value, so when it has to change that value, it needs to find it again.
Also, you said you are storing pairs of integers. Did you thought about using an array? Even 100k long array is not even 1MB big. And I'm sure it would be fastest you can get.
I had two questions. I was wondering if there is an easy class in the C# library that stores pairs of values instead of just one, so that I can store a class and an integer in the same node of the list. I think the easiest way is to just make a container class, but as this is extra work each time. I wanted to know whether I should be doing so or not. I know that in later versions of .NET ( i am using 3.5) that there are tuples that I can store, but that's not available to me.
I guess the bigger question is what are the memory disadvantages of using a dictionary to store the integer class map even though I don't need to access in O(1) and could afford to just search the list? What is the minimum size of the hash table? should i just make the wrapper class I need?
If you need to store an unordered list of {integer, value}, then I would suggest making the wrapper class. If you need a data structure in which you can look up integer to get value (or, look up value to get integer), then I would suggest a dictionary.
The decision of List<Tuple<T1, T2>> (or List<KeyValuePair<T1, T2>>) vs Dictionary<T1, T2> is largely going to come down to what you want to do with it.
If you're going to be storing information and then iterating over it, without needing to do frequent lookups based on a particular key value, then a List is probably what you want. Depending on how you're going to use it, a LinkedList might be even better - slightly higher memory overheads, faster content manipulation (add/remove) operations.
On the other hand, if you're going to be primarily using the first value as a key to do frequent lookups, then a Dictionary is designed specifically for this purpose. Key value searching and comparison is significantly improved, so if you do much with the keys and your list is big a Dictionary will give you a big speed boost.
Data size is important to the decision. If you're talking about a couple hundred items or less, a List is probably fine. Above that point the lookup times will probably impact more significantly on execution time, so Dictionary might be more worth it.
There are no hard and fast rules. Every use case is different, so you'll have to balance your requirements against the overheads.
You can use a list of KeyValuePair:http://msdn.microsoft.com/en-us/library/5tbh8a42.aspx
You can use a Tuple<T,T1>, a list of KeyValuePair<T, T1> - or, an anonymous type, e.g.
var list = something.Select(x => new { Key = x.Something, Value = x.Value });
You can use either KeyValuePair or Tuple
For Tuple, you can read the following useful post:
What requirement was the tuple designed to solve?
I have a question about generic collections in C#. If I need to store a collection of items, and I'm frequently going to need to check whether an item is in the collection, would it be faster to use Dictionary instead of List?
I've heard that checking if an item is in the collection is linear relative to the size for lists and constant relative to the size for dictionaries. Is using Dictionary and then setting Key and Value to the same object for each key-value pair something that other programmers frequently do in this situation?
Thanks for taking the time to read this.
Yes, yes it is. That said, you probably want to use HashSet because you don't need both a key and a value, you just need a set of items.
It's also worth noting that Dictionary was added in C# 2.0, and HashSet was added in 3.5, so for all that time inbetween it was actually fairly common to use a Dictionary when you wanted a Set just because that was all you had (without rolling your own). When I was forced to do this I just stuck null in the value, rather than the item as the key and value, but the idea is the same.
Just use HashSet<Foo> if what you're concerned with is fast containment tests.
A Dictionary<TKey, TValue> is for looking a value up based on a key.
A List<T> is for random access and dynamic growth properties.
A HashSet<T> is for modeling a set and providing fast containment tests.
You're not looking up a value based on a key. You're not worried about random access, but rather fast containment checks. The right concept here is a HashSet<T>.
Assuming that there is only ever one copy of the item in the list, then the appropriate data structure is ISet<T>, specifically HashSet<T>.
That said, I've seen timing that indicate that a Dictionary<TKey, TValue> ContainsKey call is a wee bit faster than even HashSet<T>. Either way, both of them are going to be loads faster than a plain List<T> lookup.
Keep in mind that both of these methods (HashSet and Dictionary) rely on reasonably well-implemented Equals and GetHashcode implementations for T. List<T> only relies on Equals
A Dictionary, or HashSet will use more memory, but provide (almost) O(1) seek time.
You might want to look at HashSet, which is a collection of unique objects (as long as the object implements IEquality comparer).
You mention using List<T>, which implies that ordering may be important. If this is the case then you may also want to look into the SortedSet<T> type as well.
`I need to know if two references from completely different parts of the program refers to the same object.
I can not compare references programaticaly because they are from the different context (one reference is not visible from another and vice versa).
Then I want to print unique identifier for each object using Console.WriteLine(). But ToString() method doesn't return "unique" identifier, it just returns "classname".
Is it possible to print unique identifier in C# (like in Java)?
The closest you can easily get (which won't be affected by the GC moving objects around etc) is probably RuntimeHelpers.GetHashCode(Object). This gives the hash code which would be returned by calling Object.GetHashCode() non-virtually on the object. This is still not a unique identifier though. It's probably good enough for diagnostic purposes, but you shouldn't rely on it for production comparisons.
EDIT: If this is just for diagnostics, you could add a sort of "canonicalizing ID generator" which was just a List<object>... when you ask for an object's "ID" you'd check whether it already existed in the list (by comparing references) and then add it to the end if it didn't. The ID would be the index into the list. Of course, doing this without introducing a memory leak would involve weak references etc, but as a simple hack this might work for you.
one reference is not visible from another and vice versa
I don't buy that. If you couldn't even get the handles, how would you get their ID's?
In C# you can always get handles to objects, and you can always compare them. Even if you have to use reflection to do it.
If you need to know if two references are pointing the same object, I'll just citate this.
By default, the operator == tests for
reference equality. This is done by
determining if two references indicate
the same object. Therefore reference
types do not need to implement
operator == in order to gain this
functionality.
So, == operator will do the trick without doing the Id workaround.
I presume you're calling ToString on your object reference, but not entirely clear on this or your explained situatyion, TBH, so just bear with me.
Does the type expose an ID property? If so, try this:
var idAsString = yourObjectInstance.ID.ToString();
Or, print directly:
Console.WriteLine(yourObjectInstance.ID);
EDIT:
I see Jon seen right through this problem, and makes my answer look rather naive - regardless, I'm leaving it in if for nothing else but to emphasise the lack of clarity of the question. And also, maybe, provide an avenue to go down based on Jon's statement that 'This [GetHashCode] is still not a unique identifier', should you decide to expose your own uniqueness by way of an identifier.