Why are sets in many programming languages not really sets? - c#

In several programming languages there are Set collections which are supposed to be implementations of the mathematical concept of a finite set.
However, this is not necessarily true, for example in C# and Java, both implementations of HashSet<T> allow you to add any HashSet<T> collection as a member of itself. Which by the modern definition of a mathematical set is not allowed.
Background:
According to naive set theory, the definition of a set is:
A set is a collection of distinct objects.
However, this definition lead famously to Russel's Paradox as well as other paradoxes. For convenience, Russel's Paradox is:
Let R be the set of all sets that are not members of themselves. If R
is not a member of itself, then its definition dictates that it must
contain itself, and if it contains itself, then it contradicts its own
definition as the set of all sets that are not members of themselves.
So according to modern set theory (See: ZFC), the definition of a set is:
A set is a collection of distinct objects, none of which is the set
itself.
Specifically, this is a result of the axiom of regularity.
Well so what? What are the implications of this? Why is this question on StackOverflow?
One of the implications of Russel's Paradox is that not all collections are sets. Additionally, this was the point where mathematicians dropped the definition of a set as being the usual English definition. So I believe this question has a lot of weight when it comes to programming language design in general.
Question(s):
So why would programming languages, who in some form, use these principles in their very design just ignore it in the implementation of the Set in their languages libraries?
Secondly, is this a common occurrence with other implementations of mathematical concepts?
Perhaps I'm being a bit nit picky, but if these are to be true implementations of Sets, then why would part of the very definition be ignored?
Update
Addition of C# and Java code snippets exemplifying behavior:
Java Snippet:
Set<Object> hashSet = new HashSet<Object>();
hashSet.add(1);
hashSet.add("Tiger");
hashSet.add(hashSet);
hashSet.add('f');
Object[] array = hashSet.toArray();
HashSet<Object> hash = (HashSet<Object>)array[3];
System.out.println("HashSet in HashSet:");
for (Object obj : hash)
System.out.println(obj);
System.out.println("\nPrinciple HashSet:");
for (Object obj : hashSet)
System.out.println(obj);
Which prints out:
HashSet in HashSet:
f
1
Tiger
[f, 1, Tiger, (this Collection)]
Principle HashSet:
f
1
Tiger
[f, 1, Tiger, (this Collection)]
C# Snippet:
HashSet<object> hashSet = new HashSet<object>();
hashSet.Add(1);
hashSet.Add("Tiger");
hashSet.Add(hashSet);
hashSet.Add('f');
object[] array = hashSet.ToArray();
var hash = (HashSet<object>)array[2];
Console.WriteLine("HashSet in HashSet:");
foreach (object obj in hash)
Console.WriteLine(obj);
Console.WriteLine("\nPrinciple HashSet:");
foreach (object obj in hashSet)
Console.WriteLine(obj);
Which prints out:
HashSet in HashSet:
1
Tiger
System.Collections.Generic.HashSet`1[System.Object]
f
Principle HashSet:
1
Tiger
System.Collections.Generic.HashSet`1[System.Object]
f
Update 2
In regards to Martijn Courteaux's second point which was that it could be done in the name of computational efficiency:
I made two test collections in C#. They were identical, except in the Add method of one of the them - I added the following check: if (this != obj) where obj is the item being added to the collection.
I clocked both of them separately where they were to add 100,000 random integers:
With Check: ~ 28 milliseconds
Without Check: ~ 21 milliseconds
This is a fairly significant performance jump.

Programming language sets really aren't like ZFC sets, but for quite different reasons than you suppose:
You can't form a set by comprehension (i.e. set of all objects such that ...). Note that this already blocks all (I believe) naive set theory paradoxes, so they are irrelevant.
They usually can't be infinite.
There exist objects which aren't sets (in ZFC there are only sets).
They are usually mutable (i.e. you can add/remove elements to/from the set).
The objects they contain can be mutable.
So the answer to
So why would programming languages, who in some form, use these principles in their very design just ignore it in the implementation of the Set in their languages libraries?
is that the languages don't use these principles.

I can't speak for C#, but as far as Java is concerned a set is a set. If you look at the javadoc for the Set interface, you will see (emphasis mine):
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
Whether the prohibition is actively enforced is unclear (adding a HashSet to itself does not throw any exceptions for example), but at least it is clearly documented that you should not try because the behaviour would then be unspecified.

Well, I think that this is because of some reasons:
For very specific programming purposes, you might want to create a set that contains itself. When you are doing this, you are not really caring about what a set mathematically means and you just want to enjoy the functionality a Set offers: "Adding" elements to the set without creating duplicate entries. (I have to be honest I can't think of a situation where you want to do that.)
For performance purposes. The chance you want to use a Set and let it contain itself is very very rare. So, it would be a waste of computing power, energy, time, performance and environmental health to check each time you try to add an element if it would be the set itself.

In java the set is mathematical set. You can insert objects, but only one can be in the set. Information from javadoc about set:
"A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction."
Source: http://docs.oracle.com/javase/7/docs/api/

Related

Why is ValueType.GetHashCode() implemented like it is?

From ValueType.cs
**Action: Our algorithm for returning the hashcode is a little bit complex. We look
** for the first non-static field and get it's hashcode. If the type has no
** non-static fields, we return the hashcode of the type. We can't take the
** hashcode of a static member because if that member is of the same type as
** the original type, we'll end up in an infinite loop.
I got bitten by this today when I was using a KeyValuePair as a key in a Dictionary (it stored xml attribute name (enum) and it's value (string)), and expected for it to have it's hashcode computed based on all its fields, but according to implementation it only considered the Key part.
Example (c/p from Linqpad):
void Main()
{
var kvp1 = new KeyValuePair<string, string>("foo", "bar");
var kvp2 = new KeyValuePair<string, string>("foo", "baz");
// true
(kvp1.GetHashCode() == kvp2.GetHashCode()).Dump();
}
The first non-static field I guess means the first field in declaratin order, which could also cause trouble when changing variable order in source for whatever reason, and believing it doesn't change the code semantically.
The actual implementation of ValueType.GetHashCode() doesn't quite match the comment. It has two versions of the algorithm, fast and slow. It first checks if the struct contains any members of a reference type and if there is any padding between the fields. Padding is empty space in a structure value, created when the JIT compiler aligns the fields. There's padding in a struct that contains bool and int (3 bytes) but no padding when it contains int and int, they fit snugly together.
Without a reference and without padding, it can do the fast version since every bit in the structure value is a bit that belongs to a field value. It simply xors 4 bytes at a time. You'll get a 'good' hash code that considers all the members. Many simple structure types in the .NET framework behave this way, like Point and Size.
Failing that test, it does the slow version, the moral equivalent of reflection. That's what you get, your KeyValuePair<> contains references. And this one only checks the first candidate field, like the comment says. This is surely a perf optimization, avoiding burning too much time.
Yes, nasty detail and not that widely known. It is usually discovered when somebody notices that their collection code sucks mud.
One more excruciating detail: the fast version has a bug that bytes when the structure contains a field of a type decimal. The values 12m and 12.0m are logically equal but they don't have the same bit pattern. GetHashCode() will say that they are not equal. Ouch.
UPDATE: This answer was (in part) the basis of a blog article I wrote which goes into more details about the design characteristics of GetHashcode. Thanks for the interesting question!
I didn't implement it and I haven't talked to the people who did. But I can point out a few things.
(Before I go on, note that here I am specifically talking about hash codes for the purposes of balancing hash tables where the contents of the table are chosen by non-hostile users. The problems of hash codes for digital signing, redundancy checking, or ensuring good performance of a hash table when some of the users are mounting denial-of-service attacks against the table provider are beyond the scope of this discussion.)
First, as Jon correctly notes, the given algorithm does implement the required contract of GetHashCode. It might be sub-optimal for your purposes, but it is legal. All that is required is that things that compare equal have equal hash codes.
So what are the "nice to haves" in addition to that contract? A good hash code implementation should be:
1) Fast. Very fast! Remember, the whole point of the hash code in the first place is to rapidly find a relatively empty slot in a hash table. If the O(1) computation of the hash code is in practice slower than the O(n) time taken to do the lookup naively then the hash code solution is a net loss.
2) Well distributed across the space of 32 bit integers for the given distribution of inputs. The worse the distribution across the ints, the more like a naive linear lookup the hash table is going to be.
So, how would you make a hash algorithm for arbitrary value types given those two conflicting goals? Any time you spend on a complex hash algorithm that guarantees good distribution is time poorly spent.
A common suggestion is "hash all of the fields and then XOR together the resulting hash codes". But that is begging the question; XORing two 32 bit ints only gives good distribution when the inputs themselves are extremely well-distributed and not related to each other, and that is an unlikely scenario:
// (Updated example based on good comment!)
struct Control
{
string name;
int x;
int y;
}
What is the likelihood that x and y are well-distributed over the entire range of 32 bit integers? Very low. Odds are much better that they are both small and close to each other, in which case xoring their hash codes together makes things worse, not better. xoring together integers that are close to each other zeros out most of the bits.
Furthermore, this is O(n) in the number of fields! A value type with a lot of small fields would take a comparatively long time to compute the hash code.
Basically the situation we're in here is that the user didn't provide a hash code implementation themselves; either they don't care, or they don't expect this type to ever be used as a key in a hash table. Given that you have no semantic information whatsoever about the type, what's the best thing to do? The best thing to do is whatever is fast and gives good results most of the time.
Most of the time, two struct instances that differ will differ in most of their fields, not just one of their fields, so just picking one of them and hoping that it's the one that differs seems reasonable.
Most of the time, two struct instances that differ will have some redundancy in their fields, so combining the hash values of many fields together is likely to decrease, not increase, the entropy in the hash value, even as it consumes the time that the hash algorithm is designed to save.
Compare this with the design of anonymous types in C#. With anonymous types we do know that it is highly likely that the type is being used as a key to a table. We do know that it is highly likely that there will be redundancy across instances of anonymous types (because they are results of a cartesian product or other join). And therefore we do combine the hash codes of all of the fields into one hash code. If that gives you bad performance due to the excess number of hash codes being computed, you are free to use a custom nominal type rather than the anonymous type.
It should still obey the contract of GetHashCode even if the field order changes: equal values will have equal hash codes, within the lifetime of that process.
In particular:
Non-equal values don't have to have non-equal hash codes
Hash codes don't have to be consistent across processes (you can change an implementation, rebuild, and everything should still work - you shouldn't be persisting hash codes, basically)
Now I'm not saying that ValueType's implementation is a great idea - it will cause performance suckage in various ways... but I don't think it's actually broken.
Well, there are pros and cons to any implementation of GetHashCode(). These are of course the things we weigh up when implementing our own, but in the case of ValueType.GetHashCode() there is a particular difficulty in that they don't have much information on what the actual details of the concrete type will be. Of course, this often happens to us when we are creating an abstract class or one intended to be the base of classes that will add a lot more in terms of state, but in those cases we have an obvious solution of just using the default implementation of object.GetHashCode() unless a derived class cares to override it there.
With ValueType.GetHashCode() they don't have this luxury as the primary difference between a value type and a reference type is, despite the popularity of talking about implementation details of stack vs. heap, the fact that for a value type equivalence relates to value while for an object type equivalence relates to identity (even when an object defines a different form of equivalence by overriding Equals() and GetHashCode() the concept of reference-equality still exists and is still useful.
So, for the Equals() method the implementation is obvious; check the two objects are the same type, and if it is then check also that all fields are equal (actually there's an optimisation which does a bitwise comparison in some cases, but that's an optimisation on the same basic idea).
What to do for GetHashCode()? There simply is no perfect solution. One thing they could do is some sort of mult-then-add or shift-then-xor on every field. That would likely give a pretty good hashcode, but could be expensive if there were a lot of fields (never mind that its not recommended to have value-types that have lots of fields, the implementer has to consider that they still can, and indeed there might even be times when it makes sense, though I honestly can't imagine a time where it both makes sense and it also makes sense to hash it). If they knew some fields were rarely different between instances they could ignore those fields and still have a pretty good hashcode, while also being quite fast. Finally, they can ignore most fields, and hope that the one(s) they don't ignore vary in value most of the time. They went for the most extreme version of the latter.
(The matter of what is done when there is no instance fields is another matter and a pretty good choice, such value types are equal to all other instances of the same type, and they've a hashcode that matches that).
So, it's an implementation that sucks if you are hashing lots of values where the first field is the same (or otherwise returns the same hashcode), but other implementations would suck in other cases (Mono goes for xoring all the fields' hashcodes together, better in your case, worse in others).
The matter of changing field order doesn't matter, as hashcode is quite clearly stated as only remaining valid for the lifetime of a process and not being suitable for most cases where they could be persisted beyond that (can be useful in some caching situations where it doesn't hurt if things aren't found correctly after a code change).
So, not great, but nothing would be perfect. It goes to show that one must always consider both sides of what "equality" means when using an object as a key. It's easily fixed in your case with:
public class KVPCmp<TKey, TValue> : IEqualityComparer<KeyValuePair<TKey, TValue>>, IEqualityComparer
{
bool IEqualityComparer.Equals(object x, object y)
{
if(x == null)
return y == null;
if(y == null)
return false;
if(!(x is KeyValuePair<TKey, TValue>) || !(y is KeyValuePair<TKey, TValue>))
throw new ArgumentException("Comparison of KeyValuePairs only.");
return Equals((KeyValuePair<TKey, TValue>) x, (KeyValuePair<TKey, TValue>) y);
}
public bool Equals(KeyValuePair<TKey, TValue> x, KeyValuePair<TKey, TValue> y)
{
return x.Key.Equals(y.Key) && x.Value.Equals(y.Value);
}
public int GetHashCode(KeyValuePair<TKey, TValue> obj)
{
int keyHash = obj.GetHashCode();
return ((keyHash << 16) | (keyHash >> 16)) ^ obj.Value.GetHashCode();
}
public int GetHashCode(object obj)
{
if(obj == null)
return 0;
if(!(obj is KeyValuePair<TKey, TValue>))
throw new ArgumentException();
return GetHashCode((KeyValuePair<TKey, TValue>)obj);
}
}
Use this as your comparator when creating your dictionary, and all should be well (you only need the generic comparator methods really, but leaving the rest in does no harm and can be useful to have sometimes).
Thank you all on very, very informative answers. I knew that there had to be some rationale in that decision but I wish it was documented better. I'm not able to use v4 of the framework so there is no Tuple<>, and that was the main reason I decided to piggyback on KeyValuePair struct. But I guess there is no cutting corners and I'll have to roll my own. Once more, thank you all.

What requirement was the tuple designed to solve?

I'm looking at the new C# feature of tuples. I'm curious, what problem was the tuple designed to solve?
What have you used tuples for in your apps?
Update
Thanks for the answers thus far, let me see if I have things straight in my mind.
A good example of a tuple has been pointed out as coordinates. Does this look right?
var coords = Tuple.Create(geoLat,geoLong);
Then use the tuple like so:
var myLatlng = new google.maps.LatLng("+ coords.Item1 + ", "+ coords.Item2 + ");
Is that correct?
When writing programs it is extremely common to want to logically group together a set of values which do not have sufficient commonality to justify making a class.
Many programming languages allow you to logically group together a set of otherwise unrelated values without creating a type in only one way:
void M(int foo, string bar, double blah)
Logically this is exactly the same as a method M that takes one argument which is a 3-tuple of int, string, double. But I hope you would not actually make:
class MArguments
{
public int Foo { get; private set; }
... etc
unless MArguments had some other meaning in the business logic.
The concept of "group together a bunch of otherwise unrelated data in some structure that is more lightweight than a class" is useful in many, many places, not just for formal parameter lists of methods. It's useful when a method has two things to return, or when you want to key a dictionary off of two data rather than one, and so on.
Languages like F# which support tuple types natively provide a great deal of flexibility to their users; they are an extremely useful set of data types. The BCL team decided to work with the F# team to standardize on one tuple type for the framework so that every language could benefit from them.
However, there is at this point no language support for tuples in C#. Tuples are just another data type like any other framework class; there's nothing special about them. We are considering adding better support for tuples in hypothetical future versions of C#. If anyone has any thoughts on what sort of features involving tuples you'd like to see, I'd be happy to pass them along to the design team. Realistic scenarios are more convincing than theoretical musings.
Tuples provide an immutable implementation of a collection
Aside from the common uses of tuples:
to group common values together without having to create a class
to return multiple values from a function/method
etc...
Immutable objects are inherently thread safe:
Immutable objects can be useful in multi-threaded applications. Multiple threads can act on data represented by immutable objects without concern of the data being changed by other threads. Immutable objects are therefore considered to be more thread-safe than mutable objects.
From "Immutable Object" on wikipedia
It provides an alternative to ref or out if you have a method that needs to return multiple new objects as part of its response.
It also allows you to use a built-in type as a return type if all you need to do is mash-up two or three existing types, and you don't want to have to add a class/struct just for this combination. (Ever wish a function could return an anonymous type? This is a partial answer to that situation.)
It's often helpful to have a "pair" type, just used in quick situations (like returning two values from a method). Tuples are a central part of functional languages like F#, and C# picked them up along the way.
very useful for returning two values from a function
Personally, I find Tuples to be an iterative part of development when you're in an investigative cycle, or just "playing". Because a Tuple is generic, I tend to think of it when working with generic parameters - especially when wanting to develop a generic piece of code, and I'm starting at the code end, instead of asking myself "how would I like this call to look?".
Quite often I realise that the collection that the Tuple forms become part of a list, and staring at List> doesn't really express the intention of the list, or how it works. I often "live" with it, but find myself wanting to manipulate the list, and change a value - at which point, I don't necessarily want to create a new Tuple for that, thus I need to create my own class or struct to hold it, so I can add manipulation code.
Of course, there's always extension methods - but quite often you don't want to extend that extra code to generic implementations.
There have been times I'm wanted to express data as a Tuple, and not had Tuples available. (VS2008) in which case I've just created my own Tuple class - and I don't make it thread safe (immutable).
So I guess I'm of the opinion that Tuples are lazy programming at the expense of losing a type name that describes it's purpose. The other expense is that you have to declare the signature of the Tuple whereever it's used as a parameter. After a number of methods that begin to look bloated, you may feel as I do, that it is worth making a class, as it cleans up the method signatures.
I tend to start by having the class as a public member of the class you're already working in. But the moment it extends beyond simply a collection of values, it get's it's own file, and I move it out of the containing class.
So in retrospect, I believe I use Tuples when I don't want to go off and write a class, and just want to think about what I've writing right now. Which means the signature of the Tuple may change quite a lot in the text half an hour whilst I figure out what data I am going to need for this method, and how it's returning what ever values it will return.
If I get a chance to refactor code, then often I'll question a Tuple's place in it.
Old question since 2010, and now in 2017 Dotnet changes and become more smart.
C# 7 introduces language support for tuples, which enables semantic names for the fields of a tuple using new, more efficient tuple types.
In vs 2017 and .Net 4.7 (or installing nuget package System.ValueTuple), you can create/use a tuple in a very efficient and simple way:
var person = (Id:"123", Name:"john"); //create tuble with two items
Console.WriteLine($"{person.Id} name:{person.Name}") //access its fields
Returning more than one value from a method:
public (double sum, double average) ComputeSumAndAverage(List<double> list)
{
var sum= list.Sum();
var average = sum/list.Count;
return (sum, average);
}
How to use:
var list=new List<double>{1,2,3};
var result = ComputeSumAndAverage(list);
Console.WriteLine($"Sum={result.sum} Average={result.average}");
For more details read: https://learn.microsoft.com/en-us/dotnet/csharp/tuples
A Tuple is often used to return multiple values from functions when you don’t want to create a specific type. If you're familiar with Python, Python has had this for a long time.
Returning more than one value from a function. getCoordinates() isn't very useful if it just returns x or y or z, but making a full class and object to hold three ints also seems pretty heavyweight.
A common use might be to avoid creating classes/structs that only contains 2 fields, instead you create a Tuple (or a KeyValuePair for now).
Usefull as a return value, avoid passing N out params...
I find the KeyValuePair refreshing in C# to iterate over the key value pairs in a Dictionary.
Its really helpful while returning values from functions. We can have multiple values back and this is quite a saver in some scenarios.
I stumbled upon this performance benchmark between Tuples and Key-Value pairs and probably you will find it interesting. In summary it says that Tuple has advantage because it is a class, therefore it is stored in the heap and not in the stack and when passed around as argument its pointer is the only thing that is going. But KeyValuePair is a structs so it is faster to allocate but it is slower when used.
http://www.dotnetperls.com/tuple-keyvaluepair

Is it reliable to compare two instances of a class by comparing their serialized byte arrays?

Given two instances of a class, is it a good and reliable practice to compare them by serializaing them first and then comparing byte arrays (or possibly hashes of arrays).
These objects might have complex hierarchical properties but serialization should go as deep as required.
By comparison I mean the process of making sure that all propertis of primitive types have equal values, properties of complex types have equal properties of primitive types, etc. As for collection properties, they should be equal to each other: equal elements, same positions:
{'a','b','c'} != {'a','c','b'}
{new Customer{Id=2, Name="abc"}, new Customer {Id=3, Name="def"}}
!=
{new Customer{Id=3, Name="def"}, new Customer {Id=2, Name="abc"}}
but
{new Customer{Id=2, Name="abc"}, new Customer {Id=3, Name="def"}}
==
{new Customer{Id=2, Name="abc"}, new Customer {Id=3, Name="def"}}
And by serialization I mean standard .NET binary formatter.
Thanks.
You are asking for a guarantee that the serialized representation will match. That's going to be awfully hard to come by, BinaryFormatter is a complicated class. Particularly serialized structures that have alignment padding could be a potential problem.
What's much simpler is to provide an example where it won't match. System.Decimal has different byte patterns for values like 0.01M and 0.010M. Its operator==() will say they are equal, its serialized byte[] won't.
You would have to define what equal means here even more precisely.
If one of the properties is a collection, there could be differences in the order (as a result of particular Add/Remove sequences) that may or may not be significant to you. Think about a Dictionary where the same elements were added in a different order. A collision might result in a different binary stream.
It's reliable if:
Every single class in the graph is marked [Serializable]. This isn't as straightforward as it might sound; if you're comparing totally arbitrary objects, then there's a pretty good chance that there's something non-serializable in there that you lack control over.
You want to know if the two instances are exactly the same. Keep in mind that BinaryFormatter is basically diving into the internal state of these objects, so even if they appear to be the same by virtue of public properties, they might not be. If you know for a fact that the graph is created the exact same way in each instance, maybe you don't care about this; but if not, there may be a number of hidden differences between the graphs.
This point is also a more serious wrinkle than one might first suspect. What if you decide to swap out the class with an interface? You may have two interfaces that, as far as you know, are exactly the same; they accomplish the same task and provide the same data. But they might be totally different implementations. What's nice about IEquatable is that it's independent of the concrete type.
So this will work, for a significant number of cases, but I probably wouldn't consider it a good practice, at least not without knowing all the details about the specific context that it's being used in. At the very least, I would not rely on this as a generic comparison method for any two instances; it should be used only in specific cases where you know about the implementation details of the classes involved.
Of course, some might say that writing any code that depends upon the implementation details of a class is always a bad practice. My take on this is more moderated, but it's something to consider - relying on the implementation details of a class can lead to difficult-to-maintain code later on.

Should updating an element in List<T> by index be quicker than using ArrayList index?

I am running through some tests about using ArrayLists and List.
Speed is very important in my app.
I have tested creating 10000 records in each, finding an item by index and then updating that object for example:
List[i] = newX;
Using the arraylist seems much faster. Is that correct?
UPDATE:
Using the List[i] approach, for my List<T> approach I am using LINQ to find the index eg/
....
int index = base.FindIndex(x=>x.AlpaNumericString = "PopItem");
base[index] = UpdatedItem;
It is definately slower than
ArrayList.IndexOf("PopItem"))
base[index] = UpdatedItem;
A generic List (List<T>) should always be quicker than an ArrayList.
Firstly, an ArrayList is not strongly-typed and accepts types of object, so if you're storing value types in the ArrayList, they are going to be boxed and unboxed every time they are added or accessed.
A Generic List can be defined to accept only (say) int's so therefore no boxing or unboxing needs to occur when adding/accessing elements of the list.
If you're dealing with reference types, you're probably still better off with a Generic List over an ArrayList, since although there's no boxing/unboxing going on, your Generic List is type-safe, and there will be no implicit (or explicit) casts required when retrieving your strongly-typed object from the ArrayList's "collection" of object types.
There may be some edge-cases where an ArrayList is faster performing than a Generic List, however, I (personally) have not yet come across one. Even the MSDN documentation states:
Performance Considerations
In deciding whether to use the
List<(Of <(T>)>) or ArrayList class,
both of which have similar
functionality, remember that the
List<(Of <(T>)>) class performs better
in most cases and is type safe. If a
reference type is used for type T of
the List<(Of <(T>)>) class, the
behavior of the two classes is
identical. However, if a value type is
used for type T, you need to consider
implementation and boxing issues.
If a value type is used for type T,
the compiler generates an
implementation of the List<(Of <(T>)>)
class specifically for that value
type. That means a list element of a
List<(Of <(T>)>) object does not have
to be boxed before the element can be
used, and after about 500 list
elements are created the memory saved
not boxing list elements is greater
than the memory used to generate the
class implementation.
Make certain the value type used for
type T implements the IEquatable<(Of
<(T>)>) generic interface. If not,
methods such as Contains must call the
Object..::.Equals(Object) method,
which boxes the affected list element.
If the value type implements the
IComparable interface and you own the
source code, also implement the
IComparable<(Of <(T>)>) generic
interface to prevent the BinarySearch
and Sort methods from boxing list
elements. If you do not own the source
code, pass an IComparer<(Of <(T>)>)
object to the BinarySearch and Sort
methods
Moreover, I particularly like the very last section of that paragraph, which states:
It is to your advantage to use the type-specific implementation of the List<(Of <(T>)>) class instead of using the ArrayList class or writing a strongly typed wrapper collection yourself. The reason is your implementation must do what the .NET Framework does for you already, and the common language runtime can share Microsoft intermediate language code and metadata, which your implementation cannot.
Touché! :)
Based on your recent edit it seems as though you're not performing a 1:1 comparison here. In the List you have a class object and you're looking for the index based on a property, whereas in the ArrayList you just store the values of that property. If so, this is a severely flawed comparison.
To make it a 1:1 comparison you would add the values to the list only, not the class. Or, you would add the class items to the ArrayList. The former would allow you to use IndexOf on both collections. The latter would entail looping through your entire ArrayList and comparing each item till a match was found (and you could do the same for the List), or overriding object.Equals since ArrayList uses that for comparison.
For an interesting read, I suggest taking a look at Rico Mariani's post: Performance Quiz #7 -- Generics Improvements and Costs -- Solution. Even in that post Rico also emphasizes the need to benchmark different scenarios. No blanket statement is issued about ArrayLists, although the general consensus is to use generic lists for performance, type safety, and having a strongly typed collection.
Another related article is: Why should I use List and not ArrayList.
ArrayList seems faster? According to the documentation ( http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx ) List should be faster when using a value type, and the same speed when using a reference type. ArrayList is slower with value types because it needs to box/unbox the values when you're accessing them.
I would expect them to be about the same if they are value-types. There is an extra cast/type-check for ArrayList, but nothing huge. Of course, List<T> should be preferred. If speed is the primary concern (which it almost always isn't, at least not in this way), then you might also want to profile an array (T[]) - harder (=more expensive) to add/remove, of course - but if you are just querying/assigning by index, it should be the fastest. I have had to resort to arrays for some very localised performance critical work, but 99.95% of the time this is overkill and should be avoided.
For example, for any of the 3 approaches (List<T>/ArrayList/T[]) I would expect the assignment cost to be insignificant to the cost of newing up the new instance to put into the storage.
Marc Gravell touched on this in his anwswer - I think it needs to be stressed.
It is usually a waste of time to prematurely optimize your code!
A better approach is to do a simple, well designed first implementation, and test it with anticipated real world data loads.
Often, you will find that it's "fast enough". (It helps to start out with a clear definition of "fast enough" - e.g. "Must be able to find a single CD in a 10,000 CD collection in 3 seconds or less")
If it's not, put a profiler on it. Almost invariably, the bottle neck will NOT be where you expect.
(I learned this the hard way when I brought a whole app to it's knees with single badly chosen string concatenation)

Why don't Java, C# and C++ have ranges?

Ada, Pascal and many other languages support ranges, a way to subtype integers.
A range is a signed integer value which ranges from a value (first) to another (last).
It's easy to implement a class that does the same in OOP but I think that supporting the feature natively could let the compiler to do additional static checks.
I know that it's impossible to verify statically that a variabile defined in a range is not going to "overflow" runtime, i.e. due to bad input, but I think that something could be done.
I think about the Design by Contract approach (Eiffel) and the Spec# ( C# Contracts ), that give a more general solution.
Is there a simpler solution that checks, at least, static out-of-bound assignment at compile time in C++, C# and Java? Some kind of static-assert?
edit: I understand that "ranges" can be used for different purpose:
iterators
enumerators
integer subtype
I would focus on the latter, because the formers are easily mappable on C* language .
I think about a closed set of values, something like the music volume, i.e. a range that goes from 1 up to 100. I would like to increment or decrement it by a value. I would like to have a compile error in case of static overflow, something like:
volume=rangeInt(0,100);
volume=101; // compile error!
volume=getIntFromInput(); // possible runtime exception
Thanks.
Subrange types are not actually very useful in practice. We do not often allocate fixed length arrays, and there is also no reason for fixed sized integers. Usually where we do see fixed sized arrays they are acting as an enumeration, and we have a better (although "heavier") solution to that.
Subrange types also complicate the type system. It would be much more useful to bring in constraints between variables than to fixed constants.
(Obligatory mention that integers should be arbitrary size in any sensible language.)
Ranges are most useful when you can do something over that range, concisely. That means closures. For Java and C++ at least, a range type would be annoying compared to an iterator because you'd need to define an inner class to define what you're going to do over that range.
Java has had an assert keyword since version 1.4. If you're doing programming by contract, you're free to use those to check proper assignment. And any mutable attribute inside an object that should fall within a certain range should be checked prior to being set. You can also throw an IllegalArgumentException.
Why no range type? My guess is that the original designers didn't see one in C++ and didn't consider it as important as the other features they were trying to get right.
For C++, a lib for constrained values variables is currently being implemented and will be proposed in the boost libraries : http://student.agh.edu.pl/~kawulak/constrained_value/index.html
Pascal (and also Delphi) uses a subrange type but it is limited to ordinal types (integer, char and even boolean).
It is primarilly an integer with extra type checking. You can fake that in an other language using a class. This gives the advantage that you can apply more complex ranges.
I would add to Tom Hawtin response (to which I agree) that, for C++, the existence of ranges would not imply they would be checked - if you want to be consistent to the general language behavior - as array accesses, for instance, are also not range-checked anyway.
For C# and Java, I believe the decision was based on performance - to check ranges would impose a burden and complicate the compiler.
Notice that ranges are mainly useful during the debugging phase - a range violation should never occur in production code (theoretically). So range checks are better to be implemented not inside the language itself, but in pre- and post- conditions, which can (should) be stripped out when producing the release build.
This is an old question, but just wanted to update it. Java doesn't have ranges per-se, but if you really want the function you can use Commons Lang which has a number of range classes including IntRange:
IntRange ir = new IntRange(1, 10);
Bizarrely, this doesn't exist in Commons Math. I kind of agree with the accepted answer in part, but I don't believe ranges are useless, particularly in test cases.
C++ allows you to implement such types through templates, and I think there are a few libraries available doing this already. However, I think in most cases, the benefit is too small to justify the added complexity and compilation speed penalty.
As for static assert, it already exists.
Boost has a BOOST_STATIC_ASSERT, and on Windows, I think Microsoft's ATL library defines a similar one.
boost::type_traits and boost::mpl are probably your best friends in implementing something like this.
The flexibility to roll your own is better than having it built into the language. What if you want saturating arithmetic for example, instead of throwing an exception for out of range values? I.e.
MyRange<0,100> volume = 99;
volume += 10; // results in volume==100
In C# you can do this:
foreach(int i in System.Linq.Enumerable.Range(0, 10))
{
// Do something
}
JSR-305 provides some support for ranges but I don't know when if ever this will be part of Java.

Categories

Resources