Best C# data structure for random order population? - c#

In C# I have a use case where I have a mapping from ints to collections.
the ints are a dense (but not packed) set from 1 to n where n is not known.
The cells will be loaded in random order.
the marginal cost of each cell should be ideal (as good as a List<T> or T[])
I'd like to have the cells default filled on demand
What is the best structure for this?
A List<T> would work well (better in space than a Dictionary<>) and by deriving from it I can get much of what I want, but is there something better? As in the best code is code you don't write.

A Dictionary<int, Cell> sounds like a good match to me. Or you could use a List<Cell> quite easily, and just make sure that you expand it where necessary:
public static void EnsureCount<T>(List<T> list, int count)
{
if (list.Count > count)
{
return;
}
if (list.Capacity < count)
{
// Always at least double the capacity, to reduce
// the number of expansions required
list.Capacity = Math.Max(list.Capacity*2, count);
}
list.AddRange(Enumerable.Repeat(default(T), list.Capacity-list.Count));
}

If you want to be a real stickler, one option is to write a facade class which provides a strategy around a number of different mappers. Make it so that it uses a statically defined array when N is < some value and when the load (packedness) is over some threshold. Have it swap to a Tree or a Dictionary representation when certain thresholds are passed.
You did not state how you planned on actually interacting with this data structure once it had been built, so depending on what your access / usage patterns are, a mixed strategy might provide a better runtime behavior than sticking with a single data representation. Think of it in a similar fashion to how quicksort algorithm implementations will sometimes switch to a simpler sort algorithm when the collection to be sorted has less than N data elements.
Cheers!

Related

Push Item to the end of an array

No, I can't use generic Collections. What I am trying to do is pretty simple actually. In php I would do something like this
$foo = [];
$foo[] = 1;
What I have in C# is this
var foo = new int [10];
// yeah that's pretty much it
Now I can do something like foo[foo.length - 1] = 1 but that obviously wont work. Another option is foo[foo.Count(x => x.HasValue)] = 1 along with a nullable int during declaration. But there has to be a simpler way around this trivial task.
This is homework and I don't want to explain to my teacher (and possibly the entire class) what foo[foo.Count(x => x.HasValue)] = 1 is and why it works etc.
The simplest way is to create a new class that holds the index of the inserted item:
public class PushPopIntArray
{
private int[] _vals = new int[10];
private int _nextIndex = 0;
public void Push(int val)
{
if (_nextIndex >= _vals.Length)
throw new InvalidOperationException("No more values left to push");
_vals[_nextIndex] = val;
_nextIndex++;
}
public int Pop()
{
if (_nextIndex <= 0)
throw new InvalidOperationException("No more values left to pop");
_nextIndex--;
return _vals[_nextIndex];
}
}
You could add overloads to get the entire array, or to index directly into it if you wanted. You could also add overloads or constructors to create different sized arrays, etc.
In C#, arrays cannot be resized dynamically. You can use Array.Resize (but this will probably be bad for performance) or substitute for ArrayList type instead.
But there has to be a simpler way around this trivial task.
Nope. Not all languages do everything as easy as each other, this is why Collections were invented. C# <> python <> php <> java. Pick whichever suits you better, but equivalent effort isn't always the case when moving from one language to another.
foo[foo.Length] won't work because foo.Length index is outside the array.
Last item is at index foo.Length - 1
After that an array is a fixed size structure if you expect it to work the same as in php you're just plainly wrong
Originally I wrote this as a comment, but I think it contains enough important points to warrant writing it as an answer.
You seem to be under the impression that C# is an awkward language because you stubbornly insist on using an array while having the requirement that you should "push items onto the end", as evidenced by this comment:
Isn't pushing items into the array kind of the entire purpose of the data structure?
To answer that: no, the purpose of the array data structure is to have a contiguous block of pre-allocated memory to mimic the original array structure in C(++) that you can easily index and perform pointer arithmetic on.
If you want a data structure that supports certain operations, such as pushing elements onto the end, consider a System.Collections.Generic.List<T>, or, if you insist on avoiding generics, a System.Collections.List. There are specializations that specify the underlying storage structure (such as ArrayList) but in general the whole point of the C# library is that you don't want to concern yourself with such details: the List<T> class has certain guarantees on its operations (e.g. insertion is O(n), retrieval is O(1) -- just like an array) and whether there is an array or some linked list that actually holds the data is irrelevant and is in fact dynamically decided based on the size and use case of the list at runtime.
Don't try to compare PHP and C# by comparing PHP arrays with C# arrays - they have different programming paradigms and the way to solve a problem in one does not necessarily carry over to the other.
To answer the question as written, I see two options then:
Use arrays the awkward way. Either create an array of Nullable<int>s and accept some boxing / unboxing and unpleasant LINQ statements for insertion; or keep an additional counter (preferably wrapped up in a class together with the array) to keep track of the last assigned element.
Use a proper data structure with appropriate guarantees on the operations that matter, such as List<T> which is effectively the (much better, optimised) built-in version of the second option above.
I understand that the latter option is not feasible for you because of the constraints imposed by your teacher, but then do not be surprised that things are harder than the canonical way in another language, if you are not allowed to use the canonical way in this language.
Afterthought:
A hybrid alternative that just came to mind, is using a List for storage and then just calling .ToArray on it. In your insert method, just Add to the list and return the new array.

Which algorithm is used in List<T> to dynamically allocate memory?

Now I have an algorithm for dynamically allocating memory on an array:
If array is full I create a new array of twice the size, and copy items.
If array is one-quarter full I create a new array of half the size, and copy items.
This is fairly fast algorithm for dynamic memory allocation despite the extra over-head of copying the elements to the newly allocated array.
What is the faster, List<T> or such an algorithm based on an array? What would you recommend to use?
does List<T> use simple array as internal data structure?
TO answer your question:
It is true, C#'s List<T> implementation uses an internal array that is
Serializable
Thread-safe
Implements IEnumerable<T> (which means it can be LINQ Queried, foreached etc)
Binary Searched
and so on
Hence, I would ask you to use List<T> instead of your own List.
Oh and btw, if you want the source code of List<T> from Microsoft, then here it is
List.cs
EDIT
The source code of EnsureCapacity in List<T> is:
// Ensures that the capacity of this list is at least the given minimum
// value. If the currect capacity of the list is less than min, the
// capacity is increased to twice the current capacity or to min,
// whichever is larger.
private void EnsureCapacity(int min) {
if (_items.Length < min) {
int newCapacity = _items.Length == 0? _defaultCapacity : _items.Length * 2;
if (newCapacity < min) newCapacity = min;
Capacity = newCapacity;
}
}
Unless you have a specific reason to believe otherwise, it's almost universally a good idea to use the provided libraries that come with C#. Those implementations are well-implemented, well-debugged, and well-tested.
The data structure you're describing is a standard implementation of a dynamic array data structure, and most languages use this as their default list implementation. Looking over the documentation for List<T>, it seems like List<T> uses this implementation, since its documentation has references to an internal capacity and guarantees O(1) append as long as the size is less than the capacity.
In short, avoid reinventing the wheel unless you have to.
Hope this helps!
List<T> uses an array internally, and it uses a similar strategy as you - it doubles the size of the array if the length would go over the length of the array. However, it doesn't make it smaller should the size be way smaller.
The relevant method in mscorlib:
private void EnsureCapacity(int min)
{
if (this._items.Length < min)
{
int num = (this._items.Length == 0) ? 4 : (this._items.Length * 2);
if (num < min)
{
num = min;
}
this.Capacity = num;
}
}
The resizing of the array actually happens in the setter of List<T>.Capacity.
No need to reinvent the wheel.
From MSDN:
Capacity is the number of elements that the List<(Of <(T>)>) can store
before resizing is required, while Count is the number of elements
that are actually in the List<(Of <(T>)>).
Capacity is always greater than or equal to Count. If Count exceeds
Capacity while adding elements, the capacity is increased by
automatically reallocating the internal array before copying the old
elements and adding the new elements.
The capacity can be decreased by calling the TrimExcess method or by
setting the Capacity property explicitly. When the value of Capacity
is set explicitly, the internal array is also reallocated to
accommodate the specified capacity, and all the elements are copied.
Retrieving the value of this property is an O(1) operation; setting
the property is an O(n) operation, where n is the new capacity.
That's basically what List<T> (and many other languages' dynamic arrays, for that matter) does as well. The resizing factors may be different, and I think it doesn't shrink the backing array on its own when removing elements - but there's TrimToSize and you can set Capacity yourself, which probably allows more efficient strategies if client code uses this feature well. But basically, it's asymptotically equivalent.
As for which one to use: Unless you have cold, hard data that List<T> is suboptimal for your use case and that the difference matters (you obviously don't have this knowledge yet), you should use it. Your own implementation will be buggy, less feature-rich (cf. IEnumerable<T>, IList<T>, numerous methods), less optimized, less widely recognizable, not accepted by other libraries (so you may have to create expensive copies, or at least do more work than with List<T> to interact), etc. and there is most likely absolutely nothing to be gained.

Preventing double hash operation when trying to update value in a of type Dictionary<IComparable, int>

I am working on software for scientific research that deals heavily with chemical formulas. I keep track of the contents of a chemical formula using an internal Dictionary<Isotope, int> where Isotope is an object like "Carbon-13", "Nitrogen-14", and the int represents the number of those isotopes in the chemical formula. So the formula C2H3NO would exist like this:
{"C12", 2
"H1", 3
"N14", 1
"O16", 1}
This is all fine and dandy, but when I want to add two chemical formulas together, I end up having to calculate the hash function of Isotope twice to update a value, see follow code example.
public class ChemicalFormula {
internal Dictionary<Isotope, int> _isotopes = new Dictionary<Isotope, int>();
public void Add(Isotope isotope, int count)
{
if (count != 0)
{
int curValue = 0;
if (_isotopes.TryGetValue(isotope, out curValue))
{
int newValue = curValue + count;
if (newValue == 0)
{
_isotopes.Remove(isotope);
}
else
{
_isotopes[isotope] = newValue;
}
}
else
{
_isotopes.Add(isotope, count);
}
_isDirty = true;
}
}
}
While this may not seem like it would be a slow down, it is when we are adding billions of chemical formulas together, this method is consistently the slowest part of the program (>45% of the running time). I am dealing with large chemical formulas like "H5921C3759N1023O1201S21" that are consistently being added to by smaller chemical formulas.
My question is, is there a better data structure for storing data like this? I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.
EDIT
Isotope is immutable and shouldn't change during the lifetime of the program so I should be able to cache the hashcode.
I have linked to the source code so you can see the classes more in depth rather than me copy and paste them here.
I second the opinion that Isotope should be made immutable with precalculated hash. That would make everything much simpler.
(in fact, functionally-oriented programming is better suited for calculations of such sort, and it deals with immutable objects)
I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.
Well it would stop the double hashing... but obviously it's then worse in terms of space. What performance difference did you notice?
Another option you should strongly consider if you're doing this a lot is caching the hash within the Isotope class, assuming it's immutable. (If it's not, then using it as a dictionary key is at least somewhat worrying.)
If you're likely to use most Isotope values as dictionary keys (or candidates) then it's probably worth computing the hash during initialization. Otherwise, pick a particularly unlikely hash value (in an ideal world, that would be any value) and use that as the "uncached" value, and compute it lazily.
If you've got 45% of the running time in GetHashCode, have you looked at optimizing that? Is it actually GetHashCode, or Equals which is the problem? (You talk about "hashing" but I suspect you mean "hash lookup in general".)
If you could post the relevant bits of the Isotope type, we may be able to help more.
EDIT: Another option to consider if you're using .NET 4 would be ConcurrentDictionary, with its AddOrUpdate method. You'd use it like this:
public void Add(Isotope isotope, int count)
{
// I prefer early exit to lots of nesting :)
if (count == 0)
{
return 0;
}
int newCount = _isotopes.AddOrUpdate(isotope, count,
(key, oldCount) => oldCount + count);
if (newCount == 0)
{
_isotopes.Remove(isotope);
}
_isDirty = true;
}
Do you actually require random access to Isotope count by type or are you using the dictionary as a means for associating a key with a value?
I would guess the latter.
My suggestion to you is not to work with a dictionary but with a sorted array (or List) of IsotopeTuples, something like:
class IsotopeTuple{
Isotope i;
int count;
}
sorted by the name of the isotope.
Why the sorting?
Because then, when you want to "add" two isotopes together, you can do this in linear time by traversing both arrays (hope this is clear, I can elaborate if needed). No hash computation required, just super fast comparisons of order.
This is a classic approach when dealing with vector multiplications where the dimensions are words.
Used widely in text mining.
the tradeoff is of course that the construction of the initial vector is (n)log(n), but I doubt if you will feel the impact.
Another solution that you could think of if you had a limited number of Isotopes and no memory problems:
public struct Formula
{
public int C12;
public int H1;
public int N14;
public int O16;
}
I am guessing you're looking at organic chemistry, so you may not have to deal with that many isotopes, and if the lookup is the issue, this one will be pretty fast...

Efficiency: Creating an array of doubles incrementally?

Consider the following code:
List<double> l = new List<double>();
//add unknown number of values to the list
l.Add(0.1); //assume we don't have these values ahead of time.
l.Add(0.11);
l.Add(0.1);
l.ToArray(); //ultimately we want an array of doubles
Anything wrong with this approach? Is there a more appropriate way to build an array, without knowing the size, or elements ahead of time?
There's nothing wrong with your approach. You are using the correct data type for the purpose.
After some observations you can get a better idea of the total elements in that list. Then you can create a new list with an initial capacity in the constructor:
List<double> l = new List<double>(capacity);
Other than this, it's the proper technique and data structure.
UPDATE:
If you:
Need only the Add and ToArray functions of the List<T> structure,
And you can't really predict the total capacity
And you end up with more than 1K elements
And better performance is really really (really!) your goal
Then you might want to write your own interface:
public interface IArrayBuilder<T>
{
void Add(T item);
T[] ToArray();
}
And then write your own implementation, which might be better than List<T>. Why is that? because List<T> holds a single array internally, and it increases its size when needed. The procedure of increasing the inner array costs, in terms of performance, since it allocates new memory (and perhaps copies the elements from the old array to the new one, I don't remember). However, if all of the conditions described above are true, all you need is to build an array, you don't really need all of the data to be stored in a single array internally.
I know it's a long shot, but I think it's better sharing such thoughts...
As others have already pointed out: This is the correct approach. I'll just add that if you can somehow avoid the array and use List<T> directly or perhaps IEnumerable<T>, you'll avoid copying the array as ToArray actually copies the internal array of the list instance.
Eric Lippert has a great post about arrays, that you may find relevant.
A dynamic data structure like a List is the correct way to implement this. The only real advantage arrays have over a List is the O(1) access performance (compared to O(n) in List). The flexibility more than makes up for this performance loss imho

List<T>.AddRange implementation suboptimal

Profiling my C# application indicated that significant time is spent in List<T>.AddRange. Using Reflector to look at the code in this method indicated that it calls List<T>.InsertRange which is implemented as such:
public void InsertRange(int index, IEnumerable<T> collection)
{
if (collection == null)
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.collection);
}
if (index > this._size)
{
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.index, ExceptionResource.ArgumentOutOfRange_Index);
}
ICollection<T> is2 = collection as ICollection<T>;
if (is2 != null)
{
int count = is2.Count;
if (count > 0)
{
this.EnsureCapacity(this._size + count);
if (index < this._size)
{
Array.Copy(this._items, index, this._items, index + count, this._size - index);
}
if (this == is2)
{
Array.Copy(this._items, 0, this._items, index, index);
Array.Copy(this._items, (int) (index + count), this._items, (int) (index * 2), (int) (this._size - index));
}
else
{
T[] array = new T[count]; // (*)
is2.CopyTo(array, 0); // (*)
array.CopyTo(this._items, index); // (*)
}
this._size += count;
}
}
else
{
using (IEnumerator<T> enumerator = collection.GetEnumerator())
{
while (enumerator.MoveNext())
{
this.Insert(index++, enumerator.Current);
}
}
}
this._version++;
}
private T[] _items;
One can argue that the simplicity of the interface (only having one overload of InsertRange) justifies the performance overhead of runtime type cheching and casting.
But what could be the reason behind the 3 lines I have indicated with (*) ?
I think it could be rewritten to the faster alternative:
is2.CopyTo(this._items, index);
Do you see any reason for not using this simpler and apparently faster alternative?
Edit:
Thanks for the answers. So consensus opinion is that this is a protective measure against the input collection implementing the CopyTo in a defective/malicious manner. To me it seems like a overkill to constantly pay the price of 1) runtime type checking 2) dynamic allocation of the temporary array 3) double the copy operation, when all this could have been saved by defining 2 or a few more overloads of InsertRange, one getting IEnumerable as now, the second getting a List<T>, third getting T[]. The later two could have been implemented to run around twice as fast as in the current case.
Edit 2:
I did implement a class FastList, identical to List, except that it also provides an overload of AddRange which takes a T[] argument. This overload does not need the dynamic type verification, and double-copying of elements. I did profile this FastList.AddRange against List.AddRange by adding 4-byte arrays 1000 times to a list which was initially emtpy. My implementation beats the speed of standard List.AddRange with a factor of 9 (nine!). List.AddRange takes about 5% of runtime in one of the important usage scenarios of our application, replacing List with a class providing a faster AddRange could improve application runtime by 4%.
They are preventing the implementation of ICollection<T> from accessing indices of the destination list outside the bounds of insertion. The implementation above results in an IndexOutOfBoundsException if a faulty (or "manipulative") implementation of CopyTo is called.
Keep in mind that T[].CopyTo is quite literally internally implemented as memcpy, so the performance overhead of adding that line is minute. When you have such a low cost of adding safety to a tremendous number of calls, you might as well do so.
Edit: The part I find strange is the fact that the call to ICollection<T>.CopyTo (copying to the temporary array) does not occur immediately following the call to EnsureCapacity. If it were moved to that location, then following any synchronous exception the list would remain unchanged. As-is, that condition only holds if the insertion happens at the end of the list. The reasoning here is:
All necessary allocation happens before altering the list elements.
The calls to Array.Copy cannot fail because
The memory is already allocated
The bounds are already checked
The element types of the source and destination arrays match
There is no "copy constructor" used like in C++ - it's just a memcpy
The only items that can throw an exception are the external call to ICollection.CopyTo and the allocations required for resizing the list and allocating the temporary array. If all three of these occur before moving elements for the insertion, the transaction to change the list cannot throw a synchronous exception.
Final note: This address strictly exceptional behavior - the above rationale does not add thread-safety.
Edit 2 (response to the OP's edit): Have you profiled this? You are making some bold claims that Microsoft should have chosen a more complicated API, so you should make sure you're correct in the assertions that the current method is slow. I've never had a problem with the performance of InsertRange, and I'm quite sure that any performance problems someone does face with it will be better resolved with an algorithm redesign than by reimplementing the dynamic list. Just so you don't take me as being harsh in a negative way, keep the following in mind:
I don't want can't stand people on my dev team that like to reinvent the square wheel.
I definitely want people on my team that care about potential performance issues, and ask questions about the side effects their code may have. This point wins out when present - but as long as people are asking questions I will drive them to turn their questions into solid answers. If you can show me that an application gains a significant advantage through what initially appears to be a bad idea, then that's just the way things go sometimes.
It's a good question, I'm struggling to come up with a reason. There's no hint in the Reference Source. One possibility is that they try to avoid a problem when the class that implements the ICollection<>.CopyTo() method objects against copying to a start index other than 0. Or as a security measure, preventing the collection from messing with the array elements it should not have access to.
Another one is that this is a counter-measure when the collection is used in thread-unsafe manner. If an item got added to the collection by another thread it will be the collection class' CopyTo() method that fails, not the Microsoft code. The right person will get the service call.
These are not great explanations.
There is a problem with your solution if you think about it for a minute, if you change the code in that way you are essentially giving the collection that should be added access to an internal datastructure.
This is not a good idea, for example if the author of the List datastructure figures out a better underlying structure to store the data than an array there is no way to change the implementation of List since all collection are expecting an array into the CopyTo function.
In essence you would be cementing the implementation of the List class, even though object oriented programming tells us that the internal implementation of a datastructure should be something that can be changed without breaking other code.

Categories

Resources