How expensive is an array allocation and initialization? - c#

TL;D: How expensive is allocation and initialization of an array of of relatively large size K, compared to going through such an array, multiplying each float by a fixed float, and writing the result in a different array of size K? E.g. is there a big difference between doing both (allocate, initialize, multiply), or just doing the latter (multiply elements of existing array)?
I'll be recursively traversing through an unknown binary tree of height at most ~50, that has some attributes already set, and I'll be filling out some other attributes. Each node contains an array/list containing around 10k floats, I'll do some operation on all the floats of this array in the node - say multiply them by a float, and then the resulting 10k array will go down to the children.
Now to me the neatest way to implement this would be to actually just declare a tree node class and recursively create the children, etc. Thing is, the values in the node can only really effect the children nodes - so I don't actually need to create an array in each node. Since I have a bound of ~50 on the height of the tree, I could just use 50 arrays, one for each level of the tree. This seems a bit ugly to me though. Also, I'll still have to do the actual operation of a multiplication, writing down the values, etc - so assuming complexity of initialization and allocation of a new array is linear in the size of the array (and I don't really know if that's true), then I'll still be doing 10k * number of nodes-many operations of multiplication, as opposed to 10k*number of nodes * (multiplication + initialization).
If it is a noticeable difference (which to me, something like say 0.1% difference is not), what would be the right way to use arrays that would keep track of the floats in the current node? Using static arrays would seem like the simplest solutions, but I'm sure that might be considered bad practice. I can't really think of another simple solution.
I was sort of expecting to just make a tree node class, and give it a traversal/"go down" method, keeping all the stuff in one class. But I guess I could make a different class with a function that would recursively traverse the tree, that might be cleaner anyway, as that way the new class could correspond to an actual "tree" object. Then I could just give tree object a field that would contain the arrays for each level.
Alternatively, I could still use just a node class with a traversal method, and just keep passing down a reference to an object with all of the 50 arrays.

Related

C# Efficient Tree Argument Recursion, Efficient sub-array or sub-list

I have an in-memory tree structure, resembling a directory tree. That is: each node has a dictionary of named subnodes. I want an efficient method of traversing the tree, from a list or array of names.
If I start at the root node, with a list of subnodes I want to traverse, {"organisms","primates","human","male","John Smith"} and I recursively process one step and pass the remaining sub-list to the subnode, return this.subNodes[myList[0]].GetSubNode(myList.GetRange(1,myList.Count-1)) ... Even though List.GetRange() is a shallow copy, it's still going to create a new list for every level of recursion. The whole operation seems very time and space inefficient.
Or if I try to use an array, then the best method I can find to create a sub-array would be Array.Copy, which is again, a shallow copy. Same problem.
I'm thinking in terms of C, where the head of a list is just a pointer to an object that has another pointer to another object, so getting a sub-list is as simple as following one pointer. Or an array is just a pointer to some memory, so getting a sub-array is as simple as incrementing the pointer. Very time and space efficient. Is there any way to do this in C#?
At present, in C#, I'm thinking I just need to forget the recursion and do some sort of iteration from the top level...
Or I can pass the unmodified array as an argument recursively, along with an int index which I'll increment at each level deeper. Which is fine, except that I'll need to pass another argument to the recursive method call, whose sole purpose is to communicate to the nth recursive method call, "ignore the first n items in the array"... This is fine, it just seems silly if this is the only possible solution (or best possible solution).
Is there a better way?
There is a LinkedList implementation in .net, which allows you to pass next LinkedListNode into method.
Other than that, approach with indexes is also fine - at least it is not consuming extra memory.
There is also a way to pass pointer to array element, like in C. But this would force you to compile program in unsafe mode, which is often undesirable.

Yet Another Take on Measuring Object Size

Searching Google and StackOverFlow comes up with a lot of references to this question. Including for example:
Ways to determine size of complex object in .NET?
How to get object size in memory?
So let me say at the start that I understand that it is not generally possible to get an accurate measurement. However I am not that concerned about that - I am looking for something that give me relative values rather than absolute. So if they are off a bit one way or the other it does not matter.
I have a complex object graph. It is made up of a single parent (T) with children that may have children and so on. All the objects in the graph are from the same base class. The childrean are in the form of List T.
I have tried both the serializing method and the unsafe method to calculate size. They give different answers but the 'relative' problem is the same in both cases.
I made an assumption that the size of a parent object would be larger than the sum of the sizes of the children. This has turned out not to be true. I calculated the size of the parent. Then summed the size of the children. In some cases this appeared to make sense but in others the sum of the children far exceeded the size determined for the parent.
So my question is: Why is my simple assumption that serializing an object can result in a size that is less that the sum of the children. The only answer I have come up with is that each serialized object has a fixed overhead (which I guess is self evident) and the sum of these can exceed the 'own size' of the parent. If that is so is there any way to determine what that overhead might be so that I can take account of it?
Many thanks in advance for any suggestions.
EDIT
Sorry I forgot to say that all objects are marked serializable the serialization method is:
var bf = new BinaryFormatter();
var ms = new MemoryStream();
bf.Serialize(ms, testObject);
byte[] array = ms.ToArray();
return array.Length;
It will really depend on which serialization mechanism you use for serializing the objects. It's possible that it's not serializing the children elements, which is one reason why you'd see the parent size smaller than the sum of the children (possibly even smaller than each individual child).
If you want to know the relative size of an object, make sure that you're serializing all the fields of all objects in your graph.
Edit: so, if you're using the binary formatter, then you must look at the specification for the format used by that serializer to understand the overhead. The format specification is public, and can be found at http://msdn.microsoft.com/en-us/library/cc236844(prot.20).aspx. It's not very easy to digest, but if you're willing to put the time to understand it, you'll find exactly how much overhead each object will have in its serialized form.

How to refer to children in a tree with millions of nodes

I'm attempting to build a tree, where each node can have an unspecified amount of children nodes. The tree is to have over a million nodes in practice.
I've managed to contruct the tree, however I'm experiencing memory errors due to a full heap when I fill the tree with a few thousand nodes. The reason for this is because I'm attempting to store each node's children in a Dictionary data structure (or any data structure for that matter). Thus, at run-time I've got thousands of such data structures being created since each node can have an unspecified amount of children, and each node's children are to be stored in this data structure.
Is there another way of doing this? I cannot simply use a variable to store a reference of the children, as there can be an unspecified amount of children for each node. THus, it is not like a binary tree where I could have 2 variables keeping track of the left child and right child respectively.
Please no suggestions for another method of doing this. I've got my reasons for needing to create this tree, and unfortunately I cannot do otherwise.
Thanks!
How many of your nodes will be "leaf" nodes? Perhaps only create the data structure to store children when you first have a child, otherwise keeping a null reference.
Unless you need to look up the children as a map, I'd use a List<T> (initialized with an appropriate capacity) instead of a Dictionary<,> for the children. It sounds like you may have more requirements than you've explained though, which makes it hard to say.
I'm surprised you're failing after only a few thousand nodes though - you should be able to create a pretty large number of objects before having problems.
I'd also suggest that if you think you'll end up using a lot of memory, make sure you're on a 64-bit machine and make sure your application itself is set to be 64-bit. (That may just be a thin wrapper over a class library, which is fine so long as the class library is set to be 64-bit or AnyCPU.)

What is the difference between LinkedList and ArrayList, and when to use which one?

What is the difference between LinkedList and ArrayList? How do I know when to use which one?
The difference is the internal data structure used to store the objects.
An ArrayList will use a system array (like Object[]) and resize it when needed. On the other hand, a LinkedList will use an object that contains the data and a pointer to the next and previous objects in the list.
Different operations will have different algorithmic complexity due to this difference in the internal representation.
Don't use either. Use System.Collections.Generic.List<T>.
That really is my recommendation. Probably independently of what your application is, but here's a little more color just in case you're doing something that needs a finely tuned choice here.
ArrayList and LinkedList are different implementations of the storage mechanism for a List. ArrayList uses an array that it must resize if your collection outgrows it current storage size. LinkedList on the other hand uses the linked list data structure from CS 201. LinkedList is better for some head- or tail-insert heavy workloads, but ArrayList is better for random access workloads.
ArrayList has a good replacement which is List<T>.
In general, List<T> is a wrapper for array - it allows indexing and accessing items in O(1), but, every time you exceed the capacity an O(n) must be paid.
LinkedList<T> won't let you access items using index but you can count that insert will always cost O(1). In addition, you can insert items in to the beginning of the list and between existing items in O(1).
I think that in most cases List<T> is the default choice. Many of the common scenarios don't require special order and have no strict complexity constraints, therefore List<T> is preferred due to its usage simplicity.
The main difference between ArrayList and List<T>, LinkedList<T>, and other similar Generics is that ArrayList holds Objects, while the others hold a type that you specify (ie. List<Point> holds only Points).
Because of this, you need to cast any object you take out of an ArrayList to its actual type. This can take a lot of screen space if you have long class names.
In general it's much better to use List<T> and other typed Generics unless you really need to have a list with multiple different types of objects in it.
The difference lies in the semantics of how the List interface* is implemented:
http://en.wikipedia.org/wiki/Arraylist and http://en.wikipedia.org/wiki/LinkedList
*Meaning the basic list operations
As #sblom has stated, use the generic counterparts of LinkedList and ArrayList. There's really no reason not to, and plenty of reasons to do so.
The List<T> implementation is effectively wrapping an Array. Should the user attempt to insert elements beyond the bounds of the backing array, it will be copied to a larger array (at considerable expense, buit transparently to users of the List<T>)
A LinkedList<T> has a completely different implementation in which data is held in LinkedListNode<T> instances, which carry reference to two other LinkedListNode<T> instances (or only one in the case of the head or tail of the list). No external reference to mid-list items is created. This means that iterating the list is fast, but random-access is slow, because one must iterate the nodes from one end or the other. The best reason to use a LinkedList is to allow for fast inserts, that involve simply changing the references held by the nodes, rather than rewriting the entire list to insert an item (as is the case with List<T>)
They have different performance on "inserts" (adding new elements) and lookups. For inserts ArrayLists keeps an array internally (initially 16 items long) and when you reach the max capacity it doubles the size of the array. An LinkedList starts empty and add an item (node) when needed.
I think also that with ArrayList you are able to index the items, while with the LinkedList you have to "visit" the item from the head (or the LinkedList does this automatically for you).

How to implement a tree structure, but still be able to reference it as a flat array by index?

I want to have a tree in memory, where each node can have multiple children. I would also need to reference this tree as a flat structure by index. for example:
a1
b1
b2
b3
c1
d1
e1
e2
d2
f1
Would be represented as a flat structure as I laid out (i.e.; a1=0, b1=1, d1=5, etc..)
Ideally I would want lookup by index to be O(1), and support insert, add, remove, etc.. with a bonus of it being threadsafe, but if that is not possible, let me know.
If you have a reasonably balanced tree, you can get indexed references in O(log n) time - just store in each node a count of the number of nodes under it, and update the counts along the path to a modified leaf when you do inserts, deletions, etc. Then you can compute an indexed access by looking at the node counts on each child when you descend from the root. How important is it to you that indexed references be O(1) instead of O(log n)?
If modifications are infrequent with respect to accesses, you could compute a side vector of pointers to nodes when you are finished with a set of modifications, by doing a tree traversal. Then you could get O(1) access to individual nodes by referencing the side vector, until the next time you modify the tree. The cost is that you have to do an O(n) tree traversal after doing modifications before you can get back to O(1) node lookups. Is your access pattern such that this would be a good tradeoff for you?
I use something similar to this in a Generic Red-Black tree I use. Essentially to start you need a wrapper class like Tree, which contains the actual nodes.
This is based on being able to reference the tree by index
So you can do something like the following to set up a tree with a Key, Value
class Tree<K, V>
{
//constructors and any methods you need
//Access the Tree like an array
public V this[K key]
{
get {
//This works just like a getter or setter
return SearchForValue(key);
}
set {
//like a setter, you can use value for the value given
if(SearchForValue(key) == null)
{
// node for index doesn't exist, add it
AddValue(key, value);
} else { /* node at index already exists... do something */ }
}
}
This works on the assumption that you already know how to create a tree, but want to to able to do stuff like access the tree by index. Now you can do something like so:
Tree<string,string> t = new Tree<string,string>();
t["a"] = "Hello World";
t["b"] = "Something else";
Console.Writeline("t at a is: {0}", t["a"]);
Finally, for thread saftety, you can add an object to you're Tree class and on any method exposed to the outside world simply call
Lock(threadsafetyobject) { /*Code you're protecting */ }
Finally, if you want something cooler for threadsafety, I use an object in my tree call a ReaderWriterLockSlim that allows multiple reads, but locks down when you want to do a write, which is especially importantif you're changing the tree's structure like doing a rotation whilst another thread is trying to do a read.
One last thing, i rewrote the code to do this from memory, so it may not compile, but it should be close :)
If you have a defined number of children for each tree (for instance, a Binary Tree), then it's not too difficult (although you potentially waste a lot of space).
If it has a variable number of children, you'd probably have to come up with some convoluted manner of storing the index of a node's first child.
I'm not seeing how it would be useful to do this, though. The point of trees is to be able to store and retrieve items below a particular node. If you want constant-time look-up by index, it doesn't sound like you want a tree at all. By storing it in an array, you have to consider the fact that if you add an element into the middle, all of the indexes you had originally stored will be invalid.
However, if indeed you want a tree, and you still want Constant Time insertion/lookup, just store a reference to the parent node in a variable, and then insert the child below it. That is constant time.
This is possible with a little work, but your insert and remove methods will become much more costly. To keep the array properly ordered, you will need to shift large chunks of data to create or fill space. The only apparent advantage is very fast traversal (minimal cache misses).
Anyhow, one solution is to store the number of children in each node, like so:
struct TreeNode
{
int numChildren;
/* whatever data you like */
};
Here's an example of how to traverse the tree...
TreeNode* example(TreeNode* p)
{
/* do something interesting with p */
int numChildren = p->numChildren;
++p;
for(int child = 1; child <= numChildren; ++child)
p = example(p);
return p;
}
Hopefully you can derive insert, remove, etc... on your own.
:)
You could always look at using Joe Celko's technique of using Nested Sets to represent trees. He's Sql focused, but the parallels are there between nested sets and representing a tree as a flat array, and it may be useful for your ultimate reason for wanting to use an array in the first place.
As others note though, most of the time it's easier just to traverse the tree directly as linked nodes. The oft-cited array implementation of a tree is a Binary Search Tree because for a node n, the parent is (n-1)/2, the left child is 2n+1 and the right child is 2n+2
The downside of using arrays are insertions, deletions, pruning and grafting all (usually) require the array to be modified when the tree changes.
You could also read up on B-trees
Not sure if this is good. But a flat array can be addressed as a binary tree by calculating the tree level as a power and then adding the offset. But this only works if the tree is a binary one.
There is a solution for binary tree that you don't have to index any. The solution might be updated to non binary tree if each node has exact number of children
The complexity is θ(h) where h is height of the tree
How to insert delete or update a value of binary tree by index?

Categories

Resources