How to refer to children in a tree with millions of nodes

How to refer to children in a tree with millions of nodes - c#

I'm attempting to build a tree, where each node can have an unspecified amount of children nodes. The tree is to have over a million nodes in practice.
I've managed to contruct the tree, however I'm experiencing memory errors due to a full heap when I fill the tree with a few thousand nodes. The reason for this is because I'm attempting to store each node's children in a Dictionary data structure (or any data structure for that matter). Thus, at run-time I've got thousands of such data structures being created since each node can have an unspecified amount of children, and each node's children are to be stored in this data structure.
Is there another way of doing this? I cannot simply use a variable to store a reference of the children, as there can be an unspecified amount of children for each node. THus, it is not like a binary tree where I could have 2 variables keeping track of the left child and right child respectively.
Please no suggestions for another method of doing this. I've got my reasons for needing to create this tree, and unfortunately I cannot do otherwise.
Thanks!

How many of your nodes will be "leaf" nodes? Perhaps only create the data structure to store children when you first have a child, otherwise keeping a null reference.
Unless you need to look up the children as a map, I'd use a List<T> (initialized with an appropriate capacity) instead of a Dictionary<,> for the children. It sounds like you may have more requirements than you've explained though, which makes it hard to say.
I'm surprised you're failing after only a few thousand nodes though - you should be able to create a pretty large number of objects before having problems.
I'd also suggest that if you think you'll end up using a lot of memory, make sure you're on a 64-bit machine and make sure your application itself is set to be 64-bit. (That may just be a thin wrapper over a class library, which is fine so long as the class library is set to be 64-bit or AnyCPU.)

Related

How is the Root Node being updated when child nodes are the ones updated when performing Tree modifications?

How does a Tree's parent node get updated with the child updates we perform? Specifically like when you do a BFS or DFS search, perform a check at the node, and then update that node. What causes in memory or in the programming language to know "Oh hey, I have to make this update to the Root node as well!"
In my example I'm using a Trie (not really important but I will just refer to it as a Tree). I have this BFS search below that searches through a bunch of nodes and will update with a specific word value. The comment I have is where my question lies. That child node is currently stored in "deque." How does the program know I mean to update the value within Root, not just the value pass into the variable "deque?"
To me, what I think SHOULD be happening is Root shouldn't be updated and the only thing that gets updated is the variable "deque," and then after it's done, everything gets garbage collected and Root remains the same. Instead, Root gets updated when "deque" gets updated. Perhaps I missed this in my Data Structures course, but it has been bugging me for a while now and I've been having trouble finding resources that explain this.
private static void BFS_UpdateAllWords(Node Root, string testword, string updatevalue)
{
Queue<Node> bfs_queue = new Queue<Node>();
bfs_queue.Enqueue(Root);
while (bfs_queue.Count > 0)
{
var deque = bfs_queue.Dequeue();
foreach (string childKey in deque.Children.Keys)
{ // Update all child nodes at the key
if (deque.Children[childKey].Word.Equals(testword))
{
// This part right here for any time of Tree traversal
deque.Children[childKey].WordType = updatevalue;
}
bfs_queue.Enqueue(deque.Children[childKey]);
}
}
}

I think you are encountering what is roughly a pass by value vs pass by reference issue. When a function is pass by value, the parameter of that function is copied such that the function gets its own distinct version and the caller won't see changes the function makes to the parameter. However, if a function is pass by reference, no copy is made and if the function makes changes to a parameter the caller will see the changes that were made.
C# and Java are a little funny here because they are always pass by value (unless you explicitly tell C# to do otherwise), but they "pass references by value" too. That is, when you pass a function a reference type (like an object or a class), what the function receives is a copy of the reference to it (not a deep copy of the underlying object). This means the underlying object can still be changed inside a function, because that function has its own reference to it.
A consequence of this is that when you pass the Enqueue method a reference type, as your Node objects are, what is stored in the queue are just (copies of) references to the same underlying objects which live outside the queue. That is, the elements of your queue are not deep copies of your tree's nodes: they are references to the same underlying Node objects. When you Deque() into the "deque" variable you're therefore just creating another alias which refers to some node which you might otherwise have been able to reach in a traversal directly from the Root node. This way, when you modify the properties of the "deque" node, you're directly modifying the properties of some Node object as it exists in the tree under the Root node—not a copy of that object with its own address in memory but the same underlying object.
This is why when deque gets garbage collected the changes persist: deque just contained copies of references to the nodes which already existed in the Root tree. There's nothing your program had to figure out to know to make the changes you made via deque also affect nodes in the Root tree. Because deque contained references to those same underlying nodes, changes made through it were already directly modifying those nodes, and so after deque went out of scope the changes naturally persisted.

How expensive is an array allocation and initialization?

TL;D: How expensive is allocation and initialization of an array of of relatively large size K, compared to going through such an array, multiplying each float by a fixed float, and writing the result in a different array of size K? E.g. is there a big difference between doing both (allocate, initialize, multiply), or just doing the latter (multiply elements of existing array)?
I'll be recursively traversing through an unknown binary tree of height at most ~50, that has some attributes already set, and I'll be filling out some other attributes. Each node contains an array/list containing around 10k floats, I'll do some operation on all the floats of this array in the node - say multiply them by a float, and then the resulting 10k array will go down to the children.
Now to me the neatest way to implement this would be to actually just declare a tree node class and recursively create the children, etc. Thing is, the values in the node can only really effect the children nodes - so I don't actually need to create an array in each node. Since I have a bound of ~50 on the height of the tree, I could just use 50 arrays, one for each level of the tree. This seems a bit ugly to me though. Also, I'll still have to do the actual operation of a multiplication, writing down the values, etc - so assuming complexity of initialization and allocation of a new array is linear in the size of the array (and I don't really know if that's true), then I'll still be doing 10k * number of nodes-many operations of multiplication, as opposed to 10k*number of nodes * (multiplication + initialization).
If it is a noticeable difference (which to me, something like say 0.1% difference is not), what would be the right way to use arrays that would keep track of the floats in the current node? Using static arrays would seem like the simplest solutions, but I'm sure that might be considered bad practice. I can't really think of another simple solution.
I was sort of expecting to just make a tree node class, and give it a traversal/"go down" method, keeping all the stuff in one class. But I guess I could make a different class with a function that would recursively traverse the tree, that might be cleaner anyway, as that way the new class could correspond to an actual "tree" object. Then I could just give tree object a field that would contain the arrays for each level.
Alternatively, I could still use just a node class with a traversal method, and just keep passing down a reference to an object with all of the 50 arrays.

Get a copy of a large (160000+ internal object tree) object

Ok, I have a set of very large, identical, trees cached in memory (to be populated with non-identical data [they contain information about stuff inside each node]).
I want to copy a single instance of the tree, and populate each copy with a seperate set of data.
However, at the moment, the cached 'blank' copy of the tree is not being copied, but simply referenced and filled with every single set of data.
How can I force the method that gets the cached blank tree to return a copy of the object, instead of a reference?

An alternative to Clone() - serialize it in the memory binary stream and then deserialize as a new instance.
EDIT
Also, if you will consider serialization, and if performance is you primary concern, please also take into account the following performance test Manual Serialization 200% + Faster than BinaryFormatter.

There are several ways, but I recommend implementing ICloneable on the tree object, and then call Clone() to create a deep copy.

I would suggest to look closely at your tree classes, and if you are going to be enforcing copy semantics, then use struct instead of class. Else use ICloneable interface to provide Clone() method, as chris166 suggested.

With such a large tree, having multiple copies of it will incur a lot of memory overhead. Why not just organise the data at each node (with a Dictionary, for example) so that it holds all the different data (as you're getting at the moment), but organised in a way which is convenient to your actual need?

should a tree node have a pointer to its containing tree?

I'm building a gui component that has a tree-based data model (e.g. folder structure in the file system). so the gui component basically has a collection of trees, which are just Node objects that have a key, reference to a piece of the gui component (so you can assign values to the Node object and it in turn updates the gui), and a collection of Node children.
one thing I'd like to do is be able to set "styles" that apply to each level of nodes (e.g. all top-level nodes are bold, all level-2 nodes are italic, etc). so I added this to the gui component object. to add nodes, you call AddChild on a Node object. I would like to apply the style here, since upon adding the node I know what level the node is.
problem is, the style info is only in the containing object (the gui object), so the Node doesn't know about it. I could add a "pointer" within each Node to the gui object, but that seems somehow wrong...or I could hide the Nodes and make the user only be able to add nodes through the gui object, e.g. gui.AddNode(Node new_node, Node parent), which seems inelegant.
is there a nicer design for this that I'm missing, or are the couple of ways I mentioned not really that bad?

Adding a ParentNode property to each node is "not really that bad". In fact, it's rather common. Apparently you didn't add that property because you didn't need it originally. Now you need it, so you have good reason to add it.
Alternates include:
Writing a function to find the parent of a child, which is processor intensive.
Adding a separate class of some sort which will cache parent-child relationships, which is a total waste of effort and memory.
Essentially, adding that one pointer into an existing class is a choice to use memory to cache the parent value instead of using processor time to find it. That appears to be a good choice in this situation.

It seems to me that the only thing you need is a Level property on the nodes, and use that when rendering a Node through the GUI object.
But it matters whether your Tree elements are Presentation agnostic like XmlNode or GUI oriented like Windows.Forms.TreeNode. The latter has a TreeView property and there is nothing wrong with that.

I see no reason why you should not have a reference to the GUI object in the node. A node cannot exist outside the GUI object, and it is useful to be able to easily find the GUI object a node is contained in.
You may not want to tie the formatting to the level the node is at if your leaf nodes may be at different levels.

How to implement a tree structure, but still be able to reference it as a flat array by index?

I want to have a tree in memory, where each node can have multiple children. I would also need to reference this tree as a flat structure by index. for example:
a1
b1
b2
b3
c1
d1
e1
e2
d2
f1
Would be represented as a flat structure as I laid out (i.e.; a1=0, b1=1, d1=5, etc..)
Ideally I would want lookup by index to be O(1), and support insert, add, remove, etc.. with a bonus of it being threadsafe, but if that is not possible, let me know.

If you have a reasonably balanced tree, you can get indexed references in O(log n) time - just store in each node a count of the number of nodes under it, and update the counts along the path to a modified leaf when you do inserts, deletions, etc. Then you can compute an indexed access by looking at the node counts on each child when you descend from the root. How important is it to you that indexed references be O(1) instead of O(log n)?
If modifications are infrequent with respect to accesses, you could compute a side vector of pointers to nodes when you are finished with a set of modifications, by doing a tree traversal. Then you could get O(1) access to individual nodes by referencing the side vector, until the next time you modify the tree. The cost is that you have to do an O(n) tree traversal after doing modifications before you can get back to O(1) node lookups. Is your access pattern such that this would be a good tradeoff for you?

I use something similar to this in a Generic Red-Black tree I use. Essentially to start you need a wrapper class like Tree, which contains the actual nodes.
This is based on being able to reference the tree by index
So you can do something like the following to set up a tree with a Key, Value
class Tree<K, V>
{
//constructors and any methods you need
//Access the Tree like an array
public V this[K key]
{
get {
//This works just like a getter or setter
return SearchForValue(key);
}
set {
//like a setter, you can use value for the value given
if(SearchForValue(key) == null)
{
// node for index doesn't exist, add it
AddValue(key, value);
} else { /* node at index already exists... do something */ }
}
}
This works on the assumption that you already know how to create a tree, but want to to able to do stuff like access the tree by index. Now you can do something like so:
Tree<string,string> t = new Tree<string,string>();
t["a"] = "Hello World";
t["b"] = "Something else";
Console.Writeline("t at a is: {0}", t["a"]);
Finally, for thread saftety, you can add an object to you're Tree class and on any method exposed to the outside world simply call
Lock(threadsafetyobject) { /*Code you're protecting */ }
Finally, if you want something cooler for threadsafety, I use an object in my tree call a ReaderWriterLockSlim that allows multiple reads, but locks down when you want to do a write, which is especially importantif you're changing the tree's structure like doing a rotation whilst another thread is trying to do a read.
One last thing, i rewrote the code to do this from memory, so it may not compile, but it should be close :)

If you have a defined number of children for each tree (for instance, a Binary Tree), then it's not too difficult (although you potentially waste a lot of space).
If it has a variable number of children, you'd probably have to come up with some convoluted manner of storing the index of a node's first child.
I'm not seeing how it would be useful to do this, though. The point of trees is to be able to store and retrieve items below a particular node. If you want constant-time look-up by index, it doesn't sound like you want a tree at all. By storing it in an array, you have to consider the fact that if you add an element into the middle, all of the indexes you had originally stored will be invalid.
However, if indeed you want a tree, and you still want Constant Time insertion/lookup, just store a reference to the parent node in a variable, and then insert the child below it. That is constant time.

This is possible with a little work, but your insert and remove methods will become much more costly. To keep the array properly ordered, you will need to shift large chunks of data to create or fill space. The only apparent advantage is very fast traversal (minimal cache misses).
Anyhow, one solution is to store the number of children in each node, like so:
struct TreeNode
{
int numChildren;
/* whatever data you like */
};
Here's an example of how to traverse the tree...
TreeNode* example(TreeNode* p)
{
/* do something interesting with p */
int numChildren = p->numChildren;
++p;
for(int child = 1; child <= numChildren; ++child)
p = example(p);
return p;
}
Hopefully you can derive insert, remove, etc... on your own.
:)

You could always look at using Joe Celko's technique of using Nested Sets to represent trees. He's Sql focused, but the parallels are there between nested sets and representing a tree as a flat array, and it may be useful for your ultimate reason for wanting to use an array in the first place.
As others note though, most of the time it's easier just to traverse the tree directly as linked nodes. The oft-cited array implementation of a tree is a Binary Search Tree because for a node n, the parent is (n-1)/2, the left child is 2n+1 and the right child is 2n+2
The downside of using arrays are insertions, deletions, pruning and grafting all (usually) require the array to be modified when the tree changes.
You could also read up on B-trees

Not sure if this is good. But a flat array can be addressed as a binary tree by calculating the tree level as a power and then adding the offset. But this only works if the tree is a binary one.

There is a solution for binary tree that you don't have to index any. The solution might be updated to non binary tree if each node has exact number of children
The complexity is θ(h) where h is height of the tree
How to insert delete or update a value of binary tree by index?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to refer to children in a tree with millions of nodes - c#

Related

How is the Root Node being updated when child nodes are the ones updated when performing Tree modifications?

How expensive is an array allocation and initialization?

Get a copy of a large (160000+ internal object tree) object

should a tree node have a pointer to its containing tree?

How to implement a tree structure, but still be able to reference it as a flat array by index?

Categories

Resources