Searching Google and StackOverFlow comes up with a lot of references to this question. Including for example:
Ways to determine size of complex object in .NET?
How to get object size in memory?
So let me say at the start that I understand that it is not generally possible to get an accurate measurement. However I am not that concerned about that - I am looking for something that give me relative values rather than absolute. So if they are off a bit one way or the other it does not matter.
I have a complex object graph. It is made up of a single parent (T) with children that may have children and so on. All the objects in the graph are from the same base class. The childrean are in the form of List T.
I have tried both the serializing method and the unsafe method to calculate size. They give different answers but the 'relative' problem is the same in both cases.
I made an assumption that the size of a parent object would be larger than the sum of the sizes of the children. This has turned out not to be true. I calculated the size of the parent. Then summed the size of the children. In some cases this appeared to make sense but in others the sum of the children far exceeded the size determined for the parent.
So my question is: Why is my simple assumption that serializing an object can result in a size that is less that the sum of the children. The only answer I have come up with is that each serialized object has a fixed overhead (which I guess is self evident) and the sum of these can exceed the 'own size' of the parent. If that is so is there any way to determine what that overhead might be so that I can take account of it?
Many thanks in advance for any suggestions.
EDIT
Sorry I forgot to say that all objects are marked serializable the serialization method is:
var bf = new BinaryFormatter();
var ms = new MemoryStream();
bf.Serialize(ms, testObject);
byte[] array = ms.ToArray();
return array.Length;
It will really depend on which serialization mechanism you use for serializing the objects. It's possible that it's not serializing the children elements, which is one reason why you'd see the parent size smaller than the sum of the children (possibly even smaller than each individual child).
If you want to know the relative size of an object, make sure that you're serializing all the fields of all objects in your graph.
Edit: so, if you're using the binary formatter, then you must look at the specification for the format used by that serializer to understand the overhead. The format specification is public, and can be found at http://msdn.microsoft.com/en-us/library/cc236844(prot.20).aspx. It's not very easy to digest, but if you're willing to put the time to understand it, you'll find exactly how much overhead each object will have in its serialized form.
Related
TL;D: How expensive is allocation and initialization of an array of of relatively large size K, compared to going through such an array, multiplying each float by a fixed float, and writing the result in a different array of size K? E.g. is there a big difference between doing both (allocate, initialize, multiply), or just doing the latter (multiply elements of existing array)?
I'll be recursively traversing through an unknown binary tree of height at most ~50, that has some attributes already set, and I'll be filling out some other attributes. Each node contains an array/list containing around 10k floats, I'll do some operation on all the floats of this array in the node - say multiply them by a float, and then the resulting 10k array will go down to the children.
Now to me the neatest way to implement this would be to actually just declare a tree node class and recursively create the children, etc. Thing is, the values in the node can only really effect the children nodes - so I don't actually need to create an array in each node. Since I have a bound of ~50 on the height of the tree, I could just use 50 arrays, one for each level of the tree. This seems a bit ugly to me though. Also, I'll still have to do the actual operation of a multiplication, writing down the values, etc - so assuming complexity of initialization and allocation of a new array is linear in the size of the array (and I don't really know if that's true), then I'll still be doing 10k * number of nodes-many operations of multiplication, as opposed to 10k*number of nodes * (multiplication + initialization).
If it is a noticeable difference (which to me, something like say 0.1% difference is not), what would be the right way to use arrays that would keep track of the floats in the current node? Using static arrays would seem like the simplest solutions, but I'm sure that might be considered bad practice. I can't really think of another simple solution.
I was sort of expecting to just make a tree node class, and give it a traversal/"go down" method, keeping all the stuff in one class. But I guess I could make a different class with a function that would recursively traverse the tree, that might be cleaner anyway, as that way the new class could correspond to an actual "tree" object. Then I could just give tree object a field that would contain the arrays for each level.
Alternatively, I could still use just a node class with a traversal method, and just keep passing down a reference to an object with all of the 50 arrays.
Ok, I want to do the following to me it seems like a good idea so if there's no way to do what I'm asking, I'm sure there's a reasonable alternative.
Anyways, I have a sparse matrix. It's pretty big and mostly empty. I have a class called MatrixNode that's basically a wrapper around each of the cells in the matrix. Through it you can get and set the value of that cell. It also has Up, Down, Left and Right properties that return a new MatrixNode that points to the corresponding cell.
Now, since the matrix is mostly empty, having a live node for each cell, including the empty ones, is an unacceptable memory overhead. The other solution is to make new instances of MatrixNode every time a node is requested. This will make sure that only the needed nodes are kept in the memory and the rest will be collected. What I don't like about it is that a new object has to be created every time. I'm scared about it being too slow.
So here's what I've come up with. Have a dictionary of weak references to nodes. When a node is requested, if it doesn't exist, the dictionary creates it and stores it as a weak reference. If the node does already exist (probably referenced somewhere), it just returns it.
Then, if the node doesn't have any live references left, instead of it being collected, I want to store it in a pool. Later, when a new node is needed, I want to first check if the pool is empty and only make a new node if there isn't one already available that can just have it's data swapped out.
Can this be done?
A better question would be, does .NET already do this for me? Am I right in worrying about the performance of creating single use objects in large numbers?
Instead of guessing, you should make a performance test to see if there are any issues at all. You may be surprised to know that managed memory allocation can often outperform explicit allocation because your code doesn't have to pay for deallocation when your data goes out of scope.
Performance may become an issue only when you are allocating new objects so frequently that the garbage collector has no chance to collect them.
That said, there are sparse array implementations in C# already, like Math.NET and MetaNumerics. These libraries are already optimized for performance and will probably avoid performance issues you will run into if you start your implementation from stratch
An SO search for c# and sparse-matrix will return many related questions, including answers pointing to commercial libraries like ILNumerics (has a community edition), NMath and Extreme Optimization's libraries
Most sparse matrix implementations use one of a few well-known schemes for their data; I generally recommend CSR or CSC, as those are efficient for common operations.
If that seems too complex, you can start using COO. What this means in your code is that you will not store anything for empty members; however, you have an item for every non-empty one. A simple implementation might be:
public struct SparseMatrixItem
{
int Row;
int Col;
double Value;
}
And your matrix would generally be a simple container:
public interface SparseMatrix
{
public IList<SparseMatrixItem> Items { get; }
}
You should make sure that the Items list stays sorted according to the row and col indices, because then you can use binary search to quickly find out if an item exists for a specific (i,j).
The idea of having a pool of objects that people use and then return to the pool is used for really expensive objects. Objects representing a network connection, a new thread, etc. It sounds like your object is very small and easy to create. Given that, you're almost certainly going to harm performance pooling it; the overhead of managing the pool will be greater than the cost of just creating a new one each time.
Having lots of short lived very small objects is the exact case that the GC is designed to handle quickly. Creating a new object is dirt cheap; it's just moving a pointer up and clearing out the bits for that object. The real overhead for objects comes in when a new garbage collection happens; for that it needs to find all "alive" objects and move them around, leaving all "dead" objects in their place. If your small object doesn't live through a single collection it has added almost no overhead. Keeping the objects around for a long time (like, say, by pooling them so you can reuse them) means copying them through several collections, consuming a fair bit of resources.
in a tile based game, I store some data via json, and the json file contains the names of classes with their data.
Is it a problem to instantiate a new class each time I want to add a new object? (which ends up to 100 instantiations). I tried to use static classes, but because I add a new object each time, that can differ from the previous one, I need to instantiate a new class, so I think the static class is not the right option.
Is there a third option possible?
The amount of objects should be multiplied by their sizes. In general 100 objects should not be a problem. It might be a problem if you throw them away and recreate them 60 times a second.
The resulting memory footprint and the amount of garbage collection runs might then be a problem.
Do the math, do not optimize prematurely, and post the results to get more precise advice.
Why do we need reference types in .NET?
I can think of only 1 cases, that it support sharing data between different functions and hence gives storage optimization.
Other than that I could not enumerate any reason, why reference types are needed?
Why do we need reference types in .NET? I can think of only one reason: that it support sharing of data and hence gives storage optimization.
You've answered your own question. Do you need a better reason than that?
Suppose every time you wanted to refer to the book The Hobbit, you had to instead make a copy of the entire text. That is, instead of saying "When I was reading The Hobbit the other day...", you'd have to say "When I was reading In a hole in the ground there lived a hobbit... [all the text] ... Well thank goodness for that, said Bilbo, handing him the tobacco jar. the other day..."
Now suppose every time you used a database in a program, instead of referring to the database, you simply made a full copy of the entire database, every single time you used any of it in any way. How fast do you think such a program would be?
References allow you to write sentences that talk about books by use of their titles instead of their contents. Reference types allow you to write programs that manipulate objects by using small references rather that enormous quantities of data.
class Node {
Node parent;
}
Try implementing that without a reference type. How big would it be? How big would a string be? An array? How much space would you need to reserve on the stack for:
string s = GetSomeString();
How would any data be used in a method that wasn't specific to one call-path? Multi-threaded code, for example.
Three reasons that I can think of off the top of my head.
You don't want to continually copy objects every time you need to pass them to a Method or Collection Type.
When iterating through collections, you may want to modify the original object with new values.
Limited Stack Space.
If you look at value types like int, long, float you can see that the biggest type store 8 bytes or 64 bits.
However, think about a list or an array of long values, in that case, if we have a list of 1000 values then the worst case will take 8000 bytes.
Now, to pass by value 8000 bytes will make our program to run super slow, because the function that took the list as a parameter will now have to copy all these values into a new list and by that we loose time and space.
That's why we have reference types, because if we pass that list then we don't lose time and space to copy that list because we pass the address of the list in the memory.
The reference type in the function will work on the same address as the list you passed, and if you want to copy that list you can do that manually.
By using reference types we save time and space for our program because we don't bother to copy and save the argument we passed.
What is the best way to deep clone an interconnected set of objects? Example:
class A {
B theB; // optional
// ...
}
class B {
A theA; // optional
// ...
}
class Container {
A[] a;
B[] b;
}
The obvious thing to do is walk the objects and deep clone everything as I come to it. This creates a problem however -- if I clone an A that contains a B, and that B is also in the Container, that B will be cloned twice after I clone the Container.
The next logical step is to create a Dictionary and look up every object before I clone it. This seems like it could be a slow and ungraceful solution, however.
Any thoughts?
Its not an elegant solution for sure, but it isn't uncommon to use a dictionary (or hashmap). One of the benefits is that a hashmap has a constant lookup time, so speed does not really suffer here.
Not that I am familiar with C#, but typically any type of crawling of a graph for some sort of processing will require a lookup table to stop processing an object due to cyclic references. So I would think you will need to do the same here.
The dictionary solution you suggested is the best I know of. To optimize further, you could use object.GetHashCode() to get a hash for the object, and use that as the dictionary key. Should be fast unless you're talking about huge object trees (10s to 100s of thousands of objects).
maybe create a bit flag to indicate whether this object has been cloned before.
Another possible solution you could investigate is serializing the objects into a stream, and then reconstructing them from that same stream into new instances. This often works wonders when everything else seems awfully convoluted and messy.
Marc
One of the practical ways to do deep cloning is serializing and then deserializing a source graph. Some serializers in .NET like DataContractSerializer are even capable of processing cycles within graphs. You can choose which serializer is the best choice for your scenario by looking at the feature comparison chart.