I'm designing a WCF service that will return a list of objects that are describing a person in the system.
The record count is really big and there I have some properties like person's sex.
Is it better to create a new enum (It's a complex-type and consumes more bandwidth) named Sex with two values (Male and Female) or use a primitive type for this like bool IsMale?
Very little point switching to bool; which is bigger:
<gender>male</gender> (or an attribute gender="male")
or
<isMale>true</isMale> (or an attribute isMale="true")
Not much in it ;-p
The record count is really big...
If bandwidth becomes an issue, and you control both ends of the service, then rather than change your entities you could look at some other options:
pre-compress the data as (for example) gzip, and pass (instead) a byte[] or Stream, noting to enable MTOM on the service
(or) switch serializer; proobuf-net has WCF hooks, and can achieve significant bandwidth improvements over the default DataContractSerialier (again: enable MTOM). In a test based on Northwind data (here) it reduced 736,574 bytes to 133,010. and reduced the CPU required to process it (win:win). For info, it reduces enums to integers, typically requiring only 1 byte for the enum value and 1 byte to identify the field; contrast to <gender>Male</gender>, which under UTF8 is 21 bytes (more for most other encodings), or gender="male" at 14 bytes.
However, either change will break your service if you have external callers who are expecting regular SOAP...
The reason to not use an enum is that XML Schema does not have a concept equivalent to an enum. That is, it does not have a concept of named values. The result is that enums don't always translate between platforms.
Use a bool, or a single-character field instead.
I'd suggest you model it in whatever way seems most natural until you run into a specific issue or encounter a requirement for a change.
WCF is designed to abstract the underlying details, and if bandwidth is a concern then i think a bool, int or enum will all probably be 4 bytes. You could optimize by using a bitmask or using a single byte.
Again, the ease of use of the API and maintainability is probably more important, which do you prefer?
if( user[i].Sex == Sexes.Male )
if( user[i].IsMale == true; ) // Could also expose .IsFemale
if( user[i].Sex == 'M' )
etc. Of course you could expose multiple.
Related
I have a few scenarios where I need to store an unlimited value (or maximum, whatever you like to call it), which represents no limitation in business.
A few options I considered:
Make the field Nullable, and Use DB NULL to represent such case. but the problem is I have to check it anywhere I need to do a comparison or display it.
Use actual Maximum value of the given type (for example, integer, i can use the largest Int32 value), but this need some tweaks at DB level - I have to write a constraint at the field (as I could use fixed length of decimal or Integer DB type) to limit the maximum value, and it could have no meaning to business either.
Use a predefined big value (that might make sense to the business) to represent it and store it at DB level, again, i have to write a constraint to the db field.
I have used all of them before for different scenarios, and all are not too bad, but you know, it's a pain to handle some specific cases.
My question is a bit broad: what do you guys suggest for this? what good/best practices are available?
Any help/suggestions are appreciated.
I would think that storing it as a separate column, IsXyzUnlimited, may be a good alternate practice.
Since it doesn't mean null, it may not be best to represent it as null. As you mentioned, there is also the problem of checking it before you invoke it.
Also, as you mentioned, the other 2 values could have business meaning. If you want the data to be self-revealing about the business, explicitly say "hey business, this thing is unlimited when this box is checked". No magic values.
Normally, I'd never have to ask myself whether a given scenario is better suited to a struct or class and frankly I did not ask that question before going the class way in this case. Now that I'm optimizing, things are getting a little confusing.
I'm writing a number crunching application that deals with extremely large numbers containing millions of Base10 digits. The numbers are (x,y) coordinates in 2D space. The main algorithm is pretty sequential and has no more than 200 instances of the class Cell (listed below) in memory at any given time. Each instance of the class takes up approximately 5MB of memory resulting in no more than 1GB in total peak memory for the application. The finished product will run on a 16 core machine with 20GB of RAM and no other applications hogging up the resources.
Here is the class:
// Inheritance is convenient but not absolutely necessary here.
public sealed class Cell: CellBase
{
// Will contain numbers with millions of digits (512KB on average).
public System.Numerics.BigInteger X = 0;
// Will contain numbers with millions of digits (512KB on average).
public System.Numerics.BigInteger Y = 0;
public double XLogD = 0D;
// Size of the array is roughly Base2Log(this.X).
public byte [] XBytes = null;
public double YLogD = 0D;
// Size of the array is roughly Base2Log(this.Y).
public byte [] YBytes = null;
// Tons of other properties for scientific calculations on X and Y.
// NOTE: 90% of the other fields and properties are structs (similar to BigInteger).
public Cell (System.Numerics.BigInteger x, System.Numerics.BigInteger y)
{
this.X = x;
this.XLogD = System.Numerics.BigInteger.Log(x, 2);
this.XBytes = x.ToByteArray();
this.Y = y;
this.YLogD = System.Numerics.BigInteger.Log(y, 2);
this.YBytes = y.ToByteArray();
}
}
I chose to use a class instead of a struct simply because it 'felt' more natural. The number of fields, methods and memory all instinctively pointed to classes as opposed to structs. I further justified that by considering how much overhead temporary assignment calls would have since the underlying primary objects are instances of BigInteger, which itself is a struct.
The question is, have I chosen wisely here considering speed efficiency is the ultimate goal in this case?
Here's a bit about the algorithm in case it helps. In each iteration:
Sorting performed once on all 200 instances. 20% of execution time.
Calculating neighboring (x,y) coordinates of interest. 60% of execution time.
Parallel/Threading overhead for point 2 above. 10% of execution time.
Branching overhead. 10% of execution time.
The most expensive function: BigInteger.ToByteArray() (implementation).
This would be better fit as a class, for many reasons, including
It doesn't logically represent a single value
It's larger than 16 bytes
It's mutable
For details, see Choosing Between Classes and Structures.
In addition, I'd also suggest that it's better suited to a class given:
It contains reference types (arrays). Structures containing classes are rarely a good design idea.
This is especially true, though, given what you're doing. If you were to use a struct, sorting would require copies of the entire struct, instead of just copies of the references. Method calls (unless passed by ref) would incur a huge overhead, as well, since you'd be copying all of the data.
Parallelization of items in a collection could also likely incur huge overhead, as the bounds checking on any array of the struct (ie: if it's kept in a List<Cell> or similar) would cause bad false sharing, since all access into the list would access the memory at the start of the list.
I would recommend leaving this as a class, and, in addition, I would suggest trying to move the fields into properties, and making the class as immutable as possible. This will help keep your design clean, and less likely to be problematic when multithreading.
It's hard to tell based on what you've written (we don't know how often you end up copying a value of type Cell for example) but I would strongly expect a class to be the correct approach here.
The number of methods in the class is irrelevant, but if it has lots of fields you need to consider the impact of copying all those fields any time you pass a value to another method (etc).
Fundamentally it doesn't feel like a value type to start with - but I understand that if performance is particularly important, the philosophical aspects may not be as interesting to you.
So yes, I think you've made the right decision, and I see no reason to believe anything else at the moment - but of course if you can easily change the decision and test it as a struct, that would be better than guesswork. Performance is remarkably difficult to predict accurately.
Since your class does contain arrays which do consume most of your memory and you have only 200 Cell Instances around the memory consumption of the class itself is not an issue. You were right that a class felt more natural it is indeed the right choice. My guess would be that the comparison of XByte[] and XYBytes[] does limit your sorting time. It all depends how big your arrays are and how you do perform the comparison.
Let's start ignoring the performance matters, and work up to them.
Structs are ValueTypes and ValueTypes are value-types. Integer's and DateTime's are value-types and a good comparison. There's no sense in talking about how one 1 is or isn't the same as 1, or how one 2010-02-03T12:45:23.321Z is or isn't the same as another 2010-02-03T12:45:23.321Z. They may have different significance in different uses, but that 1 == 1 and 1 != 2 and that 2010-02-03T12:45:23.321Z == 2010-02-03T12:45:23.321Z and 2010-02-03T12:45:23.321Z != 2931-03-05T09:21:29.43Z is inherent to the nature of integers and date-times and that's what makes them value-types.
That's the purest way of thinking about this. If it matches the above it's a value-type, if it doesn't, it's a reference type. Nothing else comes into it.
Extension 1: If an X can have an X then it has to be a reference type. Whether this logically follows from what was said above is debatable, but whatever you think on the matter you can't have a struct that has an instance of another one of itself as a member (directly or indirectly) in practice, so that's that.
Extension 2: Some say that the difficulties that come from mutable structs come from the above, and some do not. Again though, whatever you think on the matter, there are practical difficulties. A mutable struct can be useful in a few cases, but they cause enough confusion that they should be restricted to private cases as an optimisation rather than public cases as a matter of course.
Here comes the performance bit...
Value types and reference types have different characteristics in different cases that affects the speed, the memory use, and the way that memory use affects garbage collection in several ways giving each different pros and cons as far as performance goes. Just how much attention we pay to that, depends on how much we need to get down to that level. It's worth saying right now that the ways in which they differ tends to balance to a win if you follow the above rule on deciding between struct and class so if we start thinking about this beyond that, we're at least bordering on optimisation territory.
Optimisation level 1.
If a value type instance will contain more than 16bytes per instance, it should probably be made a reference. This is sometimes even stated as a "natural" difference rather than one of optimisation. Strictly there's nothing in "value type" that entails "16 or fewer bytes" but it does tend to balance out that way.
Moving away from the simplistic "16 bytes" rule, the smaller it is the faster it is to copy, and contrary-wise, so bending it for a 20-byte instance is of less impact than bending it for a 200-byte instance.
Will you need to box and unbox a lot? Since the introduction of generics we've been able to avoid a lot of cases where we would box and unbox with 1.0 and 1.1, so this isn't as big a deal as once, but if you do it will hurt performance.
Optimisation level 2.
The fact that value types can be place on a stack, placed directly in an array (rather than references to them) and be direct fields of a struct or class (again, rather than references to them) can make access to them and to their fields faster.
If you're going to create an array of them and if all-zero values are a useful starting point for you, you get that immediately, where as with reference types you get an array of nulls. This can make structs faster.
Edit: Something that extends from the above, if you are going to be iterating through arrays rapidly, then as well as the direct-access giving a boost over following the reference, you'll be loading a couple of instances into CPU cache at a time (64 bytes worth on current x86-32 or x86-64/amd, 128 bytes worth on ia-64). It has to be a pretty tight loop to matter, but there are cases where it does.
Pretty much most "I went for struct rather than class for performance" comes down to either the first point, or the first in combination with the second.
Optimisation level 3.
If you will have cases where some of the values you are concerned with are duplicates of each other, and they are large in size, then with immutable instances (or mutable instances you simply never mutate once you start doing what follows), you can deliberately alias different references so that you save a lot of memory because your e.g. 20 duplicate objects of 2kiB in size are actually the same object, hence saving 26kiB in that case. It can also make comparisons faster because the cases where you can short-cut on identity are more frequent. This can only be done with reference types.
Optimisation level 4.
Structs that have arrays do though alias the contained array and could internally use the above technique, balancing out that point, though it's somewhat more involved.
Optimisation level X.
It doesn't matter how much thinking about these pros and cons comes to a particular answer, if actually measuring the results comes to a different ones. Since there are both pros and cons, it's always possible to get this wrong.
In thinking about 1 through 4, along with the differences between value and reference types aside from such optimisation concerns, I think you should go for a class.
In thinking about level X I wouldn't be amazed if your actually testing it proved me wrong. The best bit is, if it is arduous to change from class to struct (you make heavy use of aliasing or the possibility of null value), then you can be pretty confident that doing so is a lose. If it isn't arduous, then you can just do so and measure! I'd strongly suggest measuring a test that involves a real run over doing something 10,000 times - who cares if you can do a given operation 10,000 times in a few less seconds if you do a different operation 20 times more often in the real thing?
A struct can only contain an array-type field safely if either (1) the state of the struct depends upon the identity of the array rather than its contents (as is the case with ArraySegment), or (2) no reference to the array will ever be held by anything that might try to mutate it (typically, this means that the array field will be private, and the struct itself will create the array and perform all modifications that will ever be done to it, before storing a reference in the field).
I advocate using structs much more commonly than other people here, but the fact that your data storage thingie would have two array-type fields would seem a strong argument against using a struct.
Let’s say I have double length that can be either a real length or not ready yet since we got no length yet in the server and there is nothing to send to the client. We need to pass this length from the server to the client as part of a fixed data protocol. The client currently uses the length only once, but might use it more than that in the future.
Pass double length and bool isLengthValid, and in every place you use length, check if isLengthValid
-Clean design without mixing data types but user have to remember to check
Pass double? length, and in every place you use length, check if length==null
-Design is clear (since it’s a nullable) but if you look and the type. Also – there will be an exception if someone uses without checking (good and bad, depends how you look at it)
Make a class Length instead of double. The class will have a clear interface of GetLengthIfYouCheckedIt or something.
Very readable and hard to make mistakes but design is a little over done.
What is your solution?
I say option2:
What you want is precisely why nullables were introduced.
Instead of adding a method to check wether it's a valid number or not, you'd use the built-in Nullable<double>.HasValue, just as it was meant for it.
Making a class for Length makes it doubly closed: it's only for LENGTH and it holds a Double. Think of how many of such classes you'll have to make and maintain for TIME/DateTime, MONEY/Decimal etc. It will never end.
The option 1 is just your own rolled Nullable<T> rewrapped with another name.
In other words, enforce the DRY principle, and use Nullable<T> ;)
HTH,
Bab.
I'd pass a double?. That's essentially a double + a bool value indicating if it's valid so using the 1) option would just be reinventing nullable. I think that the 3) option is overkill.
My advise would be that use nullable like this public Double? Length;
You will get methods like Length.HasValue, and Length.Value this will make the code easy to read and quicker for you to use( i mean no need to write new class etc by quicker for you)
Why not just keep it as a length parameter but return -1?
If possible, I would suggest making the request async, so that you do not return anything to the client until the data is actually ready.
If that is not possible, go with the second option.
Why do we need reference types in .NET?
I can think of only 1 cases, that it support sharing data between different functions and hence gives storage optimization.
Other than that I could not enumerate any reason, why reference types are needed?
Why do we need reference types in .NET? I can think of only one reason: that it support sharing of data and hence gives storage optimization.
You've answered your own question. Do you need a better reason than that?
Suppose every time you wanted to refer to the book The Hobbit, you had to instead make a copy of the entire text. That is, instead of saying "When I was reading The Hobbit the other day...", you'd have to say "When I was reading In a hole in the ground there lived a hobbit... [all the text] ... Well thank goodness for that, said Bilbo, handing him the tobacco jar. the other day..."
Now suppose every time you used a database in a program, instead of referring to the database, you simply made a full copy of the entire database, every single time you used any of it in any way. How fast do you think such a program would be?
References allow you to write sentences that talk about books by use of their titles instead of their contents. Reference types allow you to write programs that manipulate objects by using small references rather that enormous quantities of data.
class Node {
Node parent;
}
Try implementing that without a reference type. How big would it be? How big would a string be? An array? How much space would you need to reserve on the stack for:
string s = GetSomeString();
How would any data be used in a method that wasn't specific to one call-path? Multi-threaded code, for example.
Three reasons that I can think of off the top of my head.
You don't want to continually copy objects every time you need to pass them to a Method or Collection Type.
When iterating through collections, you may want to modify the original object with new values.
Limited Stack Space.
If you look at value types like int, long, float you can see that the biggest type store 8 bytes or 64 bits.
However, think about a list or an array of long values, in that case, if we have a list of 1000 values then the worst case will take 8000 bytes.
Now, to pass by value 8000 bytes will make our program to run super slow, because the function that took the list as a parameter will now have to copy all these values into a new list and by that we loose time and space.
That's why we have reference types, because if we pass that list then we don't lose time and space to copy that list because we pass the address of the list in the memory.
The reference type in the function will work on the same address as the list you passed, and if you want to copy that list you can do that manually.
By using reference types we save time and space for our program because we don't bother to copy and save the argument we passed.
I need to transfer .NET objects (with hierarchy) over network (multiplayer game). To save bandwidth, I'd like to transfer only fields (and/or properties) that changes, so fields that won't change won't transfer.
I also need some mechanism to match proper objects on the other client side (global object identifier...something like object ID?)
I need some suggestions how to do it.
Would you use reflection? (performance is critical)
I also need mechanism to transfer IList deltas (added objects, removed objects).
How is MMO networking done, do they transfer whole objects?
(maybe my idea of per field transfer is stupid)
EDIT:
To make it clear: I've already got mechanism to track changes (lets say every field has property, setter adds field to some sort of list or dictionary, which contains changes - structure is not final now).
I don't know how to serialize this list and then deserialize it on other client. And mostly how to do it effectively and how to update proper objects.
There's about one hundred of objects, so I'm trying avoid situation when I would write special function for each object. Decorating fields or properties with attributes would be ok (for example to specify serializer, field id or something similar).
More about objects: Each object has 5 fields in average. Some object are inherited from other.
Thank you for all answeres.
Another approach; don't try to serialize complex data changes: instead, send just the actual commands to apply (in a terse form), for example:
move 12432 134, 146
remove 25727
(which would move 1 object and remove another).
You would then apply the commands at the receiver, allowing for a full resync if they get out of sync.
I don't propose you would actually use text for this - that is just to make the example clearer.
One nice thing about this: it also provides "replay" functionality for free.
The cheapest way to track dirty fields is to have it as a key feature of your object model, I.e. with a "fooDirty" field for every data field "foo", that you set to true in the "set" (if the value differs). This could also be twinned with conditional serialization, perhaps the "ShouldSerializeFoo()" pattern observed by a few serializers. I'm not aware of any libraries that match exactly what you describe (unless we include DataTable, but ... think of the kittens!)
Perhaps another issue is the need to track all the objects for merge during deserialization; that by itself doesn't come for free.
All things considered, though, I think you could do something alon the above lines (fooDirty/ShouldSerializeFoo) and use protobuf-net as the serializer, because (importantly) that supports both conditional serialization and merge. I would also suggest an interface like:
ISomeName {
int Key {get;}
bool IsDirty {get;}
}
The IsDrty would allow you to quickly check all your objects for those with changes, then add the key to a stream, then the (conditional) serialization. The caller would read the key, obtain the object needed (or allocate a new one with that key), and then use the merge-enabled deserialize (passing in the existing/new object).
Not a full walk-through, but if it was me, that is the approach I would be looking at. Note: the addition/removal/ordering of objects in child-collections is a tricky area, that might need thought.
I'll just say up front that Marc Gravell's suggestion is really the correct approach. He glosses over some minor details, like conflict resolution (you might want to read up on Leslie Lamport's work. He's basically spent his whole career describing different approaches to dealing with conflict resolution in distributed systems), but the idea is sound.
If you do want to transmit state snapshots, instead of procedural descriptions of state changes, then I suggest you look into building snapshot diffs as prefix trees. The basic idea is that you construct a hierarchy of objects and fields. When you change a group of fields, any common prefix they have is only included once. This might look like:
world -> player 1 -> lives: 1
... -> points: 1337
... -> location -> X: 100
... -> Y: 32
... -> player 2 -> lives: 3
(everything in a "..." is only transmitted once).
It is not logical to transfer only changed fields because you would be wasting your time on detecting which fields changed and which didn't and how to reconstruct on the receiver's side which will add a lot of latency to your game and make it unplayable online.
My proposed solution is for you to decompose your objects to the minimum and sending these small objects which is fast. Also, you can use compression to reduce bandwidth usage.
For the Object ID, you can use a static ID which increases when you construct a new Object.
Hope this answer helps.
You will need to do this by hand. Automatically keeping track of property and instance changes in a hierarchy of objects is going to be very slow compared to anything crafted by hand.
If you decide to try it out anyway, I would try to map your objects to a DataSet and use its built in modification tracking mechanisms.
I still think you should do this by hand, though.