Optimal Serialization of Primitive Types - c#

We are beginning to roll out more and more WAN deployments of our product (.NET fat client with an IIS hosted Remoting backend). Because of this we are trying to reduce the size of the data on the wire.
We have overridden the default serialization by implementing ISerializable (similar to this), and we are seeing anywhere from 12% to 50% gains. Most of our efforts focus on optimizing arrays of primitive types. Is there a fancy way of serializing primitive types, beyond the obvious?
For example, today we serialize an array of ints as follows:
[4-bytes (array length)][4-bytes][4-bytes]
Can anyone do significantly better?
The most obvious example of a significant improvement, for boolean arrays, is putting 8 bools in each byte, which we already do.
Note: Saving 7 bits per bool may seem like a waste of time, but when you are dealing with large magnitudes of data (which we are), it adds up very fast.
Note: We want to avoid general compression algorithms because of the latency associated with it. Remoting only supports buffered requests/responses (no chunked encoding). I realize there is a fine line between compression and optimal serialization, but our tests indicate we can afford very specific serialization optimizations at very little cost in latency. Whereas reprocessing the entire buffered response into new compressed buffer is too expensive.

(relates to messages/classes, not just primitives)
Google designed "protocol buffers" for this type of scenario (they shift a huge amount of data around) - their format is compact (using things like base-128 encoding) but extensible and version tolerant (so clients and servers can upgrade easily).
In the .NET world, I can recommend 2 protocol buffers implementations:
protobuf-net (by me)
dotnet-protobufs (by Jon Skeet)
For info, protobuf-net has direct support for ISerializable and remoting (it is part of the unit tests). There are performance/size metrics here.
And best of all, all you do is add a few attributes to your classes.
Caveat: it doesn't claim to be the theoretical best - but pragmatic and easy to get right - a compromise between performance, portability and simplicity.

Check out the base-128 varint type used in Google's protocol buffers; that might be what you're looking for.
(There are a number of .NET implementations of protocol buffers available if you search the web which, depending on their license, you might be able to grovel some code from!)

Yes, there is a fancy way of serialising primitive types. As a bonus it is also much faster (typically 20-40 times).
Simon Hewitt's open source library, see Optimizing Serialization in
.NET - part 2, uses various tricks. For example, if it is known that an array contains small integers then less is going to the serialised output. This is described in detail in part 1 of the article. For example:
...So, an Int32 that is less
than 128 can be stored in a single byte (by using 7-bit
encoding) ....
The full and the size optimised integers can be mixed and matched. This may seem obvious, but there are other things; for example, special things happen for integer value 0 - optimisation to store numeric type and a zero value.
Part 1 states:
... If you've ever used .NET remoting for large amounts of data, you will have found that there are problems with scalability. For small amounts of data, it works well enough, but larger amounts take a lot of CPU and memory, generate massive amounts of data for transmission, and can fail with Out Of Memory exceptions. There is also a big problem with the time taken to actually perform the serialisation - large amounts of data can make it unfeasible for use in apps ....
I have used this library with great success in my application.
To make sure .NET serialisation is never used put an
ASSERT 0, Debug.WriteLine() or similar into the place in the
library code where it falls back on .NET serialisation.
That's at the end of function WriteObject() in file
FastSerializer.cs, near createBinaryFormatter().Serialize(BaseStream, value);.

If your arrays can be sorted you can perform a simple RLE to save space. Even if they aren't sorted RLE can still be beneficial. It is fast to implement for both writing and reading.

Here's a trick I used once for encoding arrays of integers:
Group array elements in groups of 4.
Precede each group with a byte (let's call it length mask) that indicates the length of the following 4 elements. The length mask is a bitmask composed of dibits which indicate how long the corresponding element is (00 - 1 byte, 01 - 2 bytes, 10 - 3 bytes, 11 - 4 bytes).
Write out the elements as short as you can.
For example, to represent unsigned integers 0x0000017B, 0x000000A9, 0xC247E8AD and 0x00032A64, you would write (assuming little-endian): B1, 7B, 01, A9, AD, E8, 47, C2, 64, 2A, 03.
It can save you up to 68.75% (11/16) of space in the best case. In the worst case, you would actually waste additional 6.25% (1/16).

If you know which int values are more common, you can encode those values in fewer bits (and encode the less-common values using correspondingly more bits): this is called "Huffman" coding/encoding/compression.
In general though I'd suggest that one of the easiest things you could do would be to run a standard 'zip' or 'compression' utility over your data.

For integer, if you usually have small numbers (under 127 or 32768) you can encode the number using the MSB as a flag to determine if it's the last byte or not. A little bit similar to UTF-8 but the flag bit is actually wasted (which is not the case with UTF-8)
Example (big-endian):
125 which is usually encoded as 00 00 00 7D
Could be encoded as 7D
270 which is usually encoded as 00 00 01 0E
Could be encoded as 82 0E
The main limitation is that the effective range of a 32 bit value is reduced to 28 bits. But for small values you will usually gain a lot.
This method is actually used in old formats such as MIDI because old electronics needed very efficient and simple encoding techniques.

If you want to control the serialization format yourself, with just library help for compact integer storage, derive a class from BinaryWriter that uses Write7BitEncodedInt. Do likewise for BinaryReader.Read7BitEncodedInt.

Before implementing ISerializable yourself, you were probably using XmlSerializer or the SOAP formatter in a web service. Given you have all fat clients all running .NET, you could try using the BinaryFormatter.

Related

Binary serialization of arbitrary objects beween Objective C and C#?

We want to pass a forest - a dictionary with values which can be: dictionaries, arrays, sets, numbers, strings, byte buffers - between Objective C and C# efficiently (time-wise, space is a lesser concern). Google's Protocol Buffers looked good, but they seem to handle only structured data, while ours is arbitrary. Ultimately we can write a binary (de)serialiser ourselves, but surely this was done before and released as FOSS somewhere?
Have you considered using ASN.1? Since ASN.1 is independent of programing language or system architecture, it can be used efficiently regardless of whether you need C, C#, C++, or Java.
You create a description of the information you wish to exchange, and the use an ASN.1 tool to generate an encoder/decoder for your target programming language. ASN.1 also supports a few different rules for transmitting the date which range from the efficient PER (Packed Encoding Rules) to the verbose, but flexible XER (XML Encoding Rules).
To play with ASN.1 to see if this might work for you, try the free online ASN.1 compiler and encoder/decoder at http://asn1-playground.oss.com to see if this might work for you.

Byte order java/.net

When sending information from a java application to a C# application through sockets, is the byte order different? Or can I just send an integer from C# to a java application and read it as integer?
(And do the OS matter, or is the same for java/.net no matter how the actual OS handles it?)
It all comes down to how you encode the data. If you are treating it only as a raw sequence of bytes, there is no conflict; the sequence is the same. When the matters is in endianness when interpreting chunks of the data as (for example) integers.
Any serializer written with portability in mind will have defined endianness - for example, in protocol buffers (available for both Java and C#) little-endian is always used regardless of your local hardware.
If you are doing manual writing to the stream, using things like shift-based encoding (rather than direct memory copying) will give you defined endianness.
If you use pre-canned platform serializers, you are at the mercy of the implementation. It might be endian-safe, it might not be (i.e. it might depend on the platform at both ends). For example, the .NET BitConverter class is not safe - it is usually assumed (incorrectly) to be little-endian, but on some platforms (and particularly in Mono on some hardware) it could be big-endian; hence the .IsLittleEndian property.
My advice would be to use a serializer that handles it all for you ;p
In Java, you can use a DataInputStream or DataOutputStream which read and write the high-byte first, as documented:
http://download.oracle.com/javase/6/docs/api/java/io/DataOutputStream.html#writeInt%28int%29
You should check corresponding C# documentation to see what it does (or maybe someone here can tell you).
You also have, in Java, the option of using ByteByffer:
http://download.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html
... which has the "order" method to allow you to specify a byte order for operations reading multi-byte primitive types.
Java uses Big Endian for some libraries like DataInput/OutputStream. IP protocols all use Big Endian which can lead people to use Big Endian as default for network protocols.
However NIO, ByteBuffer allows you to specify BigEndian, LittleEndian or NativeEndian (whatever the system uses by default)
x86 systems tend to use little endian and so many Microsoft/Linux applications use little endian by default but can support big-endian.
Yes, the byte order may be different. C# assumes little-endian may use the platform's byte ordering, Java tends to use big-endian. This has been discussed before on SO. See for example C# little endian or big endian?

Computing, storing, and retrieving values to and from an N-Dimensional matrix

This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?
I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.
I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.

Should I use XML or Binary to send data from server to client?

I have two separate apps - one a client (in C#), one a server (in C++). They need to exchange data in the form of "structs" and ~ about 1 MB of data a minute is sent from server to client.
Whats better to use - XML or my own Binary format?
With XML:
Translating XML to a struct using a parser would be slow I believe? ("good",but: load parser, load XML, parse)
The other option is parsing XML with regex (bad!)
With Binary:
compact data sizes
no need for meta information like tags;
but structs cannot be changed easily to accomodate new structs/new members in structs in future;
no conversion from text (XML) to binary (struct) necessary so is faster to receive and "assemble" into a struct)
Any pointers? Should I not be considering binary at all?? A bit confused about what approach to take.
1MB of data per minute is pretty tiny if you've got a reasonable network connection.
There are other choices between binary and XML - other human-readable text serialization formats, such as JSON.
When it comes to binary, you don't have to have versioning problems - technologies like Protocol Buffers (I'm biased: I work for Google and I've ported PB to C#) are explicitly designed with backward and forward compatibility in mind. There are other binary formats to consider as well, such as Thrift.
If you're worried about performance though, you should really measure it. I'm pretty sure my phone could parse 1MB of XML sufficiently quickly for it not to be a problem in this case... basically work out what you're most concerned about, in terms of:
Simplicity of code
Interoperability
Performance in terms of CPU
Network traffic
Backward/forward compatibility
Human readability of on-the-wire format
It's all a balancing act - but you're the one who has to decide how much weight to give each of those factors.
If you have .NET applications in both ends, use Windows Communication Foundation. This will allow you to defer the decision until deployment time, as it supports both binary and XML serialization.
As you stated, XML is a (little) slower but much more flexible and reliable. I would go with XML until there is a proven problem with performance.
You should also take a look a ProtoBuff as an alternative.
And, after your update, any cross-language, cross-platform and cross-version requirement strongly points away from binary formatting.
A good point for XML would be interoperability. Do you have other clients that also access your server?
Before you use your own binary format or do regex on XML...Have you considered the serialization namespace in .NET? There are Binary Formatters, SOAP formatters and there is also XmlSerialization.
Another advantage of a XML is that you can extend the data you are sending by adding an element, you wont have to alter the receiver's code to cope with the extra data until you are ready to.
Also even minimal(fast) compression of XML can dramatic reduce the wire load.
text/xml
Human readable
Easier to debug
Bandwidth can be saved by compressing
Tags document the data they contain
binary
Compact
Easy to parse (if fixed size fields are used, just overlay a struct)
Difficult to debug (hex editors are a pain)
Needs a separate document to understand what the data is.
Both forms are extensible and can be upgraded to newer versions provided you insert a type and version field at the beginning of the datagram.
you did not say if they are on the same machine or not. I assume not.
IN that case then there is another downside to binary. You cannot simply dump the structs on the wire, you could have endianness and sizeof issues.
XML is very wordy, YAML or JSON are much smaller
Don't forget that what most people think of as XML is XML serialized as text. It can be serialized to binary instead. This is what the netTcpBinding and other such bindings do in WCF. The XML infoset is output as binary, not as text. It's still XML, just in binary.
You could also use Google Protocol Buffers, which is a compact binary representation for structured data.

working with very large integers in c#

Does anybody know of a way I can calculate very large integers in c#
I am trying to calculate the factorial of numbers e.g.
5! = 5*4*3*2*1 = 120
with small numbers this is not a problem but trying to calculate the factorial of the bigest value of a unsigned int which is 4,294,967,295 it doesn't seem possible.
I have looked into the BigInteger class but it doesn't seem to do what I need
any help would be greatly appreciated
To calculate the factorial of uint.MaxValue you'd need a lot of storage.
For example, the Wikipedia article as 8.2639316883... × 10^5,565,708. You're going to gain information like crazy.
I strongly suspect you're not going find any way of calculating it on a sane computer in a sane amount of time. Why do you need this value? Would Stirling's approximation be close enough?
Firstly, it's worth pointing out that the factorial of uint.MaxValue is astronomically large. I'm not able to find a good estimate of the order of magnitude of its factorial, but its bit representation will probably occupy a high percentage of a standard RAM, if not well exceed.
A BigInteger class seems to be what you want, providing you only want to go up to around 1,000,000 or so (very roughly). After that, time and memory become very prohibitive. In current (stable) versions of .NET, up to 3.5, you have to go with a custom implementation. This one on the CodeProject seems to be highly rated. If you happen to be developing for .NET 4.0, the Microsoft team have finally gotten around to including a BigInteger class in the System.Numerics namespace of the BCL. Unlike some BigInteger implementations, the one existing in .NET 4.0 doesn't have a built-in factorial method (I'm not sure about the CodeProject one), but it should be trivial to implement one - an extension method would be a nice way.
Since you seem to think you don't want to use a BigInteger type, it would be helpful if you could verify that it's not what you want having read my reply, and then explain precisely why it doesn't suit your purposes.
4294967295! = 10^(10^10.597) ~ 10^(40000000000)
This value requires about 40 Gb of RAM to store, even if you will find any BigInteger implementation for C#!
P.S. Well, with optimized storing, let's say 9 digits in 4 bytes, it will take ~18 Gb of RAM.
Why do you think that you need to calculate those factorials? It's not practiacally useful for anything to do the actual calculations.
Just the result of calculating factorial of (2^32-1) would take up a lot of space, approximately 16 GB.
The calculation itself will of course take a lot of time. If you build the program so that you can transfer the calculation process to faster hardware as it is invented, you should be able to get the result within your lifetime.
If it's something like an Euler problem that you are trying to solve, consider that a lot of solutions are found by elliminating what it is that you actually don't have to calculate in order to get the answer.
Here .
The fastest one, straight from the Factorial Man - Peter Luschny.
You can use the BigInteger class from the J# libraries for now. Here's an article on how. It makes deployment harder because you have to send out the J# redistributable. You can also consider going to VS2010 beta as Framework 4.0 will have BigInteger.
In case you have J# redist installed, an alternative way would be using java.math.BigInteger by adding a reference to the vjslib assembly.
Try to use an array for this task. You could use as long integers as you have free memory space. Every member of array repsesents one decimal digit. The only you need is to implement multipication.
If you are doing calculations with factorials like combinations for example you rarely need to multiply all the way down to 1 (eg. 98 * 98 * 97 since everything else cancels out).

Categories

Resources