protobuf-net message serialized size property - c#

We are using protobuf-net for serialization and deserialization of messages in an application whose public protocol is based on Google Protocol Buffers. The library is excellent and covers our all requirements except for this one: we need to find out the serialized message length in bytes before the message is actually serialized.
The question has already been asked a year and a half ago and according to Marc, the only way to do this was to serialize to a MemoryStream and read the .Length property afterwards. This is not acceptable in our case, because MemoryStream allocates a byte buffer behind the scenes and we have to avoid this.
This line from the same response gives us hope that it might be possible after all:
If you clarify what the use-case is, I'm sure we can make it easily
available (if it isn't already).
Here is our use case. We have messages whose size varies between several bytes and two megabytes. The application pre-allocates byte buffers used for socket operations and for serializing / deserializing and once the warm-up phase is over, no additional buffers can be created (hint: avoding GC and heap fragmentation). Byte buffers are essentially pooled. We also want to avoid copying bytes between buffers / streams as much as possible.
We have come up with two possible strategies and both of them require message size upfront:
Use (large) fixed-size byte buffers and serialize all messages that can fit into one buffer; send the content of the buffer using Socket.Send. We have to know when the next message cannot fit into the buffer and stop serializing. Without message size, the only way to achieve this is to wait for an exception to occur during Serialize.
Use (small) variable size byte buffers and serialize each message into one buffer; send the content of the buffer using Socket.Send. In order to check out the byte buffer with appropriate size from the pool, we need to know how much bytes does a serialized message have.
Because the protocol is already defined (we cannot change this) and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method.
So is it possible to add a method that estimates a message size without serialization into a stream? If it is something that does not fit into the current feature set and roadmap of the library, but is doable, we are interested into extending the library ourselves. We are also looking for alternative approaches, if there are any.

As noted, this is not immediately available, as the code intentionally tries to do a single pass over the data (especially IEnumerable<T> etc). Depending on your data, though, it might already be doing a moderate amount of copying, to allow for the fact that sub-messages are also length-prefixed, so might need juggling. This juggling can be greatly reduced by using the "grouped" sub-format internally in the message, as groups allow forwards-only construction without track-backs.
So is it possible to add a method that estimates a message size without serialization into a stream?
An estimate is next to useless; since there is no terminator, it needs to be exact. Ultimately, the sizes are a little hard to predict without actually doing it. There was some code in v1 for size prediction, but the single-pass code currently seems preferred, and in most cases the buffer overhead is nominal (there is code in place to re-use the internal buffers so that it doesn't spend all the time allocating buffers for small messages).
If your message internally is forwards-only (grouped), then a cheat might be to serialize to a fake stream that measures, but drops all the data; you'd end up serializing twice, however.
Re:
and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method
I'm not quite sure I see the relationship there - it allows a range of formats etc to be used here; perhaps if you can be more specific?
Re copying data around - an idea I played with here is that of using sub-normal forms for the length prefix. For example, it might be that in most cases 5 bytes is plenty, so rather than juggle, it could leave 5 bytes, and then simply overwrite without condensing (since the octet 10000000 still means "zero and continue", even if it is redundant). This would still need to be buffered (to allow backfill), but would not require and movement of the data.
A final simple idea would be simply: serialize to a FileStream; then write the file length, and the file data. It trades memory usage for IO, obviously.

Related

How do you receive packets through TCP sockets without knowing their packet size before receiving?

I am working on a network application that can send live video feed asynchronously from one application to another, sort of like Skype. The main issue I am having is that I want to be able to send the frames but not have to know their size each time before receiving.
The way AForge.NET works when handling images is that the size of the current frame will most likely be different than the one before it. The size is not static so I was just wondering if there was a way to achieve this. And, I already tried sending the length first and then the frame, but that is not what I was looking for.
First, make sure you understand that TCP itself has no concept of "packet" at all, not at the user code level. If one is conceptualizing one's TCP network I/O in terms of packets, they are probably getting it wrong.
Now that said, you can impose a packet structure on the TCP stream of bytes. To do that where the packets are not always the same size, you can only transmit the length before the data, or delimit the data in some way, such as wrapping it in a self-describing encoding, or terminating the data in some way.
Note that adding structure around the data (encoding, terminating, whatever) when you're dealing with binary data is fraught with hassles, because binary data usually is required to support any combination of bytes. This introduces a need for escaping the data or otherwise being able to flag something that would normally look like a delimiter or terminator, so that it can be treated as binary data instead of some boundary of the data.
Personally, I'd just write a length before the data. It's a simple and commonly used technique. If you still don't want to do it that way, you should be specific and explain why you don't, so that your specific scenario can be better understood.

How to write length prefixed binary data efficiently

I'm writing a binary data format to file containing a graph of serialized objects. To be more resilient to errors (and to be able to debug problems) I am considering length-prefixing each object in the stream. I'm using C# and a BinaryWriter at the moment, but it is quite a general problem.
The size of each object isn't known until it has been completely serialized, so to be able to
write the length prefixes there are a number of strategies:
Use a write buffer with enough space to have random access and insert the length at the correct position after the object is serialized.
Write each object to its own MemoryStream, then write the length of the buffer and the buffer contents to the main stream.
Write a zero length for all objects in the first pass, remember the positions in the file for all object sizes (a table of object to size), and make a second pass filling in all the sizes.
??
The total size (and thus the size of the first/outermost object) is typically around 1mb but can be as large as 50-100mb. My concern is the performance and memory usage of the process.
Which strategy would be most efficient?
Which strategy would be most efficient?
The only way to determine this is to measure.
My first instinct would be to use #2, but knowing that is likely to add pressure to the GC (or fragmentation to the large object heap if the worker streams exceed 80Kb). However #3 sounds interesting, assuming the complexity of tracking those positions doesn't hit maintainability.
In the end you need to measure with your data, and consider that unless you have unusual circumstances the performance will be dominated by network or storage performance, not by processing in memory.
100MB is only 2.5% of the memory in a 'small' sized server (or a standard desktop computer). I'd serialize to memory (e.g. a byte[] array/MemoryStream with BinaryWriter) and then flush that to disk when done.
This would also keep your code clean, compact, and easy to manage - saving you from hours of tearing your hair and seeking back and forth in a large blob :)
Hope this helps!
If you control the format, you could accumulate a list of object sizes and append a directory at the end of your file. However, don't forget that in .NET world your write buffers are copied several times before actually getting transferred to disk anyway. Therefore any gains you make by avoiding (say) an extra MemoryStream will not increase the overall efficiency much.

Queue<byte> vs. Stream

Is there an difference between a Queue and a Stream in C#?
The question should be: do they even have anything in common besides both offering some sort of interface to retrieve bytes from?
A queue Queue<byte> is just that, a FIFO queue of bytes, main functionality is to enqueue or dequeue a single byte value at a time - there is no random access. You usually use a queue as part of a data structure or algorithm (i.e. breadth first search in a tree comes to mind). All data in a queue is stored in memory.
A stream on the other hand is an abstract representation of a byte stream usually obtained from a file, memory, network or other source - there is always an underlying source or target.This source doesn't have to be in memory, i.e. a network or file stream will allow you to read from or write to a file or network - so a stream is the main way to get bytes from A to B.
A queue has to stores bytes, a stream doesn't. Big difference.
Im not a C# (or even .NET) guy at all, and hopefully someone will provide a more detailed answer, but..
I think its pretty clear that Queue and Stream are quite different. I understandwhy you'd ask, but even a quick peek at the API shows a lot of differences.
http://msdn.microsoft.com/en-us/library/system.io.stream.aspx
http://msdn.microsoft.com/en-us/library/system.collections.queue.aspx
Foremost among these differences is that a Queue is part of Collections package and Stream is part of IO
EDIT - typed Queue is probably more applicable, as shown with other poster
http://msdn.microsoft.com/en-us/library/7977ey2c.aspx

What does stream mean? What are its characteristics?

C++ and C# both use the word stream to name many classes.
C++: iostream, istream, ostream, stringstream, ostream_iterator, istream_iterator...
C#: Stream, FileStream,MemoryStream, BufferedStream...
So it made me curious to know, what does stream mean?
What are the characteristics of a stream?
When can I use this term to name my classes?
Is this limited to file I/O classes only?
Interestingly, C doesn’t use this word anywhere, as far as I know.
Many data-structures (lists, collections, etc) act as containers - they hold a set of objects. But not a stream; if a list is a bucket, then a stream is a hose. You can pull data from a stream, or push data into a stream - but normally only once and only in one direction (there are exceptions of course). For example, TCP data over a network is a stream; you can send (or receive) chunks of data, but only in connection with the other computer, and usually only once - you can't rewind the Internet.
Streams can also manipulate data passing through them; compression streams, encryption streams, etc. But again - the underlying metaphor here is a hose of data. A file is also generally accessed (at some level) as a stream; you can access blocks of sequential data. Of course, most file systems also provide random access, so streams do offer things like Seek, Position, Length etc - but not all implementations support such. It has no meaning to seek some streams, or get the length of an open socket.
There's a couple different meanings. #1 is what you probably mean, but you might want to look at #2 too.
In the libraries like those you mentioned, a "stream" is just an abstraction for "binary data", that may or may not be random-access (as opposed to data that is continuously generated, such as if you were writing a stream that generated random data), or that may be stored anywhere (in RAM, on the hard disk, over a network, in the user's brain, etc.). They're useful because they let you avoid the details, and write generic code that doesn't care about the particular source of the stream.
As a more general computer science concept, a "stream" is sometimes thought of (loosely) as "finite or infinite amount of data". The concept is a bit difficult to explain without an example, but in functional programming (like in Scheme), you can turn a an object with state into a stateless object, by treating the object's history as a "stream" of changes. (The idea is that an object's state may change over time, but if you treat the object's entire life as a "stream" of changes, the stream as a whole never changes, and you can do functional programming with it.)
From I/O Streams (though in java, the meaning is the same in C++ / C#)
An I/O Stream represents an input
source or an output destination. A
stream can represent many different
kinds of sources and destinations,
including disk files, devices, other
programs, and memory arrays.
Streams support many different kinds
of data, including simple bytes,
primitive data types, localized
characters, and objects. Some streams
simply pass on data; others manipulate
and transform the data in useful ways.
No matter how they work internally,
all streams present the same simple
model to programs that use them: A
stream is a sequence of data. A
program uses an input stream to read
data from a source, one item at a
time.
In C#, the streams you have mentioned derive from the abstract base class Stream. Each implementation of this base class has a specific purpose.
For example, FileStream supports read / write operations on a file, while the MemoryStream works on an in-memory stream object. Unlike the FileStream and MemoryStream classes, BufferedStream class allows the user to buffer the I/O.
In addition to the above classes, there are several other classes that implement the Stream class. For a complete list, refer the MSDN documentation on Stream class.
Official terms and explanations aside, the word stream itself was taken from the "real life" stream - instead of water, data is transferred from one place to another.
Regarding question you asked and still wasn't ansewered, you can name your own classes in names that contain stream but only if you implement some sort of new stream it will have correct meaning.
In C functions defined in <stdio.h> operate on streams.
Section 7.19.2 Streams in C99 discusses how they behave, though not what they are, apart from "an ordered sequence of characters".
The rationale gives more context in the corresponding section, starting with:
C inherited its notion of text streams from the UNIX environment in which it was born.
So that's where the concept comes from.

How should I handle incomplete packet buffers?

I am writing a client for a server that typically sends data as strings in 500 or less bytes. However, the data will occasionally exceed that, and a single set of data could contain 200,000 bytes, for all the client knows (on initialization or significant events). However, I would like to not have to have each client running with a 50 MB socket buffer (if it's even possible).
Each set of data is delimited by a null \0 character. What kind of structure should I look at for storing partially sent data sets?
For example, the server may send ABCDEFGHIJKLMNOPQRSTUV\0WXYZ\0123!\0. I would want to process ABCDEFGHIJKLMNOPQRSTUV, WXYZ, and 123! independently. Also, the server could send ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890LOL123HAHATHISISREALLYLONG without the terminating character. I would want that data set stored somewhere for later appending and processing.
Also, I'm using asynchronous socket methods (BeginSend, EndSend, BeginReceive, EndReceive) if that matters.
Currently I'm debating between List<Byte> and StringBuilder. Any comparison of the two for this situation would be very helpful.
Read the data from the socket into a buffer. When you get the terminating character, turn it into a message and send it on its way to the rest of your code.
Also, remember that TCP is a stream, not a packet. So you should never assume that you will get everything sent at one time in a single read.
As far as buffers go, you should probably only need one per connection at most. I'd probably start with the max size that you reasonably expect to receive, and if that fills, create a new buffer of a larger size - a typical strategy is to double the size when you run out to avoid churning through too many allocations.
If you have multiple incoming connections, you may want to do something like create a pool of buffers, and just return "big" ones to the pool when done with them.
You could just use a List<byte> as your buffer, so the .NET framework takes care of automatically expanding it as needed. When you find a null terminator you can use List.RemoveRange() to remove that message from the buffer and pass it to the next layer up.
You'd probably want to add a check and throw an exception if it exceeds a certain length, rather than just wait until the client runs out of memory.
(This is very similar to Ben S's answer, but I think a byte array is a bit more robust than a StringBuilder in the face of encoding issues. Decoding bytes to a string is best done higher up, once you have a complete message.)
I would just use a StringBuilder and read in one character at a time, copying and emptying the builder whenever I hit a null terminator.
I wrote this answer regarding Java sockets but the concept is the same.
What's the best way to monitor a socket for new data and then process that data?

Categories

Resources