I'm doing a testbed client/server system (dotnet 4.0) that will eventually have two components communicating via streams across some transport medium, but at the moment has the two communicating via a single MemoryStream. Never used them before, and I made the assumption I could be writing and reading at the same time. However, because there's only one 'cursor' I can't be reading from the stream until it's finished writing and I can seek() back to zero.
The named pipe stuff supports full duplex operation, but only if I set one object up as the server and have the other connect to it- not something I'm wanting to do at this point.
I can get the result I want by creating a byte buffer and having two MemoryStream instances pointing at that buffer, but that falls over when I reach the end of the buffer and get an exception because the memory stream can't be expanded.
I could probably do this by creating a file instead of the array and having two FileStream instances, but that seems a somewhat messy way of doing it. And if left running would result in a full disk since nothing would be pruning the data that's been read.
What I'm after is a stream that doesn't support seek() or position, maintains separate read and write pointers, buffers data that's written to it and discards it sometime after it's been read. Feels like reinventing the wheel to roll my own. Surely such a thing is already around somewhere?
We are using protobuf-net for serialization and deserialization of messages in an application whose public protocol is based on Google Protocol Buffers. The library is excellent and covers our all requirements except for this one: we need to find out the serialized message length in bytes before the message is actually serialized.
The question has already been asked a year and a half ago and according to Marc, the only way to do this was to serialize to a MemoryStream and read the .Length property afterwards. This is not acceptable in our case, because MemoryStream allocates a byte buffer behind the scenes and we have to avoid this.
This line from the same response gives us hope that it might be possible after all:
If you clarify what the use-case is, I'm sure we can make it easily
available (if it isn't already).
Here is our use case. We have messages whose size varies between several bytes and two megabytes. The application pre-allocates byte buffers used for socket operations and for serializing / deserializing and once the warm-up phase is over, no additional buffers can be created (hint: avoding GC and heap fragmentation). Byte buffers are essentially pooled. We also want to avoid copying bytes between buffers / streams as much as possible.
We have come up with two possible strategies and both of them require message size upfront:
Use (large) fixed-size byte buffers and serialize all messages that can fit into one buffer; send the content of the buffer using Socket.Send. We have to know when the next message cannot fit into the buffer and stop serializing. Without message size, the only way to achieve this is to wait for an exception to occur during Serialize.
Use (small) variable size byte buffers and serialize each message into one buffer; send the content of the buffer using Socket.Send. In order to check out the byte buffer with appropriate size from the pool, we need to know how much bytes does a serialized message have.
Because the protocol is already defined (we cannot change this) and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method.
So is it possible to add a method that estimates a message size without serialization into a stream? If it is something that does not fit into the current feature set and roadmap of the library, but is doable, we are interested into extending the library ourselves. We are also looking for alternative approaches, if there are any.
As noted, this is not immediately available, as the code intentionally tries to do a single pass over the data (especially IEnumerable<T> etc). Depending on your data, though, it might already be doing a moderate amount of copying, to allow for the fact that sub-messages are also length-prefixed, so might need juggling. This juggling can be greatly reduced by using the "grouped" sub-format internally in the message, as groups allow forwards-only construction without track-backs.
So is it possible to add a method that estimates a message size without serialization into a stream?
An estimate is next to useless; since there is no terminator, it needs to be exact. Ultimately, the sizes are a little hard to predict without actually doing it. There was some code in v1 for size prediction, but the single-pass code currently seems preferred, and in most cases the buffer overhead is nominal (there is code in place to re-use the internal buffers so that it doesn't spend all the time allocating buffers for small messages).
If your message internally is forwards-only (grouped), then a cheat might be to serialize to a fake stream that measures, but drops all the data; you'd end up serializing twice, however.
Re:
and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method
I'm not quite sure I see the relationship there - it allows a range of formats etc to be used here; perhaps if you can be more specific?
Re copying data around - an idea I played with here is that of using sub-normal forms for the length prefix. For example, it might be that in most cases 5 bytes is plenty, so rather than juggle, it could leave 5 bytes, and then simply overwrite without condensing (since the octet 10000000 still means "zero and continue", even if it is redundant). This would still need to be buffered (to allow backfill), but would not require and movement of the data.
A final simple idea would be simply: serialize to a FileStream; then write the file length, and the file data. It trades memory usage for IO, obviously.
does anyone know of an unbuffered XmlReader implementation?
Not the XmlTextReaderImpl which uses a byte[] buffer internally and reads from a stream at construction! I find this incredibly annoying. I only want the XmlReader to read when actually calling Read* on the object. Not to buffer data or similar.
NOTE: I am not talking about cached vs non-cached. But stream buffering, which happens internally in the XmlTextReader for example.
I want to use this in a scenario where I currently have to create a new XmlTextReader each time I want to deserialize an object, but since this creates a buffer of size 4096 every time it puts a lot of pressure on the garbage collector. So I would like to keep an instance of the XmlReader around (which can continously read from a stream of xml objects), but this is not possible with the BCL implementation or have an XmlReader that does not create a buffer.
Not creating a buffer coule be impossible.
The parser needs to have lookahead, not to mention NameTables for fast processing.
4096 seems big, but might be the most efficient unit: it coincides with a page of virtual memory, and it could be used in stead of many repeated small allocations (? guessing)
However, reusing the instance would have been nice. You could have a look at the mono implementation and work it to your liking:
https://github.com/mono/mono/blob/master/mcs/class/System.XML/System.Xml/XmlTextReader.cs
New to .net but still confuse about the concept of BinaryFormatter and Filestream, from all i read they both seem to be doing the same thing or similar concept. Ex. is Binaryformatter.serialize, how can that incorporate filestream and the object as parameter while i thought filestream function was to transfer the stream object to the file. I'm just confuse, can someone tell me how they work together and the difference between the two?
Streams represent raw data that can be accessed sequentially. They are used (directly or indirectly) whenever you input or output. There are different kinds of streams. For example:
NetworkStream and FileStream read and write data from a network port and disk without performing any transformations.
GzipStream or CryptoStream decorate an underlying stream by adding compression and encryption.
BinaryFormatter requires a Stream to write or read from. But its responsibility is very different: it's used to convert .NET objects to sequence of bytes that can saved or transmitted through a network. Concrete medium and additional transformations are determined by the type of stream you use.
All streams inherit from the Stream class and share the same interface which is very convenient. Classes like BinaryFormatter can rely on this shared interface without knowing the specifics of particular implementation.
Once again, BinaryFormatter is for converting an object to and from a sequence of bytes.
Streams are for reading and writing these bytes to a particular medium.
C++ and C# both use the word stream to name many classes.
C++: iostream, istream, ostream, stringstream, ostream_iterator, istream_iterator...
C#: Stream, FileStream,MemoryStream, BufferedStream...
So it made me curious to know, what does stream mean?
What are the characteristics of a stream?
When can I use this term to name my classes?
Is this limited to file I/O classes only?
Interestingly, C doesn’t use this word anywhere, as far as I know.
Many data-structures (lists, collections, etc) act as containers - they hold a set of objects. But not a stream; if a list is a bucket, then a stream is a hose. You can pull data from a stream, or push data into a stream - but normally only once and only in one direction (there are exceptions of course). For example, TCP data over a network is a stream; you can send (or receive) chunks of data, but only in connection with the other computer, and usually only once - you can't rewind the Internet.
Streams can also manipulate data passing through them; compression streams, encryption streams, etc. But again - the underlying metaphor here is a hose of data. A file is also generally accessed (at some level) as a stream; you can access blocks of sequential data. Of course, most file systems also provide random access, so streams do offer things like Seek, Position, Length etc - but not all implementations support such. It has no meaning to seek some streams, or get the length of an open socket.
There's a couple different meanings. #1 is what you probably mean, but you might want to look at #2 too.
In the libraries like those you mentioned, a "stream" is just an abstraction for "binary data", that may or may not be random-access (as opposed to data that is continuously generated, such as if you were writing a stream that generated random data), or that may be stored anywhere (in RAM, on the hard disk, over a network, in the user's brain, etc.). They're useful because they let you avoid the details, and write generic code that doesn't care about the particular source of the stream.
As a more general computer science concept, a "stream" is sometimes thought of (loosely) as "finite or infinite amount of data". The concept is a bit difficult to explain without an example, but in functional programming (like in Scheme), you can turn a an object with state into a stateless object, by treating the object's history as a "stream" of changes. (The idea is that an object's state may change over time, but if you treat the object's entire life as a "stream" of changes, the stream as a whole never changes, and you can do functional programming with it.)
From I/O Streams (though in java, the meaning is the same in C++ / C#)
An I/O Stream represents an input
source or an output destination. A
stream can represent many different
kinds of sources and destinations,
including disk files, devices, other
programs, and memory arrays.
Streams support many different kinds
of data, including simple bytes,
primitive data types, localized
characters, and objects. Some streams
simply pass on data; others manipulate
and transform the data in useful ways.
No matter how they work internally,
all streams present the same simple
model to programs that use them: A
stream is a sequence of data. A
program uses an input stream to read
data from a source, one item at a
time.
In C#, the streams you have mentioned derive from the abstract base class Stream. Each implementation of this base class has a specific purpose.
For example, FileStream supports read / write operations on a file, while the MemoryStream works on an in-memory stream object. Unlike the FileStream and MemoryStream classes, BufferedStream class allows the user to buffer the I/O.
In addition to the above classes, there are several other classes that implement the Stream class. For a complete list, refer the MSDN documentation on Stream class.
Official terms and explanations aside, the word stream itself was taken from the "real life" stream - instead of water, data is transferred from one place to another.
Regarding question you asked and still wasn't ansewered, you can name your own classes in names that contain stream but only if you implement some sort of new stream it will have correct meaning.
In C functions defined in <stdio.h> operate on streams.
Section 7.19.2 Streams in C99 discusses how they behave, though not what they are, apart from "an ordered sequence of characters".
The rationale gives more context in the corresponding section, starting with:
C inherited its notion of text streams from the UNIX environment in which it was born.
So that's where the concept comes from.