Unbuffered XmlReader implementation

Unbuffered XmlReader implementation - c#

does anyone know of an unbuffered XmlReader implementation?
Not the XmlTextReaderImpl which uses a byte[] buffer internally and reads from a stream at construction! I find this incredibly annoying. I only want the XmlReader to read when actually calling Read* on the object. Not to buffer data or similar.
NOTE: I am not talking about cached vs non-cached. But stream buffering, which happens internally in the XmlTextReader for example.
I want to use this in a scenario where I currently have to create a new XmlTextReader each time I want to deserialize an object, but since this creates a buffer of size 4096 every time it puts a lot of pressure on the garbage collector. So I would like to keep an instance of the XmlReader around (which can continously read from a stream of xml objects), but this is not possible with the BCL implementation or have an XmlReader that does not create a buffer.

Not creating a buffer coule be impossible.
The parser needs to have lookahead, not to mention NameTables for fast processing.
4096 seems big, but might be the most efficient unit: it coincides with a page of virtual memory, and it could be used in stead of many repeated small allocations (? guessing)
However, reusing the instance would have been nice. You could have a look at the mono implementation and work it to your liking:
https://github.com/mono/mono/blob/master/mcs/class/System.XML/System.Xml/XmlTextReader.cs

Related

How to write length prefixed binary data efficiently

I'm writing a binary data format to file containing a graph of serialized objects. To be more resilient to errors (and to be able to debug problems) I am considering length-prefixing each object in the stream. I'm using C# and a BinaryWriter at the moment, but it is quite a general problem.
The size of each object isn't known until it has been completely serialized, so to be able to
write the length prefixes there are a number of strategies:
Use a write buffer with enough space to have random access and insert the length at the correct position after the object is serialized.
Write each object to its own MemoryStream, then write the length of the buffer and the buffer contents to the main stream.
Write a zero length for all objects in the first pass, remember the positions in the file for all object sizes (a table of object to size), and make a second pass filling in all the sizes.
??
The total size (and thus the size of the first/outermost object) is typically around 1mb but can be as large as 50-100mb. My concern is the performance and memory usage of the process.
Which strategy would be most efficient?

Which strategy would be most efficient?
The only way to determine this is to measure.
My first instinct would be to use #2, but knowing that is likely to add pressure to the GC (or fragmentation to the large object heap if the worker streams exceed 80Kb). However #3 sounds interesting, assuming the complexity of tracking those positions doesn't hit maintainability.
In the end you need to measure with your data, and consider that unless you have unusual circumstances the performance will be dominated by network or storage performance, not by processing in memory.

100MB is only 2.5% of the memory in a 'small' sized server (or a standard desktop computer). I'd serialize to memory (e.g. a byte[] array/MemoryStream with BinaryWriter) and then flush that to disk when done.
This would also keep your code clean, compact, and easy to manage - saving you from hours of tearing your hair and seeking back and forth in a large blob :)
Hope this helps!

If you control the format, you could accumulate a list of object sizes and append a directory at the end of your file. However, don't forget that in .NET world your write buffers are copied several times before actually getting transferred to disk anyway. Therefore any gains you make by avoiding (say) an extra MemoryStream will not increase the overall efficiency much.

Copying twitter stream to objects: can I fall behind?

Currently I'm working on an app that reads the stream from Twitter API and parses it into objects. At the moment I read the stream and use ReadObject(...) from DataContractJsonSerializer to make my objects and I write them to a buffer in memory (don't worry I read them from that buffer asynchronously and I keep a maximum of 100 objects before I start overwriting old ones).
This works great!! HOWEVER: Do I have the guarantee that the reading/writing will keep up with the actual stream. If this is not the case; what can I do about this?

You could use a BlockingCollection for the buffer, that way instead of overwriting old entries, an attempt to add more than 100 items will block instead while your reader catches up.

From what I understand, you will not have that guarantee. If you've got a limit of 100 buffered objects, you may get to the point where that buffer is full of new objects, a new one comes in and overwrites something. Really it's a trade off, the more you allow in your buffer the greater security of not falling behind, versus using more RAM.
The only alternative I can see is somehow writing your own scheduler proritising the processing of the buffered objects over reading new ones from the stream.

protobuf-net message serialized size property

We are using protobuf-net for serialization and deserialization of messages in an application whose public protocol is based on Google Protocol Buffers. The library is excellent and covers our all requirements except for this one: we need to find out the serialized message length in bytes before the message is actually serialized.
The question has already been asked a year and a half ago and according to Marc, the only way to do this was to serialize to a MemoryStream and read the .Length property afterwards. This is not acceptable in our case, because MemoryStream allocates a byte buffer behind the scenes and we have to avoid this.
This line from the same response gives us hope that it might be possible after all:
If you clarify what the use-case is, I'm sure we can make it easily
available (if it isn't already).
Here is our use case. We have messages whose size varies between several bytes and two megabytes. The application pre-allocates byte buffers used for socket operations and for serializing / deserializing and once the warm-up phase is over, no additional buffers can be created (hint: avoding GC and heap fragmentation). Byte buffers are essentially pooled. We also want to avoid copying bytes between buffers / streams as much as possible.
We have come up with two possible strategies and both of them require message size upfront:
Use (large) fixed-size byte buffers and serialize all messages that can fit into one buffer; send the content of the buffer using Socket.Send. We have to know when the next message cannot fit into the buffer and stop serializing. Without message size, the only way to achieve this is to wait for an exception to occur during Serialize.
Use (small) variable size byte buffers and serialize each message into one buffer; send the content of the buffer using Socket.Send. In order to check out the byte buffer with appropriate size from the pool, we need to know how much bytes does a serialized message have.
Because the protocol is already defined (we cannot change this) and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method.
So is it possible to add a method that estimates a message size without serialization into a stream? If it is something that does not fit into the current feature set and roadmap of the library, but is doable, we are interested into extending the library ourselves. We are also looking for alternative approaches, if there are any.

As noted, this is not immediately available, as the code intentionally tries to do a single pass over the data (especially IEnumerable<T> etc). Depending on your data, though, it might already be doing a moderate amount of copying, to allow for the fact that sub-messages are also length-prefixed, so might need juggling. This juggling can be greatly reduced by using the "grouped" sub-format internally in the message, as groups allow forwards-only construction without track-backs.
So is it possible to add a method that estimates a message size without serialization into a stream?
An estimate is next to useless; since there is no terminator, it needs to be exact. Ultimately, the sizes are a little hard to predict without actually doing it. There was some code in v1 for size prediction, but the single-pass code currently seems preferred, and in most cases the buffer overhead is nominal (there is code in place to re-use the internal buffers so that it doesn't spend all the time allocating buffers for small messages).
If your message internally is forwards-only (grouped), then a cheat might be to serialize to a fake stream that measures, but drops all the data; you'd end up serializing twice, however.
Re:
and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method
I'm not quite sure I see the relationship there - it allows a range of formats etc to be used here; perhaps if you can be more specific?
Re copying data around - an idea I played with here is that of using sub-normal forms for the length prefix. For example, it might be that in most cases 5 bytes is plenty, so rather than juggle, it could leave 5 bytes, and then simply overwrite without condensing (since the octet 10000000 still means "zero and continue", even if it is redundant). This would still need to be buffered (to allow backfill), but would not require and movement of the data.
A final simple idea would be simply: serialize to a FileStream; then write the file length, and the file data. It trades memory usage for IO, obviously.

What does stream mean? What are its characteristics?

C++ and C# both use the word stream to name many classes.
C++: iostream, istream, ostream, stringstream, ostream_iterator, istream_iterator...
C#: Stream, FileStream,MemoryStream, BufferedStream...
So it made me curious to know, what does stream mean?
What are the characteristics of a stream?
When can I use this term to name my classes?
Is this limited to file I/O classes only?
Interestingly, C doesn’t use this word anywhere, as far as I know.

Many data-structures (lists, collections, etc) act as containers - they hold a set of objects. But not a stream; if a list is a bucket, then a stream is a hose. You can pull data from a stream, or push data into a stream - but normally only once and only in one direction (there are exceptions of course). For example, TCP data over a network is a stream; you can send (or receive) chunks of data, but only in connection with the other computer, and usually only once - you can't rewind the Internet.
Streams can also manipulate data passing through them; compression streams, encryption streams, etc. But again - the underlying metaphor here is a hose of data. A file is also generally accessed (at some level) as a stream; you can access blocks of sequential data. Of course, most file systems also provide random access, so streams do offer things like Seek, Position, Length etc - but not all implementations support such. It has no meaning to seek some streams, or get the length of an open socket.

There's a couple different meanings. #1 is what you probably mean, but you might want to look at #2 too.
In the libraries like those you mentioned, a "stream" is just an abstraction for "binary data", that may or may not be random-access (as opposed to data that is continuously generated, such as if you were writing a stream that generated random data), or that may be stored anywhere (in RAM, on the hard disk, over a network, in the user's brain, etc.). They're useful because they let you avoid the details, and write generic code that doesn't care about the particular source of the stream.
As a more general computer science concept, a "stream" is sometimes thought of (loosely) as "finite or infinite amount of data". The concept is a bit difficult to explain without an example, but in functional programming (like in Scheme), you can turn a an object with state into a stateless object, by treating the object's history as a "stream" of changes. (The idea is that an object's state may change over time, but if you treat the object's entire life as a "stream" of changes, the stream as a whole never changes, and you can do functional programming with it.)

From I/O Streams (though in java, the meaning is the same in C++ / C#)
An I/O Stream represents an input
source or an output destination. A
stream can represent many different
kinds of sources and destinations,
including disk files, devices, other
programs, and memory arrays.
Streams support many different kinds
of data, including simple bytes,
primitive data types, localized
characters, and objects. Some streams
simply pass on data; others manipulate
and transform the data in useful ways.
No matter how they work internally,
all streams present the same simple
model to programs that use them: A
stream is a sequence of data. A
program uses an input stream to read
data from a source, one item at a
time.
In C#, the streams you have mentioned derive from the abstract base class Stream. Each implementation of this base class has a specific purpose.
For example, FileStream supports read / write operations on a file, while the MemoryStream works on an in-memory stream object. Unlike the FileStream and MemoryStream classes, BufferedStream class allows the user to buffer the I/O.
In addition to the above classes, there are several other classes that implement the Stream class. For a complete list, refer the MSDN documentation on Stream class.

Official terms and explanations aside, the word stream itself was taken from the "real life" stream - instead of water, data is transferred from one place to another.
Regarding question you asked and still wasn't ansewered, you can name your own classes in names that contain stream but only if you implement some sort of new stream it will have correct meaning.

In C functions defined in <stdio.h> operate on streams.
Section 7.19.2 Streams in C99 discusses how they behave, though not what they are, apart from "an ordered sequence of characters".
The rationale gives more context in the corresponding section, starting with:
C inherited its notion of text streams from the UNIX environment in which it was born.
So that's where the concept comes from.

Memory management in C#

Good afternoon,
I have some text files containing a list of (2-gram, count) pairs collected by analysing a corpus of newspaper articles which I need to load into memory when I start a given application I am developing. To store those pairs, I am using a structure like the following one:
private static Dictionary<String, Int64>[] ListaDigramas = new Dictionary<String, Int64>[27];
The ideia of having an array of dictionaries is due to efficiency questions, since I read somewhere that a long dictionary has a negative impact on performance. That said, every 2-gram goes into the dictionary that corresponds to it's first character's ASCII code minus 97 (or 26 if the first character is not a character in the range from 'a' to 'z').
When I load the (2-gram, count) pairs into memory, the application takes an overall 800Mb of RAM, and stays like this until I use a program called Memory Cleaner to free up memory. After this, the memory taken by the program goes down to the range 7Mb-100Mb, without losing functionality (I think).
Is there any way I can free up memory this way but without using an external application? I tried to use GC.Collect() but it doesn't work in this case.
Thank you very much.

You are using a static field so chances are once it is loaded it never gets garbage collected, so unless you call the .Clear() method on this dictionary it probably won't be subject to garbage collection.

It is fairly mysterious to me how utilities like that ever make it onto somebody's machine. All they do is call EmptyWorkingSet(). Maybe it looks good in Taskmgr.exe, but it is otherwise just a way to keep the hard drive busy unnecessarily. You'll get the exact same thing by minimizing the main window of your app.

I don't know the details of how memory cleaner works, but given that it's unlikely to know the inner workings of a programs memory allocations, the best it can probably do is just cause pages to be swapped out to disk reducing the apparent memory usage of the program.
Garbage collection won't help unless you actually have objects you aren't using any more. If you are using your dictionaries, which the GC considers that you are since it is a static field, then all the objects in them are considered in use and must belong to the active memory of the program. There's no way around this.

What you are seeing is the total usage of the application. This is 800MB and will stay that way. As the comments say, memory cleaner makes it look like the application uses less memory. What you can try to do is access all values in the dictionary after you've run the memory cleaner. You'll see that the memory usage goes up again (it's read from swap).
What you probably want is to not load all this data into memory. Is there a way you can get the same results using an algorithm?
Alternatively, and this would probably be the best option if you are actually storing information here, you could use a database. If it's cumbersome to use a normal database like SQLExpress, you could always go for SQLite.

About the only other idea I could come up with, if you really want to keep your memory usage down, would be store the dictionary in a stream and compress it. Factors to consider would be how often you're accessing/inflating this data, and how compressible the data is. Text from newspaper articles would compress extremely well, and the performance hit might be less than you'd think.
Using an open-source library like SharpZipLib ( http://www.icsharpcode.net/opensource/sharpziplib/ ), your code would look something like:
MemoryStream stream = new MemoryStream();
BinaryFormatter formatter = new BinaryFormatter();
formatter.Serialize(stream, ListaDigramas);
byte[] dictBytes = stream.ToArray();
Stream zipStream = new DeflaterOutputStream(new MemoryStream());
zipStream.Write(dictBytes, 0, dictBytes.Length);
Inflating requires an InflaterInputStream and a loop to inflate the stream in chunks, but is fairly straightforward.
You'd have to play with the app to see if performance was acceptable. Keeping in mind, of course, that you'll still need enough memory to hold the dictionary when you inflate it for use (unless someone has a clever idea to work with the object in its compressed state).
Honestly, though, keeping it as-is in memory and letting Windows swap it to the page file is probably your best/fastest option.
Edit
I've never tried it, but you might be able to serialize directly to the compression stream, meaning the compression overhead is minimal (you'd still have the serialization overhead):
MemoryStream stream = new MemoryStream();
BinaryFormatter formatter = new BinaryFormatter();
Stream zipStream = new DeflaterOutputStream(new MemoryStream());
formatter.Serialize(zipStream, ListaDigramas);

Thank you very much for all the answers. The data actually needs to be loaded during the whole running time of the application, so based on your answers I think there is nothing better to do... I could perhaps try an external database, but since I already need to deal with two other databases at the same time, I think it is not a good idea.
Do you think it is possible to be dealing with three databases at the same time and do not lose on performance?

If you are disposing of your applications resources correctly then the actual used memory may not be what you are seeing (if verifying through Task Manager).
The Garbage Collector will free up the unused memory at the best possible time. It usually isn't really a good idea to force collection either...see this post
"data actually needs to be loaded during the whole running time of the application" - why?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.