I'm writing a binary data format to file containing a graph of serialized objects. To be more resilient to errors (and to be able to debug problems) I am considering length-prefixing each object in the stream. I'm using C# and a BinaryWriter at the moment, but it is quite a general problem.
The size of each object isn't known until it has been completely serialized, so to be able to
write the length prefixes there are a number of strategies:
Use a write buffer with enough space to have random access and insert the length at the correct position after the object is serialized.
Write each object to its own MemoryStream, then write the length of the buffer and the buffer contents to the main stream.
Write a zero length for all objects in the first pass, remember the positions in the file for all object sizes (a table of object to size), and make a second pass filling in all the sizes.
??
The total size (and thus the size of the first/outermost object) is typically around 1mb but can be as large as 50-100mb. My concern is the performance and memory usage of the process.
Which strategy would be most efficient?
Which strategy would be most efficient?
The only way to determine this is to measure.
My first instinct would be to use #2, but knowing that is likely to add pressure to the GC (or fragmentation to the large object heap if the worker streams exceed 80Kb). However #3 sounds interesting, assuming the complexity of tracking those positions doesn't hit maintainability.
In the end you need to measure with your data, and consider that unless you have unusual circumstances the performance will be dominated by network or storage performance, not by processing in memory.
100MB is only 2.5% of the memory in a 'small' sized server (or a standard desktop computer). I'd serialize to memory (e.g. a byte[] array/MemoryStream with BinaryWriter) and then flush that to disk when done.
This would also keep your code clean, compact, and easy to manage - saving you from hours of tearing your hair and seeking back and forth in a large blob :)
Hope this helps!
If you control the format, you could accumulate a list of object sizes and append a directory at the end of your file. However, don't forget that in .NET world your write buffers are copied several times before actually getting transferred to disk anyway. Therefore any gains you make by avoiding (say) an extra MemoryStream will not increase the overall efficiency much.
Related
I'm working on a program that modifies a file, and I'm wondering if the way I'm working with it is wrong.
The file is stored in blocks inside another file and is separated by a bunch of hashes. It's only about 1mb in size, so I just calculate its location once and read it into a byte array and work with it like that.
I'm wondering if it's some kind of horrendous programming habit to a read an entire file, despite its size, into a byte array in memory. It is the sole purpose of my program though and is about the only memory it takes up.
This depends entirely on the expected size (range) of the files you will be reading in. If your input files can reach over a hundred MB in size, this approach doesn't make much sense.
If your input files are small relative to the memory of machines your software will run on, and your program design benefits from having the entire contents in memory, then it's not horrendous; it's sensible.
However, if your software doesn't actually require the entire file's contents in memory, then there's not much of an argument for doing this (even for smaller files.)
If you require random read/write access to the file in order to modify it then reading it into memory is probably ok as long as you can be sure the file will never ever exceed a certain size (you don't want to read a few hundred MB file into memory).
Usually using a stream reader (like a BinaryReader) and processing the data as you go is a better option.
It's horrendous -- like most memory-/CPU-hogging activities -- if you don't have to do it.
We are using protobuf-net for serialization and deserialization of messages in an application whose public protocol is based on Google Protocol Buffers. The library is excellent and covers our all requirements except for this one: we need to find out the serialized message length in bytes before the message is actually serialized.
The question has already been asked a year and a half ago and according to Marc, the only way to do this was to serialize to a MemoryStream and read the .Length property afterwards. This is not acceptable in our case, because MemoryStream allocates a byte buffer behind the scenes and we have to avoid this.
This line from the same response gives us hope that it might be possible after all:
If you clarify what the use-case is, I'm sure we can make it easily
available (if it isn't already).
Here is our use case. We have messages whose size varies between several bytes and two megabytes. The application pre-allocates byte buffers used for socket operations and for serializing / deserializing and once the warm-up phase is over, no additional buffers can be created (hint: avoding GC and heap fragmentation). Byte buffers are essentially pooled. We also want to avoid copying bytes between buffers / streams as much as possible.
We have come up with two possible strategies and both of them require message size upfront:
Use (large) fixed-size byte buffers and serialize all messages that can fit into one buffer; send the content of the buffer using Socket.Send. We have to know when the next message cannot fit into the buffer and stop serializing. Without message size, the only way to achieve this is to wait for an exception to occur during Serialize.
Use (small) variable size byte buffers and serialize each message into one buffer; send the content of the buffer using Socket.Send. In order to check out the byte buffer with appropriate size from the pool, we need to know how much bytes does a serialized message have.
Because the protocol is already defined (we cannot change this) and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method.
So is it possible to add a method that estimates a message size without serialization into a stream? If it is something that does not fit into the current feature set and roadmap of the library, but is doable, we are interested into extending the library ourselves. We are also looking for alternative approaches, if there are any.
As noted, this is not immediately available, as the code intentionally tries to do a single pass over the data (especially IEnumerable<T> etc). Depending on your data, though, it might already be doing a moderate amount of copying, to allow for the fact that sub-messages are also length-prefixed, so might need juggling. This juggling can be greatly reduced by using the "grouped" sub-format internally in the message, as groups allow forwards-only construction without track-backs.
So is it possible to add a method that estimates a message size without serialization into a stream?
An estimate is next to useless; since there is no terminator, it needs to be exact. Ultimately, the sizes are a little hard to predict without actually doing it. There was some code in v1 for size prediction, but the single-pass code currently seems preferred, and in most cases the buffer overhead is nominal (there is code in place to re-use the internal buffers so that it doesn't spend all the time allocating buffers for small messages).
If your message internally is forwards-only (grouped), then a cheat might be to serialize to a fake stream that measures, but drops all the data; you'd end up serializing twice, however.
Re:
and requires message length prefix to be Varint32, we cannot use SerializeWithLengthPrefix method
I'm not quite sure I see the relationship there - it allows a range of formats etc to be used here; perhaps if you can be more specific?
Re copying data around - an idea I played with here is that of using sub-normal forms for the length prefix. For example, it might be that in most cases 5 bytes is plenty, so rather than juggle, it could leave 5 bytes, and then simply overwrite without condensing (since the octet 10000000 still means "zero and continue", even if it is redundant). This would still need to be buffered (to allow backfill), but would not require and movement of the data.
A final simple idea would be simply: serialize to a FileStream; then write the file length, and the file data. It trades memory usage for IO, obviously.
I have a very large set of binary files where several thousand raw frames of video are being sequentially read and processed, and I’m now looking to optimize it as it appears to be more CPU-bound than I/O-bound.
The frames are currently being read in this manner, and I suspect this is the biggest culprit:
private byte[] frameBuf;
BinaryReader binRead = new BinaryReader(FS);
// Initialize a new buffer of sizeof(frame)
frameBuf = new byte[VARIABLE_BUFFER_SIZE];
//Read sizeof(frame) bytes from the file
frameBuf = binRead.ReadBytes(VARIABLE_BUFFER_SIZE);
Would it make much of a difference in .NET to re-organize the I/O to avoid creating all these new byte arrays with each frame?
My understanding of .NET’s memory allocation mechanism is weak as I am coming from a pure C/C++ background. My idea is to re-write this to share a static buffer class that contains a very large shared buffer with an integer keeping track of the frame’s actual size, but I love the simplicity and readability of the current implementation and would rather keep it if the CLR already handles this in some way I am not aware of.
Any input would be much appreciated.
You don't need to init frameBuf if you use binRead.ReadBytes -- you'll get back a new byte array which will overwrite the one you just created. This does create a new array for each read, though.
If you want to avoid creating a bunch of byte arrays, you could use binRead.Read, which will put the bytes into an array you supply to it. If other threads are using the array, though, they'll see the contents of it change right in front of them. Be sure you can guarantee you're done with the buffer before reusing it.
You need to be careful here. It is very easy to get completely bogus test results on code like this, results that never repro in real use. The problem is the file system cache, it will cache the data you read from a file. The trouble starts when you run your test over and over again, tweaking the code and looking for improvements.
The second, and subsequent times you run the test, the data no longer comes off the disk. It is still present in the cache, it only takes a memory-to-memory copy to get it into your program. That's very fast, a microsecond or so of overhead plus the time needed to copy. Which runs at bus-speeds, at least 5 gigabytes per second on modern machines.
Your test will now reveal that you spend a lot of time on allocating the buffer and processing the data, relative from the amount of time spent reading the data.
This will rarely repro in real use. The data won't be in the cache yet, now the sluggy disk drive needs to seek the data (many milliseconds) and it needs to be read off the disk platter (a couple of dozen megabytes per second, at best). Reading the data now takes a good three of four magnitudes of time longer. If you managed to make the processing step twice as fast, your program will actually only run 0.05% faster. Give or take.
I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.
Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files
I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.
That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?
Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.
You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.
Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network
I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.
You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.
Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?
My application does a good deal of binary serialization and compression of large objects. Uncompressed the serialized dataset is about 14 MB. Compressed it is arround 1.5 MB. I find that whenever I call the serialize method on my dataset my large object heap performance counter jumps up from under 1 MB to about 90 MB. I also know that under a relatively heavy loaded system, usually after a while of running (days) in which this serialization process happens a few time, the application has been known to throw out of memory excpetions when this serialization method is called even though there seems to be plenty of memory. I'm guessing that fragmentation is the issue (though i can't say i'm 100% sure, i'm pretty close)
The simplest short term fix (i guess i'm looking for both a short term and a long term answer) i can think of is to call GC.Collect right after i'm done the serialization process. This, in my opinion, will garbage collect the object from the LOH and will do so likely BEFORE other objects can be added to it. This will allow other objects to fit tightly tightly against the remaining objects in the heap without causing much fragmentation.
Other than this ridiculous 90MB allocation i don't think i have anything else that uses a lost of the LOH. This 90 MB allocation is also relatively rare (arround every 4 hours). We of course will still have the 1.5 MB array in there and maybe some other smaller serialized objects.
Any ideas?
Update as a result of good responses
Here is my code which does the work. I've actually tried changing this to compress WHILE serializing so that serialization serializes to a stream at the same time and i don't get much better result. I've also tried preallocating the memory stream to 100 MB and trying to use the same stream twice in a row, the LOH goes up to 180 MB anyways. I'm using Process Explorer to monitor it. It's insane. I think i'm going to try the UnmanagedMemoryStream idea next.
I would encourage you guys to try it out if you wont. It doesn't have to be this exact code. Just serialize a large dataset and you will get surprising results (mine has lots of tables, arround 15 and lots of strings and columns)
byte[] bytes;
System.Runtime.Serialization.Formatters.Binary.BinaryFormatter serializer =
new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
System.IO.MemoryStream memStream = new System.IO.MemoryStream();
serializer.Serialize(memStream, obj);
bytes = CompressionHelper.CompressBytes(memStream.ToArray());
memStream.Dispose();
return bytes;
Update after trying binary serialization with UnmanagedMemoryStream
Even if I serialize to an UnmanagedMemoryStream the LOH jumps up to the same size. It seems that no matter what i do, called the BinaryFormatter to serialize this large object will use the LOH. As for pre-allocating, it doesn't seem to help much. Say i pre-allocate say i preallocate 100MB, then i serialize, it will use 170 MB. Here is the code for that. Even simpler than the above code
BinaryFormatter serializer = new BinaryFormatter();
MemoryStream memoryStream = new MemoryStream(1024*1024*100);
GC.Collect();
serializer.Serialize(memoryStream, assetDS);
The GC.Collect() in the middle there is just to update the LOH performance counter. You will see that it will allocate the correct 100 MB. But then when you call the serialize, you will notice that it seems to add that on top of the 100 that you have already allocated.
Beware of the way collection classes and streams like MemoryStream work in .NET. They have an underlying buffer, a simple array. Whenever the collection or stream buffer grows beyond the allocated size of the array, the array gets re-allocated, now at double the previous size.
This can cause many copies of the array in the LOH. Your 14MB dataset will start using the LOH at 128KB, then take another 256KB, then another 512KB, etcetera. The last one, the one actually used, will be around 16MB. The LOH contains the sum of these, around 30MB, only one of which is in actual use.
Do this three times without a gen2 collection and your LOH has grown to 90MB.
Avoid this by pre-allocating the buffer to the expected size. MemoryStream has a constructor that takes an initial capacity. So do all collection classes. Calling GC.Collect() after you've nulled all references can help unclog the LOH and purge those intermediate buffers, at the cost of clogging the gen1 and gen2 heaps too soon.
Unfortunately, the only way I could fix this was to break up the data in chunks so as not to allocate large chunks on the LOH. All the proposed answers here were good and were expected to work but they did not. It seems that the binary serialization in .NET (using .NET 2.0 SP2) does its own little magic under the hood which prevents users from having control over memory allocation.
Answer then to the question would be "this is not likely to work". When it comes to using .NET serialization, your best bet is to serialize the large objects in smaller chunks. For all other scenarios, the answers mentioned above are great.
90MB of RAM is not much.
Avoid calling GC.Collect unless you have a problem. If you have a problem, and no better fix, try calling GC.Collect and seeing if your problem is solved.
Don't worry about LOH size jumping up. Worry about allocating/deallocating LOH. .Net very dumb about LOH -- rather than allocating LOH objects far away from regular heap, it allocates at next available VM page. I have a 3D app that does much allocate/deallocate of both LOH and regular objects -- the result (as seen in DebugDiag dump report) is that pages of small heap and large heap end up alternating throughout RAM, until there are no large chunks of the applications 2 GB VM space left. The solution when possible is to allocate once what you need, and then don't release it -- re-use it next time.
Use DebugDiag to analyze your process. See how the VM addresses gradually creep up towards 2 GB address mark. Then make a change that keeps that from happening.
I agree with some of the other posters here that you might want to try and use tricks to work with the .NET Framework instead of trying to force it to work with you via GC.Collect.
You may find this Channel 9 video helpful which discusses ways to ease pressure on the Garbage collector.
If you really need to use the LOH for something like a service or something that needs to be running for a long time, you need to use buffer pools that are never deallocated and that you can ideally allocate on start-up. This means you'll have to do your 'memory management' yourself for this, of course.
Depending on what you're doing with this memory, you might also have to p/Invoke over to native code for selected parts to avoid having to call some .NET API that forces you to put the data on newly allocated space in the LOH.
This is a good starting point article about the issues: https://devblogs.microsoft.com/dotnet/using-gc-efficiently-part-3/
I'd consider you very lucky if you GC trick would work, and it would really only work if there isn't much going on at the same time in the system. If you have work going on in parallel, this will just slightly delay the unevitable.
Also read up on the documentation about GC.Collect.IIRC, GC.Collect(n) only says that it collects no further than the generation n -- not that it actually ever GETS to generation n.