.NET Binary File Read Performance

.NET Binary File Read Performance - c#

I have a very large set of binary files where several thousand raw frames of video are being sequentially read and processed, and I’m now looking to optimize it as it appears to be more CPU-bound than I/O-bound.
The frames are currently being read in this manner, and I suspect this is the biggest culprit:
private byte[] frameBuf;
BinaryReader binRead = new BinaryReader(FS);
// Initialize a new buffer of sizeof(frame)
frameBuf = new byte[VARIABLE_BUFFER_SIZE];
//Read sizeof(frame) bytes from the file
frameBuf = binRead.ReadBytes(VARIABLE_BUFFER_SIZE);
Would it make much of a difference in .NET to re-organize the I/O to avoid creating all these new byte arrays with each frame?
My understanding of .NET’s memory allocation mechanism is weak as I am coming from a pure C/C++ background. My idea is to re-write this to share a static buffer class that contains a very large shared buffer with an integer keeping track of the frame’s actual size, but I love the simplicity and readability of the current implementation and would rather keep it if the CLR already handles this in some way I am not aware of.
Any input would be much appreciated.

You don't need to init frameBuf if you use binRead.ReadBytes -- you'll get back a new byte array which will overwrite the one you just created. This does create a new array for each read, though.
If you want to avoid creating a bunch of byte arrays, you could use binRead.Read, which will put the bytes into an array you supply to it. If other threads are using the array, though, they'll see the contents of it change right in front of them. Be sure you can guarantee you're done with the buffer before reusing it.

You need to be careful here. It is very easy to get completely bogus test results on code like this, results that never repro in real use. The problem is the file system cache, it will cache the data you read from a file. The trouble starts when you run your test over and over again, tweaking the code and looking for improvements.
The second, and subsequent times you run the test, the data no longer comes off the disk. It is still present in the cache, it only takes a memory-to-memory copy to get it into your program. That's very fast, a microsecond or so of overhead plus the time needed to copy. Which runs at bus-speeds, at least 5 gigabytes per second on modern machines.
Your test will now reveal that you spend a lot of time on allocating the buffer and processing the data, relative from the amount of time spent reading the data.
This will rarely repro in real use. The data won't be in the cache yet, now the sluggy disk drive needs to seek the data (many milliseconds) and it needs to be read off the disk platter (a couple of dozen megabytes per second, at best). Reading the data now takes a good three of four magnitudes of time longer. If you managed to make the processing step twice as fast, your program will actually only run 0.05% faster. Give or take.

Related

How to write length prefixed binary data efficiently

I'm writing a binary data format to file containing a graph of serialized objects. To be more resilient to errors (and to be able to debug problems) I am considering length-prefixing each object in the stream. I'm using C# and a BinaryWriter at the moment, but it is quite a general problem.
The size of each object isn't known until it has been completely serialized, so to be able to
write the length prefixes there are a number of strategies:
Use a write buffer with enough space to have random access and insert the length at the correct position after the object is serialized.
Write each object to its own MemoryStream, then write the length of the buffer and the buffer contents to the main stream.
Write a zero length for all objects in the first pass, remember the positions in the file for all object sizes (a table of object to size), and make a second pass filling in all the sizes.
??
The total size (and thus the size of the first/outermost object) is typically around 1mb but can be as large as 50-100mb. My concern is the performance and memory usage of the process.
Which strategy would be most efficient?

Which strategy would be most efficient?
The only way to determine this is to measure.
My first instinct would be to use #2, but knowing that is likely to add pressure to the GC (or fragmentation to the large object heap if the worker streams exceed 80Kb). However #3 sounds interesting, assuming the complexity of tracking those positions doesn't hit maintainability.
In the end you need to measure with your data, and consider that unless you have unusual circumstances the performance will be dominated by network or storage performance, not by processing in memory.

100MB is only 2.5% of the memory in a 'small' sized server (or a standard desktop computer). I'd serialize to memory (e.g. a byte[] array/MemoryStream with BinaryWriter) and then flush that to disk when done.
This would also keep your code clean, compact, and easy to manage - saving you from hours of tearing your hair and seeking back and forth in a large blob :)
Hope this helps!

If you control the format, you could accumulate a list of object sizes and append a directory at the end of your file. However, don't forget that in .NET world your write buffers are copied several times before actually getting transferred to disk anyway. Therefore any gains you make by avoiding (say) an extra MemoryStream will not increase the overall efficiency much.

Storing file in byte array vs reading and writing with file stream?

I'm working on a program that modifies a file, and I'm wondering if the way I'm working with it is wrong.
The file is stored in blocks inside another file and is separated by a bunch of hashes. It's only about 1mb in size, so I just calculate its location once and read it into a byte array and work with it like that.
I'm wondering if it's some kind of horrendous programming habit to a read an entire file, despite its size, into a byte array in memory. It is the sole purpose of my program though and is about the only memory it takes up.

This depends entirely on the expected size (range) of the files you will be reading in. If your input files can reach over a hundred MB in size, this approach doesn't make much sense.
If your input files are small relative to the memory of machines your software will run on, and your program design benefits from having the entire contents in memory, then it's not horrendous; it's sensible.
However, if your software doesn't actually require the entire file's contents in memory, then there's not much of an argument for doing this (even for smaller files.)

If you require random read/write access to the file in order to modify it then reading it into memory is probably ok as long as you can be sure the file will never ever exceed a certain size (you don't want to read a few hundred MB file into memory).
Usually using a stream reader (like a BinaryReader) and processing the data as you go is a better option.

It's horrendous -- like most memory-/CPU-hogging activities -- if you don't have to do it.

Memory management in C#

Good afternoon,
I have some text files containing a list of (2-gram, count) pairs collected by analysing a corpus of newspaper articles which I need to load into memory when I start a given application I am developing. To store those pairs, I am using a structure like the following one:
private static Dictionary<String, Int64>[] ListaDigramas = new Dictionary<String, Int64>[27];
The ideia of having an array of dictionaries is due to efficiency questions, since I read somewhere that a long dictionary has a negative impact on performance. That said, every 2-gram goes into the dictionary that corresponds to it's first character's ASCII code minus 97 (or 26 if the first character is not a character in the range from 'a' to 'z').
When I load the (2-gram, count) pairs into memory, the application takes an overall 800Mb of RAM, and stays like this until I use a program called Memory Cleaner to free up memory. After this, the memory taken by the program goes down to the range 7Mb-100Mb, without losing functionality (I think).
Is there any way I can free up memory this way but without using an external application? I tried to use GC.Collect() but it doesn't work in this case.
Thank you very much.

You are using a static field so chances are once it is loaded it never gets garbage collected, so unless you call the .Clear() method on this dictionary it probably won't be subject to garbage collection.

It is fairly mysterious to me how utilities like that ever make it onto somebody's machine. All they do is call EmptyWorkingSet(). Maybe it looks good in Taskmgr.exe, but it is otherwise just a way to keep the hard drive busy unnecessarily. You'll get the exact same thing by minimizing the main window of your app.

I don't know the details of how memory cleaner works, but given that it's unlikely to know the inner workings of a programs memory allocations, the best it can probably do is just cause pages to be swapped out to disk reducing the apparent memory usage of the program.
Garbage collection won't help unless you actually have objects you aren't using any more. If you are using your dictionaries, which the GC considers that you are since it is a static field, then all the objects in them are considered in use and must belong to the active memory of the program. There's no way around this.

What you are seeing is the total usage of the application. This is 800MB and will stay that way. As the comments say, memory cleaner makes it look like the application uses less memory. What you can try to do is access all values in the dictionary after you've run the memory cleaner. You'll see that the memory usage goes up again (it's read from swap).
What you probably want is to not load all this data into memory. Is there a way you can get the same results using an algorithm?
Alternatively, and this would probably be the best option if you are actually storing information here, you could use a database. If it's cumbersome to use a normal database like SQLExpress, you could always go for SQLite.

About the only other idea I could come up with, if you really want to keep your memory usage down, would be store the dictionary in a stream and compress it. Factors to consider would be how often you're accessing/inflating this data, and how compressible the data is. Text from newspaper articles would compress extremely well, and the performance hit might be less than you'd think.
Using an open-source library like SharpZipLib ( http://www.icsharpcode.net/opensource/sharpziplib/ ), your code would look something like:
MemoryStream stream = new MemoryStream();
BinaryFormatter formatter = new BinaryFormatter();
formatter.Serialize(stream, ListaDigramas);
byte[] dictBytes = stream.ToArray();
Stream zipStream = new DeflaterOutputStream(new MemoryStream());
zipStream.Write(dictBytes, 0, dictBytes.Length);
Inflating requires an InflaterInputStream and a loop to inflate the stream in chunks, but is fairly straightforward.
You'd have to play with the app to see if performance was acceptable. Keeping in mind, of course, that you'll still need enough memory to hold the dictionary when you inflate it for use (unless someone has a clever idea to work with the object in its compressed state).
Honestly, though, keeping it as-is in memory and letting Windows swap it to the page file is probably your best/fastest option.
Edit
I've never tried it, but you might be able to serialize directly to the compression stream, meaning the compression overhead is minimal (you'd still have the serialization overhead):
MemoryStream stream = new MemoryStream();
BinaryFormatter formatter = new BinaryFormatter();
Stream zipStream = new DeflaterOutputStream(new MemoryStream());
formatter.Serialize(zipStream, ListaDigramas);

Thank you very much for all the answers. The data actually needs to be loaded during the whole running time of the application, so based on your answers I think there is nothing better to do... I could perhaps try an external database, but since I already need to deal with two other databases at the same time, I think it is not a good idea.
Do you think it is possible to be dealing with three databases at the same time and do not lose on performance?

If you are disposing of your applications resources correctly then the actual used memory may not be what you are seeing (if verifying through Task Manager).
The Garbage Collector will free up the unused memory at the best possible time. It usually isn't really a good idea to force collection either...see this post
"data actually needs to be loaded during the whole running time of the application" - why?

Reading huge amounts of small files in sequence

I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.

Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files

I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.

That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?

Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.

You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.

Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network

I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.

You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.

Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?

One big byte buffer or several small ones?

I'm learning C# asynchronous socket programming, and I've learned that it's a good idea to reuse byte buffers in some sort of pool, and then just check one out as needed when receiving data from a socket.
However, I have seen two different methods of doing a byte array pool: one used a simple queue system, and just added/removed them from the queue as needed. If one was requested and there were no more left in the queue, a new byte array is created.
The other method that I've seen uses one big byte array for the entire program. The idea of a queue still applies, but instead it's a queue of integers which determine the slice (offset) of the byte array to use. If one was requested and there were no more left in the queue, the array must be resized.
Which one of these is a better solution for a highly scalable server? My instinct is it would be cheaper to just use many byte arrays because I'd imagine resizing the array as needed (even if we allocate it in large chunks) would be pretty costly, especially when it gets big. Using multiple arrays seems more intuitive too - is there some advantage to using one massive array that I'm not thinking of?

You are correct in your gut feeling. Every time you need to make the array bigger, you will be recreating the array and copying the existing bytes over. Since we are talking about bytes here, the size of the array may get large very quickly. So, you will be asking for a contiguous piece of memory each time, which, depending on how your program uses memory, might or might not be viable. This will also in effect, become a virtual pool, so to speak. A pool by definition has a set of multiple items that are managed and shared by various clients.
The one array solution is also way more complex to implement. The good thing is that a one array solution allows you to give variable-sized chunks out, but this comes at the cost of essentially reimplementing malloc: dealing with fragmentation, etc, etc, which you shouldn't get into.
A multiple array solution allows you to initialize a pool with N amount of buffers and easily manage them in a straightforward fashion. Definitely the approach I'd recommend.

I wouldn't suggest the resizing option. Start simple and work your way up. A queue of byte buffers which gets a new one added to the end when it is exhausted would be a good start. You will probably have to pay attention to threading issues, so my advice would be to use somebody else's thread-safe queue implementation.
Next you can take a look at the more complex "pointers" into a big byte array chunk, except my advice would be to have a queue of 4k/16k (some power of two multiple of the page size) blocks that you index into, and when it is full you add another big chunk to the queue. Actually, I don't recommend this at all due to the complexity and the dubious gain in performance.
Start simple, work your way up. Pool of buffers, make it thread safe, see if you need anything more.

One more vote for multiple buffers, but with the addition that since you're doing things asynchronously you need to make sure your queue is threadsafe. The default Queue<T> collection is definitely not threadsafe.
SO user and MS employee JaredPar has a good threadsafe queue implementation here:
http://blogs.msdn.com/jaredpar/archive/2009/02/16/a-more-usable-thread-safe-collection.aspx

If you use the single buffer you need a strategy of how fast it should grow when needed. If you grow it by small increments you may have to do it often and copy all the data often. If you grow it by large increments (like the next size is 1,5 times the previous one) you risk to face a situation when you get "Out of memory" simply trying to grow the buffer. It's a lose-lose choice for a scalable system. This is why reusing small buffers is preferrable.

With a garbage collection heap, you should always favor small, right-sized buffers that have a short life-time. The .NET heap allocator is very fast, generation #0 collections are very cheap.
When you keep a static buffer around, you'll use up system resources for the life of the program. The worst case scenario is when it gets big enough to get moved in the Large Object Heap where it will be a permanent obstacle that cannot be moved.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.