I was going through some source code written by another developer and I came across the following line of code when it comes to streams (file, memory, etc) and file/content uploads. Is there a particular reason that this person is using 1024 as a buffer? Why is this 1024 multiplied by 16?
byte[] buffer = new byte[16*1024];
Could someone please clarify this further? Also, it would be awesome if anyone can direct me towards articles, etc to further read and understand this.
The practice of allocating memory in powers of 2 is holdover from days of yore. Word sizes are powers of 2 (e.g. fullword = 32 bits, doubleword = 8 bits), and virtual memory page sizes were powers of 2. You wanted your allocated memory to align on convenient boundaries as it made execution more efficient. For instance, once upon a time, loading a word or double word into a register was more expensive in terms of CPU cycles if it wasn't on an appropriate boundary (e.g, memory address divisible by 4 or 8 respectively). And if you were allocating a big chunk of memory, you might as well consume a whole page of virtual memory, because you'd likely lock an entire page anyway.
These day it doesn't really matter, but old practices die hard.
[And unless you knew something about how the memory allocator worked and how many words of overhead were involved in each malloc()'d block... it probably didn't work anyway.
1024 is the exact amount of bytes in a kilobyte. All that line means is that they are creating a buffer of 16 KB. That's really all there is to it. If you want to go down the route of why there are 1024 bytes in a kilobyte and why it's a good idea to use that in programming, this would be a good place to start. Here would also be a good place to look. Although it's talking about disk space, it's the same general idea.
Related
I noticed that my application runs out of memory quicker than it should. It creates many byte arrays of several megabytes each. However when I looked at memory usage with vmmap, it appears .NET allocates much more than needed for each buffer. To be precise, when allocating a buffer of 9 megabytes, .NET creates a heap of 16 megabytes. The remaining 7 megabytes cannot be used to create another buffer of 9 megabytes, so .NET creates another 16 megabytes. So each 9MB buffer wastes 7MB of address space!
Here's a sample program that throws an OutOfMemoryException after allocating 106 buffers in 32-bit .NET 4:
using System.Collections.Generic;
namespace CSharpMemoryAllocationTest
{
class Program
{
static void Main(string[] args)
{
var buffers = new List<byte[]>();
for (int i = 0; i < 130; ++i)
{
buffers.Add(new byte[9 * 1000 * 1024]);
}
}
}
}
Note that you can increase the size of the array to 16 * 1000 * 1024 and still allocate the same amount of buffers before running out of memory.
VMMap shows this:
Also note how there's an almost 100% difference between the total Size of the Managed Heap and the total Commited size. (1737MB vs 946MB).
Is there a reliable way around this problem on .NET, i.e. can I coerce the runtime into allocating no more than what I actually need, or maybe much larger Managed Heaps that can be used for several contiguous buffers?
Internally the CLR allocates memory in segments. From your description it sounds like the 16 MB allocations are segments and your arrays are allocated within these. The remaining space is reserved and not really wasted under normal circumstances, as it will be used for other allocations. If you don't have any allocation that will fit within the remaining chunks these are essentially overhead.
As your arrays are allocated using contiguous memory you can only fit a single of those within a segment and hence the overhead in this case.
The default segment size is 16 MB, but if you allocation is larger than that the CLR will allocate segments that are larger. I'm not aware of the details, but e.g. if I allocate 20 MB Wmmap shows me 24 MB segments.
One way to reduce the overhead is to make allocations that fit with the segment sizes if possible. But keep in mind that these are implementation details and could change with any update of the CLR.
The CLR reserving a 16MB chunk in one go from the OS, but only actively occupying 9MB.
I believe you are expecting the 9MB and 9MB to be in one heap. The difficulty is that the variable is now split over 2 heaps.
Heap 1 = 9MB + 7MB
Heap 2 = 2MB
The problem we have now, is if the original 9MB is deleted, we now have 2 heaps we can't tidy up, as the contents are shared across heaps.
To improve performance, the approach is to put them in single heaps.
If you are worried about memory usage, don't be. Memory usage is not a bad thing with .NET, as it if no-one is using it, what's the problem? The GC will at some point kick in, and memory will be tidied up. GC will only kick in either
When the CLR deems it necessary
When the OS tells the CLR to give back memory
When forced to by the code
But memory usage, especially in this example shouldn't be a concern. The memory usage stops CPU cycles occurring. Otherwise if it tidied up memory constantly, your CPU would be high, and your process (and all others on your machine) would run much slower.
Age old symptom of the buddy system heap management algorithm, where powers of 2 are used to split each block recursively, in a binary tree, so for 9M the next size is 16M. If you dropped your array size down to 8mb, you will see your usage drop by half. Not a new problem, native programmers deal with it too.
The small object pool (less than 85,000 bytes) is managed differently, but at 9MB your arrays are in the large object pool. As of .NET 4.5, the large object heap doesn't participate in compaction, large objects are immediately promoted to generation 2.
You can't coerce the algorithm, but you can certainly coerce your user code by figuring out what sizes to use that will most efficiently fill the binary segments.
If you need to fill your process space with 9 MB arrays, either:
Figure out how to save 1MB to reduce the arrays to 8MB segments
Write or use a segmented array class that abstracts an array of 1 or 2MB array segments, using an indexer property. Same way you build an unlimited bitfield or a growable ArrayList. Actually, I thought one of the containers already did this.
Move to 64-bit
Reclaiming fragmented portion of a buddy system heap is an optimization with logarithmic returns (ie. you are approximately running out of memory anyway). At some point you'll have to move to 64-bit whether its convenient or not, unless your data size is fixed.
I'm reading binary files and here is a sample:
public static byte[] ReadFully(Stream input)
{
byte[] buffer = new byte[16*1024];
int read;
while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
{
......
}
}
Obviously the buffer size (16*1024) has a great role in performance. I've read that it depends on the I/O technology (SATA, SSD, SCSI, etc.) and also the fragment size of the partition which file exists on it (we can define during the formatting the partition).
But here is the question:
Is there any formula or best practice to define the buffer size? Right now, I'm defining based on trial-and-error.
Edit:
I've tested the application on my server with different buffer sizes, and I get the best performance with 4095*256*16 (16 MB)!!! 4096 is 4 seconds slower.
Here are some older posts which are very helpful but I can't still get the reason:
Faster (unsafe) BinaryReader in .NET
Optimum file buffer read size?
File I/O with streams - best memory buffer size
How do you determine the ideal buffer size when using FileInputStream?
"Sequential File Programming Patterns and Performance with .NET" is a great article in I/O performance improvement.
In page 8 of this PDF file, it shows that the bandwidth for buffer size bigger than eight bytes, is constant. Consider that the article has been written in 2004 and the hard disk drive is "Maxtor 250 GB 7200 RPM SATA disk" and the result should be different by latest I/O technologies.
If you are looking for the best performance take a look at pinvoke.net or the page 9 of the PDF file, the un-buffered file performance measurements shows better results:
In un-buffered I/O, the disk data moves directly between the
application’s address space and the device without any intermediate
copying.
Summary
For single disks, use the defaults of the .NET framework – they deliver excellent performance for sequential file access.
Pre-allocate large sequential files (using the SetLength() method) when the file is created. This typically improves speed by about 13% when compared to a fragmented file.
At least for now, disk arrays require un-buffered I/O to achieve the highest performance - buffered I/O can be eight times slower than un-buffered I/O. We expect this problem will be addressed in later releases of the .NET framework.
If you do your own buffering, use large request sizes (64 KB is a good place to start). Using the .NET framework, a single processor can read and write a disk array at over 800 Mbytes/s using un-buffered I/O.
There is no best or worst buffer size, but you have to look at the some aspects.
As you are using C#, so you run on Windows, Windows uses NTFS and its page size is 4 MB, so it is advisable to use multiples of 4096. So your buffer size is 16*1024 = 4*4096, and it is a good choice, but to say if it is better or worse than 16*4096 we cannot say.
Everything depends on the situation and the requirements for program. Remember here you cannot choose the best option, but only some better. I recommend to use 4096, but also you could use your own 4*4096 or even 16*4096, but remember, that this buffer will be allocated on the heap, so its allocation takes some time, so you don't want to allocate a big buffer, for example 128*4096.
I am writing a server which will ready and write huge files / database.
I have used Stream read and write functions many places where I am using 8192 as buffer size.
I am also reading large input from TCP sockets.
I don't know what would be the configuration of the VMs where the service will be deployed.
Is there any built in function using which I can determine the best suitable buffer size for my server?
I often wondered this myself. But in the end I do not hink that there is a general rule to apply. It always comes down to your specific needs.
As a rule of thumb, if your buffer is bigger you need less roundtrips to the file system or database, which in general, is best for most cases.
However, how much data your system can read into memory at once, without making other applications, is very depending on your individual environment. Some mobile device might have different specifics than your over-the-top server hardware, and so on.
Other things to consider would be network bandwith and other shared resources, as well as the sheer performance impact on your actions.
For example, at a project with thousands of image files, we found after several tries that - for us - the idela buffer size was at around 1 MB. For images with a size lower than that we used a buffer size equal to the file size. For your scenario this would of course not fit.
Rico Mariani, performance expert at Microsoft, names the 10 most important aspects of programming for performance: Measure, measure, measure, measure, ... (You get the point. :-) )
It depends on throughput, communication channel utilization and connection stability in production environment.
From my point of view, the best here is to make an adaptive algorithm, which will change buffer size, depending on factors mentioned above.
UPDATE.
Be careful when using buffer, that is equals or larger than 85000 bytes. Such buffers should be reused as much, as possible (because of LOH behavior).
The critical factor is not the size of the application's buffer but the size of the socket send and receive buffers, which must be >= the bandwidth-delay product of the link. Any increase above that should yield zero benefit; any decrease below it will become visible in suboptimal bandwidth. Application buffers have a role to play in reducing system calls but 8192 is normally quite enough for most purposes, especially networking ones.
I have around 270k data block pairs, each pair consists of one 32KiB and one 16KiB block.
When I save them to one file I of course get a very large file.
But the data is easily compressed.
After compressing the 5.48GiB file with WinRAR, with strong compression, the resulting file size is 37.4MiB.
But I need random access to each individual block, so I can only compress the blocks individually.
For that I used the Deflate class provided by .NET, which reduced the file size to 382MiB (which I could live with).
But the speed is not good enough.
A lot of the speed loss is probably due to always creating a new MemoryStream and Deflate instance for each block.
But it seems they aren't designed to be reused.
And I guess (much?) better compression can be achieved when a "global" dictionary is used instead having one for each block.
Is there an implementation of a compression algorithm (preferably in C#) which is suited for that task?
The following link contains the percentage with which each byte number occurs, divided into three block types (32KiB blocks only).
The first and third block type has an occurrence of 37,5% and the second 25%.
Block type percentages
Long file short story:
Type1 consists mostly of ones.
Type2 consists mostly of zeros and ones
Type3 consists mostly of zeros
Values greater than 128 do not occur (yet).
The 16KiB block consists almost always of zeros
If you want to try different compression you can start with RLE which shoud be suitable for your data - http://en.wikipedia.org/wiki/Run-length_encoding - it will be blazingly fast even in simplest implemetation. The related http://en.wikipedia.org/wiki/Category:Lossless_compression_algorithms contains more links to start on other algorithm if you want to roll you own or find someone's implementation.
Random comment: "...A lot of the speed loss is probably ..." is not a way to solve performance problem. Measure and see if it really is.
Gzip is known to be "fine", which means compression ratio is okay, and speed is good.
If you want more compression, other alternatives exist, such as 7z.
If you want more speed, which seems your objective, a faster alternative will provide a significant speed advantage at the cost of some compression efficiency. "Significant" shall be translated into many times faster, such as 5x-10x. Such algorithms are favored for "in-memory" compression scenarios, such as yours, since they make accessing the compressed block almost painless.
As an example, Clayton Stangeland just released LZ4 for C#. The source code is available here under a BSD license :
https://github.com/stangelandcl/LZ4Sharp
There are some comparisons metrics with gzip on the project homepage, such as :
i5 memcpy 1658 MB/s
i5 Lz4 Compression 270 MB/s Decompression 1184 MB/s
i5 LZ4C# Compression 207 MB/s Decompression 758 MB/s 49%
i5 LZ4C# whole corpus Compression 267 MB/s Decompression 838 MB/s Ratio 47%
i5 gzip whole corpus Compression 48 MB/s Decompression 266 MB/s Ratio 33%
Hope this helps.
You can't have random access to a Deflate stream, no matter how hard you try (unless you forfeit the LZ77 part, but that's what's mostly responsible for making your compression ratio so high right now -- and even then, there's tricky issues to circumvent). This is because one part of the compressed data is allowed to refer to previous part up to 32K bytes back, which may also refer to another part in turn, etc. and you end up having to start decoding the stream from the beginning to get the data you want, even if you know exactly where it is in the compressed stream (which, currently, you don't).
But, what you could do is compress many (but not all) blocks together using one stream. Then you'd get fairly good speed and compression, but you wouldn't have to decompress all the blocks to get at the one you wanted; just the particular chunk that your block happens to be in. You'd need an additional index that tracks where each compressed chunk of blocks starts in the file, but that's fairly low overhead. Think of it as a compromise between compressing everything together (which is great for compression but sucks for random access), and compressing each chunk individually (which is great for random access but sucks for compression and speed).
I've got what I assume is a memory fragmentation issue.
We've recently ported our WinForms application to a WPF application. There's some image processing that this application does, and this processing always worked in the WinForms version of the app. We go to WPF, and the processing dies. Debugging into the library has the death at random spots, but always with an array that's nulled, ie, allocation failed.
The processing itself is done in a C++ library called by a p/invoke and is fairly memory intense; if the given image is N x M pixels big, then the image is N x M x 2 bytes big (each pixel is an unsigned short, and it's a greyscale image). During the processing, image pyramids are made, which are in float space, so the total memory usage will be N x M x (2 + 2 + 4 + 4 + 4 + 4), where the first 2 is the input, the second 2 is the output, the first 4 is the input in floats, the second 4 is the 0th level difference image, and the last two fours are the rest of the pyramid (since they're pyramids and each level is half the size in each direction, these 4s are upper bounds). So, for a 5000x6000 image, that's 600 mb, which should fit into memory just fine.
(There's the possibility that using marshalling is increasing the memory requirement by another N x M x 4, ie, the input and output images on the C# side and then the same arrays copied to the C++ side-- could the marshalling requirement be bigger?)
How fragmented is WPF compared to WinForms? Is there a way to consolidate memory before running this processing? I suspect that fragmentation is the issue due to the random nature of the breakages, when they happen, and that it's always a memory allocation problem.
Or should I avoid this problem entirely by making the processing run as a separate process, with data transfer via sockets or some such similar thing?
If I read this correctly, the memory allocation failure is happening on the non-managed side, not the managed side. It seems strange then to blame WPF. I recognize that you are drawing your conclusion based on the fact that "it worked in WinForms", but there are likely more changes than just that. You can use a tool like the .NET Memory Profiler to see the differences between how the WPF application and the WinForms application are treating memory. You might find that your application is doing something you don't expect. :)
Per comment: Yup, I understand. If you're confident that you've ruled out things like environment changes, I think you have to grab a copy of BoundsChecker and Memory Profiler (or DevPartner Studio) and dig in and see what's messing up your memory allocation.
I'm guessing that the GC is moving your memory. Try pinning the memory in unmanaged land as long as you have a raw pointer to the array, and unpin it as soon as possible. It's possible that WPF causes the GC to run more often, which would explain why it happens more often with it, and if it's the GC, then that would explain why it happens at random places in your code.
Edit: Out of curiosity, could you also pre-allocate all of your memory up front (I don't see the code, so don't know if this is possible), and make sure all of your pointers are non-null, so you can verify that it's actually happening in the memory allocation, rather than some other problem?
It sounds like you want to be more careful about your memory management in general; ie: either run the processing engine in a separate address space which carefully manages memory, or pre-allocate a sufficiently large chunk before memory gets too fragmented and manage images in that area only. If you're sharing address space with the .NET runtime in a long-running process, and you need large contiguous areas, it's always going to potentially fail at some point. Just my 2c.
This post might be useful
http://blogs.msdn.com/tess/archive/2009/02/03/net-memory-leak-to-dispose-or-not-to-dispose-that-s-the-1-gb-question.aspx