Our web server needs to process many compositions of large images together before sending the results to web clients. This process is performance critical because the server can receive several thousands of requests per hour.
Right now our solution loads PNG files (around 1MB each) from the HD and sends them to the video card so the composition is done on the GPU. We first tried loading our images using the PNG decoder exposed by the XNA API. We saw the performance was not too good.
To understand if the problem was loading from the HD or the decoding of the PNG, we modified that by loading the file in a memory stream, and then sending that memory stream to the .NET PNG decoder. The difference of performance using XNA or using System.Windows.Media.Imaging.PngBitmapDecoder class is not significant. We roughly get the same levels of performance.
Our benchmarks show the following performance results:
Load images from disk: 37.76ms 1%
Decode PNGs: 2816.97ms 77%
Load images on Video Hardware: 196.67ms 5%
Composition: 87.80ms 2%
Get composition result from Video Hardware: 166.21ms 5%
Encode to PNG: 318.13ms 9%
Store to disk: 3.96ms 0%
Clean up: 53.00ms 1%
Total: 3680.50ms 100%
From these results we see that the slowest parts are when decoding the PNG.
So we are wondering if there wouldn't be a PNG decoder we could use that would allow us to reduce the PNG decoding time. We also considered keeping the images uncompressed on the hard disk, but then each image would be 10MB in size instead of 1MB and since there are several tens of thousands of these images stored on the hard disk, it is not possible to store them all without compression.
EDIT: More useful information:
The benchmark simulates loading 20 PNG images and compositing them together. This will roughly correspond to the kind of requests we will get in the production environment.
Each image used in the composition is 1600x1600 in size.
The solution will involve as many as 10 load balanced servers like the one we are discussing here. So extra software development effort could be worth the savings on the hardware costs.
Caching the decoded source images is something we are considering, but each composition will most likely be done with completely different source images, so cache misses will be high and performance gain, low.
The benchmarks were done with a crappy video card, so we can expect the PNG decoding to be even more of a performance bottleneck using a decent video card.
There is another option. And that is, you write your own GPU-based PNG decoder. You could use OpenCL to perform this operation fairly efficiently (and perform your composition using OpenGL which can share resources with OpenCL). It is also possible to interleave transfer and decoding for maximum throughput. If this is a route you can/want to pursue I can provide more information.
Here are some resources related to GPU-based DEFLATE (and INFLATE).
Accelerating Lossless compression with GPUs
gpu-block-compression using CUDA on Google code.
Floating point data-compression at 75 Gb/s on a GPU - note that this doesn't use INFLATE/DEFLATE but a novel parallel compression/decompression scheme that is more GPU-friendly.
Hope this helps!
Have you tried the following 2 things.
1)
Multi thread it, there is several ways of doing this but one would be a "all in" method. Basicly fully spawn X amount of threads, for the full proccess.
2)
Perhaps consider having XX thread do all the CPU work, and then feed it to the GPU thread.
Your question is very well formulated for being a new user, but some information about the senario might be usefull?
Are we talking about a batch job or service pictures in real time?
Do the 10k pictures change?
Hardware resources
You should also take into account what hardware resources you have at your dispoal.
Normaly the 2 cheapest things are CPU power and diskspace, so if you only have 10k pictures that rarly change, then converting them all into a format that quicker to handle might be the way to go.
Multi thread trivia
Another thing to consider when doing multithreading, is that its normaly smart to make the threads in BellowNormal priority.So you dont make the entire system "lag". You have to experiment a bit with the amount of threads to use, if your luck you can get close to 100% gain in speed pr CORE but this depends alot on the hardware and the code your running.
I normaly use Environment.ProcessorCount to get the current CPU count and work from there :)
I've written a pure C# PNG coder/decoder ( PngCs ) , you might want to give it a look.
But I higly doubt it will have better speed permance [*], it's not highly optimized, it rather tries to minimize the memory usage for dealing with huge images (it encodes/decodes sequentially, line by line). But perhaps it serves you as boilerplate to plug in some better compression/decompression implementantion. As I see it, the speed bottleneck is zlib (inflater/deflater), which (contrarily to Java) is not implemented natively in C# -I used a SharpZipLib library, with pure C# managed code; this cannnot be very efficient.
I'm a little surprised, however, that in your tests decoding was so much slower than encoding. That seems strange to me, because, in most compression algorithms (perhaps in all; and surely in zlib) encoding is much more computer intensive than decoding.
Are you sure about that?
(For example, this speedtest which read and writes 5000x5000 RGB8 images (not very compressible, about 20MB on disk) gives me about 4.5 secs for writing and 1.5 secs for reading). Perhaps there are other factor apart from pure PNG decoding?
[*] Update: new versions (since 1.1.14) that have several optimizations; if you can use .Net 4.5, specially, it should provide better decoding speed.
You have mutliple options
Improve the performance of the decoding process
You could implement another faster png decoder
(libpng is a standard library which might be faster)
You could switch to another picture format that uses simpler/faster decodeable compression
Parallelize
Use the .NET parallel processing capabilities for decoding concurrently. Decoding is likely singlethreaded so this could help if you run on multicore machines
Store the files uncompressed but on a device that compresses
For instance a compressed folder or even a sandforce ssd.
This will still compress but differently and burden other software with the decompression. I am not sure this will really help and would only try this as a last resort.
Related
imageresing.net community and developers.
Please, clarify me some details about imageresing.net internals.
Does imageresing.net use .NET Drawing library to recompress jpegs? If not - does it use a 3rd party engine or some internal algorithms?
Are there performance benchmarks? I'd like to compare imageresing.net with other libraries: libjpeg, Intel Integrated Performance Primitives, etc.
Thanks in advance,
Anton
ImageResizer offers 3 imaging pipelines.
GDI+ (System.Drawing) The default. 2-pass high-quality bicubic with pre-smoothing. Very fast for the quality provided. (Avg. 200-300ms for resizing.) A recent windows update makes GDI+ parallelize poorly, but this is being actively investigated by MSFT.
WIC (What WPF also uses). 4-8x faster (20-200ms for resizing). Single-pass resizing causes lots of moire artifacts and causes blurriness (in both Fant and Bicubic modes). No really high quality resizing is available within Windows Imaging Components.
FreeImage. If you need to support DSLR formats or access Lanczos resampling (top-tier quality), FreeImage is your library. It's slower than the others (often 800-2400ms), large, monolithic, and hard to audit, so we only recommend using it with trusted image data. We use a custom version of FreeImage built against libjpeg-turbo, which is significantly faster than libjpeg.
You can mix and match encoders, decoders, and (to some extent) resizing algorithms. Some algorithms are implemented internally for quality reasons, while most are implemented in C/C++ in dependencies.
End-to-end comparison benchmarking is kind of absurd if you care about photo quality, since you can never compare apples to apples. Back in 2011, I did some benchmarks between GDI+ and WIC, but photographers and graphics designers tend to find WIC image quality unacceptable, so it's not particularly fair.
We regularly benchmark each pipeline against itself to detect performance improvements or regressions, but comparing pipelines can be deceptive for a numer of reasons:
Do you care about metadata? libjpeg-turbo is (wierdly) 2-3x faster at reading jpegs when you disable metadata parsing. If you want to auto-rotate based on camera exif data, you'll need that info.
Do you care about color correctness? Jpeg isn't an RGB format. If it has an ICC profile, the right thing to do is convert to sRGB before you resize. That's slow.
Do you care about resizing quality? There are a hundred ways to implement a bicubic resizing filter. Some fast, some slow, most ugly, some accurate. Bicubic WIC != Bicubic GDI+. You can get < 20ms end-to-end out of ImageResizer WIC in nearest neighbor mode - if you're cool with the visual results.
Do you care about output file size? If you're willing to spend more clock cycles, you can gain 30-80% reductions in PNG/GIF file sizes, and 5-15% in jpeg. If you want to add 150-600ms per request, ImageResizer can create WebP images that halve your bandwidth costs (WebP is more costly to encode than jpeg).
You can make sense of micro-benchmarks (is libjpeg-turbo 40% faster than libjpeg under the same circumstances, etc). You can even compare certain simple low-quality image resizing filters (nearest-neighbor, box, bilinear) after excluding encoding, decoding, and color transformation.
The problem is that truly high-quality resizing is really complex, and never implemented the same way twice. There are a vanishingly small number of high-quality implementations, and an even tinier number that have sub-second performance. I ordered a dozen textbooks on image processing to see if I could find a reference implementation, but the topic is... expertly avoided by most, and only briefly touched on by others. Edge-pixel handling, pre-filtering, and performance optimization are never mentioned.
I've funded a lot of research into fast high-quality image resizing, but we haven't been able to match GDI+ yet. ImageResizer's default configuration tends to beat Photoshop quality on many types of images.
A fourth pipeline may be added to ImageResizer in the near future, based on our fork of libgd with custom resizing algorithms. No promises yet, but we may have something nearly as high quality as GDI+ with similar single-threaded (but better concurrent) performance.
All our source code is on GitHub, so if you find something fast that you'd like to demo as a plugin or alternate pipeline, we'd love to hear about it.
As the title states, I have a problem with high page file activity.
I am developing a program that process a lot of images, which it loads from the hard drive.
From every image it generates some data, that I save on a list. For every 3600 images, I save the list to the hard drive, its size is about 5 to 10 MB. It is running as fast as it can, so it max out one CPU Thread.
The program works, it generates the data that it is supposed to, but when I analyze it in Visual Studio I get a warning saying: DA0014: Extremely high rates of paging active memory to disk.
The memory comsumption of the program, according to Task Manager is about 50 MB and seems to be stable. When I ran the program I had about 2 GB left out of 4 GB, so I guess I am not running out of RAM.
http://i.stack.imgur.com/TDAB0.png
The DA0014 rule description says "The number of Pages Output/sec is frequently much larger than the number of Page Writes/sec, for example. Because Pages Output/sec also includes changed data pages from the system file cache. However, it is not always easy to determine which process is directly responsible for the paging or why."
Does this mean that I get this warning simply because I read a lot of images from the hard drive, or is it something else? Not really sure what kind of bug I am looking for.
EDIT: Link to image inserted.
EDIT1: The images size is about 300 KB each. I dipose each one before loading the next.
UPDATE: Looks from experiments like the paging comes from just loading the large amount of files. As I am no expert in C# or the underlying GDI+ API, I don't know which of the answers are most correct. I chose Andras Zoltans answer as it was well explained and because it seems he did a lot of work to explain the reason to a newcomer like me:)
Updated following more info
The working set of your application might not be very big - but what about the virtual memory size? Paging can occur because of this and not just because of it's physical size. See this screen shot from Process Explorer of VS2012 running on Windows 8:
And on task manager? Apparently the private working set for the same process is 305,376Kb.
We can take from this a) that Task Manager can't necessarily be trusted and b) an application's size in memory, as far as the OS is concerned, is far more complicated than we'd like to think.
You might want to take a look at this.
The paging is almost certainly because of what you do with the files and the high final figures almost certainly because of the number of files you're working with. A simple test of that would be experiment with different numbers of files and generate a dataset of final paging figures alongside those. If the number of files is causing the paging, then you'll see a clear correlation.
Then take out any processing (but keep the image-loading) you do and compare again - note the difference.
Then stub out the image-loading code completely - note the difference.
Clearly you'll see the biggest drop in faults when you take out the image loading.
Now, looking at the Emgu.CV Image code, it uses the Image class internally to get the image bits - so that's firing up GDI+ via the function GdipLoadImageFromFile (Second entry on this index)) to decode the image (using system resources, plus potentially large byte arrays) - and then it copies the data to an uncompressed byte array containing the actual RGB values.
This byte array is allocated using GCHandle.Alloc (also surrounded by GC.AddMemoryPressure and GC.RemoveMemoryPressure) to create a pinned byte array to hold the image data (uncompressed). Now I'm no expert on .Net memory management, but it seems to me that what we have a potential for heap fragmentation here, even if each file is loaded sequentially and not in parallel.
Whether that's causing the hard paging I don't know. But it seems likely.
In particular the in-memory representation of the image could be specifically geared around displaying as opposed to being the original file bytes. So if we're talking JPEGs, for example, then a 300Kb JPEG could be considerably larger in physical memory, depending on its size. E.g. a 1027x768 32 bit image is 3Mb - and that's been allocated twice for each image since it's loaded (first allocation) then copied (second allocation) into the EMGU image object before being disposed.
But you have to ask yourself if it's necessary to find a way around the problem. If your application is not consuming vast amounts of physical RAM, then it will have much less of an impact on other applications; one process hitting the page file lots and lots won't badly affect another process that doesn't, if there's sufficient physical memory.
However, it is not always easy to determine which process is directly responsible for the paging or why.
The devil is in that cop-out note. Bitmaps are mapped into memory from the file that contains the pixel data using a memory-mapped file. That's an efficient way to avoid reading and writing the data directly into/from RAM, you only pay for what you use. The mechanism that keeps the file in sync with RAM is paging. So it is inevitable that if you process a lot of images then you'll see a lot of page faults. The tool you use just isn't smart enough to know that this is by design.
Feature, not a bug.
I was thinking...
Does Multithreading using c# in I/O operations ( lets say copying many files from c:\1\ to c:\2\ ) , will have performance differences rather than doing the operation - sequential ?
The reason why im struggle with myself is that an IO operation finally - is One item which has to do work. so even if im working in parallel - he will still execute those copy orders as sequential...
or maybe my assumption is wrong ?
in that case is there any benefit of using multithreaded copy to :
copy many small file ( sum 4GB)
copy 4 big files ( sum 4 gb , 1000 mb each)
thanks
Like others says, it has to be measured against concrete application context.
But just would like to invite an attention on this.
Every time you copy a file the permission of Write access to destination location is checked, which is slow.
All of us met a case when you have to copy/paste a sequence of already compressed files, and if you them compress again into one big ZIP file, so the total compressed size is not significally smaller then the sum of all content, the IO operation will be executed a way faster. (Just try it, you will see a huge difference, if you didn't do it before).
So I would assume (again it has to be measured on concrete system, mine are just guesses) that having one big file write on single disk, will be faster the a lot of small files.
Hope this helps.
Multithreading with files is not so much about the CPU but about IO. This means that totally different rules apply. Different devices have different characterstics:
Magnetic disks like sequential IO
SSDs like sequential or parallel random IO (mine has 4 hardware "threads")
The network likes many parallel operations to amortize latency
I'm no expert in hard-disk related questions, but maybe this will shred some light for you:
Windows is using the NTFS file system. This system doesn't "like" too much small files, for example, under 1kb files. It will "magically" make 100 files of 1kb weight 400kb instead of 100kb. It is also VERY slow when dealing which a lot of "small" files. Therefore, copying one big file instead of many small files of the same weight will be much faster.
Also, from my personal experience and knowledge, Multithreading will NOT speed up the copying of many files, because the actual hardware disk is acting like one unit, and can't be sped up by sending many requests at the same time (it will process them one by one.)
I have around 270k data block pairs, each pair consists of one 32KiB and one 16KiB block.
When I save them to one file I of course get a very large file.
But the data is easily compressed.
After compressing the 5.48GiB file with WinRAR, with strong compression, the resulting file size is 37.4MiB.
But I need random access to each individual block, so I can only compress the blocks individually.
For that I used the Deflate class provided by .NET, which reduced the file size to 382MiB (which I could live with).
But the speed is not good enough.
A lot of the speed loss is probably due to always creating a new MemoryStream and Deflate instance for each block.
But it seems they aren't designed to be reused.
And I guess (much?) better compression can be achieved when a "global" dictionary is used instead having one for each block.
Is there an implementation of a compression algorithm (preferably in C#) which is suited for that task?
The following link contains the percentage with which each byte number occurs, divided into three block types (32KiB blocks only).
The first and third block type has an occurrence of 37,5% and the second 25%.
Block type percentages
Long file short story:
Type1 consists mostly of ones.
Type2 consists mostly of zeros and ones
Type3 consists mostly of zeros
Values greater than 128 do not occur (yet).
The 16KiB block consists almost always of zeros
If you want to try different compression you can start with RLE which shoud be suitable for your data - http://en.wikipedia.org/wiki/Run-length_encoding - it will be blazingly fast even in simplest implemetation. The related http://en.wikipedia.org/wiki/Category:Lossless_compression_algorithms contains more links to start on other algorithm if you want to roll you own or find someone's implementation.
Random comment: "...A lot of the speed loss is probably ..." is not a way to solve performance problem. Measure and see if it really is.
Gzip is known to be "fine", which means compression ratio is okay, and speed is good.
If you want more compression, other alternatives exist, such as 7z.
If you want more speed, which seems your objective, a faster alternative will provide a significant speed advantage at the cost of some compression efficiency. "Significant" shall be translated into many times faster, such as 5x-10x. Such algorithms are favored for "in-memory" compression scenarios, such as yours, since they make accessing the compressed block almost painless.
As an example, Clayton Stangeland just released LZ4 for C#. The source code is available here under a BSD license :
https://github.com/stangelandcl/LZ4Sharp
There are some comparisons metrics with gzip on the project homepage, such as :
i5 memcpy 1658 MB/s
i5 Lz4 Compression 270 MB/s Decompression 1184 MB/s
i5 LZ4C# Compression 207 MB/s Decompression 758 MB/s 49%
i5 LZ4C# whole corpus Compression 267 MB/s Decompression 838 MB/s Ratio 47%
i5 gzip whole corpus Compression 48 MB/s Decompression 266 MB/s Ratio 33%
Hope this helps.
You can't have random access to a Deflate stream, no matter how hard you try (unless you forfeit the LZ77 part, but that's what's mostly responsible for making your compression ratio so high right now -- and even then, there's tricky issues to circumvent). This is because one part of the compressed data is allowed to refer to previous part up to 32K bytes back, which may also refer to another part in turn, etc. and you end up having to start decoding the stream from the beginning to get the data you want, even if you know exactly where it is in the compressed stream (which, currently, you don't).
But, what you could do is compress many (but not all) blocks together using one stream. Then you'd get fairly good speed and compression, but you wouldn't have to decompress all the blocks to get at the one you wanted; just the particular chunk that your block happens to be in. You'd need an additional index that tracks where each compressed chunk of blocks starts in the file, but that's fairly low overhead. Think of it as a compromise between compressing everything together (which is great for compression but sucks for random access), and compressing each chunk individually (which is great for random access but sucks for compression and speed).
Is there such a thing as an optimum chunk size for processing large files? I have an upload service (WCF) which is used to accept file uploads ranging from several hundred megabytes.
I've experimented with 4KB, 8KB through to 1MB chunk sizes. Bigger chunk sizes is good for performance (faster processing) but it comes at the cost of memory.
So, is there way to work out the optimum chunk size at the moment of uploading files. How would one go about doing such calculations? Would it be a combination of available memory and the client, CPU and network bandwidth which determines the optimum size?
Cheers
EDIT: Probably should mention that the client app will be in silverlight.
If you are concerned about running out of resources, then the optimum is probably best determined by evaluating your peak upload concurrency against your system's available memory. How many simultaneous uploads you have in progress at a time would be the key critical variable in any calculation you might do. All you have to do is make sure you have enough memory to handle the upload concurrency, and that's rather trivial to achieve. Memory is cheap and you will likely run out of network bandwidth long before you get to the point where your concurrency would overrun your memory availability.
On the performance side, this isn't the kind of thing you can really optimize much during app design and development. You have to have the system in place, users uploading files for real, and then you can monitor actual runtime performance.
Try a chunk size that matches your network's TCP/IP window size. That's about as optimal as you'd really need to get at design time.