In-file data copy using DMA

In-file data copy using DMA - c#

I need to move some data from one area of a file to another. Currently, I am reading the bytes and writing them back out. But I'm wondering if doing a DMA transfer would be faster, if it is possible. I'm in C#, but unsafe and p/invoke functions are acceptable.

You are, as far as I can tell, not doing DMA 'by accident' by using the usual file streams to copy from one file to another. DMA is in use in the background in some places (for instance to transfer from disk to RAM by using FileStreams), but it cannot handle direct file-to-file stream copy in C#.
DMA itself is pretty complex and native to low level languages. I'm referring to this document. All code examples are in C and asm, so its not directly applicable to C#.
The DMA is another chip on your motherboard (usually is an Intel 8237
chip) that allows you (the programmer) to offload data transfers
between I/O boards. DMA actually stands for 'Direct Memory Access'. An
example of DMA usage would be the Sound Blaster's ability to play
samples in the background. The CPU sets up the sound card and the DMA.
When the DMA is told to 'go', it simply shovels the data from RAM to
the card. Since this is done off-CPU, the CPU can do other things
while the data is being transferred.
An alternative could be to let the OS handle the transfer: Simply use File.Copy.

Related

Performance and memory managment impact of using a C# dll with native C++

I want to write a map editor for a game. I intend doing it using C++ and OpenGL. However, the game was written in Unity, so map loading/saving code was written in C#.
Since I worked on a similar project in C# WinForms, I have already written a C# dll that can manage some game generated files, including map files. I now plan to use it to load/save map files in the main C++ program.
What does the C# dll do? (tl;dr below the second line)
It has a method for loading Region into memory, consisting of an array of 1024 MemoryStreams that each contain a compressed Chunk (about 2kB to 20kB per chunk, mostly around 5kB). It also has a method for requesting a Chunk from the Region. It decompresses the stream and reads it into a Chunk object (which is a complex object with arrays, lists, dictionaries and other custom classes with complexities of their own).
I also have the methods that do the reverse - pack the Chunk object into a MemoryStream, compress it and add it the Region object that has a method which saves it to a file on the disk.
The uncompressed chunk data ranges from 15kB to over 120kB in size, and that's just raw data, not including any object creation related overhead.
In the main program, I'd probably have several thousand of those Chunksloaded into memory at once, some maybe briefly to cache some data and be unloaded (to perhaps generate distant terrain), others fully to modify them to the users wishes.
tl;dr I'd be loading anywhere from a few hundred megabytes up to over a gigabyte of data within a managed c# dll. The data wouldn't be heavily accessed, only changed when user does changes to terrain, which is not that often speaking in terms of time in CPU scale. But as the user moves the map, a lot of chunks might be requested to be loaded/unloaded at a time.
Given that all this is within a managed C# dll, my question is, what happens to memory management and how does that impact performance of the native C++ program? To what extent can I control the memory allocation for the Region/Chunk objects? How does that impact the speed of execution?
Is it something that can be overlooked/ignored and/or dealt with, or will it pose enough of a problem to justify rewriting the dll in native C++ with a more elaborate memory management scheme?

.Net Write continuously data to the disk in different files

We have an application that extract data from several hardware devices. Each device's data should be stored in a different file.
Currently we have one FileStream by file and doing a write when a data comes and that's it.
We have a lot of data coming in, the disk is struggling with an HDD(not a SSD), I guess because the flash is faster, but also because we do not have to jump to different file places all the time.
Some metrics for the default case: 400 different data source(each should have his own file) and we receive ~50KB/s for each data(so 20MB/s). Each data source acquisition is running concurrently and at total we are using ~6% of the CPU.
Is there a way to organize the flush to the disk in order to ensure the better flow?
We will also consider improving the hardware, but it's not really the subject here, since it's a good way to improve our read/write

Windows and NTFS handle multiple concurrent sequential IO streams to the same disk terribly inefficiently. Probably, you are suffering from random IO. You need to schedule the IO yourself in bigger chunks.
You might also see extreme fragmentation. In such cases NTFS sometimes allocates every Nth sector to each of the N files. It is hard to believe how bad NTFS is in such scenarios.
Buffer data for each file until you have like 16MB. Then, flush it out. Do not write to multiple files at the same time. That way you have one disk seek for each 16MB segment which reduces seek overhead to near zero.

efficient continuous data writes on HDD

In my application I need to continuously write data chunks (around 2MB) about every 50ms in a large file (around 2-7 GB). This is done in a sequential, circular way, so I write chunk after chunk into the file and when I'm at the end of the file I start again at the beginning.
Currently I'm doing it as follows:
In C# I call File.OpenWrite once to open the file with read access and set the size of the file with SetLength. When I need to write a chunk, I pass the safe file handle to the unmanaged WriteFile (kernel32.dll). Hereby I pass an overlapped structure to specify the position within the file where the chunk has to be written. The chunk I need to write is stored in unmanaged memory, so I have an IntPtr which I can pass to WriteFile.
Now I'd like to know if and how I can make this process more efficient. Any ideas?
Some questions in detail:
Will changing from file I/O to memory-mapped file help?
Can I include some optimizations for NTFS?
Are there some useful parameters when creating the file that I'm missing? (maybe an unmanaged call with special parameters)

Using better hardware will probably be the most cost efficient way to increase file writing efficiency.
There is a paper from Microsoft research that will answer most of your questions: Sequential File Programming Patterns and Performance with .NET and the downloadable source code (C#) if you want to run the tests from the paper on your machine.
In short:
The default behavior provides excellent performance on a single disk.
Unbufffered IO should be tested if you have a disc array. Could improve write speed with a factor of eight.
This thread on social.msdn might also be of interest.

C# array to implement support for page-file architecture

Let me explain what I need to accomplish. I need to load a file into RAM and analyze its structure. What I was doing is this:
//Stream streamFile;
byte[] bytesFileBuff = new byte[streamFile.Length];
if(streamFile.Read(bytesFileBuff, 0, streamFile.Length) == streamFile.Length)
{
//Loaded OK, can now analyze 'bytesFileBuff'
//Go through bytes in 'bytesFileBuff' array from 0 to `streamFile.Length`
}
But in my previous experience with Windows and 32-bit processes, it seems like even smaller amounts of RAM can be hard to allocate. (In that particular example I failed to allocate 512MB on a Windows 7 machine with 16GB of installed RAM.)
So I was curious, is there a special class that would allow me to work with the contents on a file of hypothetically any length (by implementing an internal analog of a page-file architecture)?

If linear stream access (even with multiple passes) is not a viable option, the solution in Win32 would be to use Memory Mapped Files with relatively small Views.
I didn't think you could do that in C# easily, but I was wrong. It turns out that .NET 4.0 and above provide classes wrapping the Memory Mapped Files API.
See http://msdn.microsoft.com/en-us/library/dd997372.aspx
If you have used memory mapped files in C/C++, you will know what to do.
The basic idea would be to use MemoryMappedFileCreateFromFile to obtain a MemoryMappedFile object. With the object, you can call the CreateViewAccessor method to get different MemoryMappedViewAccessor objects that represent chunks of the file; you can use these objects to read from the file in chunks of your choice. Make sure you dispose MemoryamappedViewAccessors diligently to release the memory buffer.
You have to work out the right strategy for using memory mapped files. You don't want to create too many small views or you will suffer a lot of overhead. Too few larger views and you will consume a lot of memory.
(As I said, I didn't know about these class wrappers in .NET. Do read the MSDN docs carefully: I might have easily missed something important in the few minutes I spent reviewing them)

Fastest PNG decoder for .NET

Our web server needs to process many compositions of large images together before sending the results to web clients. This process is performance critical because the server can receive several thousands of requests per hour.
Right now our solution loads PNG files (around 1MB each) from the HD and sends them to the video card so the composition is done on the GPU. We first tried loading our images using the PNG decoder exposed by the XNA API. We saw the performance was not too good.
To understand if the problem was loading from the HD or the decoding of the PNG, we modified that by loading the file in a memory stream, and then sending that memory stream to the .NET PNG decoder. The difference of performance using XNA or using System.Windows.Media.Imaging.PngBitmapDecoder class is not significant. We roughly get the same levels of performance.
Our benchmarks show the following performance results:
Load images from disk: 37.76ms 1%
Decode PNGs: 2816.97ms 77%
Load images on Video Hardware: 196.67ms 5%
Composition: 87.80ms 2%
Get composition result from Video Hardware: 166.21ms 5%
Encode to PNG: 318.13ms 9%
Store to disk: 3.96ms 0%
Clean up: 53.00ms 1%
Total: 3680.50ms 100%
From these results we see that the slowest parts are when decoding the PNG.
So we are wondering if there wouldn't be a PNG decoder we could use that would allow us to reduce the PNG decoding time. We also considered keeping the images uncompressed on the hard disk, but then each image would be 10MB in size instead of 1MB and since there are several tens of thousands of these images stored on the hard disk, it is not possible to store them all without compression.
EDIT: More useful information:
The benchmark simulates loading 20 PNG images and compositing them together. This will roughly correspond to the kind of requests we will get in the production environment.
Each image used in the composition is 1600x1600 in size.
The solution will involve as many as 10 load balanced servers like the one we are discussing here. So extra software development effort could be worth the savings on the hardware costs.
Caching the decoded source images is something we are considering, but each composition will most likely be done with completely different source images, so cache misses will be high and performance gain, low.
The benchmarks were done with a crappy video card, so we can expect the PNG decoding to be even more of a performance bottleneck using a decent video card.

There is another option. And that is, you write your own GPU-based PNG decoder. You could use OpenCL to perform this operation fairly efficiently (and perform your composition using OpenGL which can share resources with OpenCL). It is also possible to interleave transfer and decoding for maximum throughput. If this is a route you can/want to pursue I can provide more information.
Here are some resources related to GPU-based DEFLATE (and INFLATE).
Accelerating Lossless compression with GPUs
gpu-block-compression using CUDA on Google code.
Floating point data-compression at 75 Gb/s on a GPU - note that this doesn't use INFLATE/DEFLATE but a novel parallel compression/decompression scheme that is more GPU-friendly.
Hope this helps!

Have you tried the following 2 things.
1)
Multi thread it, there is several ways of doing this but one would be a "all in" method. Basicly fully spawn X amount of threads, for the full proccess.
2)
Perhaps consider having XX thread do all the CPU work, and then feed it to the GPU thread.
Your question is very well formulated for being a new user, but some information about the senario might be usefull?
Are we talking about a batch job or service pictures in real time?
Do the 10k pictures change?
Hardware resources
You should also take into account what hardware resources you have at your dispoal.
Normaly the 2 cheapest things are CPU power and diskspace, so if you only have 10k pictures that rarly change, then converting them all into a format that quicker to handle might be the way to go.
Multi thread trivia
Another thing to consider when doing multithreading, is that its normaly smart to make the threads in BellowNormal priority.So you dont make the entire system "lag". You have to experiment a bit with the amount of threads to use, if your luck you can get close to 100% gain in speed pr CORE but this depends alot on the hardware and the code your running.
I normaly use Environment.ProcessorCount to get the current CPU count and work from there :)

I've written a pure C# PNG coder/decoder ( PngCs ) , you might want to give it a look.
But I higly doubt it will have better speed permance [*], it's not highly optimized, it rather tries to minimize the memory usage for dealing with huge images (it encodes/decodes sequentially, line by line). But perhaps it serves you as boilerplate to plug in some better compression/decompression implementantion. As I see it, the speed bottleneck is zlib (inflater/deflater), which (contrarily to Java) is not implemented natively in C# -I used a SharpZipLib library, with pure C# managed code; this cannnot be very efficient.
I'm a little surprised, however, that in your tests decoding was so much slower than encoding. That seems strange to me, because, in most compression algorithms (perhaps in all; and surely in zlib) encoding is much more computer intensive than decoding.
Are you sure about that?
(For example, this speedtest which read and writes 5000x5000 RGB8 images (not very compressible, about 20MB on disk) gives me about 4.5 secs for writing and 1.5 secs for reading). Perhaps there are other factor apart from pure PNG decoding?
[*] Update: new versions (since 1.1.14) that have several optimizations; if you can use .Net 4.5, specially, it should provide better decoding speed.

You have mutliple options
Improve the performance of the decoding process
You could implement another faster png decoder
(libpng is a standard library which might be faster)
You could switch to another picture format that uses simpler/faster decodeable compression
Parallelize
Use the .NET parallel processing capabilities for decoding concurrently. Decoding is likely singlethreaded so this could help if you run on multicore machines
Store the files uncompressed but on a device that compresses
For instance a compressed folder or even a sandforce ssd.
This will still compress but differently and burden other software with the decompression. I am not sure this will really help and would only try this as a last resort.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.