I've read quite a few SO posts and general articles on trying to allocate over 1GB of memory so before getting shot down like the others here is some context.
This app will run as a kiosk with a dedicated machine running no unnecessary processes.
My app acquires images from a high-speed camera with a rolling shutter at a rate of 120 frames per second at a resolution of 1920 x 1080 with a bit depth of 24. The app needs to write every single frame to disk for post-processing. The current problem I am facing is that the Disk I/O won't keep up with the capture rate even though it is limited to 120 frames per second. The Disk I/O bandwidth needed is around 750MBps!
The total length of the recording needs to be at least 10 seconds (7.5GB) in raw form. Performing any on-the-fly transcoding or compression brings the frame-rate down to utterly unacceptable levels.
To work around this, I have tried the following:
Compromising on quality by reducing the bit-depth at hardware-level to 16 which is still around 500MBps.
Disabled all image encoding and writing raw camera data to disk. This has saved some processing time.
Creating a single 10GB file on disk and doing a sequential write-through as frames come in. This has helped most so far. All dev and production systems have a 100GB dedicated drive for this application.
Using Contig.exe from Sysinternals to defragment the file. This has had astonishing gains on non-SSD drives.
Out of options to explore here. I am not familiar with memory-mapped files and when trying to create them, I get an IOException saying Not enough storage is available to process this command..
using (var file = MemoryMappedFile.CreateFromFile(#"D:\Temp.VideoCache", FileMode.OpenOrCreate, "MyMapName", int.MaxValue, MemoryMappedFileAccess.CopyOnWrite))
{
...
}
The large file I currently use requires either sequential write-though or sequential read access. Any pointers would be appreciated.
I could even force the overall recording size down to 1.8GB if only there was a way to allocate that much RAM. Once again, this will run on a dedicated with 8GB available memory and 100GB free space. However, not all production systems will have SSD drives.
32 bit processes on a 64 bit system can allocate 4 GB of RAM, so it should be possible to get 1.8 GB of RAM for storing the video, but of course you need to consider loaded DLLs and a buffer until the video is compressed.
Other than that, you could use a RAMDisk, e.g. from DataRam. You just need to find a balance between how much memory the application needs and how much memory you can grant the disk. IMHO a 5 GB / 3 GB setting could work well: 1 GB for the OS, 4 GB for your application and 3 GB for the file.
Don't forget to copy the file from the RAM disk to HDD if you want it persistent.
Commodity hardware is cheap for a reason. You need faster hardware.
Buy a faster disk system. A good RAID controller and four SSDs. Put the drives into a RAID 1+0 configuration and be done with this problem.
How much money is your company planning on spending developing and testing software to push cheap hardware past its limitations? And even if you can get it to work fast enough, how much do they plan on spending to maintain that software?
Memory mapped files don't speed-up very much writing to a file...
If you have a big file, you normally don't try to map it entirely in RAM... you map a "window" of it, then "move" the window (in C#/Windows API you create a "view" of the file starting at any one location and with a certain size)
Example of code: (here the window is 1mb big... bigger windows are possible... at 32 bits it should be possible to allocate a 64 or 128mb window without any problem)
const string fileName = "Test.bin";
const long fileSize = 1024L * 1024 * 16;
const long windowSize = 1024 * 1024;
if (!File.Exists(fileName)) {
using (var file = File.Create(fileName)) {
file.SetLength(fileSize);
}
}
long realFileSize = new FileInfo(fileName).Length;
if (realFileSize < fileSize) {
using (var file = File.Create(fileName)) {
file.SetLength(fileSize);
}
}
using (var mm = MemoryMappedFile.CreateFromFile(fileName, FileMode.Open)) {
long start = 0;
while (true) {
long size = Math.Min(fileSize - start, windowSize);
if (size <= 0) {
break;
}
using (var acc = mm.CreateViewAccessor(start, size)) {
for (int i = 0; i < size; i++) {
// It is probably faster if you write the file with
// acc.WriteArray()
acc.Write(i, (byte)i);
}
}
start += windowSize;
}
}
Note that here I'm writing code that will write a fixed pre-known number of bytes (fileSize)... Your code should be different (because you can't pre-know the "exact" fileSize). Still remember: Memory mapped files don't speed-up very much writing to a file.
Related
Having the second code:
class Methods
{
public MemoryStream UniqPicture(string imagePath)
{
var photoBytes = File.ReadAllBytes(imagePath); // change imagePath with a valid image path
var quality = 70;
var format = ImageFormat.Jpeg; // we gonna convert a jpeg image to a png one
var size = new Size(200, 200);
using (var inStream = new MemoryStream(photoBytes))
{
using (var outStream = new MemoryStream())
{
using (var imageFactory = new ImageFactory())
{
imageFactory.Load(inStream)
.Rotate(new Random().Next(-7, 7))
.RoundedCorners(new RoundedCornerLayer(190))
.Pixelate(3, null)
.Contrast(new Random().Next(-15, 15))
.Brightness(new Random().Next(-15, 15))
.Quality(quality)
.Save(outStream);
}
return outStream;
}
}
}
public void StartUniq()
{
var files = Directory.GetFiles("mypath");
Parallel.ForEach(files, (picture) => { UniqPicture(picture); });
}
}
When I start method StartUniq() my CPU bound to 12-13% and no more. Can I use more CPU % for doing this operation? Why it not increase?
I try to do it from python, it's also only 12-13%. It's Core i7 8700.
The only way to do it operation faster it's to start the second window of application.
It's windows limit? Using Windows Server 2016.
I think this is system limit, because if I try this simple code it's bound 12% CPU too!
while (true)
{
var a = 1 + 2;
}
A bit of research shows that you are using ImageFactory from https://imageprocessor.org/, which wraps System.Drawing. System.Drawing itself is often a wrapper for GDI/GDI+, which... incorporates process-wide locks, so your attempts at multithreading will be severely bottlenecked. Try a better image library.
(See Robert McKee's answer, although maybe this could be about disk IO but maybe not.)
So, I haven't used Paralell.ForEach before, but it seems like you should be running your UniqPicture method in parallel for all files in a given directory. I think your approach is good here, but ultimately your hard drive is probably killing the speed of your program (and vice versa).
Have you tried running UniqPicture in a loop sequentially? My concern here is that your hard drive is thrashing possibly. But in general, it's most likely that the input / output (IO) from your hard drive is taking a considerable amount of time, so the CPU is waiting a considerable amount of time before it can operate on the images in UniqPicture. If you could pre-load the images into memory, I would think the CPU utilization would be much higher, if not maxing out your CPU.
In no particular order, here are some thoughts
What happens if you run this sequentially? This will max out one core on the CPU at max, but it may prevent hard drive thrashing. If there are 100 threads being spun up, that's a lot of requests for the hard drive to deal with at once.
You should be able to add this option to make it run sequentially (or just make it a normal loop without Parallel.):
new ParallelOptions { MaxDegreeOfParallelism = 1 },
Maybe try 2, 3, or 4 threads and see if anything changes.
Check your hard drive utilization in task manager. What's the latency on the hard drive where the images are stored? What's the percentage that Winows reports it as busy? You want the hard drive to be busy the entire time (100% usage), but you also want it to be grabbing your images with the highest throughput possible so the CPU can do its job.
A spinning hard drive (HDD) has far lower IOPS (IO per second) than an SSD in general. An SSD will usually have 1000 to 100,000+ IOPS, but a HDD is around 200, I believe, and has much lower throughput usually. An SSD should help your program utilize the CPU much more.
The size of the image files could have an impact here, again relating to IO.
Or maybe see Robert Mckee's answer about your threads getting bottlenecked. Maybe 13% CPU utilization is the best you can get. 1 / 6 (your CPU has 6 cores) cores being maxed is ~16.7%, so you actually aren't that far off on maxing one core already.
Ultimately, time how long it's taking. CPU utilization should scale inversely linearly (higher CPU usage = lower run time) with the time this takes to run, but time it just be to be sure since that's the real benchmark.
i'm using monotorrent that downloads a 20GB~ file, when monotorrent creates the files the memory and CPU reaches maximum which slows the computer and even overheat it, so i wanted to limit the memory usage by limiting the write rate.
here's what i have tried:-
, i checked around and found that you can limit read/write rate of the engine using this code:-
EngineSettings engineSettings = new EngineSettings(downloadsPath, port);
engineSettings.PreferEncryption = true;
engineSettings.AllowedEncryption = EncryptionTypes.All;
engineSettings.MaxWriteRate = **maximum write rate in bytes**;
engineSettings.MaxReadRate = **maximum read rate in bytes**;
engineSettings.GlobalMaxDownloadSpeed = **max download in bytes**;
the download rate worked but it didn't limited the memory usage, so i checked the write rate value in runtime using this code
MessageBox.Show(engine.DiskManager.WriteRate.ToString());
and it returned 0, so instead of adding MaxWriteRate to the EngineSettings i went into EngineSettings.cs and added a default value to MaxWriteRate by changing this code:-
public int MaxWriteRate
{
get { return 5000; }
set { maxWriteRate = 5000; }
}
and it didn't limited the memory usage also the WriteRate value returned 0, so i went into DiskManager.cs and added a default value to WriteRate by changing this code:-
public int WriteRate
{
get { return 5000; }
}
now WriteRate value returned 5000 but it didn't limited the memory usage, then i stuck and didn't found anything else to change,
does anyone know why it's not working? i'm thinking that WriteRate is not even about limiting the writing speed.
When downloading a torrent, the download speed is limited by three things:
1) The maximum allowed download speed speed for the TorrentManager
2) The maximum allowed download speed overall
3) No more than 4MB of data is held in memory while waiting to be written to disk.
Specifically on the third point, if there are more than 4MB of pieces held in memory then no further Socket.Receive calls will be made until that data is flushed. https://github.com/mono/monotorrent/blob/caac16cffd95749febe04c3f7cf22567c3e40432/src/MonoTorrent/MonoTorrent.Client/RateLimiters/DiskWriterLimiter.cs#L43-L46
This screenshot shows what happens today when you specify a max write rate of 2 * 1024 * 1024 (2,048 kB/sec):
The download rate auto-limits because the 4MB buffer fills up, which means setting the max disk write rate ends up limiting both download rate and memory consumption.
I have frames of a Video with 30FPS in my C# code and I want to broadcast it in local host so all other applications can use it. I though because it is a video and there is no concern if any packet lost and no need to connect/accept from clients, UDP is a good choose.
But there are number of problems here.
If I use UDP Unicast speed is enough, about 25FPS (CPU Usage is 25%
that mean 100% on one thread in my 4 core CPU which is not ideal. But
at least it send enough set of data). But unicast cant deliver data
to all clients.
If I use broadcast speed is very low. About 10FPS with same CPU usage.
What can I do?! Data are in same computer so there is no need to remote access from LAN or etc. I just want a way to transfer about 30MBytes of data per second between different applications of same machine. (640x480 is fixed size of Image x 30fps x 3byte per pixel is about 27000KByte per second)
Is UDP Multicast has better performance?!
Is TCP can give me better performance even if I accept each client
and send to them independently?!
Is there any better way than Socket?! Memory sharing or something?!
Why UDP broadcast is that much slow?! Only about 10MBytes per
second?!
Is there a fast way to compress frames with high performance (to
encode 30fps in a sec and decode on other part)? Client apps are in
C++ so this must be a cross platform way.
I just want to know other developers experiences and ideas here so please write what you think.
Edit:
More info about data: Data are in Bitmap RGB24 format and they are streaming from a device to my application with 30FPS. I want to broadcast this data to other applications and they need to have this images in RGB24 format again. There is no header or any thing, only bitmap data with fixed size. All operations must perform on the fly. No matter of using a lossy compression algorithm or any thing.
I experiment Multicast in an industrial environment, it's a good choice over a not staturated reliable network.
In local host, shared memory may be a good choice because you may build a circular queue of frames and flip from one to the next only with a single mutex to protect a pointer assignment (writter side). With one writter, several reader, no problem arise.
On Windows with C++ and C#, shared memory is called File Mapping, but you may use system paging file (RAM and/or disk).
See these links to more information
http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.aspx
http://msdn.microsoft.com/en-us/library/dd997372.aspx
Mixing C++ and C# : How to implement shared memory in .NET?
Fully managed shared memory .NET implementations?
The shared memory space isn't protected nor private but it'd named.
Usually, the writer process creates it, and the readers opens it by its name. Antivirus softwares takes a look at this kind of I/O in a same fashion as they do for all others but don't block the communication.
Here is a sample to begin with File Mapping:
char shmName[MAX_PATH+1];
sprintf( shmName, "shmVideo_%s", name );
shmName[MAX_PATH] = '\0';
_hMap =
CreateFileMapping(
INVALID_HANDLE_VALUE, 0, PAGE_READWRITE, 0, size, shmName );
if( _hMap == 0 ) {
throw OSException( __FILE__, __LINE__ );
}
_owner = ( GetLastError() != ERROR_ALREADY_EXISTS );
_mutex = Mutex::getMutex( name );
Synchronize sync( *_mutex );
_data = (char *)MapViewOfFile( _hMap, FILE_MAP_ALL_ACCESS, 0, 0, 0 );
if( _data == 0 ) {
throw OSException( __FILE__, __LINE__ );
}
Use live555 http://www.live555.com/ for streaming in combination with your favorite compressor - ffmpeg.
We have up to 30 GB of GZipped log files per day. Each file holds 100.000 lines and is between 6 and 8 MB when compressed. The simplified code in which the parsing logic has been stripped out, utilises the Parallel.ForEach loop.
The maximum number of lines processed peaks at MaxDegreeOfParallelism of 8 on the two-NUMA node, 32 logical CPU box (Intel Xeon E7-2820 # 2 GHz):
using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.IO.Compression;
using System.Threading.Tasks;
namespace ParallelLineCount
{
public class ScriptMain
{
static void Main(String[] args)
{
int maxMaxDOP = (args.Length > 0) ? Convert.ToInt16(args[0]) : 2;
string fileLocation = (args.Length > 1) ? args[1] : "C:\\Temp\\SomeFiles" ;
string filePattern = (args.Length > 1) ? args[2] : "*2012-10-30.*.gz";
string fileNamePrefix = (args.Length > 1) ? args[3] : "LineCounts";
Console.WriteLine("Start: {0}", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
Console.WriteLine("Processing file(s): {0}", filePattern);
Console.WriteLine("Max MaxDOP to be used: {0}", maxMaxDOP.ToString());
Console.WriteLine("");
Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
for (int maxDOP = 1; maxDOP <= maxMaxDOP; maxDOP++)
{
// Construct ConcurrentStacks for resulting strings and counters
ConcurrentStack<Int64> TotalLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalSomeBookLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalLength = new ConcurrentStack<Int64>();
ConcurrentStack<int> TotalFiles = new ConcurrentStack<int>();
DateTime FullStartTime = DateTime.Now;
string[] files = System.IO.Directory.GetFiles(fileLocation, filePattern);
var options = new ParallelOptions() { MaxDegreeOfParallelism = maxDOP };
// Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body)
Parallel.ForEach(files, options, currentFile =>
{
string filename = System.IO.Path.GetFileName(currentFile);
DateTime fileStartTime = DateTime.Now;
using (FileStream inFile = File.Open(fileLocation + "\\" + filename, FileMode.Open))
{
Int64 lines = 0, someBookLines = 0, length = 0;
String line = "";
using (var reader = new StreamReader(new GZipStream(inFile, CompressionMode.Decompress)))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
lines++; // total lines
length += line.Length; // total line length
if (line.Contains("book")) someBookLines++; // some special lines that need to be parsed later
}
TotalLines.Push(lines); TotalSomeBookLines.Push(someBookLines); TotalLength.Push(length);
TotalFiles.Push(1); // silly way to count processed files :)
}
}
}
);
TimeSpan runningTime = DateTime.Now - FullStartTime;
// Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
Console.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7}",
maxDOP.ToString(),
TotalFiles.Sum().ToString(),
Convert.ToInt32(runningTime.TotalMilliseconds).ToString(),
TotalLength.Sum().ToString(),
TotalLines.Sum(),
TotalSomeBookLines.Sum().ToString(),
Convert.ToInt64(TotalLines.Sum() / runningTime.TotalMilliseconds).ToString(),
Convert.ToInt64(TotalLength.Sum() / runningTime.TotalMilliseconds).ToString());
}
Console.WriteLine();
Console.WriteLine("Finish: " + DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
}
}
}
Here's a summary of the results, with a clear peak at MaxDegreeOfParallelism = 8:
The CPU load (shown aggregated here, most of the load was on a single NUMA node, even when DOP was in 20 to 30 range):
The only way I've found to make CPU load cross 95% mark was to split the files across 4 different folders and execute the same command 4 times, each one targeting a subset of all files.
Can someone find a bottleneck?
It's likely that one problem is the small buffer size used by the default FileStream constructor. I suggest you use a larger input buffer. Such as:
using (FileStream infile = new FileStream(
name, FileMode.Open, FileAccess.Read, FileShare.None, 65536))
The default buffer size is 4 kilobytes, which has the thread making many calls to the I/O subsystem to fill its buffer. A buffer of 64K means that you will make those calls much less frequently.
I've found that a buffer size of between 32K and 256K gives the best performance, with 64K being the "sweet spot" when I did some detailed testing a while back. A buffer size larger than 256K actually begins to reduce performance.
Also, although this is unlikely to have a major effect on performance, you probably should replace those ConcurrentStack instances with 64-bit integers and use Interlocked.Add or Interlocked.Increment to update them. It simplifies your code and removes the need to manage the collections.
Update:
Re-reading your problem description, I was struck by this statement:
The only way I've found to make CPU load cross 95% mark was to split
the files across 4 different folders and execute the same command 4
times, each one targeting a subset of all files.
That, to me, points to a bottleneck in opening files. As though the OS is using a mutual exclusion lock on the directory. And even if all the data is in the cache and there's no physical I/O required, processes still have to wait on this lock. It's also possible that the file system is writing to the disk. Remember, it has to update the Last Access Time for a file whenever it's opened.
If I/O really is the bottleneck, then you might consider having a single thread that does nothing but load files and stuff them into a BlockingCollection or similar data structure so that the processing threads don't have to contend with each other for a lock on the directory. Your application becomes a producer/consumer application with one producer and N consumers.
The reason for this is usually that threads synchronize too much.
Looking for synchronization in your code I can see heavy syncing on the collections. Your threads are pushing the lines individually. This means that each line incurs at best an interlocked operation and at worst a kernel-mode lock wait. The interlocked operations will contend heavily because all threads race to get their current line into the collection. They all try to update the same memory locations. This causes cache line pinging.
Change this to push lines in bigger chunks. Push line-arrays of 100 lines or more. The more the better.
In other words, collect results in a thread-local collection first and only rarely merge into the global results.
You might even want to get rid of the manual data pushing altogether. This is what PLINQ is made for: Streaming data concurrently. PLINQ abstracts away all the concurrent collection manipulations in a well-performing way.
I don't think Parallelizing the disk reads is helping you. In fact, this could be seriously impacting your performance by creating contention in reading from multiple areas of storage at same time.
I would restructure the program to first do a single-threaded read of raw file data into a memory stream of byte[]. Then, do a Parallel.ForEach() on each stream or buffer to decompress and count the lines.
You take an initial IO read hit up front but let the OS/hardware optimize the hopefully mostly sequential reads, then decompress and parse in memory.
Keep in mind that operations like decomprless, Encoding.UTF8.ToString(), String.Split(), etc. will use large amounts of memory, so clean up references to/dispose of old buffers as you no longer need them.
I'd be surprised if you can't cause the machine to generate some serious waste hit this way.
Hope this helps.
The problem, I think, is that you are using blocking I/O, so your threads cannot fully take advantage of parallelism.
If I understand your algorithm right (sorry, I'm more of a C++ guy) this is what you are doing in each thread (pseudo-code):
while (there is data in the file)
read data
gunzip data
Instead, a better approach would be something like this:
N = 0
read data block N
while (there is data in the file)
asyncRead data block N+1
gunzip data block N
N = N + 1
gunzip data block N
The asyncRead call does not block, so basically you have the decoding of block N happening concurrently with the reading of block N+1, so by the time you are done decoding block N you might have block N+1 ready (or close to be ready if I/O is slower than decoding).
Then it's just a matter of finding the block size that gives you the best throughput.
Good luck.
I have a issue with my FileSystemWatcher.
I have application that needs to monitor a great, really great, amount of files which have been created in a folder, in a short period of time.
When I start developing it, I realize that a lot of files where not been notified, if my buffer was less then 64kb, which is what Microsoft recommends. I tried increasing the buffer size beyond this until I reached a value that worked for me, which is 2621440 bytes!
What could you recommend to use a small size for this case, or what would be the ideal size of buffer?
My example code :
WATCHER = new FileSystemWatcher(SignerDocument.UnsignedPath, "*.pdf");
WATCHER.InternalBufferSize = 2621440; //Great and expensive buffer 2.5mb size!
WATCHER.IncludeSubdirectories = true;
WATCHER.EnableRaisingEvents = true;
WATCHER.Created += new FileSystemEventHandler(watcher_Created);
WATCHER.Renamed += new RenamedEventHandler(watcher_Renamed);
And what Microsoft say about this in .NET 2.0 :
Increasing buffer size is expensive, as it comes from non paged memory
that cannot be swapped out to disk, so keep the buffer as small as
possible. To avoid a buffer overflow, use the NotifyFilter and
IncludeSubdirectories properties to filter out unwanted change
notifications.
link : FileSystemWatcher.InternalBufferSize Property
For such a huge workload you might want to opt for "periodic sweep" approach instead of instant notifications. You could for instance scan the directory every 5 seconds and process the added files. If you move the file to another directory after it's processed, your periodic workload might even become minimal.
That is also a safer approach because even if your processing code crashes you can always recover, unlike notifications, your checkpoint wouldn't get lost.
You can set the buffer to 4 KB or larger, but it must not exceed 64 KB. If you try to set the InternalBufferSize property to less than 4096 bytes, your value is discarded and the InternalBufferSize property is set to 4096 bytes. For best performance, use a multiple of 4 KB on Intel-based computers.
From:
http://msdn.microsoft.com/de-de/library/system.io.filesystemwatcher.internalbuffersize(v=vs.110).aspx