i wrote an app that sync's local folders with online folders, but it eats all my bandwidth, how can i limit the amount of bandwidth the app use? (programatically)?
Take a look at http://www.codeproject.com/KB/IP/MyDownloader.aspx
He's using the well known technique which can be found in Downloader.Extension\SpeedLimit
Basically, before more data is read of a stream, a check is performed on how much data has actually been read since the previous iteration . If that rate exceeds the max rate, then the read command is suspended for a very short time and the check is repeated. Most applications use this technique.
Try this: http://www.netlimiter.com/ It's been on my "check this out" list for a long time (though I haven't tried it yet myself).
I'd say "don't". Unless you're doing something really wrong, your program shouldn't be hogging bandwidth. Your router should be balancing the available bandwidth between all requests.
I'd recommend you do the following:
a) Create md5 hashes for all the files. Compare hashes and/or dates and sizes for the files and only sync the files that have changed. Unless you're syncing massive files you shouldn't have to sync a whole lot of data.
b) Limit the sending rate. In your upload thread read the files in 1-8KB chunks and then call Thread.Sleep after every chunk to limit the rate. You have to do this on the upload side however.
c) Pipe everything through a Gzip stream. (System.IO.Compression) For text files this can reduce the size of the data that needs to be transfered.
Hope this helps!
Related
We have an application that extract data from several hardware devices. Each device's data should be stored in a different file.
Currently we have one FileStream by file and doing a write when a data comes and that's it.
We have a lot of data coming in, the disk is struggling with an HDD(not a SSD), I guess because the flash is faster, but also because we do not have to jump to different file places all the time.
Some metrics for the default case: 400 different data source(each should have his own file) and we receive ~50KB/s for each data(so 20MB/s). Each data source acquisition is running concurrently and at total we are using ~6% of the CPU.
Is there a way to organize the flush to the disk in order to ensure the better flow?
We will also consider improving the hardware, but it's not really the subject here, since it's a good way to improve our read/write
Windows and NTFS handle multiple concurrent sequential IO streams to the same disk terribly inefficiently. Probably, you are suffering from random IO. You need to schedule the IO yourself in bigger chunks.
You might also see extreme fragmentation. In such cases NTFS sometimes allocates every Nth sector to each of the N files. It is hard to believe how bad NTFS is in such scenarios.
Buffer data for each file until you have like 16MB. Then, flush it out. Do not write to multiple files at the same time. That way you have one disk seek for each 16MB segment which reduces seek overhead to near zero.
I need to copy a lot of file between many file systems into one big storage.
I also need to limit the bandwidth of the file transfer because the network in not stable and I need the bandwidth for other things.
Another request is that it will be done in c#.
I thought about using Microsoft File Sync Framework, but I think that it doesn't provide bandwidth limitations.
Also thought about robocopy but it is an external process and handling the error might be a little problem.
I saw the BITS but there is a problem with the scalability of the jobs, I will need to transfer more then 100 files and that means 100 jobs at the same time.
Any suggestions? recommendations?
Thank you
I'd take a look at How to improve the Performance of FtpWebRequest? though it might not be what you're looking for exactly, it should give you ideas.
I think you'll want some sort of limited tunnel so the processes negiotating can't claim more bandwidth because there is none available for them. A connection in a connection.
Alternatively you could make a job queue, which holds off on sending all files at the same time but instead sends n number of files and waits until one is done before starting the next.
Well, you could just use the usual I/O methods (read + write) and throttle the rate.
A simple example (not exactly great, but working), would be something like this:
while ((bytesRead = await fsInput.ReadAsync(...)) > 0)
{
await fsOutput.WriteAsync(...);
await Task.Delay(100);
}
Now, obviously, bandwidth throttling isn't really the job of the application. Unless you're on a very simple network, this should be handled by the QoS of the router, which should ensure that the various services get their share of the bandwidth - a steady stream of data will usually have a lower QoS priority. Of course, it does usually require you to have a network administrator.
I have a function that downloads thousands of images at a time from a 3rd party source. The number of images can range from 2,500-250,000 per run. As you can imagine, this process takes some time and am looking to optimize the best I can.
The way it works is I take a list of image paths, do a loop through them and request the image from the 3rd party. Currently, before I make the request, I do a check to see if the image already exists on the server...if it does, it skips that image...if it does not, it downloads it.
My question is if anyone knows if the check before the download is slowing down the process (or possibly speeding it up)? Would it be more efficient to download the file and let it override for already existing images, thus cutting out the step of checking for existence?
If anyone else has any tips for downloading this volume of images, they are welcome!
The real answer depends on three things:
1: how often you come across an image that already exists. The less often you have a hit, the less useful checking is.
2: The latency of the destination storage. Is the destination storage location local or far away? if it is in India with a 300ms latency (and probable high packet loss), the check becomes more expensive relative to the download. This is mitigated significantly by smart threading.
3: Your bandwidth / throughput from your source to your destination. The higher your bandwidth, the less downloading a file twice costs you.
If you have a less than 1% hit rate for images that already exist, you're not getting much of a gain from the check (max ~1%), but if 90% of the images already exist, it would be probably be worth checking even if the destination file store is remote / far away. Either way it is a balancing act, but if you have a hit rate high enough to ask, its likely that checking to see if you already have the file would be useful.
If images you already have don't get deleted, the best way to do this would probably be to keep a database of images that you've downloaded, and check your list of files to download against that database.
If that isn't feasible because images get deleted / renamed or something, minimize the impact of the check by threading it. The performance difference between foreach and Parallel.ForEach for operations with high latency are huge.
Finally, 250k images can be a lot of data if they're large images. It might be faster to send physical media (i.e. put the data on a hard drive and send the drive).
Doing a
System.IO.File.Exists(pathName);
is a lot less expensive than doing a download. So it would speed up the process by avoiding the time to do the download.
I was thinking...
Does Multithreading using c# in I/O operations ( lets say copying many files from c:\1\ to c:\2\ ) , will have performance differences rather than doing the operation - sequential ?
The reason why im struggle with myself is that an IO operation finally - is One item which has to do work. so even if im working in parallel - he will still execute those copy orders as sequential...
or maybe my assumption is wrong ?
in that case is there any benefit of using multithreaded copy to :
copy many small file ( sum 4GB)
copy 4 big files ( sum 4 gb , 1000 mb each)
thanks
Like others says, it has to be measured against concrete application context.
But just would like to invite an attention on this.
Every time you copy a file the permission of Write access to destination location is checked, which is slow.
All of us met a case when you have to copy/paste a sequence of already compressed files, and if you them compress again into one big ZIP file, so the total compressed size is not significally smaller then the sum of all content, the IO operation will be executed a way faster. (Just try it, you will see a huge difference, if you didn't do it before).
So I would assume (again it has to be measured on concrete system, mine are just guesses) that having one big file write on single disk, will be faster the a lot of small files.
Hope this helps.
Multithreading with files is not so much about the CPU but about IO. This means that totally different rules apply. Different devices have different characterstics:
Magnetic disks like sequential IO
SSDs like sequential or parallel random IO (mine has 4 hardware "threads")
The network likes many parallel operations to amortize latency
I'm no expert in hard-disk related questions, but maybe this will shred some light for you:
Windows is using the NTFS file system. This system doesn't "like" too much small files, for example, under 1kb files. It will "magically" make 100 files of 1kb weight 400kb instead of 100kb. It is also VERY slow when dealing which a lot of "small" files. Therefore, copying one big file instead of many small files of the same weight will be much faster.
Also, from my personal experience and knowledge, Multithreading will NOT speed up the copying of many files, because the actual hardware disk is acting like one unit, and can't be sped up by sending many requests at the same time (it will process them one by one.)
I have some 2TB read only (no writing once created) files on a RAID 5 (4 x 7.2k # 3TB) system.
Now I have some threads that wants to read portions of that file.
Every thread has an array of chunks it needs.
Every chunk is addressed by file offset (position) and size (mostly about 300 bytes) to read from.
What is the fastest way to read this data.
I don't care about CPU cycles, (disk) latency is what counts.
So if possible I want take advantage of NCQ of the hard disks.
As the files are highly compressed and will accessed randomly and I know exactly the position, I have no other way to optimize it.
Should I pool the file reading to one thread?
Should I keep the file open?
Should every thread (maybe about 30) keep every file open simultaneously, what is with new threads that are coming (from web server)?
Will it help if I wait 100ms and sort my readings by file offsets (lowest first)?
What is the best way to read the data? Do you have experiences, tips, hints?
The optimum number of parallel requests depends highly on factors outside your app (e.g. Disk count=4, NCQ depth=?, driver queue depth=? ...), so you might want to use a system, that can adapt or be adapted. My recommendation is:
Write all your read requests into a queue together with some metadata that allows to notify the requesting thread
have N threads dequeue from that queue, synchronously read the chunk, notify the requesting thread
Make N runtime-changeable
Since CPU is not your concern, your worker threads can calculate a floating latency average (and/or maximum, depending on your needs)
Slide N up and down, until you hit the sweet point
Why sync reads? They have lower latency than ascync reads.
Why waste latency on a queue? A good lockless queue implementation starts at less than 10ns latency, much less than two thread switches
Update: Some Q/A
Should the read threads keep the files open? Yes, definitly so.
Would you use a FileStream with FileOptions.RandomAccess? Yes
You write "synchronously read the chunk". Does this mean every single read thread should start reading a chunk from disk as soon as it dequeues an order to read a chunk? Yes, that's what I meant. The queue depth of read requests is managed by the thread count.
Disks are "single threaded" because there is only one head. It won't go faster no matter how many threads you use... in fact more threads probably will just slow things down. Just get yourself the list and arrange (sort) it in the app.
You can of course use many threads that'd make use of NCQ probably more efficient, but arranging it in the app and using one thread should work better.
If the file is fragmented - use NCQ and a couple of threads because you then can't know exact position on disk so only NCQ can optimize reads. If it's contignous - use sorting.
You may also try direct I/O to bypass OS caching and read the whole file sequentially... it sometimes can be faster, especially if you have no other load on this array.
Will ReadFileScatter do what you want?