I need to copy a lot of file between many file systems into one big storage.
I also need to limit the bandwidth of the file transfer because the network in not stable and I need the bandwidth for other things.
Another request is that it will be done in c#.
I thought about using Microsoft File Sync Framework, but I think that it doesn't provide bandwidth limitations.
Also thought about robocopy but it is an external process and handling the error might be a little problem.
I saw the BITS but there is a problem with the scalability of the jobs, I will need to transfer more then 100 files and that means 100 jobs at the same time.
Any suggestions? recommendations?
Thank you
I'd take a look at How to improve the Performance of FtpWebRequest? though it might not be what you're looking for exactly, it should give you ideas.
I think you'll want some sort of limited tunnel so the processes negiotating can't claim more bandwidth because there is none available for them. A connection in a connection.
Alternatively you could make a job queue, which holds off on sending all files at the same time but instead sends n number of files and waits until one is done before starting the next.
Well, you could just use the usual I/O methods (read + write) and throttle the rate.
A simple example (not exactly great, but working), would be something like this:
while ((bytesRead = await fsInput.ReadAsync(...)) > 0)
{
await fsOutput.WriteAsync(...);
await Task.Delay(100);
}
Now, obviously, bandwidth throttling isn't really the job of the application. Unless you're on a very simple network, this should be handled by the QoS of the router, which should ensure that the various services get their share of the bandwidth - a steady stream of data will usually have a lower QoS priority. Of course, it does usually require you to have a network administrator.
Related
We are scraping an Web based API using Microsoft Azure. The issue is that there is SO much data to retrieve (there are combinations/permutations involved).
If we use a standard Web Job approach, we calculated it would take about 200 years to process all the data we want to get - and we would like our data to be refreshed every week.
Each request/response from the API takes about a 0.5-1.0 seconds to process. Request size is on average 20000 bytes and the average response is 35000 bytes. I believe the total number of requests is in the millions.
Another way to think about this question would be: how would you use Azure to Web scrape - and make sure you don't overload (in terms of memory + network) the VM it's running on? (I don't think you need too much CPU processing in this case).
What we have tried so far:
Used Service Bus Queues/Worker Roles scaled to 8 small VMs - but this caused a lot of network errors to occur (there must be some network limit to how much EACH worker role VM can handle).
Used Service Bus Queues/Continuous Web Job scaled to 8 small VMs - but this seems to work slower - and even scaled, doesn't give us too much control on what's happening behind the scenes. (We don't REALLY know how many VMs are up).
It seems that these things are built for CPU calculation - not for Web/API scraping.
Just to clarify: I throw my requests into a queue - which then get picked up by my multiple VMs for processing to get the responses. That's how I was using the queues. Each VM was using the ServiceBusTrigger class as prescribed by microsoft.
Is it better to have a lot small VMs or few massive VMs?
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?
Actually a web scraper is something that I have up and running, in Azure, for quite some time now :-)
AFAIK there is no 'magic bullet'. Scraping a lot of sources with deadlines is quite hard.
How it works (the most important things):
I use worker roles and C# code for the code itself.
For scheduling, I use the queue storage. I put crawling tasks on the queue with a timeout (e.g. 'when to crawl then') and have the scraper pull them off. You can put triggers on the queue size to ensure you meet deadlines in terms of speed -- personally I don't need them.
SQL Azure is slow, so I don't use that. Instead, I only use table storage for storing the scraped items. Note that updating data might be quite complex.
Don't use too much threading; instead, use async IO for all network traffic.
Also you might have to consider that extra threads require extra memory (parse trees can become quite big) - so there's a trade-off there... I do recall using some threads, but it's really just a few.
Note that probably this does require you to re-design and re-implement your complete web scraper if you're now using a threaded approach.. then again, there are some benefits:
Table storage and queue storage are cheap.
I currently use a single Extra Small VM to scrape well over a thousand web sources.
Inbound network traffic is for free.
As such, the result is quite cheap as well; I'm sure it's much less than the alternatives.
As for classes that I use... well, that's a bit of a long list. I'm using HttpWebRequest for the async HTTP requests and the Azure SDK -- but all the rest is hand crafted (and not open source).
P.S.: This doesn't just hold for Azure; most of this also holds for on-premise scrapers.
I have some experience with scraping so I will share my thoughts.
It seems that these things are built for CPU calculation - not for Web/API scraping.
They are built for dynamic scaling which given your task is not something you really need.
How to make sure you don't overload the VM?
Measure the response times and error rates and tune you code to lower them.
I don't think you need too much CPU processing in this case.
Depends on how much data is coming in each second and what you are doing with it. More complex parsing on quickly incoming data (if you decide to do it on the same machine) will eat up CPU pretty quickly.
8 small VMs caused a lot of network errors to occur (there must be some network limit)
The smaller the VMs the less shared resources they get. There are throughput limits and then there is an issue with your neighbors sharing the actual hardware with you. Often, the smaller your instance size the more trouble you run into.
Is it better to have a lot small VMs or few massive VMs?
In my experience, smaller VMs are too crippled. However, your mileage may vary and it all depends on the particular task and its solution implementation. Really, you have to measure yourself in your environment.
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?
With high throughput scraping you should be looking at infrastructure. You will have different latency in different Azure datacenters, and different experience with network latency/sustained throughput at different VM sizes, and depending on who in particular is sharing the hardware with you. The best practice is to try and find what works best for you - change datacenters, VM sizes and otherwise experiment.
Azure may not be the best solution to this problem (unless you are on a spending spree). 8 small VMs is $450 a month. It is enough to pay for an unmanaged dedicated server with 256Gb of RAM, 40 hardware threads and 500Mbps - 1Gbps (or even up to several Gbps bursts) of quality network bandwidth without latency issues.
For you budget, you will have a dedicated server that you cannot overload. You will have more than enough RAM to deal with async pinning (if you decide to go async), or enough hardware threads for multi-threaded synchronous IO which gives the best throughput (if you choose to go synchronous with a fixed-size threadpool).
On a sidenote, depending on the API specifics, it might turn out that your main issue will be the API owner simply throttling you down to a crawl when you start to put too much pressure on the API endpoints.
I need to download certain files using FTP.Already it is implemented without using the thread. It takes too much time to download all the files.
So i need to use some thread for speed up the process .
my code is like
foreach (string str1 in files)
{
download_FTP(str1)
}
I refer this , But i don't want every files to be queued at ones.say for example 5 files at a time.
If the process is too slow, it means most likely that the network/Internet connection is the bottleneck. In that case, downloading the files in parallel won't significantly increase the performance.
It might be another story though if you are downloading from different servers. We may then imagine that some of the servers are slower than others. In that case, parallel downloads would increase the overall performance since the program would download files from other servers while being busy with slow downloads.
EDIT: OK, we have more info from you: Single server, many small files.
Downloading multiple files involves some overhead. You can decrease this overhead by somehow grouping the files (tar, zip, whatever) on server-side. Of course, this may not be possible. If your app would talk to a web server, I'd advise to create a zip file on the fly server-side according to the list of files transmitted in the request. But you are on an FTP server so I'll assume you have nearly no flexibility server-side.
Downloading several files in parallel may probably increase the throughput in your case. Be very careful though about restrictions set by the server such as the max amount of simultaneous connections. Also, keep in mind that if you have many simultaneous users, you'll end up with a big amount of connections on the server: users x threads. Which may prove counter-productive according to the scalability of the server.
A commonly accepted rule of good behaviour consists in limiting to max 2 simultaneoud connections per user. YMMV.
Okay, as you're not using .NET 4 that makes it slightly harder - the Task Parallel Library would make it really easy to create five threads reading from a producer/consumer queue. However, it still won't be too hard.
Create a Queue<string> with all the files you want to download
Create 5 threads, each of which has a reference to the queue
Make each thread loop, taking an item off the queue and downloading it, or finishing if the queue is empty
Note that as Queue<T> isn't thread-safe, you'll need to lock to make sure that only one thread tries to fetch an item from the queue at a time:
string fileToDownload = null;
lock(padlock)
{
if (queue.Count == 0)
{
return; // Done
}
fileToDownload = queue.Dequeue();
}
As noted elsewhere, threading may not speed things up at all - it depends where the bottleneck is. If the bottleneck is the user's network connection, you won't be able to get more data down the same size of pipe just by using multi-threading. On the other hand, if you have a lot of small files to download from different hosts, then it may be latency rather than bandwidth which is the problem, in which case threading will help.
look up on ParameterizedThreadStart
List<System.Threading.ParameterizedThreadStart> ThreadsToUse = new List<System.Threading.ParameterizedThreadStart>();
int count = 0;
foreach (string str1 in files)
{
ThreadsToUse.add(System.Threading.ParameterizedThreadStart aThread = new System.Threading.ParameterizedThreadStart(download_FTP));
ThreadsToUse[count].Invoke(str1);
count ++;
}
I remember something about Thread.Join that can make all threads respond to one start statement, due to it being a delegate.
There is also something else you might want to look up on which i'm still trying to fully grasp which is AsyncThreads, with these you will know when the file has been downloaded. With a normal thread you gonna have to find another way to flag it's finished.
This may or may not help your speed, in one way of your line speed is low then it wont help you much,
on the other hand some servers set each connection to be capped to a certain speed in which you this in theory will set up multiple connections to the server therefore having a slight increase in speed. how much increase tho I cannot answer.
Hope this helps in some way
I can add some experience to the comments already posted. In an app some years ago I had to generate a treeview of files on an FTP server. Listing files does not normally require actual downloading, but some of the files were zipped folders and I had to download these and unzip them, (sometimes recursively), to display the files/folders inside. For a multithreaded solution, this reqired a 'FolderClass' for each folder that could keep state and so handle both unzipped and zipped folders. To start the operation off, one of these was set up with the root folder and submitted to a P-C queue and a pool of threads. As the folder was LISTed and iterated, more FolderClass instances were submitted to the queue for each subfolder. When a FolderClass instance reached the end of its LIST, it PostMessaged itself, (it was not C#, for which you would need BeginInvoke or the like), to the UI thread where its info was added to the listview.
This activity was characterised by a lot of latency-sensitive TCP connect/disconnect with occasional download/unzip.
A pool of, IIRC, 4-6 threads, (as already suggested by other posters), provided the best performance on the single-core system i had at the time and, in this particular case, was much faster than a single-threaded solution. I can't remember the figures exactly, but no stopwatch was needed to detect the performance boost - something like 3-4 times faster. On a modern box with multiiple cores where LISTs and unzips could occur concurrently, I would expect even more improvement.
There were some problems - the visual ListView component could not keep up with the incoming messages, (because of the multiple threads, data arrived for aparrently 'random' positions on the treeview and so required continual tree navigation for display), and so the UI tended to freeze during the operation. Another problem was detecting when the operation had actually finished. These snags are probably not relevant to your download-many-small-files app.
Conclusion - I expect that downloading a lot of small files is going to be faster if multithreaded with multiple connections, if only from mitigating the connect/disconnect latency which can be larger than the actual data download time. In the extreme case of a satellite connection with high speed but very high latency, a large thread pool would provide a massive speedup.
Note the valid caveats from the other posters - if the server, (or its admin), disallows or gets annoyed at the multiple connections, you may get no boost, limited bandwidth or a nasty email from the admin!
Rgds,
Martin
I want to build a windows service that will use a remote encoding service (like encoding.com, zencoder, etc.) to upload video files for encoding, download them after the encoding process is complete, and process them.
In order to do that, I was thinking about having different queues, one for handling currently waiting files, one for files being uploaded, one for files waiting for encoding to complete and one more for downloading them. Each queue has a limitation, for example only 5 files can be uploaded for encoding at a certain time. The queues have to be visible and able to resurrect from a crash - currently we do that by writing the queue to an SQL table and managing the number of items in a separate table.
I also want the queues to run in the background, independent of each other, but able to transfer files from one queue to another as the process goes on.
My biggest question mark is about how to build the queues and managing them, and less about limiting the number of items in each queue.
I am not sure what is the right approach for this and would really appreciate any help.
Thanks!
You probably don't need to separate the work into separate queues, as long as they are logically separated in some way (tagged with different "job types" or such).
As I see it, the challenge is to not pick up and process more than a given limited number of jobs from the queue, based on the type of job. I had a somewhat similar issue a while ago which led to a question here on SO, and a subsequent blog post with my solution, both of which might give you some ideas.
In short my solution was that I keep a list of "tokens". When ever I want to perform a job that has some sort of limitation, I first pick up a token. If no tokens are available, I will need to wait for one to become available. Then you can use whatever queueing mechanism suitable to handle the queue as such.
There are various ways to approach this and it depends which one suits your case in terms of reliability and resilience/development cost/maintenance cost. You need to answer the question on the likes that what if server crashes, is it important to carry on what you were doing?
Queue can be implemented in MSMQ, SQL Server or simply in code and all queues in memory. For workflow you can use Windows Workflow Foundation, or implement it yourself which would be probably easier but change would be more difficult.
So if you give a few more hints, I should be able to help you better.
i wrote an app that sync's local folders with online folders, but it eats all my bandwidth, how can i limit the amount of bandwidth the app use? (programatically)?
Take a look at http://www.codeproject.com/KB/IP/MyDownloader.aspx
He's using the well known technique which can be found in Downloader.Extension\SpeedLimit
Basically, before more data is read of a stream, a check is performed on how much data has actually been read since the previous iteration . If that rate exceeds the max rate, then the read command is suspended for a very short time and the check is repeated. Most applications use this technique.
Try this: http://www.netlimiter.com/ It's been on my "check this out" list for a long time (though I haven't tried it yet myself).
I'd say "don't". Unless you're doing something really wrong, your program shouldn't be hogging bandwidth. Your router should be balancing the available bandwidth between all requests.
I'd recommend you do the following:
a) Create md5 hashes for all the files. Compare hashes and/or dates and sizes for the files and only sync the files that have changed. Unless you're syncing massive files you shouldn't have to sync a whole lot of data.
b) Limit the sending rate. In your upload thread read the files in 1-8KB chunks and then call Thread.Sleep after every chunk to limit the rate. You have to do this on the upload side however.
c) Pipe everything through a Gzip stream. (System.IO.Compression) For text files this can reduce the size of the data that needs to be transfered.
Hope this helps!
What is the best way to determine available bandwidth in .NET?
We have users that access business applications from various remote access points, wired and wireless and at times the bandwidth can be very low based on where the user is. When the applications appear to be running slow, the issue could be due to low bandwidth and not some other issue.
I would like to be able to run some kind of service that would warn users whenever the available bandwidth dips below a specific threshold.
Any thoughts?
Not beyond the obvious of downloading a file of a known size and timing how long it takes. the disadvantage of that is that you'd need to waste a lot of bandwidth to do it. Also, if you wanted to alert when throughput drops below a threshold, you'll have to run the test more-or-less continuously.
IMHO, I'd live with poor performance in some locations, given that you can't do anything about it if it does occur.
Sorry.
There's no easy way to measure bandwidth without actually using it - which of course will starve the applications. A couple of points to bear in mind though:
1) Is it actually bandwidth that's the problem, or latency? You can measure latency in a less intrusive manner than bandwidth.
2) Are the applications all run from the same server (or at least the same network)? You may find that users will have a good connection to some areas of the net but not others. (It's likely that the last mile will be the limiting factor, but it's not always the case.)
If you're transferring data, simply measure it. You could also download a reference object from somewhere if you want to make it independent of the speed of your server.
Without knowing the exact nature of your connection, or how its used, there are two options that I am aware of.
MultinetGetConnectionPerformance (http://msdn.microsoft.com/en-us/library/aa385342(VS.85).aspx)
System Event Notification Service (http://msdn.microsoft.com/en-us/library/aa377538(VS.85).aspx)
Neither are direct .NET classes, but can be implemented in .NET very easily.
Take a look at both of them and see if they will work for you.
Roy