Dealing with a very large number of files

Dealing with a very large number of files - c#

I am currently working on a research project which involves indexing a large number of files (240k); they are mostly html, xml, doc, xls, zip, rar, pdf, and text with filesizes ranging from a few KB to more than 100 MB.
With all the zip and rar files extracted, I get a final total of one million files.
I am using Visual Studio 2010, C# and .NET 4.0 with support for TPL Dataflow and Async CTP V3. To extract the text from these files I use Apache Tika (converted with ikvm) and I use Lucene.net 2.9.4 as indexer. I would like the use the new TPL dataflow library and asynchronous programming.
I have a few questions:
Would I get performance benefits if I use TPL? It is mainly an I/O process and from what I understand, TPL doesn't offer much benefit when you heavily use I/O.
Would a producer/consumer approach be the best way to deal with this type of file processing or are there any other models that are better? I was thinking of creating one producer with multiple consumers using blockingcollections.
Would the TPL dataflow library be of any use for this type of process? It seems TPL Dataflow is best used in some sort of messaging system...
Should I use asynchronous programming or stick to synchronous in this case?

async/await definitely helps when dealing with external resources - typically web requests, file system or db operations. The interesting problem here is that you need to fulfill multiple requirements at the same time:
consume as small amount of CPU as possible (this is where async/await will help)
perform multiple operations at the same time, in parallel
control the amount of tasks that are started (!) - if you do not take this into account, you will likely run out of threads when dealing with many files.
You may take a look at a small project I published on github:
Parallel tree walker
It is able to enumerate any number of files in a directory structure efficiently. You can define the async operation to perform on every file (in your case indexing it) while still controlling the maximum number of files that are processed at the same time.
For example:
await TreeWalker.WalkAsync(root, new TreeWalkerOptions
{
MaxDegreeOfParallelism = 10,
ProcessElementAsync = async (element) =>
{
var el = element as FileSystemElement;
var path = el.Path;
var isDirectory = el.IsDirectory;
await DoStuffAsync(el);
}
});
(if you cannot use the tool directly as a dll, you may still find some useful examples in the source code)

You could use Everything Search. The SDK is open source and have C# example.
It's the fastest way to index files on Windows I've seen.
From FAQ:
1.2 How long will it take to index my files?
"Everything" only uses file and folder names and generally takes a few seconds to build its > database.
A fresh install of Windows XP SP2 (about 20,000 files) will take about 1 second to index.
1,000,000 files will take about 1 minute.
I'm not sure if you can use TPL with it though.

Related

How to asynchronically download millions of files from a file storage?

Let's assume I have a database managing millions of documents, which are stored on a WebDav or SMB server, which does not support retrieving documents in bulks.
Given a list of (potentially all) document IDs, how do I download the corresponding documents as fast as possible?
Iterating over the list and sequentially downloading them is far too slow.
The 2 options I see is threads and async downloads.
My gut says that async programming should be preferred to threads, because I'm just waiting for IO on the client side. But I am rather new to async programming and I don't know how to do it.
I assume that iterating over the whole list and sending an async download request could potentially lead to too many requests in a very short time leading to rejected requests. So how do I throttle this? Is there a best practice way to do this?

Take a look at this: How to limit the amount of concurrent async I/O? Using a SemaphoreSlim, as suggested in the accepted answer, is an easy and quite good solution.
My personal favorite though for this kind if job is the TPL Dataflow library. You can see here an example of using this library to download pages from the web asynchronously with a configurable level of concurrency, in combination with the HttpClient class. Here is another example.

I also found this great article explaining 4 different ways to limit the number of concurrent downloads.

Fast extracting of large number of small tar archives

I have a large number of small tar archives (around 50kb each). I need the fastest way how to extract these files using C#. I don't want to save them to disk, becase I have to do some other processing of the content after extracting. I tried to do it on my own but the processing was not very fast. Could you please advice me the fastest way to process the tar files?

I need the fastest way how to extract these files
I have to do some other processing of the content after extracting
Yoiu might want to look into TPL Dataflow. Perfect irrespective of whether the problem is in the decompressing or subsequent processing. Instead of having one big function, you break it up into multiple connected steps. Essentially Dataflow creates a flow pipeline of connected components for each task step you wish to perform. Each step has complete control over threading; concurrency.
More
Dataflow (Task Parallel Library)

Running a CPU/memory-intensive task - what coding approach is the most performant?

As we all know, in software dev, we can be asked very ambitious things to do with technology.
Recently I was asked about the quickest possible way to convert 4000 documents from word to pdf. The code/software to do the conversion is in place, and it runs on a dedicated server, so the hardware is also there (this is a recurring task). But from a C# performance perspective, what is the best way to do this?
I keep thinking along the lines of breaking this up into chunks (ie 40 documents) and convert them (i.e. 40 unique documents x 1000 parellel tasks), which run at the same time. Is this the right idea, performance wise? The simplest (and longest) is a serial loop that goes through each doc.
What would you recommend? There are no language constraints so C# 4.0, LINQ etc is all available.

1000 parallel tasks? You want to run 1,000 threads concurrently? You'll spend more time thread switching than doing actual work. If you have a quad-core machine, you should run four threads, each of which is converting a single document at a time.
Probably the best way to start is to use a simple Parallel.ForEach, and let the runtime library worry about scheduling the tasks. Something like:
List<string> DocumentsToConvert = new List<string>();
// here, load the file names of all the documents you want to convert.
// Then, process them with:
Parallel.Foreach(DocumentsToConvert, (doc) => { ConvertDocument(doc); });
You could do the same type of thing with the TPL and tasks:
foreach (var doc in DocumentsToConvert)
{
// Create and start a task to convert that document
}
In either case, you let the runtime library figure out how many tasks to execute in parallel.

Take three documents and process them sequentially. Take the average time and multiply it by the total number of documents. If that time is reasonable stop coding and publish to the server. You have just saved the company development costs at your development rate, for the time saved might actually be more of a cost savings than shaving off 30 minutes off of a server run it took you achieve at a weeks worth of coding.
Otherwise begin to look into Parallel programming with .Net four and test on 30 documents and due similar calculations as above to see if its reasonable. If that time is reasonable stop coding and publish to the server.
If that time is not reasonable, then discuss using more servers to split up the work further.
HTH

How to Download 5 files at a time using Thread in .net framework 3.5

I need to download certain files using FTP.Already it is implemented without using the thread. It takes too much time to download all the files.
So i need to use some thread for speed up the process .
my code is like
foreach (string str1 in files)
{
download_FTP(str1)
}
I refer this , But i don't want every files to be queued at ones.say for example 5 files at a time.

If the process is too slow, it means most likely that the network/Internet connection is the bottleneck. In that case, downloading the files in parallel won't significantly increase the performance.
It might be another story though if you are downloading from different servers. We may then imagine that some of the servers are slower than others. In that case, parallel downloads would increase the overall performance since the program would download files from other servers while being busy with slow downloads.
EDIT: OK, we have more info from you: Single server, many small files.
Downloading multiple files involves some overhead. You can decrease this overhead by somehow grouping the files (tar, zip, whatever) on server-side. Of course, this may not be possible. If your app would talk to a web server, I'd advise to create a zip file on the fly server-side according to the list of files transmitted in the request. But you are on an FTP server so I'll assume you have nearly no flexibility server-side.
Downloading several files in parallel may probably increase the throughput in your case. Be very careful though about restrictions set by the server such as the max amount of simultaneous connections. Also, keep in mind that if you have many simultaneous users, you'll end up with a big amount of connections on the server: users x threads. Which may prove counter-productive according to the scalability of the server.
A commonly accepted rule of good behaviour consists in limiting to max 2 simultaneoud connections per user. YMMV.

Okay, as you're not using .NET 4 that makes it slightly harder - the Task Parallel Library would make it really easy to create five threads reading from a producer/consumer queue. However, it still won't be too hard.
Create a Queue<string> with all the files you want to download
Create 5 threads, each of which has a reference to the queue
Make each thread loop, taking an item off the queue and downloading it, or finishing if the queue is empty
Note that as Queue<T> isn't thread-safe, you'll need to lock to make sure that only one thread tries to fetch an item from the queue at a time:
string fileToDownload = null;
lock(padlock)
{
if (queue.Count == 0)
{
return; // Done
}
fileToDownload = queue.Dequeue();
}
As noted elsewhere, threading may not speed things up at all - it depends where the bottleneck is. If the bottleneck is the user's network connection, you won't be able to get more data down the same size of pipe just by using multi-threading. On the other hand, if you have a lot of small files to download from different hosts, then it may be latency rather than bandwidth which is the problem, in which case threading will help.

look up on ParameterizedThreadStart
List<System.Threading.ParameterizedThreadStart> ThreadsToUse = new List<System.Threading.ParameterizedThreadStart>();
int count = 0;
foreach (string str1 in files)
{
ThreadsToUse.add(System.Threading.ParameterizedThreadStart aThread = new System.Threading.ParameterizedThreadStart(download_FTP));
ThreadsToUse[count].Invoke(str1);
count ++;
}
I remember something about Thread.Join that can make all threads respond to one start statement, due to it being a delegate.
There is also something else you might want to look up on which i'm still trying to fully grasp which is AsyncThreads, with these you will know when the file has been downloaded. With a normal thread you gonna have to find another way to flag it's finished.
This may or may not help your speed, in one way of your line speed is low then it wont help you much,
on the other hand some servers set each connection to be capped to a certain speed in which you this in theory will set up multiple connections to the server therefore having a slight increase in speed. how much increase tho I cannot answer.
Hope this helps in some way

I can add some experience to the comments already posted. In an app some years ago I had to generate a treeview of files on an FTP server. Listing files does not normally require actual downloading, but some of the files were zipped folders and I had to download these and unzip them, (sometimes recursively), to display the files/folders inside. For a multithreaded solution, this reqired a 'FolderClass' for each folder that could keep state and so handle both unzipped and zipped folders. To start the operation off, one of these was set up with the root folder and submitted to a P-C queue and a pool of threads. As the folder was LISTed and iterated, more FolderClass instances were submitted to the queue for each subfolder. When a FolderClass instance reached the end of its LIST, it PostMessaged itself, (it was not C#, for which you would need BeginInvoke or the like), to the UI thread where its info was added to the listview.
This activity was characterised by a lot of latency-sensitive TCP connect/disconnect with occasional download/unzip.
A pool of, IIRC, 4-6 threads, (as already suggested by other posters), provided the best performance on the single-core system i had at the time and, in this particular case, was much faster than a single-threaded solution. I can't remember the figures exactly, but no stopwatch was needed to detect the performance boost - something like 3-4 times faster. On a modern box with multiiple cores where LISTs and unzips could occur concurrently, I would expect even more improvement.
There were some problems - the visual ListView component could not keep up with the incoming messages, (because of the multiple threads, data arrived for aparrently 'random' positions on the treeview and so required continual tree navigation for display), and so the UI tended to freeze during the operation. Another problem was detecting when the operation had actually finished. These snags are probably not relevant to your download-many-small-files app.
Conclusion - I expect that downloading a lot of small files is going to be faster if multithreaded with multiple connections, if only from mitigating the connect/disconnect latency which can be larger than the actual data download time. In the extreme case of a satellite connection with high speed but very high latency, a large thread pool would provide a massive speedup.
Note the valid caveats from the other posters - if the server, (or its admin), disallows or gets annoyed at the multiple connections, you may get no boost, limited bandwidth or a nasty email from the admin!
Rgds,
Martin

how can i make my program more responsive? (program that loads atleast 200 files) - i might have 1 idea

first - i want to say sorry for my butchered English.
I am building a program that uses a lot of files. i have a lot of foreach loops that loops through the harddisk and those files (atleast 200 files - 600 bytes each file in average), the loop is using XPath to search for values in the file (the files are XML files of course)
I need to find a way to make my program more responsive - i thought of one which is the following:
Computers memory has a faster speed of loading than computer harddisk - and i thought - maybe i should load those files to the memory and than loop the memory instead of looping the harddisk.., by the way if someone can tell me how much faster computers memory are (from harddisks) than thanks
Thanks in advanced..
Din
if someone didn't understand my English i will try to explain again

The best approach I think of is PLINQ in C#4.0. Group these XML files and query them with LINQ-to-XML parallelly. The following is a simple example, which loads all xml files in C:\xmlFolder and choose those documents which contains an element whose name is "key".
List<XDocument> xmls = Directory.EnumerateFiles(#"C:\XmlFolder").AsParallel()
.Select(path => XDocument.Load(path))
.Where(doc => doc.Descendants()
.Any(ele => ele.Name.Equals("key")))
.ToList();

You should parse the XML files in a different thread and create objects with the required information, this way you will have instant access to the information.

Define "responsive." Do you mean that you want UI cues to continue to happen, or that you want to continue to be able to do other things in the UI while it's processing the files?
The former is easy, you can just toss in the occasional Application.DoEvents() in your loops. This will prompt the UI to perform any cues that are waiting (such as draw the window, etc.).
The latter is going to involve multi-threading. Diving into that is a bit more complex than can be taught in a paragraph or two, but some Google searches for "c# .net multi threading tutorial" should yield a ton of results. If you're not familiar with the basic concept of what multi-threading provides, I can further explain it.

Use a BackgroundWorker or a ThreadPool to spawn off multiple threads for the I/O, and have then read the data into a Queue (this is assuming the total size of your data is not too large). Have another thread(s) reading off of that Queue, and doing your internal xPath logic to pull whatever you need from those files.
Essentially, think of it as an instance of the Producer/Consumer design pattern, wherein your I/O reader threads are producers, and your XPath logic threads are consumers.
The type of the object in the queue could be just a byte-array, but I'd suggest a custom C# class that contains the byte array, as well as some of the file metadata in case you need it for whatever reason.

You can use database for storing XML files, it will be faster, more secure and reliable than you current schema. You can build indexes, concurrent access is enabled, XQuery/Xpath is supported and much more "pluses".
If you have only XML files, you can consider Native XML Databases, or if you have other types as well you can consider XML enabled DBMLS (such as Oracle or DB2).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.