I have a large number of small tar archives (around 50kb each). I need the fastest way how to extract these files using C#. I don't want to save them to disk, becase I have to do some other processing of the content after extracting. I tried to do it on my own but the processing was not very fast. Could you please advice me the fastest way to process the tar files?
I need the fastest way how to extract these files
I have to do some other processing of the content after extracting
Yoiu might want to look into TPL Dataflow. Perfect irrespective of whether the problem is in the decompressing or subsequent processing. Instead of having one big function, you break it up into multiple connected steps. Essentially Dataflow creates a flow pipeline of connected components for each task step you wish to perform. Each step has complete control over threading; concurrency.
More
Dataflow (Task Parallel Library)
Related
I have what I would consider to be a fairly common problem, but have not managed to find a good solution on my own or by browsing this forum.
Problem
I have written a tool to get a file listing of a folder with some additional information such as file name, file path, file size, hash, etc.
The biggest problem that I have is that some of the folders contain millions of files (maybe 50 million in the structure).
Possible Solutions
I have two solutions, but neither of them are ideal.
Every time a file is read, the information is written straight to file. This is OK, but it means I can't multi-thread the file without running into issues with the thread locking the file.
Every time a file is read, the information is added to some form of collection such as a ConcurrentBag. The means I can multi-thread the enumeration of the files and add them to the collection. Once the enumeration is done, I can write the whole collection to a file using File.WriteAllLines; however adding 50 million entries to the collection makes most machines run out of memory.
Other Options
Is there any way to add items to a collection and then write them to a file when it gets to a certain number of records in the collection or something like that?
I looked into a BlockingCollection, but that will just fill up really quickly as the producer will be multi-threaded, but the consumer would only be single-threaded.
Create a FileStream that is shared by all threads. Before writing to that FileStream, a thread must lock it. FileStream has some buffer (4096bytes if i remember right), so it doesn't actually write to disk every time. You may use a BufferedStream around that if 4096 bytes is still not enough.
BlockingCollection is precisely what you need. You can create one with a large buffer and have a single writer thread writing to a file that it keeps open for the duration of the run.
If reading is the dominant operation time-wise the queue will be near empty the whole time and total time will be just slightly more than the read time.
If writing is the dominant operation time-wise the queue will fill up until you reach the limit you set (to prevent out of memory situations) and producers will only advance as the writer advances. The total time will be the time needed to write all the records to a single file sequentially and you cannot do better than that (when writer is the slowest part).
You may be able to get slightly better performance by pipelining through multiple blocking collections, e.g. making the hash-calculation (a CPU-bound operation) potentially separate from the read, or write operations. If you want to do that though consider the TPL DataFlow library.
In my application I need to continuously write data chunks (around 2MB) about every 50ms in a large file (around 2-7 GB). This is done in a sequential, circular way, so I write chunk after chunk into the file and when I'm at the end of the file I start again at the beginning.
Currently I'm doing it as follows:
In C# I call File.OpenWrite once to open the file with read access and set the size of the file with SetLength. When I need to write a chunk, I pass the safe file handle to the unmanaged WriteFile (kernel32.dll). Hereby I pass an overlapped structure to specify the position within the file where the chunk has to be written. The chunk I need to write is stored in unmanaged memory, so I have an IntPtr which I can pass to WriteFile.
Now I'd like to know if and how I can make this process more efficient. Any ideas?
Some questions in detail:
Will changing from file I/O to memory-mapped file help?
Can I include some optimizations for NTFS?
Are there some useful parameters when creating the file that I'm missing? (maybe an unmanaged call with special parameters)
Using better hardware will probably be the most cost efficient way to increase file writing efficiency.
There is a paper from Microsoft research that will answer most of your questions: Sequential File Programming Patterns and Performance with .NET and the downloadable source code (C#) if you want to run the tests from the paper on your machine.
In short:
The default behavior provides excellent performance on a single disk.
Unbufffered IO should be tested if you have a disc array. Could improve write speed with a factor of eight.
This thread on social.msdn might also be of interest.
I was thinking...
Does Multithreading using c# in I/O operations ( lets say copying many files from c:\1\ to c:\2\ ) , will have performance differences rather than doing the operation - sequential ?
The reason why im struggle with myself is that an IO operation finally - is One item which has to do work. so even if im working in parallel - he will still execute those copy orders as sequential...
or maybe my assumption is wrong ?
in that case is there any benefit of using multithreaded copy to :
copy many small file ( sum 4GB)
copy 4 big files ( sum 4 gb , 1000 mb each)
thanks
Like others says, it has to be measured against concrete application context.
But just would like to invite an attention on this.
Every time you copy a file the permission of Write access to destination location is checked, which is slow.
All of us met a case when you have to copy/paste a sequence of already compressed files, and if you them compress again into one big ZIP file, so the total compressed size is not significally smaller then the sum of all content, the IO operation will be executed a way faster. (Just try it, you will see a huge difference, if you didn't do it before).
So I would assume (again it has to be measured on concrete system, mine are just guesses) that having one big file write on single disk, will be faster the a lot of small files.
Hope this helps.
Multithreading with files is not so much about the CPU but about IO. This means that totally different rules apply. Different devices have different characterstics:
Magnetic disks like sequential IO
SSDs like sequential or parallel random IO (mine has 4 hardware "threads")
The network likes many parallel operations to amortize latency
I'm no expert in hard-disk related questions, but maybe this will shred some light for you:
Windows is using the NTFS file system. This system doesn't "like" too much small files, for example, under 1kb files. It will "magically" make 100 files of 1kb weight 400kb instead of 100kb. It is also VERY slow when dealing which a lot of "small" files. Therefore, copying one big file instead of many small files of the same weight will be much faster.
Also, from my personal experience and knowledge, Multithreading will NOT speed up the copying of many files, because the actual hardware disk is acting like one unit, and can't be sped up by sending many requests at the same time (it will process them one by one.)
I am currently working on a research project which involves indexing a large number of files (240k); they are mostly html, xml, doc, xls, zip, rar, pdf, and text with filesizes ranging from a few KB to more than 100 MB.
With all the zip and rar files extracted, I get a final total of one million files.
I am using Visual Studio 2010, C# and .NET 4.0 with support for TPL Dataflow and Async CTP V3. To extract the text from these files I use Apache Tika (converted with ikvm) and I use Lucene.net 2.9.4 as indexer. I would like the use the new TPL dataflow library and asynchronous programming.
I have a few questions:
Would I get performance benefits if I use TPL? It is mainly an I/O process and from what I understand, TPL doesn't offer much benefit when you heavily use I/O.
Would a producer/consumer approach be the best way to deal with this type of file processing or are there any other models that are better? I was thinking of creating one producer with multiple consumers using blockingcollections.
Would the TPL dataflow library be of any use for this type of process? It seems TPL Dataflow is best used in some sort of messaging system...
Should I use asynchronous programming or stick to synchronous in this case?
async/await definitely helps when dealing with external resources - typically web requests, file system or db operations. The interesting problem here is that you need to fulfill multiple requirements at the same time:
consume as small amount of CPU as possible (this is where async/await will help)
perform multiple operations at the same time, in parallel
control the amount of tasks that are started (!) - if you do not take this into account, you will likely run out of threads when dealing with many files.
You may take a look at a small project I published on github:
Parallel tree walker
It is able to enumerate any number of files in a directory structure efficiently. You can define the async operation to perform on every file (in your case indexing it) while still controlling the maximum number of files that are processed at the same time.
For example:
await TreeWalker.WalkAsync(root, new TreeWalkerOptions
{
MaxDegreeOfParallelism = 10,
ProcessElementAsync = async (element) =>
{
var el = element as FileSystemElement;
var path = el.Path;
var isDirectory = el.IsDirectory;
await DoStuffAsync(el);
}
});
(if you cannot use the tool directly as a dll, you may still find some useful examples in the source code)
You could use Everything Search. The SDK is open source and have C# example.
It's the fastest way to index files on Windows I've seen.
From FAQ:
1.2 How long will it take to index my files?
"Everything" only uses file and folder names and generally takes a few seconds to build its > database.
A fresh install of Windows XP SP2 (about 20,000 files) will take about 1 second to index.
1,000,000 files will take about 1 minute.
I'm not sure if you can use TPL with it though.
first - i want to say sorry for my butchered English.
I am building a program that uses a lot of files. i have a lot of foreach loops that loops through the harddisk and those files (atleast 200 files - 600 bytes each file in average), the loop is using XPath to search for values in the file (the files are XML files of course)
I need to find a way to make my program more responsive - i thought of one which is the following:
Computers memory has a faster speed of loading than computer harddisk - and i thought - maybe i should load those files to the memory and than loop the memory instead of looping the harddisk.., by the way if someone can tell me how much faster computers memory are (from harddisks) than thanks
Thanks in advanced..
Din
if someone didn't understand my English i will try to explain again
The best approach I think of is PLINQ in C#4.0. Group these XML files and query them with LINQ-to-XML parallelly. The following is a simple example, which loads all xml files in C:\xmlFolder and choose those documents which contains an element whose name is "key".
List<XDocument> xmls = Directory.EnumerateFiles(#"C:\XmlFolder").AsParallel()
.Select(path => XDocument.Load(path))
.Where(doc => doc.Descendants()
.Any(ele => ele.Name.Equals("key")))
.ToList();
You should parse the XML files in a different thread and create objects with the required information, this way you will have instant access to the information.
Define "responsive." Do you mean that you want UI cues to continue to happen, or that you want to continue to be able to do other things in the UI while it's processing the files?
The former is easy, you can just toss in the occasional Application.DoEvents() in your loops. This will prompt the UI to perform any cues that are waiting (such as draw the window, etc.).
The latter is going to involve multi-threading. Diving into that is a bit more complex than can be taught in a paragraph or two, but some Google searches for "c# .net multi threading tutorial" should yield a ton of results. If you're not familiar with the basic concept of what multi-threading provides, I can further explain it.
Use a BackgroundWorker or a ThreadPool to spawn off multiple threads for the I/O, and have then read the data into a Queue (this is assuming the total size of your data is not too large). Have another thread(s) reading off of that Queue, and doing your internal xPath logic to pull whatever you need from those files.
Essentially, think of it as an instance of the Producer/Consumer design pattern, wherein your I/O reader threads are producers, and your XPath logic threads are consumers.
The type of the object in the queue could be just a byte-array, but I'd suggest a custom C# class that contains the byte array, as well as some of the file metadata in case you need it for whatever reason.
You can use database for storing XML files, it will be faster, more secure and reliable than you current schema. You can build indexes, concurrent access is enabled, XQuery/Xpath is supported and much more "pluses".
If you have only XML files, you can consider Native XML Databases, or if you have other types as well you can consider XML enabled DBMLS (such as Oracle or DB2).