I've already build a recursive function to get the directory size of a folder path. It works, however with the growing number of directories I have to search through (and number of files in each respective folder), this is a very slow, inefficient method.
static string GetDirectorySize(string parentDir)
{
long totalFileSize = 0;
string[] dirFiles = Directory.GetFiles(parentDir, "*.*",
System.IO.SearchOption.AllDirectories);
foreach (string fileName in dirFiles)
{
// Use FileInfo to get length of each file.
FileInfo info = new FileInfo(fileName);
totalFileSize = totalFileSize + info.Length;
}
return String.Format(new FileSizeFormatProvider(), "{0:fs}", totalFileSize);
}
This is searches all subdirectories for the argument path, so the dirFiles array gets quite large. Is there a better method to accomplish this? I've searched around but haven't found anything yet.
Another idea that crossed my mind was putting the results in a cache and when the function is called again, try and find the differences and only re-search folders that have changed. Not sure if that's a good thing either...
You are first scanning the tree to get a list of all files. Then you are reopening every file to get its size. This amounts to scanning twice.
I suggest you use DirectoryInfo.GetFiles which will hand you FileInfo objects directly. These objects are pre-filled with their length.
In .NET 4 you can also use the EnumerateFiles method which will return you a lazy IEnumable.
This is more cryptic but it took about 2 seconds for 10k executions.
public static long GetDirectorySize(string parentDirectory)
{
return new DirectoryInfo(parentDirectory).GetFiles("*.*", SearchOption.AllDirectories).Sum(file => file.Length);
}
Try
DirectoryInfo DirInfo = new DirectoryInfo(#"C:\DataLoad\");
Stopwatch sw = new Stopwatch();
try
{
sw.Start();
Int64 ttl = 0;
Int32 fileCount = 0;
foreach (FileInfo fi in DirInfo.EnumerateFiles("*", SearchOption.AllDirectories))
{
ttl += fi.Length;
fileCount++;
}
sw.Stop();
Debug.WriteLine(sw.ElapsedMilliseconds.ToString() + " " + fileCount.ToString());
}
catch (Exception Ex)
{
Debug.WriteLine(Ex.ToString());
}
This did 700,000 in 70 seconds on desktop NON-RAID P4.
So like 10,000 a second. On server class machine should get 100,000+ / second easy.
As usr (+1) said EnumerateFile is pre-filled with length.
You may start to speed up a little bit your function using EnumerateFiles() instead of GetFiles(). At least you won't load the full list in memory.
If it's not enough you should make your function more complex using threads (one thread per directory is too much but there is not a general rule).
You may use a fixed number of threads that peeks directories from a queue, each thread calculates the size of a directory and adds to the total. Something like:
Get the list of all directories (not files).
Create N threads (one per core, for example).
Each thread peeks a directory and calculate the size.
If there is not another directory in the queue the thread ends.
If there is a directory in the queue it calculates its size and so on.
Function finishes when all threads terminate.
You may improve a lot the algorithm spanning the search of directories across all threads (for example when a thread parse a directory it adds folders to the queue). Up to you to make it more complicated if you see it's too slow (this task has been used by Microsoft as an example for the new Task Parallel Library).
long length = Directory.GetFiles(#"MainFolderPath", "*", SearchOption.AllDirectories).Sum(t => (new FileInfo(t).Length));
Related
I'm using a vendor-provided C++ DLL, that I call with DLLImport, to parse and process files containing many object types.
I need to have a correlation between number of objects in the file and memory usage, in order to (hopefully) be able to prevent OutOfMemoryExceptions that happen sometimes.
Update
To be more clear on what I'm trying to measure and why : the out of memory exception is expected, because some very complex files take up to 7gb of memory to load (as measured by perfmon): they are 3D maps of sometimes huge and intricate buildings, from the walls down to the individual screws and bolts, including the trees outside and the tables and chairs in each room.
And since the DLL can load multiple maps in parallel (it's on a web server and the process is shared), loading 2x 7gb files understandably triggers an OutOfMemoryException on a machine with 8gb of RAM.
However, 7gb is pretty rare, most of the maps take up about 500mb, and some take 1 to 2gb.
What we really need is not to find a memory leak (yet...), but be able to know before loading the file how much memory it will probably use. So when a user tries to load a file that we calculate will probably take about 2gb of RAM while the machine has 1gb free, we do something about it; from spinning up a new VM in Azure to preventing the user from working, we don't know what yet, but we can't let the DLL crash the whole server down each time.
And in order to do that, I want to find out, for instance, "the DLL uses 1mb of memory for each 100 geometry object".
So I have a bunch of files to test (about a hundred), and I want to load them up in order, measure the memory usage of the native DLL (before and after), unload the file, process the next. Then I get a nice CSV file with all the data.
I have tried System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64 but it only gives me the current process memory, but the DLL doesn't seem to live in the current process, since most measures give me 0 bytes (difference between before and after file load).
I also have tried GC.GetTotalMemory() but it's not much better, the files are seemingly all exactly 1080 bytes.
private static void MeasureFilesMemoryUsage(string[] files) {
foreach (var file in files) {
var beforeLoad = MeasureMemoryUsage();
wrapper.LoadFile(file)
var afterLoad = MeasureMemoryUsage();
wrapper.Unload();
// save beforeLoad and afterLoad
}
}
private static long MeasureMemoryUsage() {
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
return System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64;
}
I know about tools like VMMAP or RedGate Ants Memory Profiler (or simply performance counters), but these do not allow me to match the memory usage with a specific loaded file, I would have to load files one by one, pause the program, make a measure in the tool, and write down the results. Not something I want to do on 100 files.
How do I measure memory usage of a specific C++ DLL from .Net code?
After reading #HansPassant comments, I have split my test into 2 programs: one that loads the files, and one that reads the memory measures of the first.
Here they are, cleaned up to remove other measures (like number of items in my json files) and results saving.
The "measures" program:
public static void Main(string[] args) {
foreach (var document in Directory.EnumerateDirectories(JsonFolder)) {
MeasureMemory(document);
}
}
private static void MeasureMemory(string document) {
// run process
var proc = new Process {
StartInfo = new ProcessStartInfo {
FileName = "loader.exe",
Arguments = document,
WindowStyle = ProcessWindowStyle.Hidden,
UseShellExecute = false,
RedirectStandardOutput = true,
CreateNoWindow = true
}
};
proc.Start();
// get process output
var output = string.Empty;
while (!proc.StandardOutput.EndOfStream) {
output += proc.StandardOutput.ReadLine() + "\n";
}
proc.WaitForExit();
// parse process output
var processMemoryBeforeLoad = long.Parse(Regex.Match(output, "BEFORE ([\\d]+)", RegexOptions.Multiline).Groups[1].Value);
var processMemoryAfterLoad = long.Parse(Regex.Match(output, "AFTER ([\\d]+)", RegexOptions.Multiline).Groups[1].Value);
// save the measures in a CSV file
}
And the "loader" program:
public static int Main(string[] args) {
var document = args[0];
var files = Directory.EnumerateFiles(document);
Console.WriteLine("BEFORE {0}", MeasureMemoryUsage());
wrapper.LoadFiles(files);
Console.WriteLine("AFTER {0}", MeasureMemoryUsage());
wrapper.Unload();
return 0;
}
private static long MeasureMemoryUsage() {
// make sure GC has done its job
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
return System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64;
}
I have a windows form application that currently does the following:
1) point to a directory and do 2) for all the xml files in there (usually max of 25 files ranging from 10mb to !5gb! - uncommon but possible)
2) xml read/write to alter some of the existing xml attributes (currently I use a single background worker for that)
3) write the altered xml attributes directly to a NEW file in a different dir
the little app works fine but it takes far tooo long to finish (about 20min depending on the net gb size)
what I casually tried is start the main rw method in a Parallel.ForEach() but it blocked itself out unsurprisingly and exited
my idea would be to parallelize the read/write process by starting it on all ~25 files at the same time is this wise? how can I do it (TPL?) without locking myself out?
PS: I have quite a powerful desktop pc, with 1TB samsung pro ssd, 16gb of ram, and intel core i7
You can have a ThreadPool for this approach
You Can have a pool for a size of 20 files
Because you have core i7 , you should use TaskFactory.StartNew
In this case , you should encapsulate the code for processing on file in the a sample class like XMLProcessor
then with use of TaskFactory.StartNew you can use multithreading for xml processsing
This sounds like a job for data parallelism via PLINQ + asynchronous lambdas.
I recently needed to process data from a zip archive that itself contained 5,200 zip archives which then each contained one or more data files in XML or CSV format. In total, between 40-60 GB of data when decompressed and read into memory.
The algorithm browses through this data, makes decisions based on what it finds in conjunction with supplied predicates, and finally writes the selections to disk as 1.0-1.5 GB files. Using an async PLINQ pattern with 32 processors, the average run time for each output file was 4.23 minutes.
After implementing the straightforward solution with async PLINQ, I spent some time trying to improve the running time by digging down into the TPL and TPL Dataflow libraries. In the end, attempting to beat async PLINQ proved to be a fun but ultimately fruitless exercise for my needs. The performance margins from the more "optimized" solutions were not worth the added complexity.
Below is an example of the async PLINQ pattern. The initial collection is an array of file paths.
In the first step, each file path is asynchronously read into memory and parsed, the file name is cached as a root-level attribute, and streamed to the next function.
In the last step, each XElement is asynchronously written to a new file.
I recommend that you play around with the lambda that reads the files. In my case, I found that reading via an async lambda gave me better throughput while decompressing files in memory.
However, for simple XML documents, you may be better off replacing the first async lambda with a method call to XElement.Load(string file) and letting PLINQ read as needed.
using System.IO;
using System.Linq;
using System.Xml.Linq;
namespace AsyncPlinqExample
{
public class Program
{
public static void Main(string[] args)
{
// Limit parallelism here if needed
int degreeOfParallelism = Environment.ProcessorCount;
string resultDirectory = "[result directory path here]";
string[] files = Directory.GetFiles("[directory with files here]");
files.AsParallel()
.WithDegreeOfParallelism(degreeOfParallelism)
.Select(
async x =>
{
using (StreamReader reader = new StreamReader(x))
{
XElement root = XElement.Parse(await reader.ReadToEndAsync());
root.SetAttributeValue("filePath", Path.GetFileName(x));
return root;
}
})
.Select(x => x.Result)
.Select(
x =>
{
// Perform other manipulations here
return x;
})
.Select(
async x =>
{
string resultPath =
Path.Combine(
resultDirectory,
(string) x.Attribute("fileName"));
await Console.Out.WriteLineAsync($"{DateTime.Now}: Starting {(string) x.Attribute("fileName")}.");
using (StreamWriter writer = new StreamWriter(resultPath))
{
await writer.WriteAsync(x.ToString());
}
await Console.Out.WriteLineAsync($"{DateTime.Now}: Comleted {(string)x.Attribute("fileName")}.");
});
}
}
}
Is it possible to get the size of a file in C# without using System.IO.FileInfo at all?
I know that you can get other things like Name and Extension by using Path.GetFileName(yourFilePath) and Path.GetExtension(yourFilePath) respectively, but apparently not file size? Is there another way I can get file size without using System.IO.FileInfo?
The only reason for this is that, if I'm correct, FileInfo grabs more info than I really need, therefore it takes longer to gather all those FileInfo's if the only thing I need is the size of the file. Is there a faster way?
I performed a benchmark using these two methods:
public static uint GetFileSizeA(string filename)
{
WIN32_FIND_DATA findData;
FindFirstFile(filename, out findData);
return findData.nFileSizeLow;
}
public static uint GetFileSizeB(string filename)
{
IntPtr handle = CreateFile(
filename,
FileAccess.Read,
FileShare.Read,
IntPtr.Zero,
FileMode.Open,
FileAttributes.ReadOnly,
IntPtr.Zero);
long fileSize;
GetFileSizeEx(handle, out fileSize);
CloseHandle(handle);
return (uint) fileSize;
}
Running against a bit over 2300 files, GetFileSizeA took 62-63ms to run. GetFileSizeB took over 18 seconds.
Unless someone sees something I'm doing wrong, I think the answer is clear as to which method is faster.
Is there a way I can refrain from actually opening the file?
Update
Changing FileAttributes.ReadOnly to FileAttributes.Normal reduced the timing so that the two methods were identical in performance.
Furthermore, if you skip the CloseHandle() call, the GetFileSizeEx method becomes about 20-30% faster, though I don't know that I'd recommend that.
From a short test i did, i've found that using a FileStream is just 1 millisecond slower in average than using Pete's GetFileSizeB (took me about 21 milliseconds over a network share...). Personally i prefer staying within the BCL limits whenever i can.
The code is simple:
using (var file = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
return file.Length;
}
As per this comment:
I have a small application that gathers the size info and saves it into an array... but I often have half a million files, give or take and that takes a while to go through all of those files (I'm using FileInfo). I was just wondering if there was a faster way...
Since you're finding the length of so many files you're much more likely to benefit from parallelization than from trying to get the file size through another method. The FileInfo class should be good enough, and any improvements are likely to be small.
Parallelizing the file size requests, on the other hand, has the potential for significant improvements in speed. (Note that the degree of improvement will be largely based on your disk drive, not your processor, so results can vary greatly.)
Not a direct answer...because I am not sure there is a faster way using the .NET framework.
Here's the code I am using:
List<long> list = new List<long>();
DirectoryInfo di = new DirectoryInfo("C:\\Program Files");
FileInfo[] fiArray = di.GetFiles("*", SearchOption.AllDirectories);
foreach (FileInfo f in fiArray)
list.Add(f.Length);
Running that, it took 2709ms to run on my "Program Files" directory, which was around 22720 files. That's no slouch by any means. Furthermore, when I put *.txt as a filter for the first parameter of the GetFiles method, it cut the time down drastically to 461ms.
A lot of this will depend on how fast your hard drive is, but I really don't think that FileInfo is killing performance.
NOTE: I thikn this only valid for .NET 4+
A quick'n'dirty solution if you want to do this on the .NET Core or Mono runtimes on non-Windows hosts:
Include the Mono.Posix.NETStandard NuGet package, then something like this...
using Mono.Unix.Native;
private long GetFileSize(string filePath)
{
Stat stat;
Syscall.stat(filePath, out stat);
return stat.st_size;
}
I've tested this running .NET Core on Linux and macOS - not sure if it works on Windows - it might, given that these are POSIX syscalls under the hood (and the package is maintained by Microsoft). If not, combine with the other P/Invoke-based answer to cover all platforms.
When compared to FileInfo.Length, this gives me much more reliable results when getting the size of a file that is actively being written to by another process/thread.
You can try this:
[DllImport("kernel32.dll")]
static extern bool GetFileSizeEx(IntPtr hFile, out long lpFileSize);
But that's not much of an improvement...
Here's the example code taken from pinvoke.net:
IntPtr handle = CreateFile(
PathString,
GENERIC_READ,
FILE_SHARE_READ,
0,
OPEN_EXISTING,
FILE_ATTRIBUTE_READONLY,
0); //PInvoked too
if (handle.ToInt32() == -1)
{
return;
}
long fileSize;
bool result = GetFileSizeEx(handle, out fileSize);
if (!result)
{
return;
}
We have up to 30 GB of GZipped log files per day. Each file holds 100.000 lines and is between 6 and 8 MB when compressed. The simplified code in which the parsing logic has been stripped out, utilises the Parallel.ForEach loop.
The maximum number of lines processed peaks at MaxDegreeOfParallelism of 8 on the two-NUMA node, 32 logical CPU box (Intel Xeon E7-2820 # 2 GHz):
using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.IO.Compression;
using System.Threading.Tasks;
namespace ParallelLineCount
{
public class ScriptMain
{
static void Main(String[] args)
{
int maxMaxDOP = (args.Length > 0) ? Convert.ToInt16(args[0]) : 2;
string fileLocation = (args.Length > 1) ? args[1] : "C:\\Temp\\SomeFiles" ;
string filePattern = (args.Length > 1) ? args[2] : "*2012-10-30.*.gz";
string fileNamePrefix = (args.Length > 1) ? args[3] : "LineCounts";
Console.WriteLine("Start: {0}", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
Console.WriteLine("Processing file(s): {0}", filePattern);
Console.WriteLine("Max MaxDOP to be used: {0}", maxMaxDOP.ToString());
Console.WriteLine("");
Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
for (int maxDOP = 1; maxDOP <= maxMaxDOP; maxDOP++)
{
// Construct ConcurrentStacks for resulting strings and counters
ConcurrentStack<Int64> TotalLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalSomeBookLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalLength = new ConcurrentStack<Int64>();
ConcurrentStack<int> TotalFiles = new ConcurrentStack<int>();
DateTime FullStartTime = DateTime.Now;
string[] files = System.IO.Directory.GetFiles(fileLocation, filePattern);
var options = new ParallelOptions() { MaxDegreeOfParallelism = maxDOP };
// Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body)
Parallel.ForEach(files, options, currentFile =>
{
string filename = System.IO.Path.GetFileName(currentFile);
DateTime fileStartTime = DateTime.Now;
using (FileStream inFile = File.Open(fileLocation + "\\" + filename, FileMode.Open))
{
Int64 lines = 0, someBookLines = 0, length = 0;
String line = "";
using (var reader = new StreamReader(new GZipStream(inFile, CompressionMode.Decompress)))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
lines++; // total lines
length += line.Length; // total line length
if (line.Contains("book")) someBookLines++; // some special lines that need to be parsed later
}
TotalLines.Push(lines); TotalSomeBookLines.Push(someBookLines); TotalLength.Push(length);
TotalFiles.Push(1); // silly way to count processed files :)
}
}
}
);
TimeSpan runningTime = DateTime.Now - FullStartTime;
// Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
Console.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7}",
maxDOP.ToString(),
TotalFiles.Sum().ToString(),
Convert.ToInt32(runningTime.TotalMilliseconds).ToString(),
TotalLength.Sum().ToString(),
TotalLines.Sum(),
TotalSomeBookLines.Sum().ToString(),
Convert.ToInt64(TotalLines.Sum() / runningTime.TotalMilliseconds).ToString(),
Convert.ToInt64(TotalLength.Sum() / runningTime.TotalMilliseconds).ToString());
}
Console.WriteLine();
Console.WriteLine("Finish: " + DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
}
}
}
Here's a summary of the results, with a clear peak at MaxDegreeOfParallelism = 8:
The CPU load (shown aggregated here, most of the load was on a single NUMA node, even when DOP was in 20 to 30 range):
The only way I've found to make CPU load cross 95% mark was to split the files across 4 different folders and execute the same command 4 times, each one targeting a subset of all files.
Can someone find a bottleneck?
It's likely that one problem is the small buffer size used by the default FileStream constructor. I suggest you use a larger input buffer. Such as:
using (FileStream infile = new FileStream(
name, FileMode.Open, FileAccess.Read, FileShare.None, 65536))
The default buffer size is 4 kilobytes, which has the thread making many calls to the I/O subsystem to fill its buffer. A buffer of 64K means that you will make those calls much less frequently.
I've found that a buffer size of between 32K and 256K gives the best performance, with 64K being the "sweet spot" when I did some detailed testing a while back. A buffer size larger than 256K actually begins to reduce performance.
Also, although this is unlikely to have a major effect on performance, you probably should replace those ConcurrentStack instances with 64-bit integers and use Interlocked.Add or Interlocked.Increment to update them. It simplifies your code and removes the need to manage the collections.
Update:
Re-reading your problem description, I was struck by this statement:
The only way I've found to make CPU load cross 95% mark was to split
the files across 4 different folders and execute the same command 4
times, each one targeting a subset of all files.
That, to me, points to a bottleneck in opening files. As though the OS is using a mutual exclusion lock on the directory. And even if all the data is in the cache and there's no physical I/O required, processes still have to wait on this lock. It's also possible that the file system is writing to the disk. Remember, it has to update the Last Access Time for a file whenever it's opened.
If I/O really is the bottleneck, then you might consider having a single thread that does nothing but load files and stuff them into a BlockingCollection or similar data structure so that the processing threads don't have to contend with each other for a lock on the directory. Your application becomes a producer/consumer application with one producer and N consumers.
The reason for this is usually that threads synchronize too much.
Looking for synchronization in your code I can see heavy syncing on the collections. Your threads are pushing the lines individually. This means that each line incurs at best an interlocked operation and at worst a kernel-mode lock wait. The interlocked operations will contend heavily because all threads race to get their current line into the collection. They all try to update the same memory locations. This causes cache line pinging.
Change this to push lines in bigger chunks. Push line-arrays of 100 lines or more. The more the better.
In other words, collect results in a thread-local collection first and only rarely merge into the global results.
You might even want to get rid of the manual data pushing altogether. This is what PLINQ is made for: Streaming data concurrently. PLINQ abstracts away all the concurrent collection manipulations in a well-performing way.
I don't think Parallelizing the disk reads is helping you. In fact, this could be seriously impacting your performance by creating contention in reading from multiple areas of storage at same time.
I would restructure the program to first do a single-threaded read of raw file data into a memory stream of byte[]. Then, do a Parallel.ForEach() on each stream or buffer to decompress and count the lines.
You take an initial IO read hit up front but let the OS/hardware optimize the hopefully mostly sequential reads, then decompress and parse in memory.
Keep in mind that operations like decomprless, Encoding.UTF8.ToString(), String.Split(), etc. will use large amounts of memory, so clean up references to/dispose of old buffers as you no longer need them.
I'd be surprised if you can't cause the machine to generate some serious waste hit this way.
Hope this helps.
The problem, I think, is that you are using blocking I/O, so your threads cannot fully take advantage of parallelism.
If I understand your algorithm right (sorry, I'm more of a C++ guy) this is what you are doing in each thread (pseudo-code):
while (there is data in the file)
read data
gunzip data
Instead, a better approach would be something like this:
N = 0
read data block N
while (there is data in the file)
asyncRead data block N+1
gunzip data block N
N = N + 1
gunzip data block N
The asyncRead call does not block, so basically you have the decoding of block N happening concurrently with the reading of block N+1, so by the time you are done decoding block N you might have block N+1 ready (or close to be ready if I/O is slower than decoding).
Then it's just a matter of finding the block size that gives you the best throughput.
Good luck.
I've been trying to get what I believe to be the simplest possible form of threading to work in my application but I just can't do it.
What I want to do: I have a main form with a status strip and a progress bar on it. I have to read something between 3 and 99 files and add their hashes to a string[] which I want to add to a list of all files with their respective hashes. Afterwards I have to compare the items on that list to a database (which comes in text files).
Once all that is done, I have to update a textbox in the main form and the progressbar to 33%; mostly I just don't want the main form to freeze during processing.
The files I'm working with always sum up to 1.2GB (+/- a few MB), meaning I should be able to read them into byte[]s and process them from there (I have to calculate CRC32, MD5 and SHA1 of each of those files so that should be faster than reading all of them from a HDD 3 times).
Also I should note that some files may be 1MB while another one may be 1GB. I initially wanted to create 99 threads for 99 files but that seems not wise, I suppose it would be best to reuse threads of small files while bigger file threads are still running. But that sounds pretty complicated to me so I'm not sure if that's wise either.
So far I've tried workerThreads and backgroundWorkers but neither seem to work too well for me; at least the backgroundWorkers worked SOME of the time, but I can't even figure out why they won't the other times... either way the main form still froze.
Now I've read about the Task Parallel Library in .NET 4.0 but I thought I should better ask someone who knows what he's doing before wasting more time on this.
What I want to do looks something like this (without threading):
List<string[]> fileSpecifics = new List<string[]>();
int fileMaxNumber = 42; // something between 3 and 99, depending on file set
for (int i = 1; i <= fileMaxNumber; i++)
{
string fileName = "C:\\path\\to\\file" + i.ToString("D2") + ".ext"; // file01.ext - file99.ext
string fileSize = new FileInfo(fileName).Length.ToString();
byte[] file = File.ReadAllBytes(fileName);
// hash calculations (using SHA1CryptoServiceProvider() etc., no problems with that so I'll spare you that, return strings)
file = null; // I didn't yet check if this made any actual difference but I figured it couldn't hurt
fileSpecifics.Add(new string[] { fileName, fileSize, fileCRC, fileMD5, fileSHA1 });
}
// look for files in text database mentioned above, i.e. first check for "file bundles" with the same amount of files I have here; then compare file sizes, then hashes
// again, no problems with that so I'll spare you that; the database text files are pretty small so parsing them doesn't need to be done in an extra thread.
Would anybody be kind enough to point me in the right direction? I'm looking for the easiest way to read and hash those files quickly (I believe the hashing takes some time in which other files could already be read) and save the output to a string[], without the main form freezing, nothing more, nothing less.
I'm thankful for any input.
EDIT to clarify: by "backgroundWorkers working some of the time" I meant that (for the very same set of files), maybe the first and fourth execution of my code produces the correct output and the UI unfreezes within 5 seconds, for the second, third and fifth execution it freezes the form (and after 60 seconds I get an error message saying some thread didn't respond within that time frame) and I have to stop execution via VS.
Thanks for all your suggestions and pointers, as you all have correctly guessed I'm completely new to threading and will have to read up on the great links you guys posted.
Then I'll give those methods a try and flag the answer that helped me the most. Thanks again!
With .NET Framework 4.X
Use Directory.EnumerateFiles Method for efficient/lazy files enumeration
Use Parallel.For() to delegate parallelism work to PLINQ framework or use TPL to delegate single Task per pipeline Stage
Use Pipelines pattern to pipeline following stages: calculating hashcodes, compare with pattern, update UI
To avoid UI freeze use appropriate techniques: for WPF use Dispatcher.BeginInvoke(), for WinForms use Invoke(), see this SO answer
Considering that all this stuff has UI it might be useful adding some cancellation feature to abandon long running operation if needed, take a look at the CreateLinkedTokenSource class which allows triggering CancellationToken from the "external scope"
I can try adding an example but it's worth do it yourself so you would learn all this stuff rather than simply copy/paste - > got it working -> forgot about it.
PS: Must read - Pipelines paper at MSDN
TPL specific pipeline implementation
Pipeline pattern implementation: three stages: calculate hash, match, update UI
Three tasks, one per stage
Two Blocking Queues
//
// 1) CalculateHashesImpl() should store all calculated hashes here
// 2) CompareMatchesImpl() should read input hashes from this queue
// Tuple.Item1 - hash, Typle.Item2 - file path
var calculatedHashes = new BlockingCollection<Tuple<string, string>>();
// 1) CompareMatchesImpl() should store all pattern matching results here
// 2) SyncUiImpl() method should read from this collection and update
// UI with available results
var comparedMatches = new BlockingCollection<string>();
var factory = new TaskFactory(TaskCreationOptions.LongRunning,
TaskContinuationOptions.None);
var calculateHashesWorker = factory.StartNew(() => CalculateHashesImpl(...));
var comparedMatchesWorker = factory.StartNew(() => CompareMatchesImpl(...));
var syncUiWorker= factory.StartNew(() => SyncUiImpl(...));
Task.WaitAll(calculateHashesWorker, comparedMatchesWorker, syncUiWorker);
CalculateHashesImpl():
private void CalculateHashesImpl(string directoryPath)
{
foreach (var file in Directory.EnumerateFiles(directoryPath))
{
var hash = CalculateHashTODO(file);
calculatedHashes.Add(new Tuple<string, string>(hash, file.Path));
}
}
CompareMatchesImpl():
private void CompareMatchesImpl()
{
foreach (var hashEntry in calculatedHashes.GetConsumingEnumerable())
{
// TODO: obviously return type is up to you
string matchResult = GetMathResultTODO(hashEntry.Item1, hashEntry.Item2);
comparedMatches.Add(matchResult);
}
}
SyncUiImpl():
private void UpdateUiImpl()
{
foreach (var matchResult in comparedMatches.GetConsumingEnumerable())
{
// TODO: track progress in UI using UI framework specific features
// to do not freeze it
}
}
TODO: Consider using CancellationToken as a parameter for all GetConsumingEnumerable() calls so you easily can stop a pipeline execution when needed.
First off, you should be using a higher level of abstraction to solve this problem. You have a bunch of tasks to complete, so use the "task" abstraction. You should be using the Task Parallel Library to do this sort of thing. Let the TPL deal with the question of how many worker threads to create -- the answer could be as low as one if the work is gated on I/O.
If you do want to do your own threading, some good advice:
Do not ever block on the UI thread. That's is what is freezing your application. Come up with a protocol by which working threads can communicate with your UI thread, which then does nothing except for responding to UI events. Remember that methods of user interface controls like task completion bars must never be called by any other thread other than the UI thread.
Do not create 99 threads to read 99 files. That's like getting 99 pieces of mail and hiring 99 assistants to write responses: an extraordinarily expensive solution to a simple problem. If your work is CPU intensive then there is no point in "hiring" more threads than you have CPUs to service them. (That's like hiring 99 assistants in an office that only has four desks. The assistants spend most of their time waiting for a desk to sit at instead of reading your mail.) If your work is disk-intensive then most of those threads are going to be idle most of the time waiting for the disk, which is an even bigger waste of resources.
First, I hope you are using a built-in library for calculating hashes. It's possible to write your own, but it's far safer to use something that has been around for a while.
You may need only create as many threads as CPUs if your process is CPU intensive. If it is bound by I/O, you might be able to get away with more threads.
I do not recommend loading the entire file into memory. Your hashing library should support updating a chunk at a time. Read a chunk into memory, use it to update the hashes of each algorighm, read the next chunk, and repeat until end of file. The chunked approach will help lower your program's memory demands.
As others have suggested, look into the Task Parallel Library, particularly Data Parallelism. It might be as easy as this:
Parallel.ForEach(fileSpecifics, item => CalculateHashes(item));
Check out TPL Dataflow. You can use a throttled ActionBlock which will manage the hard part for you.
If my understanding that you are looking to perform some tasks in the background and not block your UI, then the UI BackgroundWorker would be an appropriate choice. You mentioned that you got it working some of the time, so my recommendation would be to take what you had in a semi-working state, and improve upon it by tracking down the failures. If my hunch is correct, your worker was throwing an exception, which it does not appear you are handling in your code. Unhandled exceptions that bubble out of their containing threads make bad things happen.
This code hashing one file (stream) using two tasks - one for reading, second for hashing, for more robust way you should read more chunks forward.
Because bandwidth of processor is much higher than of disk, unless you use some high speed Flash drive you gain nothing from hashing more files concurrently.
public void TransformStream(Stream a_stream, long a_length = -1)
{
Debug.Assert((a_length == -1 || a_length > 0));
if (a_stream.CanSeek)
{
if (a_length > -1)
{
if (a_stream.Position + a_length > a_stream.Length)
throw new IndexOutOfRangeException();
}
if (a_stream.Position >= a_stream.Length)
return;
}
System.Collections.Concurrent.ConcurrentQueue<byte[]> queue =
new System.Collections.Concurrent.ConcurrentQueue<byte[]>();
System.Threading.AutoResetEvent data_ready = new System.Threading.AutoResetEvent(false);
System.Threading.AutoResetEvent prepare_data = new System.Threading.AutoResetEvent(false);
Task reader = Task.Factory.StartNew(() =>
{
long total = 0;
for (; ; )
{
byte[] data = new byte[BUFFER_SIZE];
int readed = a_stream.Read(data, 0, data.Length);
if ((a_length == -1) && (readed != BUFFER_SIZE))
data = data.SubArray(0, readed);
else if ((a_length != -1) && (total + readed >= a_length))
data = data.SubArray(0, (int)(a_length - total));
total += data.Length;
queue.Enqueue(data);
data_ready.Set();
if (a_length == -1)
{
if (readed != BUFFER_SIZE)
break;
}
else if (a_length == total)
break;
else if (readed != BUFFER_SIZE)
throw new EndOfStreamException();
prepare_data.WaitOne();
}
});
Task hasher = Task.Factory.StartNew((obj) =>
{
IHash h = (IHash)obj;
long total = 0;
for (; ; )
{
data_ready.WaitOne();
byte[] data;
queue.TryDequeue(out data);
prepare_data.Set();
total += data.Length;
if ((a_length == -1) || (total < a_length))
{
h.TransformBytes(data, 0, data.Length);
}
else
{
int readed = data.Length;
readed = readed - (int)(total - a_length);
h.TransformBytes(data, 0, data.Length);
}
if (a_length == -1)
{
if (data.Length != BUFFER_SIZE)
break;
}
else if (a_length == total)
break;
else if (data.Length != BUFFER_SIZE)
throw new EndOfStreamException();
}
}, this);
reader.Wait();
hasher.Wait();
}
Rest of code here: http://hashlib.codeplex.com/SourceControl/changeset/view/71730#514336