I have a base directory that contains several thousand folders. Inside of these folders there can be between 1 and 20 subfolders that contains between 1 and 10 files. I'd like to delete all files that are over 60 days old. I was using the code below to get the list of files that I would have to delete:
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.GetFiles("*.*", SearchOption.AllDirectories)
.Where(t=>t.CreationTime < DateTime.Now.AddDays(-60)).ToArray();
But I let this run for about 30 minutes and it still hasn't finished. I'm curious if anyone can see anyway that I could potentially improve the performance of the above line or if there is a different way I should be approaching this entirely for better performance? Suggestions?
This is (probably) as good as it's going to get:
DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
.AsParallel()
.Where(fi => fi.CreationTime < sixtyLess).ToArray();
Changes:
Made the the 60 days less DateTime constant, and therefore less CPU load.
Used EnumerateFiles.
Made the query parallel.
Should run in a smaller amount of time (not sure how much smaller).
Here is another solution which might be faster or slower than the first, it depends on the data:
DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories)
.Where(fi => fi.CreationTime < sixtyLess))
.ToArray();
Here it moves the parallelism to the main folder enumeration. Most of the changes from above apply too.
A possibly faster alternative is to use WINAPI FindNextFile. There is an excellent Faster Directory Enumeration Tool for this. Which can be used as follows:
HashSet<FileData> GetPast60(string dir)
{
DateTime retval = DateTime.Now.AddDays(-60);
HashSet<FileData> oldFiles = new HashSet<FileData>();
FileData [] files = FastDirectoryEnumerator.GetFiles(dir);
for (int i=0; i<files.Length; i++)
{
if (files[i].LastWriteTime < retval)
{
oldFiles.Add(files[i]);
}
}
return oldFiles;
}
EDIT
So, based on comments below, I decided to do a benchmark of suggested solutions here as well as others I could think of. It was interesting enough to see that EnumerateFiles seemed to out-perform FindNextFile in C#, while EnumerateFiles with AsParallel was by far the fastest followed surprisingly by command prompt count. However do note that AsParallel wasn't getting the complete file count or was missing some files counted by the others so you could say the command prompt method is the best.
Applicable Config:
Windows 7 Service Pack 1 x64
Intel(R) Core(TM) i5-3210M CPU #2.50GHz 2.50GHz
RAM: 6GB
Platform Target: x64
No Optimization (NB: Compiling with optimization will produce drastically poor performance)
Allow UnSafe Code
Start Without Debugging
Below are three screenshots:
I have included my test code below:
static void Main(string[] args)
{
Console.Title = "File Enumeration Performance Comparison";
Stopwatch watch = new Stopwatch();
watch.Start();
var allfiles = GetPast60("C:\\Users\\UserName\\Documents");
watch.Stop();
Console.WriteLine("Total time to enumerate using WINAPI =" + watch.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles);
Stopwatch watch1 = new Stopwatch();
watch1.Start();
var allfiles1 = GetPast60Enum("C:\\Users\\UserName\\Documents\\");
watch1.Stop();
Console.WriteLine("Total time to enumerate using EnumerateFiles =" + watch1.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles1);
Stopwatch watch2 = new Stopwatch();
watch2.Start();
var allfiles2 = Get1("C:\\Users\\UserName\\Documents\\");
watch2.Stop();
Console.WriteLine("Total time to enumerate using Get1 =" + watch2.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles2);
Stopwatch watch3 = new Stopwatch();
watch3.Start();
var allfiles3 = Get2("C:\\Users\\UserName\\Documents\\");
watch3.Stop();
Console.WriteLine("Total time to enumerate using Get2 =" + watch3.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles3);
Stopwatch watch4 = new Stopwatch();
watch4.Start();
var allfiles4 = RunCommand(#"dir /a: /b /s C:\Users\UserName\Documents");
watch4.Stop();
Console.WriteLine("Total time to enumerate using Command Prompt =" + watch4.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles4);
Console.WriteLine("Press Any Key to Continue...");
Console.ReadLine();
}
private static int RunCommand(string command)
{
var process = new Process()
{
StartInfo = new ProcessStartInfo("cmd")
{
UseShellExecute = false,
RedirectStandardInput = true,
RedirectStandardOutput = true,
CreateNoWindow = true,
Arguments = String.Format("/c \"{0}\"", command),
}
};
int count = 0;
process.OutputDataReceived += delegate { count++; };
process.Start();
process.BeginOutputReadLine();
process.WaitForExit();
return count;
}
static int GetPast60Enum(string dir)
{
return new DirectoryInfo(dir).EnumerateFiles("*.*", SearchOption.AllDirectories).Count();
}
private static int Get2(string myBaseDirectory)
{
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
return dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
.AsParallel().Count();
}
private static int Get1(string myBaseDirectory)
{
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
return dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories))
.Count() + dirInfo.EnumerateFiles("*.*", SearchOption.TopDirectoryOnly).Count();
}
private static int GetPast60(string dir)
{
return FastDirectoryEnumerator.GetFiles(dir, "*.*", SearchOption.AllDirectories).Length;
}
NB: I concentrated on count in the benchmark not modified date.
I realize this is very late to the party but if someone else is looking for this then you can speed things up by orders of magnitude by directly parsing the the MFT or FAT of the file system, this requires admin privileges as I think it will return all files regardless of security but can probably take your 30 mins down to 30 seconds for the enumeration stage at least.
A library for NTFS is here https://github.com/LordMike/NtfsLib there is also https://discutils.codeplex.com/ which I haven't personally used.
I would only use these methods for initial discovery of files over x days old and then verify them individual before deleting, it might be overkill but I'm cautious like that.
The method Get1 in above answer (#itsnotalie & #Chibueze Opata) is missing to count the files in the root directory, so it should read:
private static int Get1(string myBaseDirectory)
{
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
return dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories))
.Count() + dirInfo.EnumerateFiles("*.*", SearchOption.TopDirectoryOnly).Count();
}
When using SearchOption.AllDirectories EnumerateFiles took ages to return the first item. After reading several good answers here, I have for now ended up with the function below. By only have it work on one directory at a time and calling it recursively it now returns first item almost immediately.
But I must admit that I'm not totally sure on the correct way to use .AsParallel() so don't use this blindly.
Instead of working with arrays I would strongly suggest working with enumeration instead.
Some mentions that speed of disk is limiting factor and threads won't help, in terms of total time that is very likely as long as nothing is cached by the OS, but by using multiple threads you can get the cached data returned first, while otherwise it might be possible that the cache is pruned to make space for the new results.
Recursive calls might affect stack, but there is a limit on most FSs for how many levels there can be, so should not become a real issue.
private static IEnumerable<FileInfo> EnumerateFilesParallel(DirectoryInfo dir)
{
return dir.EnumerateDirectories()
.AsParallel()
.SelectMany(EnumerateFilesParallel)
.Concat(dir.EnumerateFiles("*", SearchOption.TopDirectoryOnly).AsParallel());
}
You are using a Linq. It would be faster if you wrote your own method for searching Directories recursively with you're special case.
public static DateTime retval = DateTime.Now.AddDays(-60);
public static void WalkDirectoryTree(System.IO.DirectoryInfo root)
{
System.IO.FileInfo[] files = null;
System.IO.DirectoryInfo[] subDirs = null;
// First, process all the files directly under this folder
try
{
files = root.GetFiles("*.*");
}
// This is thrown if even one of the files requires permissions greater
// than the application provides.
catch (UnauthorizedAccessException e)
{
// This code just writes out the message and continues to recurse.
// You may decide to do something different here. For example, you
// can try to elevate your privileges and access the file again.
log.Add(e.Message);
}
catch (System.IO.DirectoryNotFoundException e)
{
Console.WriteLine(e.Message);
}
if (files != null)
{
foreach (System.IO.FileInfo fi in files)
{
if (fi.LastWriteTime < retval)
{
oldFiles.Add(files[i]);
}
Console.WriteLine(fi.FullName);
}
// Now find all the subdirectories under this directory.
subDirs = root.GetDirectories();
foreach (System.IO.DirectoryInfo dirInfo in subDirs)
{
// Resursive call for each subdirectory.
WalkDirectoryTree(dirInfo);
}
}
}
If you really want to improve performance, get your hands dirty and use the NtQueryDirectoryFile that's internal to Windows, with a large buffer size.
FindFirstFile is already slow, and while FindFirstFileEx is a bit better, the best performance will come from calling the native function directly.
Related
I have a base directory that contains several thousand folders. Inside of these folders there can be between 1 and 20 subfolders that contains between 1 and 10 files. I'd like to delete all files that are over 60 days old. I was using the code below to get the list of files that I would have to delete:
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.GetFiles("*.*", SearchOption.AllDirectories)
.Where(t=>t.CreationTime < DateTime.Now.AddDays(-60)).ToArray();
But I let this run for about 30 minutes and it still hasn't finished. I'm curious if anyone can see anyway that I could potentially improve the performance of the above line or if there is a different way I should be approaching this entirely for better performance? Suggestions?
This is (probably) as good as it's going to get:
DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
.AsParallel()
.Where(fi => fi.CreationTime < sixtyLess).ToArray();
Changes:
Made the the 60 days less DateTime constant, and therefore less CPU load.
Used EnumerateFiles.
Made the query parallel.
Should run in a smaller amount of time (not sure how much smaller).
Here is another solution which might be faster or slower than the first, it depends on the data:
DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories)
.Where(fi => fi.CreationTime < sixtyLess))
.ToArray();
Here it moves the parallelism to the main folder enumeration. Most of the changes from above apply too.
A possibly faster alternative is to use WINAPI FindNextFile. There is an excellent Faster Directory Enumeration Tool for this. Which can be used as follows:
HashSet<FileData> GetPast60(string dir)
{
DateTime retval = DateTime.Now.AddDays(-60);
HashSet<FileData> oldFiles = new HashSet<FileData>();
FileData [] files = FastDirectoryEnumerator.GetFiles(dir);
for (int i=0; i<files.Length; i++)
{
if (files[i].LastWriteTime < retval)
{
oldFiles.Add(files[i]);
}
}
return oldFiles;
}
EDIT
So, based on comments below, I decided to do a benchmark of suggested solutions here as well as others I could think of. It was interesting enough to see that EnumerateFiles seemed to out-perform FindNextFile in C#, while EnumerateFiles with AsParallel was by far the fastest followed surprisingly by command prompt count. However do note that AsParallel wasn't getting the complete file count or was missing some files counted by the others so you could say the command prompt method is the best.
Applicable Config:
Windows 7 Service Pack 1 x64
Intel(R) Core(TM) i5-3210M CPU #2.50GHz 2.50GHz
RAM: 6GB
Platform Target: x64
No Optimization (NB: Compiling with optimization will produce drastically poor performance)
Allow UnSafe Code
Start Without Debugging
Below are three screenshots:
I have included my test code below:
static void Main(string[] args)
{
Console.Title = "File Enumeration Performance Comparison";
Stopwatch watch = new Stopwatch();
watch.Start();
var allfiles = GetPast60("C:\\Users\\UserName\\Documents");
watch.Stop();
Console.WriteLine("Total time to enumerate using WINAPI =" + watch.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles);
Stopwatch watch1 = new Stopwatch();
watch1.Start();
var allfiles1 = GetPast60Enum("C:\\Users\\UserName\\Documents\\");
watch1.Stop();
Console.WriteLine("Total time to enumerate using EnumerateFiles =" + watch1.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles1);
Stopwatch watch2 = new Stopwatch();
watch2.Start();
var allfiles2 = Get1("C:\\Users\\UserName\\Documents\\");
watch2.Stop();
Console.WriteLine("Total time to enumerate using Get1 =" + watch2.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles2);
Stopwatch watch3 = new Stopwatch();
watch3.Start();
var allfiles3 = Get2("C:\\Users\\UserName\\Documents\\");
watch3.Stop();
Console.WriteLine("Total time to enumerate using Get2 =" + watch3.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles3);
Stopwatch watch4 = new Stopwatch();
watch4.Start();
var allfiles4 = RunCommand(#"dir /a: /b /s C:\Users\UserName\Documents");
watch4.Stop();
Console.WriteLine("Total time to enumerate using Command Prompt =" + watch4.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles4);
Console.WriteLine("Press Any Key to Continue...");
Console.ReadLine();
}
private static int RunCommand(string command)
{
var process = new Process()
{
StartInfo = new ProcessStartInfo("cmd")
{
UseShellExecute = false,
RedirectStandardInput = true,
RedirectStandardOutput = true,
CreateNoWindow = true,
Arguments = String.Format("/c \"{0}\"", command),
}
};
int count = 0;
process.OutputDataReceived += delegate { count++; };
process.Start();
process.BeginOutputReadLine();
process.WaitForExit();
return count;
}
static int GetPast60Enum(string dir)
{
return new DirectoryInfo(dir).EnumerateFiles("*.*", SearchOption.AllDirectories).Count();
}
private static int Get2(string myBaseDirectory)
{
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
return dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
.AsParallel().Count();
}
private static int Get1(string myBaseDirectory)
{
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
return dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories))
.Count() + dirInfo.EnumerateFiles("*.*", SearchOption.TopDirectoryOnly).Count();
}
private static int GetPast60(string dir)
{
return FastDirectoryEnumerator.GetFiles(dir, "*.*", SearchOption.AllDirectories).Length;
}
NB: I concentrated on count in the benchmark not modified date.
I realize this is very late to the party but if someone else is looking for this then you can speed things up by orders of magnitude by directly parsing the the MFT or FAT of the file system, this requires admin privileges as I think it will return all files regardless of security but can probably take your 30 mins down to 30 seconds for the enumeration stage at least.
A library for NTFS is here https://github.com/LordMike/NtfsLib there is also https://discutils.codeplex.com/ which I haven't personally used.
I would only use these methods for initial discovery of files over x days old and then verify them individual before deleting, it might be overkill but I'm cautious like that.
The method Get1 in above answer (#itsnotalie & #Chibueze Opata) is missing to count the files in the root directory, so it should read:
private static int Get1(string myBaseDirectory)
{
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
return dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories))
.Count() + dirInfo.EnumerateFiles("*.*", SearchOption.TopDirectoryOnly).Count();
}
When using SearchOption.AllDirectories EnumerateFiles took ages to return the first item. After reading several good answers here, I have for now ended up with the function below. By only have it work on one directory at a time and calling it recursively it now returns first item almost immediately.
But I must admit that I'm not totally sure on the correct way to use .AsParallel() so don't use this blindly.
Instead of working with arrays I would strongly suggest working with enumeration instead.
Some mentions that speed of disk is limiting factor and threads won't help, in terms of total time that is very likely as long as nothing is cached by the OS, but by using multiple threads you can get the cached data returned first, while otherwise it might be possible that the cache is pruned to make space for the new results.
Recursive calls might affect stack, but there is a limit on most FSs for how many levels there can be, so should not become a real issue.
private static IEnumerable<FileInfo> EnumerateFilesParallel(DirectoryInfo dir)
{
return dir.EnumerateDirectories()
.AsParallel()
.SelectMany(EnumerateFilesParallel)
.Concat(dir.EnumerateFiles("*", SearchOption.TopDirectoryOnly).AsParallel());
}
You are using a Linq. It would be faster if you wrote your own method for searching Directories recursively with you're special case.
public static DateTime retval = DateTime.Now.AddDays(-60);
public static void WalkDirectoryTree(System.IO.DirectoryInfo root)
{
System.IO.FileInfo[] files = null;
System.IO.DirectoryInfo[] subDirs = null;
// First, process all the files directly under this folder
try
{
files = root.GetFiles("*.*");
}
// This is thrown if even one of the files requires permissions greater
// than the application provides.
catch (UnauthorizedAccessException e)
{
// This code just writes out the message and continues to recurse.
// You may decide to do something different here. For example, you
// can try to elevate your privileges and access the file again.
log.Add(e.Message);
}
catch (System.IO.DirectoryNotFoundException e)
{
Console.WriteLine(e.Message);
}
if (files != null)
{
foreach (System.IO.FileInfo fi in files)
{
if (fi.LastWriteTime < retval)
{
oldFiles.Add(files[i]);
}
Console.WriteLine(fi.FullName);
}
// Now find all the subdirectories under this directory.
subDirs = root.GetDirectories();
foreach (System.IO.DirectoryInfo dirInfo in subDirs)
{
// Resursive call for each subdirectory.
WalkDirectoryTree(dirInfo);
}
}
}
If you really want to improve performance, get your hands dirty and use the NtQueryDirectoryFile that's internal to Windows, with a large buffer size.
FindFirstFile is already slow, and while FindFirstFileEx is a bit better, the best performance will come from calling the native function directly.
Currently I have a .txt file of about 170,000 jpg file names and I read them all into a List (fileNames).
I want to search ONE folder (this folder has sub-folders) to check if each file in fileNames exists in this folder and if it does, copy it to a new folder.
I was making a rough estimate but each search and copy for each file name in fileNames takes about .5 seconds. So 170,000 seconds is roughly 48 hours so divide by 2 that will take about 24 hours for my app to have searched for every single file name using 1 thread! Obviously this is too long so I want to narrow this down and speed the process up. What is the best way to go about doing this using multi-threading?
Currently I was thinking of making 20 separate threads and splitting my list (fileNames) into 20 different lists and search for the files simultaneously. For example I would have 20 different threads executing the below at the same time:
foreach (string str in fileNames)
{
foreach (var file in Directory.GetFiles(folderToCheckForFileName, str, SearchOption.AllDirectories))
{
string combinedPath = Path.Combine(newTargetDirectory, Path.GetFileName(file));
if (!File.Exists(combinedPath))
{
File.Copy(file, combinedPath);
}
}
}
UPDATED TO SHOW MY SOLUTION BELOW:
string[] folderToCheckForFileNames = Directory.GetFiles("C:\\Users\\Alex\\Desktop\\ok", "*.jpg", SearchOption.AllDirectories);
foreach(string str in fileNames)
{
Parallel.ForEach(folderToCheckForFileNames, currentFile =>
{
string filename = Path.GetFileName(currentFile);
if (str == filename)
{
string combinedPath = Path.Combine(targetDir, filename);
if (!File.Exists(combinedPath))
{
File.Copy(currentFile, combinedPath);
Console.WriteLine("FOUND A MATCH AND COPIED" + currentFile);
}
}
}
);
}
Thank you everyone for your contributions! Greatly Appreciated!
Instead of using ordinary foreach statement in doing your search, you should use parallel linq. Parallel linq combines the simplicity and readability of LINQ syntax with the power of parallel programming. Just like code that targets the Task Parallel Library. This will shield you from low level thread manipulation and probable exceptions (hard to find/debug exceptions) while splitting your work among many threads. So you might do something like this:
fileNames.AsParallel().ForAll(str =>
{
var files = Directory.GetFiles(folderToCheckForFileName, str, SearchOption.AllDirectories);
files.AsParallel().ForAll(file =>
{
if (!string.IsNullOrEmpty(file))
{
string combinedPath = Path.Combine(newTargetDirectory, Path.GetFileName(file));
if (!File.Exists(combinedPath))
{
File.Copy(file, combinedPath);
}
}
});
});
20 different threads won't help if your computer has fewer than 20 cores. In fact, it can make the process slower because you will 1) have to spend time context switching between each thread (which is your CPU's way of emulating more than 1 thread / core) and 2) a Thread in .NET reserves 1 MB for its stack, which is pretty hefty.
Instead, try dividing your I/O into async workloads, using Task.Run for the CPU-bound / intensive parts. Also, keep your number of Tasks to maybe 4 to 8 at the max.
Sample code:
var tasks = new Task[8];
var names = fileNames.ToArray();
for (int i = 0; i < tasks.Length; i++)
{
int index = i;
tasks[i] = Task.Run(() =>
{
for (int current = index; current < names.Length; current += 8)
{
// execute the workload
string str = names[current];
foreach (var file in Directory.GetFiles(folderToCheckForFileName, str, SearchOption.AllDirectories))
{
string combinedPath = Path.Combine(newTargetDirectory, Path.GetFileName(file));
if (!File.Exists(combinedPath))
{
File.Copy(file, combinedPath);
}
}
}
});
}
Task.WaitAll(tasks);
Following is the code which processes about 10000 files.
var files = Directory.GetFiles(directorypath, "*.*", SearchOption.AllDirectories).Where(
name => !name.EndsWith(".gif") && !name.EndsWith(".jpg") && !name.EndsWith(".png")).ToList();
Parallel.ForEach(files,Countnumberofwordsineachfile);
And the Countnumberofwordsineachfile function prints the number of words in each file into the text.
Whenever i implement Parallel.ForEach(), i miss about 4-5 files everytime while processing.
Can anyone suggest as to why this happens?
public void Countnumberofwordsineachfile(string filepath)
{
string[] arrwordsinfile = Regex.Split(File.ReadAllText(filepath).Trim(), #"\s+");
Charactercount = Convert.ToInt32(arrwordsinfile.Length);
filecontent.AppendLine(filepath + "=" + Charactercount);
}
fileContent is probably not threadsafe. So if two (or more) tasks attempt to append to it at the same time one will win, the other will not. You need to remember to either lock the sections that are shared, or don't used shared data.
This is probably the easiest solution for your code. Locking, synchronises access (other tasks have to queue up to access the locked section) so it will slow down the algorithm, but since this is very short compared to the part that counts the words is likely to be then it isn't really going to be much of an issue.
private object myLock = new object();
public void Countnumberofwordsineachfile(string filepath)
{
string[] arrwordsinfile = Regex.Split(File.ReadAllText(filepath).Trim(), #"\s+");
Charactercount = Convert.ToInt32(arrwordsinfile.Length);
lock(myLock)
{
filecontent.AppendLine(filepath + "=" + Charactercount);
}
}
The cause has already been found, here is an alternative implementation:
//Parallel.ForEach(files,Countnumberofwordsineachfile);
var fileContent = files
.AsParallel()
.Select(f=> f + "=" + Countnumberofwordsineachfile(f));
and that requires a more useful design for the count method:
// make this an 'int' function, more reusable as well
public int Countnumberofwordsineachfile(string filepath)
{ ...; return characterCount; }
But do note that going parallel won't help you much here, your main function (ReadAllText) is I/O bound so you will most likely see a degradation from using AsParallel().
The better option is to use Directory.EnumerateFiles and then collect the results without parallelism:
var files = Directory.EnumerateFiles(....);
var fileContent = files
//.AsParallel()
.Select(f=> f + "=" + Countnumberofwordsineachfile(f));
I would like to optimize the following method, that returns the total file count of the specified folder and all subfolders, for speed and memory usage; any advice is appreciated.
Thanks.
private int countfiles(string srcdir)
{
try
{
DirectoryInfo dir = new DirectoryInfo(srcdir);
//if the source dir doesn't exist, throw an exception
if (!dir.Exists)
throw new ArgumentException("source dir doesn't exist -> " + srcdir);
int count = dir.GetFiles().Length;
//loop through each sub directory in the current dir
foreach (DirectoryInfo subdir in dir.GetDirectories())
{
//recursively call this function over and over again
count += countfiles(subdir.FullName);
}
//cleanup
dir = null;
return count;
}
catch (Exception exc)
{
MessageBox.Show(exc.Message);
return 0;
}
}
So I did some benchmarking with the suggestions that were proposed. Here are my findings:
My method, with recursion, is the slowest finding 9062 files in a directory tree in 6.234 seconds.
#Matthew’s answer, using SearchOption.AllDirectories, is the fastest finding the same 9062 files in 4.546 seconds
#Jeffery’s answer, using LINQ, is in the middle of the pack finding the same 9062 files in 5.562 seconds.
Thank you everyone for your suggestions.
Could you not change the entire method to:
int count = Directory.GetFiles(path, "*.*", SearchOption.AllDirectories).Length;
It looks pretty good to me, but I'd use a LINQ expression to get the count.
Try this:
int count = dir.GetFiles().Length + dir.GetDirectories().Sum(subdir =>countfiles(subdir.FullName));
Hope that helps!
I used the approach described here in the past, it shows with and without recursion and the one without is faster. Hope this helps ;-)
How to: Iterate Through a Directory Tree
If there are exceptions, then your user could end up seeing numerous message boxes as each call could show one. I'd consolidate them, allow the user can cancel the operation, or so that it'd back all the way out to the initial caller.
If you are using .NET 4.0, this is slightly faster but not by much.
static int RecurCount(string source)
{
int count = 0;
try
{
var dirs = Directory.EnumerateDirectories(source);
count = Directory.EnumerateFiles(source).Count();
foreach (string dir in dirs)
{
count += RecurCount(dir);
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return count;
}
I want to be able to get the size of one of the local directories using C#. I'm trying to avoid the following (pseudo like code), although in the worst case scenario I will have to settle for this:
int GetSize(Directory)
{
int Size = 0;
foreach ( File in Directory )
{
FileInfo fInfo of File;
Size += fInfo.Size;
}
foreach ( SubDirectory in Directory )
{
Size += GetSize(SubDirectory);
}
return Size;
}
Basically, is there a Walk() available somewhere so that I can walk through the directory tree? Which would save the recursion of going through each sub-directory.
A very succinct way to get a folder size in .net 4.0 is below. It still suffers from the limitation of having to traverse all files recursively, but it doesn't load a potentially huge array of filenames, and it's only two lines of code. Make sure to use the namespaces System.IO and System.Linq.
private static long GetDirectorySize(string folderPath)
{
DirectoryInfo di = new DirectoryInfo(folderPath);
return di.EnumerateFiles("*.*", SearchOption.AllDirectories).Sum(fi => fi.Length);
}
If you use Directory.GetFiles you can do a recursive seach (using SearchOption.AllDirectories), but this is a bit flaky anyway (especially if you don't have access to one of the sub-directories) - and might involve a huge single array coming back (warning klaxon...).
I'd be happy with the recursion approach unless I could show (via profiling) a bottleneck; and then I'd probably switch to (single-level) Directory.GetFiles, using a Queue<string> to emulate recursion.
Note that .NET 4.0 introduces some enumerator-based file/directory listing methods which save on the big arrays.
Here my .NET 4.0 approach
public static long GetFileSizeSumFromDirectory(string searchDirectory)
{
var files = Directory.EnumerateFiles(searchDirectory);
// get the sizeof all files in the current directory
var currentSize = (from file in files let fileInfo = new FileInfo(file) select fileInfo.Length).Sum();
var directories = Directory.EnumerateDirectories(searchDirectory);
// get the size of all files in all subdirectories
var subDirSize = (from directory in directories select GetFileSizeSumFromDirectory(directory)).Sum();
return currentSize + subDirSize;
}
Or even nicer:
// get IEnumerable from all files in the current dir and all sub dirs
var files = Directory.EnumerateFiles(searchDirectory,"*",SearchOption.AllDirectories);
// get the size of all files
long sum = (from file in files let fileInfo = new FileInfo(file) select fileInfo .Length).Sum();
As Gabriel pointed out this will fail if you have a restricted directory under the searchDirectory!
You could hide your recursion behind an extension method (to avoid the issues Marc has highlighted with the GetFiles() method):
public static class UserExtension
{
public static IEnumerable<FileInfo> Walk(this DirectoryInfo directory)
{
foreach(FileInfo file in directory.GetFiles())
{
yield return file;
}
foreach(DirectoryInfo subDirectory in directory.GetDirectories())
{
foreach(FileInfo file in subDirectory.Walk())
{
yield return file;
}
}
}
}
(You probably want to add some exception handling to this for protected folders etc.)
Then:
using static UserExtension;
long totalSize = 0L;
var startFolder = new DirectoryInfo("<path to folder>");
// iteration
foreach(FileInfo file in startFolder.Walk())
{
totalSize += file.Length;
}
// linq
totalSize = di.Walk().Sum(s => s.Length);
Basically the same code, but maybe a little neater...
First, forgive my poor english ;o)
I had a problem that took me to this page : enumerate files of a directory and his subdirectories without blocking on an UnauthorizedAccessException, and, like the new method of .Net 4 DirectoryInfo.Enumerate..., get the first result before the end of the entire query.
With the help of various examples found here and there on the web, I finally write this method :
public static IEnumerable<FileInfo> EnumerateFiles_Recursive(this DirectoryInfo directory, string searchPattern, SearchOption searchOption, Func<DirectoryInfo, Exception, bool> handleExceptionAccess)
{
Queue<DirectoryInfo> subDirectories = new Queue<DirectoryInfo>();
IEnumerable<FileSystemInfo> entries = null;
// Try to get an enumerator on fileSystemInfos of directory
try
{
entries = directory.EnumerateFileSystemInfos(searchPattern, SearchOption.TopDirectoryOnly);
}
catch (Exception e)
{
// If there's a callback delegate and this delegate return true, we don't throw the exception
if (handleExceptionAccess == null || !handleExceptionAccess(directory, e))
throw;
// If the exception wasn't throw, we make entries reference an empty collection
entries = EmptyFileSystemInfos;
}
// Yield return file entries of the directory and enqueue the subdirectories
foreach (FileSystemInfo entrie in entries)
{
if (entrie is FileInfo)
yield return (FileInfo)entrie;
else if (entrie is DirectoryInfo)
subDirectories.Enqueue((DirectoryInfo)entrie);
}
// If recursive search, we make recursive call on the method to yield return entries of the subdirectories.
if (searchOption == SearchOption.AllDirectories)
{
DirectoryInfo subDir = null;
while (subDirectories.Count > 0)
{
subDir = subDirectories.Dequeue();
foreach (FileInfo file in subDir.EnumerateFiles_Recursive(searchPattern, searchOption, handleExceptionAccess))
{
yield return file;
}
}
}
else
subDirectories.Clear();
}
I use a Queue and a recursive method to keep traditional order (content of directory and then content of first subdirectory and his own subdirectories and then content of the second...). The parameter "handleExceptionAccess" is just a function call when an exception is thrown with a directory; the function must return true to indicate that the exception must be ignored.
With this methode, you can write :
DirectoryInfo dir = new DirectoryInfo("c:\\temp");
long size = dir.EnumerateFiles_Recursive("*", SearchOption.AllDirectories, (d, ex) => true).Sum(f => f.Length);
And here we are : all exception when trying to enumerate a directory will be ignore !
Hope this help
Lionel
PS : for a reason I can't explain, my method is more quick than the framework 4 one...
PPS : you can get my test solutions with source for those methods : here TestDirEnumerate. I write EnumerateFiles_Recursive, EnumerateFiles_NonRecursive (use a queue to avoid recursion) and EnumerateFiles_NonRecursive_TraditionalOrder (use a stack of queue to avoid recursion and keep traditional order). Keep those 3 methods has no interest, I write them only for test the best one. I think to keep only the last one.
I also wrote the equivalent for EnumerateFileSystemInfos and EnumerateDirectories.
Have a look at this post:
http://social.msdn.microsoft.com/forums/en-US/vbgeneral/thread/eed54ebe-facd-4305-b64b-9dbdc65df04e
Basically there is no clean .NET way, but there is a quite straightforward COM approach so if you're happy with using COM interop and being tied to Windows, this could work for you.
the solution is already here https://stackoverflow.com/a/12665904/1498669
as in the duplicate How do I Get Folder Size in C#? shown -> you can do this also in c#
first, add the COM reference "Microsoft Scripting Runtime" to your project and use:
var fso = new Scripting.FileSystemObject();
var folder = fso.GetFolder(#"C:\Windows");
double sizeInBytes = folder.Size;
// cleanup COM
System.Runtime.InteropServices.Marshal.ReleaseComObject(folder);
System.Runtime.InteropServices.Marshal.ReleaseComObject(fso);
remember to cleanup the COM references
I've been looking some time ago for a function like the one you ask for and from what I've found on the Internet and in MSDN forums, there is no such function.
The recursive way is the only I found to obtain the size of a Folder considering all the files and subfolders that contains.
You should make it easy on yourself. Make a method and passthrough the location of the directory.
private static long GetDirectorySize(string location) {
return new DirectoryInfo(location).GetFiles("*.*", SearchOption.AllDirectories).Sum(file => file.Length);
}
-G