Why is EnumerateFiles much quicker than calculating the sizes - c#

For my WPF project, I have to calculate the total file size in a single directory (which could have sub directories).
Sample 1
DirectoryInfo di = new DirectoryInfo(path);
var totalLength = di.EnumerateFiles("*.*", SearchOption.AllDirectories).Sum(fi => fi.Length);
if (totalLength / 1000000 >= size)
return true;
Sample 2
var sizeOfHtmlDirectory = Directory.GetFiles(path, "*.*", SearchOption.AllDirectories);
long totalLength = 0;
foreach (var file in sizeOfHtmlDirectory)
{
totalLength += new FileInfo(file).Length;
if (totalLength / 1000000 >= size)
return true;
}
Both samples work.
Sample 1 complete in a massivly faster time. I've not timed this accurately but on my PC, using the same folder with the same content/file sizes, Sample 1 takes a few seconds, Sample 2 takes a few minutes.
EDIT
I should point out, the bottle neck in Sample 2 is within the foreach loop! It reads the GetFiles quickly and enters the foreach loop quickly.
My question is, how do I find out why this is the case?

Contrary to what the other answers indicate the main difference is not EnumerateFiles vs GetFiles - it's DirectoryInfo vs Directory - in the latter case you only have strings and have to create new FileInfo instances separately which is very costly.
DirectoryInfo returns FileInfo instances that use cached information vs directly creating new FileInfo instances which does not - more details here and here.
Relevant quote (via "The Old New Thing"):
In NTFS, file system metadata is a property not of the directory entry
but rather of the file, with some of the metadata replicated into the
directory entry as a tweak to improve directory enumeration
performance. Functions like FindĀ­FirstĀ­File report the directory
entry, and by putting the metadata that FAT users were accustomed to
getting "for free", they could avoid being slower than FAT for
directory listings. The directory-enumeration functions report the
last-updated metadata, which may not correspond to the actual metadata
if the directory entry is stale.

EnumerateFiles is asynchronous whereas GetFiles waits until all files have been enumerated before returning the collection of files. This will have a big effect on your result.

Related

display directories and files by size order c#

Trying to list all the directories and files on a machine and sort them into size.
I get a list of the file names and their sizes but it won't stick them in order... Any suggestions greatly appreciated! Cheers
//create instance of drive which contains files
DriveInfo di = new DriveInfo(#"C:\");
//find the root directory path
DirectoryInfo dirInfo = di.RootDirectory;
try
{
//EnumerateFiles increases the performance and sort them
foreach (var fi in dirInfo.EnumerateFiles().OrderBy(f =>f.Length).ToList())
{
try
{
//Display each file
Console.WriteLine("{0}\t\t{1}", fi.FullName, fi.Length);
}
catch (UnauthorizedAccessException UnAuthTop)
{
Console.WriteLine("{0}", UnAuthTop.Message);
}
}
You could try something like this
// get your folder
DirectoryInfo di = new DirectoryInfo(you path here);
// create a list of files from that folder
List<FileInfo> fi = di.GetFiles().ToList();
// pass the files in a sorted order
var files = fi.Where(f => f.FullName != null).OrderByDescending(f => f.Length);
In this example: files will contain a list if files from the current folder level, sorted by file.length .
You might want to check if fi is not null before trying to pass it to the variable files. Then, you can iterate over files with the foreach.
[ UPDATE ]
As #Abion47 points out, there doesn't seem to be much difference between the Op's code and my solution. From what I read in the OP, the OP is not getting a sorted list, which is the desired result.
What I see that might make a difference is that, by using the EnumerateFiles, you start enumerating and can act on file info before the entire collection of files is returned. It's great for handling enormous amounts of files. And, more efficient than GetFiles, for performing operations on individual files as they become available.
Since that is the case, you might not be able to sort the returned files properly, until the complete collection has been enumerated.
By using GetFiles, you have to wait for it to return the whole collection. Making it easier to sort.
I don't think that GetFiles is ideal for handling huge collections. In that case, I would divide the work into steps or use some other approach.

Getting Files from huge Directorys on NetDrives

i need to get Files from a Directory on a NetDrive. The Problem is that this Dir could contains 500k File or more.
The normal ways:
Directory.GetFiles(#"L:\cs\fromSQL\Data", "*.dat",
SearchOption.TopDirectoryOnly);
or
DirectoryInfo dir = new DirectoryInfo(#"L:\cs\fromSQL\Data");
var files =
dir.GetFiles("*.dat", SearchOption.TopDirectoryOnly)
are taking way to long. They always parse the whole Directory.
Example: NetDrive-Directory Containg ~130k Files, the first option takes 15 Minutes.
Is there a way to get just a number of files (for example the oldest one's) or something other thats faster?
Thanks!
Greetings
Christoph
You can give a try on DirectoryInfo.EnumerateFiles Method
As msdn says :-
Returns an enumerable collection of file information in the current directory.
it is IEnumerable ,so it can stream entries rather than buffer them all
For example :-
foreach(var file in Directory.EnumerateFiles(path)) {
// ...
}
More details on MSDN :-
The EnumerateFiles and GetFiles methods differ as follows: When you
use EnumerateFiles, you can start enumerating the collection of
FileInfo objects before the whole collection is returned; when you use
GetFiles, you must wait for the whole array of FileInfo objects to be
returned before you can access the array. Therefore, when you are
working with many files and directories, EnumerateFiles can be more
efficient.
Use Directory.EnumerateFiles instead:
var count = Directory.EnumerateFiles(#"L:\cs\fromSQL\Data", "*.dat",
SearchOption.TopDirectoryOnly).Count();
If you want to filter some files, then use DirectoryInfo.EnumerateFiles and filter the files using Where:
var di = new DirectoryInfo(#"L:\cs\fromSQL\Data");
var count = di.EnumerateFiles("*.dat",SearchOption.TopDirectoryOnly)
.Where(file => /* your condition */)
.Count();

Retrieve a list of filenames in folder and all subfolders quickly

I need to get a list of all Word Documents. *.doc and *.docx that are stored in a Windows based folder, with many subfolders, and sub sub folders etc...
Searching for a file with C# has an answer that works, it is 2 years old and takes 10 seconds to search through 1500 files, (in the future there may be 10,000 or more). I will post my code which is basically a copy from the above link. Does anyone have a better solution?
DateTime dt = DateTime.Now;
DirectoryInfo dir = new DirectoryInfo(MainFolder);
List<FileInfo> matches =
new List<FileInfo>(dir.GetFiles("*.doc*",SearchOption.AllDirectories));
TimeSpan ts = DateTime.Now-dt;
MessageBox.Show(matches.Count + " matches in " + ts.TotalSeconds + " seconds");
You can use Directory.EnumerateFiles instead of GetFiles. This has the advantage of returning the files as an IEnumerable<T>, which allows you to begin your processing of the result set immediately (instead of waiting for the entire list to be returned).
If you're merely counting the number of files or listing all files, it may not help. If, however, you can do your processing and/or filtering of the results, and especially if you can do any of it in other threads, it can be significantly faster.
From the documentation:
The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.
Doubt there's much you can do with that,
dir.GetFiles("*.doc|*.docx", SearchOptions.AllDirectories) might have an impact in that it's more restrictive pattern.
If you want the full list, other than making sure the Windows Indexing Service is enable on the target folders, not really. Your main delay is going to be reading from the hard drive, and no optimizing of your C# code will make that process any faster. You could create your own simple indexing service, perhaps using a FileSystemWatcher, that would give you sub-second response times no matter how many documents are added.
In a first time I suggest you to use StopWatch instead of DateTime to get the elapsed time.
In a second time to make your search faster you shouldn't store the result of GetFiles in a List but directly into an array.
And finally, you should optimize your search pattern : you want every doc and docx file, try "*.doc?"
Here is my suggestion :
var sw = new Stopwatch();
sw.Start();
var matches = Directory.GetFiles(MainFolder, "*.doc?", SearchOption.AllDirectories);
sw.Stop();
MessageBox.Show(matches.Length + " matches in " + sw.Elapsed.TotalSeconds + " seconds");

Does DirectoryInfo.EnumerateDirectories sort items?

I have an application that has to enumerate a few folders and process the files inside them.
It has to support resuming, meaning it has to start from the folder it was processing the last time.
I was thinking of using the DirectoryInfo.EnumerateDirectories method. I'd save the name of the last processed dir in a file, skip the enumeration until I meet that dir name and continue processing from there.
However, the documentation does not say anything about the order in which files are enumerated.
Is it safe to assume that using this method the program will always process the remaining directories? Or is it possible that the next time the program runs the directories will be enumerated in another order, thus making it possible to leave some unprocessed and process others two times?
If this method is not safe, what would be a good alternative?
Internally EnumerateDirectories() uses Win32 API FindNextFile() (source code). From MSDN:
"The order in which the search returns the files, such as alphabetical order, is not guaranteed, and is dependent on the file system."
DirectoryInfo.EnumerateDirectories() return IEnumerable<DirectoryInfo> in your msdn doc. I don't think it is already sorted and even if it is sorted, have question on sorted by which field or property of DirectoryInfo.
You can do sorting by yourself on query.
// Create a DirectoryInfo of the Program Files directory.
DirectoryInfo dirPrograms = new DirectoryInfo(#"c:\program files");
DateTime StartOf2009 = new DateTime(2009, 01, 01);
// LINQ query for all directories created before 2009.
var dirs = (from dir in dirPrograms.EnumerateDirectories()
where dir.CreationTimeUtc < StartOf2009
order by dir.Name).ToList();

What is the fastest way of deleting files in a directory? (Except specific file extension)

I have seen questions like What is the best way to empty a directory?
But I need to know,
what is the fastest way of deleting all the files found within the directory, except any .zip files found.
Smells like linq here... or what?
By saying fastest way, I mean the Fastest execution time.
If you are using .NET 4 you can benifit the smart way .NET now parallizing your functions. This code is the fasted way to do it. This scales with your numbers of cores on the processor too.
DirectoryInfo di = new DirectoryInfo(yourDir);
var files = di.GetFiles();
files.AsParallel().Where(f => f.Extension != ".zip").ForAll((f) => f.Delete());
By fastest are you asking for the least lines of code or the quickest execution time? Here is a sample using LINQ with a parallel for each loop to delete them quickly.
string[] files = System.IO.Directory.GetFiles("c:\\temp", "*.*", IO.SearchOption.TopDirectoryOnly);
List<string> del = (
from string s in files
where ! (s.EndsWith(".zip"))
select s).ToList();
Parallel.ForEach(del, (string s) => { IO.File.Delete(s); });
At the time of writing this answer none of the previous answers used Directory.EnumerateFiles() which allows you to carry on operations on the list of files while the list is being constructed .
Code:
Parallel.ForEach(Directory.EnumerateFiles(path, "*", SearchOption.AllDirectories).AsParallel(), Item =>
{
if(!string.Equals(Path.GetExtension(Item), ".zip",StringComparison.OrdinalIgnoreCase))
File.Delete(Item);
});
as far as I know the performance gain from using AsParallel() shouldn't be significant(if found) in this case however it did make difference in my case.
I compared the time it takes to delete all but .zip files in a list of 4689 files of which 10 were zip files using 1-foreach. 2-parallel foreach. 3-IEnumerable().AsParallel().ForAll. 4-parallel foreach using IEnumerable().AsParallel() as illustrated above.
Results:
1-1545
2-1015
3-1103
4-839
the fifth and the last case was a normal foreach using Directory.GetFiles()
5-2266
of course the results weren't conclusive , as far as I know to carry on a proper benchmarking you need to use a ram drive instead of a HDD .
Note:that the performance difference between EnumerateFiles and GetFiles becomes more apparent as the number of files increases.
Here's plain old C#
foreach(string file in Directory.GetFiles(Server.MapPath("~/yourdirectory")))
{
if(Path.GetExtension(file) != ".zip")
{
File.Delete(file);
}
}
And here's LINQ
var files = from f in Directory.GetFiles("")
where Path.GetExtension(f) != ".zip"
select f;
foreach(string file in files)
File.Delete(file);

Categories

Resources