Does DirectoryInfo.EnumerateDirectories sort items? - c#

I have an application that has to enumerate a few folders and process the files inside them.
It has to support resuming, meaning it has to start from the folder it was processing the last time.
I was thinking of using the DirectoryInfo.EnumerateDirectories method. I'd save the name of the last processed dir in a file, skip the enumeration until I meet that dir name and continue processing from there.
However, the documentation does not say anything about the order in which files are enumerated.
Is it safe to assume that using this method the program will always process the remaining directories? Or is it possible that the next time the program runs the directories will be enumerated in another order, thus making it possible to leave some unprocessed and process others two times?
If this method is not safe, what would be a good alternative?

Internally EnumerateDirectories() uses Win32 API FindNextFile() (source code). From MSDN:
"The order in which the search returns the files, such as alphabetical order, is not guaranteed, and is dependent on the file system."

DirectoryInfo.EnumerateDirectories() return IEnumerable<DirectoryInfo> in your msdn doc. I don't think it is already sorted and even if it is sorted, have question on sorted by which field or property of DirectoryInfo.
You can do sorting by yourself on query.
// Create a DirectoryInfo of the Program Files directory.
DirectoryInfo dirPrograms = new DirectoryInfo(#"c:\program files");
DateTime StartOf2009 = new DateTime(2009, 01, 01);
// LINQ query for all directories created before 2009.
var dirs = (from dir in dirPrograms.EnumerateDirectories()
where dir.CreationTimeUtc < StartOf2009
order by dir.Name).ToList();

Related

Why is EnumerateFiles much quicker than calculating the sizes

For my WPF project, I have to calculate the total file size in a single directory (which could have sub directories).
Sample 1
DirectoryInfo di = new DirectoryInfo(path);
var totalLength = di.EnumerateFiles("*.*", SearchOption.AllDirectories).Sum(fi => fi.Length);
if (totalLength / 1000000 >= size)
return true;
Sample 2
var sizeOfHtmlDirectory = Directory.GetFiles(path, "*.*", SearchOption.AllDirectories);
long totalLength = 0;
foreach (var file in sizeOfHtmlDirectory)
{
totalLength += new FileInfo(file).Length;
if (totalLength / 1000000 >= size)
return true;
}
Both samples work.
Sample 1 complete in a massivly faster time. I've not timed this accurately but on my PC, using the same folder with the same content/file sizes, Sample 1 takes a few seconds, Sample 2 takes a few minutes.
EDIT
I should point out, the bottle neck in Sample 2 is within the foreach loop! It reads the GetFiles quickly and enters the foreach loop quickly.
My question is, how do I find out why this is the case?
Contrary to what the other answers indicate the main difference is not EnumerateFiles vs GetFiles - it's DirectoryInfo vs Directory - in the latter case you only have strings and have to create new FileInfo instances separately which is very costly.
DirectoryInfo returns FileInfo instances that use cached information vs directly creating new FileInfo instances which does not - more details here and here.
Relevant quote (via "The Old New Thing"):
In NTFS, file system metadata is a property not of the directory entry
but rather of the file, with some of the metadata replicated into the
directory entry as a tweak to improve directory enumeration
performance. Functions like FindĀ­FirstĀ­File report the directory
entry, and by putting the metadata that FAT users were accustomed to
getting "for free", they could avoid being slower than FAT for
directory listings. The directory-enumeration functions report the
last-updated metadata, which may not correspond to the actual metadata
if the directory entry is stale.
EnumerateFiles is asynchronous whereas GetFiles waits until all files have been enumerated before returning the collection of files. This will have a big effect on your result.

Most efficient way to retrieve one file path from a list of folders

Let's say I have a list of over 100 folder paths. I would like to retrieve just one file path from each folder path. Here is the way I am doing it or plan to do it :
var Files = new List<String>();
var Directories = Directory.GetDirectories("C:\\Firstfolder\\Secondfolder\\");
Array.ForEach(Directories, D => Files.Add(Directory.GetFiles(D).FirstOrDefault()));
Now, is this the most effiecient way? Because My program will execute this code every time it starts.
Instead of Directory.GetFiles use Directory.EnumerateFiles to avoid loading all file paths into memory.This quote from the documentation explains the difference:
The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.
If you are using .Net 4.0 you should do this instead...
var Files = Directories.SelectMany(x => Directory.EnumerateFiles(x).FirstOrDefault()).ToList();

Getting Files from huge Directorys on NetDrives

i need to get Files from a Directory on a NetDrive. The Problem is that this Dir could contains 500k File or more.
The normal ways:
Directory.GetFiles(#"L:\cs\fromSQL\Data", "*.dat",
SearchOption.TopDirectoryOnly);
or
DirectoryInfo dir = new DirectoryInfo(#"L:\cs\fromSQL\Data");
var files =
dir.GetFiles("*.dat", SearchOption.TopDirectoryOnly)
are taking way to long. They always parse the whole Directory.
Example: NetDrive-Directory Containg ~130k Files, the first option takes 15 Minutes.
Is there a way to get just a number of files (for example the oldest one's) or something other thats faster?
Thanks!
Greetings
Christoph
You can give a try on DirectoryInfo.EnumerateFiles Method
As msdn says :-
Returns an enumerable collection of file information in the current directory.
it is IEnumerable ,so it can stream entries rather than buffer them all
For example :-
foreach(var file in Directory.EnumerateFiles(path)) {
// ...
}
More details on MSDN :-
The EnumerateFiles and GetFiles methods differ as follows: When you
use EnumerateFiles, you can start enumerating the collection of
FileInfo objects before the whole collection is returned; when you use
GetFiles, you must wait for the whole array of FileInfo objects to be
returned before you can access the array. Therefore, when you are
working with many files and directories, EnumerateFiles can be more
efficient.
Use Directory.EnumerateFiles instead:
var count = Directory.EnumerateFiles(#"L:\cs\fromSQL\Data", "*.dat",
SearchOption.TopDirectoryOnly).Count();
If you want to filter some files, then use DirectoryInfo.EnumerateFiles and filter the files using Where:
var di = new DirectoryInfo(#"L:\cs\fromSQL\Data");
var count = di.EnumerateFiles("*.dat",SearchOption.TopDirectoryOnly)
.Where(file => /* your condition */)
.Count();

Retrieve a list of filenames in folder and all subfolders quickly

I need to get a list of all Word Documents. *.doc and *.docx that are stored in a Windows based folder, with many subfolders, and sub sub folders etc...
Searching for a file with C# has an answer that works, it is 2 years old and takes 10 seconds to search through 1500 files, (in the future there may be 10,000 or more). I will post my code which is basically a copy from the above link. Does anyone have a better solution?
DateTime dt = DateTime.Now;
DirectoryInfo dir = new DirectoryInfo(MainFolder);
List<FileInfo> matches =
new List<FileInfo>(dir.GetFiles("*.doc*",SearchOption.AllDirectories));
TimeSpan ts = DateTime.Now-dt;
MessageBox.Show(matches.Count + " matches in " + ts.TotalSeconds + " seconds");
You can use Directory.EnumerateFiles instead of GetFiles. This has the advantage of returning the files as an IEnumerable<T>, which allows you to begin your processing of the result set immediately (instead of waiting for the entire list to be returned).
If you're merely counting the number of files or listing all files, it may not help. If, however, you can do your processing and/or filtering of the results, and especially if you can do any of it in other threads, it can be significantly faster.
From the documentation:
The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.
Doubt there's much you can do with that,
dir.GetFiles("*.doc|*.docx", SearchOptions.AllDirectories) might have an impact in that it's more restrictive pattern.
If you want the full list, other than making sure the Windows Indexing Service is enable on the target folders, not really. Your main delay is going to be reading from the hard drive, and no optimizing of your C# code will make that process any faster. You could create your own simple indexing service, perhaps using a FileSystemWatcher, that would give you sub-second response times no matter how many documents are added.
In a first time I suggest you to use StopWatch instead of DateTime to get the elapsed time.
In a second time to make your search faster you shouldn't store the result of GetFiles in a List but directly into an array.
And finally, you should optimize your search pattern : you want every doc and docx file, try "*.doc?"
Here is my suggestion :
var sw = new Stopwatch();
sw.Start();
var matches = Directory.GetFiles(MainFolder, "*.doc?", SearchOption.AllDirectories);
sw.Stop();
MessageBox.Show(matches.Length + " matches in " + sw.Elapsed.TotalSeconds + " seconds");

File/directory search

Any idea how to easily support file search patterns in your software, like **, *, ?
For example subfolder/**/?svn - search in all levels of subfolder for files/folders ending in "svn" 4 characters in total.
full description: http://nant.sourceforge.net/release/latest/help/types/fileset.html
If you load the directory as a directory info e.g.
DirectoryInfo directory = new DirectoryInfo(folder);
then do a search for files like this
IEnumerable<FileInfo> fileInfo = directory.GetFiles("*.svn", SearchOption.AllDirectories);
this should get you a list of fileInfo which you can manipulate
to get all subdirectories you can do the same e.g
IEnumerable<DirectoryInfo> dirInfo = directory.GetDirectories("*svn", SearchOption.AllDirectories);
anyway that should give a idea of how i'd do it. Also because fileInfo and dirInfo are IEnumerable you can add linq where queries etc. to filter results
A mix of regex and recursion should do the trick.
Another trick might be to spawn a thread for every folder or set of folders and have the thread proceed checking one more level down. This could be beneficial to speed up the process a bit.
The reason I say this is because that is highly io bound process to check folders etc. So many threads will allow you to submit more disk requests faster thus improving the speed.
This might sound silly, but have you considered downloading the nant source code to see how they did it?

Categories

Resources