Getting Files from huge Directorys on NetDrives - c#

i need to get Files from a Directory on a NetDrive. The Problem is that this Dir could contains 500k File or more.
The normal ways:
Directory.GetFiles(#"L:\cs\fromSQL\Data", "*.dat",
SearchOption.TopDirectoryOnly);
or
DirectoryInfo dir = new DirectoryInfo(#"L:\cs\fromSQL\Data");
var files =
dir.GetFiles("*.dat", SearchOption.TopDirectoryOnly)
are taking way to long. They always parse the whole Directory.
Example: NetDrive-Directory Containg ~130k Files, the first option takes 15 Minutes.
Is there a way to get just a number of files (for example the oldest one's) or something other thats faster?
Thanks!
Greetings
Christoph

You can give a try on DirectoryInfo.EnumerateFiles Method
As msdn says :-
Returns an enumerable collection of file information in the current directory.
it is IEnumerable ,so it can stream entries rather than buffer them all
For example :-
foreach(var file in Directory.EnumerateFiles(path)) {
// ...
}
More details on MSDN :-
The EnumerateFiles and GetFiles methods differ as follows: When you
use EnumerateFiles, you can start enumerating the collection of
FileInfo objects before the whole collection is returned; when you use
GetFiles, you must wait for the whole array of FileInfo objects to be
returned before you can access the array. Therefore, when you are
working with many files and directories, EnumerateFiles can be more
efficient.

Use Directory.EnumerateFiles instead:
var count = Directory.EnumerateFiles(#"L:\cs\fromSQL\Data", "*.dat",
SearchOption.TopDirectoryOnly).Count();
If you want to filter some files, then use DirectoryInfo.EnumerateFiles and filter the files using Where:
var di = new DirectoryInfo(#"L:\cs\fromSQL\Data");
var count = di.EnumerateFiles("*.dat",SearchOption.TopDirectoryOnly)
.Where(file => /* your condition */)
.Count();

Related

display directories and files by size order c#

Trying to list all the directories and files on a machine and sort them into size.
I get a list of the file names and their sizes but it won't stick them in order... Any suggestions greatly appreciated! Cheers
//create instance of drive which contains files
DriveInfo di = new DriveInfo(#"C:\");
//find the root directory path
DirectoryInfo dirInfo = di.RootDirectory;
try
{
//EnumerateFiles increases the performance and sort them
foreach (var fi in dirInfo.EnumerateFiles().OrderBy(f =>f.Length).ToList())
{
try
{
//Display each file
Console.WriteLine("{0}\t\t{1}", fi.FullName, fi.Length);
}
catch (UnauthorizedAccessException UnAuthTop)
{
Console.WriteLine("{0}", UnAuthTop.Message);
}
}
You could try something like this
// get your folder
DirectoryInfo di = new DirectoryInfo(you path here);
// create a list of files from that folder
List<FileInfo> fi = di.GetFiles().ToList();
// pass the files in a sorted order
var files = fi.Where(f => f.FullName != null).OrderByDescending(f => f.Length);
In this example: files will contain a list if files from the current folder level, sorted by file.length .
You might want to check if fi is not null before trying to pass it to the variable files. Then, you can iterate over files with the foreach.
[ UPDATE ]
As #Abion47 points out, there doesn't seem to be much difference between the Op's code and my solution. From what I read in the OP, the OP is not getting a sorted list, which is the desired result.
What I see that might make a difference is that, by using the EnumerateFiles, you start enumerating and can act on file info before the entire collection of files is returned. It's great for handling enormous amounts of files. And, more efficient than GetFiles, for performing operations on individual files as they become available.
Since that is the case, you might not be able to sort the returned files properly, until the complete collection has been enumerated.
By using GetFiles, you have to wait for it to return the whole collection. Making it easier to sort.
I don't think that GetFiles is ideal for handling huge collections. In that case, I would divide the work into steps or use some other approach.

Why is EnumerateFiles much quicker than calculating the sizes

For my WPF project, I have to calculate the total file size in a single directory (which could have sub directories).
Sample 1
DirectoryInfo di = new DirectoryInfo(path);
var totalLength = di.EnumerateFiles("*.*", SearchOption.AllDirectories).Sum(fi => fi.Length);
if (totalLength / 1000000 >= size)
return true;
Sample 2
var sizeOfHtmlDirectory = Directory.GetFiles(path, "*.*", SearchOption.AllDirectories);
long totalLength = 0;
foreach (var file in sizeOfHtmlDirectory)
{
totalLength += new FileInfo(file).Length;
if (totalLength / 1000000 >= size)
return true;
}
Both samples work.
Sample 1 complete in a massivly faster time. I've not timed this accurately but on my PC, using the same folder with the same content/file sizes, Sample 1 takes a few seconds, Sample 2 takes a few minutes.
EDIT
I should point out, the bottle neck in Sample 2 is within the foreach loop! It reads the GetFiles quickly and enters the foreach loop quickly.
My question is, how do I find out why this is the case?
Contrary to what the other answers indicate the main difference is not EnumerateFiles vs GetFiles - it's DirectoryInfo vs Directory - in the latter case you only have strings and have to create new FileInfo instances separately which is very costly.
DirectoryInfo returns FileInfo instances that use cached information vs directly creating new FileInfo instances which does not - more details here and here.
Relevant quote (via "The Old New Thing"):
In NTFS, file system metadata is a property not of the directory entry
but rather of the file, with some of the metadata replicated into the
directory entry as a tweak to improve directory enumeration
performance. Functions like FindĀ­FirstĀ­File report the directory
entry, and by putting the metadata that FAT users were accustomed to
getting "for free", they could avoid being slower than FAT for
directory listings. The directory-enumeration functions report the
last-updated metadata, which may not correspond to the actual metadata
if the directory entry is stale.
EnumerateFiles is asynchronous whereas GetFiles waits until all files have been enumerated before returning the collection of files. This will have a big effect on your result.

Most efficient way to retrieve one file path from a list of folders

Let's say I have a list of over 100 folder paths. I would like to retrieve just one file path from each folder path. Here is the way I am doing it or plan to do it :
var Files = new List<String>();
var Directories = Directory.GetDirectories("C:\\Firstfolder\\Secondfolder\\");
Array.ForEach(Directories, D => Files.Add(Directory.GetFiles(D).FirstOrDefault()));
Now, is this the most effiecient way? Because My program will execute this code every time it starts.
Instead of Directory.GetFiles use Directory.EnumerateFiles to avoid loading all file paths into memory.This quote from the documentation explains the difference:
The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.
If you are using .Net 4.0 you should do this instead...
var Files = Directories.SelectMany(x => Directory.EnumerateFiles(x).FirstOrDefault()).ToList();

How do I get the nth file of a directory by alphabetical order?

Well I haven't seen this before but is there a special method that will open the nth file in a directory?
Although I don't really have much code to show but I know that this can get my the size of the directory:
int fileCount = Directory.GetFiles(#"C:\Users\user\Collection").Length;
Is there a such thing as "file index"? I prefer not to convert it to an array as it has over 900 files inside.
EX: I want the 3rd file of the directory. It is named "test.txt".
Directory.GetFiles returns an array of full paths. Simply sort it and access it via index.
var files = Directory.GetFiles(#"C:\Users\user\Collection").OrderBy(name => name).ToArray();
File.ReadAllText(files[index]); <-- index is your N
You could use Directory.EnumerateFiles
var result = Directory.EnumerateFiles(#"C:\Users\user\Collection")
.OrderBy(x => x)
.Skip(2)
.First();
Console.WriteLine(result);
The advantage of EnumerateFile is that you don't need to have the full list of files in memory, but you could start immediately to apply the required logic

What is the fastest way of deleting files in a directory? (Except specific file extension)

I have seen questions like What is the best way to empty a directory?
But I need to know,
what is the fastest way of deleting all the files found within the directory, except any .zip files found.
Smells like linq here... or what?
By saying fastest way, I mean the Fastest execution time.
If you are using .NET 4 you can benifit the smart way .NET now parallizing your functions. This code is the fasted way to do it. This scales with your numbers of cores on the processor too.
DirectoryInfo di = new DirectoryInfo(yourDir);
var files = di.GetFiles();
files.AsParallel().Where(f => f.Extension != ".zip").ForAll((f) => f.Delete());
By fastest are you asking for the least lines of code or the quickest execution time? Here is a sample using LINQ with a parallel for each loop to delete them quickly.
string[] files = System.IO.Directory.GetFiles("c:\\temp", "*.*", IO.SearchOption.TopDirectoryOnly);
List<string> del = (
from string s in files
where ! (s.EndsWith(".zip"))
select s).ToList();
Parallel.ForEach(del, (string s) => { IO.File.Delete(s); });
At the time of writing this answer none of the previous answers used Directory.EnumerateFiles() which allows you to carry on operations on the list of files while the list is being constructed .
Code:
Parallel.ForEach(Directory.EnumerateFiles(path, "*", SearchOption.AllDirectories).AsParallel(), Item =>
{
if(!string.Equals(Path.GetExtension(Item), ".zip",StringComparison.OrdinalIgnoreCase))
File.Delete(Item);
});
as far as I know the performance gain from using AsParallel() shouldn't be significant(if found) in this case however it did make difference in my case.
I compared the time it takes to delete all but .zip files in a list of 4689 files of which 10 were zip files using 1-foreach. 2-parallel foreach. 3-IEnumerable().AsParallel().ForAll. 4-parallel foreach using IEnumerable().AsParallel() as illustrated above.
Results:
1-1545
2-1015
3-1103
4-839
the fifth and the last case was a normal foreach using Directory.GetFiles()
5-2266
of course the results weren't conclusive , as far as I know to carry on a proper benchmarking you need to use a ram drive instead of a HDD .
Note:that the performance difference between EnumerateFiles and GetFiles becomes more apparent as the number of files increases.
Here's plain old C#
foreach(string file in Directory.GetFiles(Server.MapPath("~/yourdirectory")))
{
if(Path.GetExtension(file) != ".zip")
{
File.Delete(file);
}
}
And here's LINQ
var files = from f in Directory.GetFiles("")
where Path.GetExtension(f) != ".zip"
select f;
foreach(string file in files)
File.Delete(file);

Categories

Resources