UNC path pointing to local directory much slower than local access - c#

Some code I'm working with occasionally needs to refer to long UNC paths (e.g. \\?\UNC\MachineName\Path), but we've discovered that no matter where the directory is located, even on the same machine, it's much slower when accessing through the UNC path than the local path.
For example, we've written some benchmarking code that writes a string of gibberish to a file, then later read it back, multiple times. I'm testing it with 6 different ways to access the same shared directory on my dev machine, with the code running on the same machine:
C:\Temp
\\MachineName\Temp
\\?\C:\Temp
\\?\UNC\MachineName\Temp
\\127.0.0.1\Temp
\\?\UNC\127.0.0.1\Temp
And here are the results:
Testing: C:\Temp
Wrote 1000 files to C:\Temp in 861.0647 ms
Read 1000 files from C:\Temp in 60.0744 ms
Testing: \\MachineName\Temp
Wrote 1000 files to \\MachineName\Temp in 2270.2051 ms
Read 1000 files from \\MachineName\Temp in 1655.0815 ms
Testing: \\?\C:\Temp
Wrote 1000 files to \\?\C:\Temp in 916.0596 ms
Read 1000 files from \\?\C:\Temp in 60.0517 ms
Testing: \\?\UNC\MachineName\Temp
Wrote 1000 files to \\?\UNC\MachineName\Temp in 2499.3235 ms
Read 1000 files from \\?\UNC\MachineName\Temp in 1684.2291 ms
Testing: \\127.0.0.1\Temp
Wrote 1000 files to \\127.0.0.1\Temp in 2516.2847 ms
Read 1000 files from \\127.0.0.1\Temp in 1721.1925 ms
Testing: \\?\UNC\127.0.0.1\Temp
Wrote 1000 files to \\?\UNC\127.0.0.1\Temp in 2499.3211 ms
Read 1000 files from \\?\UNC\127.0.0.1\Temp in 1678.18 ms
I tried the IP address to rule out a DNS issue. Could it be checking credentials or permissions on each file access? If so, is there a way to cache it? Does it just assume since it's a UNC path that it should do everything over TCP/IP instead of directly accessing the disk? Is it something wrong with the code we're using for the reads/writes? I've ripped out the pertinent parts for benchmarking, seen below:
using System;
using System.Collections.Generic;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using Microsoft.Win32.SafeHandles;
using Util.FileSystem;
namespace UNCWriteTest {
internal class Program {
[DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
public static extern bool DeleteFile(string path); // File.Delete doesn't handle \\?\UNC\ paths
private const int N = 1000;
private const string TextToSerialize =
"asd;lgviajsmfopajwf0923p84jtmpq93worjgfq0394jktp9orgjawefuogahejngfmliqwegfnailsjdhfmasodfhnasjldgifvsdkuhjsmdofasldhjfasolfgiasngouahfmp9284jfqp92384fhjwp90c8jkp04jk34pofj4eo9aWIUEgjaoswdfg8jmp409c8jmwoeifulhnjq34lotgfhnq34g";
private static readonly byte[] _Buffer = Encoding.UTF8.GetBytes(TextToSerialize);
public static string WriteFile(string basedir) {
string fileName = Path.Combine(basedir, string.Format("{0}.tmp", Guid.NewGuid()));
try {
IntPtr writeHandle = NativeFileHandler.CreateFile(
fileName,
NativeFileHandler.EFileAccess.GenericWrite,
NativeFileHandler.EFileShare.None,
IntPtr.Zero,
NativeFileHandler.ECreationDisposition.New,
NativeFileHandler.EFileAttributes.Normal,
IntPtr.Zero);
// if file was locked
int fileError = Marshal.GetLastWin32Error();
if ((fileError == 32 /* ERROR_SHARING_VIOLATION */) || (fileError == 80 /* ERROR_FILE_EXISTS */)) {
throw new Exception("oopsy");
}
using (var h = new SafeFileHandle(writeHandle, true)) {
using (var fs = new FileStream(h, FileAccess.Write, NativeFileHandler.DiskPageSize)) {
fs.Write(_Buffer, 0, _Buffer.Length);
}
}
}
catch (IOException) {
throw;
}
catch (Exception ex) {
throw new InvalidOperationException(" code " + Marshal.GetLastWin32Error(), ex);
}
return fileName;
}
public static void ReadFile(string fileName) {
var fileHandle =
new SafeFileHandle(
NativeFileHandler.CreateFile(fileName, NativeFileHandler.EFileAccess.GenericRead, NativeFileHandler.EFileShare.Read, IntPtr.Zero,
NativeFileHandler.ECreationDisposition.OpenExisting, NativeFileHandler.EFileAttributes.Normal, IntPtr.Zero), true);
using (fileHandle) {
//check the handle here to get a bit cleaner exception semantics
if (fileHandle.IsInvalid) {
//ms-help://MS.MSSDK.1033/MS.WinSDK.1033/debug/base/system_error_codes__0-499_.htm
int errorCode = Marshal.GetLastWin32Error();
//now that we've taken more than our allotted share of time, throw the exception
throw new IOException(string.Format("file read failed on {0} to {1} with error code {1}", fileName, errorCode));
}
//we have a valid handle and can actually read a stream, exceptions from serialization bubble out
using (var fs = new FileStream(fileHandle, FileAccess.Read, 1*NativeFileHandler.DiskPageSize)) {
//if serialization fails, we'll just let the normal serialization exception flow out
var foo = new byte[256];
fs.Read(foo, 0, 256);
}
}
}
public static string[] TestWrites(string baseDir) {
try {
var fileNames = new List<string>();
DateTime start = DateTime.UtcNow;
for (int i = 0; i < N; i++) {
fileNames.Add(WriteFile(baseDir));
}
DateTime end = DateTime.UtcNow;
Console.Out.WriteLine("Wrote {0} files to {1} in {2} ms", N, baseDir, end.Subtract(start).TotalMilliseconds);
return fileNames.ToArray();
}
catch (Exception e) {
Console.Out.WriteLine("Failed to write for " + baseDir + " Exception: " + e.Message);
return new string[] {};
}
}
public static void TestReads(string baseDir, string[] fileNames) {
try {
DateTime start = DateTime.UtcNow;
for (int i = 0; i < N; i++) {
ReadFile(fileNames[i%fileNames.Length]);
}
DateTime end = DateTime.UtcNow;
Console.Out.WriteLine("Read {0} files from {1} in {2} ms", N, baseDir, end.Subtract(start).TotalMilliseconds);
}
catch (Exception e) {
Console.Out.WriteLine("Failed to read for " + baseDir + " Exception: " + e.Message);
}
}
private static void Main(string[] args) {
foreach (string baseDir in args) {
Console.Out.WriteLine("Testing: {0}", baseDir);
string[] fileNames = TestWrites(baseDir);
TestReads(baseDir, fileNames);
foreach (string fileName in fileNames) {
DeleteFile(fileName);
}
}
}
}
}

This doesn't surprise me. You're writing/reading a fairly small amount of data, so the file system cache is probably minimizing the impact of the physical disk I/O; basically, the bottleneck is going to be the CPU. I'm not certain whether the traffic will be going via the TCP/IP stack or not but at a minimum the SMB protocol is involved. For one thing that means the requests are being passed back and forth between the SMB client process and the SMB server process, so you've got context switching between three distinct processes, including your own. Using the local file system path you're switching into kernel mode and back but no other process is involved. Context switching is much slower than the transition to and from kernel mode.
There are likely to be two distinct additional overheads, one per file and one per kilobyte of data. In this particular test the per file SMB overhead is likely to be dominant. Because the amount of data involved also affects the impact of physical disk I/O, you may find that this is only really a problem when dealing with lots of small files.

Related

How to empty contents of a log file being used by the same program

I have a C# application which uses log4net to write some log outputs in a file names "logfile.txt" residing in the application directory. I want to empty the contents of the file as soon as it reaches a size of 10GB.
For that I'm using a timer which keeps checking whether the size of the file crosses 10GB.
But I cannot perform any operation on "logfile.txt" since it is being used by other threads to write log outputs and it's throwing me,
System.IO.IOException "The process cannot access the file 'C:\Program Files\MyApps\TestApp1\logfile.txt' because it is being used by another process."
Here is the code of the timer which checks the size of the file "logfile.txt"
private void timer_file_size_check_Tick(object sender, EventArgs e)
{
try
{
string log_file_path = "C:\\Program Files\\MyApps\\TestApp1\\logfile.txt";
FileInfo f = new FileInfo(log_file_path);
bool ex;
long s1;
if (ex = f.Exists)
{
s1 = f.Length;
if (s1 > 10737418240)
{
System.GC.Collect();
System.GC.WaitForPendingFinalizers();
File.Delete(log_file_path);
//File.Create(log_file_path).Close();
//File.Delete(log_file_path);
//var fs = new FileStream(log_file_path, FileMode.Truncate);
}
}
else
{
MDIParent.log.Error("Log file doesn't exists..");
}
}
catch (Exception er)
{
MDIParent.log.Error("Exceptipon :: " + er.ToString());
}
}
You shouldn't delete a log file on your own because log4net can do it for you. If you use RollingFileAppender you can specify the maximum file size (maximumFileSize property). Additionally if you set maxSizeRollBackups property to 0, then the log file will be truncated when it reaches the limit. Please look at this question for an example.

fileinfo.CreationTime stays the same after new file is created [duplicate]

This question already has answers here:
Windows filesystem: Creation time of a file doesn't change when while is deleted and created again
(2 answers)
Closed 9 years ago.
I have a logging class. It created a new log.txt file if one isn't present and writes messsages to that file. I also have a method that runs to check the file size and when the file was created against local settings. If the difference between the log.txt's creation time and the current time exceeds the local settings MaxLogHours value, then it is archived to a local archive folder and deleted. The new log.txt file is created by the above process the next time a log message is sent to the class.
This works great, except when I look at the FileInfo.CreationTime for my log.txt file, it is always the same - 7/17/2012 12:05/18 PM - no matter what I do. I've manually deleted the file, the program deletes it, always the same. What is going on here? I also timestamp the old ones, but still nothing works. Does Windows think that the file is the same one because it has the same filename? I'd appreciate any help, thanks!
archive method
public static void ArchiveLog(Settings s)
{
FileInfo fi = new FileInfo(AppDomain.CurrentDomain.BaseDirectory + "\\log.txt");
string archiveDir = AppDomain.CurrentDomain.BaseDirectory + "\\archive";
TimeSpan ts = DateTime.Now - fi.CreationTime;
if ((s.MaxLogKB != 0 && fi.Length >= s.MaxLogKB * 1000) ||
(s.MaxLogHours != 0 && ts.TotalHours >= s.MaxLogHours))
{
if (!Directory.Exists(archiveDir))
{
Directory.CreateDirectory(archiveDir);
}
string archiveFile = archiveDir + "\\log" + string.Format("{0:MMddyyhhmmss}", DateTime.Now) + ".txt";
File.Copy(AppDomain.CurrentDomain.BaseDirectory + "\\log.txt", archiveFile);
File.Delete(AppDomain.CurrentDomain.BaseDirectory + "\\log.txt");
}
}
Writing/Creating the log:
public static void MsgLog(string Msg, bool IsStandardMsg = true)
{
try
{
using (StreamWriter sw = new StreamWriter(Directory.GetCurrentDirectory() + "\\log.txt", true))
{
sw.WriteLine("Msg at " + DateTime.Now + " - " + Msg);
Console.Out.WriteLine(Msg);
}
}
catch (Exception ex)
{
Console.Out.WriteLine(ex.Message);
}
}
This may happened , so it's writen in FileSystemInfo.CreationTime
This method may return an inaccurate value, because it uses native
functions whose values may not be continuously updated by the
operating system.
I think the problem is that you are using FileInfo.CreationTime without first checking if the file still exists. Run this POC - it will always generate "After delete CreationTime: 1/1/1601 12:00:00 AM" - because file does not exists anymore and you did not touch FileInfo.CreationTime prior to delete. However if you uncomment the line:
//Console.WriteLine("Before delete CreationTime: {0}", fi.CreationTime);
in the code below strangely both calls will return correct and updated value.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
namespace ConsoleApplication17088573
{
class Program
{
static void Main(string[] args)
{
for (int i = 0; i < 10; i++)
{
string fname = "testlog.txt";
using (var fl = File.Create(fname))
{
using (var sw = new StreamWriter(fl))
{
sw.WriteLine("Current datetime is {0}", DateTime.Now);
}
}
var fi = new FileInfo(fname);
//Console.WriteLine("Before delete CreationTime: {0}", fi.CreationTime);
File.Delete(fname);
Console.WriteLine("After delete CreationTime: {0}", fi.CreationTime);
Thread.Sleep(1000);
}
}
}
}

C# code to copy files, can this snippet be improved?

While copying around 50 GB of data via local LAN share, due to connectivity issue copy failed at around 10 GB copied.
I have renamed copied 10GB of data directory to localRepository and then written a C# program to copy files from the remote server to destination, only if it is not found in local repository. If found move file from local repository to destination folder.
Although the code worked fine and accomplishes the task very well. I wonder, have I written the most efficient code? Can you find any improvements?
string destinationFolder = #"C:\DataFolder";
string remoteRepository = #"\\RemoteComputer\DataFolder";
string localRepository = #"\\LocalComputer\LocalRepository";
protected void Page_Load(object sender, EventArgs e)
{
foreach (string remoteSrcFile in Directory.EnumerateFiles(remoteRepository, "*.*", SearchOption.AllDirectories))
{
bool foundInLocalRepo = false; ;
foreach (var localSrcFile in Directory.EnumerateFiles(localRepository, "*.*", SearchOption.AllDirectories))
{
if (Path.GetFileName(remoteSrcFile).Equals(Path.GetFileName(localSrcFile)))
{
FileInfo localFile = new FileInfo(localSrcFile);
FileInfo remoteFile = new FileInfo(remoteSrcFile);
//copy this file from local repository
if (localFile.Length == remoteFile.Length)
{
try
{
File.Move(localSrcFile, PrepareDestinationPath(remoteSrcFile));
Debug.WriteLine(remoteSrcFile + " moved from local repo");
}
catch (Exception ex)
{
Debug.WriteLine(remoteSrcFile + " did not move");
}
foundInLocalRepo = true;
break;
}
}
}
if (!foundInLocalRepo)
{
//copy this file from remote repository
try
{
File.Copy(remoteSrcFile, PrepareDestinationPath(remoteSrcFile), false);
Debug.WriteLine(remoteSrcFile + " copied from remote repo");
}
catch (Exception ex)
{
Debug.WriteLine(remoteSrcFile + " did not copy");
}
}
}
}
private string PrepareDestinationPath(string remoteSrcFile)
{
string relativePath = remoteSrcFile.Split(new string[] { "DataFolder" }, StringSplitOptions.None)[1];
string copyPath = Path.GetFullPath(destinationFolder + relativePath);
Directory.CreateDirectory(Path.GetDirectoryName(copyPath));
return copyPath;
}
EDIT:
Based on answer given by Thomas I am attempting to zip the file.
Traditionally as an end user we use to zip a file and then copy. As a programmer can we zip and copy the file parallel? I mean the portion which has been zipped send it over the wire?
You are doing far too much work with the nested loop.
You should remove the inner "foreach" and replace it with some code that:
(1) Constructs the name of the file that you are looking for and
(2) Uses File.Exists() to see if exists, then
(3) Continues with the same block of code that you currently have following the "if (Path.GetFileName(remoteSrcFile)..." condition.
Something like this:
foreach (string remoteSrcFile in Directory.EnumerateFiles(remoteRepository, "*.*", SearchOption.AllDirectories))
{
string localSrcFile = Path.Combine(localRepository, Path.GetFileName(remoteSrcFile));
if (File.Exists(localSrcFile))
{
...
}
}
I would suggest zipping the files before moving. Try take a look at the very simple http://dotnetzip.codeplex.com/
Try zipping 1000 files a time, in that way, you don't have to run the for-loop that many times and establish new connections etc each time.

Detecting a File Delete on an Open File

I am opening a file with read access and allowing subsequent read|write|delete file share access to the file (tailing the file). If the file is deleted during processing is there a way to detect that the file is pending delete (see Files section http://msdn.microsoft.com/en-us/library/aa363858(v=VS.85).aspx)? If some outside process (the owning process) has issued a delete, I want to close my handle as soon as possible to allow the file deletion so as not to interfere with any logic in the owning process.
I'm in C# and see no method of detecting the pending delete. The file was opened using a FileStream object. Is there some method for detecting the delete in C# or in some other windows function?
You can use the Windows API function GetFileInformationByHandleEx to detect a pending delete on a file you have open. The second argument is an enumeration value which lets you specify what kind of information the function should return. The FileStandardInfo (1) value will cause it to return the FILE_STANDARD_INFO structure, which includes a DeletePending boolean.
Here is a demonstration utility:
using System;
using System.Text;
using System.IO;
using System.Runtime.InteropServices;
using System.Threading;
internal static class Native
{
[DllImport("kernel32.dll", SetLastError = true)]
public extern static bool GetFileInformationByHandleEx(IntPtr hFile,
int FileInformationClass,
IntPtr lpFileInformation,
uint dwBufferSize);
public struct FILE_STANDARD_INFO
{
public long AllocationSize;
public long EndOfFile;
public uint NumberOfLinks;
public byte DeletePending;
public byte Directory;
}
public const int FileStandardInfo = 1;
}
internal static class Program
{
public static bool IsDeletePending(FileStream fs)
{
IntPtr buf = Marshal.AllocHGlobal(4096);
try
{
IntPtr handle = fs.SafeFileHandle.DangerousGetHandle();
if (!Native.GetFileInformationByHandleEx(handle,
Native.FileStandardInfo,
buf,
4096))
{
Exception ex = new Exception("GetFileInformationByHandleEx() failed");
ex.Data["error"] = Marshal.GetLastWin32Error();
throw ex;
}
else
{
Native.FILE_STANDARD_INFO info = Marshal.PtrToStructure<Native.FILE_STANDARD_INFO>(buf);
return info.DeletePending != 0;
}
}
finally
{
Marshal.FreeHGlobal(buf);
}
}
public static int Main(string[] args)
{
TimeSpan MAX_WAIT_TIME = TimeSpan.FromSeconds(10);
if (args.Length == 0)
{
args = new string[] { "deleteme.txt" };
}
for (int i = 0; i < args.Length; ++i)
{
string filename = args[i];
FileStream fs = null;
try
{
fs = File.Open(filename,
FileMode.CreateNew,
FileAccess.Write,
FileShare.ReadWrite | FileShare.Delete);
byte[] buf = new byte[4096];
UTF8Encoding utf8 = new UTF8Encoding(false);
string text = "hello world!\r\n";
int written = utf8.GetBytes(text, 0, text.Length, buf, 0);
fs.Write(buf, 0, written);
fs.Flush();
Console.WriteLine("{0}: created and wrote line", filename);
DateTime t0 = DateTime.UtcNow;
for (;;)
{
Thread.Sleep(16);
if (IsDeletePending(fs))
{
Console.WriteLine("{0}: detected pending delete", filename);
break;
}
if (DateTime.UtcNow - t0 > MAX_WAIT_TIME)
{
Console.WriteLine("{0}: timeout reached with no delete", filename);
break;
}
}
}
catch (Exception ex)
{
Console.WriteLine("{0}: {1}", filename, ex.Message);
}
finally
{
if (fs != null)
{
Console.WriteLine("{0}: closing", filename);
fs.Dispose();
}
}
}
return 0;
}
}
I would use a different signaling mechanism. (I am making the assumption all file access is within your control and not from a closed external program, mainly due to the flags being employed.)
The only "solution" within those bounds I can think of is a poll on file-access and check the exception (if any) you get back. Perhaps there is something much more tricky (at a lower-level than the win32 file API?!?), but this is already going down the "uhg path" :-)
FileSystemWatcher would probably be the closest thing, but it can't detect a "pending" delete; when the file IS deleted, an event will be raised on FileSystemWatcher, and you can attach a handler that will gracefully interrupt your file processing. If the lock (or lack of one) you acquire in opening the file makes it possible for the file to be deleted at all, simply closing your read-only FileStream when that happens should not affect the file system.
The basic steps of a file watcher are to create one, passing an instance of a FileInfo object to the constructor. FileInfos can be created inexpensively by just instantiating one, passing it the path and filename of the file as a string. Then, set its NotifyFilter to the type(s) of file system modifications you want to watch for on this file. Finally, attach your process's event handler to the OnDeleted event. This event handler can probably be as simple as setting a bit flag somewhere that your main process can read, and closing the FileStream. You'll then get an exception on your next attempt to work with the stream; catch it, read the flag, and if it's set just gracefully stop doing file stuff. You can also put the file processing in a seperate worker thread, and the event handler can just tell the thread to die in some graceful method.
If the file is small enough, your application could process a copy of the file, rather than the file itself. Also, if your application needs to know whether the owning process deleted the original file, set up a FileSystemWatcher (FSW) on the file. When the file disappears, the FSW could set a flag to interrupt processing:
private bool _fileExists = true;
public void Process(string pathToOriginalFile, string pathToCopy)
{
File.Copy(pathToOriginalFile, pathToCopy);
FileSystemWatcher watcher = new FileSystemWatcher();
watcher.Path = pathToOriginalFile;
watcher.Deleted += new FileSystemEventHandler(OnFileDeleted);
bool doneProcessing = false;
watcher.EnableRaisingEvents = true;
while(_fileExists && !doneProcessing)
{
// process the copy here
}
...
}
private void OnFileDeleted(object source, FileSystemEventArgs e)
{
_fileExists = false;
}
No, there's no clean way to do this. If you were concerned about other processes opening and/or modifying the file, then oplocks could help you. But if you're just looking for notification of when the delete disposition gets set to deleted, there isn't a straightforward way to do this (sans building a file system filter, hooking the APIs, etc. all of which spooky for an application do be doing w/o very good reason).

Is there a faster way to scan through a directory recursively in .NET?

I am writing a directory scanner in .NET.
For each File/Dir I need the following info.
class Info {
public bool IsDirectory;
public string Path;
public DateTime ModifiedDate;
public DateTime CreatedDate;
}
I have this function:
static List<Info> RecursiveMovieFolderScan(string path){
var info = new List<Info>();
var dirInfo = new DirectoryInfo(path);
foreach (var dir in dirInfo.GetDirectories()) {
info.Add(new Info() {
IsDirectory = true,
CreatedDate = dir.CreationTimeUtc,
ModifiedDate = dir.LastWriteTimeUtc,
Path = dir.FullName
});
info.AddRange(RecursiveMovieFolderScan(dir.FullName));
}
foreach (var file in dirInfo.GetFiles()) {
info.Add(new Info()
{
IsDirectory = false,
CreatedDate = file.CreationTimeUtc,
ModifiedDate = file.LastWriteTimeUtc,
Path = file.FullName
});
}
return info;
}
Turns out this implementation is quite slow. Is there any way to speed this up? I'm thinking of hand coding this with FindFirstFileW but would like to avoid that if there is a built in way that is faster.
This implementation, which needs a bit of tweaking is 5-10X faster.
static List<Info> RecursiveScan2(string directory) {
IntPtr INVALID_HANDLE_VALUE = new IntPtr(-1);
WIN32_FIND_DATAW findData;
IntPtr findHandle = INVALID_HANDLE_VALUE;
var info = new List<Info>();
try {
findHandle = FindFirstFileW(directory + #"\*", out findData);
if (findHandle != INVALID_HANDLE_VALUE) {
do {
if (findData.cFileName == "." || findData.cFileName == "..") continue;
string fullpath = directory + (directory.EndsWith("\\") ? "" : "\\") + findData.cFileName;
bool isDir = false;
if ((findData.dwFileAttributes & FileAttributes.Directory) != 0) {
isDir = true;
info.AddRange(RecursiveScan2(fullpath));
}
info.Add(new Info()
{
CreatedDate = findData.ftCreationTime.ToDateTime(),
ModifiedDate = findData.ftLastWriteTime.ToDateTime(),
IsDirectory = isDir,
Path = fullpath
});
}
while (FindNextFile(findHandle, out findData));
}
} finally {
if (findHandle != INVALID_HANDLE_VALUE) FindClose(findHandle);
}
return info;
}
extension method:
public static class FILETIMEExtensions {
public static DateTime ToDateTime(this System.Runtime.InteropServices.ComTypes.FILETIME filetime ) {
long highBits = filetime.dwHighDateTime;
highBits = highBits << 32;
return DateTime.FromFileTimeUtc(highBits + (long)filetime.dwLowDateTime);
}
}
interop defs are:
[DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
public static extern IntPtr FindFirstFileW(string lpFileName, out WIN32_FIND_DATAW lpFindFileData);
[DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
public static extern bool FindNextFile(IntPtr hFindFile, out WIN32_FIND_DATAW lpFindFileData);
[DllImport("kernel32.dll")]
public static extern bool FindClose(IntPtr hFindFile);
[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
public struct WIN32_FIND_DATAW {
public FileAttributes dwFileAttributes;
internal System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
public int nFileSizeHigh;
public int nFileSizeLow;
public int dwReserved0;
public int dwReserved1;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
public string cAlternateFileName;
}
There is a long history of the .NET file enumeration methods being slow. The issue is there is not an instantaneous way of enumerating large directory structures. Even the accepted answer here has it's issues with GC allocations.
The best I've been able do is wrapped up in my library and exposed as the FileFile (source) class in the CSharpTest.Net.IO namespace. This class can enumerate files and folders without unneeded GC allocations and string marshaling.
The usage is simple enough, and the RaiseOnAccessDenied property will skip the directories and files the user does not have access to:
private static long SizeOf(string directory)
{
var fcounter = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
fcounter.RaiseOnAccessDenied = false;
long size = 0, total = 0;
fcounter.FileFound +=
(o, e) =>
{
if (!e.IsDirectory)
{
Interlocked.Increment(ref total);
size += e.Length;
}
};
Stopwatch sw = Stopwatch.StartNew();
fcounter.Find();
Console.WriteLine("Enumerated {0:n0} files totaling {1:n0} bytes in {2:n3} seconds.",
total, size, sw.Elapsed.TotalSeconds);
return size;
}
For my local C:\ drive this outputs the following:
Enumerated 810,046 files totaling 307,707,792,662 bytes in 232.876 seconds.
Your mileage may vary by drive speed, but this is the fastest method I've found of enumerating files in managed code. The event parameter is a mutating class of type FindFile.FileFoundEventArgs so be sure you do not keep a reference to it as it's values will change for each event raised.
You might also note that the DateTime's exposed are only in UTC. The reason is that the conversion to local time is semi-expensive. You might consider using UTC times to improve performance rather than converting these to local time.
Depending on how much time you're trying to shave off the function, it may be worth your while to call the Win32 API functions directly, since the existing API does a lot of extra processing to check things that you may not be interested in.
If you haven't done so already, and assuming you don't intend to contribute to the Mono project, I would strongly recommend downloading Reflector and having a look at how Microsoft implemented the API calls you're currently using. This will give you an idea of what you need to call and what you can leave out.
You might, for example, opt to create an iterator that yields directory names instead of a function that returns a list, that way you don't end up iterating over the same list of names two or three times through all the various levels of code.
I just ran across this. Nice implementation of the native version.
This version, while still slower than the version that uses FindFirst and FindNext, is quite a bit faster than the your original .NET version.
static List<Info> RecursiveMovieFolderScan(string path)
{
var info = new List<Info>();
var dirInfo = new DirectoryInfo(path);
foreach (var entry in dirInfo.GetFileSystemInfos())
{
bool isDir = (entry.Attributes & FileAttributes.Directory) != 0;
if (isDir)
{
info.AddRange(RecursiveMovieFolderScan(entry.FullName));
}
info.Add(new Info()
{
IsDirectory = isDir,
CreatedDate = entry.CreationTimeUtc,
ModifiedDate = entry.LastWriteTimeUtc,
Path = entry.FullName
});
}
return info;
}
It should produce the same output as your native version. My testing shows that this version takes about 1.7 times as long as the version that uses FindFirst and FindNext. Timings obtained in release mode running without the debugger attached.
Curiously, changing the GetFileSystemInfos to EnumerateFileSystemInfos adds about 5% to the running time in my tests. I rather expected it to run at the same speed or possibly faster because it didn't have to create the array of FileSystemInfo objects.
The following code is shorter still, because it lets the Framework take care of recursion. But it's a good 15% to 20% slower than the version above.
static List<Info> RecursiveScan3(string path)
{
var info = new List<Info>();
var dirInfo = new DirectoryInfo(path);
foreach (var entry in dirInfo.EnumerateFileSystemInfos("*", SearchOption.AllDirectories))
{
info.Add(new Info()
{
IsDirectory = (entry.Attributes & FileAttributes.Directory) != 0,
CreatedDate = entry.CreationTimeUtc,
ModifiedDate = entry.LastWriteTimeUtc,
Path = entry.FullName
});
}
return info;
}
Again, if you change that to GetFileSystemInfos, it will be slightly (but only slightly) faster.
For my purposes, the first solution above is quite fast enough. The native version runs in about 1.6 seconds. The version that uses DirectoryInfo runs in about 2.9 seconds. I suppose if I were running these scans very frequently, I'd change my mind.
Its pretty shallow, 371 dirs with
average of 10 files in each directory.
some dirs contain other sub dirs
This is just a comment, but your numbers do appear to be quite high. I ran the below using essentially the same recursive method you are using and my times are far lower despite creating string output.
public void RecurseTest(DirectoryInfo dirInfo,
StringBuilder sb,
int depth)
{
_dirCounter++;
if (depth > _maxDepth)
_maxDepth = depth;
var array = dirInfo.GetFileSystemInfos();
foreach (var item in array)
{
sb.Append(item.FullName);
if (item is DirectoryInfo)
{
sb.Append(" (D)");
sb.AppendLine();
RecurseTest(item as DirectoryInfo, sb, depth+1);
}
else
{ _fileCounter++; }
sb.AppendLine();
}
}
I ran the above code on a number of different directories. On my machine the 2nd call to scan a directory tree was usually faster due to caching either by the runtime or the file system. Note that this system isn't anything too special, just a 1yr old development workstation.
// cached call
Dirs = 150, files = 420, max depth = 5
Time taken = 53 milliseconds
// cached call
Dirs = 1117, files = 9076, max depth = 11
Time taken = 433 milliseconds
// first call
Dirs = 1052, files = 5903, max depth = 12
Time taken = 11921 milliseconds
// first call
Dirs = 793, files = 10748, max depth = 10
Time taken = 5433 milliseconds (2nd run 363 milliseconds)
Concerned that I wasn't getting the create and modified date, the code was modified to output this as well with the following times.
// now grabbing last update and creation time.
Dirs = 150, files = 420, max depth = 5
Time taken = 103 milliseconds (2nd run 93 milliseconds)
Dirs = 1117, files = 9076, max depth = 11
Time taken = 992 milliseconds (2nd run 984 milliseconds)
Dirs = 793, files = 10748, max depth = 10
Time taken = 1382 milliseconds (2nd run 735 milliseconds)
Dirs = 1052, files = 5903, max depth = 12
Time taken = 936 milliseconds (2nd run 595 milliseconds)
Note: System.Diagnostics.StopWatch class used for timing.
I recently (2020) discovered this post because of a need to count files and directories across slow connections, and this was the fastest implementation I could come up with. The .NET enumeration methods (GetFiles(), GetDirectories()) perform a lot of under-the-hood work that slows them down tremendously by comparison.
This solution utilizes the Win32 API and .NET's Parallel.ForEach() to leverage the threadpool to maximize performance.
P/Invoke:
/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findfirstfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr FindFirstFile(
string lpFileName,
ref WIN32_FIND_DATA lpFindFileData
);
/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findnextfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindNextFile(
IntPtr hFindFile,
ref WIN32_FIND_DATA lpFindFileData
);
/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findclose
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindClose(
IntPtr hFindFile
);
Method:
public static Tuple<long, long> CountFilesDirectories(
string path,
CancellationToken token
)
{
if (String.IsNullOrWhiteSpace(path))
throw new ArgumentNullException("path", "The provided path is NULL or empty.");
// If the provided path doesn't end in a backslash, append one.
if (path.Last() != '\\')
path += '\\';
IntPtr hFile = IntPtr.Zero;
Win32.Kernel32.WIN32_FIND_DATA fd = new Win32.Kernel32.WIN32_FIND_DATA();
long files = 0;
long dirs = 0;
try
{
hFile = Win32.Kernel32.FindFirstFile(
path + "*", // Discover all files/folders by ending a directory with "*", e.g. "X:\*".
ref fd
);
// If we encounter an error, or there are no files/directories, we return no entries.
if (hFile.ToInt64() == -1)
return Tuple.Create<long, long>(0, 0);
//
// Find (and count) each file/directory, then iterate through each directory in parallel to maximize performance.
//
List<string> directories = new List<string>();
do
{
// If a directory (and not a Reparse Point), and the name is not "." or ".." which exist as concepts in the file system,
// count the directory and add it to a list so we can iterate over it in parallel later on to maximize performance.
if ((fd.dwFileAttributes & FileAttributes.Directory) != 0 &&
(fd.dwFileAttributes & FileAttributes.ReparsePoint) == 0 &&
fd.cFileName != "." && fd.cFileName != "..")
{
directories.Add(System.IO.Path.Combine(path, fd.cFileName));
dirs++;
}
// Otherwise, if this is a file ("archive"), increment the file count.
else if ((fd.dwFileAttributes & FileAttributes.Archive) != 0)
{
files++;
}
}
while (Win32.Kernel32.FindNextFile(hFile, ref fd));
// Iterate over each discovered directory in parallel to maximize file/directory counting performance,
// calling itself recursively to traverse each directory completely.
Parallel.ForEach(
directories,
new ParallelOptions()
{
CancellationToken = token
},
directory =>
{
var count = CountFilesDirectories(
directory,
token
);
lock (directories)
{
files += count.Item1;
dirs += count.Item2;
}
});
}
catch (Exception)
{
// Handle as desired.
}
finally
{
if (hFile.ToInt64() != 0)
Win32.Kernel32.FindClose(hFile);
}
return Tuple.Create<long, long>(files, dirs);
}
On my local system, the performance of GetFiles()/GetDirectories() can be close to this, but across slower connections (VPNs, etc.) I found that this is tremendously faster—45 minutes vs. 90 seconds to access a remote directory of ~40k files, ~40 GB in size.
This can also fairly easily be modified to include other data, like the total file size of all files counted, or rapidly recursing through and deleting empty directories, starting at the furthest branch.
I'd use or base myself on this multi-threaded library: http://www.codeproject.com/KB/files/FileFind.aspx
try this (i.e. do the initialization first, and then reuse your list and your directoryInfo objects):
static List<Info> RecursiveMovieFolderScan1() {
var info = new List<Info>();
var dirInfo = new DirectoryInfo(path);
RecursiveMovieFolderScan(dirInfo, info);
return info;
}
static List<Info> RecursiveMovieFolderScan(DirectoryInfo dirInfo, List<Info> info){
foreach (var dir in dirInfo.GetDirectories()) {
info.Add(new Info() {
IsDirectory = true,
CreatedDate = dir.CreationTimeUtc,
ModifiedDate = dir.LastWriteTimeUtc,
Path = dir.FullName
});
RecursiveMovieFolderScan(dir, info);
}
foreach (var file in dirInfo.GetFiles()) {
info.Add(new Info()
{
IsDirectory = false,
CreatedDate = file.CreationTimeUtc,
ModifiedDate = file.LastWriteTimeUtc,
Path = file.FullName
});
}
return info;
}
Recently I have the same question, I think it is also good to output all folders and files into a text file, and then use streamreader to read the text file, do what you want to process with multi-thread.
cmd.exe /u /c dir "M:\" /s /b >"c:\flist1.txt"
[update]
Hi Moby, you are correct.
My approach is slower due to overhead of reading back the output text file.
Actually I took some time to test the top answer and cmd.exe with 2 million files.
The top answer: 2010100 files, time: 53023
cmd.exe method: 2010100 files, cmd time: 64907, scan output file time: 19832.
The top answer method(53023) is faster than cmd.exe(64907), not to mention how to improve reading output text file. Although my original point is to provide a not-too-bad answer, still feel sorry, ha.

Categories

Resources