Get file size without using System.IO.FileInfo? - c#

Is it possible to get the size of a file in C# without using System.IO.FileInfo at all?
I know that you can get other things like Name and Extension by using Path.GetFileName(yourFilePath) and Path.GetExtension(yourFilePath) respectively, but apparently not file size? Is there another way I can get file size without using System.IO.FileInfo?
The only reason for this is that, if I'm correct, FileInfo grabs more info than I really need, therefore it takes longer to gather all those FileInfo's if the only thing I need is the size of the file. Is there a faster way?

I performed a benchmark using these two methods:
public static uint GetFileSizeA(string filename)
{
WIN32_FIND_DATA findData;
FindFirstFile(filename, out findData);
return findData.nFileSizeLow;
}
public static uint GetFileSizeB(string filename)
{
IntPtr handle = CreateFile(
filename,
FileAccess.Read,
FileShare.Read,
IntPtr.Zero,
FileMode.Open,
FileAttributes.ReadOnly,
IntPtr.Zero);
long fileSize;
GetFileSizeEx(handle, out fileSize);
CloseHandle(handle);
return (uint) fileSize;
}
Running against a bit over 2300 files, GetFileSizeA took 62-63ms to run. GetFileSizeB took over 18 seconds.
Unless someone sees something I'm doing wrong, I think the answer is clear as to which method is faster.
Is there a way I can refrain from actually opening the file?
Update
Changing FileAttributes.ReadOnly to FileAttributes.Normal reduced the timing so that the two methods were identical in performance.
Furthermore, if you skip the CloseHandle() call, the GetFileSizeEx method becomes about 20-30% faster, though I don't know that I'd recommend that.

From a short test i did, i've found that using a FileStream is just 1 millisecond slower in average than using Pete's GetFileSizeB (took me about 21 milliseconds over a network share...). Personally i prefer staying within the BCL limits whenever i can.
The code is simple:
using (var file = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
return file.Length;
}

As per this comment:
I have a small application that gathers the size info and saves it into an array... but I often have half a million files, give or take and that takes a while to go through all of those files (I'm using FileInfo). I was just wondering if there was a faster way...
Since you're finding the length of so many files you're much more likely to benefit from parallelization than from trying to get the file size through another method. The FileInfo class should be good enough, and any improvements are likely to be small.
Parallelizing the file size requests, on the other hand, has the potential for significant improvements in speed. (Note that the degree of improvement will be largely based on your disk drive, not your processor, so results can vary greatly.)

Not a direct answer...because I am not sure there is a faster way using the .NET framework.
Here's the code I am using:
List<long> list = new List<long>();
DirectoryInfo di = new DirectoryInfo("C:\\Program Files");
FileInfo[] fiArray = di.GetFiles("*", SearchOption.AllDirectories);
foreach (FileInfo f in fiArray)
list.Add(f.Length);
Running that, it took 2709ms to run on my "Program Files" directory, which was around 22720 files. That's no slouch by any means. Furthermore, when I put *.txt as a filter for the first parameter of the GetFiles method, it cut the time down drastically to 461ms.
A lot of this will depend on how fast your hard drive is, but I really don't think that FileInfo is killing performance.
NOTE: I thikn this only valid for .NET 4+

A quick'n'dirty solution if you want to do this on the .NET Core or Mono runtimes on non-Windows hosts:
Include the Mono.Posix.NETStandard NuGet package, then something like this...
using Mono.Unix.Native;
private long GetFileSize(string filePath)
{
Stat stat;
Syscall.stat(filePath, out stat);
return stat.st_size;
}
I've tested this running .NET Core on Linux and macOS - not sure if it works on Windows - it might, given that these are POSIX syscalls under the hood (and the package is maintained by Microsoft). If not, combine with the other P/Invoke-based answer to cover all platforms.
When compared to FileInfo.Length, this gives me much more reliable results when getting the size of a file that is actively being written to by another process/thread.

You can try this:
[DllImport("kernel32.dll")]
static extern bool GetFileSizeEx(IntPtr hFile, out long lpFileSize);
But that's not much of an improvement...
Here's the example code taken from pinvoke.net:
IntPtr handle = CreateFile(
PathString,
GENERIC_READ,
FILE_SHARE_READ,
0,
OPEN_EXISTING,
FILE_ATTRIBUTE_READONLY,
0); //PInvoked too
if (handle.ToInt32() == -1)
{
return;
}
long fileSize;
bool result = GetFileSizeEx(handle, out fileSize);
if (!result)
{
return;
}

Related

File.WriteAllBytes does not block

I have a simple piece of code like so:
File.WriteAllBytes(Path.Combine(temp, node.Name), stuffFile.Read(0, node.FileHeader.FileSize));
One would think that WriteAllBytes would be a blocking call as it has Async counterparts in C# 5.0 and it doesn't state anywhere in any MSDN documentation that it is non-blocking. HOWEVER when a file is of a reasonable size (not massive, but somewhere in the realms of 20mb) the call afterwards which opens the file seems to be called before the writing is finished, and the file is opened (the program complains its corrupted, rightly so) and the WriteAllBytes then complains the file is open in another process. What is going on here?! For curiosity sake, this is the code used to open the file:
System.Diagnostics.Process.Start(Path.Combine(temp, node.Name));
Anyone experience this sort of weirdness before? Or is it me being a blonde and doing something wrong?
If it is indeed blocking, what could possibly be causing this issue?
EDIT: I'll put the full method up.
var node = item.Tag as FileNode;
stuffFile.Position = node.FileOffset;
string temp = Path.GetTempPath();
File.WriteAllBytes(Path.Combine(temp, node.Name), stuffFile.Read(0, node.FileHeader.FileSize));
System.Diagnostics.Process.Start(Path.Combine(temp, node.Name));
What seems to be happening is that Process.Start is being called BEFORE WriteAllBytes is finished, and its attempting to open the file, and then WriteAllBytes complains about another process holding the lock on the file.
No, WriteAllBytes is a blocking, synchronous method. As you stated, if it were not, the documentation would say so.
Possibly the virus scanner is still busy scanning the file that you just wrote, and is responsible for locking the file. Try temporarily disabling the scanner to test my hypothesis.
I think your problem may be with the way you are reading from the file. Note that Stream.Read (and FileStream.Read) is not required to read all you request.
In other words, your call stuffFile.Read(0, node.FileHeader.FileSize) might (and definitely will, sometimes) return an array of node.FileHeader.FileSize which contains some bytes of the file at the beginning, and then the 0's after.
The bug is in your UsableFileStream.Read method. You could fix it by having it read the entire file into memory:
public byte[] Read(int offset, int count)
{
// There are still bugs in this method, like assuming that 'count' bytes
// can actually be read from the file
byte[] temp = new byte[count];
int bytesRead;
while ( count > 0 && (bytesRead = _stream.Read(temp, offset, count)) > 0 )
{
offset += bytesRead;
count -= bytesRead;
}
return temp;
}
But since you are only using this to copy file contents, you could avoid having these potentially massive allocations and use Stream.CopyTo in your tree_MouseDoubleClick:
var node = item.Tag as FileNode;
stuffFile.Position = node.FileOffset;
string temp = Path.GetTempPath();
using (var output = File.Create(Path.Combine(temp, node.Name)))
stuffFile._stream.CopyTo(output);
System.Diagnostics.Process.Start(Path.Combine(temp, node.Name));
A little late, but adding for the benefit of anyone else that might come along.
The underlying C# implementation of File.WriteAllBytes may well be synchronous, but the authors of C# cannot control at the OS level how the writing to disk is handled.
Something called write caching means that when C# asks to save the file to disk, the OS may return "I'm done" before the file is fully written to the disk, causing the issue OP highlighted.
In that case, after writing, it may be better to sleep in a loop and keep checking to see if the file is still locked before calling Process.Start.
You can see that I run into problems caused by this here: C#, Entity Framework Core & PostgreSql : inserting a single row takes 20+ seconds
Also, in the final sentence of OPs post "and then WriteAllBytes complains about another process holding the lock on the file." I think they actually meant to write "and then Process.Start complains" which seems to have caused some confusion in the comments.

how do disable disk cache in c# invoke win32 CreateFile api with FILE_FLAG_NO_BUFFERING

everyone,i have a lot of files write to disk per seconds,i want to disable disk cache to improve performance,i google search find a solution:win32 CreateFile method with FILE_FLAG_NO_BUFFERING and How to empty/flush Windows READ disk cache in C#?.
i write a little of code to test whether can worked:
const int FILE_FLAG_NO_BUFFERING = unchecked((int)0x20000000);
[DllImport("KERNEL32", SetLastError = true, CharSet = CharSet.Auto, BestFitMapping = false)]
static extern SafeFileHandle CreateFile(
String fileName,
int desiredAccess,
System.IO.FileShare shareMode,
IntPtr securityAttrs,
System.IO.FileMode creationDisposition,
int flagsAndAttributes,
IntPtr templateFile);
static void Main(string[] args)
{
var handler = CreateFile(#"d:\temp.bin", (int)FileAccess.Write, FileShare.None,IntPtr.Zero, FileMode.Create, FILE_FLAG_NO_BUFFERING, IntPtr.Zero);
var stream = new FileStream(handler, FileAccess.Write, BlockSize);//BlockSize=4096
byte[] array = Encoding.UTF8.GetBytes("hello,world");
stream.Write(array, 0, array.Length);
stream.Close();
}
when running this program,the application get exception:IO operation will not work. Most likely the file will become too long or the handle was not opened to support synchronous IO operations
later,i found this article When you create an object with constraints, you have to make sure everybody who uses the object understands those constraints,but i can't fully understand,so i change my code to test:
var stream = new FileStream(handler, FileAccess.Write, 4096);
byte[] ioBuffer = new byte[4096];
byte[] array = Encoding.UTF8.GetBytes("hello,world");
Array.Copy(array, ioBuffer, array.Length);
stream.Write(ioBuffer, 0, ioBuffer.Length);
stream.Close();
it's running ok,but i just want "hello,world" bytes not all.i trying change blocksize to 1 or other integer(not 512 multiple) get same error.i also try win32 WriteFile api also get same error.someone can help me?
CreateFile() function in No Buffering mode imposes strict requirements on what may and what may not be done. Having a buffer of certain size (multiple of device sector size) is one of them.
Now, you can improve file writes in this way only if you use buffering in your code. If you want to write 10 bytes without buffering, then No Buffering mode won't help you.
If I understood your requirements correctly, this is what I'd try first:
Create a queue with objects that have the data in memory and the target file on the disk.
You start writing the files first just into memory, and then on another thread start going through the queue, opening io-completion port based filestream handles (isAsync=True) - just don't open too many of them as at some point you'll probably start losing perf due to cache trashing etc. You need to profile to see what is optimal amount for your system and ssd's.
After each open, you can use the async filestream methods Begin... to start writing data from memory to the files. the isAsync puts some requirements so this may not be as easy to get working in every corner case as using filestream normally.
Whether there will be any improvement to using another thread to create the files and another to write to them using the async api, that might only be the case if there is a possibility that creating/opening the files would block. SSD's perform various things internally to keep the access to data fast, so when you start doing this sort of extreme performance stuff, there may be pronounced differences between SSD controllers. It's also possible that if the controller drivers aren't well implemented, OS/Windows may start to feel sluggish or freeze. The hardware benchmarks sites do not really stress this particular kind of scenario (eg. create and write x KB into million files asap) and no doubt there's some drivers out there that are slower than others.

Which method is better?

I am using an import to open a connected physical hard-drive:
var sfh = Imports.CreateFile(Path, Imports.FileAccess.GenericAll, Imports.FileShare.None, IntPtr.Zero, Imports.CreationDisposition.OpenExisting, 0, IntPtr.Zero);
if (sfh.IsInvalid)
{
Marshal.ThrowExceptionForHR(Marshal.GetHRForLastWin32Error());
return;
}
Geometry = Imports.GetGeometry(sfh);
var fs = new FileStream(sfh, FileAccess.ReadWrite, (int)Geometry.BytesPerSector, false);
That works fine, but instead of using FileStream, I was wondering if this would be a more efficient way to read bytes from the drive: http://msdn.microsoft.com/en-us/library/aa365467%28v=VS.85%29.aspx
Is speed and/or efficiency that important for you ?? because the difference is probably minor in this case...
It seems like the link you gave uses a WinAPI method. I would avoid using these were you don't necessarily have to since the .net GarbageCollector doesn't play well with Native resources, and you might suffer from memory leaks if you don't handle these correctly...

File.Copy vs. Manual FileStream.Write For Copying File

My problem is in regards file copying performance. We have a media management system that requires a lot of moving files around on the file system to different locations including windows shares on the same network, FTP sites, AmazonS3, etc. When we were all on one windows network we could get away with using System.IO.File.Copy(source, destination) to copy a file. Since many times all we have is an input Stream (like a MemoryStream), we tried abstracting the Copy operation to take an input Stream and an output Stream but we are seeing a massive performance decrease. Below is some code for copying a file to use as a discussion point.
public void Copy(System.IO.Stream inStream, string outputFilePath)
{
int bufferSize = 1024 * 64;
using (FileStream fileStream = new FileStream(outputFilePath, FileMode.OpenOrCreate, FileAccess.Write))
{
int bytesRead = -1;
byte[] bytes = new byte[bufferSize];
while ((bytesRead = inStream.Read(bytes, 0, bufferSize)) > 0)
{
fileStream.Write(bytes, 0, bytesRead);
fileStream.Flush();
}
}
}
Does anyone know why this performs so much slower than File.Copy? Is there anything I can do to improve performance? Am I just going to have to put special logic in to see if I'm copying from one windows location to another--in which case I would just use File.Copy and in the other cases I'll use the streams?
Please let me know what you think and whether you need additional information. I have tried different buffer sizes and it seems like a 64k buffer size is optimal for our "small" files and 256k+ is a better buffer size for our "large" files--but in either case it performs much worse than File.Copy(). Thanks in advance!
File.Copy was build around CopyFile Win32 function and this function takes lot of attention from MS crew (remember this Vista-related threads about slow copy performance).
Several clues to improve performance of your method:
Like many said earlier remove Flush method from your cycle. You do not need it at all.
Increasing buffer may help, but only on file-to-file operations, for network shares, or ftp servers this will slow down instead. 60 * 1024 is ideal for network shares, at least before vista. for ftp 32k will be enough in most cases.
Help os by providing your caching strategy (in your case sequential reading and writing), use FileStream constructor override with FileOptions parameter (SequentalScan).
You can speed up copying by using asynchronous pattern (especially useful for network-to-file cases), but do not use threads for this, instead use overlapped io (BeginRead, EndRead, BeginWrite, EndWrite in .net), and do not forget set Asynchronous option in FileStream constructor (see FileOptions)
Example of asynchronous copy pattern:
int Readed = 0;
IAsyncResult ReadResult;
IAsyncResult WriteResult;
ReadResult = sourceStream.BeginRead(ActiveBuffer, 0, ActiveBuffer.Length, null, null);
do
{
Readed = sourceStream.EndRead(ReadResult);
WriteResult = destStream.BeginWrite(ActiveBuffer, 0, Readed, null, null);
WriteBuffer = ActiveBuffer;
if (Readed > 0)
{
ReadResult = sourceStream.BeginRead(BackBuffer, 0, BackBuffer.Length, null, null);
BackBuffer = Interlocked.Exchange(ref ActiveBuffer, BackBuffer);
}
destStream.EndWrite(WriteResult);
}
while (Readed > 0);
Three changes will dramatically improve performance:
Increase your buffer size, try 1MB (well -just experiment)
After you open your fileStream, call fileStream.SetLength(inStream.Length) to allocate the entire block on disk up front (only works if inStream is seekable)
Remove fileStream.Flush() - it is redundant and probably has the single biggest impact on performance as it will block until the flush is complete. The stream will be flushed anyway on dispose.
This seemed about 3-4 times faster in the experiments I tried:
public static void Copy(System.IO.Stream inStream, string outputFilePath)
{
int bufferSize = 1024 * 1024;
using (FileStream fileStream = new FileStream(outputFilePath, FileMode.OpenOrCreate, FileAccess.Write))
{
fileStream.SetLength(inStream.Length);
int bytesRead = -1;
byte[] bytes = new byte[bufferSize];
while ((bytesRead = inStream.Read(bytes, 0, bufferSize)) > 0)
{
fileStream.Write(bytes, 0, bytesRead);
}
}
}
Dusting off reflector we can see that File.Copy actually calls the Win32 API:
if (!Win32Native.CopyFile(fullPathInternal, dst, !overwrite))
Which resolves to
[DllImport("kernel32.dll", CharSet=CharSet.Auto, SetLastError=true)]
internal static extern bool CopyFile(string src, string dst, bool failIfExists);
And here is the documentation for CopyFile
You'll never going to able to beat the operating system at doing something so fundemental with your own code, not even if you crafted it carefully in assembler.
If you need make sure that your operations occur with the best performance AND you want to mix and match various sources then you will need to create a type that describes the resource locations. You then create an API that has functions such as Copy that takes two such types and having examined the descriptions of both chooses the best performing copy mechanism. E.g., having determined that both locations are windows file locations you it would choose File.Copy OR if the source is windows file but the destination is to be HTTP POST it uses a WebRequest.
Try to remove the Flush call, and move it to be outside the loop.
Sometimes the OS knows best when to flush the IO.. It allows it to better use its internal buffers.
Here's a similar answer
How do I copy the contents of one stream to another?
Your main problem is the call to Flush(), that will bind your performance to the speed of the I/O.
Mark Russinovich would be the authority on this.
He wrote on his blog an entry Inside Vista SP1 File Copy Improvements which sums up the Windows state of the art through Vista SP1.
My semi-educated guess would be that File.Copy would be most robust over the greatest number of situations. Of course, that doesn't mean in some specific corner case, your own code might beat it...
One thing that stands out is that you are reading a chunk, writing that chunk, reading another chunk and so on.
Streaming operations are great candidates for multithreading. My guess is that File.Copy implements multithreading.
Try reading in one thread and writing in another thread. You will need to coordinate the threads so that the write thread doesn't start writing away a buffer until the read thread is done filling it up. You can solve this by having two buffers, one that is being read while the other is being written, and a flag that says which buffer is currently being used for which purpose.

Is there any way I can speed up the opening and hashing of 15,000 small files in C#?

I'm working on SHA1 checksum hashing 15,000 images (40KB - 1.0MB each, approximately 1.8GB total). I'd like to speed this up as it is going to be a key operation in my program and right now it is taking between 500-600 seconds.
I've tried the following which took 500 seconds:
public string GetChecksum(string filePath)
{
FileStream fs = new FileStream(filePath, FileMode.Open);
using (SHA1Managed sha1 = new SHA1Managed())
{
return BitConverter.ToString(sha1.ComputeHash(fs));
}
}
Then I thought maybe the chunks SHA1Managed() was reading in were too small so I used a BufferedReader and increased the buffer size to greater than the size of any of the files I'm reading in.
public string GetChecksum(string filePath)
{
using (var bs = new BufferedStream(File.OpenRead(filePath), 1200000))
{
using (SHA1Managed sha1 = new SHA1Managed())
{
return BitConverter.ToString(sha1.ComputeHash(bs));
}
}
}
This actually took 600 seconds.
Is there anything I can do to speed up these IO operations, or am I stuck with what I got?
As per x0n's suggestion I tried just reading in each file into a byte array and discarding the result. It appears I'm IO bound as this took ~480 seconds in itself.
You are creating and destroying the SHA1Managed class for EVERY file; this is horrifically inefficient. Create it once, and call ComputeHash 15,000 times instead and you'll get a huge performance increase (IMO.)
public Dictionary<string,string> GetChecksums(string[] filePaths)
{
var checksums = new Dictionary<string,string>(filePaths.length);
using (SHA1Managed sha1 = new SHA1Managed())
{
foreach (string filePath in filePaths) {
using (var fs = File.OpenRead(filePath)) {
checksums.Add(filePath, BitConverter.ToString(sha1.ComputeHash(fs)));
}
}
}
return checksums;
}
The SHA1Managed class is particularly slow to create/destroy because it calls out to p/invoke native win32 classes.
-Oisin
Profile it first.
Try dotTrace: http://www.jetbrains.com/profiler/
You didn't say whether your operation is CPU bound, or IO bound.
With a hash, I would suspect it is CPU bound. If it is CPU bound, you will see the CPU saturated (100% utilized) during the computation of the SHA hashes. If it is IO bound, the CPU will not be saturated.
If it is CPU bound, and you have a multi-CPU or multi-core machine (true for most laptops built in the last 2 years, and almost all servers built since 2002), then you can get an instant increase by using multiple threads, and multiple Sha1Managed() instances, and computing the SHA's in parallel. If it's a dual-core machine - 2x. If it's a dual-core 2-cpu machine (typical server) you'll get 4x throughput.
By the way, when a single-threaded program like yours "saturates" the CPU on a dual-core machine, will show up as 50% utilization in Windows Task Manager.
You need to manage the workflow through the threads, to keep track of which thread is working on which file. But this isn't hard to do.
Use a "ramdisk" - build a file system in memory.
Have you tried using the SHA1CryptoServiceProvider class instead of SHA1Managed? SHA1CryptoServiceProvider is implemented in native code, not managed code, and was much quicker in my experience. For example:
public static byte[] CreateSHA1Hash(string filePath)
{
byte[] hash = null;
using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
{
using(FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 131072))
{
hash = sha1.ComputeHash(fs);
}
//hash = sha1.ComputeHash(File.OpenRead(filePath));
}
return hash;
}
Also, with 15000 files I would use a file enumerator approach (ie WinAPI: FindFirstFile(), FindNextFile()) rather than the standard .NET Directory.GetFiles().
Directory.GetFiles loads all file paths into memory in one go. This is often much slower than enumerating files directory by directory using the WinAPI functions.

Categories

Resources