Performance wise: File.Copy vs File.WriteAllText function in C#? - c#

I have file content in string and I need to put the same content in 3 different files.
So, I am using File.WriteAllText() function of C# to put the content in first file.
now, for other 2 files, I have two options:
Using File.Copy(firstFile, otherFile)
Using File.WriteAllText(otherFile, content)
Performance wise which option is better?

If the file is relatively small it is likely to remain cached in Windows disk cache, so the performance difference will be small, or it might even be that File.Copy() is faster (since Windows will know that the data is the same, and File.Copy() calls a Windows API that is extremely optimised).
If you really care, you should instrument it and time things, although the timings are likely to be completely skewed because of Windows file caching.
One thing that might be important to you though: If you use File.Copy() the file attributes including Creation Time will be copied. If you programatically create all the files, the Creation Time is likely to be different between the files.
If this is important to you, you might want to programatically set the the file attributes after the copy so that they are the same for all files.
Personally, I'd use File.Copy().

Related

Best strategy to implement reader for large text files

We have an application which logs its processing steps into text files. These files are used during implementation and testing to analyse problems. Each file is up to 10MB in size and contains up to 100,000 text lines.
Currently the analysis of these logs is done by opening a text viewer (Notepad++ etc) and looking for specific strings and data depending on the problem.
I am building an application which will help the analysis. It will enable a user to read files, search, highlight specific strings and other specific operations related to isolating relevant text.
The files will not be edited!
While playing a little with some concepts, I found out immediately that TextBox (or RichTextBox) don't handle display of large text very well. I managed to to implement a viewer using DataGridView with acceptable performance, but that control does not support color highlighting of specific strings.
I am now thinking of holding the entire text file in memory as a string, and only displaying a very limited number of records in the RichTextBox. For scrolling and navigating I thought of adding an independent scrollbar.
One problem I have with this approach is how to get specific lines from the stored string.
If anyone has any ideas, can highlight problems with my approach then thank you.
I would suggest loading the whole thing into memory, but as a collection of strings rather than a single string. It's very easy to do that:
string[] lines = File.ReadAllLines("file.txt");
Then you can search for matching lines with LINQ, display them easily etc.
Here is an approach that scales well on modern CPU's with multiple cores.
You create an iterator block that yields the lines from the text file (or multiple text files if required):
IEnumerable<String> GetLines(String fileName) {
using (var streamReader = File.OpenText(fileName))
while (!streamReader.EndOfStream)
yield return streamReader.ReadLine();
}
You then use PLINQ to search the lines in parallel. Doing that can speed up the search considerably if you have a modern CPU.
GetLines(fileName)
.AsParallel()
.AsOrdered()
.Where(line => ...)
.ForAll(line => ...);
You supply a predicate in Where that matches the lines you need to extract. You then supply an action to ForAll that will send the lines to their final destination.
This is a simplified version of what you need to do. Your application is a GUI application and you cannot perform the search on the main thread. You will have to start a background task for this. If you want this task to be cancellable you need to check a cancellation token in the while loop in the GetLines method.
ForAll will call the action on threads from the thread pool. If you want to add the matching lines to a user interface control you need to make sure that this control is updated on the user interface thread. Depending on the UI framework you use there are different ways to do that.
This solution assumes that you can extract the lines you need by doing a single forward pass of the file. If you need to do multiple passes perhaps based on user input you may need to cache all lines from the file in memory instead. Caching 10 MB is not much but lets say you decide to search multiple files. Caching 1 GB can strain even a powerful computer but using less memory and more CPU as I suggest will allow you to search very big files within a reasonable time on a modern desktop PC.
I suppose that when one has multiple gigabytes of RAM available, one naturally gravitates towards the "load the whole file into memory" path, but is anyone here really satisfied with such a shallow understanding of the problem? What happens when this guy wants to load a 4 gigabyte file? (Yeah, probably not likely, but programming is often about abstractions that scale and the quick fix of loading the whole thing into memory just isn't scalable.)
There are, of course, competing pressures: do you need a solution yesterday or do you have the luxury of time to dig into the problem and learning something new? The framework also influences your thinking by presenting block-mode files as streams... you have to check the stream's BaseStream.CanSeek value and, if that is true, access the BaseStream.Seek() method to get random access. Don't get me wrong, I absolutely love the .NET framework, but I see a construction site where a bunch of "carpenters" can't put up the frame for a house because the air-compressor is broken and they don't know how to use a hammer. Wax-on, wax-off, teach a man to fish, etc.
So if you have time, look into a sliding window. You can probably do this the easy way by using a memory-mapped file (let the framework/OS manage the sliding window), but the fun solution is to write it yourself. The basic idea is that you only have a small chunk of the file loaded into memory at any one time (the part of the file that is visible in your interface with maybe a small buffer on either side). As you move forward through the file, you can save the offsets of the beginning of each line so that you can easily seek to any earlier section of the file.
Yes, there are performance implications... welcome to the real world where one is faced with various requirements and constraints and must find the acceptable balance between time and memory utilization. This is the fun of programming... figuring out the various ways that a goal can be reached and learning what the tradeoffs are between the various paths. This is how you grow beyond the skill levels of that guy in the office who sees every problem as a nail because he only knows how to use a hammer.
[/rant]
I would suggest to use MemoryMappedFile in .NET 4 (or via DllImport in previous versions) to handle just small portion of file that visible on screen instead of wasting memory and time with loading of entire file.

Most efficient way to search for files

I am writing a program that searches and copies mp3-files to a specified directory.
Currently I am using a List that is filled with all the mp3s in a directory (which takes - not surprisingly - a very long time.) Then I use taglib-sharp to compare the ID3Tags with the artist and title entered. If they match I copy the file.
Since this is my first program and I am very new to programming I figure there must be a better/more efficient way to do this. Does anybody have a suggestion on what I could try?
Edit: I forgot to add an important detail: I want to be able to specify what directories should be searched every time I start a search (the directory to be searched will be specified in the program itself). So storing all the files in a database or something similar isn't really an option (unless there is a way to do this every time which is still efficient). I am basically looking for the best way to search through all the files in a directory where the files are indexed every time. (I am aware that this is probably not a good idea but I'd like to do it that way. If there is no real way to do this I'll have to reconsider but for now I'd like to do it like that.)
You are mostly saddled with the bottleneck that is IO, a consequence of the hardware with which you are working. It will be the copying of files that is the denominator here (other than finding the files, which is dwarfed compared to copying).
There are other ways to go about file management, and each exposing better interfaces for different purposes, such as NTFS Change Journals and low-level sector handling (not recommended) for example, but if this is your first program in C# then maybe you don't want to venture into p/invoking native calls.
Other than alternatives to actual processes, you might consider mechanisms to minimise disk access - i.e. not redoing anything you have already done, or don't need to do.
Use an database (simple binary serialized file or an embedded database like RavenDb) to cache all files. And query that cache instead.
Also store modified time for each folder in the database. Compare the time in the database with the time on the folder each time you start your application (and sync changed folders).
That ought to give you much better performance. Threading will not really help searching folders since it's the disk IO that takes time, not your application.

Secure wipe a directory

I know how to wipe a file in C# including it's sectors and such.
But how do I overwrite the directories themselves?
Example: #"C:\mydirectory\" must be unrecoverable gone forever (all files insides are already wiped) so that it will be impossible to recover the directory structure or their names.
------------------- Update below (comment formatting is such a hassle so I post it here)---
For the file deletion I look up the partition's cluster and section size's and overwrite it at least 40 times using 5 different algorithms where the last algorithm is always the random one. The data is also actually written to the disk each time (and not just in memory or something). The only risk is that when I wipe something the physical address on the disk of that file could theoretically have been changed. The only solution I know for that is to also wipe the free disk space after the file has been wiped and hope that no other file currently partially holds the old physical location of the wiped file. Or does Windows not do such a thing?
http://www.sans.org/reading_room/whitepapers/incident/secure-file-deletion-fact-fiction_631 states:
"It is important to note the consensus that overwriting the data only reduces the
likelihood of data being recovered. The more times data is overwritten, the more
expensive and time consuming it becomes to recover the data. In fact Peter Guttman
states “…it is effectively impossible to sanitize storage locations by simple overwriting
them, no matter how many overwrite passes are made or what data patterns are written.”3 Overwritten data can be recovered using magnetic force microscopy, which
deals with imaging magnetization patterns on the platters of the hard disk. The actual
details of how this is accomplished are beyond the scope of this paper."
Personally I believe that when I overwrite the data like 100+ times using different (maybe unknown) algorithms (and if there is no copy of the data left elsewhere like in the swap files) that it will take any very expensive team of professionals many many years to get that data back. And if they do get the data back after all those years then they deserve it I guess... That must be a project for life.
So:
wiping unused data: use cipher (http://support.microsoft.com/kb/315672) or fill the hard-disk with 4GB files or use the Eraser command line executable.
wiping swap files: ?
wiping bad sectors: ?
wiping directories: use Eraser (as Teoman Soygul stated)
How do we know for sure that we overwrote the actual physical addresses?
wiping the most recently used files and the Windows log files should of course be piece a cake for any programmer :)
Eraser solves most of the above problems but cannot wipe the pages files. So any forensic will still find the data back if it was in those swap files at any moment.
AFAIK eraser does not wipe the file allocation tables. But I'm not sure.
And the conclusion should then be: It's (near) impossible to secure wipe in C#?
there is no general approach for this ... consider a SSD: you can't even make sure that your write operation will write to the same physical address, because of wear-levelling methods
If all files/folders inside the folder is already wiped (as you stated), all that is left is the directory itself. Using a cryptic random number generator rename the directory and delete it. It will be as good as wiped.
If this isn't enough for you, grab a copy of Eraser command line executable and execute the command:
Process.Start("eraserl.exe", #"-folder "C:\MyDirectory\" -subfolders -method DoD_E -silent");
Securely deleting is not straightforward, as you know. So it may be worth considering an alternative strategy.
Have you considered using something like TrueCrypt to create an encrypted volume? You could store the files there, then use standard delete routines. An adversary would then need to both decrypt the encrypted volume AND recover the deleted files.

C# Is there a faster way than FileInfo.Length with multiple files comparison?

I have been trying to compare files from two differents environments. Both environment are accessed via a network drive.
First, I checked if the file is present in both environments (to save time), then I ask for a FileInfo on both files to compare their filesizes (FileInfo.Length).
My goal is to populate a listview with every files that does not have the same size to investigate on these later.
I find it hard to understand that windows explorer gets the filesize of a lot of files so quickly and that FileInfo is taking so long...
Thanks.
Ben
If you use DirectoryInfo.GetFiles(some-search-pattern) to get the files, then use the FileInfo-instance returned from that call, then the property FileInfo.Length will be cached from the search.
Whether this will be of any help obviously depends on how the search performs in comparison to the check you are currently doing (if the search is slower, you may not gain anything). Still, it may be worth looking into.
You can try doing it the operating system way. You call via win32 system calls. You can start with this page,
http://www.pinvoke.net/default.aspx/kernel32.getfilesize

Multiple FileSystemWatchers a good idea?

I'm writing a mini editor component that is much like Notepad++ or UltraEdit that needs to monitor the files the users open - its a bit slimy, but thats the way it needs to be.
Is it wise to use multiple instances of FileSystemWatcher to monitor the open files - again like Notepad++ or UltraEdit or is there a better way to manage these?
They'll be properly disposed once the document has been closed.
Sorry, one other thing, would it be wiser to create a generic FileSystemWatcher for the drive and monitor that, then only show them a message to reload the file once I know its the right file? Or is that retarted?
You're not going to run into problems with multiple FileSystemWatchers, and there really isn't any other way to pull this off.
For performance, just be sure to specify as narrow filters as you can get away with.
FileSystemWatcher have a drawback, it locks watched folder, so, for example, if you are watching file on removable storage, it prevent "safe device removal".
You can try using Shell Notifications via SHChangeNotifyRegister. In this case you will have one entry point for all changes (or several if you want to), but in this case you will need some native shell interop.
It depends on the likely use cases.
If a user is going to open several files in the same directory and likely not modify anything else a single watcher for that directory may be less onerous than one per file if the number of files is large.
The only way you will find out is by benchmarking. Certainly doing one per file makes the lifespan of the watcher much simpler so that should be your first approach. Note that the watchers fire their events on a system thread pool, so multiple watchers can fire at the same time (something that may influence you design)
I certainly wouldn't do a watcher per drive, you will cause far more effort that way even with aggressive filtering.
Using multiple watcher is fine if you have to. As the comment ShuggyCoUk says, you can optimize by combining file watchers into one if all your files are in the same folder.
It's probably unwise to create a file watcher on a much higher folder (e.g. the root of the drive), because now your code has to handle many more events firing from other changes happening in the file system, and it's fairly easy to get into buffer overflow if your code is not fast enough to handle the events.
Another argument for less file watcher, a filesystemwatcher is a native object, and it pins memory. So depending on the life span and size of your app, you might get into memory fragmentation issues here is how:
Your code runs for a long time (e.g. hours or days) whenever you open a file it create some chunk of data in memory and instantiates a file watcher. You then cleanup this temporary data but the file watcher is still there, IF you repeat that multiple times (and not close the files, or forget to dispose the watchers) you just created multiple objects in virtual memory that cannot be moved by the CLR, and can potentially get into memory congestion. Note that this is not a big deal if you have a few watchers around, but if you suspect you might get into the hundreds or more, beware that's going to become a major issue.

Categories

Resources