Secure wipe a directory - c#

I know how to wipe a file in C# including it's sectors and such.
But how do I overwrite the directories themselves?
Example: #"C:\mydirectory\" must be unrecoverable gone forever (all files insides are already wiped) so that it will be impossible to recover the directory structure or their names.
------------------- Update below (comment formatting is such a hassle so I post it here)---
For the file deletion I look up the partition's cluster and section size's and overwrite it at least 40 times using 5 different algorithms where the last algorithm is always the random one. The data is also actually written to the disk each time (and not just in memory or something). The only risk is that when I wipe something the physical address on the disk of that file could theoretically have been changed. The only solution I know for that is to also wipe the free disk space after the file has been wiped and hope that no other file currently partially holds the old physical location of the wiped file. Or does Windows not do such a thing?
http://www.sans.org/reading_room/whitepapers/incident/secure-file-deletion-fact-fiction_631 states:
"It is important to note the consensus that overwriting the data only reduces the
likelihood of data being recovered. The more times data is overwritten, the more
expensive and time consuming it becomes to recover the data. In fact Peter Guttman
states “…it is effectively impossible to sanitize storage locations by simple overwriting
them, no matter how many overwrite passes are made or what data patterns are written.”3 Overwritten data can be recovered using magnetic force microscopy, which
deals with imaging magnetization patterns on the platters of the hard disk. The actual
details of how this is accomplished are beyond the scope of this paper."
Personally I believe that when I overwrite the data like 100+ times using different (maybe unknown) algorithms (and if there is no copy of the data left elsewhere like in the swap files) that it will take any very expensive team of professionals many many years to get that data back. And if they do get the data back after all those years then they deserve it I guess... That must be a project for life.
So:
wiping unused data: use cipher (http://support.microsoft.com/kb/315672) or fill the hard-disk with 4GB files or use the Eraser command line executable.
wiping swap files: ?
wiping bad sectors: ?
wiping directories: use Eraser (as Teoman Soygul stated)
How do we know for sure that we overwrote the actual physical addresses?
wiping the most recently used files and the Windows log files should of course be piece a cake for any programmer :)
Eraser solves most of the above problems but cannot wipe the pages files. So any forensic will still find the data back if it was in those swap files at any moment.
AFAIK eraser does not wipe the file allocation tables. But I'm not sure.
And the conclusion should then be: It's (near) impossible to secure wipe in C#?

there is no general approach for this ... consider a SSD: you can't even make sure that your write operation will write to the same physical address, because of wear-levelling methods

If all files/folders inside the folder is already wiped (as you stated), all that is left is the directory itself. Using a cryptic random number generator rename the directory and delete it. It will be as good as wiped.
If this isn't enough for you, grab a copy of Eraser command line executable and execute the command:
Process.Start("eraserl.exe", #"-folder "C:\MyDirectory\" -subfolders -method DoD_E -silent");

Securely deleting is not straightforward, as you know. So it may be worth considering an alternative strategy.
Have you considered using something like TrueCrypt to create an encrypted volume? You could store the files there, then use standard delete routines. An adversary would then need to both decrypt the encrypted volume AND recover the deleted files.

Related

C# completely delete a file so that it cannot be retrieved

We have a bit of software that retrieves a file from a client, unencrypts it, processes it, encrypts the results, and sends it back.
We use PGP keys (our private to decrypt, their public to encrypt).
However it occurs to me that although we delete the file after we've processed it, it may be possible in theory to use an undelete tool to get it from the hard disk.
At the moment we use the gpg2.exe program as part of gpg4win to to the pgp decryption so I am not sure we can decrypt it directly to memory so it never touches the hard disk.
Is there a simple way to ensure it's completely gone for good when deleting it?
You could check if the gpg program allows getting the output from stdout instead of writing it to a file, so it doesn't get written to disk. Possibly there is also a C# or C++ library that could do the same.
If you have to use an intermediate file, you can make it a bit harder by overwriting the contents with random data a few times before deleting it, or using a specialised shredder tool to delete it.
As an aside: Note that if you are paranoid enough to worry about someone using special software to recover deleted data, you may also want to worry about fragments of the data remaining in RAM.
You can use Wipe utility (http://wipe.sourceforge.net/) to, hm, wipe the unencrypted data.
Yes, it's trivial to both undelete the file and capture it on-the-fly as it's being written. So the safer approach is to use an OpenPGP library to perform all operations in memory (unless you have huge files that just don't fit into memory).
You need to remember that memory is also swapped to disk so if you have a top-secret data, your task becomes more complicated - you would need to create memory blocks that are not swapped, and somehow use them from .NET.
There's one more complication - if your application decrypts the data, then it has a private key nearby. There's a good chance for the attacker to steal the encryption key and then steal encrypted files and decrypt them.
So your main problem is outside of the disk - it's in ensuring security of the computer system in whole.

How to ensure that data doesn't get corrupted when saving to file?

I am relatively new to C# so please bear with me.
I am writing a business application (in C#, .NET 4) that needs to be reliable. Data will be stored in files. Files will be modified (rewritten) regularly, thus I am afraid that something could go wrong (power loss, application gets killed, system freezes, ...) while saving data which would (I think) result in a corrupted file. I know that data which wasn't saved is lost, but I must not lose data which was already saved (because of corruption or ...).
My idea is to have 2 versions of every file and each time rewrite the oldest file. Then in case of unexpected end of my application at least one file should still be valid.
Is this a good approach? Is there anything else I could do? (Database is not an option)
Thank you for your time and answers.
Rather than "always write to the oldest" you can use the "safe file write" technique of:
(Assuming you want to end up saving data to foo.data, and a file with that name contains the previous valid version.)
Write new data to foo.data.new
Rename foo.data to foo.data.old
Rename foo.data.new to foo.data
Delete foo.data.old
At any one time you've always got at least one valid file, and you can tell which is the one to read just from the filename. This is assuming your file system treats rename and delete operations atomically, of course.
If foo.data and foo.data.new exist, load foo.data; foo.data.new may be broken (e.g. power off during write)
If foo.data.old and foo.data.new exist, both should be valid, but something died very shortly afterwards - you may want to load the foo.data.old version anyway
If foo.data and foo.data.old exist, then foo.data should be fine, but again something went wrong, or possibly the file couldn't be deleted.
Alternatively, simply always write to a new file, including some sort of monotonically increasing counter - that way you'll never lose any data due to bad writes. The best approach depends on what you're writing though.
You could also use File.Replace for this, which basically performs the last three steps for you. (Pass in null for the backup name if you don't want to keep a backup.)
A lot of programs uses this approach, but usually, they do more copies, to avoid also human error.
For example, Cadsoft Eagle (a program used to design circuits and printed circuit boards) do up to 9 backup copies of the same file, calling them file.b#1 ... file.b#9
Another thing you can do to enforce security is to hashing: append an hash like a CRC32 or MD5 at the end of the file.
When you open it you check the CRC or MD5, if they don't match the file is corrupted.
This will also enforce you from people that accidentally or by purpose try to modify your file with another program.
This will also give you a way to know if hard drive or usb disk got corrupted.
Of course, faster the save file operation is, the less risk of loosing data you have, but you cannot be sure that nothing will happen during or after writing.
Consider that both hard drives, usb drives and windows OS uses cache, and it means, also if you finish writing the data may be OS or disk itself still didn't physically wrote it to the disk.
Another thing you can do, save to a temporary file, if everything is ok you move the file in the real destination folder, this will reduce the risk of having half-files.
You can mix all these techniques together.
In principle there are two popular approaches to this:
Make your file format log-based, i.e. do not overwrite in the usual save case, just append changes or the latest versions at the end.
or
Write to a new file, rename the old file to a backup and rename the new file into its place.
The first leaves you with (way) more development effort, but also has the advantage of making saves go faster if you save small changes to large files (Word used to do this AFAIK).

Most efficient way to search for files

I am writing a program that searches and copies mp3-files to a specified directory.
Currently I am using a List that is filled with all the mp3s in a directory (which takes - not surprisingly - a very long time.) Then I use taglib-sharp to compare the ID3Tags with the artist and title entered. If they match I copy the file.
Since this is my first program and I am very new to programming I figure there must be a better/more efficient way to do this. Does anybody have a suggestion on what I could try?
Edit: I forgot to add an important detail: I want to be able to specify what directories should be searched every time I start a search (the directory to be searched will be specified in the program itself). So storing all the files in a database or something similar isn't really an option (unless there is a way to do this every time which is still efficient). I am basically looking for the best way to search through all the files in a directory where the files are indexed every time. (I am aware that this is probably not a good idea but I'd like to do it that way. If there is no real way to do this I'll have to reconsider but for now I'd like to do it like that.)
You are mostly saddled with the bottleneck that is IO, a consequence of the hardware with which you are working. It will be the copying of files that is the denominator here (other than finding the files, which is dwarfed compared to copying).
There are other ways to go about file management, and each exposing better interfaces for different purposes, such as NTFS Change Journals and low-level sector handling (not recommended) for example, but if this is your first program in C# then maybe you don't want to venture into p/invoking native calls.
Other than alternatives to actual processes, you might consider mechanisms to minimise disk access - i.e. not redoing anything you have already done, or don't need to do.
Use an database (simple binary serialized file or an embedded database like RavenDb) to cache all files. And query that cache instead.
Also store modified time for each folder in the database. Compare the time in the database with the time on the folder each time you start your application (and sync changed folders).
That ought to give you much better performance. Threading will not really help searching folders since it's the disk IO that takes time, not your application.

What is the fastest way to write hundreds of files to disk using C#?

My program should write hundreds of files to disk, received by external resources (network)
each file is a simple document that I'm currently store it with the name of GUID in a specific folder but creating hundred files, writing, closing is a lengthy process.
Is there any better way to store these amount of files to disk?
I've come to a solution, but I don't know if it is the best.
First, I create 2 files, one of them is like allocation table and the second one is a huge file storing all the content of my documents. But reading from this file would be a nightmare; maybe a memory-mapped file technique could help. Could working with 30GB or more create a problem?
Edit: What is the fastest way to storing 1000 text files on disk ? (write operation performs frequently)
This is similar to how Subversion stores its repositories on disk. Each revision in the repository is stored as a file, and the repository uses a folder for each 1000 revisions. This seems to perform rather well, except there is a good chance for the files to either become fragmented or be located further apart from each other. Subversion allows you to pack each 1000 revision folder into a single file (but this works nicely since the revisions are not modified once created.
If you plan on modifying these documents often, you could consider using an embedded database to manage the solid file for you (Firebird is a good one that doesn't have any size limitations). This way you don't have to manage the growth and organization of the files yourself (which can get complicated when you start modifying files inside the solid file). This will also help with the issues of concurrent access (reading / writing) if you use a separate service / process to manage the database and communicate with it. The new version of Firebird (2.5) supports multiple process access to a database even when using an embedded server. This way you can have multiple accesses to your file storage without having to run a database server.
The first thing you should do is profile your app. In particular you want to get the counters around Disk Queue Length. Your queue length shouldn't be any more than 1.5 to 2 times the number of disk spindles you have.
For example, if you have a single disk system, then the queue length shouldn't go above 2. If you have a RAID array with 3 disks, it should be more than 6.
Verify that you are indeed write bound. If so then the best way to speed up performance of doing massive writes is to buy disks with very fast write performance. Note that most RAID setups will result in decreased performance.
If write performance is critical, then spreading out the storage across multiple drives could work. Of course, you would have to take this into consideration for any app that that needs to read that information. And you'll still have to buy fast drives.
Note that not all drives are created equal and some are better suited for high performance than others.
What about using the ThreadPool for that?
I.e. for each received "file", enqueue a write function in a thread pool thread that actually persists the data to a file on disk.

Synchronizing filesystem and cached data on program startup

I have a program that needs to retrieve some data about a set of files (that is, a directory and all files within it and sub directories of certain types). The data is (very) expensive to calculate, so rather than traversing the filesystem and calculating it on program startup, I keep a cache of the data in a SQLite database and use a FilesystemWatcher to monitor changes to the filesystem. This works great while the program is running, but the question is how to refresh/synchronize the data during program startup. If files have been added (or changed -- I presume I can detect this via last modified/size) the data needs to be recomputed in the cache, and if files have been removed, the data needs to be removed from the cache (since the interface traverses the cache instead of the filesystem).
So the question is: what's a good algorithm to do this? One way I can think of is to traverse the filesystem and gather the path and last modified/size of all files in a dictionary. Then I go through the entire list in the database. If there is not a match, then I delete the item from the database/cache. If there is a match, then I delete the item from the dictionary. Then the dictionary contains all the items whose data needs to be refreshed. This might work, however it seems it would be fairly memory-intensive and time-consuming to perform on every startup, so I was wondering if anyone had better ideas?
If it matters: the program is Windows-only written in C# on .NET CLR 3.5, using the SQLite for ADO.NET thing which is being accessed via the entity framework/LINQ for ADO.NET.
Our application is cross-platform C++ desktop application, but has very similar requirements. Here's a high-level description of what I did:
In our SQLite database there is a Files table that stores file_id, name, hash (currently we use last modified date as the hash value) and state.
Every other record refers back to a file_id. This makes is easy to remove "dirty" records when the file changes.
Our procedure for checking the filesystem and refreshing the cache is split into several distinct steps to make things easier to test and to give us more flexibility as to when the caching occurs (the names in italics are just what I happened to pick for class names):
On 1st Launch
The database is empty. The Walker recursively walks the filesystem and adds the entries into the Files table. The state is set to UNPROCESSED.
Next, the Loader iterates through the Files table looking for UNPARSED files. These are handed off to the Parser (which does the actual parsing and inserting of data)
This takes a while, so 1st launch can be a bit slow.
There's a big testability benefit because you can test the walking the filesystem code independently from the loading/parsing code. On subsequent launches the situation is a little more complicated:
n+1 Launch
The Scrubber iterates over the Files table and looks for files that have been deleted and files that have been modified. It sets the state to DIRTY if the file exists but has been modified or DELETED if the file no longer exists.
The Deleter (not the most original name) then iterates over the Files table looking for DIRTY and DELETED files. It deletes other related records (related via the file_id). Once the related records are removed, the original File record is either deleted or set back to state=UNPARSED
The Walker then walks the filesystem to pick-up new files.
Finally the Loader loads all UNPARSED files
Currently the "worst case scenario" (every file changes) is very rare - so we do this every time the application starts-up. But by splitting the process up unto these steps we could easily extend the implementation to:
The Scrubber/Deleter could be refactored to leave the dirty records in-place until after the new
data is loaded (so the application "keeps working" while new data is cached into the database)
The Loader could load/parse on a background thread during an idle time in the main application
If you know something about the data files ahead of time you could assign a 'weight' to the files and load/parse the really-important files immediately and queue-up the less-important files for processing at a later time.
Just some thoughts / suggestions. Hope they help!
Windows has a change journal mechanism, which does what you want: you subscribe to changes in some part of the filesystem and upon startup can read a list of changes which happened since last time you read them. See: http://msdn.microsoft.com/en-us/library/aa363798(VS.85).aspx
EDIT: I think it requires rather high privileges, unfortunately
The first obvious thing that comes to mind is creating a separate small application that would always run (as a service, perhaps) and create a kind of "log" of changes in the file system (no need to work with SQLite, just write them to a file). Then, when the main application starts, it can look at the log and know exactly what has changed (don't forget to clear the log afterwards :-).
However, if that is unacceptable to you for some reason, let us try to look at the original problem.
First of all, you have to accept that, in the worst case scenario, when all the files have changed, you will need to traverse the whole tree. And that may (although not necessarily will) take a long time. Once you realize that, you have to think about doing the job in background, without blocking the application.
Second, if you have to make a decision about each file that only you know how to make, there is probably no other way than going through all files.
Putting the above in other words, you might say that the problem is inherently complex (and any given problem cannot be solved with an algorithm that is simpler than the problem itself).
Therefore, your only hope is reducing the search space by using tweaks and hacks. And I have two of those on my mind.
First, it's better to query the database separately for every file instead of building a dictionary of all files first. If you create an index on the file path column in your database, it should be quicker, and of course, less memory-intensive.
Second, you don't actually have to query the database at all :-)
Just store the exact time when your application was last running somewhere (in a .settings file?) and check every file to see if it's newer than that time. If it is, you know it's changed. If it's not, you know you've caught it's change last time (with your FileSystemWatcher).
Hope this helps. Have fun.

Categories

Resources