I have a program that needs to retrieve some data about a set of files (that is, a directory and all files within it and sub directories of certain types). The data is (very) expensive to calculate, so rather than traversing the filesystem and calculating it on program startup, I keep a cache of the data in a SQLite database and use a FilesystemWatcher to monitor changes to the filesystem. This works great while the program is running, but the question is how to refresh/synchronize the data during program startup. If files have been added (or changed -- I presume I can detect this via last modified/size) the data needs to be recomputed in the cache, and if files have been removed, the data needs to be removed from the cache (since the interface traverses the cache instead of the filesystem).
So the question is: what's a good algorithm to do this? One way I can think of is to traverse the filesystem and gather the path and last modified/size of all files in a dictionary. Then I go through the entire list in the database. If there is not a match, then I delete the item from the database/cache. If there is a match, then I delete the item from the dictionary. Then the dictionary contains all the items whose data needs to be refreshed. This might work, however it seems it would be fairly memory-intensive and time-consuming to perform on every startup, so I was wondering if anyone had better ideas?
If it matters: the program is Windows-only written in C# on .NET CLR 3.5, using the SQLite for ADO.NET thing which is being accessed via the entity framework/LINQ for ADO.NET.
Our application is cross-platform C++ desktop application, but has very similar requirements. Here's a high-level description of what I did:
In our SQLite database there is a Files table that stores file_id, name, hash (currently we use last modified date as the hash value) and state.
Every other record refers back to a file_id. This makes is easy to remove "dirty" records when the file changes.
Our procedure for checking the filesystem and refreshing the cache is split into several distinct steps to make things easier to test and to give us more flexibility as to when the caching occurs (the names in italics are just what I happened to pick for class names):
On 1st Launch
The database is empty. The Walker recursively walks the filesystem and adds the entries into the Files table. The state is set to UNPROCESSED.
Next, the Loader iterates through the Files table looking for UNPARSED files. These are handed off to the Parser (which does the actual parsing and inserting of data)
This takes a while, so 1st launch can be a bit slow.
There's a big testability benefit because you can test the walking the filesystem code independently from the loading/parsing code. On subsequent launches the situation is a little more complicated:
n+1 Launch
The Scrubber iterates over the Files table and looks for files that have been deleted and files that have been modified. It sets the state to DIRTY if the file exists but has been modified or DELETED if the file no longer exists.
The Deleter (not the most original name) then iterates over the Files table looking for DIRTY and DELETED files. It deletes other related records (related via the file_id). Once the related records are removed, the original File record is either deleted or set back to state=UNPARSED
The Walker then walks the filesystem to pick-up new files.
Finally the Loader loads all UNPARSED files
Currently the "worst case scenario" (every file changes) is very rare - so we do this every time the application starts-up. But by splitting the process up unto these steps we could easily extend the implementation to:
The Scrubber/Deleter could be refactored to leave the dirty records in-place until after the new
data is loaded (so the application "keeps working" while new data is cached into the database)
The Loader could load/parse on a background thread during an idle time in the main application
If you know something about the data files ahead of time you could assign a 'weight' to the files and load/parse the really-important files immediately and queue-up the less-important files for processing at a later time.
Just some thoughts / suggestions. Hope they help!
Windows has a change journal mechanism, which does what you want: you subscribe to changes in some part of the filesystem and upon startup can read a list of changes which happened since last time you read them. See: http://msdn.microsoft.com/en-us/library/aa363798(VS.85).aspx
EDIT: I think it requires rather high privileges, unfortunately
The first obvious thing that comes to mind is creating a separate small application that would always run (as a service, perhaps) and create a kind of "log" of changes in the file system (no need to work with SQLite, just write them to a file). Then, when the main application starts, it can look at the log and know exactly what has changed (don't forget to clear the log afterwards :-).
However, if that is unacceptable to you for some reason, let us try to look at the original problem.
First of all, you have to accept that, in the worst case scenario, when all the files have changed, you will need to traverse the whole tree. And that may (although not necessarily will) take a long time. Once you realize that, you have to think about doing the job in background, without blocking the application.
Second, if you have to make a decision about each file that only you know how to make, there is probably no other way than going through all files.
Putting the above in other words, you might say that the problem is inherently complex (and any given problem cannot be solved with an algorithm that is simpler than the problem itself).
Therefore, your only hope is reducing the search space by using tweaks and hacks. And I have two of those on my mind.
First, it's better to query the database separately for every file instead of building a dictionary of all files first. If you create an index on the file path column in your database, it should be quicker, and of course, less memory-intensive.
Second, you don't actually have to query the database at all :-)
Just store the exact time when your application was last running somewhere (in a .settings file?) and check every file to see if it's newer than that time. If it is, you know it's changed. If it's not, you know you've caught it's change last time (with your FileSystemWatcher).
Hope this helps. Have fun.
Related
Scenario: I want to develop an application.The application should be able to connect to my remote server and download data to the local disk , while downloading it should check for new files and only download the new ones simultaneously creating the required(new) folders.
Problem: I have no idea how to compare the files in the server with the ones in the local disk.How to download only the new files from the server to the local disk?
What am thinking?: I want to sync the files in the local machine with the ones in the server. I am planning to use rsync for syncing but i have no idea how to use it with ASP.NET.
Kindly let me know if my approach is wrong or is there any other better way to accomplish this.
First you can compare the file names, then the file size and when all matches, you can compare the hashes of the files.
I call this kind of a problem a "data mastering" problem. I synchronize our databases with a Fortune 100 company throughout the week and have handled a number of business process issues.
The first rule of handling production data is not to do your users' data entry. They must be responsible for putting any business process into motion which touches production. They must understand the process and have access to logs showing what data was changed, otherwise they cannot handle issues. If you're doing this for them, then you are assuming these responsibilities. They will expect you to fix everything when problems happen, which you cannot feasibly do because IT cannot interpret business data or its relevance. For example, I handle delivery records but had to be taught that a duplicate key indicated a carrier change.
I inherited several mismanaged scenarios where IT simply dumped "newer" data into production without any further concern. Sometimes I get junk data, where I have to manually exclude incoming records from the mastering process because they have invalid negative quantities. Some of my on-hand records are more complete than incoming data, and so I have to skip synchronizing specific columns. When one application's import process simply failed, I had to put an end to complaints by creating a working update script. These are issues you need to think ahead about, because they will encourage you to organize control of each step of the synchronization process.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
Now as far as organizing a "mastering" process goes, which is what I call comparing the data and producing the lists of what's different, I have more experience to share. For one application, I had to restructure (decentralize) tables and reports before I could reliably compare both sources. This implies a need to understand the business data and know it is in proper form. You don't say if you're comparing PDFs, spreadsheets or images. For data, you must write a separate mastering process for each table (or worksheet), because the mastering process's comparison step may be specially shaped by business needs. Do not write one process which masters everything. Make each process controllable.
Not all information is compared the same way when imported. We get in PO and delivery data and therefore compare tens of thousands of records to determine which data points have changed, but some invoice information is simply imported without any future checks or synchronization. Business needs can even override updates and keep stale data on your end.
Each mastering process's comparer module can then be customized as needed. You'll want specific APIs when comparing file types like PDFs and spreadsheets. I use EPPlus for workbooks. Anything you cannot open has to be binary compared, of course.
A mastering process should not clean or transform the data, especially financial data. Those steps need to occur prior to mastering so that these issues are caught before mastering is begun.
My tools organize the data in 3 tabs -- Creates, Updates and Deletes -- each with DataGridViews showing the relevant records. Then I can log, review and commit changes or hand the responsibility to someone willing.
Mastering process steps:
(Clean / transform data externally)
Load data sources
Compare external to local data
Hydrate datasets indicating Creates, Updates and Deletes
I am relatively new to C# so please bear with me.
I am writing a business application (in C#, .NET 4) that needs to be reliable. Data will be stored in files. Files will be modified (rewritten) regularly, thus I am afraid that something could go wrong (power loss, application gets killed, system freezes, ...) while saving data which would (I think) result in a corrupted file. I know that data which wasn't saved is lost, but I must not lose data which was already saved (because of corruption or ...).
My idea is to have 2 versions of every file and each time rewrite the oldest file. Then in case of unexpected end of my application at least one file should still be valid.
Is this a good approach? Is there anything else I could do? (Database is not an option)
Thank you for your time and answers.
Rather than "always write to the oldest" you can use the "safe file write" technique of:
(Assuming you want to end up saving data to foo.data, and a file with that name contains the previous valid version.)
Write new data to foo.data.new
Rename foo.data to foo.data.old
Rename foo.data.new to foo.data
Delete foo.data.old
At any one time you've always got at least one valid file, and you can tell which is the one to read just from the filename. This is assuming your file system treats rename and delete operations atomically, of course.
If foo.data and foo.data.new exist, load foo.data; foo.data.new may be broken (e.g. power off during write)
If foo.data.old and foo.data.new exist, both should be valid, but something died very shortly afterwards - you may want to load the foo.data.old version anyway
If foo.data and foo.data.old exist, then foo.data should be fine, but again something went wrong, or possibly the file couldn't be deleted.
Alternatively, simply always write to a new file, including some sort of monotonically increasing counter - that way you'll never lose any data due to bad writes. The best approach depends on what you're writing though.
You could also use File.Replace for this, which basically performs the last three steps for you. (Pass in null for the backup name if you don't want to keep a backup.)
A lot of programs uses this approach, but usually, they do more copies, to avoid also human error.
For example, Cadsoft Eagle (a program used to design circuits and printed circuit boards) do up to 9 backup copies of the same file, calling them file.b#1 ... file.b#9
Another thing you can do to enforce security is to hashing: append an hash like a CRC32 or MD5 at the end of the file.
When you open it you check the CRC or MD5, if they don't match the file is corrupted.
This will also enforce you from people that accidentally or by purpose try to modify your file with another program.
This will also give you a way to know if hard drive or usb disk got corrupted.
Of course, faster the save file operation is, the less risk of loosing data you have, but you cannot be sure that nothing will happen during or after writing.
Consider that both hard drives, usb drives and windows OS uses cache, and it means, also if you finish writing the data may be OS or disk itself still didn't physically wrote it to the disk.
Another thing you can do, save to a temporary file, if everything is ok you move the file in the real destination folder, this will reduce the risk of having half-files.
You can mix all these techniques together.
In principle there are two popular approaches to this:
Make your file format log-based, i.e. do not overwrite in the usual save case, just append changes or the latest versions at the end.
or
Write to a new file, rename the old file to a backup and rename the new file into its place.
The first leaves you with (way) more development effort, but also has the advantage of making saves go faster if you save small changes to large files (Word used to do this AFAIK).
I am writing a program that searches and copies mp3-files to a specified directory.
Currently I am using a List that is filled with all the mp3s in a directory (which takes - not surprisingly - a very long time.) Then I use taglib-sharp to compare the ID3Tags with the artist and title entered. If they match I copy the file.
Since this is my first program and I am very new to programming I figure there must be a better/more efficient way to do this. Does anybody have a suggestion on what I could try?
Edit: I forgot to add an important detail: I want to be able to specify what directories should be searched every time I start a search (the directory to be searched will be specified in the program itself). So storing all the files in a database or something similar isn't really an option (unless there is a way to do this every time which is still efficient). I am basically looking for the best way to search through all the files in a directory where the files are indexed every time. (I am aware that this is probably not a good idea but I'd like to do it that way. If there is no real way to do this I'll have to reconsider but for now I'd like to do it like that.)
You are mostly saddled with the bottleneck that is IO, a consequence of the hardware with which you are working. It will be the copying of files that is the denominator here (other than finding the files, which is dwarfed compared to copying).
There are other ways to go about file management, and each exposing better interfaces for different purposes, such as NTFS Change Journals and low-level sector handling (not recommended) for example, but if this is your first program in C# then maybe you don't want to venture into p/invoking native calls.
Other than alternatives to actual processes, you might consider mechanisms to minimise disk access - i.e. not redoing anything you have already done, or don't need to do.
Use an database (simple binary serialized file or an embedded database like RavenDb) to cache all files. And query that cache instead.
Also store modified time for each folder in the database. Compare the time in the database with the time on the folder each time you start your application (and sync changed folders).
That ought to give you much better performance. Threading will not really help searching folders since it's the disk IO that takes time, not your application.
I know how to wipe a file in C# including it's sectors and such.
But how do I overwrite the directories themselves?
Example: #"C:\mydirectory\" must be unrecoverable gone forever (all files insides are already wiped) so that it will be impossible to recover the directory structure or their names.
------------------- Update below (comment formatting is such a hassle so I post it here)---
For the file deletion I look up the partition's cluster and section size's and overwrite it at least 40 times using 5 different algorithms where the last algorithm is always the random one. The data is also actually written to the disk each time (and not just in memory or something). The only risk is that when I wipe something the physical address on the disk of that file could theoretically have been changed. The only solution I know for that is to also wipe the free disk space after the file has been wiped and hope that no other file currently partially holds the old physical location of the wiped file. Or does Windows not do such a thing?
http://www.sans.org/reading_room/whitepapers/incident/secure-file-deletion-fact-fiction_631 states:
"It is important to note the consensus that overwriting the data only reduces the
likelihood of data being recovered. The more times data is overwritten, the more
expensive and time consuming it becomes to recover the data. In fact Peter Guttman
states “…it is effectively impossible to sanitize storage locations by simple overwriting
them, no matter how many overwrite passes are made or what data patterns are written.”3 Overwritten data can be recovered using magnetic force microscopy, which
deals with imaging magnetization patterns on the platters of the hard disk. The actual
details of how this is accomplished are beyond the scope of this paper."
Personally I believe that when I overwrite the data like 100+ times using different (maybe unknown) algorithms (and if there is no copy of the data left elsewhere like in the swap files) that it will take any very expensive team of professionals many many years to get that data back. And if they do get the data back after all those years then they deserve it I guess... That must be a project for life.
So:
wiping unused data: use cipher (http://support.microsoft.com/kb/315672) or fill the hard-disk with 4GB files or use the Eraser command line executable.
wiping swap files: ?
wiping bad sectors: ?
wiping directories: use Eraser (as Teoman Soygul stated)
How do we know for sure that we overwrote the actual physical addresses?
wiping the most recently used files and the Windows log files should of course be piece a cake for any programmer :)
Eraser solves most of the above problems but cannot wipe the pages files. So any forensic will still find the data back if it was in those swap files at any moment.
AFAIK eraser does not wipe the file allocation tables. But I'm not sure.
And the conclusion should then be: It's (near) impossible to secure wipe in C#?
there is no general approach for this ... consider a SSD: you can't even make sure that your write operation will write to the same physical address, because of wear-levelling methods
If all files/folders inside the folder is already wiped (as you stated), all that is left is the directory itself. Using a cryptic random number generator rename the directory and delete it. It will be as good as wiped.
If this isn't enough for you, grab a copy of Eraser command line executable and execute the command:
Process.Start("eraserl.exe", #"-folder "C:\MyDirectory\" -subfolders -method DoD_E -silent");
Securely deleting is not straightforward, as you know. So it may be worth considering an alternative strategy.
Have you considered using something like TrueCrypt to create an encrypted volume? You could store the files there, then use standard delete routines. An adversary would then need to both decrypt the encrypted volume AND recover the deleted files.
I'm interested in getting solution ideas for a problem we have.
Background:
We have software tools that run on laptops and flash data onto hardware components. This software reads in a series of data files in order to do the programming on the hardware. It's in a manufacturing environment and is running continuously throughout the day.
Problem:
Currently, they're a central repository that the software connects to to read the data files. The software reads the files and retains a lock on them throughout the entire flashing process. This is running all throughout the day on different hardware components, so it's feasible that these files could be "locked" for most of the day.
There's new requirements that state these data files that the software is reading need to be updated in real time, will minimal impact to the end user who is doing the flashing. We will be writing the service that drops the files out there in real time.
The software is developed by a third party vendor and is not modifiable by us. However, it expects a location to look for the data files, so everything up until the point of flashing is our process that we're free to change.
Question:
What approach would you take to solve this from a solution programming standpoint? We're not sure how to drop files out there in real time given the locks that will be present on them throughout the day. We'll settle for an "as soon as possible" solution if that is significantly easier.
The only way out of this conundrum seems to be the introduction of an extra file repository, along with a service-like piece of logic in charge of keeping these repositories synchronized.
In other words, the file upload takes places in one of the repositories (call it the "input repository"), and the flashing process uses the other repository (call it the "ouput repository"). The synchronization logic permanently pools the input repository for new files (based on file time stamp or other...) and when it finds such new files, attempts to copy these to the "output directory"; such copy either takes place instantly, when the flashing logic hasn't locked the corresponding file in the output directory, or it is differed till the file gets unlocked.
Note: During the file copy, the synchronization logic can/should lock the file, hence very temporarily preventing the file to be overwritten by new uploads, but ensuring full integrity of the copied file. The difference with the existing system is that the lock is held for a much shorter amount of time.
The drawback of this system is the full duplication of the repository, and this could be a problem if the repository is very big. However there doesn't appear to be many alternatives since we do not have control over the flashing process.
"As soon as possible" is your only option. You can't update a file that's locked, that's the whole point of a lock.
Edit:
Would it be possible to put the new file in a different location and then tell the 3rd party service to look in that location the next time it needs the file?