I have a mixed application that has data both in a database and in a physical file store maintained by my application. As I have been developing my application I have, on occasion, run into scenarios where I am moving or deleting a file from the hard drive through my application and for whatever reason something will go wrong and an exception is thrown. Currently I simply log this and move on.
Assuming the delete or move scenario, when I throw and log the exception I now have a rouge or possibly missing file taking up space and also possibly causing presentation errors within the application. Beyond manual maintenance of the file system, what are some reliable techniques for maintaining a file system from an application?
I am particularly interested in how to make sure, no matter what, a file I call Delete() on in my application is in fact deleted.
If you are using Vista or later, you can use the Transactional File System to ensure your operations are atomic. You can find some examples at Transactional File System Operations with some wrappers and the like.
Since you are already using a database in your application, you could consider creating a table to track the file system operations. For instance you could create a row containing the details of the file system operation you are about to perform, then perform the file system operation, and upon success either delete the row or mark it completed in the database. If your application fails and/or needs to be restarted, this would provide an easy mechanism to determine which file system operations did not complete successfully and need to be retried.
Related
I have to design a backup algorithm for some files used by a Windows Service and I already have some ideas, but I would like to hear the opinion of the wiser ones, in order to try and improve what I have in mind.
The software that I am dealing with follows a client-server architecture.
On the server side, we have a Windows Service that performs some tasks such as monitoring folders, etc, and it has several xml configuration files (around 10). These are the files that I want to backup.
On the client side, the user has a graphical interface that allows him to modify these configuration files, although this shouldn't happen very often. Communication with the server are made using WCF.
So the config files might be modified remotely by the user, but the administrator might also modify them manually on the server (the windows service monitors these changes).
And for the moment, this is what I have in mind for the backup algorithm (quite simple though):
When - backups will be performed in two situations:
Periodically: a parallel thread on the server application will perform a copy of the configuration files every XXXX months/weeks/whatever (configurable parameter). This is, it does not perform the backup each time the files are modified by user action, but only when the client app is launched.
Every time the user launches the client: every time the server detects that a user has launched the application, the server side will perform a backup.
How:
There will be a folder named Backup on the Program Data folder of the Windows Service. There, each time a backup is performed, a sub-folder named BackupYYYYMMDDHHmm will be created, containing all the concerned files.
Maintenance: Backup folders won't be kept forever. Periodically, all of those older than XXXX weeks/months/year (configurable parameter) will be deleted. Alternatively, I might only maintain N backup sub-folders (configurable parameter). I still haven't chosen an option, but I think I'll go for the first one.
So, this is it. Comments are very welcome. Thanks!!
I think your design is viable. just a few comments:
do you need to back up to a separate place other than the server? I don't feel it's safe to back up important data on same server, and I would rather back them up to a separate disk (perhaps a network location)
you need to implement the monitoring/backup/retention/etc. by yourself, and it sounds complicated - how long do you wish to spend on this?
Personally i would use some simple trick to achieve the backup, for example, since the data are plain text files (xml format) and light, I might simply back them up to some source control system: make the folder a checkout of SVN (or some other means) and create a simple script that detects/checks in changes to SVN, and schedule the script to be executed once a few hours (or more often up to your needs, or can be triggered by your service/app on demand) - this way it eliminates the unnecessary copy of data (as it checks in changes only), and it's much more trackable as svn provides all the history.
hope above can help a bit...
Scenario: I want to develop an application.The application should be able to connect to my remote server and download data to the local disk , while downloading it should check for new files and only download the new ones simultaneously creating the required(new) folders.
Problem: I have no idea how to compare the files in the server with the ones in the local disk.How to download only the new files from the server to the local disk?
What am thinking?: I want to sync the files in the local machine with the ones in the server. I am planning to use rsync for syncing but i have no idea how to use it with ASP.NET.
Kindly let me know if my approach is wrong or is there any other better way to accomplish this.
First you can compare the file names, then the file size and when all matches, you can compare the hashes of the files.
I call this kind of a problem a "data mastering" problem. I synchronize our databases with a Fortune 100 company throughout the week and have handled a number of business process issues.
The first rule of handling production data is not to do your users' data entry. They must be responsible for putting any business process into motion which touches production. They must understand the process and have access to logs showing what data was changed, otherwise they cannot handle issues. If you're doing this for them, then you are assuming these responsibilities. They will expect you to fix everything when problems happen, which you cannot feasibly do because IT cannot interpret business data or its relevance. For example, I handle delivery records but had to be taught that a duplicate key indicated a carrier change.
I inherited several mismanaged scenarios where IT simply dumped "newer" data into production without any further concern. Sometimes I get junk data, where I have to manually exclude incoming records from the mastering process because they have invalid negative quantities. Some of my on-hand records are more complete than incoming data, and so I have to skip synchronizing specific columns. When one application's import process simply failed, I had to put an end to complaints by creating a working update script. These are issues you need to think ahead about, because they will encourage you to organize control of each step of the synchronization process.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
Now as far as organizing a "mastering" process goes, which is what I call comparing the data and producing the lists of what's different, I have more experience to share. For one application, I had to restructure (decentralize) tables and reports before I could reliably compare both sources. This implies a need to understand the business data and know it is in proper form. You don't say if you're comparing PDFs, spreadsheets or images. For data, you must write a separate mastering process for each table (or worksheet), because the mastering process's comparison step may be specially shaped by business needs. Do not write one process which masters everything. Make each process controllable.
Not all information is compared the same way when imported. We get in PO and delivery data and therefore compare tens of thousands of records to determine which data points have changed, but some invoice information is simply imported without any future checks or synchronization. Business needs can even override updates and keep stale data on your end.
Each mastering process's comparer module can then be customized as needed. You'll want specific APIs when comparing file types like PDFs and spreadsheets. I use EPPlus for workbooks. Anything you cannot open has to be binary compared, of course.
A mastering process should not clean or transform the data, especially financial data. Those steps need to occur prior to mastering so that these issues are caught before mastering is begun.
My tools organize the data in 3 tabs -- Creates, Updates and Deletes -- each with DataGridViews showing the relevant records. Then I can log, review and commit changes or hand the responsibility to someone willing.
Mastering process steps:
(Clean / transform data externally)
Load data sources
Compare external to local data
Hydrate datasets indicating Creates, Updates and Deletes
I have a project ongoing at the moment which is create a Windows Service that essentially moves files around multiple paths. A job may be to, every 60 seconds, get all files matching a regular expression from an FTP server and transfer them to a Network Path, and so on. These jobs are stored in an SQL database.
Currently, the service takes the form of a console application, for ease of development. Jobs are added using an ASP.NET page, and can be editted using another ASP.NET page.
I have some issues though, some relating to Quartz.NET and some general issues.
Quartz.NET:
1: This is the biggest issue I have. Seeing as I'm developing the application as a console application for the time being, I'm having to create a new Quartz.NET scheduler on all my files/pages. This is causing multiple confusing errors, but I just don't know how to institate the scheduler in one global file, and access these in my ASP.NET pages (so I can get details into a grid view to edit, for example)
2: My manager would suggested I could look into having multiple 'configurations' inside Quartz.NET. By this, I mean that at any given time, an administrator can change the applications configuration so that only specifically chosen applications run. What'd be the easiest way of doing this in Quartz.NET?
General:
1: One thing that that's crucial in this application is assurance that the file has been moved and it's actually on the target path (after the move the original file is deleted, so it would be disastrous if the file is deleted when it hasn't actually been copied!). I also need to make sure that the files contents match on the initial path, and the target path to give peace of mind that what has been copied is right. I'm currently doing this by MD5 hashing the initial file, copying the file, and before deleting it make sure that the file exists on the server. Then I hash the file on the server and make sure the hashes match up. Is there a simpler way of doing this? I'm concerned that the hashing may put strain on the system.
2: This relates to the above question, but isn't as important as not even my manager has any idea how I'd do this, but I'd love to implement this. An issue would arise if a job is executed when a file is being written to, which may be that a half written file will be transferred, thus making it totally useless, and it would also be bad as the the initial file would be destroyed while it's being written to! Is there a way of checking of this?
As you've discovered, running the Quartz scheduler inside an ASP.NET presents many problems. Check out Marko Lahma's response to your question about running the scheduler inside of an ASP.NET web app:
Quartz.Net scheduler works locally but not on remote host
As far as preventing race conditions between your jobs (eg. trying to delete a file that hasn't actually been copied to the file system yet), what you need to implement is some sort of job-chaining:
http://quartznet.sourceforge.net/faq.html#howtochainjobs
In the past I've used the TriggerListeners and JobListeners to do something similar to what you need. Basically, you register event listeners that wait to execute certain jobs until after another job is completed. It's important that you test out those listeners, and understand what's happening when those events are fired. You can easily find yourself implementing a solution that seems to work fine in development (false positive) and then fails to work in production, without understanding how and when the scheduler does certain things with regards to asynchronous job execution.
Good luck! Schedulers are fun!
I'm programming a simple customer-information management software now with SQLite.
One exe file, one db file, some dll files. - That's it :)
2~4 people may be going to run this exe file simultaneously and access to a database.
Not only just reading but frequent editing will be done by them too.
Yeahhh now here comes the one of the most famous problems... "Synchronization"
I was trying to create / remove a temporary empty file whenever someone is trying
to edit it. (this is a 'key' to access the db.)
But there must be a better way for it : (
What would be the best way of preventing this problem?
Well, SQLite already locks the database file for each use, the idea being that multiple applications can share the same database.
However, the documentation for SQLite explicitly warns about using this over the network:
SQLite will work over a network
filesystem, but because of the latency
associated with most network
filesystems, performance will not be
great. Also, the file locking logic of
many network filesystems
implementation contains bugs (on both
Unix and Windows). If file locking
does not work like it should, it might
be possible for two or more client
programs to modify the same part of
the same database at the same time,
resulting in database corruption.
Because this problem results from bugs
in the underlying filesystem
implementation, there is nothing
SQLite can do to prevent it.
A good rule of thumb is that you
should avoid using SQLite in
situations where the same database
will be accessed simultaneously from
many computers over a network
filesystem.
So assuming your "2-4 people" are on different computers, using a network file share, I'd recommend that you don't use SQLite. Use a traditional client/server RDBMS instead, which is designed for multiple concurrent connections from multiple hosts.
Your app will still need to consider concurrency issues (unless it speculatively acquires locks on whatever the user is currently looking at, which is generally a nasty idea) but at least you won't have to deal with network file system locking issues as well.
You are looking at some classic problems in dealing with multiple users accessing a database: the Lost Update.
See this tutorial on concurrency:
http://www.brainbell.com/tutors/php/php_mysql/Transactions_and_Concurrency.html
At least you won't have to worry about the db file itself getting corrupted by this, because SQLite locks the whole file when it's being written. That being said, SQLite doesn't recommend you to use it if you expect your app to be accessed simultaneously by a multiple clients.
I'm interested in getting solution ideas for a problem we have.
Background:
We have software tools that run on laptops and flash data onto hardware components. This software reads in a series of data files in order to do the programming on the hardware. It's in a manufacturing environment and is running continuously throughout the day.
Problem:
Currently, they're a central repository that the software connects to to read the data files. The software reads the files and retains a lock on them throughout the entire flashing process. This is running all throughout the day on different hardware components, so it's feasible that these files could be "locked" for most of the day.
There's new requirements that state these data files that the software is reading need to be updated in real time, will minimal impact to the end user who is doing the flashing. We will be writing the service that drops the files out there in real time.
The software is developed by a third party vendor and is not modifiable by us. However, it expects a location to look for the data files, so everything up until the point of flashing is our process that we're free to change.
Question:
What approach would you take to solve this from a solution programming standpoint? We're not sure how to drop files out there in real time given the locks that will be present on them throughout the day. We'll settle for an "as soon as possible" solution if that is significantly easier.
The only way out of this conundrum seems to be the introduction of an extra file repository, along with a service-like piece of logic in charge of keeping these repositories synchronized.
In other words, the file upload takes places in one of the repositories (call it the "input repository"), and the flashing process uses the other repository (call it the "ouput repository"). The synchronization logic permanently pools the input repository for new files (based on file time stamp or other...) and when it finds such new files, attempts to copy these to the "output directory"; such copy either takes place instantly, when the flashing logic hasn't locked the corresponding file in the output directory, or it is differed till the file gets unlocked.
Note: During the file copy, the synchronization logic can/should lock the file, hence very temporarily preventing the file to be overwritten by new uploads, but ensuring full integrity of the copied file. The difference with the existing system is that the lock is held for a much shorter amount of time.
The drawback of this system is the full duplication of the repository, and this could be a problem if the repository is very big. However there doesn't appear to be many alternatives since we do not have control over the flashing process.
"As soon as possible" is your only option. You can't update a file that's locked, that's the whole point of a lock.
Edit:
Would it be possible to put the new file in a different location and then tell the 3rd party service to look in that location the next time it needs the file?