I am currently in the process of writing a small application in C# to process batches of images and put them into a PDF. Each batch of images is stored in its own folder on a network share. The application will enable users to perform QA checks on a random number of images from a single batch before creating a PDF. At most there will be between 4-6 users running this application on individual desktops with access to the location where the image batches are stored.
The problem I'm running into at the moment is how do I prevent 2 users from processing the same batch? Initially I thought about using FileSystemWatcher to check for last access to each folder, but reading up on how FileSystemWatcher raises events it didn't seem suitable. I've condsidered using polling to check the images in each folder for File access using a filestream, but I don't think that will be suitable either(I may be wrong).
What would be the simplest solution?
I'd use a lock file with a package like this.
Code is quite simple:
var fileLock = new SimpleFileLock("networkFolder/file.lock", TimeSpan.FromMinutes(timeout));
where timeout is used to unlock the folder if the process using it crashed (so that it doesn't stay locked forever).
Then, everytime a process needs to use that directory, you go with a simple check:
if (fileLock.TryAcquireLock())
{
//Lock acquired - do your work here
}
else
{
//Failed to acquire lock - SpinWait or do something else
}
Code is taken from the samples on the repo, so that's the way the author suggests using his library.
I had the chance to use it and I found it both useful and reliable.
I am familiar with the FileSystemWatcher class, and have tested using this, alternatively I have tested using a fast loop and doing a directory listing of files of type in a directory. In this particular case they are zip compressed SDF files, I need to decompress, open, and query.
The problem is that when a large file is put in a directory, sometimes that takes time, such as it being downloaded, or copied from a network location, etc...
When the FileSystemWatcher raises an OnChange event, I have a handle to the ChangeType and on these types of operations the Create is immediate, while the file is still not completely copied to the location.
Likewise using the loop, I see a file is there, before the whole file is there.
The FileSystemWatcher raises several change events, one after create, and then one or more during the copy, nothing that says This file is now complete
So if I am expecting files of a type, to be placed in a directory ultimately to read and processed, with no knowledge of their transport mechanism, and no knowledge of their final size...
How do I know when the file is ready to actually be processed other than with using error control as a workflow control (albeit the error control is there anyway as it should be)? This just seems like a bad way to have to handle this, as sometimes the error control may actually be representing a legitimate issue, sometimes it may just be that the file is not completely written, and I do not see any real safe way to differentiate.
I despise anticipated error, but realize that is has its place like sockets, nothing guarantees a check for open does not change before an attempt to read/write. But I do avoid it at all costs.
This particular one troubles me mostly because of the ambiguity of the message that will be produced. There is a conflict queue for files that legitimately error because they did not come across entirely or are otherwise corrupt, I do not want otherwise good files going there. Getting more granular to detect this specific case will be almost impossible.
edit:
I know I can do this... And I have read the other SA articles concerning others doing the same thing. (And I know this method is both crude and blocking, it is just an example.)
private static void OnChanged(object source, FileSystemEventArgs e)
{
if (e.ChangeType == WatcherChangeTypes.Created)
{
bool ready = false;
while (!ready)
{
try
{
using (FileStream fs = new FileStream(e.FullPath, FileMode.Open))
{
Console.WriteLine(String.Format("{0} - {1}", e.FullPath, fs.Length));
}
ready = true;
}
catch (IOException)
{
ready = false;
}
}
}
}
What I am trying to find out is this definitively the only way, is there no other component, or some hook to the file system that will actually do this with a proper event?
The only way to tell is to open the file with FileShare.Read. That will always fail if the process is still writing to the file and hasn't closed it yet. There is otherwise no mechanism to know anything at all about which particular process is doing the writing, FSW operates at the file system device driver level and doesn't know anything about what process is performing the operation. Could be more than one.
That will very often fail the first time you try, FSW is very efficient. In general you have no idea how much time the process will take, it of course depends on how it is written and might leave the file opened for a while. Could be hours or days, a log file would be an example.
So you need a re-try mechanism, it should have an exponential back-off algorithm to increase the re-try delays between attempts. Start it off at, say, a half second delay and keep increasing that delay when it fails. This needs to be done in a worker thread, not the FSW callback. Use a thread-safe queue to pass the path of the file from the FSW callback to the worker thread. Also in general a good strategy to deal with the multiple FSW notifications you get.
Watch out for startup effects, you of course missed any notification before you started running so there might be a load of files that are waiting for work. And watch out for Heisenbugs, whatever you do with the file might cause another process to fall over. Much like this process did to yours :)
Consider that a batch-style program that you periodically run with the task scheduler could be an easier alternative.
For the one extreme, you could use a file system mini filter driver which analyzes all activities for a file at the lowest level (and communicates with a user mode application).
I wrote a proof-of-concept mini filter some time ago to detect MS Office file conversions. See below. This way, you can reliably check for every open handle to the file.
But: even this would be no universal solution for you problem:
Consider:
A tool (e.g. FTP file transfer) could in theory write part of the file, close it, and re-open it again for appending new data. This seems very curious, but you cannot reliably just check for “no more open file handles” ==> “file is ready now”
Alex K. provided a good link in his comment, and I myself would use a solution similar to the answer from Jon (https://stackoverflow.com/a/4278034/4547223)
If time is not critical (you can waste a few seconds for the decision):
Periodic timer (1 second seems reasonable)
Check file size in every timer tick
If file size did not increment for e.g. 10 seconds and there are no more FSWatcher change events too, try to open it. If you realize that the size increments take place uneven or very slowly, you could adjust the “wait time” on the fly.
Your big advantage is that you are processing ZIP files only, where you have a chance of detecting invalid (incomplete) files due to “checksum not valid”
I do not expect official ways to detect this, since there is no universal notion of “file written completely”.
File System mini filter
This may be like a sledgehammer solution for the problem.
Some time ago, I had the requirement of working around a weird bug in Office 2010, where it does not copy ADS meta data during office file conversion (ADS needed for File Classification). We discussed this with Microsoft engineers (MS was not willing to fix the bug), they complied with our filter driver solution (in the end, this was stopped since business preferred a manual workaround.)
Nevertheless, if someony really want to check if this could be a possible solution:
I have written an explanation of the steps:
https://stackoverflow.com/a/29252665/4547223
I have a web application to return images to my frontend.
In this application what happens is: when a request is made to a particular image the application checks if the image already exists on disk; if it exists the image is returned.
My problem starts when the image does not exist on disk. In that case two requests are made at the same time for the same image which does not exist on disk. Problem occurs when two threads try to create the same file on disk at the same time.
To solve the problem, what I tried to do was to create a Mutex in the creation of disk image. But it had a problem: as the server load is enormous due to the large number of simultaneous requests, the server crashes.
I would like to ask what your ideas to solve this problem. Or what you would do otherwise?
Thank you.
You could try the following pattern:
Try to read the image (if succeeds, than done)
Try to create the image with Write lock
Only on "File in use exception", small delay (milliseconds)
Go back to step 1 (retry)
Make the delay really small, just a tiny bit larger than the time it should need to create an image.
Implement a retry limit, max 3 times or so.
This would allow you to make use of the already existing (file) locking mechanism
You can call the open function with O_CREAT and O_EXCL flags. The first process's open call will get exclusive access to create the file and it will start downloading the image. The subsequent process's open call will fail because their open is not exclusive and "errno" will be set to EEXIST.
Based on your design, the subsequent processes can either wait for complete file creation or can return back.
fd = open(path, O_CREAT|O_EXCL)
I am using a webclient to download a media file from my web server and save to isolated storage.
If you click a button it starts the download and save to Iso store process, but if you click the button while the file is downloading it tries to create a concurrent IO thread to download again and errors with webclient does not allow concurrent IO threads.
I want to write a conditional if statement to check if there is already a IO thread in being used but I'm not sure how I would do this.
Any help would be greatly appreciated.
Can't you just use a boolean to see if you started the download already? Either way it sounds like it would be better to actually disable the button in the UI after you start a download, and enable it again once it finishes or fails.
Your UI should be consistent with what users have the ability to do at a given time - letting them try something and then make them fail sounds like a frustrating user experience.
My situation is I have a legacy app which I don't have the code for which writes out data to disk every second or so. I have a C# program I wrote which every second reads what was written to disk and uses the data. The data is written to a few text files which I know the file name before its created.
The issue is I have lots of virtual machines running this legacy app and my program. They are not limited by ram or cpu but I can't add more than 10 VMs per machine due to file io bottleneck.
Is there an easy way I can make a file on disk that exists in ram or something else? I heard something about named pipes being an option?
Thanks!
Are you sure of the actual IO involved?
Long ago I implemented a very ugly connection sending data from a dos program to a Windows program by means of a file. This was a lot faster than once a second, though--the dos program would send a 4k block anytime anything on it changed, 50 times a second (if it was caught up) the Windows program would read the frame number and then read the 4K block if the frame number differed.
This did NOT cause disk IO! You could sit there causing the dos program to update the frame many times a second for as long as you wanted and the hard drive light would stay off. Windows saw the file was open and being frequently written, the buffers were NOT flushed to disk until the updates stopped.
While I spent a lot of time optimizing the Windows side of the link it was all in what was done with the data, not in the connection--that simply wasn't a bottleneck despite it's apparent ugliness.
It's possible Windows would handle it differently if the file was closed each time. Sticking it on a ramdisk would keep it from doing the disk IO even then.
You can search for some kind of memory/temporary file systems.
RAM drive for compiling - is there such a thing?
http://256stuff.com/gray/docs/misc/linux_memory_tmp_filesystem_fs.shtml
I'm not sure if you can use pipes here since you legacy app is writing directly to the HDD.