I want to parse data from a log file, pump it into a database, and then purge the log file.
I could use the FileSystemWatcher component, and monitor the Change event, but the event would be firing non-stop, as the log file is pretty much "constantly" being written to. I don't want to be opening/closing db connections willy-nilly.
My current instinct is to use a Timer, and then parse/pump/purge the log file every so often (based on time or based on time and size of file).
Is there a common/proven way of handling the scenario (design pattern)?
Update: I see FileSystemWatcher has a NotifyFilter property, with one of the filterables being "Size"; I'm guessing (haven't found any verification yet) that any time the size of the file changes by 1KB it fires; this would be a reasonable "throttle," if true...
Not sure if this is a design pattern, but if you control how much you buffer before actually writing to the log file you can minimize the frequency.
The change event is way too chatty here. I would check the file on a scheduled basis with a timer, looking at the modification timestamp (and possibly create, especially if someone deletes/recreates the file.)
Do you have any control over the log file generation? if so what you could do is create a new log file say every time it gets to a certain log size and rename the old log file to a specific format. Then have the filesystem watcher filter for the "archive" log files and process them when they are created.
Related
I need to fire an event when certain file is created. But this file is created in Temp folder inside a directory that is created with it. So I will have to set up monitoring of whole Temp dir with all subdirectories and I'm concerned about performance impact.
I know exact name and path of file and need to only track its creation. Is it better to poll File.Exists once a second or set up a FileSystemWatcher? Maybe there is a way to disable monitoring of all events except file creation and maybe it will be faster than polling?
I can't really test it because usage pattern of Temp directory is quite unpredictable.
I dont get it. Why you have to set "watch" the whole temp dir? And not just the folder where that file is?
I know exact name and path of file
BIT OF GOOGLE: Use FileSystemWatcher on a single file in C#
This is going to have probably better performance than polling every x ammount of time.
But still, i have read and even happened to me, FSW is not 100% reliable.
So i would think of two approaches:
1) Do a mix of FSW, Poll and user intervention (ie, refresh button)
2) Get some fine drugs and read this: System Minifilter Driver
https://msdn.microsoft.com/en-us/library/windows/hardware/ff540402%28v=vs.85%29.aspx
EDIT: new link:
https://learn.microsoft.com/en-us/windows-hardware/drivers/ifs/filter-manager-concepts
And a nice code sample: https://github.com/microsoft/Windows-driver-samples/tree/master/filesys/miniFilter/change
I am familiar with the FileSystemWatcher class, and have tested using this, alternatively I have tested using a fast loop and doing a directory listing of files of type in a directory. In this particular case they are zip compressed SDF files, I need to decompress, open, and query.
The problem is that when a large file is put in a directory, sometimes that takes time, such as it being downloaded, or copied from a network location, etc...
When the FileSystemWatcher raises an OnChange event, I have a handle to the ChangeType and on these types of operations the Create is immediate, while the file is still not completely copied to the location.
Likewise using the loop, I see a file is there, before the whole file is there.
The FileSystemWatcher raises several change events, one after create, and then one or more during the copy, nothing that says This file is now complete
So if I am expecting files of a type, to be placed in a directory ultimately to read and processed, with no knowledge of their transport mechanism, and no knowledge of their final size...
How do I know when the file is ready to actually be processed other than with using error control as a workflow control (albeit the error control is there anyway as it should be)? This just seems like a bad way to have to handle this, as sometimes the error control may actually be representing a legitimate issue, sometimes it may just be that the file is not completely written, and I do not see any real safe way to differentiate.
I despise anticipated error, but realize that is has its place like sockets, nothing guarantees a check for open does not change before an attempt to read/write. But I do avoid it at all costs.
This particular one troubles me mostly because of the ambiguity of the message that will be produced. There is a conflict queue for files that legitimately error because they did not come across entirely or are otherwise corrupt, I do not want otherwise good files going there. Getting more granular to detect this specific case will be almost impossible.
edit:
I know I can do this... And I have read the other SA articles concerning others doing the same thing. (And I know this method is both crude and blocking, it is just an example.)
private static void OnChanged(object source, FileSystemEventArgs e)
{
if (e.ChangeType == WatcherChangeTypes.Created)
{
bool ready = false;
while (!ready)
{
try
{
using (FileStream fs = new FileStream(e.FullPath, FileMode.Open))
{
Console.WriteLine(String.Format("{0} - {1}", e.FullPath, fs.Length));
}
ready = true;
}
catch (IOException)
{
ready = false;
}
}
}
}
What I am trying to find out is this definitively the only way, is there no other component, or some hook to the file system that will actually do this with a proper event?
The only way to tell is to open the file with FileShare.Read. That will always fail if the process is still writing to the file and hasn't closed it yet. There is otherwise no mechanism to know anything at all about which particular process is doing the writing, FSW operates at the file system device driver level and doesn't know anything about what process is performing the operation. Could be more than one.
That will very often fail the first time you try, FSW is very efficient. In general you have no idea how much time the process will take, it of course depends on how it is written and might leave the file opened for a while. Could be hours or days, a log file would be an example.
So you need a re-try mechanism, it should have an exponential back-off algorithm to increase the re-try delays between attempts. Start it off at, say, a half second delay and keep increasing that delay when it fails. This needs to be done in a worker thread, not the FSW callback. Use a thread-safe queue to pass the path of the file from the FSW callback to the worker thread. Also in general a good strategy to deal with the multiple FSW notifications you get.
Watch out for startup effects, you of course missed any notification before you started running so there might be a load of files that are waiting for work. And watch out for Heisenbugs, whatever you do with the file might cause another process to fall over. Much like this process did to yours :)
Consider that a batch-style program that you periodically run with the task scheduler could be an easier alternative.
For the one extreme, you could use a file system mini filter driver which analyzes all activities for a file at the lowest level (and communicates with a user mode application).
I wrote a proof-of-concept mini filter some time ago to detect MS Office file conversions. See below. This way, you can reliably check for every open handle to the file.
But: even this would be no universal solution for you problem:
Consider:
A tool (e.g. FTP file transfer) could in theory write part of the file, close it, and re-open it again for appending new data. This seems very curious, but you cannot reliably just check for “no more open file handles” ==> “file is ready now”
Alex K. provided a good link in his comment, and I myself would use a solution similar to the answer from Jon (https://stackoverflow.com/a/4278034/4547223)
If time is not critical (you can waste a few seconds for the decision):
Periodic timer (1 second seems reasonable)
Check file size in every timer tick
If file size did not increment for e.g. 10 seconds and there are no more FSWatcher change events too, try to open it. If you realize that the size increments take place uneven or very slowly, you could adjust the “wait time” on the fly.
Your big advantage is that you are processing ZIP files only, where you have a chance of detecting invalid (incomplete) files due to “checksum not valid”
I do not expect official ways to detect this, since there is no universal notion of “file written completely”.
File System mini filter
This may be like a sledgehammer solution for the problem.
Some time ago, I had the requirement of working around a weird bug in Office 2010, where it does not copy ADS meta data during office file conversion (ADS needed for File Classification). We discussed this with Microsoft engineers (MS was not willing to fix the bug), they complied with our filter driver solution (in the end, this was stopped since business preferred a manual workaround.)
Nevertheless, if someony really want to check if this could be a possible solution:
I have written an explanation of the steps:
https://stackoverflow.com/a/29252665/4547223
I am implementing an event handler that must open and process the content of a file created by a third part application over which I have no control. I am warned by a note in "C# 4.0 in a nutshell" (page 495) about the risk to open a file before it is fully populated; so I am wondering how to manage this occurrence. To keep at minimum the load on the event handler, I am considering to have the handler simply insert in a queue the file names and then to have a different thread to manage the processing, but, anyways, how may I make sure that the write is completed and the file read is safe? The file size could be arbitrary.
Some idea? Thanks
A reliable way to achieve what you want might be to use FileSystemWatcher + NTFS USN journal.
Maybe more complicated than you expected, but FileSystemWatcher alone won't tell you for sure that the newly created file has been closed
-first, the FileSystemWatcher, to know when a file is created. From there you have the complete file path, and are 1 or 2 pinvokes away from getting the file unique ID (which can help you to track it during its whole lifetime).
-then, read the USN journal, which tracks everything that occurs on your drive. Filter on entries corresponding to your new file's ID, and read the journal until reaching the entry with the 'Close' event.
From there, unless your file is manipulated in special ways (opened and closed multiple times by the application that generates it), you can assume it is safe to read it and do whatever you wanted to do with it.
A really great C# implementation of an USN journal parser is StCroixSkipper's work, available here:
http://mftscanner.codeplex.com/
If you are interested I can give you more help about USN journal, as I use it in my project.
Our workaround is to watch for a specific extension. When a file is uploaded, the extension is ".tmp". When its done uploading, it's renamed to have the proper extension.
Another alternative is to have the server try to move the file in a try/catch block. If the fie isn't done being uploaded, the attempt to move the file will throw an exception, so we wait and try again.
Realistically, you can't know. If the other applications "write" operation is to open the file denying write access to everyone else then when it's done, close the file. When you get a notification then you could simply open the file requesting write access and if that fails, you know the operation isn't complete. But, if the "write" operation is to open the file, write, close the file, open the file again, and write again, etc., then you're pretty much out of luck.
The best solution I've seen is to set a timer after the last notification. When the timer elapses, try to open the file for write--if you can, assume the "operation" is done and do what you need to do. If the open fails, assume the operation is still in progress and wait some more.
Of course, nothing is foolproof. Despite the above, another operation could start while you're doing what you want with the file and cause interaction problems.
How I may know which file is modified and what data is changed in the file?
Edit: I want to watch the file as it gets modified and then compare it against a previous version to know which data blocks are changed. I guess watching the file for changes can be accomplished by using file watcher API but I have no idea about the second part.
You may need the FileSystemWatcher class.
The most common approach is define FileSystemWatcher, subscribe to its events and process them accordingly to the logic of your application.
Here is a simple example.
I am working on an app that will keep a running index of work in accomplished.
I could write once at the end of a work session, but I don't want to risk losing data if something blows up. Therefore, I rewrite to disk (XML) every time a new entry or a correction is made by the user.
private void WriteIndexFile()
{
XmlDocument IndexDoc
// Build document here
XmlTextWriter tw = new XmlTextWriter(_filePath, Encoding.UTF8);
tw.Formatting = Formatting.Indented;
IndexDoc.Save(tw);
}
It is possible for the writes to be triggered in rapid succession. If this happens, it tries to open the file for writing before the prior write is complete. (While it would not be normal, I suppose it is possible that the file gets opened for use by another program.)
How can I check if the file can be re-written?
Edit for clarification: This is part of an automated lab data collection system. The users will click a button to capture data (saved in separate files), and identify the sub-task the the data package is for. Typically, it will be 3-10 minutes between clicks.
If they make an error, they need to be able to go back and correct it, so it's not an append-only usage.
Finally, the files will be read by other automated tools and manually by humans. (XML/XSLT)
The size will be limited as each work session (worker shift or less) will have a new index file generated.
Further question: As the overwhelming consensus is to not use XML and write in an append-only mode, how would I solve the requirement of going back and correcting earlier entries?
I am considering having a "dirty" flag, and save a few minutes after the flag is set and upon closing the work session. If multiple edits happen in that time, only one write will occur - no more rapid user - also have a retry/cancel dialog if the save fails. Thoughts?
XML is a poor choice in your case because new content has to be inserted before the closing tag. Use Text istead and simply open the file for append and write the new content at the end of the file, see How to: Open and Append to a Log File.
You can also look into a simple logging framework like log4net and use that instead of handling the low level file stuff urself.
If all you want is a simple log of all operations, XML may be the wrong choice here as it is difficult to append to an XML document without rewriting the whole file, which will become slower and slower as the file grows.
I'd suggest instead File.AppendText or even better: keeping the file open for the duration of the aplication's life time and using WriteLine.
(Oh, and as others have pointed out, you need to lock to ensure that only one thread writes to the file at a time. This is still true even with this solution.)
There are also logging frameworks that already solve this problem, such as log4net. Have you considered using an existing logging framework instead of rolling your own?
I have a logger that uses System.Collections.Queue. Basically it waits until something is queued then trys to write it. While writing items, which could be slow, more items could be added to the queue.
This will also help in just grouping messages rather than trying to keep up. It is running on a separate thread.
private AutoResetEvent ResetEvent { get; set; }
LogMessage(string fullMessage)
{
this.logQueue.Enqueue(fullMessage);
// Trigger the Reset Event to send the
this.ResetEvent.Set();
}
private void ProcessQueueMessages()
{
while (this.Running)
{
// This will process all the items in the queue.
while (this.logQueue.Count > 0)
{
// This method will just log the top item on the queue
this.LogQueueItem();
}
// Once the queue is empty will wait for a
// another message to queueed before running again.
// Rather than sleeping and checking if the queue is full,
// saves from doing a System.Threading.Thread.Sleep(1000); stuff
this.ResetEvent.WaitOne();
}
}
I handle write failures but not dequeueing until it wrote to the file with no errors. Then I just keep attempting until it finally can write. This has saved me because somebody removed permissions from one of our apps during it process. Permission was given back with out shutting down our app, and we didn't lose a single log statement.
Consider using a flat text file. I have a process that I wrote that uses an XML log... it was a poor choice. You can't just write out the state as you run without having to constantly rewrite the file to make sure the tags are correct. If it was flat entries written to a file you could have an automatic timeline that could give you details of what happened without trying to figure out if it was the XML writer/tag set that blew up and you don't have to worry about your logs bloating out as much.
I agree with others suggesting you avoid XML. Also, I would suggest you have one component (a "monitor") that is responsible for all access to the file. That component will have the job of handling multiple simultaneous requests and making the disk writes happen one after another.