Quick comparison of folders - only bool as a result

Quick comparison of folders - only bool as a result - c#

I have 2 folders with files on Windows:
Local folder e.g C:\MyFolder
Network folder e.g. \server\MyFolder
I need to check whether \server\MyFolder has got any updates in compare to my local one (e.g. updated file etc.).
I only need to know if there is any update in the network folder, I don't need to know what are the differences.
I tried implementing MD5 check sums on the folders, but this solution is too slow. Creating MD5 checksum on the network folder is just too slow (in my case 3-4 minutes, but I need this comparison to be made in max. few seconds) [we are talking about folders with few hundred files, and few hundred MB in size]
I tried implementing some basic rules (e.g. if number of files differs, names are different then it is a quick response), but in most of the cases number of files remain the same, and the only change is in the content of the file, so as long as the first files changes then it is quick response, but in the worst case scenario (or in the scenario that folders are the same) I end up iterating through all files, which is very slow (few minutes).
What other approaches could I try ?
(Note, that network drive is read only to me, so I can't store any information, checksums in the folder whenever it changes).

Here's a solution :
You loop your folder's file and everytime check this variable.
var lastModified = System.IO.File.GetLastWriteTime("Server/Folder");
if lastModified > the time of your code execution recurrence. then you should return the true or false

Related

UWP Enumerate Folders with 10,000 files while creating a subfolder

I'm a fairly experienced developer, but this has me stumped in UWP - I'll keep it simple.
Let's say I want to go through all photos in the pictures folder, watermark them, and save the watermarked version in a sub folder of pictures (eg. pictures\watermarked)
Sound easy?
Try 1: Using GetFilesAsync (incl. GetItemsAsync, GetFoldersAsync) - This method goes through every file, giving me the StorageFile objects I need.
There are 2 problems with this approach:
I can't show a progress bar until I've scanned every file and that's
painfully slow in UWP.
The Runtime Broker will consume all memory if I keep any reference
to the StorageFile object (so enumerate and enumerate again to get a
progress is seriously slow, think 1,000 times slower than Win32)
Try 2: Using Queries - This method involves using Windows.System.Search & Queries to return a list of pointers (ish) to all the files. I can then use StorageFolderQueryResult to get each StorageFile on the fly and release immediately so that the Runtime Broker behaves. This is very fast as it uses the Windows Index system, really, really fast.
The problem is that the query system is fairly stupid, as soon I create the subfolder "Watermarked Photos", the storagefiles returned by the Query (which did not exist when it was queried) start to contain files from the Watermarked folder. It appears that the Query is actually just a number of files, not a static list of the actual files, so the results are arbitrary based on any files added/removed after the query was invoked within it's scope.
Anyone with thoughts on how to do this?

RESOLVED - It's not possible using the index system. I created my own Query class. It uses the GetItemsAsync method of folders, the number of objects here won't kill the RuntimeBroker, I store the Path in a string list. containing the paths of all files and sub folders. I can then use GetFileFromPathAsync to instantiate and destroy StorageItems as needed. The RuntimeBroker is okay with that, although it's not the best performance it does give me custom file/folder filtering. Happy to elaborate if anyone needs more info.

C# move files between directories in a SAN

I am developing a winform application that moves files between directories in a SAN. The app searches a list of files in directories with over 200,000 files (per directory) in the SAN's Directory, and then moves the found files to another directory.
For example I have this path:
\san\xxx\yyy\zzz
I perform the search in \zzz and then move the files to \yyy, but while I'm moving the files, another app is inserting more files into the xxx, yyy and zzz directories.
I don't want to impact the access or performance to other applications that use these directories.
The file's integrity is also a big concern, because when the application moves the files to \yyy directory another app uses those files.
All of these steps should run in parallel.
What methods could I use to achieve this?
How can I handle concurrency?
Thanks in advance.

Using the published .net sources, one can see how the move file operation works Here. Now if you follow the logic you'll find there's a call to KERNEL32. The MSDN documentation for this states
To perform this operation as a transacted operation, use the MoveFileTransacted function.
Based on that I'd say the move method is a safe/good way to move a file.
Searching through a large number of files is always going to take time. If you need to increase the speed of this operation, I'd suggest think out of side of the box to achieve this. For instance, I've seen cache buckets produced for both letters and dates. Using some sort of cache system, you could search for files in a quicker manner, say by asking the cache bucket for files starting with "a" for "these files like this starting with a" file, or files between these dates - etc.

Presumably, these files are used by something. For the purposes of this conversation, let's assume they are indexed in an SQL Database.
TABLE: Files
File ID Filename
-------------- ----------------------------
1 \\a\b\aaaaa.bin
2 \\a\b\bbbbb.bin
To move them to \\a\c\ without impacting access, you can execute the following:
Copy the file from \\a\b\aaaa.bin to \\a\c\aaaa.bin
Update the database to point to \\a\c\aaaa.bin
Once all open sessions have been closed for \\a\b\aaaa.bin, delete the old file
If the files are on the same volume, you could also use a hardlink instead of a copy.

Best way to track files being moved (possibly between disks), VB.NET (or C#)

I am developing a "dynamic shortcutting" application which creates special shortcut files which point to a registry entry rather than an actual file/executable. The registry entry contains the path of the desired file. I want to have a daemon running which watches the linked-to files and updates their registry entries if they are moved or renamed. Renamed I can handle using System.IO.FileSystemWatcher, but what is the best way to handle moved files?
I know this is beyond the basic functions of FSW (despite being a low-level file-system operation). The question is, what is the best way of doing it?
Most posts/articles I have read suggest ways that feel altogether "hacky", which basically involve looking for a delete followed by a create in a new place of a file, and connecting the two by file size, meta-data, time between the delete/create triggers, hashes, etc. This may well be the method I have to resort to, setting up FSWs on all drives. However, I am hoping there might be a better way.
Is it possible to either:
2.1. Listen in to the shell and "hear" move operations?
2.2 Or (even more radical) replace or add something to the shell move operation that either triggers some sort of event or performs the registry-updating task itself, precluding the need for the daemon?
I have a feeling that everyone is going to tell me that 1. is the only course, but I look forward to your suggestions. (answers in VB.NET preferred, but can translate from C# if necessary).

[I'm not sure if this should be appended as an "update" to my original post or posted as a separate answer]
To sum up (all two of) the answers plus my own experimenting (to try to give a definitive answer to this question):
It seems the only high-level (.NET) solution is to use the FileSystemWatcher which does not detect "move" out-of-the-box (despite it being a low-level command). The FSW approach is non-trivial, comparably resource-expensive, sloppy in places (i.e. using timers) and has its limitations and caveats. Nor does it provide a true reflection of "move" - it merely infers it from symptoms that are very likely to be a move (and have the same effect on the file-system in any case) but could theoretically be produced by non-move actions. Also, it appears you have to know what files you want to watch for moves in advance of the move happening, there's no-way of telling as it occurs.
On a lower-level (which would involve C++), one could hook API calls to get a faithful picture of when "moves" are called. This has the advantage that you don't have to decide to watch files in advance, and is also less resource-expensive than listening to "deletes" and "creates" and trying to compare them.
On a systems-programming level (which would involve C++ and could easily break your computer if you didn't know what you were doing) one could build a filesystem filter driver: this would take the concept of detecting moves to a truly anal level, detecting re-allocation of filesystem resources performed even without the kernel.
After some experimenting, here is the general structure of how the FileSystemWatcher approach (or at least the most obvious one to me) works, its quirks and its limitations. [no code atm, it's all pretty integrated into my application and I'm yet to optimise it, but I might add some snippets in here later].
The FileSystemWatcher method (to detect when files are moved or renamed):
.1. FileSystemWatchers.
You will need to create one FSW for each highest-level directory you want to monitor (for example, one for each writable logical drive).
.2. Renamed.
Straightforward renaming of the file is trivially handled.
.3. Moved.
This part is very far from trivial; it basically involves comparing files in three different scenarios.
3.0.1. Deciding if a deleted/moved-from file is the same as a created/moved-to file.
For determining whether a deleted and a created file are a match, filename is useless (can be changed during a move). You could use a mixture of file size and attributes like time created, or even a hash of the entire file. In my particular solution I only needed to watch the movement of specific files "registered" before load-time, so I was able to give these files a unique fingerprint as metadata that I could then use to compare files (this works fine in real-world scenarios, but is easy to break maliciously in testing, which disappoints me as a perfectionist.)
3.0.1.1. When to read filesize/attributes/take hash?
Before I came up with the static fingerprint idea, I was testing my code with a simple filesize + creation date validation check. I quickly realised though that I had to have a note of the filesize and creation date (or hash or whatever else you want to use) of the deleted file BEFORE it signals as "deleted", because you can't check the size of a file that doesn't exist. If (like me) you know the files you want to watch in advance, then you need to read in those values before you enable the FileSystemWatchers; you also need to listen for "change" events on those files to update the values of filesize and creation date, take a new hash etc. This then begs the question: what do you do if you DON'T know what files you are interested in watching to see if they move? What if you only know you are possibly interested in knowing if they've moved when they "delete"? That, unfortunately, is beyond me (it wasn't something I had to deal with.) Unless you can come up with a solution to this problem, there is zero point in continuing with the FileSystemWatcher approach. Furthermore, I would conjecture (though could very easily be wrong) that there is no high-level solution that will meet your needs. If you do however come up with a solution (please post it below/comment on this post/edit it in here on this post), I have made the rest of this compatible.
3.1. Scenario 1: Direct moving of the file itself.
Upon the "delete" of a specific file being detected, you need to start listening for a "create" of a congruous file. Rather than listening indefinitely for the matching "create" of a file that might just have been deleted (which in reality involves inspecting every file created in the directory), you can use a timer to start and stop a "listening" flag (practical, but from a purist point of view a little arbitrary), deciding that after e.g. 1000ms with no appropriately matching create it's likely there won't be one.
3.2.0. A common misconception.
A lot of people seem to be under the impression, after glancing at the docs, that moving or renaming a folder triggers a rename for all their subfiles and subfolders rather than a delete and a create. In actual fact what the docs say is:
If you cut and paste a folder with files into a folder being watched, the FileSystemWatcher object reports only the folder as new, but not its contents because they are essentially only renamed.
(i.e. only the top folder throws rename or create/delete and the subfiles/subfolders throw NOTHING). Meaning if you want to know when and where a certain file is moved, you have to listen out for each and every of its ascendent folders as well.
3.2.1. Scenario 2: Renaming of a containing folder.
In my solution, because I knew all the files I was watching, whenever one of my FileSystemWatchers reported a rename of a folder rather than a file (the portion of the string after the last "/" will contain no ".") I checked each of my watched files to see if their paths were in that directory and if so, changed the beginning of the filepath to the path of the new directory et voila!, I knew where my files had been moved to. If you do not now in advance what files you are looking for, then you will have to recursively search through everything in every folder that throws a "rename".
3.2.2. Scenario 3: Moving of a containing folder.
This one feels like a slap in the face: in order to build your move-detection routine, you have to be able to detect moves. Here folders will throw a "delete" followed by a "create". In my case the solution just recycles the techniques in 3.1 and 3.2.1: when a folder "delete" is detected, I check to see if it contains any of my watched files. If it does, I set a "listen" flag (and a timer to snuff it) and check the subdirectory path of my file in the old folder against every new folder "create" that is detected to see if it points to a file with the desired fingerprint. If it does, I now have the old and new paths of the file and have detected the move. If you don't know what files to watch for, you may have to validate folder moves by comparing size on disk and number of subfiles/subfolders between "deleted" folder and "created" folders to confirm a folder has moved first, then search the folder recursively for the files you're interested in.
3.3. FURTHER COMPLICATION: Cross-drive moving of large files.
This is a problem I fortunately didn't run into (because I was only comparing fingerprint metadata, and didn't need access to files); however moving large files between drives (which transfer in stages, triggering a create event then a series of change events) can cause real headaches.
3.3.1. Headache 1: The "create" fires when the destination file is incomplete.
This means comparing its size to a "deleted" file will produce a false negative. You can't even take a hash of the first part of the file to indicate to your program that this "might" be the deleted file, because the move operation will have the file access permissions locked down. You just have to try and tell if the created file might still be moving and wait for it to finish.
3.3.2. Headache 2: No sure way to "tell" that the created file is still being moved.
Some have suggested checking the file access permissions on the created file, but they might be indistinguishable from those on a file created and still in use by any random application. Others have suggested setting short time-limited listen flags for "changes" on the file, but again this is indistinguishable from a file being modified by an application. In fact if the file happened to be a log file constantly and rapidly being updated by some process, then waiting for "changes" to the file to timeout might never end.
3.3.3. Headache 3: (UNTESTED) possibly these sort of moves "delete" the file after "creating" the destination file*.
It makes sense that this would be the case, though I haven't tested it. [if anyone does know, feel free to edit (or delete) this section appropriately]
3.4. A philosophical quandry: are two identical files the same?
This is a very pedantic and arbitrary thought-experiment, but say you have two drives, each with an identical copy of File.txt. You run a batch file that deletes the copy on the first drive then immediately makes a copy of the file on the second drive into the same folder on the second drive and names it Copy of File.txt. Unless you are using fingerprints, your code will identify a delete and then a create of an identical file and be unable to distinguish what happened from a move (with renaming) of the file from the first drive to the second. The final state of the filesystem is identical in both cases so it shouldn't cause your application to behave unexpectedly, but art thou really content to call that a "move" based purely on isomorphism? (especially when you know the kernel sees it differently)?

Using high-level unrestricted api provided by C# - no, you cant. Use FileSystemWatcher.. On same drive operation of moving file is not "delete and create" - it's "rename".
If you can/want to go into lower-level, then you can hook MoveItem and MoveItems of IFileOperation shell's interface, and MoveFile from Kernel32.dll... It will work with most of apps, but require expansion for security rights for your application, that mostly unacceptable in corporative environment..

The task has two flaws that make it hard to implement: (a) move operation across the disks is actually a sequence of read/write operations followed by deletion rather than move. And during those read/write operations there can be some transformation of data in place ; and (b) moving can be performed not by just a shell.
What you can do is employ a filesystem filter driver to intercept file operations right when they take place. Then you need to detect the sequence of read and write operations performed by the same process over your file. I.e. if your code detects, that the file is read sequentially (NOTE: some copying tools can read the file in multiple threads in parallel) and then write similar blocks of data to the other file AND after reading everything the source file is deleted AND the complete file contents have been written to the other place, then you can guess that you have come over file move operation.

Bump & update: This may well be against the rules of StackOverflow, but I would like to point out to the many people landing on this page (and the myriad similar questions on SO) that I have started a feature request on MicroSoft UserVoice to add MOVE detection to FileSystemWatcher. The best solution in the long term, rather than trying to work around the problem, might be to petition MicroSoft to fix it. If you have come here because you too need a solution to this problem, please consider clicking here and voting for this feature.

Amazon S3, Syncing, Modified date vs. Uploaded Date

We're using the AWS SDK for .NET and I'm trying to pinpoint where we seem to be having a sync problem with our consumer applications. Basically we have a push-service that generates changeset files that get uploaded to S3, and our consumer applications are supposed to download these files and apply them in order to sync up to the correct state, which is not happening.
There's some conflicting views on what/where the correct datestamps are represented. Our consumers were written to look at the s3 file's "LastModified" field to sort the downloaded files for processing, and I don't know anymore what this field represents. At first I thought it represented the date modified/created of the file we uploaded, then (as seen here) it actually represents a new date stamp of when the file was uploaded, and likewise in the same link it seems to imply that when a file is downloaded it reverts back to the old datestamp (but I cannot confirm this).
We're using this snippet of code to pull files
// Get a list of the latest changesets since the last successful full update.
Amazon.S3.AmazonS3Client client = ...;
List<Amazon.S3.Model.S3Object> listObjects = client.GetFullObjectList(
this.Settings.GetS3ListObjectsRequest(this.Settings.S3ChangesetSubBucket),
Amazon.S3.AmazonS3Client.DateComparisonType.GreaterThan,
lastModifiedDate,
Amazon.S3.AmazonS3Client.StringTokenComparisonType.MustContainAll,
this.Settings.RequiredChangesetPathTokens);
And then sort by the S3Object's LastModified (which I think is where our assumption is wrong)
foreach (Amazon.S3.Model.S3Object obj in listObjects)
{
if (DateTime.Parse(obj.LastModified) > lastModifiedDate)
{
//it's a new file, so we use insertion sort to put this file in an ordered list
//based on LastModified
}
}
Am I correct in assuming that we should be doing something more to preserve our own datestamps that we need, such as using custom header/metadata objects to put the correct datestamps on files that we need, or even putting it in the filename itself?
EDIT
Perhaps this question can answer my problem: If my service has 2 files to upload to S3 and goes through the process of doing that, am I guaranteed that these files show up in S3 in the order they were uploaded (via LastModified) or does S3 do some amount of asynchronous processing that could lead to my files showing up in a list of S3 object out of order? I'm worried about a case where, for example, my service uploaded files A then B, B shows up first in S3, my consumers get + process B, then A shows up, and then my consumers may or may not get A and incorrectly process it thinking it's newer when it's not?
EDIT 2
It was as I and the person below suspected and we had some racing conditions trying to apply changesets in order while blindly relying on S3's datestamps. As an addendum, we ended up making 2 fixes to try and address the problem, which might be useful for others as well:
Firstly, to address to the race condition between when our uploads finish and the modified dates reported by S3, we decided to make all our queries look into the past by 1 second from the last date modified we read from a pulled file in S3. In examining this fix we saw another problem in S3 that wasn't apparent before, namely that S3 does not preserve milliseconds on timestamps, but rather rounded them up to the next second for all its timestamps. Looking back in time by 1 second circumvented this.
Secondly, since we were looking back in time we would have the problem of downloading the same file multiple times if there weren't any new changeset files to download, so we added a filename buffer for files we saw in our last request, skipped any files we had already seen, and refreshed the buffer when we saw new files.
Hope this helps.

When listing objects in an S3 bucket, the API response received from S3 will always return them in alphabetical order.
The S3 API does not allow you to filter or sort objects based on the LastModified value. Any such filtering or sorting is done exclusively in the client libraries that you use to connect to S3.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
As for the accuracy of the LastModified value and it's possible use to sort the list of objects based on the time they were uploaded, to my knowledge, the LastModified value is set to the time the upload finishes (when the server returns a 200 OK response) and not the time the upload was started.
This means that if you start upload A that's 100MB in size and a second later you start upload B that's only 1K in size, in the end, the last modified timestamp for A will be after the last modified timestamp for B.
If you need to preserve the time your upload was started, it's best to use a custom metadata header with your original PUT request.

AntiVirus ProgressBar

I have to add a progress Bar in my AV and the problem is that I don't know in advance how many files to scan.So I don't know the total count.If I count the files to be scanned,then the soft pauses for a while since it has to go through many files and in case of Full System Scan,this pause if of long duration which is not desired.Hence I dropped that code.I heard that avast dynamically changes progress bar value i.e. if the value approaches 100 and more files are found to scan then it pushes the value to like 50.Coding is in C#.

It would be foolish to try to get a list of all the files in one pass, before updating the UI at all.
Use threads; one to locate files and add their locations to a queue, and update a total file count.
Another to do the scanning, and update the count of scanned files.
and the UI thread will update itself based on that; so the upper bound will constantly be increasing... or, becoming more accurate.

Just count the folders 5-level deep, and calculate the progress based on how many of those folders were scanned. Increase dept to increase accuracy (at the cost of pre-calculation time). Use this article to learn how to enumerate directories as fast as possible: MSDN How to: Enumerate Directories and Files

Since AV scans should happen on a regular basis can you just remember how many files were in the last scan and then pad it by 5%? That would give the user a rough idea of how long it would take without having to do an up to date full scan right now. In order to get an accurate number for the first scan you would have to do a full directory search and suffer the pause that you described. As for custom individual folders you could either store the # of files for the first 2-3 levels of folders and then just do a directory scan below that as there would be a reduced number of files in each sub-directory.

I would recommend checking how much occupied space there is on the disk and evaluating number of files basing on density: files per GB. At first You have to make an assumption how much files do get stored in one GB (for my system it is 3860) During the scan you will know what is real density of scanned hard drive so you can adjust number of files.
Adjusting total number of files during the scan is the best method I can think off, and being user of you app I would like it much more, than waiting until you count all the files on the machine.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.