I've created a program that's supposed to run once each night. What it does is that it downloads images from my FTP, compresses them and uploads them back to the FTP. I'm using WinSCP for downloading and uploading files.
Right now I have a filemask applied that makes sure that only images are downloaded, that subdirectories are excluded and most importantly that only files that are modified the last 24 hours are downloaded. Code snippet for this filemask:
DateTime currentDate = DateTime.Now;
string date = currentDate.AddHours(-24).ToString("yyyy-MM-dd");
transferOptions.FileMask = "*.jpg>=" + date + "; *.png>=" + date + "|*/";
Thing is, as I'm about to publish this I realize that if I run this once per night, and it checks if files are modified the last 24 hours, it will just keep downloading and compressing the same files, as the modified timestamp will keep increasing for each compression.
To fix this I need to edit the FileMask to only download NEW files, ie files that weren't in the folder the last time the program was run. I don't know if you can check the Created By-timestamp in some way, or if I have to do some comparisons. I've been looking through the docs but I haven't found any solution to my specific use case.
Is there anyone experienced in WinSCP that can point me in the right direction?
It doesn't look like WinSCP can access the Created Date of the files.
Unless you can do something to make the files 'different' when you re-upload them (e.g. put them in a different folder) then you best option might be:
Forget about using FileMask
Use WinSCP method EnumerateRemoteFiles to get a list of the files
Loop through them yourself (its a collection of RemoteFileInfo objects
You'll probably need to keep a list of 'files already processed' somewhere and compare with that list
Call GetFiles for the specific files that you actually want
There's a whole article on WinSCP site on How do I transfer new/modified files only?
To summarize the article:
If you keep the past files locally, just run synchronization to download only the modified/new ones.
Then iterate the list returned by Session.SynchronizeDirectories to find out what the new files are.
Otherwise you have to use a time threshold. Just remember the last time you ran your application and use a time constraint that includes also a time, not just a date.
string date = lastRun.ToString("yyyy-MM-dd HH:mm:ss");
transferOptions.FileMask = "*.jpg>=" + date + "; *.png>=" + date + "|*/";
Related
I have 2 folders with files on Windows:
Local folder e.g C:\MyFolder
Network folder e.g. \server\MyFolder
I need to check whether \server\MyFolder has got any updates in compare to my local one (e.g. updated file etc.).
I only need to know if there is any update in the network folder, I don't need to know what are the differences.
I tried implementing MD5 check sums on the folders, but this solution is too slow. Creating MD5 checksum on the network folder is just too slow (in my case 3-4 minutes, but I need this comparison to be made in max. few seconds) [we are talking about folders with few hundred files, and few hundred MB in size]
I tried implementing some basic rules (e.g. if number of files differs, names are different then it is a quick response), but in most of the cases number of files remain the same, and the only change is in the content of the file, so as long as the first files changes then it is quick response, but in the worst case scenario (or in the scenario that folders are the same) I end up iterating through all files, which is very slow (few minutes).
What other approaches could I try ?
(Note, that network drive is read only to me, so I can't store any information, checksums in the folder whenever it changes).
Here's a solution :
You loop your folder's file and everytime check this variable.
var lastModified = System.IO.File.GetLastWriteTime("Server/Folder");
if lastModified > the time of your code execution recurrence. then you should return the true or false
I have a job that once in a set periods of time "looks" at FTP if some new files have been uploaded. Once it finds any, it downloads it.
The question is how using C# to extract the time when the file was actually uploaded to FTP.
Thank you. I just still can't figure out how to extract exactly the time when file was uploaded to FTP, not modified. As the following shows the time of file modification.
fileInfo = session.GetFileInfo(FileFullPath);
dateUploaded = fileInfo.LastWriteTime;
Please advice some sample code that may be integrated in my current solution
using (Session session = new Session())
{
string FileFullPath =
Dts.Variables["User::FTP_FileFullPath"].Value.ToString();
session.Open(sessionOptions);
DateTime dateTime = DateTime.Now;
session.MoveFile(FileFullPath, newFTPFullPath);
TransferOperationResult transferResult;
transferResult = session.GetFiles(newFTPFullPath,
Dts.Variables["User::Local_DownloadFolder"].Value.ToString(),false);
Dts.Variables["User::FTP_FileProcessDate"].Value = dateTime;
}
You might not be able to, unless you know the FTP server reliably sets the file create/modified date to the date it was uploaded. Do some test uploads and see. If it works out for you on this particular server you want to use then great; keep a note of when you last visited and retrieve files with a greater date. By way of an example, a test upload to an Azure ftp server just now (probably derived from Microsoft IIS) did indeed set the time of the file to the datetime it was uploaded. Beware that the listed file time sent by the server might not be the same timezone as you are in, nor will it have any timezone info represented - it could just be some number of hours out relative to your current time
To get the date itself you'll need to parse the response the server gives you when you list the remote directory. If you're using an FTP library for C# (edit: you're using WinSCP), that might already be handled for you (edit: it is, see https://winscp.net/eng/docs/library_session_listdirectory and https://winscp.net/eng/docs/library_remotefileinfo); unless things have improved recently the default FTP provision in .NET isn't great - it's more intended for basic file retrieval than complex syncing, so i'd definitely look at using a capable library (and we don't do software recs here, sorry, so I can't recommend one) if you're scrutinizing the date info offered
That said, there's another way to carry out this sync process that is more of a side effect of what you want to do (and doesn't necessarily rely on parsing a non standard list output) overall as a process:
Keep a memory of every file you saw last time and reference it when looking at every file that is there now. This is actually quite easy to do:
Download all the files.
Disconnect.
Go back some time later and download any files that you don't already have
Keep track of which files you downloaded and do something with them?
You say you want to download them anyway, so just treat any file you don't already have (or maybe one that has a newer date, different file size etc) as one that is new/changed since you last looked
Big job, potentially, depending how many various servers you want to support
We need to create or integrate some existing software which identifies the FTP folder to download files.
The problem here is the folder structure will be configured by the user according to how the client stores them on FTP at run time and stored in some xml or DB.
The folder structure needs to be generic so that we can easily configure it for any type of structure. The folder or file names can contain dates or part of date in their names which change everyday according to the date.
For eg. we can have a folder Files_DDMMYYYY and inside that there will be specific files which have to be downloaded everyday.
OR
A single folder in which different files can contain dates.
The first major improvement you can make to your solution is a very simple one.
That would be to restructure the date format in your folder structure from using
Files_DDMMYYYY
to using
Files_YYYYMMDD
This way, your folders will list in the directory in sequential order. Otherwise, your files will be crosscut sorted by the day of the month and then the month of the year and then the year. With DDMMYYYY you'll see them listed something like this:
01102011
01102012
01102013
01112011
01112012
01112013
01122011
01122012
01122013
15102011
15102012
15102013
15112011
15112012
15112013
15122011
15122012
15122013
With them YYYYMMDD you'll see them
20111001
20111015
20111101
20111115
20111201
20111215
20121001
20121015
20121101
20121115
20121201
20121215
20131001
20131015
20131101
20131115
20131201
20131215
As I said, it's a very simple change that will help keep your structure organized. The second list is automatically sequentially ordered by date.
We're using the AWS SDK for .NET and I'm trying to pinpoint where we seem to be having a sync problem with our consumer applications. Basically we have a push-service that generates changeset files that get uploaded to S3, and our consumer applications are supposed to download these files and apply them in order to sync up to the correct state, which is not happening.
There's some conflicting views on what/where the correct datestamps are represented. Our consumers were written to look at the s3 file's "LastModified" field to sort the downloaded files for processing, and I don't know anymore what this field represents. At first I thought it represented the date modified/created of the file we uploaded, then (as seen here) it actually represents a new date stamp of when the file was uploaded, and likewise in the same link it seems to imply that when a file is downloaded it reverts back to the old datestamp (but I cannot confirm this).
We're using this snippet of code to pull files
// Get a list of the latest changesets since the last successful full update.
Amazon.S3.AmazonS3Client client = ...;
List<Amazon.S3.Model.S3Object> listObjects = client.GetFullObjectList(
this.Settings.GetS3ListObjectsRequest(this.Settings.S3ChangesetSubBucket),
Amazon.S3.AmazonS3Client.DateComparisonType.GreaterThan,
lastModifiedDate,
Amazon.S3.AmazonS3Client.StringTokenComparisonType.MustContainAll,
this.Settings.RequiredChangesetPathTokens);
And then sort by the S3Object's LastModified (which I think is where our assumption is wrong)
foreach (Amazon.S3.Model.S3Object obj in listObjects)
{
if (DateTime.Parse(obj.LastModified) > lastModifiedDate)
{
//it's a new file, so we use insertion sort to put this file in an ordered list
//based on LastModified
}
}
Am I correct in assuming that we should be doing something more to preserve our own datestamps that we need, such as using custom header/metadata objects to put the correct datestamps on files that we need, or even putting it in the filename itself?
EDIT
Perhaps this question can answer my problem: If my service has 2 files to upload to S3 and goes through the process of doing that, am I guaranteed that these files show up in S3 in the order they were uploaded (via LastModified) or does S3 do some amount of asynchronous processing that could lead to my files showing up in a list of S3 object out of order? I'm worried about a case where, for example, my service uploaded files A then B, B shows up first in S3, my consumers get + process B, then A shows up, and then my consumers may or may not get A and incorrectly process it thinking it's newer when it's not?
EDIT 2
It was as I and the person below suspected and we had some racing conditions trying to apply changesets in order while blindly relying on S3's datestamps. As an addendum, we ended up making 2 fixes to try and address the problem, which might be useful for others as well:
Firstly, to address to the race condition between when our uploads finish and the modified dates reported by S3, we decided to make all our queries look into the past by 1 second from the last date modified we read from a pulled file in S3. In examining this fix we saw another problem in S3 that wasn't apparent before, namely that S3 does not preserve milliseconds on timestamps, but rather rounded them up to the next second for all its timestamps. Looking back in time by 1 second circumvented this.
Secondly, since we were looking back in time we would have the problem of downloading the same file multiple times if there weren't any new changeset files to download, so we added a filename buffer for files we saw in our last request, skipped any files we had already seen, and refreshed the buffer when we saw new files.
Hope this helps.
When listing objects in an S3 bucket, the API response received from S3 will always return them in alphabetical order.
The S3 API does not allow you to filter or sort objects based on the LastModified value. Any such filtering or sorting is done exclusively in the client libraries that you use to connect to S3.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
As for the accuracy of the LastModified value and it's possible use to sort the list of objects based on the time they were uploaded, to my knowledge, the LastModified value is set to the time the upload finishes (when the server returns a 200 OK response) and not the time the upload was started.
This means that if you start upload A that's 100MB in size and a second later you start upload B that's only 1K in size, in the end, the last modified timestamp for A will be after the last modified timestamp for B.
If you need to preserve the time your upload was started, it's best to use a custom metadata header with your original PUT request.
I have a program that requires a few large (~4 or 5mb) files. Once a week, every week, there are new versions of these files with minor changes. Mostly just a few lines added or removed.
When the program starts, if there's an Internet connection, I'd like the program to update these files automatically. Instead of downloading the entire new versions of the files, I'll like to download just a patch based on the client's version of the files that updates them.
How might I do this?
I have total control over the server.
That is a tough problem to solve if you don't have any for knowledge of what is in the file or the server doest have a facility to allow you to request differences. Any program you write that does not have a way to determine the differences with out looking at the old and new file will have to download it anyway.
C# doesn't have any built-in facility to do this, but it sounds like your requirements aren't complicated. Look at how diff and ed on Unix can be used to patch a text file based on an easy-to-grok delta. Of course you should check the resulting file against a hash and fall back to a full download if it isn't correct.