Amazon S3, Syncing, Modified date vs. Uploaded Date

Amazon S3, Syncing, Modified date vs. Uploaded Date - c#

We're using the AWS SDK for .NET and I'm trying to pinpoint where we seem to be having a sync problem with our consumer applications. Basically we have a push-service that generates changeset files that get uploaded to S3, and our consumer applications are supposed to download these files and apply them in order to sync up to the correct state, which is not happening.
There's some conflicting views on what/where the correct datestamps are represented. Our consumers were written to look at the s3 file's "LastModified" field to sort the downloaded files for processing, and I don't know anymore what this field represents. At first I thought it represented the date modified/created of the file we uploaded, then (as seen here) it actually represents a new date stamp of when the file was uploaded, and likewise in the same link it seems to imply that when a file is downloaded it reverts back to the old datestamp (but I cannot confirm this).
We're using this snippet of code to pull files
// Get a list of the latest changesets since the last successful full update.
Amazon.S3.AmazonS3Client client = ...;
List<Amazon.S3.Model.S3Object> listObjects = client.GetFullObjectList(
this.Settings.GetS3ListObjectsRequest(this.Settings.S3ChangesetSubBucket),
Amazon.S3.AmazonS3Client.DateComparisonType.GreaterThan,
lastModifiedDate,
Amazon.S3.AmazonS3Client.StringTokenComparisonType.MustContainAll,
this.Settings.RequiredChangesetPathTokens);
And then sort by the S3Object's LastModified (which I think is where our assumption is wrong)
foreach (Amazon.S3.Model.S3Object obj in listObjects)
{
if (DateTime.Parse(obj.LastModified) > lastModifiedDate)
{
//it's a new file, so we use insertion sort to put this file in an ordered list
//based on LastModified
}
}
Am I correct in assuming that we should be doing something more to preserve our own datestamps that we need, such as using custom header/metadata objects to put the correct datestamps on files that we need, or even putting it in the filename itself?
EDIT
Perhaps this question can answer my problem: If my service has 2 files to upload to S3 and goes through the process of doing that, am I guaranteed that these files show up in S3 in the order they were uploaded (via LastModified) or does S3 do some amount of asynchronous processing that could lead to my files showing up in a list of S3 object out of order? I'm worried about a case where, for example, my service uploaded files A then B, B shows up first in S3, my consumers get + process B, then A shows up, and then my consumers may or may not get A and incorrectly process it thinking it's newer when it's not?
EDIT 2
It was as I and the person below suspected and we had some racing conditions trying to apply changesets in order while blindly relying on S3's datestamps. As an addendum, we ended up making 2 fixes to try and address the problem, which might be useful for others as well:
Firstly, to address to the race condition between when our uploads finish and the modified dates reported by S3, we decided to make all our queries look into the past by 1 second from the last date modified we read from a pulled file in S3. In examining this fix we saw another problem in S3 that wasn't apparent before, namely that S3 does not preserve milliseconds on timestamps, but rather rounded them up to the next second for all its timestamps. Looking back in time by 1 second circumvented this.
Secondly, since we were looking back in time we would have the problem of downloading the same file multiple times if there weren't any new changeset files to download, so we added a filename buffer for files we saw in our last request, skipped any files we had already seen, and refreshed the buffer when we saw new files.
Hope this helps.

When listing objects in an S3 bucket, the API response received from S3 will always return them in alphabetical order.
The S3 API does not allow you to filter or sort objects based on the LastModified value. Any such filtering or sorting is done exclusively in the client libraries that you use to connect to S3.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
As for the accuracy of the LastModified value and it's possible use to sort the list of objects based on the time they were uploaded, to my knowledge, the LastModified value is set to the time the upload finishes (when the server returns a 200 OK response) and not the time the upload was started.
This means that if you start upload A that's 100MB in size and a second later you start upload B that's only 1K in size, in the end, the last modified timestamp for A will be after the last modified timestamp for B.
If you need to preserve the time your upload was started, it's best to use a custom metadata header with your original PUT request.

Related

Quick comparison of folders - only bool as a result

I have 2 folders with files on Windows:
Local folder e.g C:\MyFolder
Network folder e.g. \server\MyFolder
I need to check whether \server\MyFolder has got any updates in compare to my local one (e.g. updated file etc.).
I only need to know if there is any update in the network folder, I don't need to know what are the differences.
I tried implementing MD5 check sums on the folders, but this solution is too slow. Creating MD5 checksum on the network folder is just too slow (in my case 3-4 minutes, but I need this comparison to be made in max. few seconds) [we are talking about folders with few hundred files, and few hundred MB in size]
I tried implementing some basic rules (e.g. if number of files differs, names are different then it is a quick response), but in most of the cases number of files remain the same, and the only change is in the content of the file, so as long as the first files changes then it is quick response, but in the worst case scenario (or in the scenario that folders are the same) I end up iterating through all files, which is very slow (few minutes).
What other approaches could I try ?
(Note, that network drive is read only to me, so I can't store any information, checksums in the folder whenever it changes).

Here's a solution :
You loop your folder's file and everytime check this variable.
var lastModified = System.IO.File.GetLastWriteTime("Server/Folder");
if lastModified > the time of your code execution recurrence. then you should return the true or false

C# find when file was uploaded on FTP

I have a job that once in a set periods of time "looks" at FTP if some new files have been uploaded. Once it finds any, it downloads it.
The question is how using C# to extract the time when the file was actually uploaded to FTP.
Thank you. I just still can't figure out how to extract exactly the time when file was uploaded to FTP, not modified. As the following shows the time of file modification.
fileInfo = session.GetFileInfo(FileFullPath);
dateUploaded = fileInfo.LastWriteTime;
Please advice some sample code that may be integrated in my current solution
using (Session session = new Session())
{
string FileFullPath =
Dts.Variables["User::FTP_FileFullPath"].Value.ToString();
session.Open(sessionOptions);
DateTime dateTime = DateTime.Now;
session.MoveFile(FileFullPath, newFTPFullPath);
TransferOperationResult transferResult;
transferResult = session.GetFiles(newFTPFullPath,
Dts.Variables["User::Local_DownloadFolder"].Value.ToString(),false);
Dts.Variables["User::FTP_FileProcessDate"].Value = dateTime;
}

You might not be able to, unless you know the FTP server reliably sets the file create/modified date to the date it was uploaded. Do some test uploads and see. If it works out for you on this particular server you want to use then great; keep a note of when you last visited and retrieve files with a greater date. By way of an example, a test upload to an Azure ftp server just now (probably derived from Microsoft IIS) did indeed set the time of the file to the datetime it was uploaded. Beware that the listed file time sent by the server might not be the same timezone as you are in, nor will it have any timezone info represented - it could just be some number of hours out relative to your current time
To get the date itself you'll need to parse the response the server gives you when you list the remote directory. If you're using an FTP library for C# (edit: you're using WinSCP), that might already be handled for you (edit: it is, see https://winscp.net/eng/docs/library_session_listdirectory and https://winscp.net/eng/docs/library_remotefileinfo); unless things have improved recently the default FTP provision in .NET isn't great - it's more intended for basic file retrieval than complex syncing, so i'd definitely look at using a capable library (and we don't do software recs here, sorry, so I can't recommend one) if you're scrutinizing the date info offered
That said, there's another way to carry out this sync process that is more of a side effect of what you want to do (and doesn't necessarily rely on parsing a non standard list output) overall as a process:
Keep a memory of every file you saw last time and reference it when looking at every file that is there now. This is actually quite easy to do:
Download all the files.
Disconnect.
Go back some time later and download any files that you don't already have
Keep track of which files you downloaded and do something with them?
You say you want to download them anyway, so just treat any file you don't already have (or maybe one that has a newer date, different file size etc) as one that is new/changed since you last looked
Big job, potentially, depending how many various servers you want to support

.NET WinSCP, only download new files

I've created a program that's supposed to run once each night. What it does is that it downloads images from my FTP, compresses them and uploads them back to the FTP. I'm using WinSCP for downloading and uploading files.
Right now I have a filemask applied that makes sure that only images are downloaded, that subdirectories are excluded and most importantly that only files that are modified the last 24 hours are downloaded. Code snippet for this filemask:
DateTime currentDate = DateTime.Now;
string date = currentDate.AddHours(-24).ToString("yyyy-MM-dd");
transferOptions.FileMask = "*.jpg>=" + date + "; *.png>=" + date + "|*/";
Thing is, as I'm about to publish this I realize that if I run this once per night, and it checks if files are modified the last 24 hours, it will just keep downloading and compressing the same files, as the modified timestamp will keep increasing for each compression.
To fix this I need to edit the FileMask to only download NEW files, ie files that weren't in the folder the last time the program was run. I don't know if you can check the Created By-timestamp in some way, or if I have to do some comparisons. I've been looking through the docs but I haven't found any solution to my specific use case.
Is there anyone experienced in WinSCP that can point me in the right direction?

It doesn't look like WinSCP can access the Created Date of the files.
Unless you can do something to make the files 'different' when you re-upload them (e.g. put them in a different folder) then you best option might be:
Forget about using FileMask
Use WinSCP method EnumerateRemoteFiles to get a list of the files
Loop through them yourself (its a collection of RemoteFileInfo objects
You'll probably need to keep a list of 'files already processed' somewhere and compare with that list
Call GetFiles for the specific files that you actually want

There's a whole article on WinSCP site on How do I transfer new/modified files only?
To summarize the article:
If you keep the past files locally, just run synchronization to download only the modified/new ones.
Then iterate the list returned by Session.SynchronizeDirectories to find out what the new files are.
Otherwise you have to use a time threshold. Just remember the last time you ran your application and use a time constraint that includes also a time, not just a date.
string date = lastRun.ToString("yyyy-MM-dd HH:mm:ss");
transferOptions.FileMask = "*.jpg>=" + date + "; *.png>=" + date + "|*/";

game client updater

I'm a MMORPG Private Server dev and I'm looking forward to create an all in one updater for the user's clients because it is very annoying and lame to use patches that must be manually downloaded.
I'm new in C# but I already succeded at making my own launcher with my own interface and basic game start/options buttons and notice that is read from my webserver.
Now, I wanna make an integrated update function for that and I'm pretty lost, I have no idea where to start. This is what it would look like, it's just a concept
It will have the main button wich is used to start the game AND update it, basically when you open the program the button would write "UPDATE" and is disabled(while it searches for new updates) and if any are found, it would turn into a clickable button, then after the updates are dowmloaded it would just change itself into "start game".
A progressbar for the overall update and another one to see the progress on the current file that is downloading only, all that with basic info like percentage and how much files need to be downloaded.
I need to find a way for the launcher to check the files on the webserver by HTTP method and check if they are same as client or newer ones so it dosent always redownload files already same version and also a method so that the updater will download the update as a compressed archive and auto extract and overwrite existing files when they are done downloading.
NOTE: The files being updated are not .exe, they mostly are textures/config files/maps/images/etc...

I'll sketch a possible architecture for this system. It's incomplete, you should consider it a form of detailed pseudo-C#-code for the first half and a set of hints and suggestions for the second.
I believe you may need two applications for this:
A C# WinForms client.
A C# server-side application, maybe a web service.
I'll not focus on security issues on this answers, but they are obviously very important. I expect that security can be implemented at a higher level, maybe using SSL. The web service would run within IIS and implementing some form of security should be mainly a matter of configuration.
The server-side part is not strictly required, especially if you do't want compression; probably there is a way to configure your server so that it returns a easily parsable list of files when an HTTP request is made at website.com/updater. However it is more flexible to have a web service, and probably it's even easier to implement. You can start by looking at this MSDN article. If you do want compression, you can probably configure the server to transparently compress individual files. I'll try to sketch all possible variants.
In the case of a single update ZIP file, basically the updater web service should be able to answer two different requests; first, it can return a list of all game files, relative to the server directory website.com/updater, together with their last write timestamp (GetUpdateInfo method in the web service). The client would compare this list with the local files; some files may not exist anymore on the server (maybe the client must delete the local copy), some may not exist on the client (they are entirely new content), and some other files may exist both on the client and on the server, and in that case the client needs to check the last write time to determine if it needs the updated version. The client would build a list of the paths of these files, relative to the game content directory. The game content directory should mirror the server website.com/updater directory.
Second, the client sends this list to the server (GetUpdateURL in the web service). The server would create a ZIP containing the update and reply with its URL.
[ServiceContract]
public interface IUpdater
{
[OperationContract]
public FileModified[] GetUpdateInfo();
[OperationContract]
public string GetUpdateURL();
}
[DataContract]
public class FileModified
{
[DataMember]
public string Path;
[DataMember]
public DateTime Modified;
}
public class Updater : IUpdater
{
public FileModified[] GetUpdateInfo()
{
// Get the physical directory
string updateDir = HostingEnvironment.MapPath("website.com/updater");
IList<FileModified> updateInfo = new List<FileModified>();
foreach (string path in Directory.GetFiles(updateDir))
{
FileModified fm = new FileModified();
fm.Path = // You may need to adjust path so that it is local with respect to updateDir
fm.Modified = new FileInfo(path).LastWriteTime;
updateInfo.Add(fm);
}
return updateInfo.ToArray();
}
[OperationContract]
public string GetUpdateURL(string[] files)
{
// You could use System.IO.Compression.ZipArchive and its
// method CreateEntryFromFile. You create a ZipArchive by
// calling ZipFile.Open. The name of the file should probably
// be unique for the update session, to avoid that two concurrent
// updates from different clients will conflict. You could also
// cache the ZIP packages you create, in a way that if a future
// update requires the same exact file you would return the same
// ZIP.
// You have to return the URL of the ZIP, not its local path on the
// server. There may be several ways to do this, and they tend to
// depend on the server configuration.
return urlOfTheUpdate;
}
}
The client would download the ZIP file by using HttpWebRequest and HttpWebResponse objects. To update the progress bar (you would have only one progress bar in this setup, check my comment to your question) you need to create a BackgroundWorker. This article and this other article cover the relevant aspects (unfortunately the example is written in VB.NET, but it looks very similar to what would be in C#). To advance the progress bar you need to keep track of how many bytes you received:
int nTotalRead = 0;
HttpWebRequest theRequest;
HttpWebResponse theResponse;
...
byte[] readBytes = new byte[1024];
int bytesRead = theResponse.GetResponseStream.Read(readBytes, 0, 4096);
nTotalRead += bytesread;
int percent = (int)((nTotalRead * 100.0) / length);
Once you received the file you can use System.IO.Compression.ZipArchive.ExtractToDirectory to update your game.
If you don't want to explicitly compress the files with .NET, you can still use the first method of the web service to obtain the list of updated file, and copy the ones you need on the client using an HttpWebRequest/HttpWebResponse pair for each. This way you can actually have two progress bars. The one that counts files will simply be set to a percentage like:
int filesPercent = (int)((nCurrentFile * 100.0) / nTotalFiles);
If you have another way to obtain the list, you don't even need the web service.
If you want to individually compress your files, but you can't have this feature automatically implemented by the server, you should define a web service with this interface:
[ServiceContract]
public interface IUpdater
{
[OperationContract]
public FileModified[] GetUpdateInfo();
[OperationContract]
public string CompressFileAndGetURL(string path);
}
In which you can ask the server to compress a specific file and return the URL of the compressed single-file archive.
Edit - Important
Especially in the case that your updates are very frequent, you need to pay special attention to time zones.
Edit - An Alternative
I should restate that one of the main issues here is obtaining the list of files in the current release from the server; this file should include the last write time of each file. A server like Apache can provide such a list for free, although usually it is intended for human consumption, but it is easily parsable by a program, nevertheless. I'm sure there must be some script/extension to have that list formatted in an even more machine-friend way.
There is another way to obtain that list; you could have a text file on the server that, for every game content file, stores its last write time or, maybe even better, a progressive release number. You would compare release numbers instead of dates to check which files you need. This would protext yourself from time zone issues. In this case however you need to maintain a local copy of this list, because files have no such thing as a release number, but only a name and a set of dates.

This is a wide and varied question, with several answers that could be called 'right' depending on various implementation requirements. Here are a few ideas...
My approach would be to use System.Security.Cryptography.SHA1 to generate a list of hash codes for each game asset. The updater can then download the list, compare it to the local file system (caching the locally-generated hashes for efficiency) and build a list of new/changed files to be downloaded.
If the game data uses archives, the process gets a bit more involved, since you don't want to download a huge archive when only a single small file inside may have been changed. In this case you'd want to hash each file within the archive and provide a method for downloading those contained files, then update the archive using the files you download from the server.
Finally, give some thought to using a Binary Diff/Patch algorithm to reduce the bandwidth requirements by downloading smaller patch files when possible. In this case the client would request a patch that updates from the current version to the latest version of a file, sending the hash of the local file so the server knows which patch to send. This requires you to maintain a stack of patches on the server for each previous version you want to be able to patch from, which might be more than you're interested in.
Here are some links that might be relevant:
SHA1 Class - Microsoft documentation for SHA1 hashing class
SevenZipSharp - using 7Zip in C#
bsdiff.net - a .NET library implementing bsdiff
Oh, and consider using a multi-part downloader to better saturate the available bandwidth at the client end. This results in higher load on the server(s), but can greatly improve the client-side experience.

loading lots of Azure blob data in a WPF app

I've been given a task to build a prototype for an app. I don't have any code yet, as the solution concepts that I've come up with seem stinky at best...
The problem:
the solution consist of various Azure projects which do stuff to lots of data stored in Azure SQL db-s. Almost every action that happens creates a gzipped log file in blob storage. So that's one .gz file per log entry.
We should also have a small desktop (WPF) app that should be able to read, filter and sort these log files.
I have absolutely 0 influence on how the logging is done, so this is something that can not be changed to solve this problem.
Possible solutions that I've come up with (conceptually):
1:
connect to the blob storage
open the container
read/download blobs (with applied filter)
decompress the .gz files
read and display
The problem with this is that, depending on the filter, this could mean a whole lot of data to download (which is slow), and process (which will also not be very snappy). I really can't see this as a usable application.
2:
create a web role which will run a WCF or REST service
the service will take the filter params and other stuff and return a single xml/json file with the data, the processing will be done on the cloud
With this approach, will I run into problems with decompressing these files if there's a lot of them (will it take up extra space on the storage/compute instance where the service is running).
EDIT: what I mean by filter is limit the results by date and severity (info, warning, error). The .gz files are saved in a structure that makes this quite easy, and I will not be filtering by looking into the files themselves.
3:
some other elegant and simple solution that I don't know of
I'd also need some way of making the app update the displayed logs in real time, which i suppose would need to be done with repeated requests to the blob storage/service.
This is not one of those "give me code" questions. I am looking for advice on best practices, or similar solutions that worked for similar problems. I also know this could be one of those "no one right answer" questions, as people have different approaches to problems, but I have some time to build a prototype, so I will be trying out different things, and I will select the right answer, which will be the one that showed a solution that worked, or the one that steered me in the right direction, even if it does take some time before I actually build something and test it out.

As I understand it, you have a set of log file in Azure Blob storage that are formatted in a particular way (gzip) and you want to display them.
How big are these files? Are you displaying every single piece of information in the log file?
Assuming that if this is a log file, it is static and historical...meaning that once the log/gzip file is created it cannot be changed (you are not updating the gzip file once it is out on Blog storage). Only new files can be created...
One Solution
Why not create an worker role/job process that periodically goes out and scans the blob storage and builds a persisted "database" so that you can display. Nice thing about this is that you are not putting the unzipping/business logic to extract the log file in a WPF app or UI.
1) I would have the worker role scan the log file in Azure Blob storage
2) Have some kind of mechanism to track which ones where processed and a current "state" maybe the UTC date of the last gzip file
3) Do all the unzipping/extracting of the log file in the worker role
4) Have the worker role place the content in a SQL database, Azure Table Storage or Distributed Cache for access
5) Access can be done by a REST service (ASP.NET Web API/Node.js etc)
You can add more things if you need to scale this out, for example run this as a job to re-do all of the log files from a given time (refresh all). I don't know the size of your data so I am not sure if that is feasable.
Nice thing about this is that if you need to scale your job (overnight), you can spin up 2, 3, 6 worker roles...extract the content, pass the result to a Service Bus or Storage Queue that would insert into SQL, Cache etc for access.

Simply storing the blobs isn't sufficient. The metadata you want to filter on should be stored somewhere else where it's easy to filter and retrieve all the metadata. So I think you should split this into 2 problems:
A. How do I efficiently list all "gzips" with their metadata and how
can I apply a filter on these gzips in order to show them in my client
application.
Solutions
Blobs: Listing blobs is slow and filtering is not possible (you could group in a container per month or week or user or ... but that's not filtering).
Table Storage: Very fast, but searching is slow (only PK and RK are indexed)
SQL Azure: You could create a table with a list of "gzips" together with some other metadata (like user that created the gzip, when, total size, ...). Using a stored procedure with a few good indexes you can make search very fast, but SQL Azure isn't the most scalable solution
Lucene.NET: There's an AzureDirectory for Windows Azure which makes it possible to use Lucene.NET in your application. This is a super fast search engine that allows you to index your 'documents' (metadata) and this would be perfect to filter and return a list of "gzips"
Update: Since you only filter on date and severity you should review the Blob and Table options:
Blobs: You can create a container per date+severity (20121107-low, 20121107-medium, 20121107-high ...). Assuming you don't have too many blobs per data+severity, you can simply list the blobs directly from the container. The only issue you might have here is that a user will want to see all items with a high severity from the last week (7 days). This means you'll need to list the blobs in 7 containers.
Tables: Even though you say table storage or db aren't an option, do consider table storage. Using partitions and row keys you can easily filter in a very scalable way (you can also use CompareTo to get a range of items (for example, all records between 1 and 7 november). Duplicating data is perfectly acceptable in Table Storage. You could include some data from the gzip in the Table Storage entity in order to show it in your WPF application (the most essential information you want to show after filtering). This means you'll only need to process the blob when the user opens/double clicks the record in the WPF application
B. How do I display a "gzip" in my application (after double clicking on a search result for example)
Solutions
Connect to the storage account from the WPF application, download the file, unzip it and display it. This means that you'll need to store the storage account in the WPF application (or use SAS or a container policy), and if you decide to change something in the backend of how files are stored, you'll also need to change the WPF application.
Connect to a Web Role. This Web Role gets the blob from blob storage, unzips it and sends it over the wire (or send it compressed in order to speed up the transfer). In case something changes in how you store files, you only need to update the Web Role

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.