So, I know it is kinda crazy to report bug at this point in Azure life cycle, but I'm out of options. Here we go.
We have a service that you can upload files and a client that download then. That BLOB is stuffed with about 27 GB of data.
In a few occasions our users reported that some files were coming wrong, so we checked our MVC route to see if was anything wrong and found nothing.
So we created a simple console that loop the download:
public static void Main()
{
var firstHash = string.Empty;
var client = new System.Net.WebClient();
for (int i = 0; i < 5000; i++)
{
try
{
var date = DateTime.Now.ToString("HH-mm-ss-ffff");
var destination = #"C:\Users\Israel\Downloads\RO65\BLOB - RO65 -" + date + ".rfa";
client.DownloadFile("http://myboxfree.blob.core.windows.net/public/91fe9d90-71ce-4036-b711-a5300159abfa.rfa", destination);
string hash = string.Empty;
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(destination))
{
hash = Convert.ToBase64String(md5.ComputeHash(stream));
}
}
if (string.IsNullOrEmpty(firstHash))
firstHash = hash;
if (hash != firstHash) hash += " ---------------------------------------------";
Console.WriteLine("i: " + i.ToString() + " = " + hash);
}
catch { }
}
}
So here is the result - every now and then it downloads the wrong file:
The first 1000 downloads were OK, the right file. Out of the blue the BLOB returns a different file, and then goes back to normal.
The only relation I found between the files are the extension and the file size in bytes. The hash is (of course) different.
Any thoughts?
I have tried to rerun your sample code and wasn't able to repro.
Questions:
For the two different versions of the files you are seeing downloaded have you compared the contents of the two files? I think you said it was two completely different blobs being retrieved - however I wanted to verify that. How large is the delta between the two files?
Are you using RA-GRS and the client libraries read from secondary retry condition - meaning a network glitch could result in the read coming from the secondary region?
Suggestions:
Can you track the etag of the retrieved files. This allows you to check if the blob has changed since you first started reading it?
The Storage Service does enable you to explicitly validate the integrity of your objects to check to see if they have been modified in transit - potentially due to network issues etc. See Azure Storage Md5 Overview for more information. The simplest way however might just be to use https as these validations are already built into https.
Can you also try to repro using https and let me know if that helps?
Related
I'm trying to handle Spreadsheet changes for updating local version of data, and got some troubles:
Google Sheets API does not have any request for checking last modified time or file version. (Correct me, if i mistake)
Google need some time to handle changes and update version metadata of the file.
For example:
File is in version 10.
Sending BatchUpdateRequest with some data
By the end of the previous request checking file version by DriveAPI Files.Get request with field "version" and get old version 10
If wait about 15 seconds this request returns correct data, but it's not a solution because data updates every minute for each spreadsheet, so, it will spends a lot more time.
To overcome these problems i have realized logic with local calculating of Spreadsheet version, and comparing it after uploading: if online version > local version spreadsheet will be reload. But it creates new problem:
If make changes to Spreadsheet from several computers at a few moment, local version on all computers will be increased, but Google concats this changes into one version. So, for correct working it must be oldversionnubmer + countOfComputersThatMakesChanges but in fact it is oldVersionNumber + 1. Thus no one will get actual spreadsheet data because online version will not be higher than local.
In this way i have a question: How can I updates spreadsheets on changing data from another source?
GoogleSpreadsheetsVersions filling like that:
var versions = Instance.GoogleSpreadsheetsVersions;
if (!versions.ContainsKey(newTable.SpreadsheetId)) {
var request = GoogleSpreadsheetsServiceDecorator.Instance.DriveService.Files.Get(newTable.SpreadsheetId);
request.Fields = "version";
var response = request.Execute();
versions.Add(newTable.SpreadsheetId, response.Version);
}
version comparison itself:
var newInfo = new Dictionary<string, long?>();
foreach (var info in GoogleSpreadsheetsVersions)
{
try
{
//Gets file version
var request = GoogleSpreadsheetsServiceDecorator.Instance.DriveService.Files.Get(info.Key);
request.Fields = "version";
var response = request.Execute();
// local version < actual google version
if (info.Value < response.Version)
{
// setting flag of reloading for each sheet from this file
foreach (var t in GoogleSpreadsheets.Where(sheet => sheet.SpreadsheetId == info.Key))
t.IsLoadRequestRequired = true;
}
//Refreshing local versions
newInfo.Add(info.Key, response.Version);
}
catch (Exception e) when (e.Message.Contains("File not found"))
{
newInfo.Add(info.Key, null);
}
}
GoogleSpreadsheetsVersions = newInfo;
P.S.:
version field description from Google guide:
A monotonically increasing version number for the file. This reflects every change made to the file on the server, even those not visible to the user.
Local class of Spreadsheet in code means data of one sheet in Google. So, if one Google spreadsheet contains 10 sheets it will be 10 spreadsheets in programm.
May be helpful DriveAPI.Files.Get request fileds
I have the task of providing an api endpoint to find out how much space a particular module is using in our Amazon s3 bucket. I'm using the C# SDK.
I have accomplished this by adapting code from the documentation here: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingNetSDK.html
private long GetUsedBytes(string module, string customer)
{
ListObjectsRequest listRequest = new ListObjectsRequest()
{
BucketName = BucketName,
Prefix = module + "/" + customer
};
ListObjectsResponse listResponse;
long totalSize = 0;
do
{
listResponse = s3Client.ListObjects(listRequest);
foreach (S3Object obj in listResponse.S3Objects)
{
totalSize += obj.Size;
}
listRequest.Marker = listResponse.NextMarker;
} while (listResponse.IsTruncated);
return totalSize;
}
My question is: is there a way to do this with the sdk without pulling all of the actual s3objects down off the bucket? There are several good answers about doing it with the CLI:
AWS S3: how do I see how much disk space is using
https://serverfault.com/questions/84815/how-can-i-get-the-size-of-an-amazon-s3-bucket
But I've yet to be able to find one using the SDK directly. Do I have to mimic the sdk somehow to accomplish this? Another method I considered is to get all the keys and query for their metadata, but the only way to get all the keys I've found is to grab all the objects as in the link above ^. If there's a way to get all the metadata for objects with a particular prefix that would be ideal.
Thanks for your time!
~Josh
Your code is not downloading any objects from Amazon S3. It is merely calling ListObjects() and totalling the size of each object. It will make one API call per 1000 objects.
Alternatively, you can retrieve the size of each bucket from Amazon CloudWatch.
From Monitoring Metrics with Amazon CloudWatch - Amazon S3:
Metric: BucketSizeBytes
The amount of data in bytes stored in a bucket. This value is calculated by summing the size of all objects in the bucket (both current and noncurrent objects), including the size of all parts for all incomplete multipart uploads to the bucket.
So, simply retrieve the metric from Amazon CloudWatch rather than calculating it yourself.
I'm having a go at modifying an existing C# (dot net core) app that reads a type of binary file to use Azure Blob Storage.
I'm using Windows.Azure.Storage (8.6.0).
The issue is that this app reads the binary data from files from a Stream in very small blocks (e.g. 5000-6000 bytes). This reflects how the data is structured.
Example pseudo code:
var blocks = new List<byte[]>();
var numberOfBytesToRead = 6240;
var numberOfBlocksToRead = 1700;
using (var stream = await blob.OpenReadAsync())
{
stream.Seek(3000, SeekOrigin.Begin); // start reading at a particular position
for (int i = 1; i <= numberOfBlocksToRead; i++)
{
byte[] traceValues = new byte[numberOfBytesToRead];
stream.Read(traceValues, 0, numberOfBytesToRead);
blocks.Add(traceValues);
}
}`
If I try to read a 10mb file using OpenReadAsync(), I get invalid/junk values in the byte arrays after around 4,190,000 bytes.
If I set StreamMinimumReadSize to 100Mb it works.
If I read more data per block (e.g. 1mb) it works.
Some of the files can be more than 100Mb, so setting the StreamMinimumReadSize may not be the best solution.
What is going on here, and how can I fix this?
Are the invalid/junk values zeros? If so (and maybe even if not) check the return value from stream.Read. That method is not guaranteed to actually read the number of bytes that you ask it to. It can read less. In which case you are supposed to call it again in a loop, until it has read the total amount that you want. A quick web search should show you lots of examples of the necessary looping.
I have somewhere in the neighborhood of 4.2 million images I need to move from North Central US to West US, as part of a large migration to take advantage of Azure VM support (for those who don't know, North Central US does not support them). The images are all in one container, split into about 119,000 directories.
I'm using the following from the Copy Blob API:
public static void CopyBlobDirectory(
CloudBlobDirectory srcDirectory,
CloudBlobContainer destContainer)
{
// get the SAS token to use for all blobs
string blobToken = srcDirectory.Container.GetSharedAccessSignature(
new SharedAccessBlobPolicy
{
Permissions = SharedAccessBlobPermissions.Read |
SharedAccessBlobPermissions.Write,
SharedAccessExpiryTime = DateTime.UtcNow + TimeSpan.FromDays(14)
});
var srcBlobList = srcDirectory.ListBlobs(
useFlatBlobListing: true,
blobListingDetails: BlobListingDetails.None).ToList();
foreach (var src in srcBlobList)
{
var srcBlob = src as ICloudBlob;
// Create appropriate destination blob type to match the source blob
ICloudBlob destBlob;
if (srcBlob.Properties.BlobType == BlobType.BlockBlob)
destBlob = destContainer.GetBlockBlobReference(srcBlob.Name);
else
destBlob = destContainer.GetPageBlobReference(srcBlob.Name);
// copy using src blob as SAS
destBlob.BeginStartCopyFromBlob(new Uri(srcBlob.Uri.AbsoluteUri + blobToken), null, null);
}
}
The problem is, it's too slow. Waaaay too slow. At the rate it's taking to issue commands to copy all of this stuff, It is going to take somewhere in the neighborhood of four days. I'm not really sure what the bottleneck is (connection limit client side, rate limiting on Azure's end, multithreading, etc).
So, I'm wondering what my options are. Is there any way to speed things up, or am I just stuck with a job that will take four days to complete?
Edit: How I'm distributing the work to copy everything
//set up tracing
InitTracer();
//grab a set of photos to benchmark this
var photos = PhotoHelper.GetAllPhotos().Take(500).ToList();
//account to copy from
var from = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(
"oldAccount",
"oldAccountKey");
var fromAcct = new CloudStorageAccount(from, true);
var fromClient = fromAcct.CreateCloudBlobClient();
var fromContainer = fromClient.GetContainerReference("userphotos");
//account to copy to
var to = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(
"newAccount",
"newAccountKey");
var toAcct = new CloudStorageAccount(to, true);
var toClient = toAcct.CreateCloudBlobClient();
Trace.WriteLine("Starting Copy: " + DateTime.UtcNow.ToString());
//enumerate sub directories, then move them to blob storage
//note: it doesn't care how high I set the Parallelism to,
//console output indicates it won't run more than five or so at a time
var plo = new ParallelOptions { MaxDegreeOfParallelism = 10 };
Parallel.ForEach(photos, plo, (info) =>
{
CloudBlobDirectory fromDir = fromContainer.GetDirectoryReference(info.BuildingId.ToString());
var toContainer = toClient.GetContainerReference(info.Id.ToString());
toContainer.CreateIfNotExists();
Trace.WriteLine(info.BuildingId + ": Starting copy, " + info.Photos.Length + " photos...");
BlobHelper.CopyBlobDirectory(fromDir, toContainer, info);
//this monitors the container, so I can restart any failed
//copies if something goes wrong
BlobHelper.MonitorCopy(toContainer);
});
Trace.WriteLine("Done: " + DateTime.UtcNow.ToString());
The async blob copy operation is going to be very fast within the same data center (recently I copied a 30GB vhd to another blob in about 1-2 seconds). Across data centers, the operation is queued up and occurs across spare capacity with no SLA (see this article which calls that out specifically)
To put that into perspective: I copied the same 30GB VHD across data centers and it took around 1 hour.
I don't know your image sizes, but assuming 500K average image size, you're looking at about 2,000 GB. In my example, I saw throughput of 30GB in about an hour. Extrapolating, that would estimate your 2000 GB of data in about (2000/30) = 60 hours. Again, no SLA. Just a best-guess.
Someone else suggested disabling Nagle's algorithm. That should help push the 4 million copy commands out faster and get them queued up faster. I don't think it will have any effect of copy time.
This is a bit of a long shot, but I had a similar issue with table storage whereby small requests (which I think BeginStartCopyFromBlob should be) started running extremely slowly. It's a problem with Nagle's Algorithm and delayed TCP acks, two optimisations for network traffic. See MSDN or this guy for more details.
Upshot - turn Nagle's algorithm off - call the following before doing any Azure storage operations.
ServicePointManager.UseNagleAlgorithm = false;
Or for just blob:
var storageAccount = CloudStorageAccount.Parse(connectionString);
ServicePoint blobServicePoint = ServicePointManager.FindServicePoint(account.BlobEndpoint);
blobServicePoint.UseNagleAlgorithm = false;
Would be great to know if that's your problem!
In a C# project that I am currently working on, we're attempting to calculate the MD5 of a large quantity of files over a network (current pot is 2.7 million, client pot may be in excess of 10 million). With the number of files that we are processing, speed is of the issue.
The reason we do this is to verify the file was copied to a different location without modification.
We currently use the following code to calculate the MD5 of a file
MD5 md5 = new MD5CryptoServiceProvider();
StringBuilder sb = new StringBuilder();
byte[] hashMD5 = null;
try
{
// Open stream to file to get MD5 hash for, create hash
using (FileStream fsMD5 = new FileStream(sFilePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
hashMD5 = md5.ComputeHash(fsMD5);
}
catch (Exception ex)
{
clsLogging.logError(clsLogging.ErrorLevel.ERROR, ex);
}
string md5sum = "";
if (hashMD5 != null)
{
// Change hash into readable text
foreach (byte hex in hashMD5)
sb.Append(hex.ToString("x2"));
md5sum = sb.ToString();
}
However, the speed of this isn't what my manager has been hoping for. We've gone through a number of changes to the way and number of files that we calculate the MD5 for (i.e. we don't do it for files that we don't copy... until today when my manager changed his mind so ALL files must have a MD5 calculated for them, in case at some future time a client wishes to bugger with our program so all files are copied i guess)
I realize that the speed of the network is probably a major contributing factor (100Mbit/s). Is there an efficient way to calculate the MD5 of the contents of a file over a network?
Thanks in advance.
Trevor Watson
Edit: put all code in block instead of just a part of it.
The bottleneck is that the whole file must be streamed/copied over the network, and your seems to look good...
the different hash functions (md5/sha256/sha512) have almost the same computation time
Two possible solutions for this problem:
1) run a hasher on the remote system and store the hashes in to separate files - if that is possible in your environment.
2) Create a part-wise hash of the file, so that you only copy a part of the file.
I mean something like that:
part1Hash = md5(file.getXXXBytesFromFileAtPosition1)
part2Hash = md5(file.getXXXBytesFromFileAtPosition2)
part3Hash = md5(file.getXXXBytesFromFileAtPosition3)
finalHash = part1Hash ^ part2Hash ^ part3Hash;
you have to test which part of the file are optimal to read, so the hashes stay unique.
hope that helps...
edit: changed to bitwise xor
One possible approach would be to make use of the parallel task library in .Net 4.0. 100Mbps will still be a bottleneck, but you should see a modest improvement.
I wrote a small application last year that walks the top levels of a folder tree checking folder and file security settings. Running over a 10Mbps WAN it took about 7 minutes to complete one of our large file shares. When I parallelised the operation the execution time came down to a bit over 1 minute.
Why don't you try installing a 'client' on each one which listens on a port and when signaled, will calculate the MD5 hash for the files requested.
The main server will then only need to ask each client to calculate the MD5. Using this distributed approach you will gain the combined speed of all the clients and reduce network congestion.