I have the task of providing an api endpoint to find out how much space a particular module is using in our Amazon s3 bucket. I'm using the C# SDK.
I have accomplished this by adapting code from the documentation here: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingNetSDK.html
private long GetUsedBytes(string module, string customer)
{
ListObjectsRequest listRequest = new ListObjectsRequest()
{
BucketName = BucketName,
Prefix = module + "/" + customer
};
ListObjectsResponse listResponse;
long totalSize = 0;
do
{
listResponse = s3Client.ListObjects(listRequest);
foreach (S3Object obj in listResponse.S3Objects)
{
totalSize += obj.Size;
}
listRequest.Marker = listResponse.NextMarker;
} while (listResponse.IsTruncated);
return totalSize;
}
My question is: is there a way to do this with the sdk without pulling all of the actual s3objects down off the bucket? There are several good answers about doing it with the CLI:
AWS S3: how do I see how much disk space is using
https://serverfault.com/questions/84815/how-can-i-get-the-size-of-an-amazon-s3-bucket
But I've yet to be able to find one using the SDK directly. Do I have to mimic the sdk somehow to accomplish this? Another method I considered is to get all the keys and query for their metadata, but the only way to get all the keys I've found is to grab all the objects as in the link above ^. If there's a way to get all the metadata for objects with a particular prefix that would be ideal.
Thanks for your time!
~Josh
Your code is not downloading any objects from Amazon S3. It is merely calling ListObjects() and totalling the size of each object. It will make one API call per 1000 objects.
Alternatively, you can retrieve the size of each bucket from Amazon CloudWatch.
From Monitoring Metrics with Amazon CloudWatch - Amazon S3:
Metric: BucketSizeBytes
The amount of data in bytes stored in a bucket. This value is calculated by summing the size of all objects in the bucket (both current and noncurrent objects), including the size of all parts for all incomplete multipart uploads to the bucket.
So, simply retrieve the metric from Amazon CloudWatch rather than calculating it yourself.
Related
I've been trying to figure out the fastest way to connect to an Azure Storage account, cycle through a number of containers and convert the blobs inside containers into objects. All elements in the container are JSON and match different objects.
The structure as seen on Azure Storage Explorer would be:
Azure_Subscription
|--Storage_Accounts
|--My_Storage_Account
|--blob1
|--blob2
|--blob3
etc
Now based on what I've read here in the official documentation, to access and Download each blob and convert it so that it can be handled as JSON and deserialized, I would need to do all the below (assuming I don't have a list of blob URIs).
string testConnectionString = "DefaultEndpointsProtocol=https;AccountName=;AccountKey=;EndpointSuffix=core.windows.net";
// the service clients allow working at the Azure Storage level with Tables and Blobs
TableServiceClient tableServiceClient = new TableServiceClient(testConnectionString);
BlobServiceClient blobServiceClient = new BlobServiceClient(testConnectionString);
List<blob1> blob1List = new List<blob1>;
// this gives me a list of blob containers and I can programmatically retrieve
// the name of each individual container.
Pageable<BlobContainerItem> blobContainers = blobServiceClient.GetBlobContainers();
// each BlobContainerItem represents an individual blob container (bill, building...)
foreach (BlobContainerItem blobContainerItem in blobContainers)
{
// create a ContainerClient to make calls to each individual container
BlobContainerClient clientForIndividualContainer =
blobServiceClient.GetBlobContainerClient(blobContainerItem.Name);
if (blobItem.Name.Equals("blob1"))
{
Pageable<BlobItem> blobItemList = clientForIndividualContainer.GetBlobs();
foreach (BlobItem bi in blobItemList)
{
BlobClient blobClient = clientForIndividualContainer.GetBlobClient(bi.Name);
var blobContent = blobClient.Download();
StreamReader reader = new StreamReader(blobContent.Value.Content);
string text = reader.ReadToEnd();
blob1List.Add(JsonSerializer.Deserialize<blob1>(text));
}
}
}
The project is targeting .net 5.0 and I will need to do something similar with Azure Tables as well. The goal is that I want to go through all blobs inside a number of containers (all of them JSON really) and compare them to all the blobs inside another storage account. I'm also open to any ideas on doing this differently altogether, but the purpose of this is to compare input into Azure Storage blobs and make sure that a new process uploads the same object structures. So for all blob1 items in the Azure Storage account I compare these to a list of all the oldBlob1 items in another storage account and check to see if they are all equal.
I hope the question makes sense... At this point the above code works and I can move the functionality inside the if-else into a method and instead of the if-else statement use a switch, but my main question is around reaching this point entirely. Without a massive list of blob URIs, do I need to make a BlobServiceClient to be able to make a list of BlobContainerItem(s) to then cycle through each of the containers and create for all of them BlobContainerClient(s) and then create a BlobClient for every single blob in the storage account to finally be able to get to the Content of the blob?
This seems like a lot of work to just get access to an individual file.
I am writing a CLI utility that does a lot of different things, but what I'm struggling with right now is I have a known blob. For that blob, I want to restore a snapshot that was taken of that blob
await foreach (var snapshot in containerClient.GetBlobsAsync(
BlobTraits.All,
BlobStates.Snapshots,
blobPath))
{
_logger.LogInformation($"found blob {snapshot.Name} - {snapshot.Snapshot}");
if (DecideIfRightSnapshot(snapshot)) {
BlobClient snapshotBlob = containerClient.GetBlobClient(snapshot.Name);
_logger.LogInformation($"found snapshot {snapshotBlob.Uri}");
await sourceBlob.StartCopyFromUriAsync(snapshotBlob.Uri);
}
break;
}
First, the filter isn't working right because the last blob in the list is always the base blob. But I can work around that one.
The real issue i'm struggling with is the proper way to restore a blob from a snapshot using the libs? I'm really concerned because the .Uri function always returns the base file's uri, even if its a snapshot. I was lead to believe the URI would be something like this
https://me.blob.core.windows.net/myapp/doc?snapshot=2020-12-16T17:07:44.1076450Z
but thats not the URI thats getting logged. Am i supposed to construct the full URI myself?
In all the searches they refer to this as "promoting" a snapshot. But I can't find a "promote" method in the API.
Am i doing this right?
If you're using the new version of blob storage sdk: Azure.Storage.Blobs, then you should construct the full URI by yourself. The sample code like below:
//other code
await foreach (var snapshot in containerClient.GetBlobsAsync(
BlobTraits.All,
BlobStates.Snapshots,
blobPath))
{
_logger.LogInformation($"found blob {snapshot.Name} - {snapshot.Snapshot}");
if(DecideIfRightSnapshot(snapshot)) {
BlobClient snapshotBlob = containerClient.GetBlobClient(snapshot.Name);
//construct the snapshot url
var snapshot_uri = snapshotBlob.Uri.ToString() + "?snapshot=" + snapshot.Snapshot;
_logger.LogInformation($"found snapshot {snapshot_uri }");
await sourceBlob.StartCopyFromUriAsync(snapshot_uri);
}
break;
}
For promoting, it means you restore the snapshot via azure portal. It's a UI operation and actually it calls the Put Blob From URL api. And currently, there is no such method in sdk.
But if you're using some old packages like WindowsAzure.Storage, it has many methods to operate with snapshot, see this article. Note: it's not recommended to use the old packages.
I am trying to upgrade my project from Microsoft.WindowsAzure.Storage v9 (deprecated) to latest sdk Azure.Storage.Blobs v12.
My issue (post-upgrade) is accessing the ContentHash property.
Pre-upgrade steps:
upload file to blob
get MD5 hash of uploaded file provided by CloudBlob.Properties.ContentMD5 from Microsoft.WindowsAzure.Storage.Blob
compare the calculated MD5 hash with the one retrieved from azure
Post-upgrade attempts to access the MD5 hash that Azure is calculating on its side:
1.BlobClient.GetProperties() calling this method
2.BlobClient.UploadAsync() looking at the BlobContentInfo response
both return ContentHash is null. (see my later Question to see why)
One huge difference I've noticed is that with older sdk I could tell to the storage client to use MD5 computing like this:
CloudBlobClient cloudBlobClient = _cloudStorageAccount.CreateCloudBlobClient();
cloudBlobClient.DefaultRequestOptions.StoreBlobContentMD5 = true;
So I was expecting to find something similar to StoreBlobContentMD5 on the latest sdk but I couldn't.
Can anyone help me find a solution for this problem?
Edit 1:
I did a test and in azure storage I do not have a MD5 hash
Upload code:
var container = _blobServiceClient.GetBlobContainerClient(containerName);
var blob = container.GetBlobClient(blobPath);
BlobHttpHeaders blobHttpHeaders = null;
if (!string.IsNullOrWhiteSpace(fileContentType))
{
blobHttpHeaders = new BlobHttpHeaders()
{
ContentType = fileContentType,
};
}
StorageTransferOptions storageTransferOption = new StorageTransferOptions()
{
MaximumConcurrency = 2,
};
var blobResponse = await blob.UploadAsync(stream, blobHttpHeaders, null, null, null, null, storageTransferOption, default);
return blob.GetProperties();
There is not much difference between old upload code and new one apart from using new classes from new sdk.
The main difference remains the one I already stated, I can not find an equivalent setting in new sdk for StoreBlobContentMD5 .
I think this is the problem. I need to set the storage client to compute MD5 hash, as I did with old sdk.
Edit 2:
For download I can do something like this:
var properties = blob.GetProperties();
var download = await blob.DownloadAsync(range: new HttpRange(0, properties.Value.ContentLength), rangeGetContentHash: true);
By using this definition of DownloadAsync I can force MD5 hash to be calculated and it can be found in download.Value.ContentHash
Summarize to close the question:
I did a quick test with the latest version of 12.4.4 blob storage package, I can see the content-md5 is auto-generated and can also be read.
And as per the op's comment, it may due to some issues with the existing solution. And after creating a new solution, it works as expected.
The short version of this problem is, make sure the Stream you upload to Azure using the v12 version of the SDK supports Seek (see the HasSeek property). It's currently required in order to traverse the Stream to generate the hash, and reset/seek the position back to 0 so that it can be read again for the actual upload.
So, I know it is kinda crazy to report bug at this point in Azure life cycle, but I'm out of options. Here we go.
We have a service that you can upload files and a client that download then. That BLOB is stuffed with about 27 GB of data.
In a few occasions our users reported that some files were coming wrong, so we checked our MVC route to see if was anything wrong and found nothing.
So we created a simple console that loop the download:
public static void Main()
{
var firstHash = string.Empty;
var client = new System.Net.WebClient();
for (int i = 0; i < 5000; i++)
{
try
{
var date = DateTime.Now.ToString("HH-mm-ss-ffff");
var destination = #"C:\Users\Israel\Downloads\RO65\BLOB - RO65 -" + date + ".rfa";
client.DownloadFile("http://myboxfree.blob.core.windows.net/public/91fe9d90-71ce-4036-b711-a5300159abfa.rfa", destination);
string hash = string.Empty;
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(destination))
{
hash = Convert.ToBase64String(md5.ComputeHash(stream));
}
}
if (string.IsNullOrEmpty(firstHash))
firstHash = hash;
if (hash != firstHash) hash += " ---------------------------------------------";
Console.WriteLine("i: " + i.ToString() + " = " + hash);
}
catch { }
}
}
So here is the result - every now and then it downloads the wrong file:
The first 1000 downloads were OK, the right file. Out of the blue the BLOB returns a different file, and then goes back to normal.
The only relation I found between the files are the extension and the file size in bytes. The hash is (of course) different.
Any thoughts?
I have tried to rerun your sample code and wasn't able to repro.
Questions:
For the two different versions of the files you are seeing downloaded have you compared the contents of the two files? I think you said it was two completely different blobs being retrieved - however I wanted to verify that. How large is the delta between the two files?
Are you using RA-GRS and the client libraries read from secondary retry condition - meaning a network glitch could result in the read coming from the secondary region?
Suggestions:
Can you track the etag of the retrieved files. This allows you to check if the blob has changed since you first started reading it?
The Storage Service does enable you to explicitly validate the integrity of your objects to check to see if they have been modified in transit - potentially due to network issues etc. See Azure Storage Md5 Overview for more information. The simplest way however might just be to use https as these validations are already built into https.
Can you also try to repro using https and let me know if that helps?
I have created the file in an amazon S3 bucket. I know that the url format is:
http://ptedotnet.s3-eu-west-1.amazonaws.com/UserContent/Uploads/54/26.jpg
However I want to be able to work out what the 's3-eu-west-1' bit is without having to explicitly know this in my application. I have seen that in the API there is a call which I can make to get the location...
GetBucketLocationRequest bucketRequest = new GetBucketLocationRequest();
bucketRequest.BucketName = ConfigurationManager.AppSettings["AWS_Bucket"].ToString();
GetBucketLocationResponse ree = client.GetBucketLocation(bucketRequest);
string location = ree.Location;
But this only returns 'eu' so im wondering how to get the other parts. The less the user has to configure on the application the better :-)
You can just use {bucket}.s3.amazonaws.com.
i.e. http://ptedotnet.s3.amazonaws.com/UserContent/Uploads/54/26.jpg
Use S3SignURL to make a signed URL