Get how much disk is being used in Azure Data Lake - c#

I'm using Azure.Storage.Files.DataLake to programmatically interact with Azure.
It looks like I can get the size of a given file with
var props = await fileClient.GetPropertiesAsync();
var fileSizeInBytes = propriedades.Value.ContentLength;
(where fileClient is an instace of DataLakeFileClient)
DataLakeDirectoryClient class offers a similar API, but ContentLength is always zero.
I would like to know how to get the size of:
a directory (with DataLakeDirectoryClient)
a file system (with DataLakeFileSystemClient)

Related

Getting big data through SignalR - Blazor

I have a component library that uses JS code to generate an image as a base64 string and the image needs to be transposed to C#. The image size is larger than MaximumReceiveMessageSize.
Can I get the value of the MaximumReceiveMessageSize property in C#? I need a way to correctly split the picture into chunks, or some other way to transfer it.
My component can be used in a Wasm or Server application. I can't change the value of the MaximumReceiveMessageSize property.
Thanks
Using a stream as described in the Stream from JavaScript to .NET solved my problem.
From Microsoft docs:
In JavaScript:
function streamToDotNet() {
return new Uint8Array(10000000);
}
In C# code:
var dataReference = await JS.InvokeAsync<IJSStreamReference>("streamToDotNet");
using var dataReferenceStream = await dataReference.OpenReadStreamAsync(maxAllowedSize: 10_000_000);
var outputPath = Path.Combine(Path.GetTempPath(), "file.txt");
using var outputFileStream = File.OpenWrite(outputPath);
await dataReferenceStream.CopyToAsync(outputFileStream);
In the preceding example: JS is an injected IJSRuntime instance. The dataReferenceStream is written to disk (file.txt) at the current user's temporary folder path (GetTempPath).

Download JSON Blobs and convert to objects from Azure Blob Storage

I've been trying to figure out the fastest way to connect to an Azure Storage account, cycle through a number of containers and convert the blobs inside containers into objects. All elements in the container are JSON and match different objects.
The structure as seen on Azure Storage Explorer would be:
Azure_Subscription
|--Storage_Accounts
|--My_Storage_Account
|--blob1
|--blob2
|--blob3
etc
Now based on what I've read here in the official documentation, to access and Download each blob and convert it so that it can be handled as JSON and deserialized, I would need to do all the below (assuming I don't have a list of blob URIs).
string testConnectionString = "DefaultEndpointsProtocol=https;AccountName=;AccountKey=;EndpointSuffix=core.windows.net";
// the service clients allow working at the Azure Storage level with Tables and Blobs
TableServiceClient tableServiceClient = new TableServiceClient(testConnectionString);
BlobServiceClient blobServiceClient = new BlobServiceClient(testConnectionString);
List<blob1> blob1List = new List<blob1>;
// this gives me a list of blob containers and I can programmatically retrieve
// the name of each individual container.
Pageable<BlobContainerItem> blobContainers = blobServiceClient.GetBlobContainers();
// each BlobContainerItem represents an individual blob container (bill, building...)
foreach (BlobContainerItem blobContainerItem in blobContainers)
{
// create a ContainerClient to make calls to each individual container
BlobContainerClient clientForIndividualContainer =
blobServiceClient.GetBlobContainerClient(blobContainerItem.Name);
if (blobItem.Name.Equals("blob1"))
{
Pageable<BlobItem> blobItemList = clientForIndividualContainer.GetBlobs();
foreach (BlobItem bi in blobItemList)
{
BlobClient blobClient = clientForIndividualContainer.GetBlobClient(bi.Name);
var blobContent = blobClient.Download();
StreamReader reader = new StreamReader(blobContent.Value.Content);
string text = reader.ReadToEnd();
blob1List.Add(JsonSerializer.Deserialize<blob1>(text));
}
}
}
The project is targeting .net 5.0 and I will need to do something similar with Azure Tables as well. The goal is that I want to go through all blobs inside a number of containers (all of them JSON really) and compare them to all the blobs inside another storage account. I'm also open to any ideas on doing this differently altogether, but the purpose of this is to compare input into Azure Storage blobs and make sure that a new process uploads the same object structures. So for all blob1 items in the Azure Storage account I compare these to a list of all the oldBlob1 items in another storage account and check to see if they are all equal.
I hope the question makes sense... At this point the above code works and I can move the functionality inside the if-else into a method and instead of the if-else statement use a switch, but my main question is around reaching this point entirely. Without a massive list of blob URIs, do I need to make a BlobServiceClient to be able to make a list of BlobContainerItem(s) to then cycle through each of the containers and create for all of them BlobContainerClient(s) and then create a BlobClient for every single blob in the storage account to finally be able to get to the Content of the blob?
This seems like a lot of work to just get access to an individual file.

How to set ContentMD5 in DataLakeFileClient?

When uploading to an Azure Data Lake using the Microsoft Azure Storage Explorer the file automatically generates and stores a value for the ContentMD5 property. It also automatically does it in a function app that uses a Blob binding.
However, this does not automatically generate when uploading from a C# DLL.
I want to use this value to compare files in the future.
My code for the upload is very simple.
DataLakeFileClient fileClient = await directoryClient.CreateFileAsync("testfile.txt");
await fileClient.UploadAsync(fileStream);
I also know I can generate an MD5 using the below code, but I'm not certain if this is the same way that Azure Storage Explorer does it.
using (var md5gen = MD5.Create())
{
md5hash = md5gen.ComputeHash(fileStream);
}
but I have no idea how to set this value to the ContentMD5 property of the file.
I have found the solution.
The UploadAsync method has an overload that accepts a parameter of type DataLakeFileUploadOptions. This class contains a HttpHeaders object which in turn has a ContentHash property which stores it as a property of the document.
var uploadOptions = new DataLakeFileUploadOptions();
uploadOptions.HttpHeaders = new PathHttpHeaders();
uploadOptions.HttpHeaders.ContentHash = md5hash;
await fileClient.UploadAsync(fileStream, uploadOptions);

ContentHash not calculated in Azure Blob Storage v12

Continuing the saga, here is part I: ContentHash is null in Azure.Storage.Blobs v12.x.x
After a lot of debugging, root cause appears to be that the content hash was not calculated after uploading a blob, therefore the BlobContentInfo or BlobProperties were returning a null content hash and my whole flow is based on receiving the hash from Azure.
What I've discovered is that it depends on which HttpRequest stream method I call and upload to azure:
HttpRequest.GetBufferlessInputStream(), the content hash is not calculated, even if I go into azure storage explorer, the ContentMD5 of the blob is empty.
HttpRequest.InputStream() everything works as expected.
Do you know why this different behavior? And do you know how to make to receive content hash for streams received by GetBufferlessInputStream method.
So the code flow looks like this:
var stream = HttpContext.Current.Request.GetBufferlessInputStream(disableMaxRequestLength: true)
var container = _blobServiceClient.GetBlobContainerClient(containerName);
var blob = container.GetBlockBlobClient(blobPath);
BlobHttpHeaders blobHttpHeaders = null;
if (!string.IsNullOrWhiteSpace(fileContentType))
{
blobHttpHeaders = new BlobHttpHeaders()
{
ContentType = fileContentType,
};
}
// retry already configured of Azure Storage API
await blob.UploadAsync(stream, httpHeaders: blobHttpHeaders);
return await blob.GetPropertiesAsync();
In the code snippet from above ContentHash is NOT calculated, but if I change the way I am getting the stream from the http request with following snippet ContentHash is calculated.
var stream = HttpContext.Current.Request.InputStream
P.S. I think its obvious, but with the old sdk, content hash was calculated for streams received by GetBufferlessInputStream method
P.S2 you can find also an open issue on github: https://github.com/Azure/azure-sdk-for-net/issues/14037
P.S3 added code snipet
Ran into this today. From my digging, it appears this is a symptom of the type of Stream you use to upload, and it's not really a bug. In order to generate a hash for your blob (which is done on the client side before uploading by the looks of it), it needs to read the stream. Which means it would need to reset the position of your stream back to 0 (for the actual upload process) after generating the hash. Doing this requires the ability to perform the Seek operation on the stream. If your stream doesn't support Seek, then it looks like it doesn't generate the hash.
To get around the issue, make sure the stream you provide supports Seek (CanSeek). If it doesn't, then use a different Stream/copy your data to a stream that does (for example MemoryStream). The alternative would be for the internals of the Blob SDK to do this for you.
A workaround is that when get the stream via GetBufferlessInputStream() method, convert it to MemoryStream, then upload the MemoryStream. Then it can generate the contenthash. Sample code like below:
var stream111 = System.Web.HttpContext.Current.Request.GetBufferlessInputStream(disableMaxRequestLength: true);
//convert to memoryStream.
MemoryStream stream = new MemoryStream();
stream111.CopyTo(stream);
stream.Position = 0;
//other code
// retry already configured of Azure Storage API
await blob.UploadAsync(stream, httpHeaders: blobHttpHeaders);
Not sure why, but as per my debug, I can see when using the method GetBufferlessInputStream() in the latest SDK, during upload, it actually calls the Put Block api in the backend. And in this api, MD5 hash is not stored with the blob(Refer to here for details.). Screenshot as below:
However, when using InputStream, it calls the Put Blob api. Screenshot as below:

Get Storage used by bucket in amazon s3 using C# SDK

I have the task of providing an api endpoint to find out how much space a particular module is using in our Amazon s3 bucket. I'm using the C# SDK.
I have accomplished this by adapting code from the documentation here: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingNetSDK.html
private long GetUsedBytes(string module, string customer)
{
ListObjectsRequest listRequest = new ListObjectsRequest()
{
BucketName = BucketName,
Prefix = module + "/" + customer
};
ListObjectsResponse listResponse;
long totalSize = 0;
do
{
listResponse = s3Client.ListObjects(listRequest);
foreach (S3Object obj in listResponse.S3Objects)
{
totalSize += obj.Size;
}
listRequest.Marker = listResponse.NextMarker;
} while (listResponse.IsTruncated);
return totalSize;
}
My question is: is there a way to do this with the sdk without pulling all of the actual s3objects down off the bucket? There are several good answers about doing it with the CLI:
AWS S3: how do I see how much disk space is using
https://serverfault.com/questions/84815/how-can-i-get-the-size-of-an-amazon-s3-bucket
But I've yet to be able to find one using the SDK directly. Do I have to mimic the sdk somehow to accomplish this? Another method I considered is to get all the keys and query for their metadata, but the only way to get all the keys I've found is to grab all the objects as in the link above ^. If there's a way to get all the metadata for objects with a particular prefix that would be ideal.
Thanks for your time!
~Josh
Your code is not downloading any objects from Amazon S3. It is merely calling ListObjects() and totalling the size of each object. It will make one API call per 1000 objects.
Alternatively, you can retrieve the size of each bucket from Amazon CloudWatch.
From Monitoring Metrics with Amazon CloudWatch - Amazon S3:
Metric: BucketSizeBytes
The amount of data in bytes stored in a bucket. This value is calculated by summing the size of all objects in the bucket (both current and noncurrent objects), including the size of all parts for all incomplete multipart uploads to the bucket.
So, simply retrieve the metric from Amazon CloudWatch rather than calculating it yourself.

Categories

Resources