Download JSON Blobs and convert to objects from Azure Blob Storage - c#

I've been trying to figure out the fastest way to connect to an Azure Storage account, cycle through a number of containers and convert the blobs inside containers into objects. All elements in the container are JSON and match different objects.
The structure as seen on Azure Storage Explorer would be:
Azure_Subscription
|--Storage_Accounts
|--My_Storage_Account
|--blob1
|--blob2
|--blob3
etc
Now based on what I've read here in the official documentation, to access and Download each blob and convert it so that it can be handled as JSON and deserialized, I would need to do all the below (assuming I don't have a list of blob URIs).
string testConnectionString = "DefaultEndpointsProtocol=https;AccountName=;AccountKey=;EndpointSuffix=core.windows.net";
// the service clients allow working at the Azure Storage level with Tables and Blobs
TableServiceClient tableServiceClient = new TableServiceClient(testConnectionString);
BlobServiceClient blobServiceClient = new BlobServiceClient(testConnectionString);
List<blob1> blob1List = new List<blob1>;
// this gives me a list of blob containers and I can programmatically retrieve
// the name of each individual container.
Pageable<BlobContainerItem> blobContainers = blobServiceClient.GetBlobContainers();
// each BlobContainerItem represents an individual blob container (bill, building...)
foreach (BlobContainerItem blobContainerItem in blobContainers)
{
// create a ContainerClient to make calls to each individual container
BlobContainerClient clientForIndividualContainer =
blobServiceClient.GetBlobContainerClient(blobContainerItem.Name);
if (blobItem.Name.Equals("blob1"))
{
Pageable<BlobItem> blobItemList = clientForIndividualContainer.GetBlobs();
foreach (BlobItem bi in blobItemList)
{
BlobClient blobClient = clientForIndividualContainer.GetBlobClient(bi.Name);
var blobContent = blobClient.Download();
StreamReader reader = new StreamReader(blobContent.Value.Content);
string text = reader.ReadToEnd();
blob1List.Add(JsonSerializer.Deserialize<blob1>(text));
}
}
}
The project is targeting .net 5.0 and I will need to do something similar with Azure Tables as well. The goal is that I want to go through all blobs inside a number of containers (all of them JSON really) and compare them to all the blobs inside another storage account. I'm also open to any ideas on doing this differently altogether, but the purpose of this is to compare input into Azure Storage blobs and make sure that a new process uploads the same object structures. So for all blob1 items in the Azure Storage account I compare these to a list of all the oldBlob1 items in another storage account and check to see if they are all equal.
I hope the question makes sense... At this point the above code works and I can move the functionality inside the if-else into a method and instead of the if-else statement use a switch, but my main question is around reaching this point entirely. Without a massive list of blob URIs, do I need to make a BlobServiceClient to be able to make a list of BlobContainerItem(s) to then cycle through each of the containers and create for all of them BlobContainerClient(s) and then create a BlobClient for every single blob in the storage account to finally be able to get to the Content of the blob?
This seems like a lot of work to just get access to an individual file.

Related

How to set ContentMD5 in DataLakeFileClient?

When uploading to an Azure Data Lake using the Microsoft Azure Storage Explorer the file automatically generates and stores a value for the ContentMD5 property. It also automatically does it in a function app that uses a Blob binding.
However, this does not automatically generate when uploading from a C# DLL.
I want to use this value to compare files in the future.
My code for the upload is very simple.
DataLakeFileClient fileClient = await directoryClient.CreateFileAsync("testfile.txt");
await fileClient.UploadAsync(fileStream);
I also know I can generate an MD5 using the below code, but I'm not certain if this is the same way that Azure Storage Explorer does it.
using (var md5gen = MD5.Create())
{
md5hash = md5gen.ComputeHash(fileStream);
}
but I have no idea how to set this value to the ContentMD5 property of the file.
I have found the solution.
The UploadAsync method has an overload that accepts a parameter of type DataLakeFileUploadOptions. This class contains a HttpHeaders object which in turn has a ContentHash property which stores it as a property of the document.
var uploadOptions = new DataLakeFileUploadOptions();
uploadOptions.HttpHeaders = new PathHttpHeaders();
uploadOptions.HttpHeaders.ContentHash = md5hash;
await fileClient.UploadAsync(fileStream, uploadOptions);

restore an azure blob snapshot using c# library

I am writing a CLI utility that does a lot of different things, but what I'm struggling with right now is I have a known blob. For that blob, I want to restore a snapshot that was taken of that blob
await foreach (var snapshot in containerClient.GetBlobsAsync(
BlobTraits.All,
BlobStates.Snapshots,
blobPath))
{
_logger.LogInformation($"found blob {snapshot.Name} - {snapshot.Snapshot}");
if (DecideIfRightSnapshot(snapshot)) {
BlobClient snapshotBlob = containerClient.GetBlobClient(snapshot.Name);
_logger.LogInformation($"found snapshot {snapshotBlob.Uri}");
await sourceBlob.StartCopyFromUriAsync(snapshotBlob.Uri);
}
break;
}
First, the filter isn't working right because the last blob in the list is always the base blob. But I can work around that one.
The real issue i'm struggling with is the proper way to restore a blob from a snapshot using the libs? I'm really concerned because the .Uri function always returns the base file's uri, even if its a snapshot. I was lead to believe the URI would be something like this
https://me.blob.core.windows.net/myapp/doc?snapshot=2020-12-16T17:07:44.1076450Z
but thats not the URI thats getting logged. Am i supposed to construct the full URI myself?
In all the searches they refer to this as "promoting" a snapshot. But I can't find a "promote" method in the API.
Am i doing this right?
If you're using the new version of blob storage sdk: Azure.Storage.Blobs, then you should construct the full URI by yourself. The sample code like below:
//other code
await foreach (var snapshot in containerClient.GetBlobsAsync(
BlobTraits.All,
BlobStates.Snapshots,
blobPath))
{
_logger.LogInformation($"found blob {snapshot.Name} - {snapshot.Snapshot}");
if(DecideIfRightSnapshot(snapshot)) {
BlobClient snapshotBlob = containerClient.GetBlobClient(snapshot.Name);
//construct the snapshot url
var snapshot_uri = snapshotBlob.Uri.ToString() + "?snapshot=" + snapshot.Snapshot;
_logger.LogInformation($"found snapshot {snapshot_uri }");
await sourceBlob.StartCopyFromUriAsync(snapshot_uri);
}
break;
}
For promoting, it means you restore the snapshot via azure portal. It's a UI operation and actually it calls the Put Blob From URL api. And currently, there is no such method in sdk.
But if you're using some old packages like WindowsAzure.Storage, it has many methods to operate with snapshot, see this article. Note: it's not recommended to use the old packages.

Is it possible to change files’ encoding azure blob storage using C#

I use polybase to export file to storageAccount. But the encoding is UTF8. I need to change it to SJIS. is there any easy way to change it to SJIS using C#? is it possible to do it by using blobstorage’s rest api
For api, you can use Set Blob Properties, then set x-ms-blob-content-encoding in the request header.
For code, if you're using azure blob storage sdk, you can refer to this article. You should modify the code, since the sample for getting properties of container. You can use sample code for setting blob property as below:
CloudBlobContainer blobContainer = blobClient.GetContainerReference("xxx");
CloudBlockBlob myblob = blobContainer.GetBlockBlobReference("xxx");
myblob.Properties.ContentEncoding = "SJIS";
myblob.SetProperties();

Get Storage used by bucket in amazon s3 using C# SDK

I have the task of providing an api endpoint to find out how much space a particular module is using in our Amazon s3 bucket. I'm using the C# SDK.
I have accomplished this by adapting code from the documentation here: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingNetSDK.html
private long GetUsedBytes(string module, string customer)
{
ListObjectsRequest listRequest = new ListObjectsRequest()
{
BucketName = BucketName,
Prefix = module + "/" + customer
};
ListObjectsResponse listResponse;
long totalSize = 0;
do
{
listResponse = s3Client.ListObjects(listRequest);
foreach (S3Object obj in listResponse.S3Objects)
{
totalSize += obj.Size;
}
listRequest.Marker = listResponse.NextMarker;
} while (listResponse.IsTruncated);
return totalSize;
}
My question is: is there a way to do this with the sdk without pulling all of the actual s3objects down off the bucket? There are several good answers about doing it with the CLI:
AWS S3: how do I see how much disk space is using
https://serverfault.com/questions/84815/how-can-i-get-the-size-of-an-amazon-s3-bucket
But I've yet to be able to find one using the SDK directly. Do I have to mimic the sdk somehow to accomplish this? Another method I considered is to get all the keys and query for their metadata, but the only way to get all the keys I've found is to grab all the objects as in the link above ^. If there's a way to get all the metadata for objects with a particular prefix that would be ideal.
Thanks for your time!
~Josh
Your code is not downloading any objects from Amazon S3. It is merely calling ListObjects() and totalling the size of each object. It will make one API call per 1000 objects.
Alternatively, you can retrieve the size of each bucket from Amazon CloudWatch.
From Monitoring Metrics with Amazon CloudWatch - Amazon S3:
Metric: BucketSizeBytes
The amount of data in bytes stored in a bucket. This value is calculated by summing the size of all objects in the bucket (both current and noncurrent objects), including the size of all parts for all incomplete multipart uploads to the bucket.
So, simply retrieve the metric from Amazon CloudWatch rather than calculating it yourself.

Store a list<string> to SQL database table?

I want to lookup a storage account in Microsoft Azure, list the blobs in a particular container(in the storage account),then store the listed blob names to a database table. Anyone please suggest a c# code to store the list of blob names to database table.
namespace ListStorageAccntFiles
{
class Program
{
static void Main(string[] args)
{
Console.Clear();
//Code to list the blobnames in Console
CloudStorageAccount StorageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
var BlobClient = StorageAccount.CreateCloudBlobClient();
var Container = BlobClient.GetContainerReference("samples-workitems");
var list = Container.ListBlobs();
List<string> blobNames = list.OfType<CloudBlockBlob>().Select(b => b.Name).ToList();
blobNames.ForEach(Console.WriteLine);
//Code to store blobnames under the column header "name" in a database table
}
}
}
with the scenario!!!! program that can look at storage account and process all new & modified files and take an action if there is any new ones
In my opinion, I recommend using BlobTrigger of WebJobs SDK to achieve your purpose.Here is my code sample for you to have a better understanding of it:
WebJob Class
public class Functions
{
//This function will get triggered/executed when a blob is created or updated on an Azure Blob container called trigger-blob-input.
public static void TriggerAzureBlob([BlobTrigger("trigger-blob-input/{name}")] ICloudBlob blob)
{
Console.WriteLine("Blob name:" + blob.Name);
//do something with the created/updated blob
}
}
When blobs are created/updated as follows:
You could detect the blobs via BlobTrigger and get the following output:
For more details, you could refer to the following tutorials:
how to create a WebJob and deploy it to Azure
how to trigger a function when a blob is created or updated
Additionally, the WebJobs SDK scans log files to monitor new or changed blobs. This process is not read-time, so a function might not be triggered for a short or longer time after the blob is created/updated.
If the limitation of blob trigger is not meet your application, you could create a queue message when you create/update the blob, then use the QueueTrigger instead of BlobTrigger on the function that processes your blob.
I think you can store the values as comma separated String. While retrieving the values back in C#, you can split(",") to get the values as Array. [In java, string split method returns array. Not sure about c#]
If you think that your values contain comma in them, use some other character in the place of comma.
Going forward with your code. Add this line to convert it into a delimited string
string namesAsString = blobNames.Aggregate((i, j) => i + "UNIQUE_DELIMITER_CHARACTER_OF_YOUR_CHOICE" + j);
For instance, you can use ':' as your delimiter
Now you can easily store this string in your table. In order to update this data in case the parent data changes, you can update it by Index (index of object in original collection

Categories

Resources