I am trying to write a custom .NET activity, which will be run from Azure Data Factory. It will do two tasks, one after the other:
it will download grib2 files from an FTP server daily (grib2 is a custom compression for meteorological data)
it will decompress each file as it is downloaded.
So far I have setup an Azure Batch with a pool with two nodes - Windows Server machines, which are used to run the FTP downloads. The nodes are downloading the grib2 files to a blob storage container.
The code for the custom app so far looks like this:
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Azure;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Runtime;
namespace ClassLibrary1
{
public class Class1 : IDotNetActivity
{
public IDictionary string, string Execute(
IEnumerable linkedServices,
IEnumerable datasets,
Activity activity,
IActivityLogger logger)
{
logger.Write("Start");
//Get extended properties
DotNetActivity dotNetActivityPipeline = (DotNetActivity)activity.TypeProperties;
string sliceStartString = dotNetActivityPipeline.ExtendedProperties["SliceStart"];
//Get linked service details
Dataset inputDataset = datasets.Single(dataset = dataset.Name == activity.Inputs.Single().Name);
Dataset outputDataset = datasets.Single(dataset = dataset.Name == activity.Outputs.Single().Name);
/*
DO FTP download here
*/
logger.Write("End");
return new Dictionary string, string();
}
}
}
So far my code works and I have the files downloaded to my blob storage account.
Now that I have the files downloaded, I would like to have the nodes of the Batch pool decompress the files and put the decompressed files in my blob storage for further processing.
For this, wgrib2.exe is used, which comes with some dll files. I have already zipped and uploaded the executable and all dll files it needs to the Application Packages to my pool. If I am correct, when each node joins the pool, this executable will be extracted and available for calls.
My question is: how do I go about to write the custom .NET activity so the files are downloaded by the nodes of my pool and after each file is downloaded, a decompression command is run on each file to convert it to csv file? The command line for this would look like:
wgrib2.exe downloadedfileName.grb2 -csv downloadedfileName.csv
How do I get a handle of the name of each downloaded file, how do I precess it on the node and save it back to the blob storage?
Also, how can I control how many files are downloaded at the same time and how many are decompressed at the same time?
Related
This should have been simple but turned out to require a bit of GoogleFu.
I have an Azure Synapse Spark Notebook written in C# that
Receives a list of Deflate compressed IIS files.
Reads the files as binary into a DataFrame
Decompresses these files one at a time and writes them into Parquet format.
Now after all of them have been successfully processed I need to delete the compressed files.
This is my proof of concept but it works perfectly.
Create a linked service pointing to the storage account that contains the files you want to delete see Configure access to Azure Blob Storage
See code sample below
#r "nuget:Azure.Storage.Files.DataLake,12.0.0-preview.9"
using Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Utils;
using Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Notebook.MSSparkUtils;
using Azure.Storage.Files.DataLake;
using Azure.Storage.Files.DataLake.Models;
string blob_sas_token = Credentials.GetConnectionStringOrCreds('your linked service name here');
Uri uri = new Uri($"https://'your storage account name here'.blob.core.windows.net/'your container name here'{blob_sas_token}") ;
DataLakeServiceClient _serviceClient = new DataLakeServiceClient(uri);
DataLakeFileSystemClient fileClient = _serviceClient.GetFileSystemClient("'path to directory containing the file here'") ;
fileClient.DeleteFile("'file name here'") ;
The call to Credentials.GetConnectionStringOrCreds returns a signed SAS token that is ready for your code to attach to a storage resource uri.
You could of course use the DeleteFileAsync method if you so desire.
Hope this saves someone else a few hours of GoogleFu.
I am uploading files to cloud storage using the .net client.
at the moment am uploading files one by one like
StorageClient client = StorageClient.Create();
foreach(file in files)
{
client.UploadObject(bucketName, uploadLocation, contentType, file);
}
But I couldn't find any way to bulk upload files. Is there any way to upload files in bulk ?
You are effectively bulk uploading; you're just uploading each file serially 😃
If you're looking for a method that you give files to and it handles the entire upload, I'm unsure that this exists.
You can run the uploads in parallel using threads or equivalent.
You'll want to ensure that you can resume failed uploads (uploading multiple files increases the likelihood of failure), see Create Object Uploader
You'll need to self-manage resuming on failures.
It's possible that there are libraries that implement this abstraction.
I understand that you are searching for a way to upload a large number of files at once in Google Cloud Storage. However, there is no direct method to handle the entire upload. If you have a large number of files to upload you can perform a parallel multi-threaded/multi-processing copy. The following steps are to be followed:
1.Instantiation of a StorageClient object.
2.Specifying the parallelization options.
3.Getting a list of filenames from my upload folder and storing the names of the files.
4.Getting a list of files in the cloud.
5.Using the Parallel Foreach.
6.calling the UploadObject method from the Google Cloud Storage client library.
You can also refer to this article for more details on the above methods.
I was looking for resources on how to create a simple background service using C# that checks a specific folder for FLAC files and sends them to a GCP bucket, once the file is uploaded successfully the file is erased or moved to another folder. Where can I find something to read about this kind of thing?
To move a file to another location using c# you can use the move method. The Move method moves an existing file to a new location with the same or a different file name.The Move method moves an existing file to a new location with the same or a different file name in File Move. The Move method takes two parameters. The Move method deletes the original file. The method that renames files is called File.Move
Example:
{
File.Move(sourceFile, destinationFile);
}
catch (IOException iox)
{
Console.WriteLine(iox.Message);
}
If you need more examples about File.Move method please follow this link
Adding to that, you can use the Directory.GetFiles method to select the file extension, like in the example below.
This is the original thread where the example was posted
Example:
//Assume user types .txt into textbox
string fileExtension = "*" + textbox1.Text;
string[] txtFiles = Directory.GetFiles("Source Path", fileExtension);
foreach (var item in txtFiles)
{
File.Move(item, Path.Combine("Destination Directory", Path.GetFileName(item)));
}
If you want to know more about Directory.GetFiles method follow this link
And concerning GCP,using Cloud Storage Transfer Service you can move or backup data to a Cloud Storage bucket either from other cloud storage providers or from your on-premises storage. Storage Transfer Service provides options that make data transfers and synchronization easier. For example, you can:
Schedule one-time transfer operations or recurring transfer
operations.
Delete existing objects in the destination bucket if they do not have
a corresponding object in the source.
Delete data source objects after transferring them.
Schedule periodic synchronization from a data source to a data sink
with advanced filters based on file creation dates, file-names, and
the times of day you prefer to import data.
If you want to know more about GCP Cloud Storage Transfer Service follow this link
If you want to know more about how to create storage buckets follow this link
I've just started working with Data Lake and I'm currently trying to figure out the real workflow steps and how to automatize the whole process.
Say I have some files as an input and I would like to process them and download output files in order to push into my data warehouse or/and SSAS.
I've found absolutely lovely API and it's all good but I can't find a way to get all the file names in a directory to get them downloaded further.
Please correct my thoughts regarding workflow. Is there another, more elegant way to automatically get all the processed data (outputs) into a storage (like conventional SQL Server, SSAS, data warehouse and etc)?
If you have a working solution based on Data Lake, please describe the workflow (from "raw" files to reports for end-users) with a few words.
here is my example of NET Core application
using Microsoft.Azure.DataLake.Store;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using Microsoft.Rest.Azure.Authentication;
var creds = new ClientCredential(ApplicationId, Secret);
var clientCreds = ApplicationTokenProvider.LoginSilentAsync(Tenant, creds).GetAwaiter().GetResult();
var client = AdlsClient.CreateClient("myfirstdatalakeservice.azuredatalakestore.net", clientCreds);
var result = client.GetDirectoryEntry("/mynewfolder", UserGroupRepresentation.ObjectID);
Say I have some files as an input and I would like to process them and download output files in order to push into my data warehouse or/and SSAS.
If you want to download the files from the folder in the azure datalake to the local path, you could use the following code to do that.
client.BulkDownload("/mynewfolder", #"D:\Tom\xx"); //local path
But based on my understanding, you could use the azure datafactory to push your data from datalake store to azure storage blob or azure file storge.
I have a very basic understanding of Azure WebJob, that is, it can perform tasks in background. I want to upload files to Azure Blob Storage, specifically using Azure WebJob. I would like to know how to do this, from scratch. Assume that the file to be uploaded is locally available on the system in a certain folder (say C:/Users/Abc/Alpha/Beta/).
How and where do I define the background task that is supposed to be performed?
How to make sure, that whenever a new file is available in the same folder (
C:/Users/Abc/Alpha/Beta/) the function is automatically triggered, and this new file is also transferred to Azure Blob Storage?
Can I monitor progress of transfer for each file? or for all files?
How to handle connection failures during transfer? and what other errors should I worry about?
How and where do I define the background task that is supposed to be performed?
According to your description, you could create a webjob console application in the VS.You could run this console application in the local.
More details, you could refer to this article to know how to create webjob in the VS.
Notice:Sine you need watch the local side folder, this webjob is running in your local side not upload to the azure web app.
How to make sure, that whenever a new file is available in the same folder ( C:/Users/Abc/Alpha/Beta/) the function is automatically triggered, and this new file is also transferred to Azure Blob Storage?
As far as I know, webjob support the filetrigger, it will monitor for file additions/changes to a particular directory, and triggers a job function when they occur.
More details, you could refer to below code sample:
Program.cs:
static void Main()
{
var config = new JobHostConfiguration();
FilesConfiguration filesConfig = new FilesConfiguration();
//set the root path when the function to watch the folder
filesConfig.RootPath = #"D:\";
config.UseFiles(filesConfig);
var host = new JobHost(config);
// The following code ensures that the WebJob will be running continuously
host.RunAndBlock();
}
function.cs:
public static void ImportFile(
[FileTrigger(#"fileupload\{name}", "*.*", WatcherChangeTypes.Created | WatcherChangeTypes.Changed)] Stream file,
FileSystemEventArgs fileTrigger,
[Blob("textblobs/{name}", FileAccess.Write)] Stream blobOutput,
TextWriter log)
{
log.WriteLine(string.Format("Processed input file '{0}'!", fileTrigger.Name));
file.CopyTo(blobOutput);
log.WriteLine("Upload File Complete");
}
Can I monitor progress of transfer for each file? or for all files?
As far as I know, there's a [BlobInput] attribute that lets you specify a container to listen on, and it includes an efficient blob listener that will dispatch to the method when new blobs are detected. More details, you could refer to this article.
How to handle connection failures during transfer? and what other errors should I worry about?
You could use try catch to catch the error.If the error happens you could send the details to the queue or write a txt file in the blob. Then you could do some operation according to the queue message or the blob txt file.