This should have been simple but turned out to require a bit of GoogleFu.
I have an Azure Synapse Spark Notebook written in C# that
Receives a list of Deflate compressed IIS files.
Reads the files as binary into a DataFrame
Decompresses these files one at a time and writes them into Parquet format.
Now after all of them have been successfully processed I need to delete the compressed files.
This is my proof of concept but it works perfectly.
Create a linked service pointing to the storage account that contains the files you want to delete see Configure access to Azure Blob Storage
See code sample below
#r "nuget:Azure.Storage.Files.DataLake,12.0.0-preview.9"
using Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Utils;
using Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Notebook.MSSparkUtils;
using Azure.Storage.Files.DataLake;
using Azure.Storage.Files.DataLake.Models;
string blob_sas_token = Credentials.GetConnectionStringOrCreds('your linked service name here');
Uri uri = new Uri($"https://'your storage account name here'.blob.core.windows.net/'your container name here'{blob_sas_token}") ;
DataLakeServiceClient _serviceClient = new DataLakeServiceClient(uri);
DataLakeFileSystemClient fileClient = _serviceClient.GetFileSystemClient("'path to directory containing the file here'") ;
fileClient.DeleteFile("'file name here'") ;
The call to Credentials.GetConnectionStringOrCreds returns a signed SAS token that is ready for your code to attach to a storage resource uri.
You could of course use the DeleteFileAsync method if you so desire.
Hope this saves someone else a few hours of GoogleFu.
Related
I've just started working with Data Lake and I'm currently trying to figure out the real workflow steps and how to automatize the whole process.
Say I have some files as an input and I would like to process them and download output files in order to push into my data warehouse or/and SSAS.
I've found absolutely lovely API and it's all good but I can't find a way to get all the file names in a directory to get them downloaded further.
Please correct my thoughts regarding workflow. Is there another, more elegant way to automatically get all the processed data (outputs) into a storage (like conventional SQL Server, SSAS, data warehouse and etc)?
If you have a working solution based on Data Lake, please describe the workflow (from "raw" files to reports for end-users) with a few words.
here is my example of NET Core application
using Microsoft.Azure.DataLake.Store;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using Microsoft.Rest.Azure.Authentication;
var creds = new ClientCredential(ApplicationId, Secret);
var clientCreds = ApplicationTokenProvider.LoginSilentAsync(Tenant, creds).GetAwaiter().GetResult();
var client = AdlsClient.CreateClient("myfirstdatalakeservice.azuredatalakestore.net", clientCreds);
var result = client.GetDirectoryEntry("/mynewfolder", UserGroupRepresentation.ObjectID);
Say I have some files as an input and I would like to process them and download output files in order to push into my data warehouse or/and SSAS.
If you want to download the files from the folder in the azure datalake to the local path, you could use the following code to do that.
client.BulkDownload("/mynewfolder", #"D:\Tom\xx"); //local path
But based on my understanding, you could use the azure datafactory to push your data from datalake store to azure storage blob or azure file storge.
In my application, we're uploading a large amount of image data at a time. Request made through an Angular portal and the ASP.NET web API is receiving the request both are hosted on Azure server. From the API I'm directly converting the image data to bytes and uploading to Azure blob.
Is this a proper way to upload or Do I need to save those images on my server first (like on some path 'C:/ImagesToUpload') and then upload to Azure blob from there?
I'm concerned because we're uploading a large amount of data and the way I'm using right now, will create memory issue or not, I've no idea about that.
so if someone
I have developed same thing. We have same requirement as large number of files. I think You have to first compress the file in API side then have to send it in blob file using the SAS token. But make sure that In Azure Blob storage you must have to pass data less then the size of 5 MB so I also found solution of that.
Here I have sample code that will work pretty good after some testing.
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(SettingsProvider.Get("CloudStorageConnectionString", SettingType.AppSetting));
var blobClient = storageAccount.CreateCloudBlobClient();
var filesContainer = blobClient.GetContainerReference("your_containername");
filesContainer.CreateIfNotExists();
var durationHours = 24;
//Generate SAS Token
var sasConstraints = new SharedAccessBlobPolicy
{
SharedAccessExpiryTime = DateTime.UtcNow.AddHours(durationHours),
Permissions = SharedAccessBlobPermissions.Write | SharedAccessBlobPermissions.Read
};
// Generate Random File Name using GUID
var StorageFileName = Guid.NewGuid() + DateTime.Now.ToString();
var blob = filesContainer.GetBlockBlobReference(StorageFileName);
var blobs = new CloudBlockBlob(new Uri(string.Format("{0}/{1}{2}", filesContainer.Uri.AbsoluteUri, StorageFileName, blob.GetSharedAccessSignature(sasConstraints))));
//Code for divide the file into the 4MB Chunk if its Greater than 4 MB then
BlobRequestOptions blobRequestOptions = new BlobRequestOptions()
{
SingleBlobUploadThresholdInBytes = 4 * 1024 * 1024, //1MB, the minimum
ParallelOperationThreadCount = 5,
ServerTimeout = TimeSpan.FromMinutes(30)
};
blob.StreamWriteSizeInBytes = 4 * 1024 * 1024;
//Upload it on Azure Storage
blobs.UploadFromByteArrayAsync(item.Document_Bytes, 0, item.Document_Bytes.Length - 1, AccessCondition.GenerateEmptyCondition(), blobRequestOptions, new OperationContext());
But make sure before call this funtion if you have huge amount of data then use any of compression technology. I have used "zlib" library. You can find it on http://www.componentace.com/zlib_.NET.htm for C# .NET it's freeware. If you want to know more then visit this https://www.zlib.net/.
Per my understanding, you could also leverage fineuploader to directly upload your files to Azure Blob storage without sending the file to your server first. For detailed description, you could follow Uploading Directly to Azure Blob Storage.
The script would look like as follows:
var uploader = new qq.azure.FineUploader({
element: document.getElementById('fine-uploader'),
request: {
endpoint: 'https://<your-storage-account-name>.blob.core.windows.net/<container-name>'
},
signature: {
endpoint: 'https://yourapp.com/uploadimage/signature'
},
uploadSuccess: {
endpoint: 'https://yourapp.com/uploadimage/done'
}
});
You could follow Getting Started with Fine Uploader and install the fine-uploader package, then follow here for initializing FineUploader for Azure Blob Storage, then follow here to configure CORS for your blob container and expose the endpoint for creating the SAS token. Moreover, here is a similar issue for using FineUploader.
From the API I'm directly converting the image data to bytes and uploading to Azure blob.
I'm concerned because we're uploading a large amount of data and the way I'm using right now, will create memory issue or not
For the approach about uploading the file to your Web API endpoint first, then upload to azure storage blob, I would prefer use MultipartFormDataStreamProvider for storing the uploaded file into a temp file in the server instead of MultipartMemoryStreamProvider which would use the memory. Details you could follow the related code snippet in this issue. Moreover, you could follow the github sample for uploading files using the Web API.
I am trying to write a custom .NET activity, which will be run from Azure Data Factory. It will do two tasks, one after the other:
it will download grib2 files from an FTP server daily (grib2 is a custom compression for meteorological data)
it will decompress each file as it is downloaded.
So far I have setup an Azure Batch with a pool with two nodes - Windows Server machines, which are used to run the FTP downloads. The nodes are downloading the grib2 files to a blob storage container.
The code for the custom app so far looks like this:
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Azure;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Runtime;
namespace ClassLibrary1
{
public class Class1 : IDotNetActivity
{
public IDictionary string, string Execute(
IEnumerable linkedServices,
IEnumerable datasets,
Activity activity,
IActivityLogger logger)
{
logger.Write("Start");
//Get extended properties
DotNetActivity dotNetActivityPipeline = (DotNetActivity)activity.TypeProperties;
string sliceStartString = dotNetActivityPipeline.ExtendedProperties["SliceStart"];
//Get linked service details
Dataset inputDataset = datasets.Single(dataset = dataset.Name == activity.Inputs.Single().Name);
Dataset outputDataset = datasets.Single(dataset = dataset.Name == activity.Outputs.Single().Name);
/*
DO FTP download here
*/
logger.Write("End");
return new Dictionary string, string();
}
}
}
So far my code works and I have the files downloaded to my blob storage account.
Now that I have the files downloaded, I would like to have the nodes of the Batch pool decompress the files and put the decompressed files in my blob storage for further processing.
For this, wgrib2.exe is used, which comes with some dll files. I have already zipped and uploaded the executable and all dll files it needs to the Application Packages to my pool. If I am correct, when each node joins the pool, this executable will be extracted and available for calls.
My question is: how do I go about to write the custom .NET activity so the files are downloaded by the nodes of my pool and after each file is downloaded, a decompression command is run on each file to convert it to csv file? The command line for this would look like:
wgrib2.exe downloadedfileName.grb2 -csv downloadedfileName.csv
How do I get a handle of the name of each downloaded file, how do I precess it on the node and save it back to the blob storage?
Also, how can I control how many files are downloaded at the same time and how many are decompressed at the same time?
i have multiple web server and one central file server inside my data center.
and all my Web server store the user uploaded files into central internal file server.
i would like to know what is the best way to pass the file from web server to file server in this case?
as suggested i try to add more details to question:
the solution i came up was:
after receiving files from user at web server, i should just do an Http Post to the file server. but i think there is some thing wrong with this because it causes large files to be entirely loaded into memory twice: (once at web server and once at file server)
Is your file server just another windows/linux server or is it a NAS device. I can suggest you number of approaches based on your requirement. The question is why d you want to use HTTP protocol when you have much better way to transfer files between servers.
HTTP protocol is best when you send text data as HTTP itself is based
on text.From the client side to Server side HTTP is used as that is
the only available option for you by our browsers .But
between your servers ,I feel you should use SMB protocol(am assuming
you are using windows as it is tagged for IIS) to move data.It will
be orders of magnitude faster as much more efficient to transfer the same data over SMB vs
HTTP.
And for SMB protocol,you do not have to write any code or complex scripts to do this.As provided by one of the answers above,you can just issue a simple copy command and it will happen for you.
So just summarizing the options for you (based on my preference)
Let the files get upload to some location on the each IIS web server e.g C:\temp\UploadedFiles . You can write a simple 2-3 line powershell script which will copy the files from this C:\temp\UploadedFiles to \FileServer\Files\UserID\\uploaded.file .This same powershell script can delete the file once it is moved to the other server successfully.
E.g script can be this simple and easy to make it as windows scheduled task
$Destination = "\\FileServer\Files\UserID\<FILEGUID>\"
New-Item -ItemType directory -Path $Destination -Force
Copy-Item -Path $Source\*.* -Destination $Destination -Force
This script can be modified to suit your needs to delete the files if it is done :)
In the Asp.net application ,you can directly save the file to network location.So in the SaveAs call,you can give the network path itself. This you have to make sure this network share is accessible for the IIS worker process and also has write permission.Also in my understanding asp.net gets the file saved to temporary location first (you do not have control on this if you are using the asp.net HttpPostedFileBase or FormCollection ). More details here
You can even run this in an async so that your requests will not be blocked
if (FileUpload1.HasFile)
// Call to save the file.
FileUpload1.SaveAs("\\networkshare\filename");
https://msdn.microsoft.com/en-us/library/system.web.ui.webcontrols.fileupload.saveas(v=vs.110).aspx
3.Save the file the current way to local directory and then use HTTP POST. This is worst design possible as you are first going to read the contents and then transfer it as chunked to other server where you have to setup another webservice which recieves the file.The you have to read the file from request stream and again save it to your location. Am not sure if you need to do this.
let me know if you need more details on any of the listed method.
Or you just write it to a folder on the webservers, and create a scheduled task that moves the files to the file server every x minutes (e.g. via robocopy). This also makes sure your webservers are not reliant on your file server.
Assuming that you have an HttpPostedFileBase then the best way is just to call the .SaveAs() method.
You need the UNC path to the file server and that is it. The simplest version would look something like this:
public void SaveFile(HttpPostedFileBase inputFile) {
var saveDirectory = #"\\fileshare\application\directory";
var savePath = Path.Combine(saveDirectory, inputFile.FileName);
inputFile.SaveAs(savePath);
}
However, this is simplistic in the extreme. Take a look at the OWASP Guidance on Unrestricted File Uploads. File uploads can be the source of many vulnerabilities in your application.
You also need to make sure that the web application has access to the file share. Take a look at this answer
Creating a file on network location in asp.net
for more info. Generally the best solution is to run the application pool with a special identity which is only used to access the folder.
the solution i came up was: after receiving files from user at web server, i should just do an Http Post to the file server. but i think there is some thing wrong with this because it causes large files to be entirely loaded into memory twice: (once at web server and once at file server)
I would suggest not posting the file at once - it's then full in memory, which is not needed.
You could post the file in chunks, by using ajax. When a chunk receives at your server, just add it to the file.
With the File Reader API, you could read the file in chunks in Javascript.
Something like this:
/** upload file in chunks */
function upload(file) {
var chunkSize = 8000;
var start = 0;
while (start < file.size) {
var chunk = file.slice(start, start + chunkSize);
var xhr = new XMLHttpRequest();
xhr.onload = function () {
//check if all chunks are and then send filename or send in in the first/last request.
};
xhr.open("POST", "/FileUpload", true);
xhr.send(chunk);
start = end;
}
}
It can be implemented in different ways. If you are storing files in files server as files in file system. And all of your servers inside the same virtual network
Then will be better to create shared folder on your file server and once you received files at web server, just save this file in this shared folder directly on file server.
Here the instructions how to create shared folders: https://technet.microsoft.com/en-us/library/cc770880(v=ws.11).aspx
Just map a drive
I take it you have a means of saving the uploaded file on the web server's local filesystem. The question pertains to moving the file from the web server (which is probably one of many load-balanced nodes) to a central file system all web servers can access it.
The solution to this is remarkably simple.
Let's say you are currently saving the files some folder, say c:\uploadedfiles. The path to uploadedfiles is stored in your web.config.
Take the following steps:
Sign on as the service account under which your web site executes
Map a persistent network drive to the desired location, e.g. from command line:
NET USE f: \\MyFileServer\MyFileShare /user:SomeUserName password
Modify your web.config and change c:\uploadedfiles to f:\
Ta da, all done.
Just make sure the drive mapping is persistent, and make sure you use a user with adequate permissions, and voila.
So I am making a website for my family, where we can upload our images and view them, but an important feature of the website is to sort by date so that when for example my aunt have taken pictures at my mothers birthday and I also have taken pictures and we upload the images they will be added to the same album etc.
I've realized that it is not possible to preserve the date, when uploading through a browser. So I will make a small program which is only used for upload pictures. I have an FTP server running but when ever I upload images the date will change to current datetime. I have found the answer to why it does that so now I am looking for a way to preserve the date while uploading to FTP.
Here's some ideas I've had:
If the program adds the files to a zip file and upload that zip file they will preserve the date, but that means I would have to have something on the server that unpacks the zips.
When the images gets uploaded the program extracts the created date from the original image and adds it to a text file which it also uploads, but that would again require a program on the server which changes the uploaded images created date.
Maybe I upload the images and thereafter change the uploaded images created date from the client?
Maybe I upload the images and thereafter change the uploaded images created date from the client?
In FTP protocol, use MFMT or MDTM command to update file modification timestamp, or MFCT to update file creation timestamp, depending on which of these your FTP server supports.
Actually none of them is standardized.
The MFMT and MFCT are drafted here:
https://datatracker.ietf.org/doc/html/draft-somers-ftp-mfxx-04
The MDTM is defined in RFC 3659 to retrieve file modification timestamp, using MDTM filename syntax. But many FTP servers support an alternative (non-standard) syntax MDTM filename timestamp (i.e. the same as proposed MFMT) to update the modification timestamp too.
Though the native FTP implementation in .NET framework (the FtpWebRequest or the WebClient wrapper), does not support any of these.
You have to use a 3rd party library.
For example the WinSCP .NET assembly preserves the modification timestamp automatically for any upload (or download) without any additional code.
A simple example code to upload a file (implicitly preserving the modification timestamp):
// Setup session options
SessionOptions sessionOptions = new SessionOptions
{
Protocol = Protocol.Ftp,
HostName = "example.com",
UserName = "user",
Password = "mypassword",
};
using (Session session = new Session())
{
// Connect
session.Open(sessionOptions);
// Upload
session.PutFiles(#"d:\toupload\image.jpg", "/home/user/").Check();
}
For details, see Session.PutFiles.
WinSCP GUI can even generate the C# code for you.
(I'm the author of WinSCP)