I've just started working with Data Lake and I'm currently trying to figure out the real workflow steps and how to automatize the whole process.
Say I have some files as an input and I would like to process them and download output files in order to push into my data warehouse or/and SSAS.
I've found absolutely lovely API and it's all good but I can't find a way to get all the file names in a directory to get them downloaded further.
Please correct my thoughts regarding workflow. Is there another, more elegant way to automatically get all the processed data (outputs) into a storage (like conventional SQL Server, SSAS, data warehouse and etc)?
If you have a working solution based on Data Lake, please describe the workflow (from "raw" files to reports for end-users) with a few words.
here is my example of NET Core application
using Microsoft.Azure.DataLake.Store;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using Microsoft.Rest.Azure.Authentication;
var creds = new ClientCredential(ApplicationId, Secret);
var clientCreds = ApplicationTokenProvider.LoginSilentAsync(Tenant, creds).GetAwaiter().GetResult();
var client = AdlsClient.CreateClient("myfirstdatalakeservice.azuredatalakestore.net", clientCreds);
var result = client.GetDirectoryEntry("/mynewfolder", UserGroupRepresentation.ObjectID);
Say I have some files as an input and I would like to process them and download output files in order to push into my data warehouse or/and SSAS.
If you want to download the files from the folder in the azure datalake to the local path, you could use the following code to do that.
client.BulkDownload("/mynewfolder", #"D:\Tom\xx"); //local path
But based on my understanding, you could use the azure datafactory to push your data from datalake store to azure storage blob or azure file storge.
Related
This should have been simple but turned out to require a bit of GoogleFu.
I have an Azure Synapse Spark Notebook written in C# that
Receives a list of Deflate compressed IIS files.
Reads the files as binary into a DataFrame
Decompresses these files one at a time and writes them into Parquet format.
Now after all of them have been successfully processed I need to delete the compressed files.
This is my proof of concept but it works perfectly.
Create a linked service pointing to the storage account that contains the files you want to delete see Configure access to Azure Blob Storage
See code sample below
#r "nuget:Azure.Storage.Files.DataLake,12.0.0-preview.9"
using Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Utils;
using Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Notebook.MSSparkUtils;
using Azure.Storage.Files.DataLake;
using Azure.Storage.Files.DataLake.Models;
string blob_sas_token = Credentials.GetConnectionStringOrCreds('your linked service name here');
Uri uri = new Uri($"https://'your storage account name here'.blob.core.windows.net/'your container name here'{blob_sas_token}") ;
DataLakeServiceClient _serviceClient = new DataLakeServiceClient(uri);
DataLakeFileSystemClient fileClient = _serviceClient.GetFileSystemClient("'path to directory containing the file here'") ;
fileClient.DeleteFile("'file name here'") ;
The call to Credentials.GetConnectionStringOrCreds returns a signed SAS token that is ready for your code to attach to a storage resource uri.
You could of course use the DeleteFileAsync method if you so desire.
Hope this saves someone else a few hours of GoogleFu.
I was looking for resources on how to create a simple background service using C# that checks a specific folder for FLAC files and sends them to a GCP bucket, once the file is uploaded successfully the file is erased or moved to another folder. Where can I find something to read about this kind of thing?
To move a file to another location using c# you can use the move method. The Move method moves an existing file to a new location with the same or a different file name.The Move method moves an existing file to a new location with the same or a different file name in File Move. The Move method takes two parameters. The Move method deletes the original file. The method that renames files is called File.Move
Example:
{
File.Move(sourceFile, destinationFile);
}
catch (IOException iox)
{
Console.WriteLine(iox.Message);
}
If you need more examples about File.Move method please follow this link
Adding to that, you can use the Directory.GetFiles method to select the file extension, like in the example below.
This is the original thread where the example was posted
Example:
//Assume user types .txt into textbox
string fileExtension = "*" + textbox1.Text;
string[] txtFiles = Directory.GetFiles("Source Path", fileExtension);
foreach (var item in txtFiles)
{
File.Move(item, Path.Combine("Destination Directory", Path.GetFileName(item)));
}
If you want to know more about Directory.GetFiles method follow this link
And concerning GCP,using Cloud Storage Transfer Service you can move or backup data to a Cloud Storage bucket either from other cloud storage providers or from your on-premises storage. Storage Transfer Service provides options that make data transfers and synchronization easier. For example, you can:
Schedule one-time transfer operations or recurring transfer
operations.
Delete existing objects in the destination bucket if they do not have
a corresponding object in the source.
Delete data source objects after transferring them.
Schedule periodic synchronization from a data source to a data sink
with advanced filters based on file creation dates, file-names, and
the times of day you prefer to import data.
If you want to know more about GCP Cloud Storage Transfer Service follow this link
If you want to know more about how to create storage buckets follow this link
We have files store in azure storage account gen 2
We are using api approach to create,delete and read the files [ as mention here
Read File ]
We are trying to copying the file from one storage account to another using api approach. Can someone suggest fast approach to achieve it ?
Note:
I am looking for code approach in c# without AzCopy
In Gen 1, Data Movement Library is there but I am looking for Gen 2
you could use AzCopy to transfer data. You could copy data between a file system and a storage account, or between storage accounts with AzCopy.
About the details how to use AzCopy you could refer to this official doc. In this doc , there are download link and the tutorials.
Update:
About transfer files between file shares you could refer to this code:
AzCopy /Source:https://myaccount1.file.core.windows.net/myfileshare1/ /Dest:https://myaccount2.file.core.windows.net/myfileshare2/ /SourceKey:key1 /DestKey:key2 /S
Other about Copy files in File Storage you could refer to the doc.
If you still have other questions, please let me know. Hope this could help you.
Actually, it's very hard to find the working solution because the official documentation is outdated and there is a lack of any up-to-date examples there.
Outdated way
An outdated example of working with blob containers could be found here: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-dotnet?tabs=windows
That example uses WindowsAzure.Storage NuGet package that was renamed to Microsoft.Azure.Storage.* and split to separate packages.
Up-to-date solution
I'm currently working on the deployment of the static SPA to Azure Blob storage. It has a very nice feature "Static website" that serves the files.
There is a working example that could be used to copy all contents from one blob container to another. Please consider it as a hint (not production ready).
All you need is to:
Have an existing blob container.
Install Microsoft.Azure.Storage.DataMovement NuGet package.
Provide a proper connection string.
Here is the code:
// I left fully qualified names of the types to make example clear.
var connectionString = "Connection string from `Azure Portal > Storage account > Access Keys`";
var sourceContainerName = "<source>";
var destinationContainerName = "<destination>";
var storageAccount = Microsoft.Azure.Storage.CloudStorageAccount.Parse(connectionString);
var client = storageAccount.CreateCloudBlobClient();
var sourceContainer = client.GetContainerReference(sourceContainerName);
var destinationContainer = client.GetContainerReference(destinationContainerName);
// Create destination container if needed
await destinationContainer.CreateIfNotExistsAsync();
var sourceBlobDir = sourceContainer.GetDirectoryReference(""); // Root directory
var destBlobDir = destinationContainer.GetDirectoryReference("");
// Use UploadOptions to set ContentType of destination CloudBlob
var options = new Microsoft.Azure.Storage.DataMovement.CopyDirectoryOptions
{
Recursive = true,
};
var context = new Microsoft.Azure.Storage.DataMovement.DirectoryTransferContext();
// Perform the copy
var transferStatus = await Microsoft.Azure.Storage.DataMovement.TransferManager
.CopyDirectoryAsync(sourceBlobDir, destBlobDir, true, options, context);
i have multiple web server and one central file server inside my data center.
and all my Web server store the user uploaded files into central internal file server.
i would like to know what is the best way to pass the file from web server to file server in this case?
as suggested i try to add more details to question:
the solution i came up was:
after receiving files from user at web server, i should just do an Http Post to the file server. but i think there is some thing wrong with this because it causes large files to be entirely loaded into memory twice: (once at web server and once at file server)
Is your file server just another windows/linux server or is it a NAS device. I can suggest you number of approaches based on your requirement. The question is why d you want to use HTTP protocol when you have much better way to transfer files between servers.
HTTP protocol is best when you send text data as HTTP itself is based
on text.From the client side to Server side HTTP is used as that is
the only available option for you by our browsers .But
between your servers ,I feel you should use SMB protocol(am assuming
you are using windows as it is tagged for IIS) to move data.It will
be orders of magnitude faster as much more efficient to transfer the same data over SMB vs
HTTP.
And for SMB protocol,you do not have to write any code or complex scripts to do this.As provided by one of the answers above,you can just issue a simple copy command and it will happen for you.
So just summarizing the options for you (based on my preference)
Let the files get upload to some location on the each IIS web server e.g C:\temp\UploadedFiles . You can write a simple 2-3 line powershell script which will copy the files from this C:\temp\UploadedFiles to \FileServer\Files\UserID\\uploaded.file .This same powershell script can delete the file once it is moved to the other server successfully.
E.g script can be this simple and easy to make it as windows scheduled task
$Destination = "\\FileServer\Files\UserID\<FILEGUID>\"
New-Item -ItemType directory -Path $Destination -Force
Copy-Item -Path $Source\*.* -Destination $Destination -Force
This script can be modified to suit your needs to delete the files if it is done :)
In the Asp.net application ,you can directly save the file to network location.So in the SaveAs call,you can give the network path itself. This you have to make sure this network share is accessible for the IIS worker process and also has write permission.Also in my understanding asp.net gets the file saved to temporary location first (you do not have control on this if you are using the asp.net HttpPostedFileBase or FormCollection ). More details here
You can even run this in an async so that your requests will not be blocked
if (FileUpload1.HasFile)
// Call to save the file.
FileUpload1.SaveAs("\\networkshare\filename");
https://msdn.microsoft.com/en-us/library/system.web.ui.webcontrols.fileupload.saveas(v=vs.110).aspx
3.Save the file the current way to local directory and then use HTTP POST. This is worst design possible as you are first going to read the contents and then transfer it as chunked to other server where you have to setup another webservice which recieves the file.The you have to read the file from request stream and again save it to your location. Am not sure if you need to do this.
let me know if you need more details on any of the listed method.
Or you just write it to a folder on the webservers, and create a scheduled task that moves the files to the file server every x minutes (e.g. via robocopy). This also makes sure your webservers are not reliant on your file server.
Assuming that you have an HttpPostedFileBase then the best way is just to call the .SaveAs() method.
You need the UNC path to the file server and that is it. The simplest version would look something like this:
public void SaveFile(HttpPostedFileBase inputFile) {
var saveDirectory = #"\\fileshare\application\directory";
var savePath = Path.Combine(saveDirectory, inputFile.FileName);
inputFile.SaveAs(savePath);
}
However, this is simplistic in the extreme. Take a look at the OWASP Guidance on Unrestricted File Uploads. File uploads can be the source of many vulnerabilities in your application.
You also need to make sure that the web application has access to the file share. Take a look at this answer
Creating a file on network location in asp.net
for more info. Generally the best solution is to run the application pool with a special identity which is only used to access the folder.
the solution i came up was: after receiving files from user at web server, i should just do an Http Post to the file server. but i think there is some thing wrong with this because it causes large files to be entirely loaded into memory twice: (once at web server and once at file server)
I would suggest not posting the file at once - it's then full in memory, which is not needed.
You could post the file in chunks, by using ajax. When a chunk receives at your server, just add it to the file.
With the File Reader API, you could read the file in chunks in Javascript.
Something like this:
/** upload file in chunks */
function upload(file) {
var chunkSize = 8000;
var start = 0;
while (start < file.size) {
var chunk = file.slice(start, start + chunkSize);
var xhr = new XMLHttpRequest();
xhr.onload = function () {
//check if all chunks are and then send filename or send in in the first/last request.
};
xhr.open("POST", "/FileUpload", true);
xhr.send(chunk);
start = end;
}
}
It can be implemented in different ways. If you are storing files in files server as files in file system. And all of your servers inside the same virtual network
Then will be better to create shared folder on your file server and once you received files at web server, just save this file in this shared folder directly on file server.
Here the instructions how to create shared folders: https://technet.microsoft.com/en-us/library/cc770880(v=ws.11).aspx
Just map a drive
I take it you have a means of saving the uploaded file on the web server's local filesystem. The question pertains to moving the file from the web server (which is probably one of many load-balanced nodes) to a central file system all web servers can access it.
The solution to this is remarkably simple.
Let's say you are currently saving the files some folder, say c:\uploadedfiles. The path to uploadedfiles is stored in your web.config.
Take the following steps:
Sign on as the service account under which your web site executes
Map a persistent network drive to the desired location, e.g. from command line:
NET USE f: \\MyFileServer\MyFileShare /user:SomeUserName password
Modify your web.config and change c:\uploadedfiles to f:\
Ta da, all done.
Just make sure the drive mapping is persistent, and make sure you use a user with adequate permissions, and voila.
Is it possible to remotely upload files using a Windows Application (C#) to Sharepoint Server?
Thank you.
Yes, although you may need some of the SharePoint assemblies on the "remote" machine in order to achieve what you need.
Uploading files using Client Object Model in SharePoint 2010 is a pretty good starting point for SharePoint 2010.
In case you are using the 2007 version (WSS 3.0), you can find a great summary of different ways to upload files on this link: http://vspug.com/smc750/2009/05/19/uploading-content-into-sharepoint-let-me-count-the-ways/
You must be very careful if your farm is 32bit, in that case it is very easy to use up all the available memory in the w3wp.exe process if you're uploading large files or many files in parallel, especially if the farm is a busy one. In that case you might want to use the RPC interface described in the link above, since this is the only one where you can upload files in chunks. With all other ways the entire file being uploaded must first be loaded in the w3wp's memory before it's committed to the SharePoint list item.
For ways that involve SharePoint object model, you might want to write your own web service facade to enable the clients that do not have SharePoint dlls to upload files (+ metadata if you need it).
You can use the client object model in sp2010, rather than talking to the web services directly.
Taken from my upload profile picture applications:
http://spc3.codeplex.com/SourceControl/changeset/view/57957#1015709
using (ClientContext context = new ClientContext(siteurl)) {
context.AuthenticationMode = ClientAuthenticationMode.Default;
List list = context.Web.Lists.GetByTitle(listname);
context.Load(list);
context.Load(list.RootFolder);
context.ExecuteQuery();
string url = siteurl.CombineUrl(list.RootFolder.ServerRelativeUrl).CombineUrl(listfolder).CombineUrl(name);
FileCreationInformation fci = new FileCreationInformation();
fci.Content = data;
fci.Overwrite = true;
fci.Url = url;
Microsoft.SharePoint.Client.File file = list.RootFolder.Files.Add(fci);
context.ExecuteQuery();
}
I wrote a tool available as a NuGet package in Visual Studio called SharePoint.DesignFactory.ContentFiles. With this tool you can manage all files with metadata to be uploaded to the SharePoint content database. You can use this for SharePoint 2007 (currently have to work on the SharePoint machine itself) or for SharePoint 2010 and Office 365. In this case you can work from a machine without SharePoint installed. See http://weblogs.asp.net/soever/archive/tags/SharePoint.DesignFactory.ContentFiles/default.aspx for blogposts on the tooling.