Decompressing a S3 File in Stream using C#

Decompressing a S3 File in Stream using C# - c#

I'm trying to read a .zip file from S3 into a stream in C# and write the entries back to to originating folder in S3. I've looked at the myriad of SO questions, watched video, etc trying to get this right and I seem to be missing something. I'm now farther than I was originally, but I'm still getting stuck. (I really wish Amazon would just implement a decompress method because this seems to come up a lot but no such luck yet.) Here is my code currently:
private async Task<string> DecompressFile(string bucketName, string keystring)
{
AmazonS3Client client = new AmazonS3Client();
Stream fileStream = new MemoryStream();
string sourceDir = keystring.Split('/')[0];
GetObjectRequest request = new GetObjectRequest{ BucketName = bucketName, Key = keystring };
try
{
using (var response = await client.GetObjectAsync(request))
using (var arch = new ZipArchive(response.ResponseStream))
{
foreach (ZipArchiveEntry entry in arch.Entries)
{
fileStream = entry.Open();
string newFile = sourceDir + "/" + entry.FullName;
using (Amazon.S3.Transfer.TransferUtility tranute = new Amazon.S3.Transfer.TransferUtility(client))
{
var upld = new Amazon.S3.Transfer.TransferUtilityUploadRequest();
upld.InputStream = fileStream;
upld.Key = newFile;
upld.BucketName = bucketName;
await tranute.UploadAsync(upld);
}
}
}
return $"Decompression complete for {keystring}...";
}
catch (Exception e)
{
ctxt.Logger.LogInformation($"Error decompressing file {keystring} from bucket {bucketName}. Please check the file and try again.");
ctxt.Logger.LogInformation(e.Message);
ctxt.Logger.LogInformation(e.StackTrace);
throw;
}
}
The error I keep hitting now is on the write process at await tranute.UploadAsync(upld). The error I'm getting is:
$exception {"This operation is not supported."} System.NotSupportedException
Here are the exception details:
System.NotSupportedException
HResult=0x80131515
Message=This operation is not supported.
Source=System.IO.Compression
StackTrace:
at System.IO.Compression.DeflateStream.get_Length()
at Amazon.S3.Transfer.TransferUtilityUploadRequest.get_ContentLength()
at Amazon.S3.Transfer.TransferUtility.IsMultipartUpload(TransferUtilityUploadRequest request)
at Amazon.S3.Transfer.TransferUtility.GetUploadCommand(TransferUtilityUploadRequest request, SemaphoreSlim asyncThrottler)
at Amazon.S3.Transfer.TransferUtility.UploadAsync(TransferUtilityUploadRequest request, CancellationToken cancellationToken)
at File_Ingestion.Function.<DecompressFile>d__13.MoveNext() in File-Ingestion\Function.cs:line 136
This exception was originally thrown at this call stack:
[External Code]
File_Ingestion.Function.DecompressFile(string, string) in Function.cs
Any help would be greatly appreciated.
Thanks!

I think the problem is that AWS needs to know the length of the file before it can be uploaded, but the stream returned by ZipArchiveEntry.Open doesn't know its length upfront.
See how the exception is thrown when TransferUtilityUploadRequest.ContentLength tries to call DeflateStream.Length (which always throws), where DeflateStream is ultimately the thing returned from ZipArchiveEntry.Open.
(It's slightly odd that DeflateStream doesn't report its own decompressed length. It certainly knows what it should be, but that's only an indication which might be wrong, so maybe it wants to avoid reporting a value which might be incorrect.)
I think that what you need to do is to buffer the extracted file in memory, before passing it to AWS. This way, we can find out the uncompressed length of the stream, and this will be correctly reported by MemoryStream.Length:
using var fileStream = entry.Open();
// Copy the fileStream into an in-memory MemoryStream
using var ms = new MemoryStream();
fileStream.CopyTo(ms);
ms.Position = 0;
string newFile = sourceDir + "/" + entry.FullName;
using (Amazon.S3.Transfer.TransferUtility tranute = new Amazon.S3.Transfer.TransferUtility(client))
{
var upld = new Amazon.S3.Transfer.TransferUtilityUploadRequest();
upld.InputStream = ms;
upld.Key = newFile;
upld.BucketName = bucketName;
await tranute.UploadAsync(upld);
}

Could be faster and more clean:
using var fileStream = entry.Open();
using (var output = new S3UploadStream(_s3Client, "<s3 bucket>", "<key_name>"))
{
await fileStream.CopyToAsync(output);
}
S3UploadStream class is taken from https://github.com/mlhpdx/s3-upload-stream

Related

Azure blob file got corrupted post uploading file using UploadFromStreamAsync

I tried below code to upload file to azure blob container but uploaded file got corrupted.
public async void UploadFile(Stream memoryStream, string fileName, string containerName)
{
try
{
memoryStream.Position = 0;
CloudBlockBlob file = GetBlockBlobContainer(containerName).GetBlockBlobReference(fileName);
file.Metadata["FileType"] = Path.GetExtension(fileName);
file.Metadata["Name"] = fileName;
await file.UploadFromStreamAsync(memoryStream).ConfigureAwait(false);
}
catch (Exception ex)
{
throw ex;
}
}
How can I resolve it.
Unable to open excel file which was uploaded to blob using above code.
Error:
Stream streamData= ConvertDataSetToByteArray(sourceTable); // sourceTable is the DataTable
streamData.Position = 0;
UploadFile(streamData,'ABCD.xlsx','sampleBlobContainer'); //calling logic to upload stream to blob
private Stream ConvertDataSetToByteArray(DataTable dataTable)
{
StringBuilder sb = new StringBuilder();
IEnumerable<string> columnNames = dataTable.Columns.Cast<DataColumn>().
Select(column => column.ColumnName);
sb.AppendLine(string.Join(",", columnNames));
foreach (DataRow row in dataTable.Rows)
{
IEnumerable<string> fields = row.ItemArray.Select(field => (field.ToString()));
sb.AppendLine(string.Join(",", fields));
}
var myByteArray = System.Text.Encoding.UTF8.GetBytes(sb.ToString());
var streamData = new MemoryStream(myByteArray);
return streamData;
}

Your code above creates a .csv file, not an .xlsx file. You can easily test this out by creating something similar to what your code builds, e.g.:
Then if you rename it to .xlsx, to replicate what you do, you get:
You have two solutions:
You either need to build an actual .xlsx file, you can do this with the https://github.com/JanKallman/EPPlus package for example
or
You need to save your file as a .csv, because that's what it really is.
The fact the you upload it to azure blob storage is completely irrelevant here - there's no issue with the upload.

Since the stream is instantiated outside this method I assume the file is handled there and added to the stream, however, here you are returning the position of the stream to 0, thus invalidating the file.

First of all, are you sure the file got corrupted? Save both the MemoryStream contents and the blog to local files and compare them. You could also save the MemoryStream contents to a file and use UploadFromFileAsync.
To check for actual corruption you should calculate the content's MD5 hash in advance and compare it with the blob's hash after upload.
To calculate the stream's MD5 hash use ComputeHash.
var hasher=MD5.Create();
memoryStream.Position = 0;
var originalHash=Convert.ToBase64String(hasher.ComputeHash(memoryStream));
To get the client to calculate an blob has you need to set the BlobRequestOptions.StoreBlobContentMD5 option while uploading :
memoryStream.Position = 0;
var options = new BlobRequestOptions()
{
StoreBlobContentMD5 = testMd5
};
await file.UploadFromStreamAsync(memoryStream,null,options,null).ConfigureAwait(false);
To retrieve and check the uploaded hash use FetchAttributes or FetchAttributesAsync and compare the BlobProperties.ContentMD5 value with the original :
file.FetchAttributes();
var blobHash=file.Properties.ContentMD5;
if (blobHash != originalHash)
{
//Ouch! Retry perhaps?
}

It seems that your method don't have fatal problems. I guess the part of your Stream conversion has gone wrong.
This is my code:
using System;
using System.IO;
using Microsoft.WindowsAzure.Storage;
namespace ConsoleApp7
{
class Program
{
public static class Util
{
public async static void UploadFile(Stream memoryStream, string fileName, string containerName)
{
memoryStream.Position = 0;
var storageAccount = CloudStorageAccount.Parse("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx");
var blockBlob = storageAccount.CreateCloudBlobClient()
.GetContainerReference(containerName)
.GetBlockBlobReference(fileName);
blockBlob.UploadFromStreamAsync(memoryStream);
}
}
static void Main(string[] args)
{
//Open the file
FileStream fileStream = new FileStream("C:\\Users\\bowmanzh\\Desktop\\Book1.xlsx", FileMode.Open);
//Read the byte[] of File
byte[] bytes = new byte[fileStream.Length];
fileStream.Read(bytes,0,bytes.Length);
fileStream.Close();
//turn from byte[] to Stream
Stream stream = new MemoryStream(bytes);
Util.UploadFile(stream,"Book2.xlsx","test");
Console.WriteLine("Hello World!");
Console.ReadLine();
}
}
}

Asp.Net Core 2 + Google Cloud Storage download Memory Stream

I'm working on an Asp.Net Core 2 Web Api and I have to make an endpoint to download the file. This file is not public, so I cannot use the MediaLink property of the google storage object. I'm using their C# library.
In the piece of code you will see bellow _storageClient was created like this: _storageClient = StorageClient.Create(cred);. The client is working, just showing which class it is.
[HttpGet("DownloadFile/{clientId}/{fileId}")]
public async Task<IActionResult> DownloadFile([FromRoute] long fileId, long clientId)
{
// here there are a bunch of logic and permissions. Not relevant to the quest
var stream = new MemoryStream();
try
{
stream.Position = 0;
var obj = _storageClient.GetObject("bucket name here", "file.png");
_storageClient.DownloadObject(obj, stream);
var response = File(stream, obj.ContentType, "file.png"); // FileStreamResult
return response;
}
catch (Exception ex)
{
throw;
}
}
The variable obj comes OK. with all properties filled as expected. The stream seems to be filled properly. it has length and everything, but it returns me a 500 error that I cannot even catch.
I cannot see what I'm doing wrong, maybe how I'm using memory stream but I can;t even catch the error.
Thanks for any help

You're rewinding the stream, but before you've written anything to it - but you're not rewinding it afterwards. I'd expect that to result in an empty response rather than a 500 error, but I'd at least move the stream.Position call to after the download:
var obj = _storageClient.GetObject("bucket name here", "file.png");
_storageClient.DownloadObject(obj, stream);
stream.Position = 0;
Note that you don't need to fetch the object metadata before downloading it. You can just use:
_storageClient.DownloadObject("bucket name here", "file.png", stream);
stream.Position = 0;

Solution can be like below.
[HttpGet("get-file")]
public ActionResult GetFile()
{
var storageClient = ...;
byte[] buffer;
using (var memoryStream = new MemoryStream())
{
storageClient.DownloadObject("bucket name here"+"/my-file.jpg", memoryStream);
buffer = memoryStream.ToArray();
}
return File(buffer, "image/jpeg", "my-file.jpg");
}

Web API download locks file

I'm having a small issue with a WebAPI method that downloads a file when the user calls the route of the method.
The method itself is rather simple:
public HttpResponseMessage Download(string fileId, string extension)
{
var location = ConfigurationManager.AppSettings["FilesDownloadLocation"];
var path = HttpContext.Current.Server.MapPath(location) + fileId + "." + extension;
var result = new HttpResponseMessage(HttpStatusCode.OK);
var stream = new FileStream(path, FileMode.Open);
result.Content = new StreamContent(stream);
result.Content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
return result;
}
The method works as expected - the first time I call it. The file is transmitted and my browser starts downloading the file.
However - if I call the same URL again from either my own computer or from any other - I get an error saying:
The process cannot access the file
'D:\...\App_Data\pdfs\test-file.pdf' because it is being used by
another process.
This error persists for about a minute - and then I can download the file again - but only once - and then I have to wait another minute or so until the file is unlocked.
Please note that my files are rather large (100-800 MB).
Am I missing something in my method? It almost seems like the stream locks the file for some time or something like that?
Thanks :)

It is because your file is locked by the first stream, you must specify a FileShare that allow it to be opened by multiple streams :
public HttpResponseMessage Download(string fileId, string extension)
{
var location = ConfigurationManager.AppSettings["FilesDownloadLocation"];
var path = HttpContext.Current.Server.MapPath(location) + fileId + "." + extension;
var result = new HttpResponseMessage(HttpStatusCode.OK);
var stream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read);
result.Content = new StreamContent(stream);
result.Content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
return result;
}
Like that you allow multiple stream to open this file for read only.
See the MSDN documentation on that constructor overload.

Can I get file.OpenStreamForReadAsync() from file.OpenStreamForWriteAsync()?

I am still learning the tricks of read/write file streams and hope someone could help if what I am looking for is feasible.
The code below makes WebApi calls (note GetAsync() on line 2) to get an image file Id and save downloaded file to database with computed Md5Hash. The code works fine, but in the interest of efficiency I was wondering if it's possible to get file.OpenStreamForReadAsync() from file.OpenStreamForWriteAsync() (not sure even if this is possible, but I can see some extension methods that operate on a stream but no luck with attempts I've made so far). If this is possible, I can avoid saving a file and opening it again, by instead making the GetMD5Hash() method call within the using (var fileStream = await file.OpenStreamForWriteAsync()){ ... } block.
Can I have the equivalent of Utils.GetMD5Hash(stream);, shown below outside the using block, inside the block, the intention being to avoid opening the file outside of the using block?
var client = new HttpClient();
var response = await client.GetAsync(new Uri($"{url}{imageId}")); // call to WebApi; url and imageId defined earlier
if (response.IsSuccessStatusCode)
{
using (var contentStream = await response.Content.ReadAsInputStreamAsync())
{
var stream = contentStream.AsStreamForRead();
var file = await imagesFolder.CreateFileAsync(imageFileName, CreationCollisionOption.ReplaceExisting);
using (var fileStream = await file.OpenStreamForWriteAsync())
{
await stream.CopyToAsync(fileStream, 4096);
// >>>> At this point, from the Write fileStream, can I get the equivalent of file.OpenStreamForReadAsync() ??
}
var stream = await file.OpenStreamForReadAsync();
string md5Hash = await Utils.GetMD5Hash(stream);
await AddImageToDataBase(file, md5Hash);
}
}

A MemoryStream is read/write, so if all you wanted to do was to compute the hash, something like this should do the trick:
var stream = contentStream.AsStreamForRead();
using (var ms = new MemoryStream())
{
stream.CopyTo(ms);
ms.Seek(0, SeekOrigin.Begin);
string md5Hash = await Utils.GetMD5Hash(ms);
}
But since you want to save the file anyway (it's passed to AddImageToDataBase, after all), consider
Save to the MemoryStream
Reset the memory stream
Copy the memory stream to the file stream
Reset the memory stream
Compute the hash
I'd suggest you do performance measurements, though. The OS does cache file I/O, so it's unlikely that you'd actually have to do a physical disk read to compute the hash. The performance gains might not be what you imagine.

Here is a complete answer using MemoryStream as Petter Hesselberg suggested in his answer. Take note of a couple of tricky situations that I had to encounter, to do what I wanted to do: (1) make sure fileStream is made Disposable with using block, and (2) make sure MemoryStream is set to start with ms.Seek(0, SeekOrigin.Begin); before using it, both for the file to be saved and for the MemoryStream object handed over for computing MD5Hash.
var client = new HttpClient();
var response = await client.GetAsync(new Uri($"{url}{imageId}")); // call to WebApi; url and imageId defined earlier
if (response.IsSuccessStatusCode)
{
using (var contentStream = await response.Content.ReadAsInputStreamAsync())
{
var stream = contentStream.AsStreamForRead();
var file = await imagesFolder.CreateFileAsync(imageFileName, CreationCollisionOption.ReplaceExisting);
using (MemoryStream ms = new MemoryStream())
{
await stream.CopyToAsync(ms);
using (var fileStream = await file.OpenStreamForWriteAsync())
{
ms.Seek(0, SeekOrigin.Begin);
await ms.CopyToAsync(fileStream);
ms.Seek(0, SeekOrigin.Begin); // rewind for next use below
}
string md5Hash = await Utils.GetMD5Hash(ms);
await AddImageToDataBase(file, md5Hash);
}
}
}

Convert .db to binary

I'm trying to convert a .db file to binary so I can stream it across a web server. I'm pretty new to C#. I've gotten as far as looking at code snippets online but I'm not really sure if the code below puts me on the right track. How I can write the data once I read it? Does BinaryReader automatically open up and read the entire file so I can then just write it out in binary format?
class Program
{
static void Main(string[] args)
{
using (FileStream fs = new FileStream("output.bin", FileMode.Create))
{
using (BinaryWriter bw = new BinaryWriter(fs))
{
long totalBytes = new System.IO.FileInfo("input.db").Length;
byte[] buffer = null;
BinaryReader binReader = new BinaryReader(File.Open("input.db", FileMode.Open));
}
}
}
}
Edit: Code to stream the database:
[WebGet(UriTemplate = "GetDatabase/{databaseName}")]
public Stream GetDatabase(string databaseName)
{
string fileName = "\\\\computer\\" + databaseName + ".db";
if (File.Exists(fileName))
{
FileStream stream = File.OpenRead(fileName);
if (WebOperationContext.Current != null)
{
WebOperationContext.Current.OutgoingResponse.ContentType = "binary/.bin";
}
return stream;
}
return null;
}
When I call my server, I get nothing back. When I use this same type of method for a content-type of image/.png, it works fine.

All the code you posted will actually do is copy the file input.db to the file output.bin. You could accomplish the same using File.Copy.
BinaryReader will just read in all of the bytes of the file. It is a suitable start to streaming the bytes to an output stream that expects binary data.
Once you have the bytes corresponding to your file, you can write them to the web server's response like this:
using (BinaryReader binReader = new BinaryReader(File.Open("input.db",
FileMode.Open)))
{
byte[] bytes = binReader.ReadBytes(int.MaxValue); // See note below
Response.BinaryWrite(bytes);
Response.Flush();
Response.Close();
Response.End();
}
Note: The code binReader.ReadBytes(int.MaxValue) is for demonstrating the concept only. Don't use it in production code as loading a large file can quickly lead to an OutOfMemoryException. Instead, you should read in the file in chunks, writing to the response stream in chunks.
See this answer for guidance on how to do that
https://stackoverflow.com/a/8613300/141172

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Decompressing a S3 File in Stream using C# - c#

Could be faster and more clean: using var fileStream = entry.Open(); using (var output = new S3UploadStream(_s3Client, "<s3 bucket>", "<key_name>")) { await fileStream.CopyToAsync(output); } S3UploadStream class is taken from https://github.com/mlhpdx/s3-upload-stream

Related

Azure blob file got corrupted post uploading file using UploadFromStreamAsync

Asp.Net Core 2 + Google Cloud Storage download Memory Stream

Web API download locks file

Can I get file.OpenStreamForReadAsync() from file.OpenStreamForWriteAsync()?

Convert .db to binary

Categories

Resources