I will try to keep it short and precise.
Requirement:
Download large (400mb) xml response from 3rd party and store as ZipArchive on disk.
Current solution:
using (var memoryStream = new MemoryStream())
{
using (var archive = new ZipArchive(memoryStream, ZipArchiveMode.Create, true))
{
var file = archive.CreateEntry($"{deliveryDate:yyyyMMdd}.xml");
using(var entryStream = file.Open())
{
using (var payload = new MemoryStream())
{
using var response = await _httpClient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);
response.EnsureSuccessStatusCode();
await response.Content.CopyToAsync(payload);
payload.Seek(0, SeekOrigin.Begin);
await payload.CopyToAsync(entryStream);
}
}
}
using (var fileStream = new FileStream(Path.Combine(filePath), FileMode.Create, FileAccess.Write, FileShare.None))
{
memoryStream.Seek(0, SeekOrigin.Begin);
await memoryStream.CopyToAsync(fileStream);
}
}
Additional Information:
I can compress a 400mb file to approx. 20mb in about 40 seconds. 1/4 is download 3/4 is compression.
The httpClient is re-used.
The code runs in a long lived application hosted as a k8 linux pod.
Issues with current solution:
I fail to understand if this implementation will clean up after itself. I would be thankful for pointers towards potential leaks.
may be writing more directly to the filestream would be faster / cleaner
and the response should be disposed:
using System.IO.Compression;
string url = "https://stackoverflow.com/questions/70605408/better-way-to-process-large-httpclient-response-400mb-to-ziparchive";
string filePath = "test.zip";
using(HttpClient client = new HttpClient())
using (var fileStream = new FileStream(Path.Combine(filePath), FileMode.Create, FileAccess.Write, FileShare.None))
using (var archive = new ZipArchive(fileStream, ZipArchiveMode.Create, true))
{
var file = archive.CreateEntry($"test.xml");
using (var entryStream = file.Open())
using (var response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead))
{
response.EnsureSuccessStatusCode();
var stream = await response.Content.ReadAsStreamAsync();
await stream.CopyToAsync(entryStream);
}
}
Related
I have very little understanding of the c# streams. I'm trying to upload brotli compressed json into azure storage.
private async Task UploadJSONAsync(BlobClient blob, object serializeObject, CancellationToken cancellationToken)
{
var json = JsonConvert.SerializeObject(serializeObject);
using (var sourceStream = new MemoryStream(Encoding.UTF8.GetBytes(json)))
using (var destStream = new MemoryStream())
using (var brotliStreamCompressor = new BrotliStream(destStream, CompressionLevel.Optimal, false))
{
sourceStream.CopyTo(brotliStreamCompressor);
//brotliStreamCompressor.Close(); // Closes the stream, can't read from a closed stream.
await blob.DeleteIfExistsAsync();
await blob.UploadAsync(destStream, cancellationToken);
//brotliStreamCompressor.Close(); // destStream has zero bytes
}
}
}
I'm sure my lack of stream knowledge is preventing this from working.
In order to read the stream I had to set it position back to zero.
private async Task UploadJSONAsync(BlobClient blob, object serializeObject, CancellationToken cancellationToken)
{
var json = JsonConvert.SerializeObject(serializeObject);
using (var sourceStream = new MemoryStream(Encoding.UTF8.GetBytes(json)))
using (var destStream = new MemoryStream())
using (var brotliStreamCompressor = new BrotliStream(destStream, CompressionLevel.Optimal, false))
{
sourceStream.CopyTo(brotliStreamCompressor);
brotliStreamCompressor.Close();
destStream.Position = 0;
await blob.DeleteIfExistsAsync();
await blob.UploadAsync(destStream, cancellationToken);
}
}
}
The following code creates a zip file from S3 by pulling them into memory and write the final product to a file on disk. However, it is observer it corrupted few file (out of thousands) while creating the zip. I've checked, there is nothing wrong with files which got corrupted during the process, because same file(s) get zipped properly by other means. Any suggestions to fine tune the code?
Code:
public static async Task S3ToZip(List<string> pdfBatch, string zipPath, IAmazonS3 s3Client)
{
FileStream fileStream = new FileStream(zipPath, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.ReadWrite);
using (ZipArchive archive = new ZipArchive(fileStream, ZipArchiveMode.Update, true))
{
foreach (var file in pdfBatch)
{
GetObjectRequest request = new GetObjectRequest
{
BucketName = "sample-bucket",
Key = file
};
using GetObjectResponse response = await s3Client.GetObjectAsync(request);
using Stream responseStream = response.ResponseStream;
ZipArchiveEntry zipFileEntry = archive.CreateEntry(file.Split('/')[^1]);
using Stream zipEntryStream = zipFileEntry.Open();
await responseStream.CopyToAsync(zipEntryStream);
zipEntryStream.Seek(0, SeekOrigin.Begin);
zipEntryStream.CopyTo(fileStream);
}
archive.Dispose();
fileStream.Close();
}
}
Don't call Dispose() or Close() explicitly, let using do all the job. And you don't need to write anything to fileStream writing to ZipArchiveEntrystream does it under the hood. You also need to use FileMode.Create to guarantee that your file is always truncated before writing to it. Also as you only creating archive not updating it, you should use ZipArchiveMode.Create to enable memory efficient streaming (thanks to #canton7 for some deep diving in details of zip archive format).
public static async Task S3ToZip(List<string> pdfBatch, string zipPath, IAmazonS3 s3Client)
{
using FileStream fileStream = new FileStream(zipPath, FileMode.Create, FileAccess.ReadWrite, FileShare.ReadWrite);
using ZipArchive archive = new ZipArchive(fileStream, ZipArchiveMode.Create, true);
foreach (var file in pdfBatch)
{
GetObjectRequest request = new GetObjectRequest
{
BucketName = "sample-bucket",
Key = file
};
using GetObjectResponse response = await s3Client.GetObjectAsync(request);
using Stream responseStream = response.ResponseStream;
ZipArchiveEntry zipFileEntry = archive.CreateEntry(file.Split('/')[^1]);
using Stream zipEntryStream = zipFileEntry.Open();
await responseStream.CopyToAsync(zipEntryStream);
}
}
What I'm looking for is zip/compress S3 files without having them first downloaded to EFS or on a file system and then upload the zip file back to S3. Is there a C# way to achieve the same? I found the following post, but not sure its C# equivalent
https://www.antstack.io/blog/create-zip-using-lambda-with-files-streamed-from-s3/
I've written following code to zip files from a MemoryStream
public static void CreateZip(string zipFileName, List<FileInfo> filesToZip)
{
//zipFileName is the final zip file name
LambdaLogger.Log($"Zipping in progress for: {zipFileName}");
using (MemoryStream zipMS = new MemoryStream())
{
using (ZipArchive zipArchive = new ZipArchive(zipMS, ZipArchiveMode.Create, true))
{
//loop through files to add
foreach (var fileToZip in filesToZip)
{
//read the file bytes
byte[] fileToZipBytes = File.ReadAllBytes(fileToZip.FullName);
ZipArchiveEntry zipFileEntry = zipArchive.CreateEntry(fileToZip.Name);
//add the file contents
using (Stream zipEntryStream = zipFileEntry.Open())
using (BinaryWriter zipFileBinary = new BinaryWriter(zipEntryStream))
{
zipFileBinary.Write(fileToZipBytes);
}
}
}
using (FileStream finalZipFileStream = new FileStream(zipFileName, FileMode.Create))
{
zipMS.Seek(0, SeekOrigin.Begin);
zipMS.CopyTo(finalZipFileStream);
}
}
}
But problem is how to make it read file directly from S3 and upload the compressed file.
public static async Task CreateZipFile(List<List<KeyVersion>> keyVersions)
{
using MemoryStream zipMS = new MemoryStream();
using (ZipArchive zipArchive = new ZipArchive(zipMS, ZipArchiveMode.Create, true))
{
foreach (var key in keyVersions)
{
foreach (var fileToZip in key)
{
GetObjectRequest request = new GetObjectRequest
{
BucketName = "dev-s3-zip-bucket",
Key = fileToZip.Key
};
using GetObjectResponse response = await s3client.GetObjectAsync(request);
using Stream responseStream = response.ResponseStream;
ZipArchiveEntry zipFileEntry = zipArchive.CreateEntry(fileToZip.Key);
//add the file contents
using Stream zipEntryStream = zipFileEntry.Open();
await responseStream.CopyToAsync(zipEntryStream);
}
}
zipArchive.Dispose();
}
zipMS.Seek(0, SeekOrigin.Begin);
var fileTxfr = new TransferUtility(s3client);
await fileTxfr.UploadAsync(zipMS, "dev-s3-zip-bucket", "test.zip");
}
I'm having strange problem with this piece of code which basically zips files (docs) and uploads them to blob storage.
v11SDK: (docs)
var blockBlobClient = new BlockBlobClient(ConnectionString, ContainerName, "test-blob.zip");
// Saved zip is valid
// using (FileStream zipStream = new FileStream(#"C:\Users\artur\Desktop\test-local.zip", FileMode.OpenOrCreate))
// Uploaded zip is invalid
using (var stream = await blockBlobClient.OpenWriteAsync(true))
using (var archive = new ZipArchive(stream, ZipArchiveMode.Create))
{
var readmeEntry = archive .CreateEntry("Readme.txt");
using (StreamWriter writer = new StreamWriter(readmeEntry.Open()))
{
writer.WriteLine("Information about this package.");
writer.WriteLine("========================");
}
await stream.FlushAsync();
}
v12SDK: (docs)
var blobClient = new BlobClient(ConnectionString, InputContainerName, "test-blob.zip");
using var stream = new MemoryStream();
using (var archive = new ZipArchive(stream, ZipArchiveMode.Create))
{
var readmeEntry = archive.CreateEntry("Readme.txt");
using StreamWriter writer = new StreamWriter(readmeEntry.Open());
{
writer.WriteLine("Information about this package.");
writer.WriteLine("========================");
await writer.FlushAsync();
}
stream.Position = 0;
await blobClient.UploadAsync(stream, true);
await stream.FlushAsync();
}
Saving zip file locally produces a valid zip (164 bytes). Saving zip to blob storage (using storage emulator) produces invalid zip (102 bytes).
I can't figure out why
Here is the correct code.
The problem was premature disposing of inner stream by ZipArchive. Note in my code below, I have passed leaveInnerStreamOpen as true while creating ZipArchive since we are already disposing stream in the outer using. Also for V11 code, I have switched to MemoryStream instead of OpenWrite of blob stream since did not have control to set stream position to 0 if we use OpenWrite. And you don't need any Flush :)
v11SDK:
var blockBlobClient = new BlockBlobClient(ConnectionString, ContainerName, "test-blob.zip");
using var stream = new MemoryStream();
using (var archive = new ZipArchive(stream, ZipArchiveMode.Create, true))
{
var readmeEntry = archive.CreateEntry("Readme.txt");
using (StreamWriter writer = new StreamWriter(readmeEntry.Open()))
{
writer.WriteLine("Information about this package.");
writer.WriteLine("========================");
}
}
stream.Position = 0;
await blockBlobClient.UploadAsync(stream);
v12SDK:
var blobClient = new BlobClient(ConnectionString, InputContainerName, "test-blob.zip");
using var stream = new MemoryStream();
using (var archive = new ZipArchive(stream, ZipArchiveMode.Create, true))
{
var readmeEntry = archive.CreateEntry("Readme.txt");
using StreamWriter writer = new StreamWriter(readmeEntry.Open());
{
writer.WriteLine("Information about this package.");
writer.WriteLine("========================");
}
}
stream.Position = 0;
await blobClient.UploadAsync(stream, true);
I am trying to take a file and split it into piece and then push each new smaller file piece to azure. I have tried writing a MemoryStream to azure, but that causes the file to upload immediately, but the file is basically empty. I have tried using a BufferedStream which allows the data to be sent as i am writing to it, but i am not sure how to end the stream. I have tried to close each of the different streams i am using, but that does not work as it results in a stream closed exception. Any idea how to mark the stream as complete so the azure library will know to finish the file upload?
It does work to wait until the full file is build and then upload the memory stream, but i would like to be able to write to it while it is uploading if possible.
CloudBlobClient blobClient = StorageAccount.CreateCloudBlobClient();
CloudBlobContainer blobContainer = blobClient.GetContainerReference("containerName");
using (FileStream fileStream = File.Open(path)
{
int key = 0;
CsvWriter csvWriter = null;
MemoryStream memoryStream = null;
BufferedStream bufferedStream = null;
StreamWriter streamWriter = null;
Task<StorageUri> uploadTask = null;
using (var reader = new StreamReader(fileStream))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
csv.Read();
csv.ReadHeader();
await foreach (model row in csv.GetRecordsAsync<MyModel>())
{
if (row.KeyColumn != key)
{
if (memoryStream != null)
{
//Wiat for the current upload to finish
await csvWriter.FlushAsync();
csvWriter.Dispose();
await uploadTask;
}
//Start New Upload
key = row.KeyColumn;
memoryStream = new MemoryStream();
bufferedStream = new BufferedStream(memoryStream)
streamWriter = new StreamWriter(bufferedStream);
csvWriter = new CsvWriter(streamWriter, CultureInfo.InvariantCulture);
csvWriter.WriteHeader<MyModel>();
await csvWriter.FlushAsync();
CloudBlockBlob blockBlob = blobContainer.GetBlockBlobReference($"file_{key}.csv");
uploadTask = blockBlob.UploadFromStreamAsync(bufferedStream);
}
csvWriter.WriteRecord(row);
await csvWriter.FlushAsync();
}
if (memoryStream != null)
{
await csvWriter.FlushAsync();
csvWriter.Dispose();
await uploadTask;
}
}
}