I have a C# Azure function to read a file content from Blob and write it to a Azure Data Lake destination. The code works perfectly fine with the large size files (~8 MB and above) but with the small size files the destination file is written with 0 bytes. I tried to change the chunk size to a lower number and parallel threads to 1 but the behavior remains the same. I am simulating the code from Visual Studio 2017.
Please find the code snippet I am using. I have gone through the documentation on Parallel.ForEach limitations but didn't come across anything specific with the file size issues. (https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism)
int bufferLength = 1 * 1024 * 1024;//1 MB chunk
long blobRemainingLength = blob.Properties.Length;
var outPutStream = new MemoryStream();
Queue<KeyValuePair<long, long>> queues = new
Queue<KeyValuePair<long, long>>();
long offset = 0;
while (blobRemainingLength > 0)
{
long chunkLength = (long)Math.Min(bufferLength, blobRemainingLength);
queues.Enqueue(new KeyValuePair<long, long>(offset, chunkLength));
offset += chunkLength;
blobRemainingLength -= chunkLength;
}
Console.WriteLine("Number of Queues: " + queues.Count);
Parallel.ForEach(queues,
new ParallelOptions()
{
//Gets or sets the maximum number of concurrent tasks
MaxDegreeOfParallelism = 10
}, (queue) =>
{
using (var ms = new MemoryStream())
{
blob.DownloadRangeToStreamAsync(ms, queue.Key,
queue.Value).GetAwaiter().GetResult();
lock (mystream)
{
var bytes = ms.ToArray();
Console.WriteLine("Processing on thread {0}",
Thread.CurrentThread.ManagedThreadId);
mystream.Write(bytes, 0, bytes.Length);
}
}
});
Appreciate all the help!!
I found the issue with my code. The ADL Stream writer is not flushed and disposed properly. After adding the necessary code, parallelization with small/large files works fine.
Thanks for the suggestions!!
Related
I would like to anticipate the exact size of my file before writing it in my device, to handle the error or prevent the crash in case there is no space in the corresponding drive. So I have this simple console script, that generates the file:
using System;
using System.IO;
namespace myNamespace
{
class Program
{
static void Main(string[] args) {
byte[] myByteArray = new byte[100];
MemoryStream stream = new MemoryStream();
string fileName = "E:\\myFile.mine";
FileStream myFs = new FileStream(fileName, FileMode.CreateNew);
BinaryWriter toStreamWriter = new BinaryWriter(stream);
BinaryWriter toFileWriter = new BinaryWriter(myFs, System.Text.Encoding.ASCII);
myFs.Write(myByteArray, 0, myByteArray.Length);
for (int i = 0; i < 30000; i++) {
toStreamWriter.Write(i);
}
Console.WriteLine($"allocated memory: {stream.Capacity}" );
Console.WriteLine($"stream lenght {stream.Length}");
Console.WriteLine($"file size: {(stream.Length / 4) * 4.096 }");
toFileWriter.Write(stream.ToArray());
Console.ReadLine();
}
}
}
I got to the point when I get to anticipate the size of the file.
I will be stream.Length / 4) * 4.096, but as long an the ramainder of stream.Length / 4 is 0.
For example for the case of adding 13589 integers to the stream
for (int i = 0; i < 13589; i++) {
toStreamWriter.Write(i);
}
I get that the file size is 55660,544 bytes in the script, but then its 57344 bytes in the explorer.
Same result as if the integers added would have been 14000 instead of 13589.
How can I anticipate the exact size of my created file when the remainder of stream.Length / 4 is not 0?
Edit: For the potential helper running the script you need to delete the created file every time the script is run! Of course use a path and fileName of your choice :)
Regarding the relation stream.Length / 4) * 4.096, the 4 is coming for the byte size, and I guess that the 4.096 comes from the array and file generation, however any further explanation would be much appreciated.
Edit2: Check that if the pending results are logged with:
for (int i = 13589; i <= 14000; i++) {
Console.WriteLine($"result for {i} : {(i*4 / 4) * 4.096} ");
}
You obtain:
....
result for 13991 : 57307,136
result for 13992 : 57311,232
result for 13993 : 57315,328
result for 13994 : 57319,424
result for 13995 : 57323,52
result for 13996 : 57327,616
result for 13997 : 57331,712
result for 13998 : 57335,808
result for 13999 : 57339,904
result for 14000 : 57344
So I assume that file size fits the next cluster + byteStream size with no decimal reminder. Would this file size set logic make sense for the file size anrticipation? If the stream is very big also?
From what I understand from the comments, the question is about how to get the actual file size of the file. Not the file size on disk. And your code is actually almost correct in doing so.
The math is pretty basic. In your example, you create a file stream and write a 100 byte long array to the file stream. Then you create memory stream and write 30000 integers into the memory stream. Then you write the memory stream into the file stream. Considering that each integer here is 4 byte long, as specified by C#, the resulting file has a file size of (30000 * 4) + 100 = 120100 bytes. At least for me, it's exactly what the file properties say in the Windows Explorer.
You could get the same result a bit easier with the following code:
FileStream myFs = new FileStream("test.file", FileMode.CreateNew);
byte[] myByteArray = new byte[100];
myFs.Write(myByteArray, 0, myByteArray.Length);
BinaryWriter toFileWriter = new BinaryWriter(myFs, System.Text.Encoding.ASCII);
for (int i = 0; i < 30000; i++)
{
toFileWriter.Write(i);
}
Console.WriteLine($"stream lenght {myFs.Length}");
myFs.Close();
This will return a stream length of 120100 bytes.
In case I misunderstood your question and comments and you were actually trying to get the file size on disk:
Don't go there. You cannot reliably predict the file size on disk due to variable circumstances. For example, file compression, encryption, various RAID types, various file systems, various disk types, various operating systems.
I'm trying to upload large files to Azure File Share via the Azure.Storage.Files.Shares library and am I running into corruption issues on all media files (images, PDFs, etc) over ~4 MB. Azure File Share has a limit of 4 MB for a single request which is why I've split the upload in to multiple chunks, but it still corrupts the files despite every chunk upload returning a 201.
Notes:
It doesn't seem like it's an issue with having to write multiple chunks as I can write a 3 MB file in as many chunks as I want and it will be totally fine
.txt files over 4 MB have no issues and display totally fine after uploading
This uploading portion of this function is basically copied/pasted from the only other stack overflow "solution" I found regarding this issue:
public async Task WriteFileFromStream(string fullPath, MemoryStream stream)
{
// Get pieces of path
string dirName = Path.GetDirectoryName(fullPath);
string fileName = Path.GetFileName(fullPath);
ShareClient share = new ShareClient(this.ConnectionString, this.ShareName);
// Set position of the stream to 0 so that we write all contents
stream.Position = 0;
try
{
// Get a directory client for specified directory and create the directory if it doesn't exist
ShareDirectoryClient directory = share.GetDirectoryClient(dirName);
directory.CreateIfNotExists();
if (directory.Exists())
{
// Get file client
ShareFileClient file = directory.GetFileClient(fileName);
// Create file based on stream length
file.Create(stream.Length);
int blockSize = 300 * 1024; // can be anything as long as it doesn't exceed 4194304
long offset = 0; // Define http range offset
BinaryReader reader = new BinaryReader(stream);
while (true)
{
byte[] buffer = reader.ReadBytes(blockSize);
if (buffer.Length == 0)
break;
MemoryStream uploadChunk = new MemoryStream();
uploadChunk.Write(buffer, 0, buffer.Length);
uploadChunk.Position = 0;
HttpRange httpRange = new HttpRange(offset, buffer.Length); // offset -> buffer.Length-1 (inclusive)
var resp = file.UploadRange(httpRange, uploadChunk);
Console.WriteLine($"Wrote bytes {offset}-{offset+(buffer.Length-1)} to {fullPath}. Response: {resp.GetRawResponse()}");
offset += buffer.Length; // Shift the offset by number of bytes already written
}
reader.Close();
}
else
{
throw new Exception($"Failed to create directory: {dirName}");
}
}
catch (Exception e)
{
// Close out memory stream
throw new Exception($"Error occured while writing file from stream: {e.Message}");
}
}
Any help on this is greatly appreciated.
I'm currently developing for an environment that has poor network connectivity. My application helps to automatically download required Google Drive files for users. It works reasonably well for small files (ranging from 40KB to 2MB), but fails far too often for larger files (9MB). I know these file sizes might seem small, but in terms of my client's network environment, Google Drive API constantly fails with the 9MB file.
I've concluded that I need to download files in smaller byte chunks, but I don't see how I can do that with Google Drive API. I've read this over and over again, and I've tried the following code:
// with the Drive File ID, and the appropriate export MIME type, I create the export request
var request = DriveService.Files.Export(fileId, exportMimeType);
// take the message so I can modify it by hand
var message = request.CreateRequest();
var client = request.Service.HttpClient;
// I change the Range headers of both the client, and message
client.DefaultRequestHeaders.Range =
message.Headers.Range =
new System.Net.Http.Headers.RangeHeaderValue(100, 200);
var response = await request.Service.HttpClient.SendAsync(message);
// if status code = 200, copy to local file
if (response.IsSuccessStatusCode)
{
using (var fileStream = new FileStream(downloadFileName, FileMode.CreateNew, FileAccess.ReadWrite))
{
await response.Content.CopyToAsync(fileStream);
}
}
The resultant local file (from fileStream) however, is still full-length (i.e. 40KB file for the 40KB Drive file, and a 500 Internal Server Error for the 9MB file). On a sidenote, I've also experimented with ExportRequest.MediaDownloader.ChunkSize, but from what I observe it only changes the frequency at which the ExportRequest.MediaDownloader.ProgressChanged callback is called (i.e. callback will trigger every 256KB if ChunkSize is set to 256 * 1024).
How can I proceed?
You seemed to be heading in the right direction. From your last comment, the request will update progress based on the chunk size, so your observation was accurate.
Looking into the source code for MediaDownloader in the SDK the following was found (emphasis mine)
The core download logic. We download the media and write it to an
output stream ChunkSize bytes at a time, raising the ProgressChanged
event after each chunk. The chunking behavior is largely a historical
artifact: a previous implementation issued multiple web requests, each
for ChunkSize bytes. Now we do everything in one request, but the API
and client-visible behavior are retained for compatibility.
Your example code will only download one chunk from 100 to 200. Using that approach you would have to keep track of an index and download each chunk manually, copying them to the file stream for each partial download
const int KB = 0x400;
int ChunkSize = 256 * KB; // 256KB;
public async Task ExportFileAsync(string downloadFileName, string fileId, string exportMimeType) {
var exportRequest = driveService.Files.Export(fileId, exportMimeType);
var client = exportRequest.Service.HttpClient;
//you would need to know the file size
var size = await GetFileSize(fileId);
using (var file = new FileStream(downloadFileName, FileMode.CreateNew, FileAccess.ReadWrite)) {
file.SetLength(size);
var chunks = (size / ChunkSize) + 1;
for (long index = 0; index < chunks; index++) {
var request = exportRequest.CreateRequest();
var from = index * ChunkSize;
var to = from + ChunkSize - 1;
request.Headers.Range = new RangeHeaderValue(from, to);
var response = await client.SendAsync(request);
if (response.StatusCode == HttpStatusCode.PartialContent || response.IsSuccessStatusCode) {
using (var stream = await response.Content.ReadAsStreamAsync()) {
file.Seek(from, SeekOrigin.Begin);
await stream.CopyToAsync(file);
}
}
}
}
}
private async Task<long> GetFileSize(string fileId) {
var file = await driveService.Files.Get(fileId).ExecuteAsync();
var size = file.size;
return size;
}
This code makes some assumptions about the drive api/server.
That the server will allow the multiple requests needed to download the file in chunks. Don't know if requests are throttled.
That the server still accepts the Range header like stated in the developer documenation
I am using this code to extract a chunk from file
// info is FileInfo object pointing to file
var percentSplit = info.Length * 50 / 100; // extract 50% of file
var bytes = new byte[percentSplit];
var fileStream = File.OpenRead(fileName);
fileStream.Read(bytes, 0, bytes.Length);
fileStream.Dispose();
File.WriteAllBytes(splitName, bytes);
Is there any way to speed up this process?
Currently for a 530 MB file it takes around 4 - 5 seconds. Can this time be improved?
There are several cases of you question, but none of them is language relevant.
Following are something to concern
What is the file system of source/destination file?
Do you want to keep original source file?
Are they lie on the same drive?
In c#, you almost do not have a method could be faster than File.Copy which invokes CopyFile of WINAPI internally. Because of the percentage is fifty, however, following code might not be faster. It copies whole file and then set the length of the destination file
var info=new FileInfo(fileName);
var percentSplit=info.Length*50/100; // extract 50% of file
File.Copy(info.FullName, splitName);
using(var outStream=File.OpenWrite(splitName))
outStream.SetLength(percentSplit);
Further, if
you don't keep original source after file splitted
destination drive is the same as source
your are not using a crypto/compression enabled file system
then, the best thing you can do, is don't copy files at all.
For example, if your source file lies on FAT or FAT32 file system, what you can do is
create new dir entry(entries) for newly splitted parts of file
let the entry(entries) point(s) to the cluster of target part(s)
set correct file size for each entry
check for cross-link and avoid that
If your file system was NTFS, you might need to spend a long time to study the spec.
Good luck!
var percentSplit = (int)(info.Length * 50 / 100); // extract 50% of file
var buffer = new byte[8192];
using (Stream input = File.OpenRead(info.FullName))
using (Stream output = File.OpenWrite(splitName))
{
int bytesRead = 1;
while (percentSplit > 0 && bytesRead > 0)
{
bytesRead = input.Read(buffer, 0, Math.Min(percentSplit, buffer.Length));
output.Write(buffer, 0, bytesRead);
percentSplit -= bytesRead;
}
output.Flush();
}
The flush may not be needed but it doesn't hurt, this was quite interesting, changing the loop to a do-while rather than a while had a big hit on performance. I suppose the IL is not as fast. My pc was running the original code in 4-6 secs, the attached code seemed to be running at about 1 second.
I get better results when reading/writing by chunks of a few megabytes. The performances changes also depending on the size of the chunk.
FileInfo info = new FileInfo(#"C:\source.bin");
FileStream f = File.OpenRead(info.FullName);
BinaryReader br = new BinaryReader(f);
FileStream t = File.OpenWrite(#"C:\split.bin");
BinaryWriter bw = new BinaryWriter(t);
long count = 0;
long split = info.Length * 50 / 100;
long chunk = 8000000;
DateTime start = DateTime.Now;
while (count < split)
{
if (count + chunk > split)
{
chunk = split - count;
}
bw.Write(br.ReadBytes((int)chunk));
count += chunk;
}
Console.WriteLine(DateTime.Now - start);
I'm trying to uploading very large (>100GB) blobs to Azure using Microsoft.Azure.Storage.Blob (9.4.2). However, it appears that even when using the stream-based blob write API, the library will allocate memory proportional to the size of the file (a 1.2GB test file results in a 2GB process memory footprint). I need this to work in constant memory. My code is below (similar results using UploadFromFile, UploadFromStream, etc.):
var container = new CloudBlobContainer(new Uri(sasToken));
var blob = container.GetBlockBlobReference("test");
const int bufferSize = 64 * 1024 * 1024; // 64MB
blob.StreamWriteSizeInBytes = bufferSize;
using (var writeStream = blob.OpenWrite())
{
using (var readStream = new FileStream(archiveFilePath, FileMode.Open))
{
var buffer = new byte[bufferSize];
var bytesRead = 0;
while ((bytesRead = readStream.Read(buffer, 0, bufferSize)) != 0)
{
writeStream.Write(buffer, 0, bytesRead);
}
}
}
This behavior is pretty baffling - I can see in TaskMgr that the upload indeed starts right away, so it's not like it's buffering things up waiting to send; there is no reason why it needs to hang on to previously sent data. How does anyone use this API for non-trivial blob uploads?
I suggest you take a look at the BlobStorageMultipartStreamProvider sample, as it shows how a request stream can "forwarded" to an Azure Blob stream and this might reduce the amount of memory used at the server side while uploading.
Hope it helps!