Parallel.ForEach memory usage keeps growing - c#

public string SavePath { get; set; } = #"I:\files\";
public void DownloadList(List<string> list)
{
var rest = ExcludeDownloaded(list);
var result = Parallel.ForEach(rest, link=>
{
Download(link);
});
}
private void Download(string link)
{
using(var net = new System.Net.WebClient())
{
var data = net.DownloadData(link);
var fileName = code to generate unique fileName;
if (File.Exists(fileName))
return;
File.WriteAllBytes(fileName, data);
}
}
var downloader = new DownloaderService();
var links = downloader.GetLinks();
downloader.DownloadList(links);
I observed the usage of RAM for the project keeps growing
I guess there is something wrong on the Parallel.ForEach(), but I cannot figure it out.
Is there the memory leak, or what is happening?
Update 1
After changed to the new code
private void Download(string link)
{
using(var net = new System.Net.WebClient())
{
var fileName = code to generate unique fileName;
if (File.Exists(fileName))
return;
var data = net.DownloadFile(link, fileName);
Track theTrack = new Track(fileName);
theTrack.Title = GetCDName();
theTrack.Save();
}
}
I still observed increasing memory use after keeping running for 9 hours, it is much slowly growing usage though.
Just wondering, is it because that I didn't free the memory use of theTrack file?
Btw, I use ALT package for update file metadata, unfortunately, it doesn't implement IDisposable interface.

The Parallel.ForEach method is intended for parallelizing CPU-bound workloads. Downloading a file is an I/O bound workload, and so the Parallel.ForEach is not ideal for this case because it needlessly blocks ThreadPool threads. The correct way to do it is asynchronously, with async/await. The recommended class for making asynchronous web requests is the HttpClient, and for controlling the level of concurrency an excellent option is the TPL Dataflow library. For this case it is enough to use the simplest component of this library, the ActionBlock class:
async Task DownloadListAsync(List<string> list)
{
using (var httpClient = new HttpClient())
{
var rest = ExcludeDownloaded(list);
var block = new ActionBlock<string>(async link =>
{
await DownloadFileAsync(httpClient, link);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 10
});
foreach (var link in rest)
{
await block.SendAsync(link);
}
block.Complete();
await block.Completion;
}
}
async Task DownloadFileAsync(HttpClient httpClient, string link)
{
var fileName = Guid.NewGuid().ToString(); // code to generate unique fileName;
var filePath = Path.Combine(SavePath, fileName);
if (File.Exists(filePath)) return;
var response = await httpClient.GetAsync(link);
response.EnsureSuccessStatusCode();
using (var contentStream = await response.Content.ReadAsStreamAsync())
using (var fileStream = new FileStream(filePath, FileMode.Create,
FileAccess.Write, FileShare.None, 32768, FileOptions.Asynchronous))
{
await contentStream.CopyToAsync(fileStream);
}
}
The code for downloading a file with HttpClient is not as simple as the WebClient.DownloadFile(), but it's what you have to do in order to keep the whole process asynchronous (both reading from the web and writing to the disk).
Caveat: Asynchronous filesystem operations are currently not implemented efficiently in .NET. For maximum efficiency it may be preferable to avoid using the FileOptions.Asynchronous option in the FileStream constructor.
.NET 6 update: The preferable way for parallelizing asynchronous work is now the Parallel.ForEachAsync API. A usage example can be found here.

Use WebClient.DownloadFile() to download directly to a file so you don't have the whole file in memory.

Related

Streaming large data from different clients at the same time

This is a bit architecture and code issue. I have a lot of source url's containing huge files that come from many different clients that I have to download and save on filesystem.
I have hardware limits on RAM. So I want to buffer each stream in chunks of bytes and I think it will be good idea to initiate one thread for each downloading of a stream.
I have added a coding for initiating a thread/task using Task Parallel Library as such:
public Task RunTask(Action action)
{
Task task = Task.Run(action);
return task;
}
and I pass for the action parameter the following method:
public void DownloadFileThroughWebStream(WebClient webClient, Uri src, string dest, long buffersize)
{
Stream stream = webClient.OpenRead(src);
byte[] buffer = new byte[buffersize];
int len;
using (BufferedStream bufferedStream = new BufferedStream(stream))
{
using (FileStream fileStream = new FileStream(Path.GetFullPath(dest), FileMode.Create, FileAccess.Write))
{
while ((len = stream.Read(buffer, 0, buffer.Length)) > 0)
{
fileStream.Write(buffer, 0, len);
fileStream.Flush();
}
}
}
}
And I for testing purposes try to download some resources from http uri's as by initiating a thread/task for each specific download:
[Test]
public async Task DownloadSomeStream()
{
Uri uri = new Uri("http://mirrors.standaloneinstaller.com/video-sample/metaxas-keller-Bell.mpeg");
List<Uri> streams = new List<Uri> { uri, uri, uri};
List<Task> tasks = new List<Task>();
var path = "C:\\TMP\\";
//Create task for each of the streams from uri
int c = 1;
foreach (var uri in streams)
{
WebClient webClient = new WebClient();
Task task = taskInitiator.RunTask(() => DownloadFileThroughWebStream(webClient, uri, Path.Combine(path,"File"+c), 8192));
tasks.Add(task);
c++;
}
Task allTasksHaveCompleted = Task.WhenAll(tasks);
await allTasksHaveCompleted;
}
I get the following exception:
System.IO.IOException: 'The process cannot access the file 'D:\TMP\File4' because it is being used by another process'
on line:
using (FileStream fileStream = new FileStream(Path.GetFullPath(dest), FileMode.Create, FileAccess.Write))
So there are two things that i dont understand with this exception:
Why it is not allowed to write? and how another process is allocating the file?
Why do it want to save file4 when I have only added 3 url's, so I only should have files: file1, file2, and file3 ?
Also, other questions that could be nice to get some thoughts on:
Is it right approach what I am doing in regards to what I want to achieve? Am I doing the Task initiations using Task Parallel Library correct?
Any tips and trick, best practices, etc.?
Why it is not allowed to write? and how another process is allocating
the file?
The file is not locked by another process, but by the same process. If you open a file for write, you basically get an exclusive lock for it. When you try to open the file again for writing from another task, it is locked and that is why you get the error.
To handle this case, you should put a lock around writing the data to disk. You should have a separate lock object for every unique file name you are writing to, and be careful to use the proper lock!
Why do it want to save file4 when I have only added 3 url's, so I only
should have files: file1, file2, and file3 ?
This is because you capture the variable c in the delegate you pass to Task.Run. Since these tasks normally start after the loop is over, the value of c is now 4. See here for more information about closures.
We can create download method which can execute downloading:
async Task DownloadFile(string url, string location, string fileName)
{
using (var client = new WebClient())
{
await client.DownloadFileTaskAsync(url, $"{location}{fileName}");
}
}
And the above method can be called by Task.Run() to execute simultaneous download of files:
IList<string> urls = new List<string>()
{
#"http://mirrors.standaloneinstaller.com/video-sample/metaxas-keller-Bell.mpeg",
#"https://...",
#"https://..."
};
string location = "D:";
Directory.CreateDirectory(location);
Task.Run(async () =>
{
var tasks = urls.Select(url =>
{
var fileName = url.Substring(url.LastIndexOf('/'));
return DownloadFile(url, location, fileName);
}).ToArray();
await Task.WhenAll(tasks);
}).GetAwaiter().GetResult();

Executing a process on IIS makes RAM goes up really quick

I built an ASP.NET MVC API hosted on IIS on Windows 10 Pro (VM on Azure - 4GB RAM, 2CPU). Within I call an .exe (wkhtmltopdf) that I want to convert an HTML page to image and save it locally. Everything works fine, except I noticed that after some calls to the API, the RAM goes crazy and while investigating the process with Task Manager I saw a process, called IIS Worker Process, that adds more RAM every time the API is called. Of course I wrapped my System.Diagnostics.Process instance usage inside a using statement to be disposed, because IDisposable is implemented, but it still consumes more and more RAM and after a while the server becomes laggy and unresponsive (it has only 4GB of RAM after all). I noticed that after some number of minutes (10-15-20 maybe) this IIS Worker Process calms down in terms of RAM usage... Here is my code, pretty straight forward:
Gets base64 encoded url
Decodes it
Uses wkhtmltoimage.exe to convert it to image
Saves it locally
Reads the byte array
Creates a blob in Azure with the image
Returns json with the url
public async Task<ActionResult> Index(string url)
{
object oJSON = new { url = string.Empty };
if (!string.IsNullOrEmpty(value: url))
{
try
{
byte[] EncodedData = Convert.FromBase64String(s: url);
string DecodedURL = Encoding.UTF8.GetString(bytes: EncodedData);
using (Process proc = new Process())
{
proc.StartInfo.FileName = wkhtmltopdfExecutablePath;
proc.StartInfo.Arguments = $"--encoding utf-8 \"{DecodedURL}\" {LocalImageFilePath}";
proc.Start();
proc.WaitForExit();
oJSON = new { procStatusCode = proc.ExitCode };
}
if (System.IO.File.Exists(path: LocalImageFilePath))
{
byte[] pngBytes = System.IO.File.ReadAllBytes(path: LocalImageFilePath);
System.IO.File.Delete(path: LocalImageFilePath);
string ImageURL = await CreateBlob(blobName: $"{BlobName}.png", data: pngBytes);
oJSON = new { url = ImageURL };
}
}
catch (Exception ex)
{
Debug.WriteLine(value: ex);
}
}
return Json(data: oJSON, behavior: JsonRequestBehavior.AllowGet);
}
private async Task<string> CreateBlob(string blobName, byte[] data)
{
string ConnectionString = "DefaultEndpointsProtocol=https;AccountName=" + AzureStorrageAccountName + ";AccountKey=" + AzureStorageAccessKey + ";EndpointSuffix=core.windows.net";
CloudStorageAccount cloudStorageAccount = CloudStorageAccount.Parse(connectionString: ConnectionString);
CloudBlobClient cloudBlobClient = cloudStorageAccount.CreateCloudBlobClient();
CloudBlobContainer cloudBlobContainer = cloudBlobClient.GetContainerReference(containerName: AzureBlobContainer);
await cloudBlobContainer.CreateIfNotExistsAsync();
BlobContainerPermissions blobContainerPermissions = await cloudBlobContainer.GetPermissionsAsync();
blobContainerPermissions.PublicAccess = BlobContainerPublicAccessType.Container;
await cloudBlobContainer.SetPermissionsAsync(permissions: blobContainerPermissions);
CloudBlockBlob cloudBlockBlob = cloudBlobContainer.GetBlockBlobReference(blobName: blobName);
cloudBlockBlob.Properties.ContentType = "image/png";
using (Stream stream = new MemoryStream(buffer: data))
{
await cloudBlockBlob.UploadFromStreamAsync(source: stream);
}
return cloudBlockBlob.Uri.AbsoluteUri;
}
Here are the resources I'm reading somehow related to this issue IMO, but are not helping much:
Investigating ASP.Net Memory Dumps for Idiots (like Me)
ASP.NET app eating memory. Application / Session objects the reason?
IIS Worker Process using a LOT of memory?
Run dispose method upon asp.net IIS app restart
IIS: Idle Timeout vs Recycle
UPDATE:
if (System.IO.File.Exists(path: LocalImageFilePath))
{
string BlobName = Guid.NewGuid().ToString(format: "n");
string ImageURL = string.Empty;
using(FileStream fileStream = new FileStream(LocalImageFilePath, FileMode.Open)
{
ImageURL = await CreateBlob(blobName: $"{BlobName}.png", dataStream: fileStream);
}
System.IO.File.Delete(path: LocalImageFilePath);
oJSON = new { url = ImageURL };
}
The most likely cause of your pain is the allocation of large byte arrays:
byte[] pngBytes = System.IO.File.ReadAllBytes(path: LocalImageFilePath);
The easiest change to make, to try and encourage the GC to collect the Large Object Heap more often, is to set GCSettings.LargeObjectHeapCompactionMode to CompactOnce at the end of the method. That might help.
But, a better idea would be to remove the need for the large array altogether. To do this, change:
private async Task<string> CreateBlob(string blobName, byte[] data)
to instead be:
private async Task<string> CreateBlob(string blobName, FileStream data)
And then later use:
await cloudBlockBlob.UploadFromStreamAsync(source: data);
In the caller, you'll need to stop using ReadAllBytes, and instead use a FileStream to read the file instead.

Creating List of object with byte array : OutOfMemoryException

I have a .NET Core 1.1 Application that is having a problem when generating a List of objects that have a byte array in them. If there are more than 20 items in the list (arbitrary, I'm not sure of the exact number or size at which it fails) the method throws the OutOfMemoryException. The method is below:
public async Task<List<Blob>> GetBlobsAsync(string container)
{
List<Blob> retVal = new List<Blob>();
Blob itrBlob;
BlobContinuationToken continuationToken = null;
BlobResultSegment resultSegment = null;
CloudBlobContainer cont = _cbc.GetContainerReference(container);
resultSegment = await cont.ListBlobsSegmentedAsync(String.Empty, true, BlobListingDetails.Metadata, null, continuationToken, null, null);
do
{
foreach (var bItem in resultSegment.Results)
{
var iBlob = bItem as CloudBlockBlob;
itrBlob = new Blob()
{
Contents = new byte[iBlob.Properties.Length],
Name = iBlob.Name,
ContentType = iBlob.Properties.ContentType
};
await iBlob.DownloadToByteArrayAsync(itrBlob.Contents, 0);
retVal.Add(itrBlob);
}
continuationToken = resultSegment.ContinuationToken;
} while (continuationToken != null);
return retVal;
}
I'm not using anything that can really be disposed in the method. Is there a better way to accomplish this? The ultimate goal is to pull all of these files and then create a ZIP archive. This process works as long as I don't breach some size threshold.
If it helps, the application is accessing Azure Block Blob Storage from an Azure Web Application instance. Maybe there is a setting I need to adjust to increase a threshold?
The exception is thrown when the Blob() object is instantiated.
EDIT:
So the question as posted was admittedly weak in the way of detail. The problem container has 30 files (mostly large text files that compress well). The total size of the container is 971MB. The request runs for approximately 40 seconds before reporting an HTTP 500 error and the referenced exception.
When I debug locally and step through the same operation it succeeds, resulting in a 237MB zip file. During the operation I can see the memory usage shoot over 2GB by the time the list is created.
I tried to abstract the interaction of the blob storage to its own service, but perhaps I've made this more difficult on myself than is necessary.
Found these two code samples that illustrate the concept very well that supports your use case.
get list of block blobs in blob container and create ZipOutputStream on-the-fly
add each block blob to a ZipOutputStream (SharpZipLib) writing to Response.OutputStream
ZIP compression level:
zipOutputStream.SetLevel(3); //0-9, 9 being the highest level of compression
End-to-end example using ASP.NET WebApi
adding Zip feature can be added in this well structured application
Further reading
https://www.strathweb.com/2012/09/dealing-with-large-files-in-asp-net-web-api/
https://www.strathweb.com/2013/01/asynchronously-streaming-video-with-asp-net-web-api/
WebAPI StreamContent vs PushStreamContent
Using Sascha's answer, I was able to make a compromise that seems to perform decently given the parameters. Probably not perfect, but it cuts the memory usage by nearly 70% and allows me to keep some abstraction.
I added a method to my blob service called GetBlobsAsZipAsync that accepts a container name as an argument:
public async Task<Stream> GetBlobsAsZipAsync(string container)
{
BlobContinuationToken continuationToken = null;
BlobResultSegment resultSegment = null;
byte[] buffer = new byte[4194304];
MemoryStream ms = new MemoryStream();
CloudBlobContainer cont = _cbc.GetContainerReference(container);
resultSegment = await cont.ListBlobsSegmentedAsync(String.Empty, true, BlobListingDetails.Metadata, null, continuationToken, null, null);
using (var za = new ZipArchive(ms, ZipArchiveMode.Create, true))
{
do
{
foreach (var bItem in resultSegment.Results)
{
var iBlob = bItem as CloudBlockBlob;
var ze = za.CreateEntry(iBlob.Name);
using (var fs = await iBlob.OpenReadAsync())
{
using (var dest = ze.Open())
{
int count = await fs.ReadAsync(buffer, 0, buffer.Length);
while (count > 0)
{
await dest.WriteAsync(buffer, 0, count);
count = await fs.ReadAsync(buffer, 0, buffer.Length);
}
}
}
}
continuationToken = resultSegment.ContinuationToken;
} while (continuationToken != null);
}
return ms;
}
This returns the Zip as a (closed) MemoryStream that is then returned as an Array using a FileResult:
[HttpPost]
public async Task<IActionResult> DownloadFiles(string container, int projectId, int? profileId)
{
MemoryStream ms = null;
_ctx.Add(new ProjectDownload() { ProfileId = profileId, ProjectId = projectId });
await _ctx.SaveChangesAsync();
using (ms = (MemoryStream)await _blobs.GetBlobsAsZipAsync(container))
{
return File(ms.ToArray(), "application/zip", "download.zip");
}
}
I hope this is useful to someone else who just needs a push in the right direction. I took a lazy way out on this originally and it came back to bite me.

The best way load files from Isolated Storage

In my Windows Store App I save/use files (almost images) in Isolated Storage. When I need to present image i use following:
var file = await folder.GetFileAsync(fileName);
using (var stream = await file.OpenAsync(FileAccessMode.Read))
{
obj.Image = new BitmapImage();
await obj.Image.SetSourceAsync(stream);
}
But when I use 3+ images in same page I have lags. I'm looking for faster solution to access Isolated Storage files.
You can try to start with this article Optimize media resources (Windows Store apps using C#/VB/C++ and XAML), and couple more things:
Make sure that if you show them in the ListView / GridView - you have enabled virtualization (you use right ItemsPanel which supports virtualization).
If you just need to load images from local storage - set the binding from Image.Source to to the right URI (ms-appx:/// or ms-appdata:///local/), and Image control will do everything for you.
I don't know how you're opening multiple images, but since all the methods are asynchronous you shouldn't iterate through your files sequentially, but open all of them in parallel.
So instead of doing this (where you're waiting for the previous image to load before starting to load the next one):
foreach (var fileName in fileNames)
{
var file = await folder.GetFileAsync(fileName);
using (var stream = await file.OpenAsync(FileAccessMode.Read))
{
obj.Image = new BitmapImage();
await obj.Image.SetSourceAsync(stream);
}
}
You should approach it like this:
// not sure about the type of obj
public async Task<Image> LoadImage(string fileName, dynamic obj)
{
var file = await folder.GetFileAsync(fileName);
using (var stream = await file.OpenAsync(FileAccessMode.Read))
{
obj.Image = new BitmapImage();
await obj.Image.SetSourceAsync(stream);
}
}
var tasks = fileNames.Select(f => LoadImage(f, obj)).ToArray();
await Task.WhenAll(tasks);
This will initialize an array of awaitable tasks loading the images and then await all of them at the same time so that they will execute in parallel.

OutOfMemoryException on MemoryStream writing

I have a little sample application I was working on trying to get some of the new .Net 4.0 Parallel Extensions going (they are very nice). I'm running into a (probably really stupid) problem with an OutOfMemoryException. My main app that I'm looking to plug this sample into reads some data and lots of files, does some processing on them, and then writes them out somewhere. I was running into some issues with the files getting bigger (possibly GB's) and was concerned about memory so I wanted to parallelize things which led me down this path.
Now the below code gets an OOME on smaller files and I think I'm just missing something. It will read in 10-15 files and write them out in parellel nicely, but then it chokes on the next one. It looks like it's read and written about 650MB. A second set of eyes would be appreciated.
I'm reading into a MemorySteam from the FileStream because that is what is needed for the main application and I'm just trying to replicate that to some degree. It reads data and files from all types of places and works on them as MemoryStreams.
This is using .Net 4.0 Beta 2, VS 2010.
namespace ParellelJob
{
class Program
{
BlockingCollection<FileHolder> serviceToSolutionShare;
static void Main(string[] args)
{
Program p = new Program();
p.serviceToSolutionShare = new BlockingCollection<FileHolder>();
ServiceStage svc = new ServiceStage(ref p.serviceToSolutionShare);
SolutionStage sol = new SolutionStage(ref p.serviceToSolutionShare);
var svcTask = Task.Factory.StartNew(() => svc.Execute());
var solTask = Task.Factory.StartNew(() => sol.Execute());
while (!solTask.IsCompleted)
{
}
}
}
class ServiceStage
{
BlockingCollection<FileHolder> outputCollection;
public ServiceStage(ref BlockingCollection<FileHolder> output)
{
outputCollection = output;
}
public void Execute()
{
var di = new DirectoryInfo(#"C:\temp\testfiles");
var files = di.GetFiles();
foreach (FileInfo fi in files)
{
using (var fs = new FileStream(fi.FullName, FileMode.Open, FileAccess.Read))
{
int b;
var ms = new MemoryStream();
while ((b = fs.ReadByte()) != -1)
{
ms.WriteByte((byte)b); //OutOfMemoryException Occurs Here
}
var f = new FileHolder();
f.filename = fi.Name;
f.contents = ms;
outputCollection.TryAdd(f);
}
}
outputCollection.CompleteAdding();
}
}
class SolutionStage
{
BlockingCollection<FileHolder> inputCollection;
public SolutionStage(ref BlockingCollection<FileHolder> input)
{
inputCollection = input;
}
public void Execute()
{
FileHolder current;
while (!inputCollection.IsCompleted)
{
if (inputCollection.TryTake(out current))
{
using (var fs = new FileStream(String.Format(#"c:\temp\parellel\{0}", current.filename), FileMode.OpenOrCreate, FileAccess.Write))
{
using (MemoryStream ms = (MemoryStream)current.contents)
{
ms.WriteTo(fs);
current.contents.Close();
}
}
}
}
}
}
class FileHolder
{
public string filename { get; set; }
public Stream contents { get; set; }
}
}
The main logic seems OK, but if that empty while-loop in main is literal then you are burning unnecesary CPU cycles. Better use solTask.Wait() instead.
But if individual files can run in Gigabytes, you still have the problem of holding at least 1 completely in memory, and usually 2 (1 being read, 1 being processed/written.
PS1: I just realized you don't pre-allocate the MemStream. That's bad, it will have to re-size very often for a big file, and that costs a lot of memory. Better use something like:
var ms = new MemoryStream(fs.Length);
And then, for big files, you have to consider the Large Object Heap (LOH). Are you sure you can't break a file up in segments and process them?
PS2: And you don't need the ref's on the constructor parameters, but that's not the problem.
Just looking through quickly, inside your ServiceStage.Execute method you have
var ms = new MemoryStream();
I don't see where you are closing ms out or have it in a using. You do have the using in the other class. That's one thing to check out.

Categories

Resources