Saving data from SQLDataReader to FileStream with minimal memory usage

Saving data from SQLDataReader to FileStream with minimal memory usage - c#

I've wrapped a SQLDataReader as an IEnumberable using a yield statement. I'd like use this to dump to a file. I'm seeing some pretty heavy memory utilization. Was wondering if anyone had any ideas on how to do this with minimal or set memory utilization. I don't mind specifying a buffer I'd just like to know what it'll be before I unleash this on an unsuspecting server.
I've been using something like the following:
class Program
{
static void Main(string[] args)
{
var fs = File.Create("c:\\somefile.txt");
var sw = new StreamWriter(fs);
foreach (var asdf in Enumerable.Range(0, 500000000))
{
sw.WriteLine("adsfadsf");//Data from Reader
}
sw.Close();
}
}
string commandText = #"SELECT name FROM {0} WHERE name NOT LIKE '%#%'";
SqlCommand sqlCommand = new SqlCommand(string.Format(commandText, list.TableName.SQLEncapsulate()),
_connection);
using (SqlDataReader sqlDataReader = sqlCommand.ExecuteReader())
{
while (sqlDataReader.Read())
{
yield return sqlDataReader["name"].ToString();
}
}

Some heavy memory throughput is not a problem, and is unavoidable when you process a lot of data.
The data that you read will be allocated as new objects on the heap, but they are short lived objects as you just read the data, write it, then throw it away.
The memory management in .NET doesn't try to keep the memory usage as low as possible, as having a lot of unused memory doesn't make the computer faster. When you create and release objects, they will just be abandoned on the heap for a while, and the garbage collector cleans them up eventually.
It's normal for a .NET application to use a lot of memory when you are doing some heavy data processing, and the memory usage will drop again after a while. If there is some other part of the system that needs the memory, the garbage collector will do a more aggressive collection to free up as much as possible.

Related

How to deal with large Lists during a CSV Export in .Net Core

I have a block of code that exports a set of records (over 200k records) to a CSV. A large memory chunk is allocated during the CSV writing, and for some reason, it's not freed by the garbage collector after returning the CSV file.
I have tried using GC.Collect(), but I know it just won't work for large heap objects. GC.WaitForPendingFinalizers() right after GC.Collect() does not release all memory that was allocated during this operation either.
using (var memoryStream = new MemoryStream())
{
using (var streamWriter = new StreamWriter(memoryStream))
using (var csvWriter = new CsvWriter(streamWriter))
{
csvWriter.Configuration.Delimiter =_configuration["Csv:Delimiter"];
csvWriter.Configuration.CultureInfo = new CultureInfo(_configuration["Csv:Culture"]);
if (!noClassMap)
{
csvWriter.Configuration.RegisterClassMap<U>();
}
//When writing 200,000 records, 500MB get allocated here, and never
freed. Details is an IQueryable that returns a 200,000 records
result.
csvWriter.WriteRecords(details);
}
return new SpreadsheetFile { FileBytes = memoryStream.ToArray(), FileName = fileName + DateTime.Now.ToShortDateString() + ".csv" };
}
I know it's expected that during the writing of the memory stream, all the objects list gets allocated in memory.
However, shouldn't this be freed after the Web API returns the actual CSV? I'm not sure if this is expected, or if this is some kind of memory leak (running .NET Core 2.2.3).

return new SpreadsheetFile { FileBytes = memoryStream.ToArray()...
memoryStream.ToArray stores all data into the array in memory and reference to it is keeping by FileBytes property of SpreadsheetFile, that's why data not freed by GC.
Use a memory profiler to see what keeps the instance of SpreadsheetFile in memory if you think it should not be keeping.

Memory leak analysis and help requested

I've been using the methodology outlined by Shivprasad Koirala to check for memory leaks from code running inside a C# application (VoiceAttack). It basically involves using the Performance Monitor to track an application's private bytes as well as bytes in all heaps and compare these counters to assess if there is a leak and what type (managed/unmanaged). Ideally I need to test outside of Visual Studio, which is why I'm using this method.
The following portion of code generates the below memory profile (bear in mind the code has a little different format compared to Visual Studio because this is a function contained within the main C# application):
public void main()
{
string FilePath = null;
using (FileDialog myFileDialog = new OpenFileDialog())
{
myFileDialog.Title = "this is the title";
myFileDialog.FileName = "testFile.txt";
myFileDialog.Filter = "txt files (*.txt)|*.txt|All files (*.*)|*.*";
myFileDialog.FilterIndex = 1;
if (myFileDialog.ShowDialog() == DialogResult.OK)
{
FilePath = myFileDialog.FileName;
var extension = Path.GetExtension(FilePath);
var compareType = StringComparison.InvariantCultureIgnoreCase;
if (extension.Equals(".txt", compareType) == false)
{
FilePath = null;
VA.WriteToLog("Selected file is not a text file. Action canceled.");
}
else
VA.WriteToLog(FilePath);
}
else
VA.WriteToLog("No file selected. Action canceled.");
}
VA.WriteToLog("done");
}
You can see that after running this code the private bytes don't come back to the original count and the bytes in all heaps are roughly constant, which implies that there is a portion of unmanaged memory that was not released. Running this same inline function a few times consecutively doesn't cause further increases to the maximum observed private bytes or the unreleased memory. Once the main C# application (VoiceAttack) closes all the related memory (including the memory for the above code) is released. The bad news is that under normal circumstances the main application may be kept running indefinitely by the user, causing the allocated memory to remain unreleased.
For good measure I threw this same code into VS (with a pair of Thread.Sleep(5000) added before and after the using block for better graphical analysis) and built an executable to track with the Performance Monitor method, and the result is the same. There is an initial unmanaged memory jump for the OpenFileDialog and the allocated unmanaged memory never comes back down to the original value.
Does the memory and leak tracking methodology outlined above make sense? If YES, is there anything that can be done to properly release the unmanaged memory?

Does the memory and leak tracking methodology outlined above make sense?
No. You shouldn't expect unmanaged committed memory (Private Bytes) always be released. For instance processes have an unmanaged heap, which is managed to allow for subsequent allocations. And since Windows can page your committed memory, it isn't critical to minimize each processes committed memory.

If repeated calls don't increase memory use, you don't have a memory leak, you have delayed initialization. Some components aren't initialized until you use them, so their memory usage isn't being taken into account when you establish your baseline.

Reclaiming memory from large objects (specifically, datasets)

I inherited a Windows service that accepts requests via remoting to analyze potentially huge amounts of data in a database. It does this by retrieving the raw data by loading it into a dataset, parsing the data into a tree-like structure of objects, and then running the analysis.
The problem I am encountering is that if a very large set of data is analyzed, not all of the memory is returned to the system after the analysis is done, even if I aggressively force garbage collection. For example, if a 500MB set of data is analyzed, the windows service goes from ~10MB (the baseline at startup), to ~500MB, then down to ~200MB after GC.Collect(), and never goes any lower, even overnight. The only way to get the memory back down is to stop and restart the service. But if I run a small analysis, the service goes from ~10MB, to ~50MB, then down to something like ~20MB. Not great either, but there is a huge discrepancy between the final utilization between large and small data after the analysis is done.
This is not a memory leak per se because if I run the large analysis over and over, the total memory goes back down to ~200MB every time it completes.
This is a problem because the windows service runs on a shared server and I can't have my process taking up loads of memory all the time. It's fine if it spikes and then goes back down after the analysis is done, but this spikes and goes partially down to an unacceptably high number. A typical scenario is running an analysis and then sitting idle for hours.
Unfortunately, this codebase is very large and a huge portion of it is coded to work with datatables returned by a proprietary data access layer, so using an alternate method to load the data is not an option (I wish I could, loading all the data into memory just to loop over it makes no sense).
So my questions are:
1) Why does running a large dataset cause the memory utilization to settle back down to ~200MB, but running the small dataset causes the memory utilization to settle back down to ~20MB? It's obviously hanging on to pieces of the dataset somehow, I just can't see where.
2) Why does it make a difference if I loop over the data table's rows or not (see below)?
3) How can I get/force the memory back down to reasonable levels when the analysis is done, without radically changing the architecture?
I created a small windows service/client app to reproduce the problem. The test database I am using has a table with a million records, an int PK, and two string fields. Here's the scenarios I have tried -- the client (console app) calls LoadData via remoting ten times in a loop.
1) doWork = true, garbageCollect = true, recordCount = 100,000. Memory goes up to 78MB then stabilizes at 22MB.
2) doWork = false, garbageCollect = true, recordCount = 100,000. Memory goes up to 78MB and stabilizes at 19MB. Seriously, 3MB more to loop over the rows without doing anything?
3) doWork = false, garbageCollect = false, recordCount = 100,000. Memory goes up to about 178MB then stabilizes at 78MB. Forcing garbage collection is obviously doing something, but not enough for my needs.
4) doWork = false, garbageCollect = true, recordCount = 1,000,000. Memory goes up to 500MB and stabilizes at 35MB. Why does it stabilize at a higher number when the dataset is larger?
5) doWork = false, garbageCollect = true, recordCount = 1,000. It runs too fast to see the peak but it stabilizes at a measly 12MB.
public string LoadData(bool doWork, bool garbageCollect, int recordCount)
{
var dataSet = new DataSet();
using (var sqlConnection = new SqlConnection("...blah..."))
{
sqlConnection.Open();
using (var dbCommand = sqlConnection.CreateCommand())
{
dbCommand.CommandText = string.Format("select top {0} * from dbo.FakeData", recordCount.ToString());
dbCommand.CommandType = CommandType.Text;
using (var dbReader = new SqlDataAdapter(dbCommand))
{
dbReader.Fill(dataSet);
}
}
sqlConnection.Close();
}
// loop over the records
var count = dataSet.Tables[0].Rows.Count;
if (doWork)
{
foreach (DataRow row in dataSet.Tables[0].Rows) {}
}
dataSet.Clear();
dataSet = null;
if (garbageCollect)
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
return string.Format("Record count is {0}", count);
}

Large String in Large Object Heap causes issues - but in any case it has to end up as a String

I am following up from this question here
The problem I have is that I have some large objects coming from an MSMQ mainly Strings. I have narrowed down my memory problems to these objects being created in the Large Object Heap (LOH) and therefore fragmenting it (confirmed that with some help from the profiler).
In the question I posted above I got some workarounds mainly in the form of splitting up the String into char arrays which I did.
The problem I am facing is that at the end of the string processing (in whatever form that is) I need to send that string to another system which I have no control over. So I was thinking of the following solution to have this String placed in the LOH:
Represent it as an array of char arrays less than 85k each (threshold of Objects to be placed in the LOH)
Compress it on the sender end (i.e. before receiving it in the system we are talking about here which is the receiver) and decompress it only before passing it in the third party system.
Whatever I do - one way or another - the String will have to be complete (no char arrays or compressed).
Am I stuck here? I am thinking if using a managed environment was a mistake here and whether we should bite the bullet and go for a C++ kind of environment.
Thanks,
Yannis
EDIT: I have narrowed down the problem to exactly the code posted here
The large string that comes through is placed in the LOH. I have removed every single processing module from point where i have received the message onwards and the memory consumption trend remains the same.
So I guess i need to change the way this WorkContext is passed around between systems.

Well your options depend on how the 3rd party system is receiving data. If you can stream to it somehow then you don't have to have it all in memory in one go. If that is the case then compressing (which will probably really help your network load if its easily compressible data) is great as you can decompress through a stream and punt it to the 3rd party system in chunks.
The same of course would work if you split your strings up to go below LoH threshold.
If not then I would still advocate splitting the payload on the MSMQ message, and then using a memory pool of prealloacted and reused byte arrays for the re-assembly before sending it to the client. Microsoft has an implementation you can use http://msdn.microsoft.com/en-us/library/system.servicemodel.channels.buffermanager.aspx
The final option I can think of, is to handle the msmq deserialisation in unmanaged code in C++ and create your own custom large block memory pool using placement new to deserialise the strings into that. You could keep it relatively simple by ensuring your pool buffers are sufficient for the longest message possible rather than trying to be clever and dynamic which is hard.

You can try streaming the values using a StringBuilder (the 4.0 version that uses a rope-like implementation).
This example must be executed in Release mode and with the Start Without Debugging attached (CTRL-F5). Both Debug mode and Start Debugging mess too much with the GC.
public class SerializableWork
{
// This is very often between 100-120k bytes. This is actually a String - not just for the purposes of this example
public String WorkContext { get; set; }
// This is quite large as well but usually less than 85k bytes. This is actually a String - not just for the purposes of this example
public String ContextResult { get; set; }
}
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Initial memory: {0}", GC.GetTotalMemory(true));
var sw = new SerializableWork { WorkContext = new string(' ', 1000000), ContextResult = new string(' ', 1000000) };
Console.WriteLine("Memory with objects: {0}", GC.GetTotalMemory(true));
using (var mq = new MessageQueue(#".\Private$\Test1"))
{
mq.Send(sw);
}
sw = null;
Console.WriteLine("Memory after collect: {0}", GC.GetTotalMemory(true));
using (var mq = new MessageQueue(#".\Private$\Test1"))
{
StringBuilder sb1, sb2;
using (var msg = mq.Receive())
{
Console.WriteLine("Memory after receive: {0}", GC.GetTotalMemory(true));
using (var reader = XmlTextReader.Create(msg.BodyStream))
{
reader.ReadToDescendant("WorkContext");
reader.Read();
sb1 = ReadContentAsStringBuilder(reader);
reader.ReadToFollowing("ContextResult");
reader.Read();
sb2 = ReadContentAsStringBuilder(reader);
Console.WriteLine("Memory after creating sb: {0}", GC.GetTotalMemory(true));
}
}
Console.WriteLine("Memory after freeing mq: {0}", GC.GetTotalMemory(true));
GC.KeepAlive(sb1);
GC.KeepAlive(sb2);
}
Console.WriteLine("Memory after final collect: {0}", GC.GetTotalMemory(true));
}
private static StringBuilder ReadContentAsStringBuilder(XmlReader reader)
{
var sb = new StringBuilder();
char[] buffer = new char[4096];
int read;
while ((read = reader.ReadValueChunk(buffer, 0, buffer.Length)) != 0)
{
sb.Append(buffer, 0, read);
}
return sb;
}
}
I read directly the Message.BodyStream of the message in an XmlReader and then I go to the elements I need and I read the data in chunks using XmlReader.ReadValueChunk
In the end nowhere I use string objects. The only big block of memory is the Message.

You maybe could implement a class (call it LargeString), that reuses previously assigned strings and keeps a small collection of them.
Since strings normally are immutable, you'd have to do every change and new assignment by unsafe pointer juggling. After passing a string to the reciever, you'd need to manually mark it as free for reuse. Different message lengths might also be a problem, unless the reciever can cope with messages that are too long, or you have a collection of strings of every length.
Probably not a great idea, but maybe beats rewriting everything in C++.

Reading from PackagePart stream does not release memory

In our application, we are reading an XPS file using the System.IO.Packaging.Package class. When we read from a stream of a PackagePart, we can see from the Task Manager that the application's memory consumption rises. However, when the reading is done, the memory consumption doesn't fall back to what it was before reading from the stream.
To illustrate the problem, I wrote a simple code sample that you can use in a stand alone wpf application.
public partial class Window1 : Window
{
public Window1()
{
InitializeComponent();
_package = Package.Open(#"c:\test\1000pages.xps", FileMode.Open, FileAccess.ReadWrite, FileShare.None);
}
private void ReadPackage()
{
foreach (PackagePart part in _package.GetParts())
{
using (Stream partStream = part.GetStream())
{
byte[] arr = new byte[partStream.Length];
partStream.Read(arr, 0, (int)partStream.Length);
partStream.Close();
}
}
}
Package _package;
private void Button_Click(object sender, RoutedEventArgs e)
{
ReadPackage();
}
}
The ReadPackage() method will read all the PackagePart objects' stream contents into a local array. In the sample, I used a 1000 page XPS document as the package source in order to easily see the memory consumption change of the application. On my machine, the stand alone app's memory consumption starts at 18MB then rises to 100MB after calling the method. Calling the method again can raise the memory consumption again but it can fall back to 100MB. However, it doesn't fall back to 18MB anymore.
Has anyone experienced this while using PackagePart? Or am I using it wrong? I think the internal implementation of PackagePart is caching the data that was read.
Thank you!

You do not specify how you measure the "memory consumption" of your application but perhaps you are using task manager? To get a better view of what is going on I suggest that you examine some performance counters for your application. Both .NET heap and general process memory performance counters are available.
If you really want to understand the details of how your application uses memory you can use the Microsoft CLR profiler.
What you see may be a result of the .NET heap expanding to accomodate a very large file. Big objects are placed on the Large Object Heap (LOH) and even if the .NET memory is garbage collected the free memory is never returned to the operating system. Also, objects on the LOH are never moved around during garbage collection and this may fragment the LOH exhausting the available address space even though there is plenty of free memory.
Has anyone experienced this while using PackagePart? Or am I using it wrong?
If you want to control the resources used by the package you are not using it in the best way. Packages are disposable and in general you should use it like this:
using (var package = Package.Open(#"c:\test\1000pages.xps", FileMode.Open, FileAccess.ReadWrite, FileShare.None)) {
// ... process the package
}
At the end of the using statement resources consumed by the package are either already released or can be garbage collected.
If you really want to keep the _package member of your form you should at some point call Close() (or IDisposable.Dispose()) to release the resources. Calling GC.Collect() is not recommended and will not necessarily be able to recycle the resources used by the package. Any managed memory (e.g. package buffers) that are reachable from _package will not be garbage collected no matter how often you try to force a garbage collection.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Saving data from SQLDataReader to FileStream with minimal memory usage - c#

Related

How to deal with large Lists during a CSV Export in .Net Core

Memory leak analysis and help requested

Reclaiming memory from large objects (specifically, datasets)

Large String in Large Object Heap causes issues - but in any case it has to end up as a String

Reading from PackagePart stream does not release memory

Categories

Resources