Reclaiming memory from large objects (specifically, datasets) - c#

I inherited a Windows service that accepts requests via remoting to analyze potentially huge amounts of data in a database. It does this by retrieving the raw data by loading it into a dataset, parsing the data into a tree-like structure of objects, and then running the analysis.
The problem I am encountering is that if a very large set of data is analyzed, not all of the memory is returned to the system after the analysis is done, even if I aggressively force garbage collection. For example, if a 500MB set of data is analyzed, the windows service goes from ~10MB (the baseline at startup), to ~500MB, then down to ~200MB after GC.Collect(), and never goes any lower, even overnight. The only way to get the memory back down is to stop and restart the service. But if I run a small analysis, the service goes from ~10MB, to ~50MB, then down to something like ~20MB. Not great either, but there is a huge discrepancy between the final utilization between large and small data after the analysis is done.
This is not a memory leak per se because if I run the large analysis over and over, the total memory goes back down to ~200MB every time it completes.
This is a problem because the windows service runs on a shared server and I can't have my process taking up loads of memory all the time. It's fine if it spikes and then goes back down after the analysis is done, but this spikes and goes partially down to an unacceptably high number. A typical scenario is running an analysis and then sitting idle for hours.
Unfortunately, this codebase is very large and a huge portion of it is coded to work with datatables returned by a proprietary data access layer, so using an alternate method to load the data is not an option (I wish I could, loading all the data into memory just to loop over it makes no sense).
So my questions are:
1) Why does running a large dataset cause the memory utilization to settle back down to ~200MB, but running the small dataset causes the memory utilization to settle back down to ~20MB? It's obviously hanging on to pieces of the dataset somehow, I just can't see where.
2) Why does it make a difference if I loop over the data table's rows or not (see below)?
3) How can I get/force the memory back down to reasonable levels when the analysis is done, without radically changing the architecture?
I created a small windows service/client app to reproduce the problem. The test database I am using has a table with a million records, an int PK, and two string fields. Here's the scenarios I have tried -- the client (console app) calls LoadData via remoting ten times in a loop.
1) doWork = true, garbageCollect = true, recordCount = 100,000. Memory goes up to 78MB then stabilizes at 22MB.
2) doWork = false, garbageCollect = true, recordCount = 100,000. Memory goes up to 78MB and stabilizes at 19MB. Seriously, 3MB more to loop over the rows without doing anything?
3) doWork = false, garbageCollect = false, recordCount = 100,000. Memory goes up to about 178MB then stabilizes at 78MB. Forcing garbage collection is obviously doing something, but not enough for my needs.
4) doWork = false, garbageCollect = true, recordCount = 1,000,000. Memory goes up to 500MB and stabilizes at 35MB. Why does it stabilize at a higher number when the dataset is larger?
5) doWork = false, garbageCollect = true, recordCount = 1,000. It runs too fast to see the peak but it stabilizes at a measly 12MB.
public string LoadData(bool doWork, bool garbageCollect, int recordCount)
{
var dataSet = new DataSet();
using (var sqlConnection = new SqlConnection("...blah..."))
{
sqlConnection.Open();
using (var dbCommand = sqlConnection.CreateCommand())
{
dbCommand.CommandText = string.Format("select top {0} * from dbo.FakeData", recordCount.ToString());
dbCommand.CommandType = CommandType.Text;
using (var dbReader = new SqlDataAdapter(dbCommand))
{
dbReader.Fill(dataSet);
}
}
sqlConnection.Close();
}
// loop over the records
var count = dataSet.Tables[0].Rows.Count;
if (doWork)
{
foreach (DataRow row in dataSet.Tables[0].Rows) {}
}
dataSet.Clear();
dataSet = null;
if (garbageCollect)
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
return string.Format("Record count is {0}", count);
}

Related

C# ODBC driver with DBF database in parallel not working as intended

Hi everyone this is my last resort, i think I'm getting something wrong but can't get unstuck
I was trying to load some old DBF files at startup to create my data context in the new application but soon discovered that some of these really take up a long time to load at times even 1-2 minutes (these are sometimes 80-100MB databases) which is not acceptable at the start of the application.
My idea was to load them in parallel which would mean at worst i have a load time of 1-2 minutes(the bigger one) but when using a stopwatch to check the execution time i wasn't getting the 1-2 minutes expected but i instead got the sum of the time as if i was doing them one by one.
This is the code i wrote to access all the databases in the folder using an obdc adapter, in reality, the query executes very fast is the adapter that is taking a long time to load things in Datatable, i switched to Task after not getting any result with "Parallel.Foreach()" i even tried to switch on and off the background Fetch but with no avail, is there something i can try or this driver isn't made to be used by more resources?
public static List<DataTable> SelectAllParallel(string folder)
{
System.Data.Odbc.OdbcConnection conn = new System.Data.Odbc.OdbcConnection("Driver={Microsoft Visual FoxPro Driver};SourceType=DBF;SourceDB=" + folder + ";Exclusive=No;Collate=Machine;NULL=NO;DELETED=NO;BACKGROUNDFETCH=Yes;");
List<string> Databases = new List<string>();
foreach (var file in Directory.EnumerateFiles(folder, "*.dbf"))
Databases.Add(Path.GetFileName(file));
conn.Open();
List<OdbcDataReader> QueryResult = new List<OdbcDataReader>();
List<DataTable> Results = new List<DataTable>();
var watch = new System.Diagnostics.Stopwatch();
watch.Start();
foreach (string database in Databases)
{
string strQuery = $"SELECT * FROM [{database}]";
OdbcCommand command = new OdbcCommand(strQuery, conn);
QueryResult.Add(command.ExecuteReader());
}
List<Task> tsk = new List<Task>();
foreach(OdbcDataReader SingQuery in QueryResult)
{
tsk.Add(new Task(() => { DataTable dt = new DataTable();dt.Load(SingQuery); Results.Add(dt); }));
}
foreach (var tssk in tsk)
tssk.Start();
Task.WaitAll(tsk.ToArray());
watch.Stop();
conn.Close();
var h = watch.Elapsed;
return Results;
}
You are using background fetching and multiple threads which is exactly what Microsoft warns you not to do:
https://learn.microsoft.com/bs-latn-ba/sql/odbc/microsoft/thread-support-visual-foxpro-odbc-driver
Avoid using background fetch when you call the driver from multithreaded applications.
The VFP ODBC driver is thread-safe. That means, you can call it from multiple threads (if you avoid background fetching). However, it is not multi-threaded. Only one query at a time is executed. The others are blocked by a semaphore that VFP is using to synchronize access to its query engine.
Loading all rows into memory with SELECT * FROM table is usually considered a bad approach. You get better performance if you only load the records you need. The ODBC driver isn't designed and optimized to return all rows.
If your specific case needs access to all rows in all tables, for example, because your application is a converter that converts all data to a different format, you might get better performance by using a C# DBF library. There could be compatibility issues, though, because most of these libraries only implement older DBF file formats. You also should avoid writing with these libraries, because that could lead to corruption.
If you are only reading the files and not changing them later, and if your files are on a network share, you will also see better performance if you copy the entire file into a local temp folder and then use the ODBC driver to read from the temp folder.
The reason for this is that VFP reads records synchronously. It reads one record from the file server and waits until the record has been served to the client before reading the next record. This sends many IP packets across the network (up to one per record). Network latency and packet trough-put rather than bandwidth determine transfer speed in this case. With local files you don't have these limitations.

Memory leak analysis and help requested

I've been using the methodology outlined by Shivprasad Koirala to check for memory leaks from code running inside a C# application (VoiceAttack). It basically involves using the Performance Monitor to track an application's private bytes as well as bytes in all heaps and compare these counters to assess if there is a leak and what type (managed/unmanaged). Ideally I need to test outside of Visual Studio, which is why I'm using this method.
The following portion of code generates the below memory profile (bear in mind the code has a little different format compared to Visual Studio because this is a function contained within the main C# application):
public void main()
{
string FilePath = null;
using (FileDialog myFileDialog = new OpenFileDialog())
{
myFileDialog.Title = "this is the title";
myFileDialog.FileName = "testFile.txt";
myFileDialog.Filter = "txt files (*.txt)|*.txt|All files (*.*)|*.*";
myFileDialog.FilterIndex = 1;
if (myFileDialog.ShowDialog() == DialogResult.OK)
{
FilePath = myFileDialog.FileName;
var extension = Path.GetExtension(FilePath);
var compareType = StringComparison.InvariantCultureIgnoreCase;
if (extension.Equals(".txt", compareType) == false)
{
FilePath = null;
VA.WriteToLog("Selected file is not a text file. Action canceled.");
}
else
VA.WriteToLog(FilePath);
}
else
VA.WriteToLog("No file selected. Action canceled.");
}
VA.WriteToLog("done");
}
You can see that after running this code the private bytes don't come back to the original count and the bytes in all heaps are roughly constant, which implies that there is a portion of unmanaged memory that was not released. Running this same inline function a few times consecutively doesn't cause further increases to the maximum observed private bytes or the unreleased memory. Once the main C# application (VoiceAttack) closes all the related memory (including the memory for the above code) is released. The bad news is that under normal circumstances the main application may be kept running indefinitely by the user, causing the allocated memory to remain unreleased.
For good measure I threw this same code into VS (with a pair of Thread.Sleep(5000) added before and after the using block for better graphical analysis) and built an executable to track with the Performance Monitor method, and the result is the same. There is an initial unmanaged memory jump for the OpenFileDialog and the allocated unmanaged memory never comes back down to the original value.
Does the memory and leak tracking methodology outlined above make sense? If YES, is there anything that can be done to properly release the unmanaged memory?
Does the memory and leak tracking methodology outlined above make sense?
No. You shouldn't expect unmanaged committed memory (Private Bytes) always be released. For instance processes have an unmanaged heap, which is managed to allow for subsequent allocations. And since Windows can page your committed memory, it isn't critical to minimize each processes committed memory.
If repeated calls don't increase memory use, you don't have a memory leak, you have delayed initialization. Some components aren't initialized until you use them, so their memory usage isn't being taken into account when you establish your baseline.

C#, DataGridView not garbage collecting properly?

I've been looking at this problem the last two days and I'm all out of ideas, so I'm hoping someone might have some insight into what exactly is going on.
In brief, the program loads in a CSV file (with 240K lines, so not a ton) into a DataGridView, does a bunch of different operations on the data, and then dumps it into an Excel file for easy viewing and graphic and what not. However, as I iterate over the data in the DataGridView, ram usage constantly increases. I trimmed down the code to the simplest form and it's still happened and I don't quite understand why.
try
{
string conStr = #"Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=C:\Downloads;Extensions=csv,txt";
OdbcConnection conn = new OdbcConnection(conStr);
//OdbcDataAdapter da = new OdbcDataAdapter("Select * from bragg2.csv", conn);
OdbcDataAdapter da = new OdbcDataAdapter("Select * from bragg4.csv", conn);
DataTable dt = new DataTable("DudeYa");
da.Fill(dt);
dgvCsvFile.DataSource = dt;
da.Dispose();
conn.Close();
conn.Dispose();
}
catch (Exception e) { }
string sDeviceCon;
for (int i = 0; i < dgvCsvFile.Rows.Count; i++)
{
if (dgvCsvFile.Rows[i].Cells[48].Value != DBNull.Value)
sDeviceCon = (string)dgvCsvFile.Rows[i].Cells[48].Value;
else
sDeviceCon = "";
if ((sDeviceCon == null) || (sDeviceCon == ""))
continue;
}
When I enter that for loop, the program is using 230 megs. When I finish it, I'm using 3 gigs. At this point I'm not even doing anything with the data, I'm literally just looking at it and then moving on.
Futhermore, if I add these lines after it:
dgvCsvFile.DataSource = null;
dgvCsvFile.Rows.Clear();
dgvCsvFile.Columns.Clear();
dgvCsvFile.Dispose();
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
The RAM usage stays the same. I can't figure out how to actually get the memory back without closing the program.
I've downloaded various memory profiling tools to try and figure out exactly whats going on:
Using VMMap from Sys Internals to take periodic snapshots, there were instances of the Garbage Collector running that were filling up. Each later snapshot I took usually had more Garbage Collectors running with all the previous ones using ~260MB of ram.
Using .NET Memory Profiler, I was able to get a little bit more detailed information. According to that, I have approximately 4 million instances of PropertyStore, DataGrieViewTextBoxCell and PropertyStore.IntegerEntry[]. All of them appear to have 1 reference and are existing uncollected in the Garbage Collector.
I ran the program in x86 mode and it caused an out of memory exception (not terribly unexpected). It throws the error inside of DataGridViewTextBoxCell.Clone(), which is in turn inside a DataGridViewRow.CloneCells() and DataGridViewRow.Clone(). So obviously it's doing a whole bunch of cloning to access the cell data... but why is it never being cleaned up? What could possibly be holding onto the reference for them?
I also tried getting the data into the DataSource in a different way using the CachedCsvReader from LumenWorks.Framwork.IO with no change in the functionality.
Is there something that I'm missing? Is there a setting I should be changing somewhere? Why aren't any of these things being disposed of?
Any help would be greatly appreciated, thanks!

MemoryCache OutOfMemoryException

I am trying to figure out how the MemoryCache should be used in order to avoid getting out of memory exceptions. I come from ASP.Net background where the cache manages it's own memory usage so I expect that MemoryCache would do the same. This does not appear to be the case as illustrated in the bellow test program I made:
class Program
{
static void Main(string[] args)
{
var cache = new MemoryCache("Cache");
for (int i = 0; i < 100000; i++)
{
AddToCache(cache, i);
}
Console.ReadLine();
}
private static void AddToCache(MemoryCache cache, int i)
{
var key = "File:" + i;
var contents = System.IO.File.ReadAllBytes("File.txt");
var policy = new CacheItemPolicy
{
SlidingExpiration = TimeSpan.FromHours(12)
};
policy.ChangeMonitors.Add(
new HostFileChangeMonitor(
new[] { Path.GetFullPath("File.txt") }
.ToList()));
cache.Add(key, contents, policy);
Console.Clear();
Console.Write(i);
}
}
The above throws an out of memory exception after approximately reaching 2GB of memory usage (Any CPU) or after consuming all my machine's physical memory (x64)(16GB).
If I remove the cache.Add bit the program throws no exception. If I include a call to cache.Trim(5) after every cache add I see that it releases some memory and it keeps aproximately 150 objects in the cache at any given time (from cache.GetCount()).
Is calling cache.Trim my program's responsibility? If so when should it be called (like how can my program know that the memory is getting full)? How do you calculate the percentage argument?
Note: I am planning to use the MemoryCache in a long running windows service so it is critical for it to have proper memory management.
MemoryCache has a background thread that periodically estimates how much memory the process is using and how many keys are in the cache. When it thinks you are getting close to the cachememorylimit, it will Trim the cache. Each time this background thread runs, it checks to see how close you are to the limits, and it will increase the polling frequency under memory pressure.
If you add items very quickly, the background thread doesn't have a chance to run, and you can run out of memory before the cache can trim and GC can run (in a x64 process this can result in massive heap size and multi minute GC pauses). The trim process/memory estimation is also known to have bugs under some conditions.
If your program is prone to out of memory due to rapidly loading an excessive number of objects, something with a bounded size like an LRU cache is a much better strategy. LRU typically uses a policy based on item count to evict the least recently used items.
I wrote a thread safe implementation of TLRU (a time aware least recently used policy), that you can easily use as a drop in replacement for ConcurrentDictionary.
It's available on Github here: https://github.com/bitfaster/BitFaster.Caching
Install-Package BitFaster.Caching
Using it would look like something this for your program, and it will not run out of memory (depending on how big your files are):
class Program
{
static void Main(string[] args)
{
int capacity = 80;
TimeSpan timeToLive = TimeSpan.FromMinutes(5);
var lru = new ConcurrentTLru<int, byte[]>(capacity, timeToLive);
for (int i = 0; i < 100000; i++)
{
var value = lru.GetOrAdd(1, (k) => System.IO.File.ReadAllBytes("File.txt"));
}
Console.ReadLine();
}
}
If you really want to avoid running out of memory, you should also consider reading the files into a RecyclableMemoryStream, and using the Scoped class in BitFaster to make the cached values thread safe and avoid races on dispose.

Saving data from SQLDataReader to FileStream with minimal memory usage

I've wrapped a SQLDataReader as an IEnumberable using a yield statement. I'd like use this to dump to a file. I'm seeing some pretty heavy memory utilization. Was wondering if anyone had any ideas on how to do this with minimal or set memory utilization. I don't mind specifying a buffer I'd just like to know what it'll be before I unleash this on an unsuspecting server.
I've been using something like the following:
class Program
{
static void Main(string[] args)
{
var fs = File.Create("c:\\somefile.txt");
var sw = new StreamWriter(fs);
foreach (var asdf in Enumerable.Range(0, 500000000))
{
sw.WriteLine("adsfadsf");//Data from Reader
}
sw.Close();
}
}
string commandText = #"SELECT name FROM {0} WHERE name NOT LIKE '%#%'";
SqlCommand sqlCommand = new SqlCommand(string.Format(commandText, list.TableName.SQLEncapsulate()),
_connection);
using (SqlDataReader sqlDataReader = sqlCommand.ExecuteReader())
{
while (sqlDataReader.Read())
{
yield return sqlDataReader["name"].ToString();
}
}
Some heavy memory throughput is not a problem, and is unavoidable when you process a lot of data.
The data that you read will be allocated as new objects on the heap, but they are short lived objects as you just read the data, write it, then throw it away.
The memory management in .NET doesn't try to keep the memory usage as low as possible, as having a lot of unused memory doesn't make the computer faster. When you create and release objects, they will just be abandoned on the heap for a while, and the garbage collector cleans them up eventually.
It's normal for a .NET application to use a lot of memory when you are doing some heavy data processing, and the memory usage will drop again after a while. If there is some other part of the system that needs the memory, the garbage collector will do a more aggressive collection to free up as much as possible.

Categories

Resources