I am trying to isolate the source of a "memory leak" in my C# application. This application copies a large number of potentially large files into records in a database using the image column type in SQL Server. I am using a LinqToSql and associated objects for all database access.
The main loop iterates over a list of files and inserts. After removing much boilerplate and error handling, it looks like this:
foreach (Document doc in ImportDocs) {
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(doc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}
Running this program over the entire input results in an eventual OutOfMemoryException. CLR Profiler reveals that 99% of the heap consists of large byte[] objects corresponding to the sizes of the files.
If I comment both lines A and B, this leak goes away. If I uncomment only line A, the leak comes back. I don't understand how this is possible, as dc is disposed for every iteration of the loop.
Has anyone encountered this before? I suspect directly calling stored procedures or doing inserts will avoid this leak, but I'd like to understand this before trying something else. What is going on?
Update
Including GC.Collect(); after line (B) appears to make no significant change to any case. This does not surprise me much, as CLR Profiler was showing a good number of GC events without explicitly inducing them.
Which operating system are you running this on? Your problem may not be related to Linq2Sql, but to how the operating system manages large memory allocations. For instance, Windows Server 2008 is much better at managing large objects in memory than XP. I have had instances where the code working with large files was leaking on XP but was running fine on Win 2008 server.
HTH
I don't entirely understand why, but making a copy of the iterating variable fixed it. As near as I can tell, LinqToSql was somehow making a copy of the DocumentSubmission inside each Document.
foreach (Document doc in ImportDocs) {
// make copy of doc that lives inside loop scope
Document copydoc = new Document() {
field1 = doc.field1,
field2 = doc.field2,
// complete copy
};
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(copydoc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}
Related
I wrote an app running on multiple client machines connecting to a remote oracle database (master) to synchronize theirs local open source database (slave). This works fine so far. But sometimes a local table needs to be fully initialized (dropped and afterwards all rows of master database inserted). If the master table is big enough (ColumnCount or DataType/DataSize and a certain RowSize) the app sometimes runs into an OutOfMemoryException.
The app is running on windows machines with .NET 4.0. Version of ODP.NET is 4.122.18.3. Oracle database is 12c (12.1.0.2.0).
I don't want to show the data to any user (the app is running in the background), else i could do some paging or filtering. Since not all tables contains keys or are able to be sorted automatically its pretty hard to fetch the table in parts. The initializing of the local table should be done in one transaction without multiple partial commits.
I can brake the problem down to a simple code sample showing the managed memory allocation which I didn't expect. At this point I'm not sure how to explain or solve the problem.
using (var connection = new OracleConnection(CONNECTION_STRING))
{
connection.Open();
using (var command = connection.CreateCommand())
{
command.CommandText = STATEMENT;
using (var reader = command.ExecuteReader())
{
while (reader.Read())
{
//reader[2].ToString();
//reader.GetString(2);
//reader.GetValue(2);
}
}
}
}
When uncommenting any of the three reader.* calls the memory of the requested column data seems to be pinned internally by ODP.NET (OracleInternal.Network.OraBuf) for each record. For requesting a few thousand records this doesn't seem to be a problem. But when fetching 100k+ records the memory allocation gets to hundreds of MB, which leads to an OutOfMemoryException. The more data the specified column has the faster the OOM happens (mostly NVARCHAR2).
Calling additionally GC.Collect() manually doesn't do anything. The GC.Collect()'s shown in the image are done internally (no calls by myself).
Since I don't store the read data at any place I would have expected the data is not cached while iterating the DbDataReader. Can u help me understand what is happening here and how to avoid it?
This seems to be a known bug (Bug 21975120) when reading clob column values using ExecuteReader() method with the managed Driver 12.1.0.2. The Workaround is to use the OracleDataReader specific methods (for example oracleDataReader.GetOracleValue(i)). The OracleClob value can explicitly be closed to free the Memory allocation.
var item = oracleDataReader.GetOracleValue(columnIndex);
if (item is OracleClob clob)
{
if (clob != null)
{
// use clob.Value ...
clob.Close();
}
}
I'm trying to use Fluent NHibernate to migrate a database that needs some of the database 'massaged'. The source database is a MS Access database and the current table I'm stuck on is one with an OLE Object field. The target database is a MS SQL Server Express database.
In the entity I simply had this field defined as a byte[] however when loading however even when just loading that single field for a single record I was hitting a System.OutOfMemoryException
byte[] test = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == 5590).Select(x => x.FileData).SingleOrDefault<byte[]>();
I then tried implementing the blob type listed here but now when running that I receive an error of:
"Unable to cast object of type 'System.Byte[]' to type
'TestProg.DatabaseConverter.Entities.Blob'."}
I can't imagine the Ole Object is any larger than 100mb but haven't been able to check. Is there any good way using Fluent NHibernate to copy this out of the one database and save it to the other or will I need to look at other options?
My normal loop for processing these is:
IList<Entities.Access.Revision> result;
IList<int> recordIds = aSession.Query<Entities.Access.Revision>().Select(x => x.Id).ToList<int>();
foreach (int recordId in recordIds)
{
result = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == recordId).ToList<Entities.Access.Revision>();
Save(sqlDb, result);
}
Save function just copies properties from one to another and for some entities is used to manipulate data or give feedback to user related to data problems. I'm using stateless sessions for both databases.
--
From further testing the objects it appears to be hanging on are about 60-70mb. I'm currently testing grabbing the data with an OleDbDataReader using GetBytes.
--
Update (Nov 24): I've yet to find a way to get this to work with NHibernate. I did get this working with regular db command objects. I've put the code for function I made below for anybody curious who finds this. This is code from my database converter so objects prefixed with 'a' are access database objects and 's' are sql ones.
public void MigrateBinaryField(int id, string tableName, string fieldName)
{
var aCmd = new OleDbCommand(String.Format(#"SELECT ID, {0} FROM {1} WHERE ID = {2}", fieldName, tableName, id), aConn);
using (var reader = aCmd.ExecuteReader(System.Data.CommandBehavior.SequentialAccess))
{
while (reader.Read())
{
if (reader[fieldName] == DBNull.Value)
return;
long read = 0;
long offset = 0;
// Can't .WRITE a NULL column so need to set an initial value
var sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1} = #data WHERE OldId = #OldId", tableName, fieldName), sConn);
sCmd.Parameters.AddWithValue("#data", new byte[0]);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
// Incrementally store binary field to avoid OutOfMemoryException from having entire field loaded in memory
sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1}.WRITE(#data, #offset, #len) WHERE OldId = #OldId", tableName, fieldName), sConn);
while ((read = reader.GetBytes(reader.GetOrdinal(fieldName), offset, buffer, 0, buffer.Length)) > 0)
{
sCmd.Parameters.Clear();
sCmd.Parameters.AddWithValue("#data", buffer);
sCmd.Parameters.AddWithValue("#offset", offset);
sCmd.Parameters.AddWithValue("#len", read);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
offset += read;
}
}
}
}
This sounds like the results I have seen with using .NET on top of other frameworks as well.
The native database driver beneath ADO.NET beneath NHibernate (two "beneaths" are intentional here) will require a pinned destination memory block that cannot be moved in memory while the driver fills it. Since the .NET garbage collector can randomly move blocks of memory on a separate thread in order to compact the heaps, NHibernate's underlying .NET database layer has to create a non-managed memory block to receive the data, which effectively doubles the amount of memory required to load a record.
Also, I have not verified this next point, but NHibernate should attempt to cache blocks of records, since it bypasses some of the relational database query operations. This allows NHibernate to make fewer database requests, which is optimal for smaller record sizes, but requires many records (including many blobs) to fit in memory at a time.
As a first step toward a resolution, make sure the process is really running the machine out of memory (or if it is 32-bit, make sure it is hitting the 2GB limit). If so, attempt to determine the baseline - if it is processing records with a variety of blob sizes, what is the minimum and maximum memory it uses? From that, you can estimate how much memory would be required for that large record (or the cache block that contains that record!)
64-bit and more physical memory may be a brute-force solution, if you aren't already running 64-bit, and if bigger hardware is even an option.
Another possible solution is to check whether NHibernate has configurable settings or properties for how it caches data. For example, check whether you can set a property that limits how many records are loaded at a time, or tell it to limit its cache to a certain size in bytes.
A more efficient solution is to use your ADO.NET code for the blobs; that might be the best solution, especially if you expect even larger blobs than this particular 60-70MB blob. MS Access will normally allow multiple read-only connections, so this should work as long as NHibernate doesn't set the database to block other connections.
I strongly suspect this is an accumulation due to NHibernate session cache.
Try to read each blob in a separate session or at least to flush/clear it periodically adding a counter 'i' to your loop and a condition like
if (i % 10 == 0)
{
aSession.Flush();
aSession.Clear();
}
I'm using SqlBulkCopy to insert large amounts of data into SQL Server. The code is simple, like following:
using (var copy = new SqlBulkCopy(connection))
{
copy.BatchSize = 10000;
copy.DestinationTableName = "MyTableName";
// other parameters...
copy.WriteToServer(reader);
}
The problem I'm having is the data volume is large, so it takes very long time to finish the copy operation. Sometimes there is transient errors in the network or in SQL Server, which would fail the WriteToServer call.
In these situations, I want to be able to continue the copy from the failed batch, instead of deleting all the data and doing the copy all over again from the very beginning. But I can't do this because at the time when I caught the exception thrown from WriteToServer, the reader object would have already been in a wrong position.
One solution I can think of is to cache the entire batch in my code. However that would mean I need to manage the memory for the cache. Also I believe this approach hurts performance a lot.
Is there any better ways? Thanks.
Here is what I am trying to achieve. Looping through a dataset over 1 million records and create a data dump in text file export to C drive.
I am looping through a dataset with over a million of records. Here is what is inside the loop
I am using a StringBuilder inside the loop.
myString.Append(ds.tables[0](i)(0)); <-- each datarow is not more than 10 char long.
It throws an error saying insufficient memory. I have 12 gb of Ram.
How do I go about fixing this problem?
Don't use an intermediate StringBuilder -- its contents sit in your computer's RAM before you presumably call .ToString() on it to write the result to disk. Instead, write the data to disk as you are processing it, something like:
using (var sw = new StreamWriter(outputFilePath, true))
{
// start loop
sw.Write(ds.tables[0](i)(0));
// end loop
}
This will write text to a file using the default encoding (UTF-8) and buffer size (I think it's 4KB).
Why do you store the large string in memory at all? If all you want to do is to write it to a text-file you could use a StreamWriter to write in batches:
using(var writer = new StreamWriter("c:\\file.txt", true))
{
for(int rowNum = 0; rowNum < ds.tables[0].Rows.Count; rowNum++)
{
DataRow row = ds.Tables[0].Rows[rowNum];
writer.Write(row.Field<string>(0));
}
}
But maybe you can optimze it further. Do you really need the large DataSet at all? If the data came from a database you could use a DataReader to stream it lazily. Then you can write to the text-file without memory.
This means that CLR cannot allocate an object with the size you've requested. Each process has its own RAM limit so appending a million records to a StringBuilder is probably not possible on your machine or any standard machine.
Even if you have a lot of memory, and even if you're running a 64-bit CLR, there are limits to the size of objects that can be created.
Your problem is that you try to have your file in memory. But it's too large to hold it in memory. Now, you don't need to have it in memory all at once. You need to create smaller chunks (like you already do, for example lines) and instead of keeping all of them in memory at the same time, write them to disk and then "forget" about them, so that you only need the memory for one chunk, never for all chunks at the same time.
You can do that using LinQ because of a feature called deferred execution:
var collectionToBeIterated = ds.Tables[0].Rows.Select(r => r[0].ToString());
File.WriteAllLines(#"c:\your.file", collectionToBeIterated);
Note the absence of any method that would actually materialize the collection, like ToList() or ToArray() which would have the same problems your code has. This simply creates a description of what to do when the datasets rows get iterated. Not a command to actually do it all at once.
I am using VSTS2008 + C# + .Net 3.5 to run this console application on x64 Server 2003 Enterprise with 12G physical memory.
Here is my code, and I find when executing statement bformatter.Serialize(stream, table), there is out of memory exception. I monitored memory usage through Perormance Tab of Task Manager and I find only 2G physical memory is used when exception is thrown, so should be not out of memory. :-(
Any ideas what is wrong? Any limitation of .Net serialization?
static DataTable MakeParentTable()
{
// Create a new DataTable.
System.Data.DataTable table = new DataTable("ParentTable");
// Declare variables for DataColumn and DataRow objects.
DataColumn column;
DataRow row;
// Create new DataColumn, set DataType,
// ColumnName and add to DataTable.
column = new DataColumn();
column.DataType = System.Type.GetType("System.Int32");
column.ColumnName = "id";
column.ReadOnly = true;
column.Unique = true;
// Add the Column to the DataColumnCollection.
table.Columns.Add(column);
// Create second column.
column = new DataColumn();
column.DataType = System.Type.GetType("System.String");
column.ColumnName = "ParentItem";
column.AutoIncrement = false;
column.Caption = "ParentItem";
column.ReadOnly = false;
column.Unique = false;
// Add the column to the table.
table.Columns.Add(column);
// Make the ID column the primary key column.
DataColumn[] PrimaryKeyColumns = new DataColumn[1];
PrimaryKeyColumns[0] = table.Columns["id"];
table.PrimaryKey = PrimaryKeyColumns;
// Create three new DataRow objects and add
// them to the DataTable
for (int i = 0; i <= 5000000; i++)
{
row = table.NewRow();
row["id"] = i;
row["ParentItem"] = "ParentItem " + i;
table.Rows.Add(row);
}
return table;
}
static void Main(string[] args)
{
DataTable table = MakeParentTable();
Stream stream = new MemoryStream();
BinaryFormatter bformatter = new BinaryFormatter();
bformatter.Serialize(stream, table); // out of memory exception here
Console.WriteLine(table.Rows.Count);
return;
}
thanks in advance,
George
Note: DataTable defaults to the xml serialization format that was used in 1.*, which is incredibly inefficient. One thing to try is switching to the newer format:
dt.RemotingFormat = System.Data.SerializationFormat.Binary;
Re the out-of-memory / 2GB; individual .NET objects (such as the byte[] behind a MemoryStream) are limited to 2GB. Perhaps try writing to a FileStream instead?
(edit: nope: tried that, still errors)
I also wonder if you may get better results (in this case) using table.WriteXml(stream), perhaps with compression such as GZIP if space is a premium.
As already discussed this is a fundamental issue with trying to get contiguous blocks of memory in the Gigabyte sort of size.
You will be limited by (in increasing difficulty)
The amount of addressable memory
since you are 64bit this will be you 12GB physical memory, less any holes in it required by devices plus any swap file space.
Note that you must be running an app with the relevant PE headers that indicate it can run 64bit or you will run under WoW64 and only have 4GB of address space.
Also note that the default target was changed in 2010, a contentious change.
The CLR's limitation that no single object may consume more than 2GB of space.
Finding a contiguous block within the available memory.
You can find that you run out of space before the CLR limit of 2 because the backing buffer in the stream is expanded in a 'doubling' fashion and this swiftly results in the buffer being allocated in the Large Object Heap. This heap is not compacted in the same way the other heaps are(1) and as a result the process of building up to the theoretical maximum size of the buffer under 2 fragments the LOH so that you fail to find a sufficiently large contiguous block before this happens.
Thus a mitigation approach if you are close to the limit is to set the initial capacity of the stream such that it definitely has sufficient space from the start via one of the constructors.
Given that you are writing to the memory stream as part of a serialization process it would make sense to actually use streams as intended and use only the data required.
If you are serializing to some file based location then stream it into that directly.
If this is data going into a Sql Server database consider using:
FILESTREAM 2008 only I'm afraid.
From 2005 onwards you can read/write in chunks but writing is not well integrated into ADO.Net
For versions prior to 2005 there are relatively unpleasant workarounds
If you are serializing this in memory for use in say a comparison then consider streaming the data being compared as well and diffing as you go along.
If you are persisting an object in memory to recreate it latter then this really should be going to a file or a memory mapped file. In both cases the operating system is then free to structure it as best it can (in disk caches or pages being mapped in and out of main memory) and it is likely it will do a better job of this than most people are able to do themselves.
If you are doing this so that the data can be compressed then consider using streaming compression. Any block based compression stream can be fairly easily converted into a streaming mode with the addition of padding. If your compression API doesn't support this natively consider using one that does or writing the wrapper to do it.
If you are doing this to write to a byte buffer which is then pinned and passed to an unmanaged function then use the UnmanagedMemoryStream instead, this stands a slightly better chance of being able to allocate a buffer of this sort of size but is still not guaranteed to do so.
Perhaps if you tell us what you are serializing an object of this size for we might be able to tell you better ways to do it.
This is an implementation detail you should not rely on
1) The OS is x64, but is the app x64 (or anycpu)? If not, it is capped at 2Gb.
2) Does this happen 'early on', or after the app has been running for some time (i.e. n serializations later)? Could it maybe be a result of large object heap fragmentation...?
Interestingly, it actually goes up to 3.7GB before giving a memory error here (Windows 7 x64).
Apparently, it would need about double that amount to complete.
Given that the application uses 1.65GB after creating the table, it seems likely that it's hitting the 2GB byte[] (or any single object) limit Marc Gravell is speaking of (1.65GB + 2GB ~= 3.7GB)
Based on this blog, I suppose you could allocate your memory using the WINAPI, and write your own MemoryStream implementation using that. That is, if you really wanted to do this. Or write one using more than one array of course :)