Huge managed memory allocation when reading (iterating) data with DbDataReader - c#

I wrote an app running on multiple client machines connecting to a remote oracle database (master) to synchronize theirs local open source database (slave). This works fine so far. But sometimes a local table needs to be fully initialized (dropped and afterwards all rows of master database inserted). If the master table is big enough (ColumnCount or DataType/DataSize and a certain RowSize) the app sometimes runs into an OutOfMemoryException.
The app is running on windows machines with .NET 4.0. Version of ODP.NET is 4.122.18.3. Oracle database is 12c (12.1.0.2.0).
I don't want to show the data to any user (the app is running in the background), else i could do some paging or filtering. Since not all tables contains keys or are able to be sorted automatically its pretty hard to fetch the table in parts. The initializing of the local table should be done in one transaction without multiple partial commits.
I can brake the problem down to a simple code sample showing the managed memory allocation which I didn't expect. At this point I'm not sure how to explain or solve the problem.
using (var connection = new OracleConnection(CONNECTION_STRING))
{
connection.Open();
using (var command = connection.CreateCommand())
{
command.CommandText = STATEMENT;
using (var reader = command.ExecuteReader())
{
while (reader.Read())
{
//reader[2].ToString();
//reader.GetString(2);
//reader.GetValue(2);
}
}
}
}
When uncommenting any of the three reader.* calls the memory of the requested column data seems to be pinned internally by ODP.NET (OracleInternal.Network.OraBuf) for each record. For requesting a few thousand records this doesn't seem to be a problem. But when fetching 100k+ records the memory allocation gets to hundreds of MB, which leads to an OutOfMemoryException. The more data the specified column has the faster the OOM happens (mostly NVARCHAR2).
Calling additionally GC.Collect() manually doesn't do anything. The GC.Collect()'s shown in the image are done internally (no calls by myself).
Since I don't store the read data at any place I would have expected the data is not cached while iterating the DbDataReader. Can u help me understand what is happening here and how to avoid it?

This seems to be a known bug (Bug 21975120) when reading clob column values using ExecuteReader() method with the managed Driver 12.1.0.2. The Workaround is to use the OracleDataReader specific methods (for example oracleDataReader.GetOracleValue(i)). The OracleClob value can explicitly be closed to free the Memory allocation.
var item = oracleDataReader.GetOracleValue(columnIndex);
if (item is OracleClob clob)
{
if (clob != null)
{
// use clob.Value ...
clob.Close();
}
}

Related

NHibernate OutOfMemoryException querying large byte[]

I'm trying to use Fluent NHibernate to migrate a database that needs some of the database 'massaged'. The source database is a MS Access database and the current table I'm stuck on is one with an OLE Object field. The target database is a MS SQL Server Express database.
In the entity I simply had this field defined as a byte[] however when loading however even when just loading that single field for a single record I was hitting a System.OutOfMemoryException
byte[] test = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == 5590).Select(x => x.FileData).SingleOrDefault<byte[]>();
I then tried implementing the blob type listed here but now when running that I receive an error of:
"Unable to cast object of type 'System.Byte[]' to type
'TestProg.DatabaseConverter.Entities.Blob'."}
I can't imagine the Ole Object is any larger than 100mb but haven't been able to check. Is there any good way using Fluent NHibernate to copy this out of the one database and save it to the other or will I need to look at other options?
My normal loop for processing these is:
IList<Entities.Access.Revision> result;
IList<int> recordIds = aSession.Query<Entities.Access.Revision>().Select(x => x.Id).ToList<int>();
foreach (int recordId in recordIds)
{
result = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == recordId).ToList<Entities.Access.Revision>();
Save(sqlDb, result);
}
Save function just copies properties from one to another and for some entities is used to manipulate data or give feedback to user related to data problems. I'm using stateless sessions for both databases.
--
From further testing the objects it appears to be hanging on are about 60-70mb. I'm currently testing grabbing the data with an OleDbDataReader using GetBytes.
--
Update (Nov 24): I've yet to find a way to get this to work with NHibernate. I did get this working with regular db command objects. I've put the code for function I made below for anybody curious who finds this. This is code from my database converter so objects prefixed with 'a' are access database objects and 's' are sql ones.
public void MigrateBinaryField(int id, string tableName, string fieldName)
{
var aCmd = new OleDbCommand(String.Format(#"SELECT ID, {0} FROM {1} WHERE ID = {2}", fieldName, tableName, id), aConn);
using (var reader = aCmd.ExecuteReader(System.Data.CommandBehavior.SequentialAccess))
{
while (reader.Read())
{
if (reader[fieldName] == DBNull.Value)
return;
long read = 0;
long offset = 0;
// Can't .WRITE a NULL column so need to set an initial value
var sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1} = #data WHERE OldId = #OldId", tableName, fieldName), sConn);
sCmd.Parameters.AddWithValue("#data", new byte[0]);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
// Incrementally store binary field to avoid OutOfMemoryException from having entire field loaded in memory
sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1}.WRITE(#data, #offset, #len) WHERE OldId = #OldId", tableName, fieldName), sConn);
while ((read = reader.GetBytes(reader.GetOrdinal(fieldName), offset, buffer, 0, buffer.Length)) > 0)
{
sCmd.Parameters.Clear();
sCmd.Parameters.AddWithValue("#data", buffer);
sCmd.Parameters.AddWithValue("#offset", offset);
sCmd.Parameters.AddWithValue("#len", read);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
offset += read;
}
}
}
}
This sounds like the results I have seen with using .NET on top of other frameworks as well.
The native database driver beneath ADO.NET beneath NHibernate (two "beneaths" are intentional here) will require a pinned destination memory block that cannot be moved in memory while the driver fills it. Since the .NET garbage collector can randomly move blocks of memory on a separate thread in order to compact the heaps, NHibernate's underlying .NET database layer has to create a non-managed memory block to receive the data, which effectively doubles the amount of memory required to load a record.
Also, I have not verified this next point, but NHibernate should attempt to cache blocks of records, since it bypasses some of the relational database query operations. This allows NHibernate to make fewer database requests, which is optimal for smaller record sizes, but requires many records (including many blobs) to fit in memory at a time.
As a first step toward a resolution, make sure the process is really running the machine out of memory (or if it is 32-bit, make sure it is hitting the 2GB limit). If so, attempt to determine the baseline - if it is processing records with a variety of blob sizes, what is the minimum and maximum memory it uses? From that, you can estimate how much memory would be required for that large record (or the cache block that contains that record!)
64-bit and more physical memory may be a brute-force solution, if you aren't already running 64-bit, and if bigger hardware is even an option.
Another possible solution is to check whether NHibernate has configurable settings or properties for how it caches data. For example, check whether you can set a property that limits how many records are loaded at a time, or tell it to limit its cache to a certain size in bytes.
A more efficient solution is to use your ADO.NET code for the blobs; that might be the best solution, especially if you expect even larger blobs than this particular 60-70MB blob. MS Access will normally allow multiple read-only connections, so this should work as long as NHibernate doesn't set the database to block other connections.
I strongly suspect this is an accumulation due to NHibernate session cache.
Try to read each blob in a separate session or at least to flush/clear it periodically adding a counter 'i' to your loop and a condition like
if (i % 10 == 0)
{
aSession.Flush();
aSession.Clear();
}

Reading Oracle CLOB data from ODBC reader is super slow

We're downloading data from an Oracle database using C# and .Net 4.5.
values[] is an object array;
reader is the ODBC reader with an open connection to the Oracle database table with CLOB data.
Here's the relevant code:
if (reader.Read())
{ //Download and save the values
for (int x = 0; x < reader.FieldCount; x++)
{ //Populate all the values
values[x] = reader[x]; //this line seems to cause execution to hang
}
//
//blah blah blah
//
}
The C# code seems to hang on the line values[x] = reader[x];.
We're assigning every column in the row read to a special object array because we need to do separate stuff on that data later, and not have to worry about the data type at the moment.
The problem lies that when a table is hit with an Oracle CLOB data column that's big ( > 28,000 ), that line never seems to complete.
If we eliminate the CLOB column from what the odbc reader reads, everything works perfectly.
Questions:
Why would this be? Shouldn't the array assignment be relatively quick?
What are some possible work arounds so we can keep the CLOB columns in the data downloaded? We
need to keep the ODBC reader as a generic ODBC reader (and not made Oracle specific).
The application is compiled and must remain 32-bit.
Thanks!
It the type CLOB which is the problem!
cast it to varchar and the performance is ok:
change
select clob from table
to 2 selects
select DBMS_LOB.SUBSTR(clob,4000,1) from table where length(clob)<= 4000;
select clob from table where length(clob)> 4000;
It's easy to verify - the performance is slow for small clob columns as weel, but casting the to varchar solves the problem!
The problem is that every single CLOB column fetch issues 2 extra net roundtrips from the client to the server!
Normally you'll have one net roundtrip for several rows
Unless there is a really compelling reason to use ODBC, you are far better off using ODP.net or managed ODP.net. While I can't say with 100% certainty that ODBC is your issue, I can tell you I have used .NET with LOBs numerous times without issue, using ODP.net. I can also tell you that the Microsoft driver for Oracle (depricated) caused countless mystery issues, most of them arround performance. For example, query 1 works fine, but replacing a literal with a bind variable results in a 15 second performance delay on the execute. Using ODP.net, the delay goes away. My guess is that ODBC may have similar mystery issues.
ODP.net (or DevArt dotConnect) are the only two tools I know of that enable .NET to leverage the OCI, which has great features like bulk inserts and updates. It's possible that OCI has a better way of dealing with large LOBs. Is there a compelling reason you have to use ODBC?
Add LONGDATACOMPAT=1; on your string connection:
string conn = #"DSN=database;UID=user;PWD=password;LONGDATACOMPAT=1;";

Exception of type 'System.OutOfMemoryException' was thrown. C# when using IDataReader

I have an application in which I have to get a large amount of data from DB.
Since it failed to get all of those rows (it's close to 2,000,000 rows...), I cut it in breaks, and I run each time the sql query and get only 200,000 rows each time.
I use DataTable to which I enter all of the data (meaning - all 2,000,000 rows should be there).
The first few runs are fine. Then it fails with the OutOfMemoryException.
My code works as following:
private static void RunQueryAndAddToDT(string sql, string lastRowID, SqlConnection conn, DataTable dt, int prevRowCount)
{
if (string.IsNullOrEmpty(sql))
{
sql = generateSqlQuery(lastRowID);
}
if (conn.State == ConnectionState.Closed)
{
conn.Open();
}
using (IDbCommand cmd2 = conn.CreateCommand())
{
cmd2.CommandType = CommandType.Text;
cmd2.CommandText = sql;
cmd2.CommandTimeout = 0;
using (IDataReader reader = cmd2.ExecuteReader())
{
while (reader.Read())
{
DataRow row = dt.NewRow();
row["RowID"] = reader["RowID"].ToString();
row["MyCol"] = reader["MyCol"].ToString();
... //In one of these rows it returns the exception.
dt.Rows.Add(row);
}
}
}
if (conn != null)
{
conn.Close();
}
if (dt.Rows.Count > prevRowCount)
{
lastRowID = dt.Rows[dt.Rows.Count - 1]["RowID"].ToString();
sql = string.Empty;
RunQueryAndAddToDT(sql, lastRowID, conn, dt, dt.Rows.Count);
}
}
It seems to me as if the reader keeps collecting rows, and that's why It throws an exception only in the third or second round.
Shouldn't the Using clean the memory as its done?
What may solve my problem?
Note: I should explain - I have no other choice but get all of those rows to the datatable, Since I do some manipulation on them later, and the order of the rows is important, and I can't split it because sometimes I have to take the data of some rows and set it into one row and so on and so on, so I can't give it up.
Thanks.
Check that you are building a 64-bit process, and not a 32-bit one, which is the default compilation mode of Visual Studio. To do this, right click on your project, Properties -> Build -> platform target : x64. As any 32-bit process, Visual Studio applications compiled in 32-bit have a virtual memory limit of 2GB.
64-bit processes do not have this limitation, as they use 64-bit pointers, so their theoretical maximum address space is 16 exabytes (2^64). In reality, Windows x64 limits the virtual memory of processes to 8TB. The solution to the memory limit problem is then to compile in 64-bit.
However, object’s size in Visual Studio is still limited to 2GB, by default. You will be able to create several arrays whose combined size will be greater than 2GB, but you cannot by default create arrays bigger than 2GB. Hopefully, if you still want to create arrays bigger than 2GB, you can do it by adding the following code to you app.config file:
<configuration>
<runtime>
<gcAllowVeryLargeObjects enabled="true" />
</runtime>
</configuration>
I think simply you run out of memory because your DataTable gets so large from all the rows you keep adding to it.
You may want to try a different pattern in this case.
Instead of buffering your rows in a list (or DataTable), can you simply yield the rows as they are available for use when they arrive?
Since you are using a DataTable, let me share a random problem that I was having using one. Check your Build properties. I had a problem with a DataTable throwing an out of memory exception randomly. As it turned out, the project's Build Platform target was set to Prefer 32-bit. Once I unselected that option, the random out of memory exception went away.
You are storing a copy of the data to dt. You are simply storing so much that the machine is running out of memory. So you have few options:
Increase the available memory.
Reduce the amount of data you are retrieving.
To increase the available memory, you can add physical memory to the machine. Note that a .NET process on a 32bit machine will not be able to access more than 2GB of memory though (3GB if you enable the 3GB switch in boot.ini) so you may need to switch to 64bit (machine and process) if you wish to address more memory than that.
Retrieving less data is probably the way to go. Depending upon what you are trying to achieve, you may be able to perform the task on subsets of the data (perhaps even on individual rows). If you are performing some kind of aggregation (e.g. a producing a summary or report from the data) you may be able to employ Map-Reduce.

"cursor like" reading inside a CLR procedure/function

I have to implement an algorithm on data which is (for good reasons) stored inside SQL server. The algorithm does not fit SQL very well, so I would like to implement it as a CLR function or procedure. Here's what I want to do:
Execute several queries (usually 20-50, but up to 100-200) which all have the form select a,b,... from some_table order by xyz. There's an index which fits that query, so the result should be available more or less without any calculation.
Consume the results step by step. The exact stepping depends on the results, so it's not exactly predictable.
Aggregate some result by stepping over the results. I will only consume the first parts of the results, but cannot predict how much I will need. The stop criteria depends on some threshold inside the algorithm.
My idea was to open several SqlDataReader, but I have two problems with that solution:
You can have only one SqlDataReader per connection and inside a CLR method I have only one connection - as far as I understand.
I don't know how to tell SqlDataReader how to read data in chunks. I could not find documentation how SqlDataReader is supposed to behave. As far as I understand, it's preparing the whole result set and would load the whole result into memory. Even if I would consume only a small part of it.
Any hint how to solve that as a CLR method? Or is there a more low level interface to SQL server which is more suitable for my problem?
Update: I should have made two points more explicit:
I'm talking about big data sets, so a query might result in 1 mio records, but my algorithm would consume only the first 100-200 ones. But as I said before: I don't know the exact number beforehand.
I'm aware that SQL might not be the best choice for that kind of algorithm. But due to other constraints it has to be a SQL server. So I'm looking for the best possible solution.
SqlDataReader does not read the whole dataset, you are confusing it with the Dataset class. It reads row by row, as the .Read() method is being called. If a client does not consume the resultset the server will suspend the query execution because it has no room to write the output into (the selected rows). Execution will resume as the client consumes more rows (SqlDataReader.Read is being called). There is even a special command behavior flag SequentialAccess that instructs the ADO.Net not to pre-load in memory the entire row, useful for accessing large BLOB columns in a streaming fashion (see Download and Upload images from SQL Server via ASP.Net MVC for a practical example).
You can have multiple active result sets (SqlDataReader) active on a single connection when MARS is active. However, MARS is incompatible with SQLCLR context connections.
So you can create a CLR streaming TVF to do some of what you need in CLR, but only if you have one single SQL query source. Multiple queries it would require you to abandon the context connection and use isntead a fully fledged connection, ie. connect back to the same instance in a loopback, and this would allow MARS and thus consume multiple resultsets. But loopback has its own issues as it breaks the transaction boundaries you have from context connection. Specifically with a loopback connection your TVF won't be able to read the changes made by the same transaction that called the TVF, because is a different transaction on a different connection.
SQL is designed to work against huge data sets, and is extremely powerful. With set based logic it's often unnecessary to iterate over the data to perform operations, and there are a number of built-in ways to do this within SQL itself.
1) write set based logic to update the data without cursors
2) use deterministic User Defined Functions with set based logic (you can do this with the SqlFunction attribute in CLR code). Non-Deterministic will have the affect of turning the query into a cursor internally, it means the value output is not always the same given the same input.
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static int algorithm(int value1, int value2)
{
int value3 = ... ;
return value3;
}
3) use cursors as a last resort. This is a powerful way to execute logic per row on the database but has a performance impact. It appears from this article CLR can out perform SQL cursors (thanks Martin).
I saw your comment that the complexity of using set based logic was too much. Can you provide an example? There are many SQL ways to solve complex problems - CTE, Views, partitioning etc.
Of course you may well be right in your approach, and I don't know what you are trying to do, but my gut says leverage the tools of SQL. Spawning multiple readers isn't the right way to approach the database implementation. It may well be that you need multiple threads calling into a SP to run concurrent processing, but don't do this inside the CLR.
To answer your question, with CLR implementations (and IDataReader) you don't really need to page results in chunks because you are not loading data into memory or transporting data over the network. IDataReader gives you access to the data stream row-by-row. By the sounds it your algorithm determines the amount of records that need updating, so when this happens simply stop calling Read() and end at that point.
SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("Value1", SqlDbType.Int);
columns[1] = new SqlMetaData("Value2", SqlDbType.Int);
columns[2] = new SqlMetaData("Value3", SqlDbType.Int);
SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);
SqlDataReader reader = comm.ExecuteReader();
bool flag = true;
while (reader.Read() && flag)
{
int value1 = Convert.ToInt32(reader[0]);
int value2 = Convert.ToInt32(reader[1]);
// some algorithm
int newValue = ...;
reader.SetInt32(3, newValue);
SqlContext.Pipe.SendResultsRow(record);
// keep going?
flag = newValue < 100;
}
Cursors are a SQL only function. If you wanted to read chunks of data at a time, some sort of paging would be required so that only a certain amount of the records would be returned. If using Linq,
.Skip(Skip)
.Take(PageSize)
Skips and takes could be used to limit results returned.
You can simply iterate over the DataReader by doing something like this:
using (IDataReader reader = Command.ExecuteReader())
{
while (reader.Read())
{
//Do something with this record
}
}
This would be iterating over the results one at a time, similiar to a cursor in SQL Server.
For multiple recordsets at once, try MARS
(if SQL Server)
http://msdn.microsoft.com/en-us/library/ms131686.aspx

LinqToSql InsertOnSubmit memory leak?

I am trying to isolate the source of a "memory leak" in my C# application. This application copies a large number of potentially large files into records in a database using the image column type in SQL Server. I am using a LinqToSql and associated objects for all database access.
The main loop iterates over a list of files and inserts. After removing much boilerplate and error handling, it looks like this:
foreach (Document doc in ImportDocs) {
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(doc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}
Running this program over the entire input results in an eventual OutOfMemoryException. CLR Profiler reveals that 99% of the heap consists of large byte[] objects corresponding to the sizes of the files.
If I comment both lines A and B, this leak goes away. If I uncomment only line A, the leak comes back. I don't understand how this is possible, as dc is disposed for every iteration of the loop.
Has anyone encountered this before? I suspect directly calling stored procedures or doing inserts will avoid this leak, but I'd like to understand this before trying something else. What is going on?
Update
Including GC.Collect(); after line (B) appears to make no significant change to any case. This does not surprise me much, as CLR Profiler was showing a good number of GC events without explicitly inducing them.
Which operating system are you running this on? Your problem may not be related to Linq2Sql, but to how the operating system manages large memory allocations. For instance, Windows Server 2008 is much better at managing large objects in memory than XP. I have had instances where the code working with large files was leaking on XP but was running fine on Win 2008 server.
HTH
I don't entirely understand why, but making a copy of the iterating variable fixed it. As near as I can tell, LinqToSql was somehow making a copy of the DocumentSubmission inside each Document.
foreach (Document doc in ImportDocs) {
// make copy of doc that lives inside loop scope
Document copydoc = new Document() {
field1 = doc.field1,
field2 = doc.field2,
// complete copy
};
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(copydoc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}

Categories

Resources