C# OutOfMemory Issue when dealing with large data

C# OutOfMemory Issue when dealing with large data - c#

In our application we are generating reports using Windows Service. The data for reports is fetched from SQL Server using a stored procedure. In some scenario the result set returned contains 250,000 records (We can not help on this part and we need this data in one go as we need to do some calculations on this).
Problem
Our application is getting this data in reader and we are converting this dataset in our custom collection of custom objects. As the data is huge it is not able to store the complete data in the custom object and hence throwing out of memory. When we see the task manager for the process usage while executing the record, it goes very high and even the CPU utilization.
I am not sure what should be do in this case.
Can we increase the size of the memory allocated to a single process running under CLR?
Any other workarounds?
Any help would be really appreciated
Why do I need all data at once : We need to do calculations on complete resultset
We are using ADO.NET and transforming the data set in to our custom object (collection)
Our system is 32 bit
We can not page the data
Can not move the computation to sql server
This stack trace might help:
Exception of type 'System.OutOfMemoryException' was thrown. Server
stack trace: at
System.Collections.Generic.Dictionary2.ValueCollection.System.Collections.Generic.IEnumerable<TValue>.GetEnumerator()
at System.Linq.Enumerable.WhereEnumerableIterator1.MoveNext() at
System.Collections.Generic.List1.InsertRange(Int32 index,
IEnumerable1 collection) at
System.Collections.Generic.List1.AddRange(IEnumerable1 collection)
at MyProject.Common.Data.DataProperty.GetPropertiesForType(Type t) in
C:\Ashish-Stuff\Projects\HCPA\Dev
Branch\Common\Benefits.Common\Data\DataProperty.shared.cs:line 60 at
MyProject.Common.Data.Extensions.GetProperties[T](T target) in
C:\Ashish-Stuff\Projects\HCPA\Dev
Branch\Common\Benefits.Common\Data\Extensions.shared.cs:line 30 at
MyProject.Common.Data.Factories.SqlServerDataFactoryContract1.GetData(String
procedureName, IDictionary2 parameters, Nullable1 languageId,
Nullable1 pageNumber, Nullable`1 pageSize)
Thanks,
Ashish

Could you every 1,000 rows of data, serialize your custom collection of objects to disk somewhere? Then when you return data, paginate it from those files?
More info on your use case as to why you need to pull back 2.5million rows of data would be helpful.

My first though was that computation could be made on Sql-Server side by some stored procedure. I suspect that this approach requires some Sql-Server jedi ... but anyway, have you considered such approach?

I would love to see a code sample highlighting where exactly you are getting this error from. Is it on the data pull itself (populating the reader) or is it creating the object and adding it to the custom collection (populating the collection).
I have had similar issues before, dealing with VERY LARGE datasets, but met great success with leaving it in a stream for as long as possible. streams will keep the data directly in memory and you wont ever really have anything with direct access to the entire mess until you finish building the object. Now, given that the stack trace shows the error on a "MoveNext" operation, this may not work for you. I would then say try to chunk the data, grab 10k rows at a time or something, I know that this can be done with SQL. It will make the data read take a lot longer though.
EDIT
If you read this from the database into a local stream, that you then pass around (just be careful not to close it), then you shouldn't run into these issues. Make a data wrapper class that you can pass around with an open stream and an open reader. Store the data in the stream and then use wrapper functions to read the specific data you need from it. Things like GetSumOfXField() or AverageOfYValues(), etc etc... The data will never be in a live object, but you wont have to keep going back to the database for it.
Pseudo Example
public void ReadingTheDataFunction()
{
DBDataReader reader = dbCommand.ExecuteReader();
MyDataStore.FillDataSource(reader)
}
private void FillDataSource(DbDataReader reader)
{
StreamWriter writer = new StreamWriter(GlobaldataStream);
while (reader.Read())
writer.WriteLine(BuildStringFromDataRow(reader));
reader.close();
}
private CustomObject GetNextRow()
{
String line = GlobalDataReader.ReadLine();
//Parse String to Custom Object
return ret;
}
From there you pass around MyDataStore, and as long as the stream and reader aren't closed, you can move your position around, go looking for individual entries, compile sums and averages, etc etc. you never even really have to know you aren't dealing with a live object, as long as you only interact with it via the interface functions.

Related

Out of memory when copying large database with C# / ADO.Net Entities

Yet another How-to-free-memory question:
I'm copying data between two databases which are currently identical but will soon be getting out of sync. I have put together a skeleton app in C# using Reflection and ADO.Net Entities that does this:
For each table in the source database:
Clear the corresponding table in the destination database
For each object in the source table
For each property in the source object
If an identically-named property exists in the destination object, use Reflection to copy the source property to the destination property
This works great until I get to the big 900MB table that has user-uploaded files in it.
The process of copying the blobs (up to 7 MB each) to my machine and back to the destination database uses up local memory. However, that memory isn't getting freed, and the process dies once it's copied about 750 MB worth of data - with my program having 1500 MB of allocated space when the OutOfMemoryException is thrown, presumably two copies of everything that it's copied so far.
I tried a naive approach first, doing a simple copy. It worked on every table until I got to the big one. I have tried forcing a GC.Collect() with no obvious change to the results. I've also tried putting the actual copy into a separate function in hopes that the reference going out of scope would help it get GCed. I even put a Thread.Sleep in to try to give background processes more time to run. All of these have had no effect.
Here's the relevant code as it exists right now:
public static void CopyFrom<TSource, TDest>(this ObjectSet<TDest> Dest, ObjectSet<TSource> Source, bool SaveChanges, ObjectContext context)
where TSource : class
where TDest : class {
int total = Source.Count();
int count = 0;
foreach (var src in Source) {
count++;
CopyObject(src, Dest);
if (SaveChanges && context != null) {
context.SaveChanges();
GC.Collect();
if (count % 100 == 0) {
Thread.Sleep(2000);
}
}
}
}
I didn't include the CopyObject() function, it just uses reflection to evaluate the properties of src and put them into identically-named properties in a new object to be appended to Dest.
SaveChanges is a Boolean variable passed in saying that this extra processing should be done, it's only true on the big table, false otherwise.
So, my question: How can I modify this code to not run me out of memory?

The problem is that your database context utilizes a lot of caching internally, and it's holding onto a lot of your information and preventing the garbage collector from freeing it (whether you call Collect or not).
This means that your context is defined at too high of a scope. (It appears, based on your edit, that you're using it across tables. That's...not good.) You haven't shown where it is defined, but wherever it is it should probably be at a lower level. Keep in mind that because of connection pooling creating new contexts is not expensive, and based on your use cases you shouldn't need to rely on a lot of the cached info (because you're not touching items more than once) so frequently creating new contexts shouldn't add performance costs, even though it's substantially decreasing your memory footprint.

Disposing a data reader when constructing it by value vs reference

Rookie question about when a data reader actually gets released when it’s constructed in a class using ref vs. var. I’ve been testing this today and the results are puzzling me a bit – hoping to get this clear in my head.
I have a class that I use for fetching data via ODBC from numerous remote servers but I need to restrict how many ODBC connections are open to each server I'm attached to – so I’m being careful about properly disposing data readers when I’m done with them before opening another. Short version is I have a method called FillDataReader that takes a data reader object, and fills it based on your query and passes it back.
If I pass it using ref and dispose the data reader from the calling side, all is well. The connection is released immediately and the client side can get another data reader filled without burning a connection. However if I pass by value, the resource is not released and if I open another data reader from the client side I now have two connections to that server.
Conceptually I get the difference – with ref only a single address is being used as it’s passing a “pointer to a pointer” and the dispose releases that resource. OK, but even if passing by value and doing an explicit dispose on the client side what, exactly, is holding the resource? I’d rather pass by value here so I can use the nifty using construct on the client side but more importantly I want to understand better what’s happening here. In a nutshell here’s what it looks like
[DB fetch class]
public bool FillDataReader(string pQueryString, ref System.Data.Odbc.OdbcDataReader pDataReader, out string pErrorReason)
{
(uses a connection object that’s been established at class construction time and stays up all the time)
...
try
{
pDataReader = _Command.ExecuteReader();
}
...
}
[Calling class]
strSQL = "SELECT Alias, ObjectID, FROM vw_GlobalUser";
if (ServerList[currentServer].DatabaseFunctions.FillDataReader(strSQL, ref drTemp, false, out strErrorReason) == false)
….
drTemp.Dispose();
(at this point the connection is released to the server)
However if I take the ref out at the point of Dispose in the calling class the connection is not released. It goes away eventually but I need it gone immediately (hence the call to dispose).
So is the fill function in the DB fetch class hanging onto a reference to the allocated space on the heap somehow? I’m not sure I understand why that is – understood it’s using another copy of the address to the data reader on the stack to reference the data reader object on the heap there but when it goes out of scope, isn’t that released? Maybe I need more coffee…

Since your calling code needs to receive the reference to release the object, you do need a ref (or out). Otherwise the parameter is only passed to the method, but not back, so that the drTemp isn't updated with the data reader created in the FillDataReader method.
Note that you may want to change the signature as follows to make the intention more clear:
public Result TryGetDataReader(string pQueryString, out System.Data.Odbc.OdbcDataReader pDataReader)
Changes that I propose:
Introduced the naming convention with "Try", which is common for this type of method
Made the pDataReader an out, since it doesn't need to be initialized when calling the method
Introduced a "Result" type, which should carry the success information and the error message (if any)

List<> items lost during ToArray()?

A while ago, I coded a system for collecting public transport disruptions. Information about any incident is collected in an MSSQL database. Consumers access these data by calling an .asmx web service. Data are collected from the DB using ADO.NET, each data row is then populating a Deviation object and added to a List. In the service layer, the list is applied a ToArray() call and returned to the consumer.
So far, so good. But the problem is that in some cases (5% or so), we have been aware that the array somehow is curtailed. Instead of the usual number of 15-20 items, only half of them, or even fewer, are returned. The remaining items are always at the end of the original list. And, even fewer times, a couple of items are repeated/shuffled at the beginning of the array.
After doing some testing on the different layers, it seems as the curtailing occurs at the end of the process, i.e. during the casting to an array or the SOAP serialization. But the code seems so innocent, huh??:
[WebMethod]
public Deviation[] GetDeviationsByTimeInterval(DateTime from, DateTime to)
{
return DeviRoutines.GetDeviationsByTimeInterval(from, to).ToArray();
}
I am not 100% sure the error doesn't occur in the SQL or data access layer, but they have proved to do their job during the testing. Any help on the subject would be of great help! :)

I'd do something like:
public Deviation[] GetDeviationsByTimeInterval(DateTime from, DateTime to)
{
var v1 = DeviRoutines.GetDeviationsByTimeInterval(from, to);
LogMe( "v1: " + v1.Count );
var v2 = v1.ToArray();
LogMe( "v2: " + v2.Length );
return v2;
}
Proofing what you expect usually pays out :-)

http://msdn.microsoft.com/en-us/library/x303t819.aspx
Type: T[] An array containing copies
of the elements of the List.
You didn't find a bug in .NET, it's most likely something in your GetDeviationsByTimeInterval

I'd be willing to bet ToArray is doing exactly what it's told, but either your from or to values are occasionally junk (validation error?) or GetDeviationsByTimeInterval is misinterpreting them for some reason.
Stick some logging into both Deviation[] and GetDeviationsByTimeInterval to see what values get passed to it, and the next time it goes pear-shaped you'll be able to diagnose where the problem is.

Command.Prepare() Causing Memory Leakage?

I've sort of inherited some code on this scientific modelling project, and my colleagues and I are getting stumped by this problem. The guy who wrote this is now gone, so we can't ask him (go figure).
Inside the data access layer, there is this insert() method. This does what it sounds like -- it inserts records into a database. It is used by the various objects being modeled to tell the database about themselves during the course of the simulation.
However, we noticed that during longer simulations after a fair number of database inserts, we eventually got connection timeouts. So we upped the timeout limits, and then we started getting "out of memory" errors from PostgreSQL. We eventually pinpointed the problem to a line where an IDbCommand object uses Prepare(). Leaving it in causes memory usage to indefinitely go up. Commenting out this line causes the code to work just fine, and eliminates all the memory problems. What does Prepare() do that causes this? I can't find anything in the documentation to explain this.
A compressed version of the code follows.
public virtual void insert(DomainObjects.EntityObject obj)
{
lock (DataBaseProvider.DataBase.Connection)
{
IDbCommand cmd = null;
IDataReader noInsertIdReader = null;
IDataReader reader= null;
try
{
if (DataBaseProvider.DataBase.Validate)
{ ... }
// create and prepare the insert command
cmd = createQuery(".toInsert", obj);
cmd.Prepare(); // This is what is screwing things up
// get the query to retreive the sequence number
SqlStatement lastInsertIdSql = DAOLayer...getStatement(this.GetType().ToString() + ".toGetLastInsertId");
// if the obj insert does not use a sequence, execute the insert command and return
if (lastInsertIdSql == null)
{
noInsertIdReader = cmd.ExecuteReader();
noInsertIdReader.Close();
return;
}
// append the sequence query to the end of the insert statement
cmd.CommandText += ";" + lastInsertIdSql.Statement;
reader = cmd.ExecuteReader();
// read the sequence number and set the objects id
...
}
// deal with some specific exceptions
...
}
}
EDIT: (In response to the first given answer) All the database objects do get disposed in a finally block. I just cut that part out here to save space. We've played with that a bit and that didn't make any difference, so I don't think that's the problem.

You'll notice that IDbCommand and IDataReader both implement IDisposable. Whenever you create an instance of an IDisposable object you should either wrap it in a using statement or call Dispose once you're finished. If you don't you'll end up leaking resources (sometimes resources other than just memory).
Try this in your code
using (IDbCommand cmd = createQuery(".toInsert", obj))
{
cmd.Prepare(); // This is what is screwing things up
...
//the rest of your example code
...
}
EDIT to talk specifically about Prepare
I can see from the code that you're preparing the command and then never reusing it.
The idea behind preparing a command is that it costs extra overhead to prepare, but then each time you use the command it will be more efficient than a non prepared statement. This is good if you've got a command that you're going to reuse a lot, and is a trade off of whether the overhead is worth the performance increase of the command.
So in the code you've shown us you are preparing the command (paying all of the overhead) and getting no benefit because you then immediately throw the command away!
I would either recycle the prepared command, or just ditch the call to the prepare statement.
I have no idea why the prepared commands are leaking, but you shouldn't be preparing so many commands in the first place (especially single use commands).

The Prepare() method was designed to make the query run more efficiently. It is entirely up to the provider to implement this. A typical one creates a temporary stored procedure, giving the server an opportunity to pre-parse and optimize the query.
There's a couple of ways code like this could leak memory. One is a typical .NET detail, a practical implementation of an IDbCommand class always has a Dispose() method to release resources explicitly before the finalizer thread does it. I don't see it being used in your snippet. But pretty unlikely in this case, it is very hard to consume all memory without ever running the garbage collector. You can tell from Perfmon.exe and observe the performance counters for the garbage collector.
The next candidate is more insidious, you are using a big chunk of native code. Dbase providers are not that simple. The FOSS kind tends to be designed to allow you to get the bugs out of them. Source code is available for a reason. Perfmon.exe again to diagnose that, seeing the managed heaps not growing beyond bounds but private bytes exploding is a dead giveaway.
If you don't feel much like debugging the provider you could just comment the statement.

How to deal with large result sets with Linq to Entities?

I have a fairly complex linq to entities query that I display on a website. It uses paging so I never pull down more than 50 records at a time for display.
But I also want to give the user the option to export the full results to Excel or some other file format.
My concern is that there could potentially be a large # of records all being loaded into memory at one time to do this.
Is there a way to process a linq result set 1 record at a time like you could w/ a datareader so only 1 record is really being kept in memory at a time?
I've seen suggestions that if you enumerate over the linq query w/ a foreach loop that the records will not all be read into memory at once and would not overwelm the server.
Does anyone have a link to something I could read to verify this?
I'd appreciate any help.
Thanks

set the ObjectContext to MergeOption.NoTracking (since it is a read only operation). If you are using the same ObjectContext for saving other data, Detach the object from the context.
how to detach
foreach( IQueryable)
{
//do something
objectContext.Detach(object);
}
Edit: If you are using NoTracking option, there is no need to detach
Edit2: I wrote to Matt Warren about this scenario. And am posting relevant private correspondences here, with his approval
The results from SQL server may not
even be all produced by the server
yet. The query has started on the
server and the first batch of results
are transferred to the client, but no
more are produced (or they are cached
on the server) until the client
requests to continue reading them.
This is what is called ‘firehose
cursor’ mode, or sometimes referred to
as streaming. The server is sending
them as fast as it can, and the client
is reading them as fast as it can
(your code), but there is a data
transfer protocol underneath that
requires acknowledgement from the
client to continue sending more data.
Since IQueryable inherits from IEnumerable, I believe the underlying query sent to the server would be the same. However, when we do a IEnumerable.ToList(), the data reader, which is used by the underlying connection, would start populating the object, the objects get loaded into the app domain and might run out of memory these objects cannot yet be disposed.
When you are using foreach and IEunmerable the data reader reads the SQL result set one at a time, the objects are created and then disposed. The underlying connection might receive data in chunks and might not send a response to SQL Server back until all the chunks are read. Hence you will not run into 'out of memory` exception
Edit3:
When your query is running, you actually can open your SQL Server "Activity Monitor" and see the query, the Task State as SUSPENDED and Wait Type as Async_network_IO - which actually states that the result is in the SQL Server network buffer. You can read more about it here and here

Look at the return value of the LINQ query. It should be IEnumerable<>, which only loads one object at a time. If you then use something like .ToList(), they will all be loaded into memory. Just make sure your code doesn't maintain a list or use more than one instance at a time and you will be fine.
Edit: To add on to what people have said about foreach... If you do something like:
var query = from o in Objects
where o.Name = "abc"
select o;
foreach (Object o in query)
{
// Do something with o
}
The query portion uses deferred execution (see examples), so the objects are not in memory yet. The foreach iterates through the results, but only getting one object at a time. query uses IEnumerator, which has Reset() and MoveNext(). The foreach calls MoveNext() each round until there are no more results.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.