I am using VSTS2008 + C# + .Net 3.5 to run this console application on x64 Server 2003 Enterprise with 12G physical memory.
Here is my code, and I find when executing statement bformatter.Serialize(stream, table), there is out of memory exception. I monitored memory usage through Perormance Tab of Task Manager and I find only 2G physical memory is used when exception is thrown, so should be not out of memory. :-(
Any ideas what is wrong? Any limitation of .Net serialization?
static DataTable MakeParentTable()
{
// Create a new DataTable.
System.Data.DataTable table = new DataTable("ParentTable");
// Declare variables for DataColumn and DataRow objects.
DataColumn column;
DataRow row;
// Create new DataColumn, set DataType,
// ColumnName and add to DataTable.
column = new DataColumn();
column.DataType = System.Type.GetType("System.Int32");
column.ColumnName = "id";
column.ReadOnly = true;
column.Unique = true;
// Add the Column to the DataColumnCollection.
table.Columns.Add(column);
// Create second column.
column = new DataColumn();
column.DataType = System.Type.GetType("System.String");
column.ColumnName = "ParentItem";
column.AutoIncrement = false;
column.Caption = "ParentItem";
column.ReadOnly = false;
column.Unique = false;
// Add the column to the table.
table.Columns.Add(column);
// Make the ID column the primary key column.
DataColumn[] PrimaryKeyColumns = new DataColumn[1];
PrimaryKeyColumns[0] = table.Columns["id"];
table.PrimaryKey = PrimaryKeyColumns;
// Create three new DataRow objects and add
// them to the DataTable
for (int i = 0; i <= 5000000; i++)
{
row = table.NewRow();
row["id"] = i;
row["ParentItem"] = "ParentItem " + i;
table.Rows.Add(row);
}
return table;
}
static void Main(string[] args)
{
DataTable table = MakeParentTable();
Stream stream = new MemoryStream();
BinaryFormatter bformatter = new BinaryFormatter();
bformatter.Serialize(stream, table); // out of memory exception here
Console.WriteLine(table.Rows.Count);
return;
}
thanks in advance,
George
Note: DataTable defaults to the xml serialization format that was used in 1.*, which is incredibly inefficient. One thing to try is switching to the newer format:
dt.RemotingFormat = System.Data.SerializationFormat.Binary;
Re the out-of-memory / 2GB; individual .NET objects (such as the byte[] behind a MemoryStream) are limited to 2GB. Perhaps try writing to a FileStream instead?
(edit: nope: tried that, still errors)
I also wonder if you may get better results (in this case) using table.WriteXml(stream), perhaps with compression such as GZIP if space is a premium.
As already discussed this is a fundamental issue with trying to get contiguous blocks of memory in the Gigabyte sort of size.
You will be limited by (in increasing difficulty)
The amount of addressable memory
since you are 64bit this will be you 12GB physical memory, less any holes in it required by devices plus any swap file space.
Note that you must be running an app with the relevant PE headers that indicate it can run 64bit or you will run under WoW64 and only have 4GB of address space.
Also note that the default target was changed in 2010, a contentious change.
The CLR's limitation that no single object may consume more than 2GB of space.
Finding a contiguous block within the available memory.
You can find that you run out of space before the CLR limit of 2 because the backing buffer in the stream is expanded in a 'doubling' fashion and this swiftly results in the buffer being allocated in the Large Object Heap. This heap is not compacted in the same way the other heaps are(1) and as a result the process of building up to the theoretical maximum size of the buffer under 2 fragments the LOH so that you fail to find a sufficiently large contiguous block before this happens.
Thus a mitigation approach if you are close to the limit is to set the initial capacity of the stream such that it definitely has sufficient space from the start via one of the constructors.
Given that you are writing to the memory stream as part of a serialization process it would make sense to actually use streams as intended and use only the data required.
If you are serializing to some file based location then stream it into that directly.
If this is data going into a Sql Server database consider using:
FILESTREAM 2008 only I'm afraid.
From 2005 onwards you can read/write in chunks but writing is not well integrated into ADO.Net
For versions prior to 2005 there are relatively unpleasant workarounds
If you are serializing this in memory for use in say a comparison then consider streaming the data being compared as well and diffing as you go along.
If you are persisting an object in memory to recreate it latter then this really should be going to a file or a memory mapped file. In both cases the operating system is then free to structure it as best it can (in disk caches or pages being mapped in and out of main memory) and it is likely it will do a better job of this than most people are able to do themselves.
If you are doing this so that the data can be compressed then consider using streaming compression. Any block based compression stream can be fairly easily converted into a streaming mode with the addition of padding. If your compression API doesn't support this natively consider using one that does or writing the wrapper to do it.
If you are doing this to write to a byte buffer which is then pinned and passed to an unmanaged function then use the UnmanagedMemoryStream instead, this stands a slightly better chance of being able to allocate a buffer of this sort of size but is still not guaranteed to do so.
Perhaps if you tell us what you are serializing an object of this size for we might be able to tell you better ways to do it.
This is an implementation detail you should not rely on
1) The OS is x64, but is the app x64 (or anycpu)? If not, it is capped at 2Gb.
2) Does this happen 'early on', or after the app has been running for some time (i.e. n serializations later)? Could it maybe be a result of large object heap fragmentation...?
Interestingly, it actually goes up to 3.7GB before giving a memory error here (Windows 7 x64).
Apparently, it would need about double that amount to complete.
Given that the application uses 1.65GB after creating the table, it seems likely that it's hitting the 2GB byte[] (or any single object) limit Marc Gravell is speaking of (1.65GB + 2GB ~= 3.7GB)
Based on this blog, I suppose you could allocate your memory using the WINAPI, and write your own MemoryStream implementation using that. That is, if you really wanted to do this. Or write one using more than one array of course :)
Related
I'm trying to use Fluent NHibernate to migrate a database that needs some of the database 'massaged'. The source database is a MS Access database and the current table I'm stuck on is one with an OLE Object field. The target database is a MS SQL Server Express database.
In the entity I simply had this field defined as a byte[] however when loading however even when just loading that single field for a single record I was hitting a System.OutOfMemoryException
byte[] test = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == 5590).Select(x => x.FileData).SingleOrDefault<byte[]>();
I then tried implementing the blob type listed here but now when running that I receive an error of:
"Unable to cast object of type 'System.Byte[]' to type
'TestProg.DatabaseConverter.Entities.Blob'."}
I can't imagine the Ole Object is any larger than 100mb but haven't been able to check. Is there any good way using Fluent NHibernate to copy this out of the one database and save it to the other or will I need to look at other options?
My normal loop for processing these is:
IList<Entities.Access.Revision> result;
IList<int> recordIds = aSession.Query<Entities.Access.Revision>().Select(x => x.Id).ToList<int>();
foreach (int recordId in recordIds)
{
result = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == recordId).ToList<Entities.Access.Revision>();
Save(sqlDb, result);
}
Save function just copies properties from one to another and for some entities is used to manipulate data or give feedback to user related to data problems. I'm using stateless sessions for both databases.
--
From further testing the objects it appears to be hanging on are about 60-70mb. I'm currently testing grabbing the data with an OleDbDataReader using GetBytes.
--
Update (Nov 24): I've yet to find a way to get this to work with NHibernate. I did get this working with regular db command objects. I've put the code for function I made below for anybody curious who finds this. This is code from my database converter so objects prefixed with 'a' are access database objects and 's' are sql ones.
public void MigrateBinaryField(int id, string tableName, string fieldName)
{
var aCmd = new OleDbCommand(String.Format(#"SELECT ID, {0} FROM {1} WHERE ID = {2}", fieldName, tableName, id), aConn);
using (var reader = aCmd.ExecuteReader(System.Data.CommandBehavior.SequentialAccess))
{
while (reader.Read())
{
if (reader[fieldName] == DBNull.Value)
return;
long read = 0;
long offset = 0;
// Can't .WRITE a NULL column so need to set an initial value
var sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1} = #data WHERE OldId = #OldId", tableName, fieldName), sConn);
sCmd.Parameters.AddWithValue("#data", new byte[0]);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
// Incrementally store binary field to avoid OutOfMemoryException from having entire field loaded in memory
sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1}.WRITE(#data, #offset, #len) WHERE OldId = #OldId", tableName, fieldName), sConn);
while ((read = reader.GetBytes(reader.GetOrdinal(fieldName), offset, buffer, 0, buffer.Length)) > 0)
{
sCmd.Parameters.Clear();
sCmd.Parameters.AddWithValue("#data", buffer);
sCmd.Parameters.AddWithValue("#offset", offset);
sCmd.Parameters.AddWithValue("#len", read);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
offset += read;
}
}
}
}
This sounds like the results I have seen with using .NET on top of other frameworks as well.
The native database driver beneath ADO.NET beneath NHibernate (two "beneaths" are intentional here) will require a pinned destination memory block that cannot be moved in memory while the driver fills it. Since the .NET garbage collector can randomly move blocks of memory on a separate thread in order to compact the heaps, NHibernate's underlying .NET database layer has to create a non-managed memory block to receive the data, which effectively doubles the amount of memory required to load a record.
Also, I have not verified this next point, but NHibernate should attempt to cache blocks of records, since it bypasses some of the relational database query operations. This allows NHibernate to make fewer database requests, which is optimal for smaller record sizes, but requires many records (including many blobs) to fit in memory at a time.
As a first step toward a resolution, make sure the process is really running the machine out of memory (or if it is 32-bit, make sure it is hitting the 2GB limit). If so, attempt to determine the baseline - if it is processing records with a variety of blob sizes, what is the minimum and maximum memory it uses? From that, you can estimate how much memory would be required for that large record (or the cache block that contains that record!)
64-bit and more physical memory may be a brute-force solution, if you aren't already running 64-bit, and if bigger hardware is even an option.
Another possible solution is to check whether NHibernate has configurable settings or properties for how it caches data. For example, check whether you can set a property that limits how many records are loaded at a time, or tell it to limit its cache to a certain size in bytes.
A more efficient solution is to use your ADO.NET code for the blobs; that might be the best solution, especially if you expect even larger blobs than this particular 60-70MB blob. MS Access will normally allow multiple read-only connections, so this should work as long as NHibernate doesn't set the database to block other connections.
I strongly suspect this is an accumulation due to NHibernate session cache.
Try to read each blob in a separate session or at least to flush/clear it periodically adding a counter 'i' to your loop and a condition like
if (i % 10 == 0)
{
aSession.Flush();
aSession.Clear();
}
Here is what I am trying to achieve. Looping through a dataset over 1 million records and create a data dump in text file export to C drive.
I am looping through a dataset with over a million of records. Here is what is inside the loop
I am using a StringBuilder inside the loop.
myString.Append(ds.tables[0](i)(0)); <-- each datarow is not more than 10 char long.
It throws an error saying insufficient memory. I have 12 gb of Ram.
How do I go about fixing this problem?
Don't use an intermediate StringBuilder -- its contents sit in your computer's RAM before you presumably call .ToString() on it to write the result to disk. Instead, write the data to disk as you are processing it, something like:
using (var sw = new StreamWriter(outputFilePath, true))
{
// start loop
sw.Write(ds.tables[0](i)(0));
// end loop
}
This will write text to a file using the default encoding (UTF-8) and buffer size (I think it's 4KB).
Why do you store the large string in memory at all? If all you want to do is to write it to a text-file you could use a StreamWriter to write in batches:
using(var writer = new StreamWriter("c:\\file.txt", true))
{
for(int rowNum = 0; rowNum < ds.tables[0].Rows.Count; rowNum++)
{
DataRow row = ds.Tables[0].Rows[rowNum];
writer.Write(row.Field<string>(0));
}
}
But maybe you can optimze it further. Do you really need the large DataSet at all? If the data came from a database you could use a DataReader to stream it lazily. Then you can write to the text-file without memory.
This means that CLR cannot allocate an object with the size you've requested. Each process has its own RAM limit so appending a million records to a StringBuilder is probably not possible on your machine or any standard machine.
Even if you have a lot of memory, and even if you're running a 64-bit CLR, there are limits to the size of objects that can be created.
Your problem is that you try to have your file in memory. But it's too large to hold it in memory. Now, you don't need to have it in memory all at once. You need to create smaller chunks (like you already do, for example lines) and instead of keeping all of them in memory at the same time, write them to disk and then "forget" about them, so that you only need the memory for one chunk, never for all chunks at the same time.
You can do that using LinQ because of a feature called deferred execution:
var collectionToBeIterated = ds.Tables[0].Rows.Select(r => r[0].ToString());
File.WriteAllLines(#"c:\your.file", collectionToBeIterated);
Note the absence of any method that would actually materialize the collection, like ToList() or ToArray() which would have the same problems your code has. This simply creates a description of what to do when the datasets rows get iterated. Not a command to actually do it all at once.
I have an application in which I have to get a large amount of data from DB.
Since it failed to get all of those rows (it's close to 2,000,000 rows...), I cut it in breaks, and I run each time the sql query and get only 200,000 rows each time.
I use DataTable to which I enter all of the data (meaning - all 2,000,000 rows should be there).
The first few runs are fine. Then it fails with the OutOfMemoryException.
My code works as following:
private static void RunQueryAndAddToDT(string sql, string lastRowID, SqlConnection conn, DataTable dt, int prevRowCount)
{
if (string.IsNullOrEmpty(sql))
{
sql = generateSqlQuery(lastRowID);
}
if (conn.State == ConnectionState.Closed)
{
conn.Open();
}
using (IDbCommand cmd2 = conn.CreateCommand())
{
cmd2.CommandType = CommandType.Text;
cmd2.CommandText = sql;
cmd2.CommandTimeout = 0;
using (IDataReader reader = cmd2.ExecuteReader())
{
while (reader.Read())
{
DataRow row = dt.NewRow();
row["RowID"] = reader["RowID"].ToString();
row["MyCol"] = reader["MyCol"].ToString();
... //In one of these rows it returns the exception.
dt.Rows.Add(row);
}
}
}
if (conn != null)
{
conn.Close();
}
if (dt.Rows.Count > prevRowCount)
{
lastRowID = dt.Rows[dt.Rows.Count - 1]["RowID"].ToString();
sql = string.Empty;
RunQueryAndAddToDT(sql, lastRowID, conn, dt, dt.Rows.Count);
}
}
It seems to me as if the reader keeps collecting rows, and that's why It throws an exception only in the third or second round.
Shouldn't the Using clean the memory as its done?
What may solve my problem?
Note: I should explain - I have no other choice but get all of those rows to the datatable, Since I do some manipulation on them later, and the order of the rows is important, and I can't split it because sometimes I have to take the data of some rows and set it into one row and so on and so on, so I can't give it up.
Thanks.
Check that you are building a 64-bit process, and not a 32-bit one, which is the default compilation mode of Visual Studio. To do this, right click on your project, Properties -> Build -> platform target : x64. As any 32-bit process, Visual Studio applications compiled in 32-bit have a virtual memory limit of 2GB.
64-bit processes do not have this limitation, as they use 64-bit pointers, so their theoretical maximum address space is 16 exabytes (2^64). In reality, Windows x64 limits the virtual memory of processes to 8TB. The solution to the memory limit problem is then to compile in 64-bit.
However, object’s size in Visual Studio is still limited to 2GB, by default. You will be able to create several arrays whose combined size will be greater than 2GB, but you cannot by default create arrays bigger than 2GB. Hopefully, if you still want to create arrays bigger than 2GB, you can do it by adding the following code to you app.config file:
<configuration>
<runtime>
<gcAllowVeryLargeObjects enabled="true" />
</runtime>
</configuration>
I think simply you run out of memory because your DataTable gets so large from all the rows you keep adding to it.
You may want to try a different pattern in this case.
Instead of buffering your rows in a list (or DataTable), can you simply yield the rows as they are available for use when they arrive?
Since you are using a DataTable, let me share a random problem that I was having using one. Check your Build properties. I had a problem with a DataTable throwing an out of memory exception randomly. As it turned out, the project's Build Platform target was set to Prefer 32-bit. Once I unselected that option, the random out of memory exception went away.
You are storing a copy of the data to dt. You are simply storing so much that the machine is running out of memory. So you have few options:
Increase the available memory.
Reduce the amount of data you are retrieving.
To increase the available memory, you can add physical memory to the machine. Note that a .NET process on a 32bit machine will not be able to access more than 2GB of memory though (3GB if you enable the 3GB switch in boot.ini) so you may need to switch to 64bit (machine and process) if you wish to address more memory than that.
Retrieving less data is probably the way to go. Depending upon what you are trying to achieve, you may be able to perform the task on subsets of the data (perhaps even on individual rows). If you are performing some kind of aggregation (e.g. a producing a summary or report from the data) you may be able to employ Map-Reduce.
I'm looking for a good way to export an IEnumerable to Excel 2007 (.xlsb).
The T is a known type, so reflection is not completely necessary for performance reasons.
I'm using .xlsb (excel binary format) because the amount of data will be large for Excel.
The IEnumerable in question has approximately 2 million records. The IEnumerable is retrieved from an Access database (.mdb) then goes through some processing, then finally LINQ queries are wrote to generate a report structure for T. Though these records do not need to get sent to excel as one (nor could it); it will be sub-divided by a condition to which the largest record length will be roughly 1 million records.
I want to be able to convert the data to an Excel Pivot Table for easy viewing.
My initial idea was to convert the IEnumerable to a 2Darray [,] then push into an Excel range using COM interop.
public static object[,] To2DArray<T>(this IEnumerable<T> objectList)
{
Type t = typeof(T);
PropertyInfo[] fields = t.GetProperties();
object[,] my2DObject = new object[objectList.Count(), fields.Count()];
int row = 0;
foreach (var o in objectList)
{
int col = 0;
foreach (var f in fields)
{
my2DObject[row, col] = f.GetValue(o, null) ?? string.Empty;
col++;
}
row++;
}
return my2DObject;
}
I then took that object[,] and did a "transaction split" as I called it which just split up the object[,] into smaller chunks such as I'd create a List then go through each one and send into Excel range using something similar to:
Excel.Range range = worksheet.get_Range(cell,cell);
range.Value2 = List<object[,]>[0]
I'd obviously loop the above but just for simplicity it would look like the above.
This will work though, it takes an enormous amount of time to process, over 30minutes.
I've dabbled in outputting the IEnumerable to CSV though, it is not very efficient either; since it first requires the .csv file to be created, then open the .csv file using COM interop to do the excel pivot table formatting.
My question: Is there a better (preferred) way to do this?
Should I force execution (toList()) before iteration?
Should I use a different mechanism to output/display the data?
I'm open to any options to get a disconnected IEnumerable out to file in an efficient manner.
-I wouldn't be opposed to using something like SQL Express.
The main question will be where the bottleneck is. I'd have a look at the code in a profiler to see what part of the execution is taking a long time. It can also be worthwhile looking at your resource usage by running the process and seeing whether there is a shortage of CPU or Memory, or whether it's disk-locked.
If you're getting sensible performance doing 2000 records at a time, then I suspect memory resources may be an issue - with the code you posted you're converting an IEnumerable (which can avoid loading a complete dataset into memory) into an entirely in-memory structure with potentially a million records - depending on the size and number of fields involved, this could easily become an issue.
If the problem looks like the time to create the Excel file itself (which it doesn't immediately sound like it is in this case), then COM interop calls can add up, and some of the 3rd party Excel libraries aim to be much faster at writing Excel files, particularly with large numbers of records, so rather than necessarily use Excel Binary format and COM, I'd suggest looking at an Open Source library like EPPlus (http://epplus.codeplex.com/) and seeing what the performance difference is like.
I am trying to isolate the source of a "memory leak" in my C# application. This application copies a large number of potentially large files into records in a database using the image column type in SQL Server. I am using a LinqToSql and associated objects for all database access.
The main loop iterates over a list of files and inserts. After removing much boilerplate and error handling, it looks like this:
foreach (Document doc in ImportDocs) {
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(doc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}
Running this program over the entire input results in an eventual OutOfMemoryException. CLR Profiler reveals that 99% of the heap consists of large byte[] objects corresponding to the sizes of the files.
If I comment both lines A and B, this leak goes away. If I uncomment only line A, the leak comes back. I don't understand how this is possible, as dc is disposed for every iteration of the loop.
Has anyone encountered this before? I suspect directly calling stored procedures or doing inserts will avoid this leak, but I'd like to understand this before trying something else. What is going on?
Update
Including GC.Collect(); after line (B) appears to make no significant change to any case. This does not surprise me much, as CLR Profiler was showing a good number of GC events without explicitly inducing them.
Which operating system are you running this on? Your problem may not be related to Linq2Sql, but to how the operating system manages large memory allocations. For instance, Windows Server 2008 is much better at managing large objects in memory than XP. I have had instances where the code working with large files was leaking on XP but was running fine on Win 2008 server.
HTH
I don't entirely understand why, but making a copy of the iterating variable fixed it. As near as I can tell, LinqToSql was somehow making a copy of the DocumentSubmission inside each Document.
foreach (Document doc in ImportDocs) {
// make copy of doc that lives inside loop scope
Document copydoc = new Document() {
field1 = doc.field1,
field2 = doc.field2,
// complete copy
};
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(copydoc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}