Here is what I am trying to achieve. Looping through a dataset over 1 million records and create a data dump in text file export to C drive.
I am looping through a dataset with over a million of records. Here is what is inside the loop
I am using a StringBuilder inside the loop.
myString.Append(ds.tables[0](i)(0)); <-- each datarow is not more than 10 char long.
It throws an error saying insufficient memory. I have 12 gb of Ram.
How do I go about fixing this problem?
Don't use an intermediate StringBuilder -- its contents sit in your computer's RAM before you presumably call .ToString() on it to write the result to disk. Instead, write the data to disk as you are processing it, something like:
using (var sw = new StreamWriter(outputFilePath, true))
{
// start loop
sw.Write(ds.tables[0](i)(0));
// end loop
}
This will write text to a file using the default encoding (UTF-8) and buffer size (I think it's 4KB).
Why do you store the large string in memory at all? If all you want to do is to write it to a text-file you could use a StreamWriter to write in batches:
using(var writer = new StreamWriter("c:\\file.txt", true))
{
for(int rowNum = 0; rowNum < ds.tables[0].Rows.Count; rowNum++)
{
DataRow row = ds.Tables[0].Rows[rowNum];
writer.Write(row.Field<string>(0));
}
}
But maybe you can optimze it further. Do you really need the large DataSet at all? If the data came from a database you could use a DataReader to stream it lazily. Then you can write to the text-file without memory.
This means that CLR cannot allocate an object with the size you've requested. Each process has its own RAM limit so appending a million records to a StringBuilder is probably not possible on your machine or any standard machine.
Even if you have a lot of memory, and even if you're running a 64-bit CLR, there are limits to the size of objects that can be created.
Your problem is that you try to have your file in memory. But it's too large to hold it in memory. Now, you don't need to have it in memory all at once. You need to create smaller chunks (like you already do, for example lines) and instead of keeping all of them in memory at the same time, write them to disk and then "forget" about them, so that you only need the memory for one chunk, never for all chunks at the same time.
You can do that using LinQ because of a feature called deferred execution:
var collectionToBeIterated = ds.Tables[0].Rows.Select(r => r[0].ToString());
File.WriteAllLines(#"c:\your.file", collectionToBeIterated);
Note the absence of any method that would actually materialize the collection, like ToList() or ToArray() which would have the same problems your code has. This simply creates a description of what to do when the datasets rows get iterated. Not a command to actually do it all at once.
Related
I'm trying to use Fluent NHibernate to migrate a database that needs some of the database 'massaged'. The source database is a MS Access database and the current table I'm stuck on is one with an OLE Object field. The target database is a MS SQL Server Express database.
In the entity I simply had this field defined as a byte[] however when loading however even when just loading that single field for a single record I was hitting a System.OutOfMemoryException
byte[] test = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == 5590).Select(x => x.FileData).SingleOrDefault<byte[]>();
I then tried implementing the blob type listed here but now when running that I receive an error of:
"Unable to cast object of type 'System.Byte[]' to type
'TestProg.DatabaseConverter.Entities.Blob'."}
I can't imagine the Ole Object is any larger than 100mb but haven't been able to check. Is there any good way using Fluent NHibernate to copy this out of the one database and save it to the other or will I need to look at other options?
My normal loop for processing these is:
IList<Entities.Access.Revision> result;
IList<int> recordIds = aSession.Query<Entities.Access.Revision>().Select(x => x.Id).ToList<int>();
foreach (int recordId in recordIds)
{
result = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == recordId).ToList<Entities.Access.Revision>();
Save(sqlDb, result);
}
Save function just copies properties from one to another and for some entities is used to manipulate data or give feedback to user related to data problems. I'm using stateless sessions for both databases.
--
From further testing the objects it appears to be hanging on are about 60-70mb. I'm currently testing grabbing the data with an OleDbDataReader using GetBytes.
--
Update (Nov 24): I've yet to find a way to get this to work with NHibernate. I did get this working with regular db command objects. I've put the code for function I made below for anybody curious who finds this. This is code from my database converter so objects prefixed with 'a' are access database objects and 's' are sql ones.
public void MigrateBinaryField(int id, string tableName, string fieldName)
{
var aCmd = new OleDbCommand(String.Format(#"SELECT ID, {0} FROM {1} WHERE ID = {2}", fieldName, tableName, id), aConn);
using (var reader = aCmd.ExecuteReader(System.Data.CommandBehavior.SequentialAccess))
{
while (reader.Read())
{
if (reader[fieldName] == DBNull.Value)
return;
long read = 0;
long offset = 0;
// Can't .WRITE a NULL column so need to set an initial value
var sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1} = #data WHERE OldId = #OldId", tableName, fieldName), sConn);
sCmd.Parameters.AddWithValue("#data", new byte[0]);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
// Incrementally store binary field to avoid OutOfMemoryException from having entire field loaded in memory
sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1}.WRITE(#data, #offset, #len) WHERE OldId = #OldId", tableName, fieldName), sConn);
while ((read = reader.GetBytes(reader.GetOrdinal(fieldName), offset, buffer, 0, buffer.Length)) > 0)
{
sCmd.Parameters.Clear();
sCmd.Parameters.AddWithValue("#data", buffer);
sCmd.Parameters.AddWithValue("#offset", offset);
sCmd.Parameters.AddWithValue("#len", read);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
offset += read;
}
}
}
}
This sounds like the results I have seen with using .NET on top of other frameworks as well.
The native database driver beneath ADO.NET beneath NHibernate (two "beneaths" are intentional here) will require a pinned destination memory block that cannot be moved in memory while the driver fills it. Since the .NET garbage collector can randomly move blocks of memory on a separate thread in order to compact the heaps, NHibernate's underlying .NET database layer has to create a non-managed memory block to receive the data, which effectively doubles the amount of memory required to load a record.
Also, I have not verified this next point, but NHibernate should attempt to cache blocks of records, since it bypasses some of the relational database query operations. This allows NHibernate to make fewer database requests, which is optimal for smaller record sizes, but requires many records (including many blobs) to fit in memory at a time.
As a first step toward a resolution, make sure the process is really running the machine out of memory (or if it is 32-bit, make sure it is hitting the 2GB limit). If so, attempt to determine the baseline - if it is processing records with a variety of blob sizes, what is the minimum and maximum memory it uses? From that, you can estimate how much memory would be required for that large record (or the cache block that contains that record!)
64-bit and more physical memory may be a brute-force solution, if you aren't already running 64-bit, and if bigger hardware is even an option.
Another possible solution is to check whether NHibernate has configurable settings or properties for how it caches data. For example, check whether you can set a property that limits how many records are loaded at a time, or tell it to limit its cache to a certain size in bytes.
A more efficient solution is to use your ADO.NET code for the blobs; that might be the best solution, especially if you expect even larger blobs than this particular 60-70MB blob. MS Access will normally allow multiple read-only connections, so this should work as long as NHibernate doesn't set the database to block other connections.
I strongly suspect this is an accumulation due to NHibernate session cache.
Try to read each blob in a separate session or at least to flush/clear it periodically adding a counter 'i' to your loop and a condition like
if (i % 10 == 0)
{
aSession.Flush();
aSession.Clear();
}
I'm looking for a good way to export an IEnumerable to Excel 2007 (.xlsb).
The T is a known type, so reflection is not completely necessary for performance reasons.
I'm using .xlsb (excel binary format) because the amount of data will be large for Excel.
The IEnumerable in question has approximately 2 million records. The IEnumerable is retrieved from an Access database (.mdb) then goes through some processing, then finally LINQ queries are wrote to generate a report structure for T. Though these records do not need to get sent to excel as one (nor could it); it will be sub-divided by a condition to which the largest record length will be roughly 1 million records.
I want to be able to convert the data to an Excel Pivot Table for easy viewing.
My initial idea was to convert the IEnumerable to a 2Darray [,] then push into an Excel range using COM interop.
public static object[,] To2DArray<T>(this IEnumerable<T> objectList)
{
Type t = typeof(T);
PropertyInfo[] fields = t.GetProperties();
object[,] my2DObject = new object[objectList.Count(), fields.Count()];
int row = 0;
foreach (var o in objectList)
{
int col = 0;
foreach (var f in fields)
{
my2DObject[row, col] = f.GetValue(o, null) ?? string.Empty;
col++;
}
row++;
}
return my2DObject;
}
I then took that object[,] and did a "transaction split" as I called it which just split up the object[,] into smaller chunks such as I'd create a List then go through each one and send into Excel range using something similar to:
Excel.Range range = worksheet.get_Range(cell,cell);
range.Value2 = List<object[,]>[0]
I'd obviously loop the above but just for simplicity it would look like the above.
This will work though, it takes an enormous amount of time to process, over 30minutes.
I've dabbled in outputting the IEnumerable to CSV though, it is not very efficient either; since it first requires the .csv file to be created, then open the .csv file using COM interop to do the excel pivot table formatting.
My question: Is there a better (preferred) way to do this?
Should I force execution (toList()) before iteration?
Should I use a different mechanism to output/display the data?
I'm open to any options to get a disconnected IEnumerable out to file in an efficient manner.
-I wouldn't be opposed to using something like SQL Express.
The main question will be where the bottleneck is. I'd have a look at the code in a profiler to see what part of the execution is taking a long time. It can also be worthwhile looking at your resource usage by running the process and seeing whether there is a shortage of CPU or Memory, or whether it's disk-locked.
If you're getting sensible performance doing 2000 records at a time, then I suspect memory resources may be an issue - with the code you posted you're converting an IEnumerable (which can avoid loading a complete dataset into memory) into an entirely in-memory structure with potentially a million records - depending on the size and number of fields involved, this could easily become an issue.
If the problem looks like the time to create the Excel file itself (which it doesn't immediately sound like it is in this case), then COM interop calls can add up, and some of the 3rd party Excel libraries aim to be much faster at writing Excel files, particularly with large numbers of records, so rather than necessarily use Excel Binary format and COM, I'd suggest looking at an Open Source library like EPPlus (http://epplus.codeplex.com/) and seeing what the performance difference is like.
I have this code below that executes. IT has 46,000 records in the text file that i need to process and insert into the database. It takes FOREVER if i just call it directly and loop one at a time.
I was trying to use LINQ to pull every 1000 rows or so and throw it into a thread so I could proces 3000 rows at once and cut the processing time. I can't figure it out though. so I need some help.
Any suggestions would be welcome. Thank You in advance.
var reader = ReadAsLines(tbxExtended.Text);
var ds = new DataSet();
var dt = new DataTable();
string headerNames = "Long|list|of|strings|"
var headers = headerNames.Split('|');
foreach (var header in headers)
dt.Columns.Add(header);
var records = reader.Skip(1);
foreach (var record in records)
dt.Rows.Add(record.Split('|'));
ds.Tables.Add(dt);
ds.AcceptChanges();
ProcessSmallList(ds);
If you are looking for high performance then look at the SqlBulkInsert if you are using SqlServer. The performance is significantly better than Insert row by row.
Here is an example using a custom CSVDataReader that I used for a project, but any IDataReader compatible Reader, DataRow[] or DataTable can be used as a parameter into WriteToServer, SQLDataReader, OLEDBDataReader etc.
Dim sr As CSVDataReader
Dim sbc As SqlClient.SqlBulkCopy
sbc = New SqlClient.SqlBulkCopy(mConnectionString, SqlClient.SqlBulkCopyOptions.TableLock Or SqlClient.SqlBulkCopyOptions.KeepIdentity)
sbc.DestinationTableName = "newTable"
'sbc.BulkCopyTimeout = 0
sr = New CSVDataReader(parentfileName, theBase64Map, ","c)
sbc.WriteToServer(sr)
sr.Close()
There are quite a number of options available. (See the link in the item)
To bulk insert data into a database, you probably should be using that database engine's bulk-insert utility (e.g. bcp in SQL Server). You might want to first do the processing, write out the processed data into a separate text file, then bulk-insert into your database of concern.
If you really want to do the processing on-line and insert on-line, memory is also a (small) factor, for example:
ReadAllLines reads the whole text file into memory, creating 46,000 strings. That would occupying a sizable chunk of memory. Try to use ReadLines instead which returns an IEnumerable and return strings one line at a time.
Your dataset may contain all 46,000 rows in the end, which will be slow in detecting changed rows. Try to Clear() the dataset table right after insert.
I believe the slowness you observed actually came from the dataset. Datasets issue one INSERT statement per new record, which means that you won't be saving anything by doing Update() 1,000 rows at a time or one row at a time. You still have 46,000 INSERT statements going to the database, which makes it slow.
In order to improve performance, I'm afraid LINQ can't help you here, since the bottleneck is with the 46,000 INSERT statements. You should:
Forgo the use of datasets
Dynamically create an INSERT statement in a string
Batch the update, say, 100-200 rows per command
Dynamically build the INSERT statement with multiple VALUE statments
Run the SQL command to insert 100-200 rows per batch
If you insist on using datasets, you don't have to do it with LINQ -- LINQ solves a different type of problems. Do something like:
// code to create dataset "ds" and datatable "dt" omitted
// code to create data adaptor omitted
int count = 0;
foreach (string line in File.ReadLines(filename)) {
// Do processing based on line, perhaps split it
dt.AddRow(...);
count++;
if (count >= 1000) {
adaptor.Update(dt);
dt.Clear();
count = 0;
}
}
This will improve performance somewhat, but you're never going to approach the performance you obtain by using dedicated bulk-insert utilities (or function calls) for your database engine.
Unfortunately, using those bulk-insert facilities will make your code less portable to another database engine. This is the trade-off you'll need to make.
I am trying to isolate the source of a "memory leak" in my C# application. This application copies a large number of potentially large files into records in a database using the image column type in SQL Server. I am using a LinqToSql and associated objects for all database access.
The main loop iterates over a list of files and inserts. After removing much boilerplate and error handling, it looks like this:
foreach (Document doc in ImportDocs) {
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(doc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}
Running this program over the entire input results in an eventual OutOfMemoryException. CLR Profiler reveals that 99% of the heap consists of large byte[] objects corresponding to the sizes of the files.
If I comment both lines A and B, this leak goes away. If I uncomment only line A, the leak comes back. I don't understand how this is possible, as dc is disposed for every iteration of the loop.
Has anyone encountered this before? I suspect directly calling stored procedures or doing inserts will avoid this leak, but I'd like to understand this before trying something else. What is going on?
Update
Including GC.Collect(); after line (B) appears to make no significant change to any case. This does not surprise me much, as CLR Profiler was showing a good number of GC events without explicitly inducing them.
Which operating system are you running this on? Your problem may not be related to Linq2Sql, but to how the operating system manages large memory allocations. For instance, Windows Server 2008 is much better at managing large objects in memory than XP. I have had instances where the code working with large files was leaking on XP but was running fine on Win 2008 server.
HTH
I don't entirely understand why, but making a copy of the iterating variable fixed it. As near as I can tell, LinqToSql was somehow making a copy of the DocumentSubmission inside each Document.
foreach (Document doc in ImportDocs) {
// make copy of doc that lives inside loop scope
Document copydoc = new Document() {
field1 = doc.field1,
field2 = doc.field2,
// complete copy
};
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(copydoc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}
I am using VSTS2008 + C# + .Net 3.5 to run this console application on x64 Server 2003 Enterprise with 12G physical memory.
Here is my code, and I find when executing statement bformatter.Serialize(stream, table), there is out of memory exception. I monitored memory usage through Perormance Tab of Task Manager and I find only 2G physical memory is used when exception is thrown, so should be not out of memory. :-(
Any ideas what is wrong? Any limitation of .Net serialization?
static DataTable MakeParentTable()
{
// Create a new DataTable.
System.Data.DataTable table = new DataTable("ParentTable");
// Declare variables for DataColumn and DataRow objects.
DataColumn column;
DataRow row;
// Create new DataColumn, set DataType,
// ColumnName and add to DataTable.
column = new DataColumn();
column.DataType = System.Type.GetType("System.Int32");
column.ColumnName = "id";
column.ReadOnly = true;
column.Unique = true;
// Add the Column to the DataColumnCollection.
table.Columns.Add(column);
// Create second column.
column = new DataColumn();
column.DataType = System.Type.GetType("System.String");
column.ColumnName = "ParentItem";
column.AutoIncrement = false;
column.Caption = "ParentItem";
column.ReadOnly = false;
column.Unique = false;
// Add the column to the table.
table.Columns.Add(column);
// Make the ID column the primary key column.
DataColumn[] PrimaryKeyColumns = new DataColumn[1];
PrimaryKeyColumns[0] = table.Columns["id"];
table.PrimaryKey = PrimaryKeyColumns;
// Create three new DataRow objects and add
// them to the DataTable
for (int i = 0; i <= 5000000; i++)
{
row = table.NewRow();
row["id"] = i;
row["ParentItem"] = "ParentItem " + i;
table.Rows.Add(row);
}
return table;
}
static void Main(string[] args)
{
DataTable table = MakeParentTable();
Stream stream = new MemoryStream();
BinaryFormatter bformatter = new BinaryFormatter();
bformatter.Serialize(stream, table); // out of memory exception here
Console.WriteLine(table.Rows.Count);
return;
}
thanks in advance,
George
Note: DataTable defaults to the xml serialization format that was used in 1.*, which is incredibly inefficient. One thing to try is switching to the newer format:
dt.RemotingFormat = System.Data.SerializationFormat.Binary;
Re the out-of-memory / 2GB; individual .NET objects (such as the byte[] behind a MemoryStream) are limited to 2GB. Perhaps try writing to a FileStream instead?
(edit: nope: tried that, still errors)
I also wonder if you may get better results (in this case) using table.WriteXml(stream), perhaps with compression such as GZIP if space is a premium.
As already discussed this is a fundamental issue with trying to get contiguous blocks of memory in the Gigabyte sort of size.
You will be limited by (in increasing difficulty)
The amount of addressable memory
since you are 64bit this will be you 12GB physical memory, less any holes in it required by devices plus any swap file space.
Note that you must be running an app with the relevant PE headers that indicate it can run 64bit or you will run under WoW64 and only have 4GB of address space.
Also note that the default target was changed in 2010, a contentious change.
The CLR's limitation that no single object may consume more than 2GB of space.
Finding a contiguous block within the available memory.
You can find that you run out of space before the CLR limit of 2 because the backing buffer in the stream is expanded in a 'doubling' fashion and this swiftly results in the buffer being allocated in the Large Object Heap. This heap is not compacted in the same way the other heaps are(1) and as a result the process of building up to the theoretical maximum size of the buffer under 2 fragments the LOH so that you fail to find a sufficiently large contiguous block before this happens.
Thus a mitigation approach if you are close to the limit is to set the initial capacity of the stream such that it definitely has sufficient space from the start via one of the constructors.
Given that you are writing to the memory stream as part of a serialization process it would make sense to actually use streams as intended and use only the data required.
If you are serializing to some file based location then stream it into that directly.
If this is data going into a Sql Server database consider using:
FILESTREAM 2008 only I'm afraid.
From 2005 onwards you can read/write in chunks but writing is not well integrated into ADO.Net
For versions prior to 2005 there are relatively unpleasant workarounds
If you are serializing this in memory for use in say a comparison then consider streaming the data being compared as well and diffing as you go along.
If you are persisting an object in memory to recreate it latter then this really should be going to a file or a memory mapped file. In both cases the operating system is then free to structure it as best it can (in disk caches or pages being mapped in and out of main memory) and it is likely it will do a better job of this than most people are able to do themselves.
If you are doing this so that the data can be compressed then consider using streaming compression. Any block based compression stream can be fairly easily converted into a streaming mode with the addition of padding. If your compression API doesn't support this natively consider using one that does or writing the wrapper to do it.
If you are doing this to write to a byte buffer which is then pinned and passed to an unmanaged function then use the UnmanagedMemoryStream instead, this stands a slightly better chance of being able to allocate a buffer of this sort of size but is still not guaranteed to do so.
Perhaps if you tell us what you are serializing an object of this size for we might be able to tell you better ways to do it.
This is an implementation detail you should not rely on
1) The OS is x64, but is the app x64 (or anycpu)? If not, it is capped at 2Gb.
2) Does this happen 'early on', or after the app has been running for some time (i.e. n serializations later)? Could it maybe be a result of large object heap fragmentation...?
Interestingly, it actually goes up to 3.7GB before giving a memory error here (Windows 7 x64).
Apparently, it would need about double that amount to complete.
Given that the application uses 1.65GB after creating the table, it seems likely that it's hitting the 2GB byte[] (or any single object) limit Marc Gravell is speaking of (1.65GB + 2GB ~= 3.7GB)
Based on this blog, I suppose you could allocate your memory using the WINAPI, and write your own MemoryStream implementation using that. That is, if you really wanted to do this. Or write one using more than one array of course :)