I'm trying to use Fluent NHibernate to migrate a database that needs some of the database 'massaged'. The source database is a MS Access database and the current table I'm stuck on is one with an OLE Object field. The target database is a MS SQL Server Express database.
In the entity I simply had this field defined as a byte[] however when loading however even when just loading that single field for a single record I was hitting a System.OutOfMemoryException
byte[] test = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == 5590).Select(x => x.FileData).SingleOrDefault<byte[]>();
I then tried implementing the blob type listed here but now when running that I receive an error of:
"Unable to cast object of type 'System.Byte[]' to type
'TestProg.DatabaseConverter.Entities.Blob'."}
I can't imagine the Ole Object is any larger than 100mb but haven't been able to check. Is there any good way using Fluent NHibernate to copy this out of the one database and save it to the other or will I need to look at other options?
My normal loop for processing these is:
IList<Entities.Access.Revision> result;
IList<int> recordIds = aSession.Query<Entities.Access.Revision>().Select(x => x.Id).ToList<int>();
foreach (int recordId in recordIds)
{
result = aSession.Query<Entities.Access.Revision>().Where(x => x.Id == recordId).ToList<Entities.Access.Revision>();
Save(sqlDb, result);
}
Save function just copies properties from one to another and for some entities is used to manipulate data or give feedback to user related to data problems. I'm using stateless sessions for both databases.
--
From further testing the objects it appears to be hanging on are about 60-70mb. I'm currently testing grabbing the data with an OleDbDataReader using GetBytes.
--
Update (Nov 24): I've yet to find a way to get this to work with NHibernate. I did get this working with regular db command objects. I've put the code for function I made below for anybody curious who finds this. This is code from my database converter so objects prefixed with 'a' are access database objects and 's' are sql ones.
public void MigrateBinaryField(int id, string tableName, string fieldName)
{
var aCmd = new OleDbCommand(String.Format(#"SELECT ID, {0} FROM {1} WHERE ID = {2}", fieldName, tableName, id), aConn);
using (var reader = aCmd.ExecuteReader(System.Data.CommandBehavior.SequentialAccess))
{
while (reader.Read())
{
if (reader[fieldName] == DBNull.Value)
return;
long read = 0;
long offset = 0;
// Can't .WRITE a NULL column so need to set an initial value
var sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1} = #data WHERE OldId = #OldId", tableName, fieldName), sConn);
sCmd.Parameters.AddWithValue("#data", new byte[0]);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
// Incrementally store binary field to avoid OutOfMemoryException from having entire field loaded in memory
sCmd = new SqlCommand(string.Format(#"UPDATE {0} SET {1}.WRITE(#data, #offset, #len) WHERE OldId = #OldId", tableName, fieldName), sConn);
while ((read = reader.GetBytes(reader.GetOrdinal(fieldName), offset, buffer, 0, buffer.Length)) > 0)
{
sCmd.Parameters.Clear();
sCmd.Parameters.AddWithValue("#data", buffer);
sCmd.Parameters.AddWithValue("#offset", offset);
sCmd.Parameters.AddWithValue("#len", read);
sCmd.Parameters.AddWithValue("#OldId", id);
sCmd.ExecuteNonQuery();
offset += read;
}
}
}
}
This sounds like the results I have seen with using .NET on top of other frameworks as well.
The native database driver beneath ADO.NET beneath NHibernate (two "beneaths" are intentional here) will require a pinned destination memory block that cannot be moved in memory while the driver fills it. Since the .NET garbage collector can randomly move blocks of memory on a separate thread in order to compact the heaps, NHibernate's underlying .NET database layer has to create a non-managed memory block to receive the data, which effectively doubles the amount of memory required to load a record.
Also, I have not verified this next point, but NHibernate should attempt to cache blocks of records, since it bypasses some of the relational database query operations. This allows NHibernate to make fewer database requests, which is optimal for smaller record sizes, but requires many records (including many blobs) to fit in memory at a time.
As a first step toward a resolution, make sure the process is really running the machine out of memory (or if it is 32-bit, make sure it is hitting the 2GB limit). If so, attempt to determine the baseline - if it is processing records with a variety of blob sizes, what is the minimum and maximum memory it uses? From that, you can estimate how much memory would be required for that large record (or the cache block that contains that record!)
64-bit and more physical memory may be a brute-force solution, if you aren't already running 64-bit, and if bigger hardware is even an option.
Another possible solution is to check whether NHibernate has configurable settings or properties for how it caches data. For example, check whether you can set a property that limits how many records are loaded at a time, or tell it to limit its cache to a certain size in bytes.
A more efficient solution is to use your ADO.NET code for the blobs; that might be the best solution, especially if you expect even larger blobs than this particular 60-70MB blob. MS Access will normally allow multiple read-only connections, so this should work as long as NHibernate doesn't set the database to block other connections.
I strongly suspect this is an accumulation due to NHibernate session cache.
Try to read each blob in a separate session or at least to flush/clear it periodically adding a counter 'i' to your loop and a condition like
if (i % 10 == 0)
{
aSession.Flush();
aSession.Clear();
}
Related
I wrote an app running on multiple client machines connecting to a remote oracle database (master) to synchronize theirs local open source database (slave). This works fine so far. But sometimes a local table needs to be fully initialized (dropped and afterwards all rows of master database inserted). If the master table is big enough (ColumnCount or DataType/DataSize and a certain RowSize) the app sometimes runs into an OutOfMemoryException.
The app is running on windows machines with .NET 4.0. Version of ODP.NET is 4.122.18.3. Oracle database is 12c (12.1.0.2.0).
I don't want to show the data to any user (the app is running in the background), else i could do some paging or filtering. Since not all tables contains keys or are able to be sorted automatically its pretty hard to fetch the table in parts. The initializing of the local table should be done in one transaction without multiple partial commits.
I can brake the problem down to a simple code sample showing the managed memory allocation which I didn't expect. At this point I'm not sure how to explain or solve the problem.
using (var connection = new OracleConnection(CONNECTION_STRING))
{
connection.Open();
using (var command = connection.CreateCommand())
{
command.CommandText = STATEMENT;
using (var reader = command.ExecuteReader())
{
while (reader.Read())
{
//reader[2].ToString();
//reader.GetString(2);
//reader.GetValue(2);
}
}
}
}
When uncommenting any of the three reader.* calls the memory of the requested column data seems to be pinned internally by ODP.NET (OracleInternal.Network.OraBuf) for each record. For requesting a few thousand records this doesn't seem to be a problem. But when fetching 100k+ records the memory allocation gets to hundreds of MB, which leads to an OutOfMemoryException. The more data the specified column has the faster the OOM happens (mostly NVARCHAR2).
Calling additionally GC.Collect() manually doesn't do anything. The GC.Collect()'s shown in the image are done internally (no calls by myself).
Since I don't store the read data at any place I would have expected the data is not cached while iterating the DbDataReader. Can u help me understand what is happening here and how to avoid it?
This seems to be a known bug (Bug 21975120) when reading clob column values using ExecuteReader() method with the managed Driver 12.1.0.2. The Workaround is to use the OracleDataReader specific methods (for example oracleDataReader.GetOracleValue(i)). The OracleClob value can explicitly be closed to free the Memory allocation.
var item = oracleDataReader.GetOracleValue(columnIndex);
if (item is OracleClob clob)
{
if (clob != null)
{
// use clob.Value ...
clob.Close();
}
}
So I have a program that processes a lot of data, it pulls the data from the source then saves it into the database. The only conditions is that there cannot be duplicates based on just a single name field, there is no way to only pull new data, I have to pull it all.
The way I have it working right now is that it pulls all the data from the source, then all of the data from the DB (~1.5m records) then compares and only sends the new entries, however this is not very good in terms of RAM since it can use up around 700mb.
I looked into a way to let the DB handle more of the processing and found the below extension method on another question, but my concerns with using it is that it might be even more inefficient due to having to check 200 000 singularly against all records in the db
public static T AddIfNotExists<T>(this DbSet<T> dbSet, T entity, Expression<Func<T, bool>> predicate = null) where T : class, new()
{
var exists = predicate != null ? dbSet.Any(predicate) : dbSet.Any();
return !exists ? dbSet.Add(entity) : null;
}
So, any ideas as to how I might be able to handle this problem in an efficient manner?
1) Just a suggestion: because this task seems to be "data intensive" you could use a dedicated tool: SSIS. Create one package having one Data Flow Task thus:
OLE DB source (to read data from source)
Lookup transformation (to check if every source value exists or not within target table; This needs indexing)
No match data flow (source value doesn't exist) then
OLE DB destination (it inserts the source value into target table).
I would use this approach if I have to compare/search on small set of columns (this columns should be indexed).
2) Another suggestion: is to simply insert source data into #TempTable and then
INSERT dbo.TargetTable (...)
SELECT ...
FROM #TempTable
EXCEPT
SELECT ...
FROM dbo.TargetTable
I would use this approach if I have to compare/search on all columns.
You have to do your tests to see which is the right/fastest approach.
If you stick to EF, there's a few ways of improving your algorithm:
Download only Name field from your source table (create a new DTO object that maps to the same table but contains only the name field against which you compare). That will minimize the memory usage.
Create and store the hash of the name field in the database. Then download on the client only hashes. Check the input data - if there's no hash, just add it; if there's such hash - check with dbSet.Any(..). If you use the MD5 hash it will take only 16 (bytes) *1,5M records ~ 23Mb. If you use CRC32, 4*1,5 records ~ 6Mb.
You have the high memory usage because all the data you added is stored in the DB context. If you do in chunks (let's say 4000 records) and then close the context, the memory will be released.
using (var context = new MyDBContext())
{
context.DBSet.Add(...); // and no more than 4000 records here.
context.SaveChanges();
}
The better solution is to use SQL statements that can be executed via
model.Database.ExecuteSqlCommand("INSERT...",
new SqlParameter("Name",..),
new SqlParameter("Value",..));
(Mind to set the parameter type for SqlParameter)
INSERT target(name, value)
SELECT #Name, #Value
WHERE NOT EXISTS (SELECT name FROM target WHERE name = #Name);
If the expected number of inserted values is high, you might get benefits form using Table type (System.Data.SqlDbType.Structured) (assuming you use MS SQL server) and executing the similar query in some chunks but it is a little bit more complex.
Depending on the SQL server you may download the data into a temporary table (e.g. you have a CSV file or BulkInsert for MS SQL) and then execute the similar query.
The data
I have a collection with around 300,000 vacations. Every vacation has several categories, countries, cities, activities and other subobjects. This data needs to be inserted into a MySQL / SQL Server database. I have the luxury of being able to truncate the entire database and start clean every time the parser program is run.
What I have tried
I have tried working with Entity Framework, this is also where my preference lies. To keep Entity Framework's performance up I have created a construction where 300 items are taken out of the vacations collection, parsed and inserted by Entity Framework and it's context disposed thereafter. The program finishes in a matter of minutes using this method. If I fill the context with all 300k vacations from the collection (and it's subobjects) it's a matter of hours.
int total = vacationsObjects.Count;
for (int i = 0; i < total; i += Math.Min(300, (total - i)))
{
var set = vacationsObjects.Skip(i).Take(300);
int enumerator = 0;
using (var database = InitializeContext())
{
foreach (VacationModel vacationData in set)
{
enumerator++;;
Vacations vacation = new Vacations
{
ProductId = vacationData.ExternalId,
Name = vacationData.Name,
Description = vacationData.Description,
Price = vacationData.Price,
Url = vacationData.Url,
};
foreach (string category in vacationData.Categories)
{
var existingCategory = database.Categories.Local.FirstOrDefault(c => c.CategoryName == categor);
if (existingCategory != null)
vacation.Categories.Add(existingCategory);
else
{
vacation.Categories.Add(new Category
{
CategoryName = category
});
}
}
database.Vacations.Add(vacation);
}
database.SaveChanges();
}
}
The downside (and possibly dealbreaker) with this method is figuring out the relationships. As you can see when adding a Category I check if it's already been created in the local context, and then use that. But what if it has been added in a previous set of 300? I don't want to query the database multiple times for every vacation to check whether an entity already resides within it.
Possible solution
I could keep a dictionary in memory containing the categories that have been added. I'd need to figure out how to attach these categories to the proper vacations (or vice-versa) and insert them, including their respective relations into the database.
Possible alternatives
Segregate the context and the transaction -
Purely theoretical, I do not know if I'm making any sense here. Maybe I could have EF's context keep track of all objects, and take manual control over the inserting part. I have messed around with this, trying to work with manual transaction scopes without avail.
Stored procedure -
I could write a stored procedure that handles and inserts my data. I'm not a big fan of this alternative, as I would like to keep the flexibility of switching between MySQL and SQL Server. Also, I would be in the dark as to where to begin.
Intermediary CSV file -
Instead of inserting parsed data directly into the RDMBS, I could export it into one or more CSV files and make use of importing tools such as MySQL's INFLINE.
Alternative database systems
Databases such as Azure Table Storage, MongoDB or RavenDB could be an option. However, I would prefer to stick to a traditional RDMBS due to compatibility with my skillset and tools.
I have been working on and researching this problem for a couple of weeks now. It seems like the best way of finding a solution that fits is by simply trying the different possibilities and observing the result. I was hoping that I could receive some pointers or tips from your personal experiences.
If you insert each record separately, the whole operation will take a lot of time. The bottleneck is SQL-queries between client and server. Each query takes time, so try to avoid using multiple of them. For huge amount of data it will be much better to process them locally. The best solution is to use special import tool. In MySQL you can use LOAD DATA, in MSSQL there is BULK INSERT. To import your data, you need a .css file.
To handle external keys correctly, you must populate tables manually before inserting. If destination tables are empty, you can simply create .css file with predefined primary and external keys. Otherwise you can import existing records from server, update them with your data, then export them back.
Time
Since you can afford to make only INSERTs, one suggestion is to try Entity Framework Bulk Insert extension. I have used it to save up to 200K records and it works fine. Just include in your project and write something like this:
context.BulkInsert(listOfEntities);
This should solve (or greatly improve the EF version) your problem's the time dimension
Data integrity
Keeping everything in one transaction does not sound reasonable (I expect that 300K parent records to generate at least 3M overall records), so I would try the following approach:
1) make your entities insertion using bulk insert.
2) call a stored procedure to check data integrity
If the insertion is quite long and the chance of failure is relatively big, you can load what is already loaded and have the process skip what is already loaded:
1) make smaller bulk inserts for a batch of vacation records and all its children records. Ensure that it runs in a transaction. One BULK INSERT is run atomically (no transaction needed), for several it seems tricky.
2) if the process fails, you have complete vacation data in your database (no partially imported vacation)
3) retake the process, but load existing vacation records (parents only). Using EF, a faster way is using AsNoTracking to spare the tracking overhead (which is great for large lists)
var existingVacations = context.Vacation.Select(v => v.VacationSourceIdentifier).AsNoTracking();
As suggested by Alexei, EntityFramework.BulkInsert is a very good solution if your model is supported by this library.
You can also use Entity Framework Extensions (PRO Version) which allow to use BulkSaveChanges and Bulk Operations (Insert, Update, Delete and Merge).
It's support your both provider: MySQL and SQL Server
// Upgrade SaveChanges performance with BulkSaveChanges
var context = new CustomerContext();
// ... context code ...
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(operation => operation.BatchSize = 1000);
// Use direct bulk operation
context.BulkInsert(customers);
Disclaimer: I'm the owner of the project Entity Framework Extensions
I have to implement an algorithm on data which is (for good reasons) stored inside SQL server. The algorithm does not fit SQL very well, so I would like to implement it as a CLR function or procedure. Here's what I want to do:
Execute several queries (usually 20-50, but up to 100-200) which all have the form select a,b,... from some_table order by xyz. There's an index which fits that query, so the result should be available more or less without any calculation.
Consume the results step by step. The exact stepping depends on the results, so it's not exactly predictable.
Aggregate some result by stepping over the results. I will only consume the first parts of the results, but cannot predict how much I will need. The stop criteria depends on some threshold inside the algorithm.
My idea was to open several SqlDataReader, but I have two problems with that solution:
You can have only one SqlDataReader per connection and inside a CLR method I have only one connection - as far as I understand.
I don't know how to tell SqlDataReader how to read data in chunks. I could not find documentation how SqlDataReader is supposed to behave. As far as I understand, it's preparing the whole result set and would load the whole result into memory. Even if I would consume only a small part of it.
Any hint how to solve that as a CLR method? Or is there a more low level interface to SQL server which is more suitable for my problem?
Update: I should have made two points more explicit:
I'm talking about big data sets, so a query might result in 1 mio records, but my algorithm would consume only the first 100-200 ones. But as I said before: I don't know the exact number beforehand.
I'm aware that SQL might not be the best choice for that kind of algorithm. But due to other constraints it has to be a SQL server. So I'm looking for the best possible solution.
SqlDataReader does not read the whole dataset, you are confusing it with the Dataset class. It reads row by row, as the .Read() method is being called. If a client does not consume the resultset the server will suspend the query execution because it has no room to write the output into (the selected rows). Execution will resume as the client consumes more rows (SqlDataReader.Read is being called). There is even a special command behavior flag SequentialAccess that instructs the ADO.Net not to pre-load in memory the entire row, useful for accessing large BLOB columns in a streaming fashion (see Download and Upload images from SQL Server via ASP.Net MVC for a practical example).
You can have multiple active result sets (SqlDataReader) active on a single connection when MARS is active. However, MARS is incompatible with SQLCLR context connections.
So you can create a CLR streaming TVF to do some of what you need in CLR, but only if you have one single SQL query source. Multiple queries it would require you to abandon the context connection and use isntead a fully fledged connection, ie. connect back to the same instance in a loopback, and this would allow MARS and thus consume multiple resultsets. But loopback has its own issues as it breaks the transaction boundaries you have from context connection. Specifically with a loopback connection your TVF won't be able to read the changes made by the same transaction that called the TVF, because is a different transaction on a different connection.
SQL is designed to work against huge data sets, and is extremely powerful. With set based logic it's often unnecessary to iterate over the data to perform operations, and there are a number of built-in ways to do this within SQL itself.
1) write set based logic to update the data without cursors
2) use deterministic User Defined Functions with set based logic (you can do this with the SqlFunction attribute in CLR code). Non-Deterministic will have the affect of turning the query into a cursor internally, it means the value output is not always the same given the same input.
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static int algorithm(int value1, int value2)
{
int value3 = ... ;
return value3;
}
3) use cursors as a last resort. This is a powerful way to execute logic per row on the database but has a performance impact. It appears from this article CLR can out perform SQL cursors (thanks Martin).
I saw your comment that the complexity of using set based logic was too much. Can you provide an example? There are many SQL ways to solve complex problems - CTE, Views, partitioning etc.
Of course you may well be right in your approach, and I don't know what you are trying to do, but my gut says leverage the tools of SQL. Spawning multiple readers isn't the right way to approach the database implementation. It may well be that you need multiple threads calling into a SP to run concurrent processing, but don't do this inside the CLR.
To answer your question, with CLR implementations (and IDataReader) you don't really need to page results in chunks because you are not loading data into memory or transporting data over the network. IDataReader gives you access to the data stream row-by-row. By the sounds it your algorithm determines the amount of records that need updating, so when this happens simply stop calling Read() and end at that point.
SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("Value1", SqlDbType.Int);
columns[1] = new SqlMetaData("Value2", SqlDbType.Int);
columns[2] = new SqlMetaData("Value3", SqlDbType.Int);
SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);
SqlDataReader reader = comm.ExecuteReader();
bool flag = true;
while (reader.Read() && flag)
{
int value1 = Convert.ToInt32(reader[0]);
int value2 = Convert.ToInt32(reader[1]);
// some algorithm
int newValue = ...;
reader.SetInt32(3, newValue);
SqlContext.Pipe.SendResultsRow(record);
// keep going?
flag = newValue < 100;
}
Cursors are a SQL only function. If you wanted to read chunks of data at a time, some sort of paging would be required so that only a certain amount of the records would be returned. If using Linq,
.Skip(Skip)
.Take(PageSize)
Skips and takes could be used to limit results returned.
You can simply iterate over the DataReader by doing something like this:
using (IDataReader reader = Command.ExecuteReader())
{
while (reader.Read())
{
//Do something with this record
}
}
This would be iterating over the results one at a time, similiar to a cursor in SQL Server.
For multiple recordsets at once, try MARS
(if SQL Server)
http://msdn.microsoft.com/en-us/library/ms131686.aspx
I am trying to isolate the source of a "memory leak" in my C# application. This application copies a large number of potentially large files into records in a database using the image column type in SQL Server. I am using a LinqToSql and associated objects for all database access.
The main loop iterates over a list of files and inserts. After removing much boilerplate and error handling, it looks like this:
foreach (Document doc in ImportDocs) {
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(doc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}
Running this program over the entire input results in an eventual OutOfMemoryException. CLR Profiler reveals that 99% of the heap consists of large byte[] objects corresponding to the sizes of the files.
If I comment both lines A and B, this leak goes away. If I uncomment only line A, the leak comes back. I don't understand how this is possible, as dc is disposed for every iteration of the loop.
Has anyone encountered this before? I suspect directly calling stored procedures or doing inserts will avoid this leak, but I'd like to understand this before trying something else. What is going on?
Update
Including GC.Collect(); after line (B) appears to make no significant change to any case. This does not surprise me much, as CLR Profiler was showing a good number of GC events without explicitly inducing them.
Which operating system are you running this on? Your problem may not be related to Linq2Sql, but to how the operating system manages large memory allocations. For instance, Windows Server 2008 is much better at managing large objects in memory than XP. I have had instances where the code working with large files was leaking on XP but was running fine on Win 2008 server.
HTH
I don't entirely understand why, but making a copy of the iterating variable fixed it. As near as I can tell, LinqToSql was somehow making a copy of the DocumentSubmission inside each Document.
foreach (Document doc in ImportDocs) {
// make copy of doc that lives inside loop scope
Document copydoc = new Document() {
field1 = doc.field1,
field2 = doc.field2,
// complete copy
};
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(copydoc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}