Processing large resultset with NHibernate

Processing large resultset with NHibernate - c#

I have following task to do: calculate interest for all active accounts. In the past I was doing things like this using Ado.Net and stored procedures.
This time I've tried to do that with NHibernate, because it seemed that complex algorithms would be done easier with pure POCO.
So I want to do following (pseudocode):
foreach account in accounts
calculate interest
save account with new interest
I'm aware that NHibernate was not designed for processinge large data volumes. For me it is sufficient to have possibility to organize such a loop without having all accounts in memory at once.
To minimize memory usage I would use IStatelessSession for external loop instead of plain ISession.
I've tried approach proposed by Ayende. There are two problems:
CreateQuery is using "magic strings";
more important: it doesn't work as described.
My program works but after switching on Odbc trace I saw in debugger that all fetches were done before lambda expression in .List was executed for the first time.
I've found myself another solution: session.Query returning .AsEnumerable() which I've used in foreach. Again two problems:
I would prefer IQueryOver over IQueryable
still doesn't work as described (all fetches before first interest calculation).
I don't know why but IQueryOver doesn't have AsEnumerable. It also doesn't have List method with argument (like CreateQuery). I've tried .Future but again:
documentation of Future doesn't describe streaming feature
still doesn't work as I need (all fetches before first interest calculation).
In summary: is there any equivalent in NHibernate to dataReader.Read() from Ado.Net?
My best alternative to pure NHibernate approach would be main loop using dataReader.Read() and then Load account with id from Ado.Net loop. However performance will suffer - reading each account via key is slower than sequence of fetches done in outer loop.
I'm using NHibernate version 4.0.0.4000.

While it is true that NH was not designed with large-valume processing in mind you can always circumvent this restriction with application-layer batch processing. I have found that depending on the size of the object graph of the relevant entity, performance will suffer after a certain amount of objects have been loaded into memory (in one small project I could load 100.000 objects and performance would remain acceptable, in an other with only 1500 objects any additional Load() would crawl).
In the past I have used paging to handle batch processing, when IStatelessSession result sets are too poor (as they don't load proxies etc).
So you make a count query in the beginning, make up some arbitrary batch size and then start doing your work on the batch. This way you can neatly avoid the n+1 select problem, assuming that for each batch you explicitly fetch-join everything needed.
The caveat is that for this to work efficiently you will need to evict the processed entities of each batch from the ISession when you are done. And this means that you will have to commit-transaction on each batch. If you can live with multiple flush+commits then this could work for you.
Else you will have to go by the IStatelessSession although there are no lazy queries there. "from Books" means "select * from dbo.Books" or something equivalent and all results are fetched into memory.

Related

Scaling C# database application

I apologize if this question is a bit nebulous.
I am writing a C# application which does data manipulation against a SQL Server database. For a group of items, I read data for each item, do calculations on the data, then write the results to the database.
The problem I am having is that the application starts to slow down relative to the time it takes to process each item when the number of items to be processed increases.
I am trying to be very careful as far as freeing memory for allocated objects as I am through with them. I want to have nothing hanging around from the processing of one item when I start the processing for the next item. I make use of "using" structures for data tables and the BulkCopy class to try to force memory cleanup.
Yet, I start to get geometrically increasing run times per item the more items I try to process in one invocation of the program.
My program is a WinForms app. I don't seem to be eating up the server's memory with what I am doing. I am trying to make the processing of each item isolated from the processing of all other items, to make sure it would not matter how many items I process in each invocation of the application.
Has anyone seen this behavior in their applications and know what to look for to correct this?

A couple of things you can be watchful for - if you're using "using" statements - are you making sure that you're not keeping your connection open while manipulating your objects? Best to make sure you get your data from the database, close the connection, do your manipulation and then send the data back.
Are you using stored procedures for fetching/sending complex objects? You can also experiment with doing some of you data manipulation inside of the stored procedure or in functions called from them - you do NOT want to offload your entire business classes to the database, but you can do some of it there, depending on what you're doing.
Make sure your data structure is optimized as well (primary key indices, foreign keys, triggers etc. you can get some scripts from http://www.brentozar.com/first-aid/ to check the optimization of your database.
As mentioned above, try using some parallel/asynchronous patterns to divy up your work - await/async is very helpful for this, especially if you want to have calculations while also sending previous data back to the server.

Thanks for all the input. I checked the issues of opening/closing connections, etc. to see that I was being tidy. The thing that really helped was removing the primary keys on the destination data table. These were setup relative to what an end user would require, but they really gummed up the speed of data inserts. A heads up to folks to think about database constraints for updating data vs. using the data.
Also, found performance issues in selecting with a filter from an in memory DataTable. Somehow what I was doing get bogged down with a larger number of rows (30,000). I realized that I was mishandling the data and did not really need to do this. But it did show me the need to micro-test each step of my logic when trying to drag so much data around.

How to cache a big table from SQL Server

I have a table with a lot of rows (3 million) from which I need to query some rows at several points in my app. The way I found to do this is querying all the data the first time that any was needed and storing it in a static DataTable with SqlAdapter.Fill() for the rest of the app life.
That's fast, because then when I need something I use DataTable.Select("some query") and the app processes the info just nice.
The problem is that this table takes about 800MB of RAM, and I have to run this app in PCs where it might be too much.
The other way I thought was to query the data I need each time. This takes little memory but has poor performance (a lot of queries to the database, which is at a network address and with 1000 queries you start to notice the ping and all that..).
Is there any intermediate point between performance and memory usage?
EDIT: What I'm retrieving are sales, which have a date, a product and a quantity. I query by product, and it isn't indexed that way. But anyways, making 1000 queries, even if the query took 0.05s, a 0.2s ping makes a total of 200 seconds...

First talk to the dba about performance
If you are downloading the entire table you might actually be putting more load on the network and SQL than if you performed individual queries.
As a dba if I knew you were downloading an entire large table I would put an index on product immediately.
Why are you performing 1000s of queries?
If you are looking for sales when a product is created then a cache is problematic. You would not yet have sales data. The problem with a cache is stale data. If you know the data will not change - you either have it or not then you can eliminate the concern of stale data.
There is something between sequentially and simultaneously. You can pack multiple selects in a single request. What this does is make a single round trip and is more efficient.
select * from tableA where ....;
select * from tableB where ....;
With DataReader just call SqlDataReader.NextResult Method ()
using (SqlDataReader rdr = cmd.ExecuteReader())
{
while (rdr.Read())
{
}
rdr.NextResultSet();
while (rdr.Read())
{
}
}
Pretty sure you can do the same type of thing with multiple DataTables in a DataSet.
Another option is a LocalDB. It is targeted at developers but for what you are doing it would work just fine. DataTable speed without memory concerns. You can even put an index on ProductID. It will take a little longer to write to disc compared to memory but you are not using up memory.
Then there is the ever evil with (nolock). Know what you are doing and I am not going to go into all the possible evils but I can tell you that I use it a lot.

The question can be precipitated to Memory vs Performance. The answer to that is Caching.
If you know what your usage pattern would be like, then one thing you can do is to create a local cache in the app.
The extreme cases are - your cache size is 800MB with all your data in it (thereby sacrificing memory) - OR - your cache size is 0MB and all your queries go to network (thereby sacrificing performance).
Three important questions about the design of the cache are answered below.
How to populate the Cache?
If you are likely to make some query multiple times, store it in cache and before going to network, test if your cache already has the result. If it doesn't, query the database and then store the result in the cache.
If after querying for some data, you are likely to query the next and/or previous piece of data, then query all of it once and cache it so that when you query the next piece, you already have it in cache.
Effectively the idea is that if you know some information may be needed in future, cache it beforehand.
How to free the Cache?
You can decide the freeing mechanism for cache either Actively or Passively.
Passively: Whenever cache is full you can evict the data from it.
Actively: Run a background thread at regular interval and it takes care of removal for you.
One hybrid method is to run a freeing thread as soon as you reach, let's say, 80% of your memory limit and then free whatever memory you can.
What data to remove from the Cache?
This has been answered already in context of the issue of Page Replacement Policies for Operating Systems.
For completion, I'll summarize the important ones here:
Evict the Least Recently Used data (if it is not likely to be used);
Evict the data that was brought in earliest (if the earliest data is not likely to be used);
Evict the data that was brought in latest (if you think that the newly brought in data is least likely to be used).
Automatically remove the data that is older than t time units.

RE: "I can't index by anything because I'm not the database admin nor can ask for that."
Can you prepopulate a temp table and index on that?, e.g.
Select * into #MyTempTable from BigHugeTable
Create Index Prodidx on #MyTempTable (product)
You will have to ensure you always reuse the same connection (and it isn't closed) in order to use the temp table.

Which is better performance wise, C# for loop or cursor(or while loop) in sql?

Which will perform better while loop or cursor?
After lots of research, I came to know they both are equally bad for performance and may sometime out perform each other based on situation, and should be used only when it is not possible to use set based operation.
Now question is Which is better performance wise, loop in C# or cursor(or while loop) in sql?
and I searched in web, but found no definitive result...
anybody have any idea?

Based on my experience I would say: it depends on which operations you perform on every item...
In a scenario it happened to me to use cursor loop in SQL for performing bit-wise operation on some data read from the DB and it was very poor in performance (SQL is not intended to operate such kind of stuff)... in that case I obtained a better result looping in C# on a cursor opened on the DB...
On the other side, if you have to perform some other complex data-mining task for every item of the loop, then it is much more convenient to do that in SQL, so that your data do not have to go back and forth from DB to C# and viceversa.
Have you a specific application scenario you can talk about, so that we can give you an idea about that?

When you say "a for loop in C#", do you mean that you intend to first load all the results from the query, and then subsequently loop over them? The downside of that approach is of course that you need the memory to hold all those results. But you don't have to do that. There are many mechanisms in C# that allow you to loop over the results as they come in, which avoids the need to hold all results in memory. It depends of course on your database access technology.
If you use some technology based on IQueryable<T>, just use a foreach loop over the result, and avoid calling materialization functions on it, such as .ToList().
As for cursors, always avoid them if possible. For performance, always prefer set based operations over cursors. In fact, the same is true for a foreach loop in C#. If your processing of each result involves querying the database again, use a SQL join that returns you the needed data in a single query, instead of a new query for each result row retrieved.

Nhibernate large transactions, flushes vs locks

I am having a challenge of maintaining an incredibly large transaction using Nhibernate. So, let us say, I am saving large number of entities. If I do not flush on a transaction N, let us say 10000, then the performance gets killed due to overcrowded Nh Session. If I do flush, I place locks on DB level which in combination with read committed isolation level do affect working application. Also note that in reality I import an entity whose business logic is one of the hearts of the system and on its import around 10 tables are affected. That makes Stateless session a bad idea due to manual maintaining of cascades.
Moving BL to stored procedure is a big challenge due to to reasons:
there is already complicated OO business logic in the domain
classes of application,
duplicated BL will be introduced.
Ideally I would want to Flush session to some file and only then preparation of data is completed, I would like to execute its contents. Is it possible?
Any other suggestions/best practices are more than welcome.

You scenario is a typical ORM batch problem. In general we can say that no ORM is meant to be used for stuff like that. If you want to have high batch processing performance (not everlasting locks and maybe deadlocks) you should not use the ORM to insert 1000s of records.
Instead use native batch inserts which will always be a lot faster. (like SqlBulkCopy for MMSQL)
Anyways, if you want to use nhibernate for this, try to make use of the batch size setting.
Call save or update to all your objects and only call session.Flush once at the end. This will create all your objects in memory...
Depending on the batch size, nhibernate should try to create insert/update batches with this size, meaning you will have lot less roundtrips to the database and therefore fewer locks or at least it shouldn't take that long...
In general, your operations should only lock the database the moment your first insert statement gets executed on the server if you use normal transactions. It might work differently if you work with TransactionScope.
Here are some additional reads of how to improve batch processing.
http://fabiomaulo.blogspot.de/2011/03/nhibernate-32-batching-improvement.html
NHibernate performance insert
http://zvolkov.com/clog/2010/07/16?s=Insert+or+Update+records+in+bulk+with+NHibernate+batching

Effect of Many(!) Small Queries Being Performed

So I am troubleshooting some performance problems on a legacy application, and I have uncovered a pretty specific problem (there may be others).
Essentially, the application is using an object relational mapper to fetch data, but it is doing so in a very inefficient/incorrect way. In effect, it is performing a series of entity graph fetches to fill a datagrid in the UI, and on databinding the grid (it is ASP.Net Webforms) it is doing additional fetches, which lead to other fetches, etc.
The net effect of this is that many, many tiny queries are being performed. Using SQL Profiler shows that a certain page performs over 10,000 queries (to fill a single grid. No query takes over 10ms to complete, and most of them register as 0ms in Profiler. Each query will use and release one connection, and the series of queries would be single-threaded (per http request).
I am very familiar with the ORM, and know exactly how to fix the problem.
My question is: what is the exact effect of having many, many small queries being executed in an application? In what ways does it/can it stress the different components of the system?
For example, what is the effect on the webserver's CPU and memory? Would it flood the connection pool and cause blocking? What would be the impact on the database server's memory, CPU and I/O?
I am looking for relatively general answers, mainly because I want to start monitoring the areas that are likely to be the most affected (I need to measure => fix => re-measure). Concurrent use of the system at peak would likely be around 100-200 users.

It will depend on the database but generally there is a parse phase for each query. If the query has used bind variables it will probably be cached. If not, you wear the hit of a parse and that often means short locks on resources. i.e. BAD. In Oracle, CPU and blocking are much more prevelant at the parse than the execute. SQL Server less so but it's worse at the execute. Obviously doing 10K of anything over a network is going to be a terrible solution, especially x 200 users. Volume I'm sure is fine but that frequency will really highlight all the overhead in comms latency and stuff like that. Connection pools generally are in the hundreds, not tens of thousands, and now you have 10s of thousands of objects all being created, queued, managed, destroyed, garbage collected etc.
But I'm sure you already know all this deep down. Ditch the ORM for this part and write a stored procedure to execute the single query to return your result set. Then put it on the grid.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.