I am having a challenge of maintaining an incredibly large transaction using Nhibernate. So, let us say, I am saving large number of entities. If I do not flush on a transaction N, let us say 10000, then the performance gets killed due to overcrowded Nh Session. If I do flush, I place locks on DB level which in combination with read committed isolation level do affect working application. Also note that in reality I import an entity whose business logic is one of the hearts of the system and on its import around 10 tables are affected. That makes Stateless session a bad idea due to manual maintaining of cascades.
Moving BL to stored procedure is a big challenge due to to reasons:
there is already complicated OO business logic in the domain
classes of application,
duplicated BL will be introduced.
Ideally I would want to Flush session to some file and only then preparation of data is completed, I would like to execute its contents. Is it possible?
Any other suggestions/best practices are more than welcome.
You scenario is a typical ORM batch problem. In general we can say that no ORM is meant to be used for stuff like that. If you want to have high batch processing performance (not everlasting locks and maybe deadlocks) you should not use the ORM to insert 1000s of records.
Instead use native batch inserts which will always be a lot faster. (like SqlBulkCopy for MMSQL)
Anyways, if you want to use nhibernate for this, try to make use of the batch size setting.
Call save or update to all your objects and only call session.Flush once at the end. This will create all your objects in memory...
Depending on the batch size, nhibernate should try to create insert/update batches with this size, meaning you will have lot less roundtrips to the database and therefore fewer locks or at least it shouldn't take that long...
In general, your operations should only lock the database the moment your first insert statement gets executed on the server if you use normal transactions. It might work differently if you work with TransactionScope.
Here are some additional reads of how to improve batch processing.
http://fabiomaulo.blogspot.de/2011/03/nhibernate-32-batching-improvement.html
NHibernate performance insert
http://zvolkov.com/clog/2010/07/16?s=Insert+or+Update+records+in+bulk+with+NHibernate+batching
Related
I want to use DeleteManyAsync method to delete multiple documents. I will encounter big collections being deleted. In the meantime I would like my new documents to be inserted. I would like to know if my database collection will be locked when DeleteManyAsync is fired.
This is the code I want to use :
List<MyDocument> list= new List<MyDocument>();
var filter = Builders<MyDocument>.Filter.In("_id", vl.Select(i => i.InternalId));
await _context?.MyDocuments?.DeleteManyAsync(filter);
Mongo db locks are a low level concern and are handled at the database server level. You, as a programmer writing a client application using the driver, do not need to concern yourself about the database locks too much.
What I'm trying to say is that when using the C# driver you won't notice any kind of issue related to concurrent write operations executed on the same collection. Locks are handled by the storage engine, not by the driver used at the client application level.
If you check this documentation you can read that, in case of conflicting write operations on the same collection, the storage engine will retry the operation at the server level:
When the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation
So, again, the concurrency issues are handled at the server level.
Consider that if you need your application to be highly scalable you should design your system in order to avoid as much as possible concurrent write operations on the same collection. As I said above, locks are handled by the storage engine in order to preserve the correctness of your data, but locks can reduce the overall scalability of your system. So, if scalability is critical in your scenario, you should carefully design your system and avoid contention of resources at the database level as much as possible.
At the client application level you just need to decide whether or not retrying on a failed write operation.
Sometimes you can safely retry a failed operation, some other times you can't (e.g.: in some cases you will endup having duplicate data at the database level. A good guard against this is using unique indexes).
As a rule of thumb, idempotent write operations can safely be retried in case of a failure (because applying them multiple times does not have any side effect). Put another way, strive to have idempotent write operations as much as possible: this way you are always safe retrying a failed write operation.
If you need some guidance about the mongo C# driver erorr handling, you can take a look to this documentation
Update 25th July 2020
Based on the author comment, it seems that the main concern is not the actual database locking strategy, but the delete performances instead.
In that case I would proceed in the following manner:
always prefer a command performing a single database roundtrip (such as deleteMany) over issuing multiple single commands (such as deleteOne). By doing a single roundtrip you will minimize the latency cost and you will perform a single database command. It's simply more efficient
when you use a deleteMany command be sure to always filter documents by using a proper index, so that collection scan is avoided when finding the documents to be deleted
if you measure and you are sure that your bottleneck is the deleteMany speed, considere comparing the performances of deleteMany command with the one of an equivalent bulk write operation. I never tried that, so I have no idea about the actual speed comparison. My feeling is that probably there is no difference at all, because I supsect that under the hood deleteMany performs a bulk write. I have no clue on that, this is just a feeling.
consider changing your design in order to exploit the TTL index feature for an automatic deletion of the documents when some sort of expiration criteria is satisfied. This is not always possible, but it can be handy when applicable.
if you perform the delete operation as part of some sort of cleanup task on the data, consider scheduling a job performing the data cleanup operation on a regular basis, but outisde of the business hours of your users.
I have a lot of rows (300k+) to upsert into SQL server database in a shortest possible period of time, so the idea was to use parallelization and partition the data and use async to pump the data into SQL, X threads at the time, 100 rows per context, with context being recycled to minimize tracking overhead. However, that means more than one connection is to be used in parallel and thus CommittableTransaction/TransactionScope would use distributed transaction which would cause parallelized transaction enlistment operation to return the infamous "This platform does not support distributed transactions." exception.
I do need the ability to commit/rollback the entire set of upserts. Its part of the batch upload process and any error should rollback the changes to previously working/stable condition, application wise.
What are my options? Short of using one connection and no parallelization?
Note: Problem is not so simple as a batch of insert commands, if that was the case, I would just generate inserts and run them on server as query or indeed use SqlBulkCopy. About half of them are updates, half are inserts where new keys are generated by SQL Server which need to be obtained and re-keyed on child objects which would be inserted next, rows are spread over about a dozen tables in a 3-level hierarchy.
Nope. Totally wrong approach. Do NOT use EF for that - bulk insert ETL is not what Object Relational Mappers are made for and a lot of design decisions are not productive for that. You would also not use a small car instead of a truck to transport 20 tons of goods.
300k rows are trivial if you use SqlBulkCopy API in some sort.
There is something that worries me about my application. I have a SQL query that does a bunch of inserts into the database across various tables. I timed how long it takes to complete the process, it takes about 1.5 seconds. At this point I'm not even done developing the query, I still have more inserts to program into this. So I fully expect this to process to take even longer, perhaps up to 3 seconds.
Now, it is important that all of this data be consistent and finish either completely, or not at all. So What I'm wondering about is, is it OK for a transaction to take that long. Doesn't it lock up the table, so selects, inserts, updates, etc... cannot be run until the transaction is finished? My concern is if this query is being run frequently it could lock up the entire application so that certain parts of it become either incredibly slow, or unusable. With a low user base, I doubt this would be an issue, but if my application should gain some traction, this query could potentially be a lot.
Should I be concerned about this or am I missing something where the database won't act how I am thinking. I'm using a SQL Server 2014 database.
To note, I timed this by using the StopWatch C# object immediately before the transaction starts, and stop it right after the changes are committed. So it's about as accurate as can be.
You're right to be concerned about this, as a transaction will lock the rows it's written until the transaction commits, which can certainly cause problems such as deadlocks, and temporary blocking which will slow the system response. But there are various factors that determine the potential impact.
For example, you probably largely don't need to worry if your users are only updating and querying their own data, and your tables have indexing to support both read and write query criteria. That way each user's row locking will largely not affect the other users--depending on how you write your code of course.
If your users share data, and you want to be able to support efficient searching across multiple user's data even with multiple concurrent updates for example, then you may need to do more.
Some general concepts:
-- Ensure your transactions write to tables in the same order
-- Keep your transactions as short as possible by preparing the data to be written as much as possible before starting the transaction.
-- If this is a new system (and even if not new), definitely consider enabling Snapshot Isolation and/or Read Committed Snapshot Isolation on the database. SI will (when explicitly set on the session) allow your read queries not to be blocked by concurrent writes. RCSI will allow all your read queries by default not to be blocked by concurrent writes. But read this to understand both the benefits and gotchas of both isolation levels: https://www.brentozar.com/archive/2013/01/implementing-snapshot-or-read-committed-snapshot-isolation-in-sql-server-a-guide/
I think its depend on your code, how you used loop effectively, select query and the other statement.
So I am troubleshooting some performance problems on a legacy application, and I have uncovered a pretty specific problem (there may be others).
Essentially, the application is using an object relational mapper to fetch data, but it is doing so in a very inefficient/incorrect way. In effect, it is performing a series of entity graph fetches to fill a datagrid in the UI, and on databinding the grid (it is ASP.Net Webforms) it is doing additional fetches, which lead to other fetches, etc.
The net effect of this is that many, many tiny queries are being performed. Using SQL Profiler shows that a certain page performs over 10,000 queries (to fill a single grid. No query takes over 10ms to complete, and most of them register as 0ms in Profiler. Each query will use and release one connection, and the series of queries would be single-threaded (per http request).
I am very familiar with the ORM, and know exactly how to fix the problem.
My question is: what is the exact effect of having many, many small queries being executed in an application? In what ways does it/can it stress the different components of the system?
For example, what is the effect on the webserver's CPU and memory? Would it flood the connection pool and cause blocking? What would be the impact on the database server's memory, CPU and I/O?
I am looking for relatively general answers, mainly because I want to start monitoring the areas that are likely to be the most affected (I need to measure => fix => re-measure). Concurrent use of the system at peak would likely be around 100-200 users.
It will depend on the database but generally there is a parse phase for each query. If the query has used bind variables it will probably be cached. If not, you wear the hit of a parse and that often means short locks on resources. i.e. BAD. In Oracle, CPU and blocking are much more prevelant at the parse than the execute. SQL Server less so but it's worse at the execute. Obviously doing 10K of anything over a network is going to be a terrible solution, especially x 200 users. Volume I'm sure is fine but that frequency will really highlight all the overhead in comms latency and stuff like that. Connection pools generally are in the hundreds, not tens of thousands, and now you have 10s of thousands of objects all being created, queued, managed, destroyed, garbage collected etc.
But I'm sure you already know all this deep down. Ditch the ORM for this part and write a stored procedure to execute the single query to return your result set. Then put it on the grid.
I have an importer process which is running as a windows service (debug mode as an application) and it processes various xml documents and csv's and imports into an SQL database. All has been well until I have have had to process a large amount of data (120k rows) from another table (as I do the xml documents).
I am now finding that the SQL server's memory usage is hitting a point where it just hangs. My application never receives a time out from the server and everything just goes STOP.
I am still able to make calls to the database server separately but that application thread is just stuck with no obvious thread in SQL Activity Monitor and no activity in Profiler.
Any ideas on where to begin solving this problem would be greatly appreciated as we have been struggling with it for over a week now.
The basic architecture is c# 2.0 using NHibernate as an ORM data is being pulled into the actual c# logic and processed then spat back into the same database along with logs into other tables.
The only other prob which sometimes happens instead is that for some reason a cursor is being opening on this massive table, which I can only assume is being generated from ADO.net the statement like exec sp_cursorfetch 180153005,16,113602,100 is being called thousands of times according to Profiler
When are you COMMITting the data? Are there any locks or deadlocks (sp_who)? If 120,000 rows is considered large, how much RAM is SQL Server using? When the application hangs, is there anything about the point where it hangs (is it an INSERT, a lookup SELECT, or what?)?
It seems to me that that commit size is way too small. Usually in SSIS ETL tasks, I will use a batch size of 100,000 for narrow rows with sources over 1,000,000 in cardinality, but I never go below 10,000 even for very wide rows.
I would not use an ORM for large ETL, unless the transformations are extremely complex with a lot of business rules. Even still, with a large number of relatively simple business transforms, I would consider loading the data into simple staging tables and using T-SQL to do all the inserts, lookups etc.
Are you running this into SQL using BCP? If not, the transaction logs may not be able to keep up with your input. On a test machine, try turning the recovery mode to Simple (non-logged) , or use the BCP methods to get data in (they bypass T logging)
Adding on to StingyJack's answer ...
If you're unable to use straight BCP due to processing requirements, have you considered performing the import against a separate SQL Server (separate box), using your tool, then running BCP?
The key to making this work would be keeping the staging machine clean -- that is, no data except the current working set. This should keep the RAM usage down enough to make the imports work, as you're not hitting tables with -- I presume -- millions of records. The end result would be a single view or table in this second database that could be easily BCP'ed over to the real one when all the processing is complete.
The downside is, of course, having another box ... And a much more complicated architecture. And it's all dependent on your schema, and whether or not that sort of thing could be supported easily ...
I've had to do this with some extremely large and complex imports of my own, and it's worked well in the past. Expensive, but effective.
I found out that it was nHibernate creating the cursor on the large table. I am yet to understand why, but in the mean time I have replaced the large table data access model with straight forward ado.net calls
Since you are rewriting it anyway, you may not be aware that you can call BCP directly from .NET via the System.Data.SqlClient.SqlBulkCopy class. See this article for some interesting perforance info.