EF.NET Core: Multiple insert streams within one transaction

EF.NET Core: Multiple insert streams within one transaction - c#

I have a lot of rows (300k+) to upsert into SQL server database in a shortest possible period of time, so the idea was to use parallelization and partition the data and use async to pump the data into SQL, X threads at the time, 100 rows per context, with context being recycled to minimize tracking overhead. However, that means more than one connection is to be used in parallel and thus CommittableTransaction/TransactionScope would use distributed transaction which would cause parallelized transaction enlistment operation to return the infamous "This platform does not support distributed transactions." exception.
I do need the ability to commit/rollback the entire set of upserts. Its part of the batch upload process and any error should rollback the changes to previously working/stable condition, application wise.
What are my options? Short of using one connection and no parallelization?
Note: Problem is not so simple as a batch of insert commands, if that was the case, I would just generate inserts and run them on server as query or indeed use SqlBulkCopy. About half of them are updates, half are inserts where new keys are generated by SQL Server which need to be obtained and re-keyed on child objects which would be inserted next, rows are spread over about a dozen tables in a 3-level hierarchy.

Nope. Totally wrong approach. Do NOT use EF for that - bulk insert ETL is not what Object Relational Mappers are made for and a lot of design decisions are not productive for that. You would also not use a small car instead of a truck to transport 20 tons of goods.
300k rows are trivial if you use SqlBulkCopy API in some sort.

Related

Is my SQL transaction taking too long?

There is something that worries me about my application. I have a SQL query that does a bunch of inserts into the database across various tables. I timed how long it takes to complete the process, it takes about 1.5 seconds. At this point I'm not even done developing the query, I still have more inserts to program into this. So I fully expect this to process to take even longer, perhaps up to 3 seconds.
Now, it is important that all of this data be consistent and finish either completely, or not at all. So What I'm wondering about is, is it OK for a transaction to take that long. Doesn't it lock up the table, so selects, inserts, updates, etc... cannot be run until the transaction is finished? My concern is if this query is being run frequently it could lock up the entire application so that certain parts of it become either incredibly slow, or unusable. With a low user base, I doubt this would be an issue, but if my application should gain some traction, this query could potentially be a lot.
Should I be concerned about this or am I missing something where the database won't act how I am thinking. I'm using a SQL Server 2014 database.
To note, I timed this by using the StopWatch C# object immediately before the transaction starts, and stop it right after the changes are committed. So it's about as accurate as can be.

You're right to be concerned about this, as a transaction will lock the rows it's written until the transaction commits, which can certainly cause problems such as deadlocks, and temporary blocking which will slow the system response. But there are various factors that determine the potential impact.
For example, you probably largely don't need to worry if your users are only updating and querying their own data, and your tables have indexing to support both read and write query criteria. That way each user's row locking will largely not affect the other users--depending on how you write your code of course.
If your users share data, and you want to be able to support efficient searching across multiple user's data even with multiple concurrent updates for example, then you may need to do more.
Some general concepts:
-- Ensure your transactions write to tables in the same order
-- Keep your transactions as short as possible by preparing the data to be written as much as possible before starting the transaction.
-- If this is a new system (and even if not new), definitely consider enabling Snapshot Isolation and/or Read Committed Snapshot Isolation on the database. SI will (when explicitly set on the session) allow your read queries not to be blocked by concurrent writes. RCSI will allow all your read queries by default not to be blocked by concurrent writes. But read this to understand both the benefits and gotchas of both isolation levels: https://www.brentozar.com/archive/2013/01/implementing-snapshot-or-read-committed-snapshot-isolation-in-sql-server-a-guide/

I think its depend on your code, how you used loop effectively, select query and the other statement.

Storing huge number of entities in SQL Server database

I have the following scenario: I am building a dummy web app that pulls betting odds every minute, stores all the events, matches, odds etc. to the database and then updates the UI.
I have this structure: Sports > Events > Matches > Bets > Odds and I am using code first approach and for all DB-related operations I am using EF.
When I am running my application for the very first time and my database is empty I am receiving XML with odds which contains: ~16 sports, ~145 events, ~675 matches, ~17100 bets & ~72824 odds.
Here comes the problem: how to save all this entities in timely manner? Parsing is not that time consuming operation - 0.2 seconds, but when I try to bulk store all these entities I face memory problems and the save took more than 1 minute so next odd pull is triggered and this is nightmare.
I saw somewhere to disable the Configuration.AutoDetectChangesEnabled and recreate my context on every 100/1000 records I insert, but I am not nearly there. Every suggestion will be appreciated. Thanks in advance

When you are inserting huge (though it is not that huge) amounts of data like that, try using SqlBulkCopy. You can also try using Table Value Parameter and pass it to a stored procedure but I do not suggest it for this case as TVPs perform well for records under 1000. SqlBulkCopy is super easy to use which is a big plus.
If you need to do an update to many records, you can use SqlBulkCopy for that as well but with a little trick. Create a staging table and insert the data using SqlBulkCopy into the staging table, then call a stored procedure which will get records from the staging table and update the target table. I have used SqlBulkCopy for both cases numerous times and it works pretty well.
Furthermore, with SqlBulkCopy you can do the insertion in batches as well and provide feedback to the user, however, in your case, I do not think you need to do that. But nonetheless, this flexibility is there.
Can I do it using EF only?
I have not tried but there is this library you can try.

I understand your situation but:
All actions you've been doing it all depends on your machine specs and
the software itself.
Now if machine specs cannot handle the process it will be the time to
change a plan like to limit the count of records to be inserted till
it all to be done.

How to Improve Entity Framework bulk insert

I have an application which receives data from multiple sockets and then write the data into a DB.
I am currently using EF to do this. I would like to know how I can make it more efficient.
I have read that doing a bulk insert is faster so I am only saving changes to the DB every 500 insters:
db.Logs_In.Add(tableItem);
if (logBufferCounter++ > 500)
{
db.SaveChanges();
logBufferCounter = 0;
}
Now I have profiled the application and 74% of the work is being done by the Function: System.Data.Enitity.DbSet'1[System._Canon].Add
Is there a better way to do the insert? Maybe queue up the tableItems into a List and then add the whole list to the DB Context.
Or maybe Im looking at it all wrong and I should totally avoid using EntityFramework for this higher performance insert? Currently it is the bottle neck in my application and if I look at the system resources SQL doesn't even seem to be budging an eyelid.
So my Questions:
1: In what way would I achieve the most efficient / quickest insert on multiple inserts
2: If EF is acceptable, how can I improve my solution?
I am using SQL Server 2012 enterprise Edition,
The incoming data is a constant stream, however I can afford to buffer it and then do A bulk insert if this is a better solution.
[EDIT]
To further explain the scenario. I have a thread which is looping on a concurrentQueue which dequeues the items from this queue. However due to the fact that the db insert is the bottle neck. there are often thousands of entries in the queue, So if there is also an Async or Parallel way in which I could possibly make use of more than one thread to do the insert.

For scenarios that involve large amounts of inserts, I tend to favor "buffer seperately" (in-memory, or a redis list, or whatever), then as a batch job (perhaps every minute, or every few minutes) read the list and use SqlBulkCopy to throw the data into the database as efficiently as possible. To help with that, I use the ObjectReader.Create method of fastmember, which exposes a List<T> (or any IEnumerable<T>) as an IDataReader that can be fed into SqlBulkCopy, exposing properties of T as logical columns in the data-reader. All you need to do, then, is fill the List<T> from the buffer.
Note, however, that you need to think about the "something goes wrong" scenario; i.e. if the insert fails half way through, what do you do about the data in the buffer? One option here is to do the SqlBulkCopy into a staging table (same schema, but not the "live" table), then use a regular INSERT to copy the data in one step when you know it is at the database - this makes recovery simpler.

Nhibernate large transactions, flushes vs locks

I am having a challenge of maintaining an incredibly large transaction using Nhibernate. So, let us say, I am saving large number of entities. If I do not flush on a transaction N, let us say 10000, then the performance gets killed due to overcrowded Nh Session. If I do flush, I place locks on DB level which in combination with read committed isolation level do affect working application. Also note that in reality I import an entity whose business logic is one of the hearts of the system and on its import around 10 tables are affected. That makes Stateless session a bad idea due to manual maintaining of cascades.
Moving BL to stored procedure is a big challenge due to to reasons:
there is already complicated OO business logic in the domain
classes of application,
duplicated BL will be introduced.
Ideally I would want to Flush session to some file and only then preparation of data is completed, I would like to execute its contents. Is it possible?
Any other suggestions/best practices are more than welcome.

You scenario is a typical ORM batch problem. In general we can say that no ORM is meant to be used for stuff like that. If you want to have high batch processing performance (not everlasting locks and maybe deadlocks) you should not use the ORM to insert 1000s of records.
Instead use native batch inserts which will always be a lot faster. (like SqlBulkCopy for MMSQL)
Anyways, if you want to use nhibernate for this, try to make use of the batch size setting.
Call save or update to all your objects and only call session.Flush once at the end. This will create all your objects in memory...
Depending on the batch size, nhibernate should try to create insert/update batches with this size, meaning you will have lot less roundtrips to the database and therefore fewer locks or at least it shouldn't take that long...
In general, your operations should only lock the database the moment your first insert statement gets executed on the server if you use normal transactions. It might work differently if you work with TransactionScope.
Here are some additional reads of how to improve batch processing.
http://fabiomaulo.blogspot.de/2011/03/nhibernate-32-batching-improvement.html
NHibernate performance insert
http://zvolkov.com/clog/2010/07/16?s=Insert+or+Update+records+in+bulk+with+NHibernate+batching

C# + SQL Server - Fastest / Most Efficient way to read new rows into memory

I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!

I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.

Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.

If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.

Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.