Storing huge number of entities in SQL Server database

Storing huge number of entities in SQL Server database - c#

I have the following scenario: I am building a dummy web app that pulls betting odds every minute, stores all the events, matches, odds etc. to the database and then updates the UI.
I have this structure: Sports > Events > Matches > Bets > Odds and I am using code first approach and for all DB-related operations I am using EF.
When I am running my application for the very first time and my database is empty I am receiving XML with odds which contains: ~16 sports, ~145 events, ~675 matches, ~17100 bets & ~72824 odds.
Here comes the problem: how to save all this entities in timely manner? Parsing is not that time consuming operation - 0.2 seconds, but when I try to bulk store all these entities I face memory problems and the save took more than 1 minute so next odd pull is triggered and this is nightmare.
I saw somewhere to disable the Configuration.AutoDetectChangesEnabled and recreate my context on every 100/1000 records I insert, but I am not nearly there. Every suggestion will be appreciated. Thanks in advance

When you are inserting huge (though it is not that huge) amounts of data like that, try using SqlBulkCopy. You can also try using Table Value Parameter and pass it to a stored procedure but I do not suggest it for this case as TVPs perform well for records under 1000. SqlBulkCopy is super easy to use which is a big plus.
If you need to do an update to many records, you can use SqlBulkCopy for that as well but with a little trick. Create a staging table and insert the data using SqlBulkCopy into the staging table, then call a stored procedure which will get records from the staging table and update the target table. I have used SqlBulkCopy for both cases numerous times and it works pretty well.
Furthermore, with SqlBulkCopy you can do the insertion in batches as well and provide feedback to the user, however, in your case, I do not think you need to do that. But nonetheless, this flexibility is there.
Can I do it using EF only?
I have not tried but there is this library you can try.

I understand your situation but:
All actions you've been doing it all depends on your machine specs and
the software itself.
Now if machine specs cannot handle the process it will be the time to
change a plan like to limit the count of records to be inserted till
it all to be done.

Related

Data Processing In C# - Best Approach?

I am still on a learning curve in C# and SQL Server so please forgive my ‘greeness’.
Here is my scenario:
I have an EMPLOYEE table with 10,000 rows. Each of these employees has transactions in a TRANSACTIONS table.
The transaction table has the salary elements like Basic pay, Acting allowance, Overtime hours etc. It also has payroll deductions like advance deductions, some loans (with interest), and savings (pension, social security savings etc.
I need to go through each employee’s transactions and compute taxes, outstanding balances on loans, update balances on savings, convert hours into payments/deductions and some other stuff.
This processing will give me a new set of rows for each employee, with a period marker (eg 2013-04 for April 2013). I need to store this in a HISTORY table for future references.
What is the best approach for processing the entire 10,000 employee table and their transactions?
I am told that pulling the entire table into memory via readers is not good practice and I agree.
Do I keep pulling an employee from the database, process their transactions, and commit the history to the database? And pull the next and so forth?
Too many calls to the back end?
(EF not an option for me, still doing raw SQL in ADO.NET)
I will appreciate any help on this.

10000 rows is not much. Memory could easily handle that if there's not some enourmous varchar or binary columns. Don't feel completely locked by good practice "rules".
On the other hand, consider a stored procedure. Then all processing will be done locally on the server.
edit: if neither of the above is an option, try to stream your results. For example, when reading your query save each row in a ConcurrentQueue or something like that. Before you execute the query, start another thread or a BackgroundWorker which checks the queue for new items and saves back results simultaneously on another SqlConnection. Work will be done when query is done AND the queue has Count 0.

Check out using ROW_NUMBER(). This can be used by programs to allow large tables to be essentially browsed using 'x' number of rows at a time. You could then conceivably use this same method to batch your job over, say, 1000 rows at a time.
See this link for more information.

Writing code to Process 25,000 records C#, T-SQL, Quick Performance is key

What would be the most efficent way to Loop through 25,000 records, and based on some prewritten vb logic that wont ever change(99% sure), update the Result column in a table to a value of 1, 2 or 3?
Performance and reliabilty is most important here. This most likely will get called via a client server app on the network but would be nice to be able to call it from a web app. I am thinking about 3 different ways to do it with T-SQL, C#.
a. Write an object that executes a stored procedure gets the 25,000 records, use the foreach collection to go through each record and based on some c# logic, call an object at each record that executes a stored procedure to update that row. This would call the object 25,000 times (and the proc I assume would just reuse the execution plan)
or
b. Write a stored procedure that gets the 25,000 records, use the forbidden cursor to go through each record and based on some T-SQL logic, update that row in this stored procedure.
or
UPDATED: MY SOLUTION IS THIS
For what it's worth I am going with persisited computed columns, and breaking the loop into smaller update statements to update the column (all wrapped in a transaction). See article below. I think it will be really fast, compared to a loop..
http://technet.microsoft.com/en-us/library/cc917696.aspx

You obviously have some condition that determines wheter the value should be 1,2 or 3. You could just do 3 update queries. Each query would update the records based on the condition that determines if the value should be 1, 2 or 3. Don't pull all the data down to your machine if you can help it.

My first choice would be to do it all in SQL if I could, i.e. update xxx set col=1 where (your logic here), update xxx set col=2 where (logic) etc.
If you need to do the logic in the vb client, either in a web app or client server, my choice would be to use a datareader to pass thru the records (pulling down only the columns that are required, not the whole row) and the either execute either a TSQL update or stored procedure to call to update those records that need to be updated, one at a time).
the datareader will give you the best performance; the SP should perform at least as good if not better than a TSQL update, (but probably not by much).
EDIT: Avoid server-side cursors at (almost) any cost...they are true hogs.

Solving this without entering c# is actually the best option if performance is key.
Run your queries outside c#.
If it's really necessary use DataReaders.

I would not go with option B. In my experience using cursors is extremely slow.
C. Use a DataReader and update the records with an ExecuteNonQuery

How about option (C) A stored procedure that updates the table using set-based logic rather than a cursor:
...
update x set col = f(x)
from x
...

Depending on how the updates work you have a couple options.
Have a computed column where the results are persisted. That way when the record changes it will be updated in one place.
Instead of running 25,000 update queries, just use sqlbulk load.
(and this is my preference). Have your app send the parameters to SQL server on what to update. In this case I'd lean towards using a static cursor as it would be a bit faster, as long as one record doesn't necessarily affect the next one.

You can either:
Go with the 3 separate UPDATEs
suggested by #Andrew
Pull the records into a Temporary
Table and loop through them in
batches of maybe 1000 records at a
time in a WHILE loop for the UPDATE
statement (so, 25 loops / UPDATEs)
Or, if you are using SQL Server 2008
(or newer) and the algorithm to
determine the change is complex, you
can pull the 25,000 rows into a
collection on the .Net side and
stream the changes back into a Proc
that has a Table-Valued Parameter
and do a single update. You can find an example of this at:
http://www.sqlservercentral.com/articles/SQL+Server+2008/66554/
In each case, you want to avoid 25,000 UPDATE calls.

I have similar situation. Actually, i have > 10.000.000 records. Business logic was rather complex, and there was old code purely written in SQL. Managers told me that with old code, it take 15+ hours per 1.000.000 records. With my solution, i took only 5 mins, literally ! I have done this in loop which have 3 steps in iteration, and each iteration took one batch of records:
Bulk load of invoices. I don't remember batch size, i think it was about few thousands.
Performing business logic on loaded records
Bulk insert. Because it was bulk, it couldn't be update. So it was bulk into temporary table, with almost same structure as original table, and then update by key in original table. Temporary table was emptied/deleted with every time of bulk insert. It is much faster than standard update.

C# + SQL Server - Fastest / Most Efficient way to read new rows into memory

I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!

I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.

Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.

If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.

Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.

SQL Insert Performance Improvement

I am writing an application that logs status updates (GPS locations) from devices to a database. The updates occur at a set interval for each device, which is currently every 3 seconds. I'm using a simple table in SQL Server 08 for storing each update.
I've noticed that running the inserts is an area of slow down in my application. Its not a severe slow down, but noticable. Naturally, I'd like to write to the database in as an efficient way as possible. I have an idea to improve the performance and am looking for input and advice to see if it will help:
The status updates come in from an asynchronous Socket thread. In my current implementation, the database insert call is executed from this thread. I'm thinking I can create a queue for holding update data that the Socket thread can quickly add its update to and then go on its merry way. There would then be a separate thread whose sole responsibility would be checking the update queue and inserting the updates into the database.
Basically this whole process rests on the assumption that writing to the database from one location with a bunch of data all at once is more efficient than writing one row of data at a random time. Is my assumption correct, or way off base? Also, on the SQL side, is there a command to tell it to write a bunch of rows at once that would improve write performance?
This is how the database is being written to:
I'm using LinqToSQL in C#, so for each insert, I first create a DataContext instance. From the DataContext object I then call a stored procedure which inserts the location update.
The table is indexed by datetime, for the time of the update.

Have a look at the SqlBulkCopy class - this allows you to use BCP to insert chunks of data very quickly.
Also, make sure your indexes are efficient. If you have a clustered index on anything that does not increase sequentially (integer, date) then you will suffer performance slowdowns as the pages are filled up.

Have you looked MSMQ ( Microsoft Message Queuing (MSMQ)) ? That seems to me an option to take a look.

Yes, inserting in batches will typically be faster than separate inserts given your description. Each insert will require a connection to be set up and packets to be transferred. If you have a single small insert that takes one packet and you issue three of those, but you alternatively have three inserts that are small enough that they can all fit in one packet then it will help.
Quantifying it is difficult just based on your description - you'll need to do testing for that. For example, if you are keeping a dedicated connection open at all times anyway, as hova suggests, then you might see less of an impact.

Another area you might want to take a look at is whether you are setting up and tearing down a connection for each insert. That alone might make a performance improvement, negating the need for batching.
You'll also want to have as few indexes on the table as possible.

It sounds like a good idea. Why not give it a shot and see how it performs?

On the SQL side you'd want to have a look at making sure you are using parameterized queries.
Also batching your INSERT statements will certainly increase the performance.
Connection management is also key, of course that depends on how the application is built and whether it depends on a connection being there.

Are you not afraid to loose data while are you collecting data to batch copy?
I'm writing application doing the same. At start I will have to write data from 3,5k GPS devices. One device should send data each minute but it can send faster. Destination number of devices is 10,5k.
I'm wondering about inserting performance too. For now I'm saving received data to db on every packet using pure ADO.NET ICommand and stored procedure. On my test serwer (Xeon 3,4GHz and one 1TB hard disk - normal desktop ;) it takes for now 1ms or less.
#GRIMUS - should I wondering if there will be more devices?

Import Process maxing SQL memory

I have an importer process which is running as a windows service (debug mode as an application) and it processes various xml documents and csv's and imports into an SQL database. All has been well until I have have had to process a large amount of data (120k rows) from another table (as I do the xml documents).
I am now finding that the SQL server's memory usage is hitting a point where it just hangs. My application never receives a time out from the server and everything just goes STOP.
I am still able to make calls to the database server separately but that application thread is just stuck with no obvious thread in SQL Activity Monitor and no activity in Profiler.
Any ideas on where to begin solving this problem would be greatly appreciated as we have been struggling with it for over a week now.
The basic architecture is c# 2.0 using NHibernate as an ORM data is being pulled into the actual c# logic and processed then spat back into the same database along with logs into other tables.
The only other prob which sometimes happens instead is that for some reason a cursor is being opening on this massive table, which I can only assume is being generated from ADO.net the statement like exec sp_cursorfetch 180153005,16,113602,100 is being called thousands of times according to Profiler

When are you COMMITting the data? Are there any locks or deadlocks (sp_who)? If 120,000 rows is considered large, how much RAM is SQL Server using? When the application hangs, is there anything about the point where it hangs (is it an INSERT, a lookup SELECT, or what?)?
It seems to me that that commit size is way too small. Usually in SSIS ETL tasks, I will use a batch size of 100,000 for narrow rows with sources over 1,000,000 in cardinality, but I never go below 10,000 even for very wide rows.
I would not use an ORM for large ETL, unless the transformations are extremely complex with a lot of business rules. Even still, with a large number of relatively simple business transforms, I would consider loading the data into simple staging tables and using T-SQL to do all the inserts, lookups etc.

Are you running this into SQL using BCP? If not, the transaction logs may not be able to keep up with your input. On a test machine, try turning the recovery mode to Simple (non-logged) , or use the BCP methods to get data in (they bypass T logging)

Adding on to StingyJack's answer ...
If you're unable to use straight BCP due to processing requirements, have you considered performing the import against a separate SQL Server (separate box), using your tool, then running BCP?
The key to making this work would be keeping the staging machine clean -- that is, no data except the current working set. This should keep the RAM usage down enough to make the imports work, as you're not hitting tables with -- I presume -- millions of records. The end result would be a single view or table in this second database that could be easily BCP'ed over to the real one when all the processing is complete.
The downside is, of course, having another box ... And a much more complicated architecture. And it's all dependent on your schema, and whether or not that sort of thing could be supported easily ...
I've had to do this with some extremely large and complex imports of my own, and it's worked well in the past. Expensive, but effective.

I found out that it was nHibernate creating the cursor on the large table. I am yet to understand why, but in the mean time I have replaced the large table data access model with straight forward ado.net calls

Since you are rewriting it anyway, you may not be aware that you can call BCP directly from .NET via the System.Data.SqlClient.SqlBulkCopy class. See this article for some interesting perforance info.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.