SQL Insert Performance Improvement - c#

I am writing an application that logs status updates (GPS locations) from devices to a database. The updates occur at a set interval for each device, which is currently every 3 seconds. I'm using a simple table in SQL Server 08 for storing each update.
I've noticed that running the inserts is an area of slow down in my application. Its not a severe slow down, but noticable. Naturally, I'd like to write to the database in as an efficient way as possible. I have an idea to improve the performance and am looking for input and advice to see if it will help:
The status updates come in from an asynchronous Socket thread. In my current implementation, the database insert call is executed from this thread. I'm thinking I can create a queue for holding update data that the Socket thread can quickly add its update to and then go on its merry way. There would then be a separate thread whose sole responsibility would be checking the update queue and inserting the updates into the database.
Basically this whole process rests on the assumption that writing to the database from one location with a bunch of data all at once is more efficient than writing one row of data at a random time. Is my assumption correct, or way off base? Also, on the SQL side, is there a command to tell it to write a bunch of rows at once that would improve write performance?
This is how the database is being written to:
I'm using LinqToSQL in C#, so for each insert, I first create a DataContext instance. From the DataContext object I then call a stored procedure which inserts the location update.
The table is indexed by datetime, for the time of the update.

Have a look at the SqlBulkCopy class - this allows you to use BCP to insert chunks of data very quickly.
Also, make sure your indexes are efficient. If you have a clustered index on anything that does not increase sequentially (integer, date) then you will suffer performance slowdowns as the pages are filled up.

Have you looked MSMQ ( Microsoft Message Queuing (MSMQ)) ? That seems to me an option to take a look.

Yes, inserting in batches will typically be faster than separate inserts given your description. Each insert will require a connection to be set up and packets to be transferred. If you have a single small insert that takes one packet and you issue three of those, but you alternatively have three inserts that are small enough that they can all fit in one packet then it will help.
Quantifying it is difficult just based on your description - you'll need to do testing for that. For example, if you are keeping a dedicated connection open at all times anyway, as hova suggests, then you might see less of an impact.

Another area you might want to take a look at is whether you are setting up and tearing down a connection for each insert. That alone might make a performance improvement, negating the need for batching.
You'll also want to have as few indexes on the table as possible.

It sounds like a good idea. Why not give it a shot and see how it performs?

On the SQL side you'd want to have a look at making sure you are using parameterized queries.
Also batching your INSERT statements will certainly increase the performance.
Connection management is also key, of course that depends on how the application is built and whether it depends on a connection being there.

Are you not afraid to loose data while are you collecting data to batch copy?
I'm writing application doing the same. At start I will have to write data from 3,5k GPS devices. One device should send data each minute but it can send faster. Destination number of devices is 10,5k.
I'm wondering about inserting performance too. For now I'm saving received data to db on every packet using pure ADO.NET ICommand and stored procedure. On my test serwer (Xeon 3,4GHz and one 1TB hard disk - normal desktop ;) it takes for now 1ms or less.
#GRIMUS - should I wondering if there will be more devices?

Related

Uploading a Log (.txt file) vs inserting 1 record per event on database, Efficiency on recording Logs

I'm trying to record log files on my database. My question is which has the less load on making logs on the database. I'm thinking of storing long term log files ,maybe 3-5 years maximum, for an Inventory Program.
Process: I'll be using a barcode scanner.
After scanning a barcode, I'll get all the details of who is logged in, date and time, product details then saved per piece.
I came up with two ideas.
After the scanning event, It will be saved on a DataTable then after finishing a batch.. DataTable will be written on a *.txt file and then uploaded to my database.
After every scanned barcode, an INSERT query will be executed. I suspect this option will be heavy on the server side since I'm not the only one using this server
What are the pros and cons of the two options?
Are there more efficient ways of storing logs?
Based on your use case, I also think you need to consider at least 2 additional factors, the first being how important is it that the scanned item is logged in the database immediately. If you need the scanned item to be logged because you'll be checking to see if its been scanned, for example to prevent other scans, then doing a single insert is probably a very good idea. The second thing to consider is will you ever need to "unscan" an item, and at which part of the process? If the person scanning needs the ability to revert the scan immediately, it might be a good idea to wait until theyre done all their scannings before dumping the data to the database, as this will let you avoid ever having to delete from the table.
Overall I wouldnt worry too much about what the database can handle, sql-server is very good at handling simultaneous single inserts into a table thats designed for that use case. If youre only going to be inserting new data to the end of the table, and not updating or deleting existing records, performance is going to scale very well. The same goes for larger batch inserts, theyre very efficient no matter how many rows you want to bring in, assuming your table is designed for that purpose.
So overall I would probably pick the more efficient solution from the application side for your specific use case, and then once you have decided that, you can shape the database around the code, rather than trying to shape your code around suspected limitations of the database.
What are the pros and cons of the two options?
Basically your question is which way is more efficient (bulk insert or multiple single insert)?
The answers is always depends and always be situation based. So unfortunately, I don't think there's a right answer for you
The way you structure the log table.
If you choose bulk insert, how many rows do you want to insert at 1 time?
Is it read-only table? And if you want to read from it, how often do you do the read?
Do you need to scale it up?
etc...
Are there more efficient ways of storing logs?
There're some possible ways to improve I can think of (not all of them can work together)
If you go with the first option, maybe you can schedule the insert to non-peak hours
If you go with the first option, chunk the log files and do the insert
Use another database to do the logging
If you go with the second option, do some load testing
Personally, I prefer to go with second option if the project is small to medium size and the logging is critical part of the project.
hope it helps.
Go with the second option, and use transactions. This way the data will not be sent to the db until you call the transaction to complete. (Which can be scheduled.) This will also prevent broken data from getting into your database when a crash or something occurs.
Transactions in .net
Transaction Tutorial in C#

Scaling C# database application

I apologize if this question is a bit nebulous.
I am writing a C# application which does data manipulation against a SQL Server database. For a group of items, I read data for each item, do calculations on the data, then write the results to the database.
The problem I am having is that the application starts to slow down relative to the time it takes to process each item when the number of items to be processed increases.
I am trying to be very careful as far as freeing memory for allocated objects as I am through with them. I want to have nothing hanging around from the processing of one item when I start the processing for the next item. I make use of "using" structures for data tables and the BulkCopy class to try to force memory cleanup.
Yet, I start to get geometrically increasing run times per item the more items I try to process in one invocation of the program.
My program is a WinForms app. I don't seem to be eating up the server's memory with what I am doing. I am trying to make the processing of each item isolated from the processing of all other items, to make sure it would not matter how many items I process in each invocation of the application.
Has anyone seen this behavior in their applications and know what to look for to correct this?
A couple of things you can be watchful for - if you're using "using" statements - are you making sure that you're not keeping your connection open while manipulating your objects? Best to make sure you get your data from the database, close the connection, do your manipulation and then send the data back.
Are you using stored procedures for fetching/sending complex objects? You can also experiment with doing some of you data manipulation inside of the stored procedure or in functions called from them - you do NOT want to offload your entire business classes to the database, but you can do some of it there, depending on what you're doing.
Make sure your data structure is optimized as well (primary key indices, foreign keys, triggers etc. you can get some scripts from http://www.brentozar.com/first-aid/ to check the optimization of your database.
As mentioned above, try using some parallel/asynchronous patterns to divy up your work - await/async is very helpful for this, especially if you want to have calculations while also sending previous data back to the server.
Thanks for all the input. I checked the issues of opening/closing connections, etc. to see that I was being tidy. The thing that really helped was removing the primary keys on the destination data table. These were setup relative to what an end user would require, but they really gummed up the speed of data inserts. A heads up to folks to think about database constraints for updating data vs. using the data.
Also, found performance issues in selecting with a filter from an in memory DataTable. Somehow what I was doing get bogged down with a larger number of rows (30,000). I realized that I was mishandling the data and did not really need to do this. But it did show me the need to micro-test each step of my logic when trying to drag so much data around.

update sql server rows, while reading the same table

I have a database in SQL Server 2012 and want to update a table in it.
My table has three columns, the first column is of type nchar(24). It is filled with billion of rows. The other two columns are from the same type, but they are null (empty) at this moment.
I need to read the data from the first column, with this information I do some calculations. The result of my calculations are two strings, this two strings are the data I want to insert into the two empty columns.
My question is what is the fastest way to read the information from the first column of the table and update the second and third column.
Read and update step by step? Read a few rows, do the calculation, update the rows while reading the next few rows?
As it comes to billion of rows, performance is the only important thing here.
Let me know if you need any more information!
EDIT 1:
My calculation canĀ“t be expressed in SQL.
As the SQL server is on the local machine, the througput is nothing we have to be worried about. One calculation take about 0.02154 seconds, I have a total number of 2.809.475.760 rows this is about 280 GB of data.
Normally, DML is best performed in bigger batches. Depending on your indexing structure, a small batch size (maybe 1000?!) can already deliver the best results, or you might need bigger batch sizes (up to the point where you write all rows of the table in one statement).
Bulk updates can be performed by bulk-inserting information about the updates you want to make, and then updating all rows in the batch in one statement. Alternative strategies exist.
As you can't hold all rows to be updated in memory at the same time you probably need to look into MARS to be able to perform streaming reads while writing occasionally at the same time. Or, you can do it with two connections. Be careful to not deadlock across connections. SQL Server cannot detect that by principle. Only a timeout will resolve such a (distributed) deadlock. Making the reader run under snapshot isolation is a good strategy here. Snapshot isolation causes reader to not block or be blocked.
Linq is pretty efficient from my experiences. I wouldn't worry too much about optimizing your code yet. In fact that is typically something you should avoid is prematurely optimizing your code, just get it to work first then refactor as needed. As a side note, I once tested a stored procedure against a Linq query, and Linq won (to my amazement)
There is no simple how and a one-solution-fits all here.
If there are billions of rows, does performance matter? It doesn't seem to me that it has to be done within a second.
What is the expected throughput of the database and network. If your behind a POTS dial-in link the case is massively different when on 10Gb fiber.
The computations? How expensive are they? Just c=a+b or heavy processing of other text files.
Just a couple of questions raised in response. As such there is a lot more involved that we are not aware of to answer correctly.
Try a couple of things and measure it.
As a general rule: Writing to a database can be improved by batching instead of single updates.
Using a async pattern can free up some of the time for calculations instead of waiting.
EDIT in reply to comment
If calculations take 20ms biggest problem is IO. Multithreading won't bring you much.
Read the records in sequence using snapshot isolation so it's not hampered by write locks and update in batches. My guess is that the reader stays ahead of the writer without much trouble, reading in batches adds complexity without gaining much.
Find the sweet spot for the right batchsize by experimenting.

C# + SQL Server - Fastest / Most Efficient way to read new rows into memory

I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!
I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.
Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.
If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.
Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.

Any good method to work with large amount of data?

I have almost 100.000 records in the database and I need to compare them to each-other with the Longest Common Subsequence algorithm, and I need to do that with 1000 new records every day.
My application is written in c# .Net, and the problem is that this comparing is working slow on the application level, for comparing of 1000 records are needed more than 10 hours.
So does anyone knows how much faster will this go if I wrote this algorithm in Stored procedure in SQL, or is there any other way?
You might want to try and write a stored proc in C# if you are using SQL server 2005 or 2008. This might scale better in the long run as you get more and more records and can't keep them all in memory.
Check out the MSDN Introduction to SQL Server CLR Integration.
This will use more CPU on your DB server, but you don't have to transfer data back and forth.
If you have 'just' 100.000 records. Just collect them all when your app starts. Do your algorithms in memory, and store any results/alterations to the db when you finish.
It'll be much faster
I'm not sure TSQL will allow you the same flexibility as C# allows you, especially when you deal with complex algorithms like LCS. Store all needed records in memory and deal with them from there.
Now most important thing is that you can think out of box for a minute and go for other approach, try to insert flags(ranking) of some kind once new item is inserted. Noone can advice you here since you haven't provided use with little bit of data what are you doing and what are you comparing. Probably you can ease on process with some ranking made during new item insertion. I don't mean to make full comparison once new item added but to trigger event like every hour or so you update table without user input.
Its true that, stored procedure works faster than LinQ or View. That is the way, to collect your data fast.
How do you determine that two of your records follow on from each other (i.e. that they're part of a sub-sequence)? Maybe you don't need to compare the whole 1MB of each record and could speed things up by only analysing some portion of that?
Sounds to me like your algorithm's flawed or that a DB might not be the best way of storing your data if it's taking 2 seconds to compare each record?

Categories

Resources