Reading from a single table in SQL Server using multi threads

Reading from a single table in SQL Server using multi threads - c#

The best way to tackle reading from a single table in SQL Server using multiple threads and make sure not reading the same record twice in different thread using c#
Thank you for your help in advance

Are you trying to read records from the table in parallel to speed up retreiving the data or are you just worried about data corruption with threads accessing the same data?
Database Management Systems like MsSQL handle concurrency extremely well so thread safety in that respect is not something you would have to be concerned with in your code if you have mutiple threads reading the same table.
If you want to read data in parallel without any overlapping you could run a SQL command with paging, and just have each thread fetch a different page. You could have say 20 threads all read 20 different pages at once and it would be guaranteed that they are not reading the same rows. Then you can concatenate the data. The greater the page size the more performance boost you would get from creating the thread.
efficient way to implement paging

Assuming a dependency on SQL Server, you could possibly looking at the SQL Server Service Broker features to provide queuing for you. One thing to keep in mind with that is that currently SQL Server Service Broker isn't available on SQL Azure, so if you had plans on moving onto the Azure cloud that could be a problem.
Anyway - with SQL Server Service Broker the concurrent access is managed at the database engine layer. Another way of doing it is having one thread that reads the database and then dispatches threads with the message as the input. That is slightly easier than trying to use transactions in the database to ensure that messages aren't read twice.
Like I said though, SQL Server Service Broker is probably the way to go. Or a proper external queuing mechanism.

Solution 1:
I am assuming that you are attempting to process or extract data from a large table. If I were assigned this task I would first look at paging . If you are trying to split work among threads that is. So Thread 1 handles pages 0 to 10, Thread 2 handles pages 11 to 20, etc... or you could batch rows using the actual rownumber. So in your stored proc you would do this;
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY <ordering>) AS [row_number],
x, y, z
FROM
table
WHERE
<search-clauses>
) SELECT
*
FROM
result_set
WHERE
[row_number] BETWEEN #IN_Thread_Row_Start AND #IN_Thread_Row_End;
Another choice which would be more efficient is if you have a natural key, or a darn good surrogate, is to page using that and have the thread pass in the key parameters rather than the records it is interested in ( or page numbers ).
Immediate concerns with this solution would be:
ROW_NUMBER performance
CTE Performance (I believe they are stored in memory)
So if this was my problem to resolve I would look at paging via a key.
Solution 2:
The second solution would be to mark the rows as they are processing, virtually locking them, that is if you have data writer permission. So your table would have a field called Processed or Locked, as the rows are selected by your thread, they are updated as Locked = 1;
Then your select from other threads selects only rows that aren't locked. When your process is done and all rows are processed you could reset the lock.
Hard to say what will perform best w.o some trials... GL

This question is super old but still very relevant and I spent a lot of time finding this solution so i thought id post it for anyone else who happens along this. This is very common when using a sql table as a queue rather than msmq.
The solution (after a lot of investigation) is simple and can be tested by opening 2 tabs in ssms with each tab running its own transaction to simulate multiple processes/threads hitting the same table.
The quick answer is this: the key to this is using updlock and readpast hints on your selects.
To illustrate the reads working without duplication check out this simple example.
--on tab 1 in ssms
begin tran
SELECT TOP 1 ordno FROM table_queue WITH (updlock, readpast)
--on tab 2 in ssms
begin tran
SELECT TOP 1 ordno FROM table_queue WITH (updlock, readpast)
You will notice that the first selected record is locked and does not get duplicated by the select statement firing on the second tab/process.
Now in the real world you wouldnt just execute a select on your table like the simple example above. You would update your records as "isprocessing=1" or something similar if you are using your table as a queue. The above code just demonstrates that this allows concurrent reads without duplication.
So in the real world (if you are using your table as a queue and processing this queue with multiple services for instance) you would execute your select in a subquery to an update statement most likely.
Something like this.
begin tran
update table_queue set processing= 1 where myId in
(
SELECT TOP 50 myId FROM table_queue WITH (updlock, readpast)
)
commit tran
You may also combine yoru update statement with an output keyword so you have a list of all ids that are now locked (processing=1) so you can work with them.
if you are processing data using a table as queue this will ensure you will not duplicate records in your select statements without any need for paging or anything else.
This solution is being tested in an enterprise level application where we experienced a lot of duplication in our select statements when being monitored by many services running on many different boxes.

Related

Storing huge number of entities in SQL Server database

I have the following scenario: I am building a dummy web app that pulls betting odds every minute, stores all the events, matches, odds etc. to the database and then updates the UI.
I have this structure: Sports > Events > Matches > Bets > Odds and I am using code first approach and for all DB-related operations I am using EF.
When I am running my application for the very first time and my database is empty I am receiving XML with odds which contains: ~16 sports, ~145 events, ~675 matches, ~17100 bets & ~72824 odds.
Here comes the problem: how to save all this entities in timely manner? Parsing is not that time consuming operation - 0.2 seconds, but when I try to bulk store all these entities I face memory problems and the save took more than 1 minute so next odd pull is triggered and this is nightmare.
I saw somewhere to disable the Configuration.AutoDetectChangesEnabled and recreate my context on every 100/1000 records I insert, but I am not nearly there. Every suggestion will be appreciated. Thanks in advance

When you are inserting huge (though it is not that huge) amounts of data like that, try using SqlBulkCopy. You can also try using Table Value Parameter and pass it to a stored procedure but I do not suggest it for this case as TVPs perform well for records under 1000. SqlBulkCopy is super easy to use which is a big plus.
If you need to do an update to many records, you can use SqlBulkCopy for that as well but with a little trick. Create a staging table and insert the data using SqlBulkCopy into the staging table, then call a stored procedure which will get records from the staging table and update the target table. I have used SqlBulkCopy for both cases numerous times and it works pretty well.
Furthermore, with SqlBulkCopy you can do the insertion in batches as well and provide feedback to the user, however, in your case, I do not think you need to do that. But nonetheless, this flexibility is there.
Can I do it using EF only?
I have not tried but there is this library you can try.

I understand your situation but:
All actions you've been doing it all depends on your machine specs and
the software itself.
Now if machine specs cannot handle the process it will be the time to
change a plan like to limit the count of records to be inserted till
it all to be done.

C# + SQL Server - Fastest / Most Efficient way to read new rows into memory

I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!

I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.

Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.

If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.

Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.

Prevent insert data into table at the same time

I'm working on a online sales web site. I'm using C# 4,0 and SQL server 2008 and I want to control and prevent users simultaneously insert into the table like dbo.orders... How can I do that?

Inserts will not be a problem, but updates can be. The term that you need to research is database concurrency. There are four basic models you can implement each with its own pros and cons. Some are better suited for certain situations and there are hundreds of articles on the web for this subject.

Are you trying to solve this in C# code on in SQL? Because in SQL it's simple. If you add BEGIN TRAN in the beginning of the stored procedure and COMMIT in the end, this will act as a lock in C# preventing concurrent code executions effectively serializing the requests. So if there are two inserts, they will be executed one after another. One thing to remember is that it will be blocking operation, i.e. the second insert won't start until the first one is finished (regardless successfully or not).

In your Add method you can use Locking with lock keyword, this will allow one thread at a time.

SQL Server and message queues

I'm trying to build a reliable message service, or at least that's how I would describe it.
Here's my problem: I have a table, I insert data into this table, I have at least two applications which select data from this table. However, I need a reliable way for the two different applications to never select the same rows at any given time.
How would I go about writing a transaction or select statement that is guaranteed to not select the same rows that other applications have recently selected.
I'm not an SQL Server expert, but I would expect something similar to this.
Select work from table, this will give the application exclusive access to some rows. The application would then process those rows. Some rows get deleted, some rows are returned to the database.
My main concern is that if the application fails to complete it's processing, SQL Server would eventually time out and return the checked out data to the application data pool.

Add a 'Being Processed' Flag to each record, which is set in an 'atomic' way.
There have been several questions and many answers here on SO, quite similiar to this:
How do I lock certain SQL rows while running a process on them?
Queues against Tables in messaging systems
What’s the best way of implementing a messaging queue table in mysql
Posting Message to MSMQ from SQL Server

this is quite an old question and probably the author's problem is long gone now but there's a little SQL trick that solves it and can be useful to others: the 'readpast' hint.
With readpast sql select will skip locked rows in a query. What you want is to grab first unlocked row and immediately lock it so no one else will get it:
select top 1 *
from messages with(readpast, updlock)
order by id
This will select and lock the first free message in the table and so no one else will be able to modify it. Actually I'm using the same approach in SQL-based message bus: http://http://code.google.com/p/nginn-messagebus/ and it works very smoothly (sometimes faster than MSMQ)

Take a look at SQL Server Service Broker, it is designed for this kind of problem.
If you need a queue, use a queue.

Add a "grabbed" column to the messages table. Everyone can try to grab a row with a single SQL statement:
UPDATE m
SET grabbed = 1
FROM messages m
WHERE id = #myid
and grabbed = 0
After this, if ##ROWCOUNT is 1, the row is yours. Otherwise, somebody else grabbed the row before you.
SQL Server Broker or MSMQ or MQSeries introduce massive complexity and maintenance overhead. Be sure the operations guy that's receiving your stuff agrees before you start using it.

Parallel processing of database queue

There is small system, where a database table as queue on MSSQL 2005. Several applications are writing to this table, and one application is reading and processing in a FIFO manner.
I have to make it a little bit more advanced to be able to create a distributed system, where several processing application can run. The result should be that 2-10 processing application should be able to run and they should not interfere each other during work.
My idea is to extend the queue table with a row showing that a process is already working on it. The processing application will first update the table with it's idetifyer, and then asks for the updated records.
So something like this:
start transaction
update top(10) queue set processing = 'myid' where processing is null
select * from processing where processing = 'myid'
end transaction
After processing, it sets the processing column of the table to something else, like 'done', or whatever.
I have three questions about this approach.
First: can this work in this form?
Second: if it is working, is it effective? Do you have any other ideas to create such a distribution?
Third: In MSSQL the locking is row based, but after an amount of rows are locked, the lock is extended to the whole table. So the second application cannot access it, until the first application does not release the transaction. How big can be the selection (top x) in order to not lock the whole table, only create row locks?

This will work, but you'll probably find you'll run into blocking or deadlocks where multiple processes try and read/update the same data. I wrote a procedure to do exactly this for one of our systems which uses some interesting locking semantics to ensure this type of thing runs with no blocking or deadlocks, described here.

This approach looks reasonable to me, and is similar to one I have used in the past - successfully.
Also, the row/ table will only be locked while the update and select operations take place, so I doubt the row vs table question is really a major consideration.
Unless the processing overhead of your app is so low as to be negligible, I'd keep the "top" value low - perhaps just 1. Of course that entirely depends on the details of your app.
Having said all that, I'm not a DBA, and so will also be interested in any more expert answers

In regards to your question about locking. You can use a locking hint to force it to lock only rows
update mytable with (rowlock) set x=y where a=b

Biggest problem with this approach is that you increase the number of 'updates' to the table. Try this with just one process consuming (update + delete) and others inserting data in the table and you will find that at around a million records, it starts to crumble.
I would rather have one consumer for the DB and use message queues to deliver processing data to other consumers.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.