Using SQL Server 2012, ASP.NET 4.6.2, and EF6. I have a table of urls. each url has to go through a number of third party processes via API calls, and the state reflected in that table. I'm planning to use scheduled background processes of some sort to kick those processes off. I've come up with a structure like:
Id int (PK)
SourceUrl varchar(200)
Process1Status int
Process2Status int
When rows go into the table, Status flags will be 0 for AwaitingProcessing. 1 would mean InProgress, and 2 would be Complete.
To ensure the overall processing is quicker, I want to run these two processes in parallel. In addition, there may be multiple instances of each of these background processors picking up urls from the queue.
I'm new to multi threaded processing though, so I'm a bit concerned that there will be some conflicting processing going on.
What I want to be able to do is to ensure that no Process1runner selects the same row as another Process1runner, by ensuring that Process1Runner takes only 1 item and flags that it is currently in progress. I'd also like to ensure that when separate third party services call back to notification urls, that no status update is lost if two processes are attempting to update Process1Status and Process2Status at the same time.
I've seen two possible relevant answers: How can I lock a table on read, using Entity Framework?
and: Get "next" row from SQL Server database and flag it in single transaction
But I'm not much clearer about which route I should take for my needs. Could someone point me in the right direction? Am I on the right track?
If by design multiple actors need access the same row of data I would split the data to avoid this situation.
My first thought is to suggest building a UrlProcessStatus table with URLId, ProcessId, and Status columns. This way the workers can read/write their data independently.
Related
I'm attempting to improve query performance for an application and I'm logically stuck.
So the application is proprietary and thus we're unable to alter application-side code. We have, however, received permission to work with the underlying database (surprisingly enough). The application calls a SQL Server database, so the current idea we're running with is to create a view with the same name as the table and rename the underlying table. When the application hits the view, the view calls one of two SQL CLR functions, which both do nothing more than call a web service we've put together. The web service performs all the logic, and contains an API call to an external, proprietary API that performs some additional logic and then returns the result.
This all works, however, we're having serious performance issues when scaling up to large data sets (100,000+ rows). The pretty clear source of this is the fact we're having to work on one row at a time with the web service, which includes the API call, which makes for a lot of latency overhead.
The obvious solution to this is to figure out a way to limit the number of times that the web service has to be hit per query, but this is where I'm stuck. I've read about a few different ways out there for potentially handling scenarios like this, but as a total database novice I'm having difficulty getting a grasp on what would be appropriate in this situation.
If any there are any ideas/recommendations out there, I'd be very appreciative.
There are probably a few things to look at here:
Is your SQLCLR TVF streaming the results out (i.e. are you adding to a collection and then returning that collection at the end, or are you releasing each row as it is completed -- either with yield return or building out a full Enumerator)? If not streaming, then you should do this as it allows for the rows to be consumed immediately instead of waiting for the entire process to finish.
Since you are replacing a Table with a View that is sourced by a TVF, you are naturally going to have performance degradation since TVFs:
don't report their actual number of rows. T-SQL Multi-statement TVFs always appear to return 1 row, and SQLCLR TVFs always appear to return 1000 rows.
don't maintain column statistics. When selecting from a Table, SQL Server will automatically create statistics for columns referenced in WHERE and JOIN conditions.
Because of these two things, the Query Optimizer is not going to have an easy time generating an appropriate plan if the actual number of rows is 100k.
How many SELECTs, etc are hitting this View concurrently? Since the View is hitting the same URI each time, you are bound by the concurrent connection limit imposed by ServicePointManager ( ServicePointManager.DefaultConnectionLimit ). And the default limit is a whopping 2! Meaning, all additional requests to that URI, while there are already 2 active/open HttpWebRequests, will wait inline, patiently. You can increase this by setting the .ServicePoint.ConnectionLimit property of the HttpWebRequest object.
How often does the underlying data change? Since you switched to a View, that doesn't take any parameters, so you are always returning everything. This opens the door for doing some caching, and there are two options (at least):
cache the data in the Web Service, and if it hasn't reached a particular time limit, return the cached data, else get fresh data, cache it, and return it.
go back to using a real Table. Create a SQL Server Agent job that will, every few minutes (or maybe longer if the data doesn't change that often): start a transaction, delete the current data, repopulate via the SQLCLR TVF, and commit the transaction. This requires that extra piece of the SQL Agent job, but you are then back to having more accurate statistics!!
For more info on working with SQLCLR in general, please visit: SQLCLR Info
I am looking for some input on how to scale out a Windows Service that is currently running at my company. We are using .NET 4.0 (can and will be upgraded to 4.5 at some point in the future) and running this on Windows Server 2012.
About the service
The service's job is to query for new rows in a logging table (We're working with an Oracle database), process the information, create and/or update a bunch of rows in 5 other tables (let's call them Tracking tables), update the logging table and repeat.
The logging table has large amounts of XML (can go up to 20 MB per row) which needs to be selected and saved in the other 5 Tracking tables. New rows are added all the time at the maximum rate of 500,000 rows an hour.
The Tracking tables' traffic is much higher, ranging from 90,000 new rows in the smallest one to potentially millions of rows in the largest table, each hour. Not to mention that there are Update operations on those tables as well.
About the data being processed
I feel this bit is important for finding a solution based on how these objects are grouped and processed. The data structure looks like this:
public class Report
{
public long Id { get; set; }
public DateTime CreateTime { get; set; }
public Guid MessageId { get; set; }
public string XmlData { get; set; }
}
public class Message
{
public Guid Id { get; set; }
}
Report is the logging data I need to select and process
For every Message there are on average 5 Reports. This can vary between 1 to hundreds in some cases.
Message has a bunch of other collections and other relations, but they are irrelevant to the question.
Today the Windows Service we have barely manages the load on a 16-core server (I don't remember the full specs, but it's safe to say this machine is a beast). I have been tasked with finding a way to scale out and add more machines that will process all this data and not interfere with the other instances.
Currently each Message gets it's own Thread and handles the relevant reports. We handle reports in batches, grouped by their MessageId to reduce the number of DB queries to a minimum when processing the data.
Limitations
At this stage I am allowed to re-write this service from scratch using any architecture I see fit.
Should an instance crash, the other instances need to be able to pick up where the crashed one left. No data can be lost.
This processing needs to be as close to real-time as possible from the reports being inserted into the database.
I'm looking for any input or advice on how to build such a project. I assume the services will need to be stateless, or is there a way to synchronize caches for all the instances somehow? How should I coordinate between all the instances and make sure they're not processing the same data? How can I distribute the load equally between them? And of course, how to handle an instance crashing and not completing it's work?
EDIT
Removed irrelevant information
For your work items, Windows Workflow is probably your quickest means to refactor your service.
Windows Workflow Foundation # MSDN
The most useful thing you'll get out of WF is workflow persistence, where a properly designed workflow may resume from a Persist point, should anything happen to the workflow from the last point at which it was saved.
Workflow Persistence # MSDN
This includes the ability for a workflow to be recovered from another process should any other process crash while processing the workflow. The resuming process doesn't need to be on the same machine if you use the shared workflow store. Note that all recoverable workflows require the use of the workflow store.
For work distribution, you have a couple options.
A service to produce messages combined with host-based load balancing via workflow invocation using WCF endpoints via the WorkflowService class. Note that you'll probably want to use the design-mode editor here to construct entry methods rather than manually setup Receive and corresponding SendReply handlers (these map to WCF methods). You would likely call the service for every Message, and perhaps also call the service for every Report. Note that the CanCreateInstance property is important here. Every invocation tied to it will create a running instance that runs independently.
~
WorkflowService Class (System.ServiceModel.Activities) # MSDN
Receive Class (System.ServiceModel.Activities) # MSDN
Receive.CanCreateInstance Property (System.ServiceModel.Activities) # MSDN
SendReply Class (System.ServiceModel.Activities) # MSDN
Use a service bus that has Queue support. At the minimum, you want something that potentially accepts input from any number of clients, and whose outputs may be uniquely identified and handled exactly once. A few that come to mind are NServiceBus, MSMQ, RabbitMQ, and ZeroMQ. Out of the items mentioned here, NServiceBus is exclusively .NET ready out-of-the-box. In a cloud context, your options also include platform-specific offerings such as Azure Service Bus and Amazon SQS.
~
NServiceBus
MSMQ # MSDN
RabbitMQ
ZeroMQ
Azure Service Bus # MSDN
Amazon SQS # Amazon AWS
~
Note that the service bus is just the glue between a producer that will initiate Messages and a consumer that can exist on any number of machines to read from the queue. Similarly, you can use this indirection for Report generation. Your consumer will create workflow instances that may then use workflow persistence.
Windows AppFabric may be used to host workflows, allowing you to use many techniques that apply to IIS load balancing to distribute your work. I don't personally have any experience with it, so there's not much I can say for it other than it has good monitoring support out-of-the-box.
~
How to: Host a Workflow Service with Windows App Fabric # MSDN
I solved this by coding all this scalability and redundancy stuff on my own. I will explain what I did and how I did it, should anyone ever need this.
I created a few processes in each instance to keep track of the others and know which records the particular instance can process. On start up, the instance would register in the database (if it's not already) in a table called Instances. This table has the following columns:
Id Number
MachineName Varchar2
LastActive Timestamp
IsMaster Number(1)
After registering and creating a row in this table if the instance's MachineName wasn't found, the instance starts pinging this table every second in a separate thread, updating its LastActive column. Then it selects all the rows from this table and makes sure that the Master Instance (more on that later) is still alive - meaning that it's LastActive time is in the last 10 seconds. If the master instance stopped responding, it will assume control and set itself as master. In the next iteration it will make sure that there is only one master (in case another instance decided to assume control as well simultaneously), and if not it will yield to the instance with the lowest Id.
What is the master instance?
The service's job is to scan a logging table and process that data so people can filter and read through it easily. I didn't state this in my question, but it might be relevant here. We have a bunch of ESB servers writing multiple records to the logging table per request, and my service's job is to keep track of them in near real-time. Since they're writing their logs asynchronously, I could potentially get a finished processing request A before started processing request A entry in the log. So, I have some code that sorts those records and makes sure my service processes the data in the correct order. Because I needed to scale out this service, only one instance can do this logic to avoid lots of unnecessary DB queries and possibly insane bugs.
This is where the Master Instance comes in. Only it executes this sorting logic and temporarily saves the log record Id's in another table called ReportAssignment. This table's job is to keep track of which records were processed and by whom. Once processing is complete, the record is deleted. The table looks like this:
RecordId Number
InstanceId Number Nullable
The master instance sorts the log entries and inserts their Id's here. All my service instances check this table in 1 second intervals for new records that aren't being processed by anyone or that are being processed by an inactive instance, and that the [record's Id] % [number of isnstances] == [index of current instance in a sorted array of all the active instances] (that were acquired during the Pinging process). The query looks somewhat like this:
SELECT * FROM ReportAssignment
WHERE (InstanceId IS NULL OR InstanceId NOT IN (1, 2, 3)) // 1,2,3 are the active instances
AND RecordId % 3 == 0 // 0 is the index of the current instance in the list of active instances
Why do I need to do this?
The other two instances would query for RecordId % 3 == 1 and RecordId % 3 == 2.
RecordId % [instanceCount] == [indexOfCurrentInstance] ensures that the records are distributed evenly between all instances.
InstanceId NOT IN (1,2,3) allows the instances to take over records that were being processed by an instance that crashed, and not process the records of already active instances when a new instance is added.
Once an instance queries for these records, it will execute an update command, setting the InstanceId to its own and query the logging table for records with those Id's. When processing is complete, it deletes the records from ReportAssignment.
Overall I am very pleased with this. It scales nicely, ensures that no data is lost should the instance go down, and there were nearly no alterations to the existing code we have.
The best way to tackle reading from a single table in SQL Server using multiple threads and make sure not reading the same record twice in different thread using c#
Thank you for your help in advance
Are you trying to read records from the table in parallel to speed up retreiving the data or are you just worried about data corruption with threads accessing the same data?
Database Management Systems like MsSQL handle concurrency extremely well so thread safety in that respect is not something you would have to be concerned with in your code if you have mutiple threads reading the same table.
If you want to read data in parallel without any overlapping you could run a SQL command with paging, and just have each thread fetch a different page. You could have say 20 threads all read 20 different pages at once and it would be guaranteed that they are not reading the same rows. Then you can concatenate the data. The greater the page size the more performance boost you would get from creating the thread.
efficient way to implement paging
Assuming a dependency on SQL Server, you could possibly looking at the SQL Server Service Broker features to provide queuing for you. One thing to keep in mind with that is that currently SQL Server Service Broker isn't available on SQL Azure, so if you had plans on moving onto the Azure cloud that could be a problem.
Anyway - with SQL Server Service Broker the concurrent access is managed at the database engine layer. Another way of doing it is having one thread that reads the database and then dispatches threads with the message as the input. That is slightly easier than trying to use transactions in the database to ensure that messages aren't read twice.
Like I said though, SQL Server Service Broker is probably the way to go. Or a proper external queuing mechanism.
Solution 1:
I am assuming that you are attempting to process or extract data from a large table. If I were assigned this task I would first look at paging . If you are trying to split work among threads that is. So Thread 1 handles pages 0 to 10, Thread 2 handles pages 11 to 20, etc... or you could batch rows using the actual rownumber. So in your stored proc you would do this;
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY <ordering>) AS [row_number],
x, y, z
FROM
table
WHERE
<search-clauses>
) SELECT
*
FROM
result_set
WHERE
[row_number] BETWEEN #IN_Thread_Row_Start AND #IN_Thread_Row_End;
Another choice which would be more efficient is if you have a natural key, or a darn good surrogate, is to page using that and have the thread pass in the key parameters rather than the records it is interested in ( or page numbers ).
Immediate concerns with this solution would be:
ROW_NUMBER performance
CTE Performance (I believe they are stored in memory)
So if this was my problem to resolve I would look at paging via a key.
Solution 2:
The second solution would be to mark the rows as they are processing, virtually locking them, that is if you have data writer permission. So your table would have a field called Processed or Locked, as the rows are selected by your thread, they are updated as Locked = 1;
Then your select from other threads selects only rows that aren't locked. When your process is done and all rows are processed you could reset the lock.
Hard to say what will perform best w.o some trials... GL
This question is super old but still very relevant and I spent a lot of time finding this solution so i thought id post it for anyone else who happens along this. This is very common when using a sql table as a queue rather than msmq.
The solution (after a lot of investigation) is simple and can be tested by opening 2 tabs in ssms with each tab running its own transaction to simulate multiple processes/threads hitting the same table.
The quick answer is this: the key to this is using updlock and readpast hints on your selects.
To illustrate the reads working without duplication check out this simple example.
--on tab 1 in ssms
begin tran
SELECT TOP 1 ordno FROM table_queue WITH (updlock, readpast)
--on tab 2 in ssms
begin tran
SELECT TOP 1 ordno FROM table_queue WITH (updlock, readpast)
You will notice that the first selected record is locked and does not get duplicated by the select statement firing on the second tab/process.
Now in the real world you wouldnt just execute a select on your table like the simple example above. You would update your records as "isprocessing=1" or something similar if you are using your table as a queue. The above code just demonstrates that this allows concurrent reads without duplication.
So in the real world (if you are using your table as a queue and processing this queue with multiple services for instance) you would execute your select in a subquery to an update statement most likely.
Something like this.
begin tran
update table_queue set processing= 1 where myId in
(
SELECT TOP 50 myId FROM table_queue WITH (updlock, readpast)
)
commit tran
You may also combine yoru update statement with an output keyword so you have a list of all ids that are now locked (processing=1) so you can work with them.
if you are processing data using a table as queue this will ensure you will not duplicate records in your select statements without any need for paging or anything else.
This solution is being tested in an enterprise level application where we experienced a lot of duplication in our select statements when being monitored by many services running on many different boxes.
I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!
I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.
Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.
If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.
Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.
There is small system, where a database table as queue on MSSQL 2005. Several applications are writing to this table, and one application is reading and processing in a FIFO manner.
I have to make it a little bit more advanced to be able to create a distributed system, where several processing application can run. The result should be that 2-10 processing application should be able to run and they should not interfere each other during work.
My idea is to extend the queue table with a row showing that a process is already working on it. The processing application will first update the table with it's idetifyer, and then asks for the updated records.
So something like this:
start transaction
update top(10) queue set processing = 'myid' where processing is null
select * from processing where processing = 'myid'
end transaction
After processing, it sets the processing column of the table to something else, like 'done', or whatever.
I have three questions about this approach.
First: can this work in this form?
Second: if it is working, is it effective? Do you have any other ideas to create such a distribution?
Third: In MSSQL the locking is row based, but after an amount of rows are locked, the lock is extended to the whole table. So the second application cannot access it, until the first application does not release the transaction. How big can be the selection (top x) in order to not lock the whole table, only create row locks?
This will work, but you'll probably find you'll run into blocking or deadlocks where multiple processes try and read/update the same data. I wrote a procedure to do exactly this for one of our systems which uses some interesting locking semantics to ensure this type of thing runs with no blocking or deadlocks, described here.
This approach looks reasonable to me, and is similar to one I have used in the past - successfully.
Also, the row/ table will only be locked while the update and select operations take place, so I doubt the row vs table question is really a major consideration.
Unless the processing overhead of your app is so low as to be negligible, I'd keep the "top" value low - perhaps just 1. Of course that entirely depends on the details of your app.
Having said all that, I'm not a DBA, and so will also be interested in any more expert answers
In regards to your question about locking. You can use a locking hint to force it to lock only rows
update mytable with (rowlock) set x=y where a=b
Biggest problem with this approach is that you increase the number of 'updates' to the table. Try this with just one process consuming (update + delete) and others inserting data in the table and you will find that at around a million records, it starts to crumble.
I would rather have one consumer for the DB and use message queues to deliver processing data to other consumers.