I am faced with a task, where I have to design a web application in .net framework. In this application users will only (99% of the time) have readonly access as they will just see data (SELECT).
However the backend database is going to be the beast where every minute there will be records updated / inserted / deleted. The projection is that at very minimum there will be about 10 million records added to system in a year, in less than 5 tables collectively.
Question/Request 1:
As these updates/inserts will happen very frequently (every minute or 2 the latest) I was hoping to get some tips so that when some rows are being changed a select query may not cause a thread deadlock or vice-versa.
Question/Request 2:
My calculated guess is that in normal situations, only few hundred records will be inserted every minute 24/7 (and updated / deleted based on some conditions). If I write a C# tool which will get data from any number of sources (xml, csv or direct from some tables from a remote db, a configuration file or registry setting will dictate which format the data is being imported from) and then do the insert / update / deleted, will this be fast enough and/or will cause deadlock issues?
I hope my questions are elaborate enough... Please let me know if this is all vague...
Thanks in advance
I will answer first your question #2: According with the scenario you've described, it will be fast enough. But remember to issue direct sql commands to the database. I personally have a very, very, similar scenario and the application runs without any problem, but when a scheduled job executes multiples inserts/deletes with a tool (like nhibernate) deadlocks do occurs. So, again, if prossible, execute direct sql statements.
Question #1: You can use "SELECT WITH NOLOCK".
For example:
SELECT * FROM table_sales WITH (NOLOCK)
It avoids blocks on database. But you have to remember that you might be reading an outdated info (once again, in the scenario you've described probably it will not be a problem).
You can also try "READ COMMITTED SNAPSHOT", it was supported since the 2005 version, but for this example I will keep it simple. Search a little about it to decide wich one may be the best choice for you.
Related
Given a data set where you will only know the relevant fields at run time. I want to select each row of data for analysis through a loop. Is it better to:
run a direct sql query to get the row each time by directly opening and closing the database
pull all the applicable rows into a datatable before the loop then selecting them through linq from inside the loop
For example, I am reading in a file that says look for rows a b and c then my query becomes "SELECT col1, col2, col3 FROM table WHERE col1 = 'a' or col1 = 'b' or col1= 'c'"
But I dont know it will be a,b,c during compile time, only after i run the program
Thanks
edit: better in terms or speed and best practice
depending on how long your analysis takes, the underlying transactional locking (if any), and the resources blocked by holding your result set on the db server it is either 1 or 2 ... but for me 2 seems rather unlikely (gigantic resultsets that are open for long periods of time and would eat up the memory on your DB system, which alone would suggest that you should rethink your whole data processing workflow)... i'd just build the SQL at runtime, which is even possible using linq directly against your DB (see "Entity Framework") and only if I would encounter serious performance problems once that is running, i'd refactor...
StackOverflow isn't really for "better" type questions because they're usually highly subjective/opinion based/lead to arguments.. We are however, free to decide whether subjective questions should be answered and this is one of those that (in my opinion) can
I'd assert that in 99% of cases you'd be better off leaving data in a database and querying it using the SQL functionality built into the database. The reason is that databases are purpose built for storing and querying data. It is somewhat ludicrous to conclude that it's a good use of resources to transfer 100 million rows across a network, into a poorly designed data storage container and then naively searching through it using a loop-in-a-loop (which is what linq in a loop would be) when you could leave it in the purpose-built, well indexed, high powered enterprise level software (on enterprise grade hardware) where it currently resides, and ask for retrieval of a tiny subset of those records to be transferred over a (slow) network link into a limited power client
I have the following scenario: I am building a dummy web app that pulls betting odds every minute, stores all the events, matches, odds etc. to the database and then updates the UI.
I have this structure: Sports > Events > Matches > Bets > Odds and I am using code first approach and for all DB-related operations I am using EF.
When I am running my application for the very first time and my database is empty I am receiving XML with odds which contains: ~16 sports, ~145 events, ~675 matches, ~17100 bets & ~72824 odds.
Here comes the problem: how to save all this entities in timely manner? Parsing is not that time consuming operation - 0.2 seconds, but when I try to bulk store all these entities I face memory problems and the save took more than 1 minute so next odd pull is triggered and this is nightmare.
I saw somewhere to disable the Configuration.AutoDetectChangesEnabled and recreate my context on every 100/1000 records I insert, but I am not nearly there. Every suggestion will be appreciated. Thanks in advance
When you are inserting huge (though it is not that huge) amounts of data like that, try using SqlBulkCopy. You can also try using Table Value Parameter and pass it to a stored procedure but I do not suggest it for this case as TVPs perform well for records under 1000. SqlBulkCopy is super easy to use which is a big plus.
If you need to do an update to many records, you can use SqlBulkCopy for that as well but with a little trick. Create a staging table and insert the data using SqlBulkCopy into the staging table, then call a stored procedure which will get records from the staging table and update the target table. I have used SqlBulkCopy for both cases numerous times and it works pretty well.
Furthermore, with SqlBulkCopy you can do the insertion in batches as well and provide feedback to the user, however, in your case, I do not think you need to do that. But nonetheless, this flexibility is there.
Can I do it using EF only?
I have not tried but there is this library you can try.
I understand your situation but:
All actions you've been doing it all depends on your machine specs and
the software itself.
Now if machine specs cannot handle the process it will be the time to
change a plan like to limit the count of records to be inserted till
it all to be done.
I'm trying to record log files on my database. My question is which has the less load on making logs on the database. I'm thinking of storing long term log files ,maybe 3-5 years maximum, for an Inventory Program.
Process: I'll be using a barcode scanner.
After scanning a barcode, I'll get all the details of who is logged in, date and time, product details then saved per piece.
I came up with two ideas.
After the scanning event, It will be saved on a DataTable then after finishing a batch.. DataTable will be written on a *.txt file and then uploaded to my database.
After every scanned barcode, an INSERT query will be executed. I suspect this option will be heavy on the server side since I'm not the only one using this server
What are the pros and cons of the two options?
Are there more efficient ways of storing logs?
Based on your use case, I also think you need to consider at least 2 additional factors, the first being how important is it that the scanned item is logged in the database immediately. If you need the scanned item to be logged because you'll be checking to see if its been scanned, for example to prevent other scans, then doing a single insert is probably a very good idea. The second thing to consider is will you ever need to "unscan" an item, and at which part of the process? If the person scanning needs the ability to revert the scan immediately, it might be a good idea to wait until theyre done all their scannings before dumping the data to the database, as this will let you avoid ever having to delete from the table.
Overall I wouldnt worry too much about what the database can handle, sql-server is very good at handling simultaneous single inserts into a table thats designed for that use case. If youre only going to be inserting new data to the end of the table, and not updating or deleting existing records, performance is going to scale very well. The same goes for larger batch inserts, theyre very efficient no matter how many rows you want to bring in, assuming your table is designed for that purpose.
So overall I would probably pick the more efficient solution from the application side for your specific use case, and then once you have decided that, you can shape the database around the code, rather than trying to shape your code around suspected limitations of the database.
What are the pros and cons of the two options?
Basically your question is which way is more efficient (bulk insert or multiple single insert)?
The answers is always depends and always be situation based. So unfortunately, I don't think there's a right answer for you
The way you structure the log table.
If you choose bulk insert, how many rows do you want to insert at 1 time?
Is it read-only table? And if you want to read from it, how often do you do the read?
Do you need to scale it up?
etc...
Are there more efficient ways of storing logs?
There're some possible ways to improve I can think of (not all of them can work together)
If you go with the first option, maybe you can schedule the insert to non-peak hours
If you go with the first option, chunk the log files and do the insert
Use another database to do the logging
If you go with the second option, do some load testing
Personally, I prefer to go with second option if the project is small to medium size and the logging is critical part of the project.
hope it helps.
Go with the second option, and use transactions. This way the data will not be sent to the db until you call the transaction to complete. (Which can be scheduled.) This will also prevent broken data from getting into your database when a crash or something occurs.
Transactions in .net
Transaction Tutorial in C#
I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!
I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.
Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.
If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.
Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.
There is small system, where a database table as queue on MSSQL 2005. Several applications are writing to this table, and one application is reading and processing in a FIFO manner.
I have to make it a little bit more advanced to be able to create a distributed system, where several processing application can run. The result should be that 2-10 processing application should be able to run and they should not interfere each other during work.
My idea is to extend the queue table with a row showing that a process is already working on it. The processing application will first update the table with it's idetifyer, and then asks for the updated records.
So something like this:
start transaction
update top(10) queue set processing = 'myid' where processing is null
select * from processing where processing = 'myid'
end transaction
After processing, it sets the processing column of the table to something else, like 'done', or whatever.
I have three questions about this approach.
First: can this work in this form?
Second: if it is working, is it effective? Do you have any other ideas to create such a distribution?
Third: In MSSQL the locking is row based, but after an amount of rows are locked, the lock is extended to the whole table. So the second application cannot access it, until the first application does not release the transaction. How big can be the selection (top x) in order to not lock the whole table, only create row locks?
This will work, but you'll probably find you'll run into blocking or deadlocks where multiple processes try and read/update the same data. I wrote a procedure to do exactly this for one of our systems which uses some interesting locking semantics to ensure this type of thing runs with no blocking or deadlocks, described here.
This approach looks reasonable to me, and is similar to one I have used in the past - successfully.
Also, the row/ table will only be locked while the update and select operations take place, so I doubt the row vs table question is really a major consideration.
Unless the processing overhead of your app is so low as to be negligible, I'd keep the "top" value low - perhaps just 1. Of course that entirely depends on the details of your app.
Having said all that, I'm not a DBA, and so will also be interested in any more expert answers
In regards to your question about locking. You can use a locking hint to force it to lock only rows
update mytable with (rowlock) set x=y where a=b
Biggest problem with this approach is that you increase the number of 'updates' to the table. Try this with just one process consuming (update + delete) and others inserting data in the table and you will find that at around a million records, it starts to crumble.
I would rather have one consumer for the DB and use message queues to deliver processing data to other consumers.