Given a data set where you will only know the relevant fields at run time. I want to select each row of data for analysis through a loop. Is it better to:
run a direct sql query to get the row each time by directly opening and closing the database
pull all the applicable rows into a datatable before the loop then selecting them through linq from inside the loop
For example, I am reading in a file that says look for rows a b and c then my query becomes "SELECT col1, col2, col3 FROM table WHERE col1 = 'a' or col1 = 'b' or col1= 'c'"
But I dont know it will be a,b,c during compile time, only after i run the program
Thanks
edit: better in terms or speed and best practice
depending on how long your analysis takes, the underlying transactional locking (if any), and the resources blocked by holding your result set on the db server it is either 1 or 2 ... but for me 2 seems rather unlikely (gigantic resultsets that are open for long periods of time and would eat up the memory on your DB system, which alone would suggest that you should rethink your whole data processing workflow)... i'd just build the SQL at runtime, which is even possible using linq directly against your DB (see "Entity Framework") and only if I would encounter serious performance problems once that is running, i'd refactor...
StackOverflow isn't really for "better" type questions because they're usually highly subjective/opinion based/lead to arguments.. We are however, free to decide whether subjective questions should be answered and this is one of those that (in my opinion) can
I'd assert that in 99% of cases you'd be better off leaving data in a database and querying it using the SQL functionality built into the database. The reason is that databases are purpose built for storing and querying data. It is somewhat ludicrous to conclude that it's a good use of resources to transfer 100 million rows across a network, into a poorly designed data storage container and then naively searching through it using a loop-in-a-loop (which is what linq in a loop would be) when you could leave it in the purpose-built, well indexed, high powered enterprise level software (on enterprise grade hardware) where it currently resides, and ask for retrieval of a tiny subset of those records to be transferred over a (slow) network link into a limited power client
Related
I have a table with a lot of rows (3 million) from which I need to query some rows at several points in my app. The way I found to do this is querying all the data the first time that any was needed and storing it in a static DataTable with SqlAdapter.Fill() for the rest of the app life.
That's fast, because then when I need something I use DataTable.Select("some query") and the app processes the info just nice.
The problem is that this table takes about 800MB of RAM, and I have to run this app in PCs where it might be too much.
The other way I thought was to query the data I need each time. This takes little memory but has poor performance (a lot of queries to the database, which is at a network address and with 1000 queries you start to notice the ping and all that..).
Is there any intermediate point between performance and memory usage?
EDIT: What I'm retrieving are sales, which have a date, a product and a quantity. I query by product, and it isn't indexed that way. But anyways, making 1000 queries, even if the query took 0.05s, a 0.2s ping makes a total of 200 seconds...
First talk to the dba about performance
If you are downloading the entire table you might actually be putting more load on the network and SQL than if you performed individual queries.
As a dba if I knew you were downloading an entire large table I would put an index on product immediately.
Why are you performing 1000s of queries?
If you are looking for sales when a product is created then a cache is problematic. You would not yet have sales data. The problem with a cache is stale data. If you know the data will not change - you either have it or not then you can eliminate the concern of stale data.
There is something between sequentially and simultaneously. You can pack multiple selects in a single request. What this does is make a single round trip and is more efficient.
select * from tableA where ....;
select * from tableB where ....;
With DataReader just call SqlDataReader.NextResult Method ()
using (SqlDataReader rdr = cmd.ExecuteReader())
{
while (rdr.Read())
{
}
rdr.NextResultSet();
while (rdr.Read())
{
}
}
Pretty sure you can do the same type of thing with multiple DataTables in a DataSet.
Another option is a LocalDB. It is targeted at developers but for what you are doing it would work just fine. DataTable speed without memory concerns. You can even put an index on ProductID. It will take a little longer to write to disc compared to memory but you are not using up memory.
Then there is the ever evil with (nolock). Know what you are doing and I am not going to go into all the possible evils but I can tell you that I use it a lot.
The question can be precipitated to Memory vs Performance. The answer to that is Caching.
If you know what your usage pattern would be like, then one thing you can do is to create a local cache in the app.
The extreme cases are - your cache size is 800MB with all your data in it (thereby sacrificing memory) - OR - your cache size is 0MB and all your queries go to network (thereby sacrificing performance).
Three important questions about the design of the cache are answered below.
How to populate the Cache?
If you are likely to make some query multiple times, store it in cache and before going to network, test if your cache already has the result. If it doesn't, query the database and then store the result in the cache.
If after querying for some data, you are likely to query the next and/or previous piece of data, then query all of it once and cache it so that when you query the next piece, you already have it in cache.
Effectively the idea is that if you know some information may be needed in future, cache it beforehand.
How to free the Cache?
You can decide the freeing mechanism for cache either Actively or Passively.
Passively: Whenever cache is full you can evict the data from it.
Actively: Run a background thread at regular interval and it takes care of removal for you.
One hybrid method is to run a freeing thread as soon as you reach, let's say, 80% of your memory limit and then free whatever memory you can.
What data to remove from the Cache?
This has been answered already in context of the issue of Page Replacement Policies for Operating Systems.
For completion, I'll summarize the important ones here:
Evict the Least Recently Used data (if it is not likely to be used);
Evict the data that was brought in earliest (if the earliest data is not likely to be used);
Evict the data that was brought in latest (if you think that the newly brought in data is least likely to be used).
Automatically remove the data that is older than t time units.
RE: "I can't index by anything because I'm not the database admin nor can ask for that."
Can you prepopulate a temp table and index on that?, e.g.
Select * into #MyTempTable from BigHugeTable
Create Index Prodidx on #MyTempTable (product)
You will have to ensure you always reuse the same connection (and it isn't closed) in order to use the temp table.
I am faced with a task, where I have to design a web application in .net framework. In this application users will only (99% of the time) have readonly access as they will just see data (SELECT).
However the backend database is going to be the beast where every minute there will be records updated / inserted / deleted. The projection is that at very minimum there will be about 10 million records added to system in a year, in less than 5 tables collectively.
Question/Request 1:
As these updates/inserts will happen very frequently (every minute or 2 the latest) I was hoping to get some tips so that when some rows are being changed a select query may not cause a thread deadlock or vice-versa.
Question/Request 2:
My calculated guess is that in normal situations, only few hundred records will be inserted every minute 24/7 (and updated / deleted based on some conditions). If I write a C# tool which will get data from any number of sources (xml, csv or direct from some tables from a remote db, a configuration file or registry setting will dictate which format the data is being imported from) and then do the insert / update / deleted, will this be fast enough and/or will cause deadlock issues?
I hope my questions are elaborate enough... Please let me know if this is all vague...
Thanks in advance
I will answer first your question #2: According with the scenario you've described, it will be fast enough. But remember to issue direct sql commands to the database. I personally have a very, very, similar scenario and the application runs without any problem, but when a scheduled job executes multiples inserts/deletes with a tool (like nhibernate) deadlocks do occurs. So, again, if prossible, execute direct sql statements.
Question #1: You can use "SELECT WITH NOLOCK".
For example:
SELECT * FROM table_sales WITH (NOLOCK)
It avoids blocks on database. But you have to remember that you might be reading an outdated info (once again, in the scenario you've described probably it will not be a problem).
You can also try "READ COMMITTED SNAPSHOT", it was supported since the 2005 version, but for this example I will keep it simple. Search a little about it to decide wich one may be the best choice for you.
I am writing a app that reads a whole table, does some processing, then writes the resulting data to another table. I am using the SqlBulkCopy class (.net version of "bcp in") which does the insert very fast. But I cannot find any efficent way to select data in the first place. there is not .net equivilent of "bcp out", which seems strange to me.
Currently I'm using select * from table_name. For prespective it takes 2.5 seconds to select 6,000 rows ... and only 600ms to bulk insert the same number of rows.
I would expect that selecting data should always be faster than inserting. What is the fastest way to select all rows & columns from a table?
Answers to qeustions:
I timed my select to take 2.5 seconds 2 ways. First was while running my application and running a sql trace. second was running the same query in SSMS. Both retured about the same result.
I am reading data using SqlDataReader.
No other applications are using this database.
My current processing takes under 1 second, so 2+ second read time is relatively large. But mostly I'm concerned(interested) in performance when scaling this up to 100,000 rows and millions of rows.
Sql Server 08r2 and my application are both running on my dev machine.
Some of the data processing is set based so I need to have the whole table in memory (to support much larger data sets, I know this step will probably need to be moved into SQL so I only need to operate per row in memory)
Here is my code:
DataTable staging = new DataTable();
using (SqlConnection dwConn = (SqlConnection)SqlConnectionManager.Instance.GetDefaultConnection())
{
dwConn.Open();
SqlCommand cmd = dwConn.CreateCommand();
cmd.CommandText = "select * from staging_table";
SqlDataReader reader = cmd.ExecuteReader();
staging.Load(reader);
}
select * from table_name is the simplest, easiest and fastest way to read a whole table.
Let me explain why your results lead to wrong conclusions.
Copying a whole table is an optimized operation that merely requires cloning the old binary data into the new one (at most you can perform a file copy operation, according to storage mechanism).
Writing is buffered. DBMS says the record was written but it's actually not yet done, unless you work with transactions. Disk operations are generally delayed.
Querying a table also requires (unlike cloning) adapting data from the binary-stored layout/format to a driver-dependant format that is ultimately readable by your client. This takes time.
It all depends on your hardware, but it is likely that your network is the bottleneck here.
Apart from limiting your query to just read the columns you'd actually be using, doing a select is as fast as it will get. There is caching involved here, when you execute it twice in a row, the second time shoud be much faster because the data is cached in memory. execute dbcc dropcleanbuffers to check the effect of caching.
If you want to do it as fast as possible try to implement the code that does the processing in T-SQL, that way it could operate directly on the data right there on the server.
Another good tip for speed tuning is have the table that is being read on one disk (look at filegroups) and the table that is written to on another disk. That way one disk can do a continuous read and the other a continuous write. If both operations happen on the same disk the heads of the disk keep going back and forth what seriously downgrades performance.
If the logic your writing cannot be doen it T-SQL you could also have a look at SQL CLR.
Another tip: when you do select * from table, use a datareader if possible. That way you don't materialize the whole thing in memory first.
GJ
It is a good idea generally to include the column names in the select list, but with today's RDBMS's, it won't make much difference. You will only see difference in this regard if you limit the columns selected. Generally speaking it is good practice to include column names. But to answer it seems a select is indeed slower than inserting in the scenario you describe
and yes a select * from table_name is indeed the fastest way to read all rows and cols from a table
I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!
I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.
Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.
If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.
Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.
I have an importer process which is running as a windows service (debug mode as an application) and it processes various xml documents and csv's and imports into an SQL database. All has been well until I have have had to process a large amount of data (120k rows) from another table (as I do the xml documents).
I am now finding that the SQL server's memory usage is hitting a point where it just hangs. My application never receives a time out from the server and everything just goes STOP.
I am still able to make calls to the database server separately but that application thread is just stuck with no obvious thread in SQL Activity Monitor and no activity in Profiler.
Any ideas on where to begin solving this problem would be greatly appreciated as we have been struggling with it for over a week now.
The basic architecture is c# 2.0 using NHibernate as an ORM data is being pulled into the actual c# logic and processed then spat back into the same database along with logs into other tables.
The only other prob which sometimes happens instead is that for some reason a cursor is being opening on this massive table, which I can only assume is being generated from ADO.net the statement like exec sp_cursorfetch 180153005,16,113602,100 is being called thousands of times according to Profiler
When are you COMMITting the data? Are there any locks or deadlocks (sp_who)? If 120,000 rows is considered large, how much RAM is SQL Server using? When the application hangs, is there anything about the point where it hangs (is it an INSERT, a lookup SELECT, or what?)?
It seems to me that that commit size is way too small. Usually in SSIS ETL tasks, I will use a batch size of 100,000 for narrow rows with sources over 1,000,000 in cardinality, but I never go below 10,000 even for very wide rows.
I would not use an ORM for large ETL, unless the transformations are extremely complex with a lot of business rules. Even still, with a large number of relatively simple business transforms, I would consider loading the data into simple staging tables and using T-SQL to do all the inserts, lookups etc.
Are you running this into SQL using BCP? If not, the transaction logs may not be able to keep up with your input. On a test machine, try turning the recovery mode to Simple (non-logged) , or use the BCP methods to get data in (they bypass T logging)
Adding on to StingyJack's answer ...
If you're unable to use straight BCP due to processing requirements, have you considered performing the import against a separate SQL Server (separate box), using your tool, then running BCP?
The key to making this work would be keeping the staging machine clean -- that is, no data except the current working set. This should keep the RAM usage down enough to make the imports work, as you're not hitting tables with -- I presume -- millions of records. The end result would be a single view or table in this second database that could be easily BCP'ed over to the real one when all the processing is complete.
The downside is, of course, having another box ... And a much more complicated architecture. And it's all dependent on your schema, and whether or not that sort of thing could be supported easily ...
I've had to do this with some extremely large and complex imports of my own, and it's worked well in the past. Expensive, but effective.
I found out that it was nHibernate creating the cursor on the large table. I am yet to understand why, but in the mean time I have replaced the large table data access model with straight forward ado.net calls
Since you are rewriting it anyway, you may not be aware that you can call BCP directly from .NET via the System.Data.SqlClient.SqlBulkCopy class. See this article for some interesting perforance info.