I have a table with a lot of rows (3 million) from which I need to query some rows at several points in my app. The way I found to do this is querying all the data the first time that any was needed and storing it in a static DataTable with SqlAdapter.Fill() for the rest of the app life.
That's fast, because then when I need something I use DataTable.Select("some query") and the app processes the info just nice.
The problem is that this table takes about 800MB of RAM, and I have to run this app in PCs where it might be too much.
The other way I thought was to query the data I need each time. This takes little memory but has poor performance (a lot of queries to the database, which is at a network address and with 1000 queries you start to notice the ping and all that..).
Is there any intermediate point between performance and memory usage?
EDIT: What I'm retrieving are sales, which have a date, a product and a quantity. I query by product, and it isn't indexed that way. But anyways, making 1000 queries, even if the query took 0.05s, a 0.2s ping makes a total of 200 seconds...
First talk to the dba about performance
If you are downloading the entire table you might actually be putting more load on the network and SQL than if you performed individual queries.
As a dba if I knew you were downloading an entire large table I would put an index on product immediately.
Why are you performing 1000s of queries?
If you are looking for sales when a product is created then a cache is problematic. You would not yet have sales data. The problem with a cache is stale data. If you know the data will not change - you either have it or not then you can eliminate the concern of stale data.
There is something between sequentially and simultaneously. You can pack multiple selects in a single request. What this does is make a single round trip and is more efficient.
select * from tableA where ....;
select * from tableB where ....;
With DataReader just call SqlDataReader.NextResult Method ()
using (SqlDataReader rdr = cmd.ExecuteReader())
{
while (rdr.Read())
{
}
rdr.NextResultSet();
while (rdr.Read())
{
}
}
Pretty sure you can do the same type of thing with multiple DataTables in a DataSet.
Another option is a LocalDB. It is targeted at developers but for what you are doing it would work just fine. DataTable speed without memory concerns. You can even put an index on ProductID. It will take a little longer to write to disc compared to memory but you are not using up memory.
Then there is the ever evil with (nolock). Know what you are doing and I am not going to go into all the possible evils but I can tell you that I use it a lot.
The question can be precipitated to Memory vs Performance. The answer to that is Caching.
If you know what your usage pattern would be like, then one thing you can do is to create a local cache in the app.
The extreme cases are - your cache size is 800MB with all your data in it (thereby sacrificing memory) - OR - your cache size is 0MB and all your queries go to network (thereby sacrificing performance).
Three important questions about the design of the cache are answered below.
How to populate the Cache?
If you are likely to make some query multiple times, store it in cache and before going to network, test if your cache already has the result. If it doesn't, query the database and then store the result in the cache.
If after querying for some data, you are likely to query the next and/or previous piece of data, then query all of it once and cache it so that when you query the next piece, you already have it in cache.
Effectively the idea is that if you know some information may be needed in future, cache it beforehand.
How to free the Cache?
You can decide the freeing mechanism for cache either Actively or Passively.
Passively: Whenever cache is full you can evict the data from it.
Actively: Run a background thread at regular interval and it takes care of removal for you.
One hybrid method is to run a freeing thread as soon as you reach, let's say, 80% of your memory limit and then free whatever memory you can.
What data to remove from the Cache?
This has been answered already in context of the issue of Page Replacement Policies for Operating Systems.
For completion, I'll summarize the important ones here:
Evict the Least Recently Used data (if it is not likely to be used);
Evict the data that was brought in earliest (if the earliest data is not likely to be used);
Evict the data that was brought in latest (if you think that the newly brought in data is least likely to be used).
Automatically remove the data that is older than t time units.
RE: "I can't index by anything because I'm not the database admin nor can ask for that."
Can you prepopulate a temp table and index on that?, e.g.
Select * into #MyTempTable from BigHugeTable
Create Index Prodidx on #MyTempTable (product)
You will have to ensure you always reuse the same connection (and it isn't closed) in order to use the temp table.
Related
Given a data set where you will only know the relevant fields at run time. I want to select each row of data for analysis through a loop. Is it better to:
run a direct sql query to get the row each time by directly opening and closing the database
pull all the applicable rows into a datatable before the loop then selecting them through linq from inside the loop
For example, I am reading in a file that says look for rows a b and c then my query becomes "SELECT col1, col2, col3 FROM table WHERE col1 = 'a' or col1 = 'b' or col1= 'c'"
But I dont know it will be a,b,c during compile time, only after i run the program
Thanks
edit: better in terms or speed and best practice
depending on how long your analysis takes, the underlying transactional locking (if any), and the resources blocked by holding your result set on the db server it is either 1 or 2 ... but for me 2 seems rather unlikely (gigantic resultsets that are open for long periods of time and would eat up the memory on your DB system, which alone would suggest that you should rethink your whole data processing workflow)... i'd just build the SQL at runtime, which is even possible using linq directly against your DB (see "Entity Framework") and only if I would encounter serious performance problems once that is running, i'd refactor...
StackOverflow isn't really for "better" type questions because they're usually highly subjective/opinion based/lead to arguments.. We are however, free to decide whether subjective questions should be answered and this is one of those that (in my opinion) can
I'd assert that in 99% of cases you'd be better off leaving data in a database and querying it using the SQL functionality built into the database. The reason is that databases are purpose built for storing and querying data. It is somewhat ludicrous to conclude that it's a good use of resources to transfer 100 million rows across a network, into a poorly designed data storage container and then naively searching through it using a loop-in-a-loop (which is what linq in a loop would be) when you could leave it in the purpose-built, well indexed, high powered enterprise level software (on enterprise grade hardware) where it currently resides, and ask for retrieval of a tiny subset of those records to be transferred over a (slow) network link into a limited power client
I have a C# program that is executed on an EC2 cloud server and connecting to an RDS SQL DB. It collects some data, checks if that data exists an SQL DB table and if not it adds the item to a list to be bulk saved later. Now I'm struggling a bit on what the best approach is here.
I've tried these options so far:
a) before inserting make a DB make a call to see if it exists.
Pro: less memory intensive. Can run on cheaper cloud servers.
Cons: higher DB reads. Slower performance
b) Put a Unique Constraint on the tables and in code catch this Unique Constraint exception and move on.
Pro: less memory intensive. Can run on cheaper cloud servers.
Cons: higher DB reads. MUCH Slower performance. Id increment of the transaction don't get rolled back leading to an error down the road of making out the int Id column
c) At the start, make a bulk SQL call for all the lookup values and put them all into a hashed set. Reference this hashed set to see if the value exists or not.
Pro: Much much faster. Much less DB reads
Cons: Pay more for servers with more memory. Risk hitting Out of memory errors as it scales.
The reason I say making single SQL calls is slow is because of the latency of the RDS DB. It's minor but adds up when doing 100k calls in a row to see if something exists. Ideally I'd like as fast performance as possible. I've considered using DyanmoDb but the pricing is just too much and doesn't make much sense in my situation.
So is there some way to save lookups to a file on the local disk that can be quickly referenced later?
please tell me what it is!
One quick option you could try is to write a stored procedure that accepts the insert data, then the sp would check for the record, and if not found add it, or return an error.
Doesn't cut down on the reads or writes, but does cut down on the roundtrips to the server, and you did say the latency is a problem.
Another option is to use a temp table and do inserts into it without checks (fast), and then do a "insert into mastertable select * from temptable where record is not in mastertable' type of query to insert rows as a batch - the usefulness of this option will depend on whether or not you can bunch up a bunch of rows to insert at once.
I have a database in SQL Server 2012 and want to update a table in it.
My table has three columns, the first column is of type nchar(24). It is filled with billion of rows. The other two columns are from the same type, but they are null (empty) at this moment.
I need to read the data from the first column, with this information I do some calculations. The result of my calculations are two strings, this two strings are the data I want to insert into the two empty columns.
My question is what is the fastest way to read the information from the first column of the table and update the second and third column.
Read and update step by step? Read a few rows, do the calculation, update the rows while reading the next few rows?
As it comes to billion of rows, performance is the only important thing here.
Let me know if you need any more information!
EDIT 1:
My calculation can´t be expressed in SQL.
As the SQL server is on the local machine, the througput is nothing we have to be worried about. One calculation take about 0.02154 seconds, I have a total number of 2.809.475.760 rows this is about 280 GB of data.
Normally, DML is best performed in bigger batches. Depending on your indexing structure, a small batch size (maybe 1000?!) can already deliver the best results, or you might need bigger batch sizes (up to the point where you write all rows of the table in one statement).
Bulk updates can be performed by bulk-inserting information about the updates you want to make, and then updating all rows in the batch in one statement. Alternative strategies exist.
As you can't hold all rows to be updated in memory at the same time you probably need to look into MARS to be able to perform streaming reads while writing occasionally at the same time. Or, you can do it with two connections. Be careful to not deadlock across connections. SQL Server cannot detect that by principle. Only a timeout will resolve such a (distributed) deadlock. Making the reader run under snapshot isolation is a good strategy here. Snapshot isolation causes reader to not block or be blocked.
Linq is pretty efficient from my experiences. I wouldn't worry too much about optimizing your code yet. In fact that is typically something you should avoid is prematurely optimizing your code, just get it to work first then refactor as needed. As a side note, I once tested a stored procedure against a Linq query, and Linq won (to my amazement)
There is no simple how and a one-solution-fits all here.
If there are billions of rows, does performance matter? It doesn't seem to me that it has to be done within a second.
What is the expected throughput of the database and network. If your behind a POTS dial-in link the case is massively different when on 10Gb fiber.
The computations? How expensive are they? Just c=a+b or heavy processing of other text files.
Just a couple of questions raised in response. As such there is a lot more involved that we are not aware of to answer correctly.
Try a couple of things and measure it.
As a general rule: Writing to a database can be improved by batching instead of single updates.
Using a async pattern can free up some of the time for calculations instead of waiting.
EDIT in reply to comment
If calculations take 20ms biggest problem is IO. Multithreading won't bring you much.
Read the records in sequence using snapshot isolation so it's not hampered by write locks and update in batches. My guess is that the reader stays ahead of the writer without much trouble, reading in batches adds complexity without gaining much.
Find the sweet spot for the right batchsize by experimenting.
I am building a database on SQL Server.
This DB is going to be really huge.
However, there are few tables which need to be queried very frequently and are quite small.
Is there a way to cache these tables in RAM for faster querying ?
Any ideas/links to make the database insertions/query faster will be highly appreciated.
Also, do I get any performance boost if I migrate from SQL Express to SQL Server Enterprise ?
Thanks in advance.
SQL server will do an outstanding job of keeping small tables that are frequently accessed in RAM.
However, a small frequently accessed table does sound like a good candidate for caching at the application layer to avoid ever hitting the database.
If your database really is "huge", you will hit the 1GB RAM limit of SQL Express (and/or the 10GB per DB storage limitation) and will want an edition that does not have that constraint.
http://msdn.microsoft.com/en-us/library/cc645993(v=SQL.110).aspx
You can read the data from the table and store into the DataTable Variable。
You Should create suitable index and you and make the query faster.
If you are working with the C# then you may have try data caching.
You just need to follow 3 steps:
Fetch your result to a list
Now cache the list of data
Whenever you need to query cache result, cast your cache object to concern list type.
Following is the example code:
List<type> result = (Linq-query).ToList();
Cache["resultSet"] = optresult;
List<type> cachedList = (List<type>)Cache["resultSet"];
Now you may perform Linq query over cachedList which actually uses cached object.
Note: For caching any object you may use more precise approach like following, this provides better control over caching.
Cache cacheObjectName = new Cache();
cacheObjectName.Insert("Key", value, Dependency, DateTime, TimeSpan, CacheItemPriority, CacheItemRemovedCallback)
More a page is used by queries more are chances that the page will be in memory.But it will be at page level rather than table level. Everytime it will be referenced its count will be increased and a background process (lazy writer) usualy decrease the count for all the pages. When a new page is required to bring to memory ;sql server will write the page with least count to disk.Thus if your table's pages are accessed frequently there are high chances that the count will be high and thus those will stay in memory for longer.But if you will have some kind of a big query which reads lots of data from different tables which say is more than your memory then even those pages might be thrown out of the cache.But if you do not have those kind of queries then the chances are high that pages will stay in the memory.
Also, it means the same page is accessed a number of times.If diff processes will read diff pages from same table then you might not have very high use count for all of your pages and thus some of them could be written to disk.
Read below blog for more details on how buffers etc works.
http://sqlblog.com/blogs/elisabeth_redei/archive/2009/03/01/bufferpool-performance-counters.aspx
Depending on how often these small tables are changed, Query Notifications might be a good option. Essentially, you subscribe your application to changes in a data set in the db. A canonical example is a list of vendors. Doesn't change much over time but you want the application to know when it does change.
I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!
I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.
Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.
If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.
Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.