I have a C# program that is executed on an EC2 cloud server and connecting to an RDS SQL DB. It collects some data, checks if that data exists an SQL DB table and if not it adds the item to a list to be bulk saved later. Now I'm struggling a bit on what the best approach is here.
I've tried these options so far:
a) before inserting make a DB make a call to see if it exists.
Pro: less memory intensive. Can run on cheaper cloud servers.
Cons: higher DB reads. Slower performance
b) Put a Unique Constraint on the tables and in code catch this Unique Constraint exception and move on.
Pro: less memory intensive. Can run on cheaper cloud servers.
Cons: higher DB reads. MUCH Slower performance. Id increment of the transaction don't get rolled back leading to an error down the road of making out the int Id column
c) At the start, make a bulk SQL call for all the lookup values and put them all into a hashed set. Reference this hashed set to see if the value exists or not.
Pro: Much much faster. Much less DB reads
Cons: Pay more for servers with more memory. Risk hitting Out of memory errors as it scales.
The reason I say making single SQL calls is slow is because of the latency of the RDS DB. It's minor but adds up when doing 100k calls in a row to see if something exists. Ideally I'd like as fast performance as possible. I've considered using DyanmoDb but the pricing is just too much and doesn't make much sense in my situation.
So is there some way to save lookups to a file on the local disk that can be quickly referenced later?
please tell me what it is!
One quick option you could try is to write a stored procedure that accepts the insert data, then the sp would check for the record, and if not found add it, or return an error.
Doesn't cut down on the reads or writes, but does cut down on the roundtrips to the server, and you did say the latency is a problem.
Another option is to use a temp table and do inserts into it without checks (fast), and then do a "insert into mastertable select * from temptable where record is not in mastertable' type of query to insert rows as a batch - the usefulness of this option will depend on whether or not you can bunch up a bunch of rows to insert at once.
Related
I have the following scenario: I am building a dummy web app that pulls betting odds every minute, stores all the events, matches, odds etc. to the database and then updates the UI.
I have this structure: Sports > Events > Matches > Bets > Odds and I am using code first approach and for all DB-related operations I am using EF.
When I am running my application for the very first time and my database is empty I am receiving XML with odds which contains: ~16 sports, ~145 events, ~675 matches, ~17100 bets & ~72824 odds.
Here comes the problem: how to save all this entities in timely manner? Parsing is not that time consuming operation - 0.2 seconds, but when I try to bulk store all these entities I face memory problems and the save took more than 1 minute so next odd pull is triggered and this is nightmare.
I saw somewhere to disable the Configuration.AutoDetectChangesEnabled and recreate my context on every 100/1000 records I insert, but I am not nearly there. Every suggestion will be appreciated. Thanks in advance
When you are inserting huge (though it is not that huge) amounts of data like that, try using SqlBulkCopy. You can also try using Table Value Parameter and pass it to a stored procedure but I do not suggest it for this case as TVPs perform well for records under 1000. SqlBulkCopy is super easy to use which is a big plus.
If you need to do an update to many records, you can use SqlBulkCopy for that as well but with a little trick. Create a staging table and insert the data using SqlBulkCopy into the staging table, then call a stored procedure which will get records from the staging table and update the target table. I have used SqlBulkCopy for both cases numerous times and it works pretty well.
Furthermore, with SqlBulkCopy you can do the insertion in batches as well and provide feedback to the user, however, in your case, I do not think you need to do that. But nonetheless, this flexibility is there.
Can I do it using EF only?
I have not tried but there is this library you can try.
I understand your situation but:
All actions you've been doing it all depends on your machine specs and
the software itself.
Now if machine specs cannot handle the process it will be the time to
change a plan like to limit the count of records to be inserted till
it all to be done.
I have a table with a lot of rows (3 million) from which I need to query some rows at several points in my app. The way I found to do this is querying all the data the first time that any was needed and storing it in a static DataTable with SqlAdapter.Fill() for the rest of the app life.
That's fast, because then when I need something I use DataTable.Select("some query") and the app processes the info just nice.
The problem is that this table takes about 800MB of RAM, and I have to run this app in PCs where it might be too much.
The other way I thought was to query the data I need each time. This takes little memory but has poor performance (a lot of queries to the database, which is at a network address and with 1000 queries you start to notice the ping and all that..).
Is there any intermediate point between performance and memory usage?
EDIT: What I'm retrieving are sales, which have a date, a product and a quantity. I query by product, and it isn't indexed that way. But anyways, making 1000 queries, even if the query took 0.05s, a 0.2s ping makes a total of 200 seconds...
First talk to the dba about performance
If you are downloading the entire table you might actually be putting more load on the network and SQL than if you performed individual queries.
As a dba if I knew you were downloading an entire large table I would put an index on product immediately.
Why are you performing 1000s of queries?
If you are looking for sales when a product is created then a cache is problematic. You would not yet have sales data. The problem with a cache is stale data. If you know the data will not change - you either have it or not then you can eliminate the concern of stale data.
There is something between sequentially and simultaneously. You can pack multiple selects in a single request. What this does is make a single round trip and is more efficient.
select * from tableA where ....;
select * from tableB where ....;
With DataReader just call SqlDataReader.NextResult Method ()
using (SqlDataReader rdr = cmd.ExecuteReader())
{
while (rdr.Read())
{
}
rdr.NextResultSet();
while (rdr.Read())
{
}
}
Pretty sure you can do the same type of thing with multiple DataTables in a DataSet.
Another option is a LocalDB. It is targeted at developers but for what you are doing it would work just fine. DataTable speed without memory concerns. You can even put an index on ProductID. It will take a little longer to write to disc compared to memory but you are not using up memory.
Then there is the ever evil with (nolock). Know what you are doing and I am not going to go into all the possible evils but I can tell you that I use it a lot.
The question can be precipitated to Memory vs Performance. The answer to that is Caching.
If you know what your usage pattern would be like, then one thing you can do is to create a local cache in the app.
The extreme cases are - your cache size is 800MB with all your data in it (thereby sacrificing memory) - OR - your cache size is 0MB and all your queries go to network (thereby sacrificing performance).
Three important questions about the design of the cache are answered below.
How to populate the Cache?
If you are likely to make some query multiple times, store it in cache and before going to network, test if your cache already has the result. If it doesn't, query the database and then store the result in the cache.
If after querying for some data, you are likely to query the next and/or previous piece of data, then query all of it once and cache it so that when you query the next piece, you already have it in cache.
Effectively the idea is that if you know some information may be needed in future, cache it beforehand.
How to free the Cache?
You can decide the freeing mechanism for cache either Actively or Passively.
Passively: Whenever cache is full you can evict the data from it.
Actively: Run a background thread at regular interval and it takes care of removal for you.
One hybrid method is to run a freeing thread as soon as you reach, let's say, 80% of your memory limit and then free whatever memory you can.
What data to remove from the Cache?
This has been answered already in context of the issue of Page Replacement Policies for Operating Systems.
For completion, I'll summarize the important ones here:
Evict the Least Recently Used data (if it is not likely to be used);
Evict the data that was brought in earliest (if the earliest data is not likely to be used);
Evict the data that was brought in latest (if you think that the newly brought in data is least likely to be used).
Automatically remove the data that is older than t time units.
RE: "I can't index by anything because I'm not the database admin nor can ask for that."
Can you prepopulate a temp table and index on that?, e.g.
Select * into #MyTempTable from BigHugeTable
Create Index Prodidx on #MyTempTable (product)
You will have to ensure you always reuse the same connection (and it isn't closed) in order to use the temp table.
We have a C# application which parses data from text files. We then have to update records in our sql database based on the information in the text files. What's the most efficient way for passing the data from application to SQL server?
We currently use a delimited string and then loop through the string in a stored procedure to update the records. I am also testing using TVP (table valued parameter). Are there any other options out there?
Our files contain thousands of records and we would like a solution that takes the least amount of time.
Please do not use a DataTable as that is just wasting CPU and memory for no benefit (other than possibly familiarity). I have detailed a very fast and flexible approach in my answer to the following questions, which is very similar to this one:
How can I insert 10 million records in the shortest time possible?
The example shown in that answer is for INSERT only, but it can easily be adapted to include UPDATE. Also, it uploads all rows in a single shot, but that can also be easily adapted to set a counter for X number of records and to exit the IEnumerable method after that many records have been passed in, and then close the file once there are no more records. This would require storing the File pointer (i.e. the stream) in a static variable to keep passing to the IEnumerable method so that it can be advanced and picked up at the most recent position the next time around. I have a working example of this method shown in the following answer, though it was using a SqlDataReader as input, but the technique is the same and requires very little modification:
How to split one big table that has 100 million data to multiple tables?
And for some perspective, 50k records is not even close to "huge". I have been uploading / merging / syncing data using the method I am showing here on 4 million row files and that hit several tables with 10 million (or more) rows.
Things to not do:
Use a DataTable: as I said, if you are just filling it for the purpose of using with a TVP, it is a waste of CPU, memory, and time.
Make 1 update at a time in parallel (as suggested in a comment on the question): this is just crazy. Relational database engines are heavily tuned to work most efficiently with sets, not singleton operations. There is no way that 50k inserts will be more efficient than even 500 inserts of 100 rows each. Doing it individually just guarantees more contention on the table, even if just row locks (it's 100k lock + unlock operations). Is could be faster than a single 50k row transaction that escalates to a table lock (as Aaron mentioned), but that is why you do it in smaller batches, just so long as small does not mean 1 row ;).
Set the batch size arbitrarily. Staying below 5000 rows is good to help reduce chances of lock escalation, but don't just pick 200. Experiment with several batch sizes (100, 200, 500, 700, 1000) and try each one a few times. You will see what is best for your system. Just make sure that the batch size is configurable though the app.config file or some other means (table in the DB, registry setting, etc) so that it can be changed without having to re-deploy code.
SSIS (powerful, but very bulky and not fun to debug)
Things which work, but not nearly as flexible as a properly done TVP (i.e. passing in a method that returns IEnumerable<SqlDataRecord>). These are ok, but why dump the records into a temp table just to have to parse them into the destination when you can do it all inline?
BCP / OPENROWSET(BULK...) / BULK INSERT
.NET's SqlBulkCopy
The best way to do this in my opinion is to create a temp table then use SqlBulkCopy to insert into that temp table (https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy%28v=vs.110%29.aspx), and then simply Update the table based on the temp table.
Based on my tests(using Dapper and also LINQ), updating as a bulk or with batches takes way too longer than just creating a temp table and sending a command to the server to update the data based on the temp table. The process is faster because the SqlBulkCopy populates the data natively in a fast manner, and the rest is completed on the SQL server side which goes through less calculation steps, and the data at that point resides on the server end.
I am writing a app that reads a whole table, does some processing, then writes the resulting data to another table. I am using the SqlBulkCopy class (.net version of "bcp in") which does the insert very fast. But I cannot find any efficent way to select data in the first place. there is not .net equivilent of "bcp out", which seems strange to me.
Currently I'm using select * from table_name. For prespective it takes 2.5 seconds to select 6,000 rows ... and only 600ms to bulk insert the same number of rows.
I would expect that selecting data should always be faster than inserting. What is the fastest way to select all rows & columns from a table?
Answers to qeustions:
I timed my select to take 2.5 seconds 2 ways. First was while running my application and running a sql trace. second was running the same query in SSMS. Both retured about the same result.
I am reading data using SqlDataReader.
No other applications are using this database.
My current processing takes under 1 second, so 2+ second read time is relatively large. But mostly I'm concerned(interested) in performance when scaling this up to 100,000 rows and millions of rows.
Sql Server 08r2 and my application are both running on my dev machine.
Some of the data processing is set based so I need to have the whole table in memory (to support much larger data sets, I know this step will probably need to be moved into SQL so I only need to operate per row in memory)
Here is my code:
DataTable staging = new DataTable();
using (SqlConnection dwConn = (SqlConnection)SqlConnectionManager.Instance.GetDefaultConnection())
{
dwConn.Open();
SqlCommand cmd = dwConn.CreateCommand();
cmd.CommandText = "select * from staging_table";
SqlDataReader reader = cmd.ExecuteReader();
staging.Load(reader);
}
select * from table_name is the simplest, easiest and fastest way to read a whole table.
Let me explain why your results lead to wrong conclusions.
Copying a whole table is an optimized operation that merely requires cloning the old binary data into the new one (at most you can perform a file copy operation, according to storage mechanism).
Writing is buffered. DBMS says the record was written but it's actually not yet done, unless you work with transactions. Disk operations are generally delayed.
Querying a table also requires (unlike cloning) adapting data from the binary-stored layout/format to a driver-dependant format that is ultimately readable by your client. This takes time.
It all depends on your hardware, but it is likely that your network is the bottleneck here.
Apart from limiting your query to just read the columns you'd actually be using, doing a select is as fast as it will get. There is caching involved here, when you execute it twice in a row, the second time shoud be much faster because the data is cached in memory. execute dbcc dropcleanbuffers to check the effect of caching.
If you want to do it as fast as possible try to implement the code that does the processing in T-SQL, that way it could operate directly on the data right there on the server.
Another good tip for speed tuning is have the table that is being read on one disk (look at filegroups) and the table that is written to on another disk. That way one disk can do a continuous read and the other a continuous write. If both operations happen on the same disk the heads of the disk keep going back and forth what seriously downgrades performance.
If the logic your writing cannot be doen it T-SQL you could also have a look at SQL CLR.
Another tip: when you do select * from table, use a datareader if possible. That way you don't materialize the whole thing in memory first.
GJ
It is a good idea generally to include the column names in the select list, but with today's RDBMS's, it won't make much difference. You will only see difference in this regard if you limit the columns selected. Generally speaking it is good practice to include column names. But to answer it seems a select is indeed slower than inserting in the scenario you describe
and yes a select * from table_name is indeed the fastest way to read all rows and cols from a table
I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!
I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.
Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.
If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.
Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.