I am transferring about 350 rows (with some data collection) from a MS SQL Server to the iSeries for processing. I feel the process is too slow which is about a minute or so. I am doing all of the MS SQL stuff in LINQ2SQL. Here is the basics of what I am doing currently:
Collect all of the vehicle master data to process one-at-a-time.
SUM() Fuel usage by vehicle
SUM() Oil usage by vehicle
SUM() Parts used by vehicle
SUM() Labor by vehicle
SUM() Outside Repairs by vehicle
SUM() Accident Costs by vehicle
I realize this is a lot of queries, but most of these are from different tables in the MS SQL Server. All of these require at lease one join. I am thinking of joining Oil and Parts in to one query and Outside Repairs and Accident Costs into one query since both of those are stored in the same tables and see if that improves performance any.
Do you have any other suggestions?
Note that this is a vendor delivered product and I would prefer to not create any stored procedures or views (which is basically none) that aren't already in the database.
Update: I had another post looking at alternatives to improving speed.
You could perhaps launch these queries into separated threads and wait for them to return? Then, all of your calculations would get done in about the same time as for, let's say, half the time it requires now, I guess.
Grouping your results per table is in my point of view a good idea as you're already processing these datum.
Grouping your queries per table and launching them into different threads would for sure gain in performance. It all depends on if this is optimal for your situation.
It appears like the database was poorly designed. Whatever LINQ was generating in the background it was highly inefficient code. I am not saying the LINQ is bad, it just was bad for this database. I converted to a quickly thrown together .XSD setup and the processing time went from 1.25 minutes to 15 seconds. Once I do a proper redesign, I can only guess I'll shave a few more seconds off of that. I'll try LINQ again some other day on a better database.
If performance is important (one minute is a problem?) you might consider using a summary table. Then you just have to query the summary table for your report. The summary table could be built with triggers or with a nightly batch extraction to the summary table.
Related
I have the following scenario: I am building a dummy web app that pulls betting odds every minute, stores all the events, matches, odds etc. to the database and then updates the UI.
I have this structure: Sports > Events > Matches > Bets > Odds and I am using code first approach and for all DB-related operations I am using EF.
When I am running my application for the very first time and my database is empty I am receiving XML with odds which contains: ~16 sports, ~145 events, ~675 matches, ~17100 bets & ~72824 odds.
Here comes the problem: how to save all this entities in timely manner? Parsing is not that time consuming operation - 0.2 seconds, but when I try to bulk store all these entities I face memory problems and the save took more than 1 minute so next odd pull is triggered and this is nightmare.
I saw somewhere to disable the Configuration.AutoDetectChangesEnabled and recreate my context on every 100/1000 records I insert, but I am not nearly there. Every suggestion will be appreciated. Thanks in advance
When you are inserting huge (though it is not that huge) amounts of data like that, try using SqlBulkCopy. You can also try using Table Value Parameter and pass it to a stored procedure but I do not suggest it for this case as TVPs perform well for records under 1000. SqlBulkCopy is super easy to use which is a big plus.
If you need to do an update to many records, you can use SqlBulkCopy for that as well but with a little trick. Create a staging table and insert the data using SqlBulkCopy into the staging table, then call a stored procedure which will get records from the staging table and update the target table. I have used SqlBulkCopy for both cases numerous times and it works pretty well.
Furthermore, with SqlBulkCopy you can do the insertion in batches as well and provide feedback to the user, however, in your case, I do not think you need to do that. But nonetheless, this flexibility is there.
Can I do it using EF only?
I have not tried but there is this library you can try.
I understand your situation but:
All actions you've been doing it all depends on your machine specs and
the software itself.
Now if machine specs cannot handle the process it will be the time to
change a plan like to limit the count of records to be inserted till
it all to be done.
I have a table with more than 10,000,000 Rows.
I need some filters (some in queries and some like queries) and dynamic order by
I wondered what is the best way to work with big data, Pagination, Filtering and ordering.
Of course its easy to work with entity framework, But I think the performance better on stored procedure
I have a table with more than 10,000,000 Rows.
You have a small table, nearly tiny, small enough to have no problems for anyone now abusing the server.
Seriously.
I wondered what is the best way to work with big data,
That starts with HAVING big data. That is generally defined a smultiple times RMA of a low cost server. Which today has around 16 cores and around 128gbm memory. After that it gets expensive.
General rules are:
DO NOT PAGE. Paging at the start is easy, but gtting to the end results is slow - either you precalculate the pages and store them OR you ahve to reexecute queries just to throw away results. It works nice on page 1-2, then it gets slower.
Of course its easy to work with entity framework, But I think the
performance better on stored procedure
And why would that be? The overhead of generating the query is tiny, and contrary to often repeated delusions - SQL Server uses query plan caching for everything. A SP is faster - if the compilation overhead is significant (i.e. SMALL data), or if you otherwise pull a lot of data over a network in order to send results back (processing only in database).
For anyhing else the "general" performance impact is close to zero.
OTOH it allows you to send much more tailored SQL without geting into really bad and ugly stored procedures - that either issue dynamic SQL internally, or have tons of complex conditions for optional parameters.
What to be careful with:
IN clauses can be terrible for performance. DO not put hundreds of elements in there. IF you need that - a SP and a table variable that prepares and is joined is the better way.
As I said - careful with paging. Someone asking for page 100 and just pressing forward is repeating a TON of processing.
And: Attitude adjustment. The time where 10 million rows where large are around 20 years ago.
I have a database in SQL Server 2012 and want to update a table in it.
My table has three columns, the first column is of type nchar(24). It is filled with billion of rows. The other two columns are from the same type, but they are null (empty) at this moment.
I need to read the data from the first column, with this information I do some calculations. The result of my calculations are two strings, this two strings are the data I want to insert into the two empty columns.
My question is what is the fastest way to read the information from the first column of the table and update the second and third column.
Read and update step by step? Read a few rows, do the calculation, update the rows while reading the next few rows?
As it comes to billion of rows, performance is the only important thing here.
Let me know if you need any more information!
EDIT 1:
My calculation can´t be expressed in SQL.
As the SQL server is on the local machine, the througput is nothing we have to be worried about. One calculation take about 0.02154 seconds, I have a total number of 2.809.475.760 rows this is about 280 GB of data.
Normally, DML is best performed in bigger batches. Depending on your indexing structure, a small batch size (maybe 1000?!) can already deliver the best results, or you might need bigger batch sizes (up to the point where you write all rows of the table in one statement).
Bulk updates can be performed by bulk-inserting information about the updates you want to make, and then updating all rows in the batch in one statement. Alternative strategies exist.
As you can't hold all rows to be updated in memory at the same time you probably need to look into MARS to be able to perform streaming reads while writing occasionally at the same time. Or, you can do it with two connections. Be careful to not deadlock across connections. SQL Server cannot detect that by principle. Only a timeout will resolve such a (distributed) deadlock. Making the reader run under snapshot isolation is a good strategy here. Snapshot isolation causes reader to not block or be blocked.
Linq is pretty efficient from my experiences. I wouldn't worry too much about optimizing your code yet. In fact that is typically something you should avoid is prematurely optimizing your code, just get it to work first then refactor as needed. As a side note, I once tested a stored procedure against a Linq query, and Linq won (to my amazement)
There is no simple how and a one-solution-fits all here.
If there are billions of rows, does performance matter? It doesn't seem to me that it has to be done within a second.
What is the expected throughput of the database and network. If your behind a POTS dial-in link the case is massively different when on 10Gb fiber.
The computations? How expensive are they? Just c=a+b or heavy processing of other text files.
Just a couple of questions raised in response. As such there is a lot more involved that we are not aware of to answer correctly.
Try a couple of things and measure it.
As a general rule: Writing to a database can be improved by batching instead of single updates.
Using a async pattern can free up some of the time for calculations instead of waiting.
EDIT in reply to comment
If calculations take 20ms biggest problem is IO. Multithreading won't bring you much.
Read the records in sequence using snapshot isolation so it's not hampered by write locks and update in batches. My guess is that the reader stays ahead of the writer without much trouble, reading in batches adds complexity without gaining much.
Find the sweet spot for the right batchsize by experimenting.
I am still on a learning curve in C# and SQL Server so please forgive my ‘greeness’.
Here is my scenario:
I have an EMPLOYEE table with 10,000 rows. Each of these employees has transactions in a TRANSACTIONS table.
The transaction table has the salary elements like Basic pay, Acting allowance, Overtime hours etc. It also has payroll deductions like advance deductions, some loans (with interest), and savings (pension, social security savings etc.
I need to go through each employee’s transactions and compute taxes, outstanding balances on loans, update balances on savings, convert hours into payments/deductions and some other stuff.
This processing will give me a new set of rows for each employee, with a period marker (eg 2013-04 for April 2013). I need to store this in a HISTORY table for future references.
What is the best approach for processing the entire 10,000 employee table and their transactions?
I am told that pulling the entire table into memory via readers is not good practice and I agree.
Do I keep pulling an employee from the database, process their transactions, and commit the history to the database? And pull the next and so forth?
Too many calls to the back end?
(EF not an option for me, still doing raw SQL in ADO.NET)
I will appreciate any help on this.
10000 rows is not much. Memory could easily handle that if there's not some enourmous varchar or binary columns. Don't feel completely locked by good practice "rules".
On the other hand, consider a stored procedure. Then all processing will be done locally on the server.
edit: if neither of the above is an option, try to stream your results. For example, when reading your query save each row in a ConcurrentQueue or something like that. Before you execute the query, start another thread or a BackgroundWorker which checks the queue for new items and saves back results simultaneously on another SqlConnection. Work will be done when query is done AND the queue has Count 0.
Check out using ROW_NUMBER(). This can be used by programs to allow large tables to be essentially browsed using 'x' number of rows at a time. You could then conceivably use this same method to batch your job over, say, 1000 rows at a time.
See this link for more information.
On a WPF application already in production, users have a window where they choose a client. It shows a list with all the clients and a TextBox where they can search for a client.
As the client base increased, this turns out to be exceptionally slow. Around 1 minute for a operation that happens around 100 times each day.
Currently MSSQL management studio says the query select id, name, birth_date from client takes 41 seconds to execute (around 130000 rows).
Is there any suggestions on how to improve this time? Indexes, ORMs or direct sql queries on code?
Currently I'm using framework 3.5 and LinqToSql
If your query is actually SELECT id, name, birth_date from client (ie, no where clause) there is very little that you'll be able to do to speed that up short of new hardware. SQL Server will have to do a table scan to get all of the data. Even a covering index means that it will have to scan an index just as big as the table.
What you need to ask yourself is: is a list of 130000 clients really useful for your users? I anybody really going to scroll through to the 75613th entry in a list to find the user that they want? The answer is probably not. I would go with the search option only. At least then you can add indices that make sense for those queries.
If you absolutely do need the entire list, try loading it lazily in chunks. Start with the first 500 records and then add more records as the user moves the scroll bar. That way the initial load time is reduced and the user will only load the data that is necessary.
Why do you need the list of all the clients? Couldn't you just have the search TextBox that you describe and handle the search query on the server side. There you set a cap on the maximum number of returned rows for an individual client search (e.g. max 500 matches).
Alternatively, some efficiency gains may be achived by caching the client data list on the web server
Indexing should not help, based on your query. You could use a view which caches the sorted query (assuming you're not ordering by the id?), but given SQL Server's baked-in query cache for adhoc queries you're probably not going to see much gain there either. The ORM does add some overhead, but there are several tutorials out there for cutting the cost of that (e.g. http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html). Main points there that apply to you are to use compiled queries wherever possible, and turn off optimistic concurrency for read-only data.
An even bigger performance gain could be realized by having your clients not hit the db directly. If you add a service layer in there (not necessarily a web service, but it could be) then the service class or application could put some smart caching in place, which would help by an order of magnitude for read-only queries like this.
Go in to SQL Server, do a new query. In the Query menu click the "Include Client Statistics".
Run the query just as you would from code.
It will display the results and also a tab next to the result called "Client Statistics"
Click that and look at the time in the "Wait time on server replies" This is in ms, and it's the time the server was actually executing.
I just ran this query:
select firstname, lastname from leads
It took 3ms on the server to fetch 301,000 records.
The "Total Execution Time" was something like 483ms, which includes the time for SSMS to actually get the data and process it. My query took something like 2.5-3s to run in SSMS and the remaining time (2500ms or so) was actually for SSMS to paint the results etc.)
My guess is, the 41 seconds is probably not being spent on the SQL server, as 130,000 records really isn't that much. Your 41 seconds is probably largely being spent by everything after the SQL server returns the results.
If you find out SQL Server is taking a long time to execute, in the query menu turn on "Include Actual Execution Plan" Rerun your query. A new tab appears called "Execution Plan" this tab will show you what SQL server is doing when you do a select on this table as well as a percentage of where it spends all of it's time. In my case it spent 100% of the time in a "Clustered Index Scan" of PK_Leads
Edited to include more stats
In general:
Find out what takes so much time, executing the query or retrieving the results
If its the query execution, the query plan will tell you which indexes are missing, just press the display query plan button in the SSMS and you will get hints on which indexes you should create to increase performance
If its the retrieval of the values, there is not much you can do about it besides upgrading hardware (ram, disk, network etc.)
But:
In your case it looks like the query is a full table scan, which is never good for performance, check if you really need to retrieve all this data at once.
Since there are no clauses what so ever its unlikely that its the query execution that is the problem. Meaning additional indexes will not help.
You will need to change the way the application access the data. Instead of loading all clients into memory and then search from them in memory you will need to pass on the search term to the database query.
LinqToSql enable you to use different features for searching values, here is a blog describing most of them:
http://davidhayden.com/blog/dave/archive/2007/11/23/LINQToSQLLIKEOperatorGeneratingLIKESQLServer.aspx