I'm using jquery datatables to display a grid which uses webapi to retrieve it's data. The webapi uses linq to query a mssql database and it neatly uses filtering, sorting and skip/take to assemble it's query on a well-indexed table containing about a million records (and growing). A common scenario.
And it performs really well. The browser has to wait about 50 ms for the response (while paginating for example) to return.
However, after I took a look with a profiling tool I noticed about 25 ms to be used just selecting the total rowcount of the table. Which I want to know because I want the datatable to display something like: "displaying row 1 to 10 of 45.000 filtered out of 1.000.000" needing the total count.
I don't actually need to know the precise total count (it's just informative) every trip from the server so I perhaps could keep the value server side and refresh it every second in a different task without it interfering with the data retrieval of datatables. I would just return the 'close enough' value of the total row count.
Is there a solid mechanism for that? I've tried to put the total rowcount in a static used by multiple users during multiple callbacks and every time it was requested a async task was fired to refresh it.
That feels icky however, sharing the static and having a different thread update it doesn't feel all that stable to me. I've looked at SqlDependency to push the recordcount every time it changes from my data to my domain model but that doesn't seem to support SELECT COUNT(Id) FROM TABLE scenarios.
Any thoughts?
You could use one of the system tables if possible. You could ping this every minute and stick it in the cache. This article has two that it claims are sufficient options:
--The way the SQL management studio counts rows (look at table properties, storage
--, row count). Very fast, but still an approximate number of rows.
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))
or
--Quick (although not as fast as method 2) operation and equally important, reliable.
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Transactions')
AND (index_id=0 or index_id=1);
Have you considered taking the count when a query is performed and then echoing the value out to your clients via SignalR?
Basically, when the LINQ call returns get a .Count() and hand off the value to a background thread to let SignalR notify the clients of the update, at the same time you return the data to the requesting client.
SignalR will activate a javascript function in all of the client pages, where you can then take the passed in value and display it somewhere on the page.
http://www.asp.net/signalr
Related
I'm building an app where I need to store invoices from customers so we can track who has paid and who has not, and if not, see how much they owe in total. Right now my schema looks something like this:
Customer
- Id
- Name
Invoice
- Id
- CreatedOn
- PaidOn
- CustomerId
InvoiceItem
- Id
- Amount
- InvoiceId
Normally I'd fetch all the data using Entity Framework and calculate everything in my C# service, (or even do the calculation on SQL Server) something like so:
var amountOwed = Invoice.Where(i => i.CustomerId == customer.Id)
.SelectMany(i => i.InvoiceItems)
.Select(ii => ii.Amount)
.Sum()
But calculating everything every time I need to generate a report doesn't feel like the right approach this time, because down the line I'll have to generate reports that should calculate what all the customers owe (sometimes go even higher on the hierarchy).
For this scenario I was thinking of adding an Amount field on my Invoice table and possibly an AmountOwed on my Customer table which will be updated or populated via the InvoiceService whenever I insert/update/delete an InvoiceItem. This should be safe enough and make the report querying much faster.
But I've also been searching some on this subject and another recommended approach is using triggers on my database. I like this method best because even if I were to directly modify a value using SQL and not the app services, the other tables would automatically update.
My question is:
How do I add a trigger to update all the parent tables whenever an InvoiceItem is changed?
And from your experience, is this the best (safer, less error-prone) solution to this problem, or am I missing something?
There are many examples of triggers that you can find on the web. Many are poorly written unfortunately. And for future reference, post DDL for your tables, not some abbreviated list. No one should need to ask about the constraints and relationships you have (or should have) defined.
To start, how would you write a query to calculate the total amount at the invoice level? Presumably you know the tsql to do that. So write it, test it, verify it. Then add your amount column to the invoice table. Now how would you write an update statement to set that new amount column to the sum of the associated item rows? Again - write it, test it, verify it. At this point you have all the code you need to implement your trigger.
Since this process involves changes to the item table, you will need to write triggers to handle all three types of dml statements - insert, update, and delete. Write a trigger for each to simplify your learning and debugging. Triggers have access to special tables - go learn about them. And go learn about the false assumption that a trigger works with a single row - it doesn't. Triggers must be written to work correctly if 0 (yes, zero), 1, or many rows are affected.
In an insert statement, the inserted table will hold all the rows inserted by the statement that caused the trigger to execute. So you merely sum the values (using the appropriate grouping logic) and update the appropriate rows in the invoice table. Having written the update statement mentioned in the previous paragraphs, this should be a relatively simple change to that query. But since you can insert a new row for an old invoice, you must remember to add the summed amount to the value already stored in the invoice table. This should be enough direction for you to start.
And to answer your second question - the safest and easiest way is to calculate the value every time. I fear you are trying to solve a problem that you do not have and that you may never have. Generally speaking, no one cares about invoices that are of "significant" age. You might care about unpaid invoices for a period of time, but eventually you write these things off (especially if the amounts are not significant). Another relatively easy approach is to create an indexed view to calculate and materialize the total amount. But remember - nothing is free. An indexed view must be maintained and it will add extra processing for DML statements affecting the item table. Indexed views do have limitations - which are documented.
And one last comment. I would strongly hesitate to maintain a total amount at any level higher than invoice. Above that level one frequently wants to filter the results in any ways - date, location, type, customer, etc. At this level you are approaching data warehouse functionality which is not appropriate for a OLTP system.
First of all never use triggers for business logic. Triggers are tricky and easily forgettable. It will be hard to maintain such application.
For most cases you can easily populate your reporting data via entity framework or SQL query. But if it requires lots of joins then you need to consider using staging tables. Because reporting requires data denormalization. To populate staging tables you can use SQL jobs or other schedule mechanism (Azure Scheduler maybe). This way you won't need to work with lots of join and your reports will populate faster.
I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.
I have a simple tool for searching in a given db. The user can provide numerous conditions and my tool puts the sql query together based on that. However I want to prevent the query to be executed in case it returns too many records. For e.g. in case the user leaves all the filters blank then the query would pull all the records from the db which would take tens of minutes. Of course it's not necessary for any of my users. So I want some limitation.
I was thinking about running a count() sql query with the same conditions before each 'real' query, but that takes too much time.
Is there any option to measure the records 'during' the query and stop it if a certain amount is being reached? Throwing some exception asking the user to refine the search.
I use this approach:
State that you want to fetch AT MOST 100 rows. Construct your query so it returns at most 101 rows (with TOP N or the more generic ANSI way by filtering on row_number). Then you can easily detect whether there is more. You can act accordingly, in my case, show a 'read more'.
You could run a test query to search the database with the user defined options and only return the id field of the returned results, this would be very quick and also allow you to test the count().
Then if all is ok then you can run the full query to return all of their results.
Following on from the answer above, if you are working with large amounts of data, select top N, with the fast query option.
E.g.
SELECT TOP 101 [ColumnName]
FROM [Table]
OPTION (FAST 101)
This depends on your application and how you want it to work.
If your only wanting to displaying data on a table and setting a maximum size to your query is enough. You can use TOP in your select statement.
SELECT TOP N [ColumnName]
But considering you said a count takes too much time then I think your concerned about handling a very large data set and maybe manipulating it not necessarily just getting a limited set of data from the query.
Then one method is to break apart the job into chunks and run across the job size so grab the first N rows then the next N rows and repeat until there is no more values to be returned. You can also have record keeping for rollbacks and checkpoints to ensure data integrity.
Similar questions maybe:
query to limit records returned by sql query based on size of data
How to iterate through a large SQL result set with multiple related tables
I have an operation (That I can't change) that starts threads that make calls to our Oracle database to see if a certain hotel(s) has availability on a certain date.
If a date/hotel combination has availability, that thread returns information about the date/hotel in the form of a DataTable that is merged into a Main DataTable of results. Yes, I know ... I inherited this.
So I am trying to re-write this operation. I still must query Oracle in threads to get the availability information, but I want to display the data as it is returned (in chunks of 5, 10? I'm flexible), instead of having the user sit in front of the screen for up to 4 minutes before a complete result is spat out into a GridView.
How do I do this directly from an .aspx page so I can make a web service call and populate a grid (JqGrid?) with the results?
If I haven't provided enough information or described what I am trying to achieve, please let me know and I will elaborate.
Oracle provides a field on each row called "rowid"
(http://www.adp-gmbh.ch/ora/concepts/rowid.html)
The first time you send the query, send in the int (x) to define what the highest rownumber you want is. Have the service return the total number of rows and the first x rows.
Then, the 2nd time you send the query, get the next x rows, rinse and repeat.
Basically, you need to send an ajax query for rows x through y each time until you have them all loaded.
I would recommend paging as well, since users typically don't want to see hundreds of results at a time.
I am designing a WCF interface which returns status of all orders (Order data structure includes two members, a string type ID and an enum orderstatus, and designed as DataContract), the total # of orders are very large, about 10M. I am concerning about the traffic and impact to server side if client calls this interface API to get all order status and frqeuently call this API.
Any advice?
I am using VSTS 2008 + C# + .Net 3.5 + WCF.
I would support ozczecho - why would you ever want to return 10M records? Will your customers REALLY want to sift through 10M orders?? I highly doubt it....
Limit the number - by e.g. date ranges (all orders from 1Q/09), or by any other criteria. Just because you could return 10M rows doesn't mean it'll really be a good idea.
Also, together with SQL Server, you could easily implement paging, e.g. you could have your WCF service send back the first e.g. 100 rows, and send back a flag indicating there's more, and then have your client request rows 101 through 200 etc. It takes a bit of logic, but it would make communication just THAT much easier (and quicker)!
Also, in WCF, you have to define maximum message sizes - they're normally at 64K. The reason for this is the fact that a message needs to be assembled in memory, in full, before being able to be transmitted. Imagine you have 50 clients hitting your server - how much memory can you really set aside for "message assembly" on your server?
Marc
UPDATE:
One way you can achieve paging in a service would be by having a call something like this:
[OperationContract]
public List<Orders> GetOrders(string searchCriteria, string sortExpression,
int skipFirstRows, int takeRows)
This is inspired by the .Skip() and .Take() extension methods introduced by LINQ.
In this case, you could call GetOrders and define some search criteria (that might be a class, too, instead of just a string) to match your orders, you can define how to sort the orders by specifying sortExpression, and then you tell the service that you want to skip the first n rows and then take x rows.
So a call
List<Orders> result = GetOrders(criteria, sort, 0, 50)
would fetch the first 50 rows. Once you're done, you can call again:
List<Orders> result = GetOrders(criteria, sort, 50, 50)
and now you'd skip the first 50 rows (which you've already displayed / reported on) and then you take the next 50 (rows 51-100).
Of course, if your WCF service on the backend uses LINQ, you can translate that directly into calls to .Skip() and .Take() on your LINQ queries! :-)
UPDATE 2:
Are you working against SQL Server 2005 or higher? Check out the Common Table Expressions (CTE) which are basically the basis for what LINQ does. This allows you to define "virtual" view on your data and select only a certain section of the data set.
See more information here:
4 Guys from Rolla
Simple-Talk
Blog post by Troy DeMonbreun
Dont return all 10m record.
Use Paging Technology...