SQL Database Rows to GridView Columns - c#

I have a few database tables set up like:
Executions
-ExecutionID
Periods
-PeriodID
-ExecutionID
Transactions
-TransactionID
-PeriodID
-Name
-ResponseTime
I have a page where I want to display the execution in a GridView, with columns:
TransactionName, Period1ResponseTime, Period2ResponseTime, etc..
I've been trying to come up with the best way to combine the tables in to a single datasource to feed to the GridView but am only coming up dirty brute force ideas. What do you think the best approach for this would be? Is it possible using SQL alone?
Ps. The transaction names are distinct per period (won't be 2 transactions with same name in a period) and not every period will have the same transactions although they're mostly the same (different response times) though.

I ended up doing it the brute force c# approach as I can't figure out the complex SQL involved - not really sure what "tools" I could use to accomplish. That said I'll have to look more in to cursors. Basically what I do now is 1 query to get all distinct transaction names from periods in the execution (make SQL statement on the fly with some OR appends. Then get transactions for the different period and store in an array. Finally I combine them all in to a custom data table by going row by row (distinct transaction names), and for each period search its the transaction list for that name: if found puts the response time value in that periods column, otherwise leaves blank. Not sure what the duration is for stackoverflow questions but if anyone has a suggestion to improve on this approach I'm all ears since it feels very inefficient.

Related

Best approach to track Amount field on Invoice table when InvoiceItem items change?

I'm building an app where I need to store invoices from customers so we can track who has paid and who has not, and if not, see how much they owe in total. Right now my schema looks something like this:
Customer
- Id
- Name
Invoice
- Id
- CreatedOn
- PaidOn
- CustomerId
InvoiceItem
- Id
- Amount
- InvoiceId
Normally I'd fetch all the data using Entity Framework and calculate everything in my C# service, (or even do the calculation on SQL Server) something like so:
var amountOwed = Invoice.Where(i => i.CustomerId == customer.Id)
.SelectMany(i => i.InvoiceItems)
.Select(ii => ii.Amount)
.Sum()
But calculating everything every time I need to generate a report doesn't feel like the right approach this time, because down the line I'll have to generate reports that should calculate what all the customers owe (sometimes go even higher on the hierarchy).
For this scenario I was thinking of adding an Amount field on my Invoice table and possibly an AmountOwed on my Customer table which will be updated or populated via the InvoiceService whenever I insert/update/delete an InvoiceItem. This should be safe enough and make the report querying much faster.
But I've also been searching some on this subject and another recommended approach is using triggers on my database. I like this method best because even if I were to directly modify a value using SQL and not the app services, the other tables would automatically update.
My question is:
How do I add a trigger to update all the parent tables whenever an InvoiceItem is changed?
And from your experience, is this the best (safer, less error-prone) solution to this problem, or am I missing something?
There are many examples of triggers that you can find on the web. Many are poorly written unfortunately. And for future reference, post DDL for your tables, not some abbreviated list. No one should need to ask about the constraints and relationships you have (or should have) defined.
To start, how would you write a query to calculate the total amount at the invoice level? Presumably you know the tsql to do that. So write it, test it, verify it. Then add your amount column to the invoice table. Now how would you write an update statement to set that new amount column to the sum of the associated item rows? Again - write it, test it, verify it. At this point you have all the code you need to implement your trigger.
Since this process involves changes to the item table, you will need to write triggers to handle all three types of dml statements - insert, update, and delete. Write a trigger for each to simplify your learning and debugging. Triggers have access to special tables - go learn about them. And go learn about the false assumption that a trigger works with a single row - it doesn't. Triggers must be written to work correctly if 0 (yes, zero), 1, or many rows are affected.
In an insert statement, the inserted table will hold all the rows inserted by the statement that caused the trigger to execute. So you merely sum the values (using the appropriate grouping logic) and update the appropriate rows in the invoice table. Having written the update statement mentioned in the previous paragraphs, this should be a relatively simple change to that query. But since you can insert a new row for an old invoice, you must remember to add the summed amount to the value already stored in the invoice table. This should be enough direction for you to start.
And to answer your second question - the safest and easiest way is to calculate the value every time. I fear you are trying to solve a problem that you do not have and that you may never have. Generally speaking, no one cares about invoices that are of "significant" age. You might care about unpaid invoices for a period of time, but eventually you write these things off (especially if the amounts are not significant). Another relatively easy approach is to create an indexed view to calculate and materialize the total amount. But remember - nothing is free. An indexed view must be maintained and it will add extra processing for DML statements affecting the item table. Indexed views do have limitations - which are documented.
And one last comment. I would strongly hesitate to maintain a total amount at any level higher than invoice. Above that level one frequently wants to filter the results in any ways - date, location, type, customer, etc. At this level you are approaching data warehouse functionality which is not appropriate for a OLTP system.
First of all never use triggers for business logic. Triggers are tricky and easily forgettable. It will be hard to maintain such application.
For most cases you can easily populate your reporting data via entity framework or SQL query. But if it requires lots of joins then you need to consider using staging tables. Because reporting requires data denormalization. To populate staging tables you can use SQL jobs or other schedule mechanism (Azure Scheduler maybe). This way you won't need to work with lots of join and your reports will populate faster.

How can I deal with slow performance on Contains query in Entity Framework / MS-SQL?

I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.

Create an in-memory readonly cache for data stored in SQL Server

I have a problem concerning application performance: I have many tables, each having millions of records. I am performing select statements over them using joins, where clauses and orderby on different criterias (specified by the user at runtime). I want to get my records paged but no matter what I do with my SQL statements I cannot reach the performance of getting my pages directly from memory. Basically the problem comes when I have to filter my records by using some runtime dynamic specified criteria. I tried everything such as using ROW_NUMBER() function combined with a "where RowNo between" clause, I've tried CTE, temp tables, etc. Those SQL solutions performs well only if I don't include filtering. Keep in mind also that I want my solution to be as generic as possible (imagine that i have in my app several lists that virtually presents paged millions of records and those records are constructed with very complex sql statements).
All my tables has a primary key of type INT.
So, I come with an ideea: Why not create a "server" only for select statements. The server loads first all records from all tables and stores them into some HashSets where each T has an Id property and GetHashCode () returns that Id and also the Equals is implemented such that two records are "equal" only if Id is equal (don't scream, You will see later why I am not using all record data for hashing and comparisons).
So far so good, but there's a problem: How can I sync my in memory collections with database records?. The ideea is that I must find a solution such as I load only differential changes. So I invented a changelog table for each table that I want to cache. In this changelog I perform only inserts that marks dirty rows (updates or deletes) and also records newly inserted ids, all of this mechanism implemented using triggers. So whenever an in-memory select comes, I check first if I must sync something (by interogating the changelog). If something must be applied, I load the changelog, I apply those changes in memory and finally I am clearing that changelog (or maybe remember what was the highest changelog id that I've applied ...).
In order to be able to apply the changelog in O ( N ) where N is the changelog size, i am using this algo:
for each log.
identify my in-memory Dictionary <int, T> where the key is the primary key.
if it's a delete log then call dictionary.Remove (id) ( O ( 1 ))
if it's an update log, then call also dictionary.Remove (id) ( O (1)) and move this id into an "to be inserted" collection
if it's an insert log, move this id into a "to be inserted" collection.
finally, refresh cache by selecting all data from the corresponding table where Id in ("to be inserted").
For filtering, I am compiling some expression trees into Func < T, List < FilterCriterias >, bool > functors. Using this mechanism I am performing way more faster than SQL.
I Know that SQL 2012 has caching support and the new comming SQL version will suport even more but My client have SQL server 2005 so ... I can't benefit of this stuff.
My question: What do you think ? this is a bad ideea ? there's a better aproach ?
The developers of SQL Server did a very good job. I think it is fairly impossible to trick this out.
Unless your data has some kind of implicit structure which might help to speed things up and which the optimizer cannot be aware of, such "I do my own speedy trick" approaches won't help - normally...
Performance problems are ever first to be solved where they occur:
the tables structures and relations
indexes and statistics
quality of SQL statements
Even many million rows are no problem if the design and the queries are good...
If your queries do a lot of computations, or you need to retrieve data out of tricky structures (nested list with recursive reads, XML...) I'd go the Data-Warehouse-Path and write some denormalized tables for quick selects. Of course you will have to deal with the fact, that you are reading "old" data. If your data does not change much, you could trigger all changes to a denormalized structure immediately. But this depends on your actual situation.
If you want, you could post one of your imperformant queries together with the relevant structure details and ask for review. There are dedicated groups on Stack-Exchange, such as "Code Review". If it's not to big, you might try it here as well...

How to measure the result of an sql query?

I have a simple tool for searching in a given db. The user can provide numerous conditions and my tool puts the sql query together based on that. However I want to prevent the query to be executed in case it returns too many records. For e.g. in case the user leaves all the filters blank then the query would pull all the records from the db which would take tens of minutes. Of course it's not necessary for any of my users. So I want some limitation.
I was thinking about running a count() sql query with the same conditions before each 'real' query, but that takes too much time.
Is there any option to measure the records 'during' the query and stop it if a certain amount is being reached? Throwing some exception asking the user to refine the search.
I use this approach:
State that you want to fetch AT MOST 100 rows. Construct your query so it returns at most 101 rows (with TOP N or the more generic ANSI way by filtering on row_number). Then you can easily detect whether there is more. You can act accordingly, in my case, show a 'read more'.
You could run a test query to search the database with the user defined options and only return the id field of the returned results, this would be very quick and also allow you to test the count().
Then if all is ok then you can run the full query to return all of their results.
Following on from the answer above, if you are working with large amounts of data, select top N, with the fast query option.
E.g.
SELECT TOP 101 [ColumnName]
FROM [Table]
OPTION (FAST 101)
This depends on your application and how you want it to work.
If your only wanting to displaying data on a table and setting a maximum size to your query is enough. You can use TOP in your select statement.
SELECT TOP N [ColumnName]
But considering you said a count takes too much time then I think your concerned about handling a very large data set and maybe manipulating it not necessarily just getting a limited set of data from the query.
Then one method is to break apart the job into chunks and run across the job size so grab the first N rows then the next N rows and repeat until there is no more values to be returned. You can also have record keeping for rollbacks and checkpoints to ensure data integrity.
Similar questions maybe:
query to limit records returned by sql query based on size of data
How to iterate through a large SQL result set with multiple related tables

Remember how far you've gotten when crunching a big MSSQL table?

We have a application that executes a job to process a range of rows from a mssql view.
This view contains a lot of rows, and the data is inserted with a additional column (dataid) set to identity, meant for us to use to know how far through the dataset we have gotten.
A while ago we had some issues when just getting top n rows with a dataid larger than y (y being the last biggest last dataid that we processed). It seemed that the rows was not returned in correct order, meaning that when we grabbed a range of rows, it seemed that the dataid of some of the rows was misplaced, which meant that we processed a row with a dataid of 100 when we actually had only gotten to 95.
example
The window / range is 100 rows on each crunch. but if the rows' dataid are not in sequential order, the query getting the next 100 rows, may contain a dataid that really should have been located in the next crunch. And then rows will be skipped when the next crunch is executed.
A order by on the dataid would solve the problem, but that is way way to slow.
Do you guys have any suggestions how this could be done in a better/working way?
When i say a lot of rows, i mean a few billion rows, and yes, if you think that is absolutely crazy you are completely right!
We use Dapper to map the rows into objects.
This is completely read only.
I hope this question is not too vague.
Thanks in advance!
A order by on the dataid would solve the problem, but that is way way to slow.
Apply the proper indexes.
The only answer to "why is my query slow" is: How To: Optimize SQL Queries.
Is not clear what you mean by mixing 'view' and 'insert' in the same sentence. If you really mean a view that projects an IDENTITY function then you can stop right now, it will not work. You need to have a persisted bookmark to resume your work. An IDENTITY projected in a SELECT by a view does not meet the persistence criteria.
You need to process data in a well defined order that is persistent on consecutive reads. You must be able to read a key that clearly defines a boundary in the given order. You need to persist the last key processed in the same transaction as the batch processing the rows. How you achieve these requirements, is entirely up to you. A typical solution is to process in the clustered index order and remember the last processed cluster key position. An unique clustered key is a must. An IDENTITY property and a clustered index by it does satisfy the criteria you need.
If you only want to work on the last 100, give a take a 1000000, you could look at partitioning the data.
Whats the point of including the other 999999000000 in the index?

Categories

Resources