Optimal postgreSQL read-modify-write rows access - c#

Trying to modify some fields in all table records, using Npgsql data Provider for PostgreSQL.
Each record needs:
to be read,
some fields needs to be modified by a C# procedure
and write back to table
Is there an object or mechanism that allow to point to each record to do this without multiple queries to perform the C# procedure call between the reading and writing of each record?

If you're looking for a way to update a value via an open cursor, to avoid an additional UPDATE, then that doesn't exist in PostgreSQL. On the other hand, I'm pretty sure (but not 100%) that on other databases it doesn't actually improve perf either, i.e. that an additional roundtrip for each update is required anyway. In other words, "updating a cursor" for results from a SELECT is probably API sugar rather than an actual optimization.
The most efficient way to accomplish this with Npgsql is probably to do a SELECT, buffer results in memory, iterate them to calculate the new values, and then issue a prepared batched update that updates the rows (i.e. a single command with several UPDATE ...; UPDATE ... statements). If the amount of rows is too large, this can be split into several batches, i.e. "load x rows, calculate, update those x rows; load next x rows...". You can use PostgreSQL's cursor functionality to each time load the next X rows, or simple issue new SELECTs and use LIMIT/OFFSET for paging (likely to have similar performance).

Related

Best approach to track Amount field on Invoice table when InvoiceItem items change?

I'm building an app where I need to store invoices from customers so we can track who has paid and who has not, and if not, see how much they owe in total. Right now my schema looks something like this:
Customer
- Id
- Name
Invoice
- Id
- CreatedOn
- PaidOn
- CustomerId
InvoiceItem
- Id
- Amount
- InvoiceId
Normally I'd fetch all the data using Entity Framework and calculate everything in my C# service, (or even do the calculation on SQL Server) something like so:
var amountOwed = Invoice.Where(i => i.CustomerId == customer.Id)
.SelectMany(i => i.InvoiceItems)
.Select(ii => ii.Amount)
.Sum()
But calculating everything every time I need to generate a report doesn't feel like the right approach this time, because down the line I'll have to generate reports that should calculate what all the customers owe (sometimes go even higher on the hierarchy).
For this scenario I was thinking of adding an Amount field on my Invoice table and possibly an AmountOwed on my Customer table which will be updated or populated via the InvoiceService whenever I insert/update/delete an InvoiceItem. This should be safe enough and make the report querying much faster.
But I've also been searching some on this subject and another recommended approach is using triggers on my database. I like this method best because even if I were to directly modify a value using SQL and not the app services, the other tables would automatically update.
My question is:
How do I add a trigger to update all the parent tables whenever an InvoiceItem is changed?
And from your experience, is this the best (safer, less error-prone) solution to this problem, or am I missing something?
There are many examples of triggers that you can find on the web. Many are poorly written unfortunately. And for future reference, post DDL for your tables, not some abbreviated list. No one should need to ask about the constraints and relationships you have (or should have) defined.
To start, how would you write a query to calculate the total amount at the invoice level? Presumably you know the tsql to do that. So write it, test it, verify it. Then add your amount column to the invoice table. Now how would you write an update statement to set that new amount column to the sum of the associated item rows? Again - write it, test it, verify it. At this point you have all the code you need to implement your trigger.
Since this process involves changes to the item table, you will need to write triggers to handle all three types of dml statements - insert, update, and delete. Write a trigger for each to simplify your learning and debugging. Triggers have access to special tables - go learn about them. And go learn about the false assumption that a trigger works with a single row - it doesn't. Triggers must be written to work correctly if 0 (yes, zero), 1, or many rows are affected.
In an insert statement, the inserted table will hold all the rows inserted by the statement that caused the trigger to execute. So you merely sum the values (using the appropriate grouping logic) and update the appropriate rows in the invoice table. Having written the update statement mentioned in the previous paragraphs, this should be a relatively simple change to that query. But since you can insert a new row for an old invoice, you must remember to add the summed amount to the value already stored in the invoice table. This should be enough direction for you to start.
And to answer your second question - the safest and easiest way is to calculate the value every time. I fear you are trying to solve a problem that you do not have and that you may never have. Generally speaking, no one cares about invoices that are of "significant" age. You might care about unpaid invoices for a period of time, but eventually you write these things off (especially if the amounts are not significant). Another relatively easy approach is to create an indexed view to calculate and materialize the total amount. But remember - nothing is free. An indexed view must be maintained and it will add extra processing for DML statements affecting the item table. Indexed views do have limitations - which are documented.
And one last comment. I would strongly hesitate to maintain a total amount at any level higher than invoice. Above that level one frequently wants to filter the results in any ways - date, location, type, customer, etc. At this level you are approaching data warehouse functionality which is not appropriate for a OLTP system.
First of all never use triggers for business logic. Triggers are tricky and easily forgettable. It will be hard to maintain such application.
For most cases you can easily populate your reporting data via entity framework or SQL query. But if it requires lots of joins then you need to consider using staging tables. Because reporting requires data denormalization. To populate staging tables you can use SQL jobs or other schedule mechanism (Azure Scheduler maybe). This way you won't need to work with lots of join and your reports will populate faster.

Create an in-memory readonly cache for data stored in SQL Server

I have a problem concerning application performance: I have many tables, each having millions of records. I am performing select statements over them using joins, where clauses and orderby on different criterias (specified by the user at runtime). I want to get my records paged but no matter what I do with my SQL statements I cannot reach the performance of getting my pages directly from memory. Basically the problem comes when I have to filter my records by using some runtime dynamic specified criteria. I tried everything such as using ROW_NUMBER() function combined with a "where RowNo between" clause, I've tried CTE, temp tables, etc. Those SQL solutions performs well only if I don't include filtering. Keep in mind also that I want my solution to be as generic as possible (imagine that i have in my app several lists that virtually presents paged millions of records and those records are constructed with very complex sql statements).
All my tables has a primary key of type INT.
So, I come with an ideea: Why not create a "server" only for select statements. The server loads first all records from all tables and stores them into some HashSets where each T has an Id property and GetHashCode () returns that Id and also the Equals is implemented such that two records are "equal" only if Id is equal (don't scream, You will see later why I am not using all record data for hashing and comparisons).
So far so good, but there's a problem: How can I sync my in memory collections with database records?. The ideea is that I must find a solution such as I load only differential changes. So I invented a changelog table for each table that I want to cache. In this changelog I perform only inserts that marks dirty rows (updates or deletes) and also records newly inserted ids, all of this mechanism implemented using triggers. So whenever an in-memory select comes, I check first if I must sync something (by interogating the changelog). If something must be applied, I load the changelog, I apply those changes in memory and finally I am clearing that changelog (or maybe remember what was the highest changelog id that I've applied ...).
In order to be able to apply the changelog in O ( N ) where N is the changelog size, i am using this algo:
for each log.
identify my in-memory Dictionary <int, T> where the key is the primary key.
if it's a delete log then call dictionary.Remove (id) ( O ( 1 ))
if it's an update log, then call also dictionary.Remove (id) ( O (1)) and move this id into an "to be inserted" collection
if it's an insert log, move this id into a "to be inserted" collection.
finally, refresh cache by selecting all data from the corresponding table where Id in ("to be inserted").
For filtering, I am compiling some expression trees into Func < T, List < FilterCriterias >, bool > functors. Using this mechanism I am performing way more faster than SQL.
I Know that SQL 2012 has caching support and the new comming SQL version will suport even more but My client have SQL server 2005 so ... I can't benefit of this stuff.
My question: What do you think ? this is a bad ideea ? there's a better aproach ?
The developers of SQL Server did a very good job. I think it is fairly impossible to trick this out.
Unless your data has some kind of implicit structure which might help to speed things up and which the optimizer cannot be aware of, such "I do my own speedy trick" approaches won't help - normally...
Performance problems are ever first to be solved where they occur:
the tables structures and relations
indexes and statistics
quality of SQL statements
Even many million rows are no problem if the design and the queries are good...
If your queries do a lot of computations, or you need to retrieve data out of tricky structures (nested list with recursive reads, XML...) I'd go the Data-Warehouse-Path and write some denormalized tables for quick selects. Of course you will have to deal with the fact, that you are reading "old" data. If your data does not change much, you could trigger all changes to a denormalized structure immediately. But this depends on your actual situation.
If you want, you could post one of your imperformant queries together with the relevant structure details and ask for review. There are dedicated groups on Stack-Exchange, such as "Code Review". If it's not to big, you might try it here as well...

How to query an SQLite db in batches

I am using C# with .NET 4.5. I am making a scraper which collects specific data. Each time a value is scraped, I need to make sure it hasn't already been added to the SQLite db.
To do this, I am making a call each time a value is scraped to query against the db to check if it contains the value, and if not, I make another call to insert the value into the db.
Since I am scraping multiple values per second, this gets to be very IO-intensive, with constant calls to the db.
My question is, is there any better way to do this? Perhaps I could queue the values scraped and then run a batch query at once? Is that possible?
I see three approaches:
Use INSERT OR IGNORE, which will reject an entry if it is already present (based on primary key and unique fields). Or plainly INSERT (or its equivalent (INSERT or ABORT) which will return SQLITE_CONSTRAINT, a value you will have to catch and manage if you want to count failed insertions.
Accumulate, outside the database, the updates you want to make. When you have accumulated enough/all, start a transaction (BEGIN;), do your insertions (you can use INSERT OR IGNORE here as well), commit the transaction (COMMIT;)
You could pre-fetch a list of items you already have, depending, and check against that list, if your data model allows it.

Speed up LINQ inserts

I have a CSV file and I have to insert it into a SQL Server database. Is there a way to speed up the LINQ inserts?
I've created a simple Repository method to save a record:
public void SaveOffer(Offer offer)
{
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
// add new offer
if (dbOffer == null)
{
this.db.Offers.InsertOnSubmit(offer);
}
//update existing offer
else
{
dbOffer = offer;
}
this.db.SubmitChanges();
}
But using this method, the program is way much slower then inserting the data using ADO.net SQL inserts (new SqlConnection, new SqlCommand for select if exists, new SqlCommand for update/insert).
On 100k csv rows it takes about an hour vs 1 minute or so for the ADO.net way. For 2M csv rows it took ADO.net about 20 minutes. LINQ added about 30k of those 2M rows in 25 minutes. My database has 3 tables, linked in the dbml, but the other two tables are empty. The tests were made with all the tables empty.
P.S. I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy.
Updates/Edits:
After 18hours, the LINQ version added just ~200K rows.
I've tested the import just with LINQ inserts too, and also is really slow compared with ADO.net. I haven't seen a big difference between just inserts/submitchanges and selects/updates/inserts/submitchanges.
I still have to try batch commit, manually connecting to the db and compiled queries.
SubmitChanges does not batch changes, it does a single insert statement per object. If you want to do fast inserts, I think you need to stop using LINQ.
While SubmitChanges is executing, fire up SQL Profiler and watch the SQL being executed.
See question "Can LINQ to SQL perform batch updates and deletes? Or does it always do one row update at a time?" here: http://www.hookedonlinq.com/LINQToSQLFAQ.ashx
It links to this article: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx that uses extension methods to fix linq's inability to batch inserts and updates etc.
Have you tried wrapping the inserts within a transaction and/or delaying db.SubmitChanges so that you can batch several inserts?
Transactions help throughput by reducing the needs for fsync()'s, and delaying db.SubmitChanges will reduce the number of .NET<->db roundtrips.
Edit: see http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html for some more optimization principles.
Have a look at the following page for a simple walk-through of how to change your code to use a Bulk Insert instead of using LINQ's InsertOnSubmit() function.
You just need to add the (provided) BulkInsert class to your code, make a few subtle changes to your code, and you'll see a huge improvement in performance.
Mikes Knowledge Base - BulkInserts with LINQ
Good luck !
I wonder if you're suffering from an overly large set of data accumulating in the data-context, making it slow to resolve rows against the internal identity cache (which is checked once during the SingleOrDefault, and for "misses" I would expect to see a second hit when the entity is materialized).
I can't recall 100% whether the short-circuit works for SingleOrDefault (although it will in .NET 4.0).
I would try ditching the data-context (submit-changes and replace with an empty one) every n operations for some n - maybe 250 or something.
Given that you're calling SubmitChanges per isntance at the moment, you may also be wasting a lot of time checking the delta - pointless if you've only changed one row. Only call SubmitChanges in batches; not per record.
Alex gave the best answer, but I think a few things are being over looked.
One of the major bottlenecks you have here is calling SubmitChanges for each item individually. A problem I don't think most people know about is that if you haven't manually opened your DataContext's connection yourself, then the DataContext will repeatedly open and close it itself. However, if you open it yourself, and then close it yourself when you're absolutely finished, things will run a lot faster since it won't have to reconnect to the database every time. I found this out when trying to find out why DataContext.ExecuteCommand() was so unbelievably slow when executing multiple commands at once.
A few other areas where you could speed things up:
While Linq To SQL doesn't support your straight up batch processing, you should wait to call SubmitChanges() until you've analyzed everything first. You don't need to call SubmitChanges() after each InsertOnSubmit call.
If live data integrity isn't super crucial, you could retrieve a list of offer_id back from the server before you start checking to see if an offer already exists. This could significantly reduce the amount of times you're calling the server to get an existing item when it's not even there.
Why not pass an offer[] into that method, and doing all the changes in cache before submitting them to the database. Or you could use groups for submission, so you don't run out of cache. The main thing would be how long till you send over the data, the biggest time wasting is in the closing and opening of the connection.
Converting this to a compiled query is the easiest way I can think of to boost your performance here:
Change the following:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
to:
Offer dbOffer = RetrieveOffer(offer.offer_id);
private static readonly Func<DataContext, int> RetrieveOffer
{
CompiledQuery.Compile((DataContext context, int offerId) => context.Offers.SingleOrDefault(o => o.offer_id == offerid))
}
This change alone will not make it as fast as your ado.net version, but it will be a significant improvement because without the compiled query you are dynamically building the expression tree every time you run this method.
As one poster already mentioned, you must refactor your code so that submit changes is called only once if you want optimal performance.
Do you really need to check if the record exist before inserting it into the DB. I thought it looked strange as the data comes from a csv file.
P.S. I've tried to use SqlBulkCopy,
but I need to do some transformations
on Offer before inserting it into the
db, and I think that defeats the
purpose of SqlBulkCopy.
I don't think it defeat the purpose at all, why would it? Just fill a simple dataset with all the data from the csv and do a SqlBulkCopy. I did a similar thing with a collection of 30000+ rows and the import time went from minutes to seconds
I suspect it isn't the inserting or updating operations that are taking a long time, rather the code that determines if your offer already exists:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
If you look to optimise this, I think you'll be on the right track. Perhaps use the Stopwatch class to do some timing that will help to prove me right or wrong.
Usually, when not using Linq-to-Sql, you would have an insert/update procedure or sql script that would determine whether the record you pass already exists. You're doing this expensive operation in Linq, which certainly will never hope to match the speed of native sql (which is what's happening when you use a SqlCommand and select if the record exists) looking-up on a primary key.
Well you must understand linq creates code dynamically for all ADO operations that you do instead handwritten, so it will always take up more time then your manual code. Its simply an easy way to write code but if you want to talk about performance, ADO.NET code will always be faster depending upon how you write it.
I dont know if linq will try to reuse its last statement or not, if it does then seperating insert batch with update batch may improve performance little bit.
This code runs ok, and prevents large amounts of data:
if (repository2.GeoItems.GetChangeSet().Inserts.Count > 1000)
{
repository2.GeoItems.SubmitChanges();
}
Then, at the end of the bulk insertion, use this:
repository2.GeoItems.SubmitChanges();

Minimise database updates from changes in DataTable/SqlDataAdapter

My goal is to maximise performance. The basics of the scenario are:
I read some data from SQL Server 2005 into a DataTable (1000 records x 10 columns)
I do some processing in .NET of the data, all records have at least 1 field changed in the DataTable, but potentially all 10 fields could be changed
I also add some new records in to the DataTable
I do a SqlDataAdapter.Update(myDataTable.GetChanges()) to persist the updates (an inserts) back to the db using a InsertCommand and UpdateCommand I defined at the start
Assume table being updated contains 10s of millions of records
This is fine. However, if a row has changed in the DataTable then ALL columns for that record are updated in the database even if only 1 out of 9 columns has actually changed value. This means unnecessary work, particularly if indexes are involved. I don't believe SQL Server optimises this scenario?
I think, if I was able to only update the columns that had actually changed for any given record, that I should see a noticeable performance improvement (esp. as cumulatively I will be dealing with millions of rows).
I found this article: http://netcode.ru/dotnet/?lang=&katID=30&skatID=253&artID=6635
But don't like the idea of doing multiple UPDATEs within the sproc.
Short of creating individual UPDATE statements for each changed DataRow and then firing them in somehow in a batch, I'm looking for other people's experiences/suggestions.
(Please assume I can't use triggers)
Thanks in advance
Edit: Any way to get SqlDataAdapter to send UPDATE statements specific to each changed DataRow (only to update the actual changed columns in that row) rather than giving a general .UpdateCommand that updates all columns?
Isn't it possible to implement your own IDataAdapter where you implement this functionality ?
Offcourse, the DataAdapter only fires the correct SqlCommand, which is determined by the RowState of each DataRow.
So, this means that you would have to generate the SQL command that has to be executed for each situation ...
But, I wonder if it is worth the effort. How much performance will you gain ?
I think that - if it is really necessary - I would disable all my indexes and constraints, do the update using the regular SqlDataAdapter, and afterwards enable the indexes and constraints.
you might try is do create an XML of your changed dataset, pass it as a parameter ot a sproc and the do a single update by using sql nodes() function to translate the xml into a tabular form.
you should never try to update a clustered index. if you do it's time to rethink your db schema.
I would VERY much suggest that you do this with a stored procedure.
Lets say that you have 10 million records you have to update. And lets say that each record has 100 bytes (for 10 columns this could be too small, but lets be conservative). This amounts to cca 100 MB of data that must be transferred from database (network traffic), stored in memory and than returned to database in form of UPDATE or INSERT that are much more verbose for transfer to database.
I expect that SP would perform much better.
Than again you could divide you work into smaller SP (that are called from main SP) that would update just the necessary fields and that way gain additional performance.
Disabling indexes/constraints is also an option.
EDIT:
Another thing you must consider is potential number of different update statements. In case of 10 fields per row any field could stay the same or change. So if you construct your UPDATE statement to reflect this you could potentially get 10^2 = 1024 different UPDATE statements and any of those must be parsed by SQL Server, execution plan calculated and parsed statement stored in some area. There is a price to do this.

Categories

Resources