Execute multiple SQL commands in one round trip - c#

I am building an application and I want to batch multiple queries into a single round-trip to the database. For example, lets say a single page needs to display a list of users, a list of groups and a list of permissions.
So I have stored procs (or just simple sql commands like "select * from Users"), and I want to execute three of them. However, to populate this one page I have to make 3 round trips.
Now I could write a single stored proc ("getUsersTeamsAndPermissions") or execute a single SQL command "select * from Users;exec getTeams;select * from Permissions".
But I was wondering if there was a better way to specify to do 3 operations in a single round trip. Benefits include being easier to unit test, and allowing the database engine to parrallelize the queries.
I'm using C# 3.5 and SQL Server 2008.

Something like this. The example is probably not very good as it doesn't properly dispose objects but you get the idea. Here's a cleaned up version:
using (var connection = new SqlConnection(ConnectionString))
using (var command = connection.CreateCommand())
{
connection.Open();
command.CommandText = "select id from test1; select id from test2";
using (var reader = command.ExecuteReader())
{
do
{
while (reader.Read())
{
Console.WriteLine(reader.GetInt32(0));
}
Console.WriteLine("--next command--");
} while (reader.NextResult());
}
}

The single multi-part command and the stored procedure options that you mention are the two options. You can't do them in such a way that they are "parallelized" on the db. However, both of those options does result in a single round trip, so you're good there. There's no way to send them more efficiently. In sql server 2005 onwards, a multi-part command that is fully parameterized is very efficient.
Edit: adding information on why cram into a single call.
Although you don't want to care too much about reducing calls, there can be legitimate reasons for this.
I once was limited to a crummy ODBC driver against a mainframe, and there was a 1.2 second overhead on each call! I'm serious. There were times when I crammed a little extra into my db calls. Not pretty.
You also might find yourself in a situation where you have to configure your sql queries somewhere, and you can't just make 3 calls: it has to be one. It shouldn't be that way, bad design, but it is. You do what you gotta do!
Sometimes of course it can be very good to encapsulate multiple steps in a stored procedure. Usually not for saving round trips though, but for tighter transactions, getting ID for new records, constraining for permissions, providing encapsulation, blah blah blah.

Making one round-trip vs three will be more eficient indeed. The question is wether it is worth the trouble. The entire ADO.Net and C# 3.5 toolset and framework opposes what you try to do. TableAdapters, Linq2SQL, EF, all these like to deal with simple one-call==one-resultset semantics. So you may loose some serious productivity by trying to beat the Framework into submission.
I would say that unless you have some serious measurements showing that you need to reduce the number of roundtrips, abstain. If you do end up requiring this, then use a stored procedure to at least give an API kind of semantics.
But if your query really is what you posted (ie. select all users, all teams and all permissions) then you obviosuly have much bigger fish to fry before reducing the round-trips... reduce the resultsets first.

I this this link might be helpful.
Consider using at least the same connection-openning; according to what it says here, openning a connection is almost the top-leader of performance cost in Entity-Framework.

Firstly, 3 round trips isn't really a big deal. If you were talking about 300 round trips then that would be another matter, but for just 3 round trips I would conderer this to definitley be a case of premature optimisation.
That said, the way I'd do this would probably be to executed the 3 stored procuedres using SQL:
exec dbo.p_myproc_1 #param_1 = #in_param_1, #param_2 = #in_param_2
exec dbo.p_myproc_2
exec dbo.p_myproc_3
You can then iterate through the returned results sets as you would if you directly executed multiple rowsets.

Build a temp-table? Insert all results into the temp table and then select * from #temp-table
as in,
#temptable=....
select #temptable.field=mytable.field from mytable
select #temptable.field2=mytable2.field2 from mytable2
etc... Only one trip to the database, though I'm not sure it is actually more efficient.

Related

C# Oracle ODP: Is it possible to return multiple query results in a single trip to the server without calling a stored procedure?

I expected to be able to include multiple SELECT statements, each separated by a semicolon, in my query, and get a dataset returned with as the same number of datatables as individual SELECT statements.
I am starting to think that the only way that this can be done is to create a stored procedure with multiple refcursor output parameters.
string sql = #"SELECT
R.DERVN_RULE_NUM
,P.DERVN_PARAM_INPT_IND
,R.DERVN_PARAM_NM
,R.DERVN_PARAM_VAL_DESC
,P.DERVN_PARAM_SPOT_NUM
,R.DERVN_PARAM_VAL_TXT
FROM
FDS_BASE.DERVN_RULE R
INNER JOIN FDS_BASE.DERVN_PARAM P
ON R.DERVN_TY_CD = P.DERVN_TY_CD
AND R.DERVN_PARAM_NM = P.DERVN_PARAM_NM
WHERE
R.DERVN_TY_CD = :DERVN_TY_CD
ORDER BY
R.DERVN_RULE_NUM
,P.DERVN_PARAM_INPT_IND DESC
, P.DERVN_PARAM_SPOT_NUM";
var dataSet = new DataSet();
using (OracleConnection oracleConnection = new OracleConnection(connectionString))
{
oracleConnection.Open();
var oracleCommand = new OracleCommand(sql, oracleConnection)
{
CommandType = CommandType.Text
};
oracleCommand.Parameters.Add(":DERVN_TY_CD", derivationType);
var oracleDataAdapter = new OracleDataAdapter(oracleCommand);
oracleDataAdapter.Fill(dataSet);
}
I tried to apply what I read here:
https://www.intertech.com/Blog/executing-sql-scripts-with-oracle-odp/
including changing my SQL to enclose it in a BEGIN END BLOCK in this form:
string sql = #"BEGIN
SELECT 1 FROM DUAL;
SELECT 2 FROM DUAL;
END";
and replacing my end of line character
sql = sql.Replace("\r\n", "\n");
but nothing works.
Is this even possible w/o using a stored procedure using ODP or must I make a seperate trip to the server for each query?
The simplest way to return multiple query results from a single statement is with the CURSOR SQL function. For example:
select
cursor(select * from all_tables) tables,
cursor(select * from all_objects) objects
from dual;
(However, I am not a C# programmer, so I don't know if this solution will work for you. Let me know if the code doesn't work - there's probably another solution using anonymous blocks and OUT parameters.)
must I make a seperate trip to the server for each query?
The way this is asked makes it seem like there's a considerable effort or waste of resources going on somewhere that can be saved or reduced, like making a database query is the equivalent of walking to the shops to get milk, coming back, then walking to the shops again to get bread and coming back
There isn't any appreciable saving to be had; if this was going to the shops, db querying is like being able to clone yourself X times, the X of you all going to the shops, and coming back at different times - some of you found your small things instantly and sprint back with them, some of you found the massive things instantly and stagger back with them, some of you took ages to find your things etc. (These are metaphors for the speed of query execution and the time required to download large vs small result sets).
If you have two queries that take ten seconds each to run, you can set them going in parallel and have your results ready and retrieved to the client in 10+x seconds (x being the time required to drag the data over the network), or you could execute them in series and have it be 20+x
If you think about it, putting two queries in one statement is only the same thing as submitting two statements for execution over different connections. The set of steps the db must take, and the set of steps the client must do to read, are the same. Writing a sproc to handle it is more effort, more complexity to maintain and more places code lives in. Even writing a block to do it is more. None of it saves anything. Even the bytes in the header of the tcp packets, minutiae as they are, are offset by more complex multi line blocks. If one query takes considerably longer than the other you might even be hamstrung into having to wait for them all to finish before you can get the results
Write your "query statement x with y parameters and return resultset Z" as async, start two of them and Task.WhenAll to wait for them to finish; if you can handle it, don't do a WhenAll but instead read and use the results as they finish - that's a saving, if the process can logically proceed before all queries deliver
I get that you're thinking "surely I should just walk to the shops and carry milk and bread back with me - that's more efficient than going twice" but it's a faulty perspective when you consider that the shop is nanoseconds away because you run at the speed of light, you have multiple private unobstructed paths to it and the bigger spend of time is finding the items you want and loading them into sufficient chained-together carts/dragging them all home. With a cloning approach, if the milk is right there, one of you can take it home and spend 10 minutes making the béchamel with it while the other of you is still waiting 10 minutes for the shop to bake the bread that you'll eat directly when you get home - you can still eat in 10 minutes if you maintain the parallelism, and launching separate operations is not only simpler but it keeps you in easy control of that

Querying Intersystem Caché through ODBC

I'm querying Caché for a list of tables in two schemas and looping through those tables to obtain a count on the tables. However, this is incredibly slow. For instance, 13 million records took 8 hours to return results. When I query an Oracle database with 13 million records (on the same network), it takes 1.1 seconds to return results.
I'm using a BackgroundWorker to carry out the work apart from the UI (Windows Form).
Here's the code I'm using with the Caché ODBC driver:
using (OdbcConnection odbcCon = new OdbcConnection(strConnection))
{
try
{
odbcCon.Open();
OdbcCommand odbcCmd = new OdbcCommand();
foreach (var item in lstSchema)
{
var item = i;
odbcCmd.CommandText = "SELECT Count(*) FROM " + item;
odbcCmd.Connection = odbcCon;
AppendTextBox(item + " Count = " + Convert.ToInt32(odbcCmd.ExecuteScalar()) + "\r\n");
int intPercentComplete = (int)((float)(lstSchema.IndexOf(item) + 1) / (float)intTotalTables * 100);
worker.ReportProgress(intPercentComplete);
ModifyLabel(" (" + (lstSchema.IndexOf(item) + 1) + " out of " + intTotalTables + " processed)");
}
}
catch (Exception ex)
{
MessageBox.Show(ex.ToString());
return;
}
}
Is the driver the issue?
Thanks.
I supose the devil is in the details. Your code does
SELECT COUNT(*) FROM Table
If the table has no indices then I wouldn't be surprised that it is slower than you expect. If the table has indices, especially bitmap indices, I would expect this to be on par with Oracle.
The other thing to consider is to understand how Cache is configured, ie what are the global buffers, what does the performance of the disk look like.
Intersystems cache is slower for querying than any SQL database I have used, especially when you deal with large databases. Now add an ODBC overhead to the picture and you will achieve even worse performance.
Some level of performance can be achieved through use of bitmap indexes, but often the only way to get good performance is to create more data.
You might even find that you can allocate more memory for the database (but that never seemed to do much for me)
For example every time you add new data force the database to increment a number somewhere for your count (or even multiple entries for the purpose of grouping). Then you can have performance at a reasonable level.
I wrote a little Intersystems performance test post on my blog...
http://tesmond.blogspot.co.uk/2013/09/intersystems-cache-performance-woe-is-me.html
Cache has a built in (smart) function that determines how to best execute queries. Of course having indexes, especially bitmapped, will drastically help query times. Though, a mere 13 million rows should take seconds tops. How much data is in each row? We have 260 million rows in many tables and 790 million rows in others. We can mow through the whole thing in a couple of minutes. A non-indexed, complex query may take a day, though that is understandable. Take a look at what's locking your globals. We have also discovered that apparently queries run even if the client is disconnected. You can kill the task with the management portal, but the system doesn't seem to like doing more than one ODBC query at once with larger queries because it takes gigs of temp data to do such a query. We use DBVisualizer for a JDBC connection.
Someone mentioned TuneTable, that's great to run if your table changes a lot or at least a couple of times in the table's life. This is NOT something that you want to overuse. http://docs.intersystems.com/ens20151/csp/docbook/DocBook.UI.Page.cls?KEY=GSQL_optimizing is where you can find some documentation and other useful information about this and improving performance. If it's not fast then someone broke it.
Someone also mentioned that select count() will count an index instead of the table itself with computed properties. This is related to that decision engine that compiles your sql queries and decides what the most efficient method is to get your data. There is a tool in the portal that will show you how long it takes and will show you the other methods (that the smart interpreter [I forget what it's called]) that are available. You can see the Query Plan at the same page that you can execute SQL in the browser mentioned below. /csp/sys/exp/UtilSqlQueryShowPlan.csp
RE: I can't run this query from within the Management Portal because the tables are only made available from within an application and/or ODBC.
That isn't actually true. Within the management portal, go to System Explorer, SQL, then Execute SQL Statements. Please note that you must have adequate privileges to see this %ALL will allow access to everything. Also, you can run SQL queries natively in TERMINAL by executing.... do $system.SQL.Shell() Then type your queries. This interface should be faster than ODBC as I think it uses object access. Also, keep in mind that embedded SQL and object access of data is the fastest way to access data.
Please let me know if you have any more questions!

"cursor like" reading inside a CLR procedure/function

I have to implement an algorithm on data which is (for good reasons) stored inside SQL server. The algorithm does not fit SQL very well, so I would like to implement it as a CLR function or procedure. Here's what I want to do:
Execute several queries (usually 20-50, but up to 100-200) which all have the form select a,b,... from some_table order by xyz. There's an index which fits that query, so the result should be available more or less without any calculation.
Consume the results step by step. The exact stepping depends on the results, so it's not exactly predictable.
Aggregate some result by stepping over the results. I will only consume the first parts of the results, but cannot predict how much I will need. The stop criteria depends on some threshold inside the algorithm.
My idea was to open several SqlDataReader, but I have two problems with that solution:
You can have only one SqlDataReader per connection and inside a CLR method I have only one connection - as far as I understand.
I don't know how to tell SqlDataReader how to read data in chunks. I could not find documentation how SqlDataReader is supposed to behave. As far as I understand, it's preparing the whole result set and would load the whole result into memory. Even if I would consume only a small part of it.
Any hint how to solve that as a CLR method? Or is there a more low level interface to SQL server which is more suitable for my problem?
Update: I should have made two points more explicit:
I'm talking about big data sets, so a query might result in 1 mio records, but my algorithm would consume only the first 100-200 ones. But as I said before: I don't know the exact number beforehand.
I'm aware that SQL might not be the best choice for that kind of algorithm. But due to other constraints it has to be a SQL server. So I'm looking for the best possible solution.
SqlDataReader does not read the whole dataset, you are confusing it with the Dataset class. It reads row by row, as the .Read() method is being called. If a client does not consume the resultset the server will suspend the query execution because it has no room to write the output into (the selected rows). Execution will resume as the client consumes more rows (SqlDataReader.Read is being called). There is even a special command behavior flag SequentialAccess that instructs the ADO.Net not to pre-load in memory the entire row, useful for accessing large BLOB columns in a streaming fashion (see Download and Upload images from SQL Server via ASP.Net MVC for a practical example).
You can have multiple active result sets (SqlDataReader) active on a single connection when MARS is active. However, MARS is incompatible with SQLCLR context connections.
So you can create a CLR streaming TVF to do some of what you need in CLR, but only if you have one single SQL query source. Multiple queries it would require you to abandon the context connection and use isntead a fully fledged connection, ie. connect back to the same instance in a loopback, and this would allow MARS and thus consume multiple resultsets. But loopback has its own issues as it breaks the transaction boundaries you have from context connection. Specifically with a loopback connection your TVF won't be able to read the changes made by the same transaction that called the TVF, because is a different transaction on a different connection.
SQL is designed to work against huge data sets, and is extremely powerful. With set based logic it's often unnecessary to iterate over the data to perform operations, and there are a number of built-in ways to do this within SQL itself.
1) write set based logic to update the data without cursors
2) use deterministic User Defined Functions with set based logic (you can do this with the SqlFunction attribute in CLR code). Non-Deterministic will have the affect of turning the query into a cursor internally, it means the value output is not always the same given the same input.
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static int algorithm(int value1, int value2)
{
int value3 = ... ;
return value3;
}
3) use cursors as a last resort. This is a powerful way to execute logic per row on the database but has a performance impact. It appears from this article CLR can out perform SQL cursors (thanks Martin).
I saw your comment that the complexity of using set based logic was too much. Can you provide an example? There are many SQL ways to solve complex problems - CTE, Views, partitioning etc.
Of course you may well be right in your approach, and I don't know what you are trying to do, but my gut says leverage the tools of SQL. Spawning multiple readers isn't the right way to approach the database implementation. It may well be that you need multiple threads calling into a SP to run concurrent processing, but don't do this inside the CLR.
To answer your question, with CLR implementations (and IDataReader) you don't really need to page results in chunks because you are not loading data into memory or transporting data over the network. IDataReader gives you access to the data stream row-by-row. By the sounds it your algorithm determines the amount of records that need updating, so when this happens simply stop calling Read() and end at that point.
SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("Value1", SqlDbType.Int);
columns[1] = new SqlMetaData("Value2", SqlDbType.Int);
columns[2] = new SqlMetaData("Value3", SqlDbType.Int);
SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);
SqlDataReader reader = comm.ExecuteReader();
bool flag = true;
while (reader.Read() && flag)
{
int value1 = Convert.ToInt32(reader[0]);
int value2 = Convert.ToInt32(reader[1]);
// some algorithm
int newValue = ...;
reader.SetInt32(3, newValue);
SqlContext.Pipe.SendResultsRow(record);
// keep going?
flag = newValue < 100;
}
Cursors are a SQL only function. If you wanted to read chunks of data at a time, some sort of paging would be required so that only a certain amount of the records would be returned. If using Linq,
.Skip(Skip)
.Take(PageSize)
Skips and takes could be used to limit results returned.
You can simply iterate over the DataReader by doing something like this:
using (IDataReader reader = Command.ExecuteReader())
{
while (reader.Read())
{
//Do something with this record
}
}
This would be iterating over the results one at a time, similiar to a cursor in SQL Server.
For multiple recordsets at once, try MARS
(if SQL Server)
http://msdn.microsoft.com/en-us/library/ms131686.aspx

.NET/C# check if SQL query modifies database and if not execute

I know how to execute queries from C# but I want to provide a dropdown list in which people can write a query and it will execute and populate the list.
A problem is that I want to forbid all queries that modify the database in any way. I have not managed to find a way to do this and I did my best with google.
The solution I can think of is that I will scan the query for INSERT, DELETE, UPDATE and only allow SELECT statements. However, I want to be able to allow users to call stored procedures as well. This means I need to get the body of the stored procedure and scan it before I execute it. How do I download a stored procedure then?
If anyone knows a way to only execute read only queries do share please! I have the feeling scanning the text for INSERT, DELETE, UPDATE doesn't prevent SQL injections.
The easiest way to do this might be to offload this job to the database. Just make sure that the database user that will be running the queries has read-access only. Then, any queries that do anything other than SELECT will fail, and you can report that failure back to the users.
If you don't go this route, the complexity becomes quite enormous, since you basically have to be prepared to parse an arbitrary SQL statement, not to mention arbitrary sequences of SQL statements if you allow stored procs to be run.
Even then, take care to ensure that you aren't leaking sensitive data through your queries. Directly input queries from site users can be dangerous if you're not careful. Even if you are, allowing these queries on anything but a specifically constructed sandbox database is a "whoops, I accidentally changed the user's permissions" away from becoming a security nightmare.
Another option is to write a "query creator" page, where users can pick the table and columns they'd like to see. You can then a) only show tables and columns that are appropriate for a given user (possibly based on user roles etc.) and b) generate the SQL yourself, preferably using a parameterized query.
Update: As Yahia points out, if the user has execute privilege (so that they can execute stored procs,) then the permissions of the procedure itself are honoured. Given that, it might be better to not allow arbitrary stored proc execution, but rather offer the users a list of procedures that are known to be safe. That will probably be difficult to maintain and error-prone, though, so disallowing stored procs altogether might be best.
How about creating a user account on the database server which only has select (read-only) rights?
Perhaps you could set up a SQL user with read-only access to the database and issue the command using that user? Then you can catch the errors when/if they happen.
It seems to me that it's going to be very difficult and error-prone to try to parse the query to figure out if it modifies the database.
You can't parse SQL like that reliably.
Use permissions to
Allow only SELECT on tables and views
No permissions on stored procedures that change data (An end user by default won't be able to see stored procedure definition)
Best is to not allow users to enter SQL and use only prepared/parameterized queries...
The next best way to prevent that is to use a restricted user with pure read access
The above two can be combined...
BEWARE
To execute a Stored Procedure the user must have execute privilege... IF the Stored Procedure modifies data then this would happen without an error messages even with a restricted user since the permission to modify is granted to the Stored Procedure!
IF you absolutely must allow users to enter SQL and can't restrict the login then you would need to use a SQL parser - for example this...
As to how to download the body of a Stored Procedure - this is dependent on the DB you use (SQL Server, Oracle etc.).
EDIT:
Another option are so-called "Database Firewall" - you connect instead of directly to the DB to the Firewall... in the Firewall you configure several things like time-based restrictions (when specific users/statement are/art not allowed), SQL-based statement (which are allowed...), quantity-based restrictions (like you can get 100 records, but are not able to download the whole table/DB...) etc.
There are commercial and opensource DB Firewalls out there - though these are by nature very dependent on the DB you use etc.
Examples:
Oracle Firewall - works with Oracle / SQL Server / DB2 etc.
SecureSphere - several including Oracle / SQL Server / DB2 etc.
GreenSQL - opensource version support Postgres + MySQL, commercial MS SQL Server
Don't forget about things that are even worse than INSERT, UPDATE, and DELETE. Like TRUNCATE...that's some bad stuff.
i think SQL Trigger is the best way what you want to do.
Your first move should be to create a DB user for this specific task with only the needed permissions (basically SELECT only), and with the rights to see only the tables you need them to see (so they cannot SELECT sys tables or your users table).
More generally, it seems like a bad idea to let users execute code directly on your database. Even if you protect it against data modification, they will still be able to make ugly-looking joins to make your db run slow, for instance.
Maybe whichever language your programming the UI with, you could try to look online for a custom control that allows filtering on a database. Google it...
this is not perfect but might be what you want, this allows the keyword to appear if its a part of a bigger alphanumeric string:
public static bool ValidateQuery(string query)
{
return !ValidateRegex("delete", query) && !ValidateRegex("exec", query) && !ValidateRegex("insert", query) && !ValidateRegex("alter", query) &&
!ValidateRegex("create", query) && !ValidateRegex("drop", query) && !ValidateRegex("truncate", query);
}
public static bool ValidateRegex(string term, string query)
{
// this regex finds all keywords {0} that are not leading or trailing by alphanumeric
return new Regex(string.Format("([^0-9a-z]{0}[^0-9a-z])|(^{0}[^0-9a-z])", term), RegexOptions.IgnoreCase).IsMatch(query);
}
you can see how it works here: regexstorm
see regex cheat sheet: cheatsheet1, cheatsheet2
notice this is not perfect since it might block a query with one of the keywords as a quote, but if you write the queries and its just a precaution then this might do the trick.
you can also take a different approach, try the query, and if it affects the database do a rollback:
public static bool IsDbAffected(string query, string conn, List<SqlParameter> parameters = null)
{
var response = false;
using (var sqlConnection = new SqlConnection(conn))
{
sqlConnection.Open();
using (var transaction = sqlConnection.BeginTransaction("Test Transaction"))
using (var command = new SqlCommand(query, sqlConnection, transaction))
{
command.Connection = sqlConnection;
command.CommandType = CommandType.Text;
command.CommandText = query;
if (parameters != null)
command.Parameters.AddRange(parameters.ToArray());
// ExecuteNonQuery() does not return data at all: only the number of rows affected by an insert, update, or delete.
if (command.ExecuteNonQuery() > 0)
{
transaction.Rollback("Test Transaction");
response = true;
}
transaction.Dispose();
command.Dispose();
}
}
return response;
}
you can also combine the two.

Speed up LINQ inserts

I have a CSV file and I have to insert it into a SQL Server database. Is there a way to speed up the LINQ inserts?
I've created a simple Repository method to save a record:
public void SaveOffer(Offer offer)
{
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
// add new offer
if (dbOffer == null)
{
this.db.Offers.InsertOnSubmit(offer);
}
//update existing offer
else
{
dbOffer = offer;
}
this.db.SubmitChanges();
}
But using this method, the program is way much slower then inserting the data using ADO.net SQL inserts (new SqlConnection, new SqlCommand for select if exists, new SqlCommand for update/insert).
On 100k csv rows it takes about an hour vs 1 minute or so for the ADO.net way. For 2M csv rows it took ADO.net about 20 minutes. LINQ added about 30k of those 2M rows in 25 minutes. My database has 3 tables, linked in the dbml, but the other two tables are empty. The tests were made with all the tables empty.
P.S. I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy.
Updates/Edits:
After 18hours, the LINQ version added just ~200K rows.
I've tested the import just with LINQ inserts too, and also is really slow compared with ADO.net. I haven't seen a big difference between just inserts/submitchanges and selects/updates/inserts/submitchanges.
I still have to try batch commit, manually connecting to the db and compiled queries.
SubmitChanges does not batch changes, it does a single insert statement per object. If you want to do fast inserts, I think you need to stop using LINQ.
While SubmitChanges is executing, fire up SQL Profiler and watch the SQL being executed.
See question "Can LINQ to SQL perform batch updates and deletes? Or does it always do one row update at a time?" here: http://www.hookedonlinq.com/LINQToSQLFAQ.ashx
It links to this article: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx that uses extension methods to fix linq's inability to batch inserts and updates etc.
Have you tried wrapping the inserts within a transaction and/or delaying db.SubmitChanges so that you can batch several inserts?
Transactions help throughput by reducing the needs for fsync()'s, and delaying db.SubmitChanges will reduce the number of .NET<->db roundtrips.
Edit: see http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html for some more optimization principles.
Have a look at the following page for a simple walk-through of how to change your code to use a Bulk Insert instead of using LINQ's InsertOnSubmit() function.
You just need to add the (provided) BulkInsert class to your code, make a few subtle changes to your code, and you'll see a huge improvement in performance.
Mikes Knowledge Base - BulkInserts with LINQ
Good luck !
I wonder if you're suffering from an overly large set of data accumulating in the data-context, making it slow to resolve rows against the internal identity cache (which is checked once during the SingleOrDefault, and for "misses" I would expect to see a second hit when the entity is materialized).
I can't recall 100% whether the short-circuit works for SingleOrDefault (although it will in .NET 4.0).
I would try ditching the data-context (submit-changes and replace with an empty one) every n operations for some n - maybe 250 or something.
Given that you're calling SubmitChanges per isntance at the moment, you may also be wasting a lot of time checking the delta - pointless if you've only changed one row. Only call SubmitChanges in batches; not per record.
Alex gave the best answer, but I think a few things are being over looked.
One of the major bottlenecks you have here is calling SubmitChanges for each item individually. A problem I don't think most people know about is that if you haven't manually opened your DataContext's connection yourself, then the DataContext will repeatedly open and close it itself. However, if you open it yourself, and then close it yourself when you're absolutely finished, things will run a lot faster since it won't have to reconnect to the database every time. I found this out when trying to find out why DataContext.ExecuteCommand() was so unbelievably slow when executing multiple commands at once.
A few other areas where you could speed things up:
While Linq To SQL doesn't support your straight up batch processing, you should wait to call SubmitChanges() until you've analyzed everything first. You don't need to call SubmitChanges() after each InsertOnSubmit call.
If live data integrity isn't super crucial, you could retrieve a list of offer_id back from the server before you start checking to see if an offer already exists. This could significantly reduce the amount of times you're calling the server to get an existing item when it's not even there.
Why not pass an offer[] into that method, and doing all the changes in cache before submitting them to the database. Or you could use groups for submission, so you don't run out of cache. The main thing would be how long till you send over the data, the biggest time wasting is in the closing and opening of the connection.
Converting this to a compiled query is the easiest way I can think of to boost your performance here:
Change the following:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
to:
Offer dbOffer = RetrieveOffer(offer.offer_id);
private static readonly Func<DataContext, int> RetrieveOffer
{
CompiledQuery.Compile((DataContext context, int offerId) => context.Offers.SingleOrDefault(o => o.offer_id == offerid))
}
This change alone will not make it as fast as your ado.net version, but it will be a significant improvement because without the compiled query you are dynamically building the expression tree every time you run this method.
As one poster already mentioned, you must refactor your code so that submit changes is called only once if you want optimal performance.
Do you really need to check if the record exist before inserting it into the DB. I thought it looked strange as the data comes from a csv file.
P.S. I've tried to use SqlBulkCopy,
but I need to do some transformations
on Offer before inserting it into the
db, and I think that defeats the
purpose of SqlBulkCopy.
I don't think it defeat the purpose at all, why would it? Just fill a simple dataset with all the data from the csv and do a SqlBulkCopy. I did a similar thing with a collection of 30000+ rows and the import time went from minutes to seconds
I suspect it isn't the inserting or updating operations that are taking a long time, rather the code that determines if your offer already exists:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
If you look to optimise this, I think you'll be on the right track. Perhaps use the Stopwatch class to do some timing that will help to prove me right or wrong.
Usually, when not using Linq-to-Sql, you would have an insert/update procedure or sql script that would determine whether the record you pass already exists. You're doing this expensive operation in Linq, which certainly will never hope to match the speed of native sql (which is what's happening when you use a SqlCommand and select if the record exists) looking-up on a primary key.
Well you must understand linq creates code dynamically for all ADO operations that you do instead handwritten, so it will always take up more time then your manual code. Its simply an easy way to write code but if you want to talk about performance, ADO.NET code will always be faster depending upon how you write it.
I dont know if linq will try to reuse its last statement or not, if it does then seperating insert batch with update batch may improve performance little bit.
This code runs ok, and prevents large amounts of data:
if (repository2.GeoItems.GetChangeSet().Inserts.Count > 1000)
{
repository2.GeoItems.SubmitChanges();
}
Then, at the end of the bulk insertion, use this:
repository2.GeoItems.SubmitChanges();

Categories

Resources