"cursor like" reading inside a CLR procedure/function

"cursor like" reading inside a CLR procedure/function - c#

I have to implement an algorithm on data which is (for good reasons) stored inside SQL server. The algorithm does not fit SQL very well, so I would like to implement it as a CLR function or procedure. Here's what I want to do:
Execute several queries (usually 20-50, but up to 100-200) which all have the form select a,b,... from some_table order by xyz. There's an index which fits that query, so the result should be available more or less without any calculation.
Consume the results step by step. The exact stepping depends on the results, so it's not exactly predictable.
Aggregate some result by stepping over the results. I will only consume the first parts of the results, but cannot predict how much I will need. The stop criteria depends on some threshold inside the algorithm.
My idea was to open several SqlDataReader, but I have two problems with that solution:
You can have only one SqlDataReader per connection and inside a CLR method I have only one connection - as far as I understand.
I don't know how to tell SqlDataReader how to read data in chunks. I could not find documentation how SqlDataReader is supposed to behave. As far as I understand, it's preparing the whole result set and would load the whole result into memory. Even if I would consume only a small part of it.
Any hint how to solve that as a CLR method? Or is there a more low level interface to SQL server which is more suitable for my problem?
Update: I should have made two points more explicit:
I'm talking about big data sets, so a query might result in 1 mio records, but my algorithm would consume only the first 100-200 ones. But as I said before: I don't know the exact number beforehand.
I'm aware that SQL might not be the best choice for that kind of algorithm. But due to other constraints it has to be a SQL server. So I'm looking for the best possible solution.

SqlDataReader does not read the whole dataset, you are confusing it with the Dataset class. It reads row by row, as the .Read() method is being called. If a client does not consume the resultset the server will suspend the query execution because it has no room to write the output into (the selected rows). Execution will resume as the client consumes more rows (SqlDataReader.Read is being called). There is even a special command behavior flag SequentialAccess that instructs the ADO.Net not to pre-load in memory the entire row, useful for accessing large BLOB columns in a streaming fashion (see Download and Upload images from SQL Server via ASP.Net MVC for a practical example).
You can have multiple active result sets (SqlDataReader) active on a single connection when MARS is active. However, MARS is incompatible with SQLCLR context connections.
So you can create a CLR streaming TVF to do some of what you need in CLR, but only if you have one single SQL query source. Multiple queries it would require you to abandon the context connection and use isntead a fully fledged connection, ie. connect back to the same instance in a loopback, and this would allow MARS and thus consume multiple resultsets. But loopback has its own issues as it breaks the transaction boundaries you have from context connection. Specifically with a loopback connection your TVF won't be able to read the changes made by the same transaction that called the TVF, because is a different transaction on a different connection.

SQL is designed to work against huge data sets, and is extremely powerful. With set based logic it's often unnecessary to iterate over the data to perform operations, and there are a number of built-in ways to do this within SQL itself.
1) write set based logic to update the data without cursors
2) use deterministic User Defined Functions with set based logic (you can do this with the SqlFunction attribute in CLR code). Non-Deterministic will have the affect of turning the query into a cursor internally, it means the value output is not always the same given the same input.
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static int algorithm(int value1, int value2)
{
int value3 = ... ;
return value3;
}
3) use cursors as a last resort. This is a powerful way to execute logic per row on the database but has a performance impact. It appears from this article CLR can out perform SQL cursors (thanks Martin).
I saw your comment that the complexity of using set based logic was too much. Can you provide an example? There are many SQL ways to solve complex problems - CTE, Views, partitioning etc.
Of course you may well be right in your approach, and I don't know what you are trying to do, but my gut says leverage the tools of SQL. Spawning multiple readers isn't the right way to approach the database implementation. It may well be that you need multiple threads calling into a SP to run concurrent processing, but don't do this inside the CLR.
To answer your question, with CLR implementations (and IDataReader) you don't really need to page results in chunks because you are not loading data into memory or transporting data over the network. IDataReader gives you access to the data stream row-by-row. By the sounds it your algorithm determines the amount of records that need updating, so when this happens simply stop calling Read() and end at that point.
SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("Value1", SqlDbType.Int);
columns[1] = new SqlMetaData("Value2", SqlDbType.Int);
columns[2] = new SqlMetaData("Value3", SqlDbType.Int);
SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);
SqlDataReader reader = comm.ExecuteReader();
bool flag = true;
while (reader.Read() && flag)
{
int value1 = Convert.ToInt32(reader[0]);
int value2 = Convert.ToInt32(reader[1]);
// some algorithm
int newValue = ...;
reader.SetInt32(3, newValue);
SqlContext.Pipe.SendResultsRow(record);
// keep going?
flag = newValue < 100;
}

Cursors are a SQL only function. If you wanted to read chunks of data at a time, some sort of paging would be required so that only a certain amount of the records would be returned. If using Linq,
.Skip(Skip)
.Take(PageSize)
Skips and takes could be used to limit results returned.
You can simply iterate over the DataReader by doing something like this:
using (IDataReader reader = Command.ExecuteReader())
{
while (reader.Read())
{
//Do something with this record
}
}
This would be iterating over the results one at a time, similiar to a cursor in SQL Server.
For multiple recordsets at once, try MARS
(if SQL Server)
http://msdn.microsoft.com/en-us/library/ms131686.aspx

Related

SqlDataReader get specific ResultSet

When having a query that returns multiple results, we are iterating among them using the NextResult() of the SqlDataReader. This way we are accessing results sequentially.
Is there any way to access the result in a random / non sequential way. For example jump first to the third result, then to the first e.t.c
I am searching if there is something like rdr.GetResult(1), or a workaround.
Since I was asked Why I want something like this,
First of all I have no access to the query and so I can not changes, so in my client I will have the Results in the sequence that server writes / produces them.
To process (build collections, entities --> business logic) the first I need the Information from both the second and the third one.
Again since it is not an option to modify some of the code, I can not somehow (without writing a lot of code) 'store' the connection info (eg. ids) in order to connect in a later step the two ResultSets
The most 'elegant' solution (for sure not the only one) is to process the result sets in non sequential way. So that is why I am asking if there is such a way.
Update 13/6
While Jeroen Mostert answer gives a thoughtful explanation on why, Think2ceCode1ce answer shows the right directions for a workaround. The content of the link in the comments how in additional dataset could be utilized to work in an async way. IMHO this would be the way to go if was going to write a general solution. However in my case, I based my solution in the nature of my data and the logic behind them. In short terms, (1) I read the data as they come sequentially using the SqlDataReader; (2) I store some of the data I need in a dictionary and a Collection, when I am reading the first in row but second in logic ResultSet; (3) When I am Reading the third in row, but first in logic ResultSet I am iterating in through the collection I built earlier and based on the dictionary data I am building my final result.
The final code seems more efficient and it is more maintainable than using the async DataAdapter. However this is a very specific solution based on my data.

Provides a way of reading a forward-only stream of rows from a SQL
Server database.
https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqldatareader(v=vs.110).aspx
You need to use DataAdapter for disconnected and non-sequential access. To use this you just have to change bit of your ADO.NET code.
Instead of
SqlDataReader sqlReader = sqlCmd.ExecuteReader();
You need
DataTable dt = new DataTable();
SqlDataAdapter sqlAdapter = new SqlDataAdapter(sqlCmd);
sqlAdapter.Fill(dt);
If your SQL returns multiple result sets, you would use DataSet instead of DataTable, and then access result sets like ds.Tables[index_or_name].
https://msdn.microsoft.com/en-us/library/bh8kx08z(v=vs.110).aspx

No, this is not possible. The reason why is quite elementary: if a batch returns multiple results, it must return them in order -- the statement that returns result set #2 does not run before the one that returns result set #1, nor does the client have any way of saying "please just skip that first statement entirely" (as that could have dire consequences for the batch as a whole). Indeed, there's not even any way in general to tell how many result sets a batch will produce -- all of this is done at runtime, the server doesn't know in advance what will happen.
Since there's no way, server-side, to skip or index result sets, there's no meaningful way to do it client-side either. You're free to ignore the result sets streamed back to you, but you must still process them in order before you can move on -- and once you've moved on, you can't go back.
There are two possible global workarounds:
If you process all data and cache it locally (with a DataAdapter, for example) you can go back and forth in the data as you please, but this requires keeping all data in memory.
If you enable MARS (Multiple Active Result Sets) you can execute another query even as the first one is still processing. This does require splitting up your existing single batch code into individual statements (which, if you really can't change anything about the SQL at all, is not an option), but you could go through result sets at will (without caching). It would still not be possible for you to "go back" within a single result set, though.

How to correctly and efficiently reuse a prepared statement in C# .NET (SQL Server)?

I looked at lots of questions but evidently my SO-fu isn't up to the task, so here I am. I am trying to efficiently use prepared statements, and I don't just mean parameterizing a single statement, but compiling one for reuse many times. My question lies around the parameters and reuse and how to implement that correctly.
Generally I follow this procedure (contrived example):
SqlConnection db = new SqlConnection(...);
SqlCommand s = new SqlCommand("select * from foo where a=#a", db);
s.Parameters.Add("#a", SqlDbType.VarChar, 8);
s.Prepare();
...
s.Parameters["#a"] = "bozo";
s.Execute();
Super, that works. However, I don't want to do all of these steps (or the latter four) every time I run this query. That seems like it's counteracting the whole idea of prepared statements. In my mind I should only have to change the parameters and re-execute, but the question is how to do that?
I tried s.Parameters.Clear(), but this actually removes the parameters themselves, not just the values, so I would essentially need to re-Add the parameters and re-Prepare again, which would seem to break the whole point as well. No thanks.
At this point I am left with iterating through s.Parameters and setting them all to null or some other value. Is this correct? Unfortunately in my current project I have queries with ~15 parameters which need to be executed ~10,000 times per run. I can shunt this iteration off into a method but was wondering if there is a better way to do this (without stored procs).
My current workaround is an extension method, SqlParameterCollection.Nullify, that sets all the parameters to null, which is fine for my case. I just run this after an execute.
I found some virtually identical but (IMHO) unanswered questions:
Prepared statements and the built-in connection pool in .NET
SQLite/C# Connection Pooling and Prepared Statement Confusion (Serge was so close to answering!)
The best answer I could find is (1) common sense above and (2) this page:
http://msdn.microsoft.com/en-us/magazine/cc163799.aspx

When re-using a prepared SqlCommand, surely all you need to do is set the parameter values to the new ones? You don't need to clear them out after use.
For myself, I haven't seen a DBMS produced in the last 10 years which got any noticeable benefit from preparing a statement (I suppose if the DB Server was at the limits of its CPU it might, but this is not typical). Are you sure that Preparing is necessary?
Running the same command "~10,000 times per run" smells a bit to me, unless you're uploading from an external source. In that case, Bulk Loading might help? What is each run doing?

To add to Simon's answer, prior to Sql 2005 Command.Prepare() would have improved query plan caching of ad-hoc queries (SPROCs would generally be compiled). However, in more recent Sql Versions, provided that your query is parameterized, ad-hoc queries which are also parameterized can also be cached, reducing the need for Prepare().
Here is an example of retaining a SqlParameters collection changing just the value of those parameters values which vary, to prevent repeated creation of the Parameters (i.e. saving parameter object creation and collection):
using (var sqlConnection = new SqlConnection("connstring"))
{
sqlConnection.Open();
using (var sqlCommand = new SqlCommand
{
Connection = sqlConnection,
CommandText = "dbo.MyProc",
CommandType = CommandType.StoredProcedure,
})
{
// Once-off setup per connection
// This parameter doesn't vary so is set just once
sqlCommand.Parameters.Add("ConstantParam0", SqlDbType.Int).Value = 1234;
// These parameters are defined once but set multiple times
sqlCommand.Parameters.Add(new SqlParameter("VarParam1", SqlDbType.VarChar));
sqlCommand.Parameters.Add(new SqlParameter("VarParam2", SqlDbType.DateTime));
// Tight loop - performance critical
foreach(var item in itemsToExec)
{
// No need to set ConstantParam0
// Reuses variable parameters, by just mutating values
sqlParameters["VarParam1"].Value = item.Param1Value; // Or sqlParameters[1].Value
sqlParameters["VarParam2"].Value = item.Param2Date; // Or sqlParameters[2].Value
sqlCommand.ExecuteNonQuery();
}
}
}
Notes:
If you are inserting a large number of rows, and concurrency with other inhabitants of the database is important, and if an ACID transaction boundary is not important, you might consider batching and committing updates such that fewer than 5000 row locks are held on a table at a time, to guard against table lock escalation.
Depending on what work your proc is actually doing, there may be an opportunity to parallelize the loop, e.g. with TPL. Obviously connection and commands are not thread safe each Task will require its own connection and Reusable Command - the localInit overload of Parallel.ForEach is ideal for this.

ADO.NET SQL Server performance: multiple result sets vs. multiple command executions

With connection pooling or at least the assumption that the connection is not closed between calls, is there a network or server performance difference and how significant is it between one stored procedure execution with multiple result sets and multiple stored procedure executions.
In pseudo code, something like
using(new connection)
{
using (datareader dr = connection.Execute(Command))
{
while (dr.NextResult())
{
while (dr.Read())
{
SomeContainer.Add(Something.Parse(dr));
}
}
}
}
vs
using(new connection)
{
using (datareader dr = connection.Execute(Command))
{
while (dr.Read())
{
SomeContainer.Add(Something.Parse(dr));
}
}
using (datareader dr = connection.Execute(Command))
{
while (dr.Read())
{
SomeContainer.Add(Something.Parse(dr));
}
}
}

The first one is a single round-trip to the server, the second is distinct round trips. A round trip occurs a penalty due to network latency, time to parse the request, time to set up an execution context etc. However, this penalty is all but negligible for everything but the most critical applications.
So do whatever is easier to understand, code, debug and maintain (imho, that would be the second option). You probably won't be able to measure the difference.

I disagree, I would go with the 1st approach almost always (it really depends on the particular scenario) but in general, it's better to have a proc returning 2 result sets as opposed to having 2 calls to 2 different procs that return a single data set precisely for the reasons #Remus explained (network latency,etc).
In the majority of the cases the difference is not negligible.

Suggest you profile the difference for yourself. What is most efficient may depend greatly on how much data and how many users etc. I would tend to believe that the one round trip over the network is better, but it is best to try both approaches and measure and then you will know.

You can assume that Connection Pooling is being used in both of your scenarios, so it's really a non-factor in your determination of efficiency.
If you can receive all of the results in a single call, it is naturally more efficient than multiple calls. Consider the simple case of selecting 10 things individually versus selecting all 10 at once with an 'in' clause. That's 1 query sent to the server and 1 response parsed vs 10 of each. This is the round-trip that Remus is talking about.
It's most likely nominal in light usage scenarios, but as (if) you scale up, the chattiness could start to be a problem. Your connection pool has a limit that can be reached at some point.
I would go with option 1 if you are returning the same type of data between calls.
However, there is also maintenance and reuse to consider. If you are returning disparate data (ie: fetching all data needed for a particular view), I would go with option 2 and optimize to fewer calls as needed.

Execute multiple SQL commands in one round trip

I am building an application and I want to batch multiple queries into a single round-trip to the database. For example, lets say a single page needs to display a list of users, a list of groups and a list of permissions.
So I have stored procs (or just simple sql commands like "select * from Users"), and I want to execute three of them. However, to populate this one page I have to make 3 round trips.
Now I could write a single stored proc ("getUsersTeamsAndPermissions") or execute a single SQL command "select * from Users;exec getTeams;select * from Permissions".
But I was wondering if there was a better way to specify to do 3 operations in a single round trip. Benefits include being easier to unit test, and allowing the database engine to parrallelize the queries.
I'm using C# 3.5 and SQL Server 2008.

Something like this. The example is probably not very good as it doesn't properly dispose objects but you get the idea. Here's a cleaned up version:
using (var connection = new SqlConnection(ConnectionString))
using (var command = connection.CreateCommand())
{
connection.Open();
command.CommandText = "select id from test1; select id from test2";
using (var reader = command.ExecuteReader())
{
do
{
while (reader.Read())
{
Console.WriteLine(reader.GetInt32(0));
}
Console.WriteLine("--next command--");
} while (reader.NextResult());
}
}

The single multi-part command and the stored procedure options that you mention are the two options. You can't do them in such a way that they are "parallelized" on the db. However, both of those options does result in a single round trip, so you're good there. There's no way to send them more efficiently. In sql server 2005 onwards, a multi-part command that is fully parameterized is very efficient.
Edit: adding information on why cram into a single call.
Although you don't want to care too much about reducing calls, there can be legitimate reasons for this.
I once was limited to a crummy ODBC driver against a mainframe, and there was a 1.2 second overhead on each call! I'm serious. There were times when I crammed a little extra into my db calls. Not pretty.
You also might find yourself in a situation where you have to configure your sql queries somewhere, and you can't just make 3 calls: it has to be one. It shouldn't be that way, bad design, but it is. You do what you gotta do!
Sometimes of course it can be very good to encapsulate multiple steps in a stored procedure. Usually not for saving round trips though, but for tighter transactions, getting ID for new records, constraining for permissions, providing encapsulation, blah blah blah.

Making one round-trip vs three will be more eficient indeed. The question is wether it is worth the trouble. The entire ADO.Net and C# 3.5 toolset and framework opposes what you try to do. TableAdapters, Linq2SQL, EF, all these like to deal with simple one-call==one-resultset semantics. So you may loose some serious productivity by trying to beat the Framework into submission.
I would say that unless you have some serious measurements showing that you need to reduce the number of roundtrips, abstain. If you do end up requiring this, then use a stored procedure to at least give an API kind of semantics.
But if your query really is what you posted (ie. select all users, all teams and all permissions) then you obviosuly have much bigger fish to fry before reducing the round-trips... reduce the resultsets first.

I this this link might be helpful.
Consider using at least the same connection-openning; according to what it says here, openning a connection is almost the top-leader of performance cost in Entity-Framework.

Firstly, 3 round trips isn't really a big deal. If you were talking about 300 round trips then that would be another matter, but for just 3 round trips I would conderer this to definitley be a case of premature optimisation.
That said, the way I'd do this would probably be to executed the 3 stored procuedres using SQL:
exec dbo.p_myproc_1 #param_1 = #in_param_1, #param_2 = #in_param_2
exec dbo.p_myproc_2
exec dbo.p_myproc_3
You can then iterate through the returned results sets as you would if you directly executed multiple rowsets.

Build a temp-table? Insert all results into the temp table and then select * from #temp-table
as in,
#temptable=....
select #temptable.field=mytable.field from mytable
select #temptable.field2=mytable2.field2 from mytable2
etc... Only one trip to the database, though I'm not sure it is actually more efficient.

Speed up LINQ inserts

I have a CSV file and I have to insert it into a SQL Server database. Is there a way to speed up the LINQ inserts?
I've created a simple Repository method to save a record:
public void SaveOffer(Offer offer)
{
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
// add new offer
if (dbOffer == null)
{
this.db.Offers.InsertOnSubmit(offer);
}
//update existing offer
else
{
dbOffer = offer;
}
this.db.SubmitChanges();
}
But using this method, the program is way much slower then inserting the data using ADO.net SQL inserts (new SqlConnection, new SqlCommand for select if exists, new SqlCommand for update/insert).
On 100k csv rows it takes about an hour vs 1 minute or so for the ADO.net way. For 2M csv rows it took ADO.net about 20 minutes. LINQ added about 30k of those 2M rows in 25 minutes. My database has 3 tables, linked in the dbml, but the other two tables are empty. The tests were made with all the tables empty.
P.S. I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy.
Updates/Edits:
After 18hours, the LINQ version added just ~200K rows.
I've tested the import just with LINQ inserts too, and also is really slow compared with ADO.net. I haven't seen a big difference between just inserts/submitchanges and selects/updates/inserts/submitchanges.
I still have to try batch commit, manually connecting to the db and compiled queries.

SubmitChanges does not batch changes, it does a single insert statement per object. If you want to do fast inserts, I think you need to stop using LINQ.
While SubmitChanges is executing, fire up SQL Profiler and watch the SQL being executed.
See question "Can LINQ to SQL perform batch updates and deletes? Or does it always do one row update at a time?" here: http://www.hookedonlinq.com/LINQToSQLFAQ.ashx
It links to this article: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx that uses extension methods to fix linq's inability to batch inserts and updates etc.

Have you tried wrapping the inserts within a transaction and/or delaying db.SubmitChanges so that you can batch several inserts?
Transactions help throughput by reducing the needs for fsync()'s, and delaying db.SubmitChanges will reduce the number of .NET<->db roundtrips.
Edit: see http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html for some more optimization principles.

Have a look at the following page for a simple walk-through of how to change your code to use a Bulk Insert instead of using LINQ's InsertOnSubmit() function.
You just need to add the (provided) BulkInsert class to your code, make a few subtle changes to your code, and you'll see a huge improvement in performance.
Mikes Knowledge Base - BulkInserts with LINQ
Good luck !

I wonder if you're suffering from an overly large set of data accumulating in the data-context, making it slow to resolve rows against the internal identity cache (which is checked once during the SingleOrDefault, and for "misses" I would expect to see a second hit when the entity is materialized).
I can't recall 100% whether the short-circuit works for SingleOrDefault (although it will in .NET 4.0).
I would try ditching the data-context (submit-changes and replace with an empty one) every n operations for some n - maybe 250 or something.
Given that you're calling SubmitChanges per isntance at the moment, you may also be wasting a lot of time checking the delta - pointless if you've only changed one row. Only call SubmitChanges in batches; not per record.

Alex gave the best answer, but I think a few things are being over looked.
One of the major bottlenecks you have here is calling SubmitChanges for each item individually. A problem I don't think most people know about is that if you haven't manually opened your DataContext's connection yourself, then the DataContext will repeatedly open and close it itself. However, if you open it yourself, and then close it yourself when you're absolutely finished, things will run a lot faster since it won't have to reconnect to the database every time. I found this out when trying to find out why DataContext.ExecuteCommand() was so unbelievably slow when executing multiple commands at once.
A few other areas where you could speed things up:
While Linq To SQL doesn't support your straight up batch processing, you should wait to call SubmitChanges() until you've analyzed everything first. You don't need to call SubmitChanges() after each InsertOnSubmit call.
If live data integrity isn't super crucial, you could retrieve a list of offer_id back from the server before you start checking to see if an offer already exists. This could significantly reduce the amount of times you're calling the server to get an existing item when it's not even there.

Why not pass an offer[] into that method, and doing all the changes in cache before submitting them to the database. Or you could use groups for submission, so you don't run out of cache. The main thing would be how long till you send over the data, the biggest time wasting is in the closing and opening of the connection.

Converting this to a compiled query is the easiest way I can think of to boost your performance here:
Change the following:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
to:
Offer dbOffer = RetrieveOffer(offer.offer_id);
private static readonly Func<DataContext, int> RetrieveOffer
{
CompiledQuery.Compile((DataContext context, int offerId) => context.Offers.SingleOrDefault(o => o.offer_id == offerid))
}
This change alone will not make it as fast as your ado.net version, but it will be a significant improvement because without the compiled query you are dynamically building the expression tree every time you run this method.
As one poster already mentioned, you must refactor your code so that submit changes is called only once if you want optimal performance.

Do you really need to check if the record exist before inserting it into the DB. I thought it looked strange as the data comes from a csv file.
P.S. I've tried to use SqlBulkCopy,
but I need to do some transformations
on Offer before inserting it into the
db, and I think that defeats the
purpose of SqlBulkCopy.
I don't think it defeat the purpose at all, why would it? Just fill a simple dataset with all the data from the csv and do a SqlBulkCopy. I did a similar thing with a collection of 30000+ rows and the import time went from minutes to seconds

I suspect it isn't the inserting or updating operations that are taking a long time, rather the code that determines if your offer already exists:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
If you look to optimise this, I think you'll be on the right track. Perhaps use the Stopwatch class to do some timing that will help to prove me right or wrong.
Usually, when not using Linq-to-Sql, you would have an insert/update procedure or sql script that would determine whether the record you pass already exists. You're doing this expensive operation in Linq, which certainly will never hope to match the speed of native sql (which is what's happening when you use a SqlCommand and select if the record exists) looking-up on a primary key.

Well you must understand linq creates code dynamically for all ADO operations that you do instead handwritten, so it will always take up more time then your manual code. Its simply an easy way to write code but if you want to talk about performance, ADO.NET code will always be faster depending upon how you write it.
I dont know if linq will try to reuse its last statement or not, if it does then seperating insert batch with update batch may improve performance little bit.

This code runs ok, and prevents large amounts of data:
if (repository2.GeoItems.GetChangeSet().Inserts.Count > 1000)
{
repository2.GeoItems.SubmitChanges();
}
Then, at the end of the bulk insertion, use this:
repository2.GeoItems.SubmitChanges();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.