Is SQL or C# faster at pairing?

Is SQL or C# faster at pairing? - c#

I have a lot of data which needs to be paired based on a few simple criteria. There is a time window (both records have a DateTime column), if one record is very close in time (within 5 seconds) to another then it is a potential match, the record which is the closest in time is considered a complete match. There are other fields which help narrow this down also.
I wrote a stored procedure which does this matching on the server before returning the
full, matched dataset to a C# application. My question is, would it be better to pull in the 1 million (x2) rows and deal with them in C#, or is sql server better suited to perform this matching? If Sql server is, then what is the fastest way of pairing data using datetime fields?
Right now I select all records from Table 1/Table 2 into temporary tables, iterate through each record in Table 1, look for a match in Table 2 and store the match (if one exists) in a temporary table, then I delete both records in their own temporary tables.
I had to rush this piece for a game I'm writing, so excuse the bad (very bad) procedure... It works, it's just horribly inefficient! The whole SP is available on pastebin: http://pastebin.com/qaieDsW7
I know the SP is written poorly, so saying "hey, dumbass... write it better" doesn't help! I'm looking for help in improving it, or help/advice on how I should do the whole thing differently! I have about 3/5 days to rewrite it, I can push that deadline back a bit, but I'd rather not if you guys can help me in time! :)
Thanks!

Ultimately, compiling your your data on the database side is preferable 99% of the time, as it's designed for data crunching (through the use of indexes, relations, etc). A lot of your code can be consolidated by the use of joins to compile the data in exactly the format you need. In fact, you can bypass almost all your temp tables entirely and just fill a master Event temp table.
The general pattern is this:
INSERT INTO #Events
SELECT <all interested columns>
FROM
FireEvent
LEFT OUTER JOIN HitEvent ON <all join conditions for HitEvent>
This way you match all fire events to zero or more HitEvents. After our discussion in chat, you can even limit it to zero or one hit event by wrapping it in a subquery and using a window function for ROW_NUMBER() OVER (PARTITION BY HitEvent.EventID ORDER BY ...) AS HitRank and add a WHERE HitRank = 1 to the outer query. This is ultimately what you ended up doing and got the results you were expecting (with a bit of work and learning in the process).

If the data is already in the database, that is where you should do the work. You absolutely should learn to display and query plans using SQL Server Management Studio, and become able to notice and optimize away expensive computations like nested loops.
Your task probably does not require any use of temporary tables. Temporary tables tend to be efficient when they are relatively small and/or heavily reused, which is not your case.

I would advise you to try to optimize the stored procedure if is not running fast enough and not rewrite it in C#. Why would you want to transfer millions of rows out of SQL Server anyway?
Unfortunately I don't have an SQL Server installation so I can't test your script, but I don't see any CREATE INDEX statements in there. If you didn't just skipped them for brevity, then you should surely analyze your queries and see which indexes are needed.

So the answer depends on several factors like resources available per client/server (Ram/CPU/Concurrent Users/Concurrent processes, etc.)
Here are some basic rules that will improve your performance regardless of what you use:
Loading a million rows into c# program is not a good practice. Unless this is a stand alone process with plenty of ram.
Uniqueidentifiers will never out perform Integers. Comparisons
Common Table Expression are a good alternative for fast performing matching. How to use CTE
Finally you have to consider output. If there is constant reading and writing that affects the user interface, then you should manage that in memory (c#), otherwise all CRUD operations should be kept inside the database.

Related

Issues using a stored procedure and what to do next

Update:
As suggested I will ask two separated questions with elaborated details.
This is probably a two-part question and maybe a common issue however I can’t figure it out.
We have a .net web service using Entity Framework with a code-first approach against a SQL Server 2012 (I think). We have a few tables some called User, License, Product etc..
We continually need to get data from the database regarding users, their licenses and products. For this data we execute a rather large stored procedure which accesses all the tables, do some processing and deliver the data e.g. the user with this userid have these licenses with these roles in relation to these products.
However, the execution of this stored procedure seems to regress over time and it becomes slower during the day. To prevent this, we run an optimization of the indexes every morning.
If the optimization is not executed every morning the stored procedure goes from 200ms execution time to 2000 ms execution time.
If anyone has insight to what is going on I would appreciate it. My knowledge of SQL and SQL Server is limited.
HOWEVER,
To avoid these issues regarding the stored procedure we have decided to rethink our strategy. For now, we have created a new table containing the key values from the others tables e.g. userid, license id, role, productid. However, this means we have to maintain this new table every time the other tables are altered.
So, my second part question is. Is a new table containing the key values which we can easily fetch a valid approach or should we do something completely else?

The index is probably getting fragmented due to inserts or updates during the day. You could consider columnstore indexes. Try using SSMS query optimisation tool. Also consider hinting the query optimiser with loop joins if applicable.

It's very hard to answer in a vacuum. You could van index problem with heavy inserts and i am guessing GUID as a key or you could have blocking issues due to heavy loads, you could have spills to tempdb due to low memory or old statistics etc. It's all guessing game. The best advise I can give you is hire professional, because one day you would have to, at list it is not going to be as bad as people without a clue thinking that they fixing stuff (because all devs are "smart").
At list get a consultation, because without looking at the issue, it is close to impossible to give you an answer.
At the very list, you need to post a execution plan for your stored procedure.

When to use Temporary SQL Tables vs DataTables

I don't know whether it is better to use temporary tables in SQL Server or use the DataTable in C# for a report. Here is the scope of the report: it will be copied into a workbook with about 10 worksheets - each worksheet containing about 1000 rows and about 30 columns so it's a lot of data. There is some guidance out there but I could not find anything specific regarding the amount of data that is too much for a DataTable. According to https://msdn.microsoft.com/en-us/library/system.data.datatable.aspx, 16M rows but my data set seems unwieldy considering the number of columns I have. Plus, I will either have to make multiple SQL queries to collect the data in my report or try to write a stored procedure in SQL to collect that data. How do I figure out this quandary?

My rule of thumb is that if it can be processed on the database server, it probably should. Keep in mind, no matter how efficient your C# code is, SQL Server will mostly likely to it faster and more efficiently, after all it was designed for data manipulation.
There is no shame in using #temp tables. They maintain stats, can be indexed, and/or manipulated. One recent example, a developer create an admittedly elegant query using cte, the performance was 12-14 seconds vs mine at 1 second using #temps.
Now, one carefully structured stored procedure could produce and return the 10 data-sets for your worksheets. If you are using a product like SpreadSheetLight (there are many options available), it becomes a small matter of passing the results and creating the tabs (no cell level looping... unless you want or need to).
I would also like to add, you can dramatically reduce the number of touch points and better enforce the business logic by making SQL Server do the heavy lifting. For example, a client introduced a 6W risk rating, which was essentially a 6.5. HUNDREDS of legacy reports had to be updated, while I only had to add the 6W into my mapping table.

There's a lot of missing context here - how is this report going to be accessed and run? Is this going to run as a scripted event every day?
Have you considered SSRS?
In my opinion it's best to abstract away your business logic by creating Views or Stored Procedures in the database. Stored Procedures would probably be the way to go but it really depends on your specific environment. Then you can point whatever tools you want to use at the database object. This has several advantages:
if you end up having different versions or different formats of the report, and your logic ever changes, you can update the logic in one place rather than many.
your code is simpler and cleaner, typically:
select v.col1, v.col2, v.col3
from MY_VIEW v
where v.date between #startdate and #enddate
I assume your 10 spreadsheets are going to be something like
Summary Page | Department 1 | Department 2 | ...
So you could make a generalized View or SP, create a master spreadsheet linked to the db object that pulls all the relevant data from SQL, and use Pivot Tables or filters or whatever else you want, and use that to generate your copies that get sent out.
But before going to all that trouble, I would make sure that SSRS is not an option, because if you can use that, it has a lot of baked in functionality that would make your life easier (export to Excel, automatic date parameters, scheduled execution, email subscriptions, etc).

Inserting/updating huge amount of rows into SQL Server with C#

I have to parse a big XML file and import (insert/update) its data into various tables with foreign key constraints.
So my first thought was: I create a list of SQL insert/update statements and execute them all at once by using SqlCommand.ExecuteNonQuery().
Another method I found was shown by AMissico: Method
where I would execute the sql commands one by one. No one complained, so I think its also a viable practice.
Then I found out about SqlBulkCopy, but it seems that I would have to create a DataTable with the data I want to upload. So, SqlBulkCopy for every table. For this I could create a DataSet.
I think every option supports SqlTransaction. It's approximately 100 - 20000 records per table.
Which option would you prefer and why?

You say that the XML is already in the database. First, decide whether you want to process it in C# or in T-SQL.
C#: You'll have to send all data back and forth once, but C# is a far better language for complex logic. Depending on what you do it can be orders of magnitude faster.
T-SQL: No need to copy data to the client but you have to live with the capabilities and perf profile of T-SQL.
Depending on your case one might be far faster than the other (not clear which one).
If you want to compute in C#, use a single streaming SELECT to read the data and a single SqlBulkCopy to write it. If your writes are not insert-only, write to a temp table and execute as few DML statements as possible to update the target table(s) (maybe a single MERGE).
If you want to stay in T-SQL minimize the number of statements executed. Use set-based logic.
All of this is simplified/shortened. I left out many considerations because they would be too long for a Stack Overflow answer. Be aware that the best strategy depends on many factors. You can ask follow-up questions in the comments.

Don't do it from C# unless you have to, it's a huge overhead and SQL can do it so much faster and better by itself
Insert to table from XML file using INSERT INTO SELECT

Scrollable ODBC cursor in C#

I'm a C++ programmer and I'm not familiar with the .NET database model. I usually use IDataReader (OdbcDataReader, OledbDataReader or SqlDataReader) to read data from database. Sometimes when I need a bulk of data I use DataAdapter, but what should I do to achieve the functionality of scrollable cursors that exists in native libraries like ODBC?
Thanks all of you for your answers, but I am in a situation that I can't accept them, of course this is my fault that didn't explain my problem completely. I explain it as a comment in one of answers that now removed.
I have to write a program that will act as a proxy between client side program and MSSQL, for this library I have following requirements:
My program should be compatible with MSSQL2000
I don't know all the tables and queries that will be sent by the user, I should simply add some information to it, make a log, ... and then execute it against MSSQL, so it is really hard to use techniques that based on ordered field(s) of the query or primary key of the table(All my works are in one database but that database is huge and may change over time).
Only a part of data is needed by the client, most DBMS support LIMIT OFFSET, unfortunately MSSQL do not support it, and ROW_NUMBER does not exist in the MSSQL2000 and if it supported, then again I need to understand program logic and that need a parse of SQL command(Actually I write a parsing library with boost::spirit but that's native code and beside that I'm not yet 100% sure about its functionality).
I may have multiple clients but most of queries that will be sent by them are one of a few predefined queries(of course users still send custom queries but its about 30% of all queries), So I think I can open some scrollable cursors and respond to clients using that cursors and a custom cache.
Server machine and its MSSQL will be dedicated to my program, so I really want to use all of the power of the server and DBMS to achieve my functionality.
So now:
What is the problem in using scrollable cursors and why I should avoid them?
How can I use scrollable cursors in .NET?

In SQL Server you can create queries paged thus. The page number you handle it easily from the application. You do not need to create cursors for this task.
For SQL Server 2005 o higher
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY ID) AS ROW FROM TABLEA ) AS ALIAS
WHERE ROW > 40
AND ROW <= 49
For SQL Server 2000
SELECT TOP 10 T.* FROM TABLA AS T WHERE T.ID NOT IN
( SELECT TOP 39 id from tabla order by id desc )
ORDER BY T.ID DESC
PD: edited to include support for SQL Server 2000

I usually use DataReader.Read() to skip all rows that I do not want to use when doing paging on a DB which do not support paging.
If you don't want to build the SQL paged query yourself you are free to use my paging class: https://github.com/jgauffin/Griffin.Data/blob/master/src/Griffin.Data/BasicLayer/Paging/SqlServerPager.cs

When Microsoft designed the ADO.NET API, they made the decision to expose only firehose cursors (IDataReader etc). This may or may not actually pose a problem for you. You say that you want "functionality of scrollable cursors", but that can mean all sorts of things, not just paging, and each particular use case can be tackled in a variety of ways. For example:
Requirement: The user should be able to arbitrarily page up and down the resultset.
Retrieve only one page of data at a time, e.g. using the ROW_NUMBER() function. This is more efficient than scrolling through a cursor.
Requirement: I have an extremely large data set and I only want to process one row at a time to avoid running out of memory.
Use the firehose cursor provided by ADO.NET. Note that this is only practical if (a) you don't need to hit the database at all during the loop, or (b) you have MARS configured in your connection string.
Simulate a keyset cursor by retrieving the set of unique identifiers into an array, then loop through the array and read one row of data at a time.
Requirement: I am doing a complicated calculation that involves moving forwards and backwards through the resultset.
You should be able to re-write your algorithm to eliminate this requirement. For example, read one set of rows, process them, read another set of rows, process them, etc.
UPDATE (more information provided in the question)
Your business requirements are asking too much. You have to handle arbitrary queries that assume the presence of scrollable cursors, but you can't provide scrollable cursors, and you can't re-write the client code to not use scrollable cursors. That's an impossible position to be in. I recommend you stick with what you currently have (C++ and ODBC) and don't bother trying to re-write it in .NET.

I don't think cursors will work for you particular case. The main reason is that you have 3 tiers. But let's take two steps back.
Most 3 tier applications have a stateless middle tier (your c++ code). Caching is fine since it really just an optimization and does not create any real state in the middle tier. The middle tier normally has a small number of open sessions to the database. Because opening a db session is expensive for the processor, and after the db session is open a set amount of RAM is reserved at the database server. When a request is received by the middle tier, the request is processed and handed on to the SQL database. An algorithm may be used to pick any of the open sessions, or it can even be done at random. In this model it is not possible to know what session will receive the next request. Cursors belong to the session that received the original query request. So you can't really expect that the receiving session will be the one that has your open cursor.
The 3 tier model I described is used mainly for web applications so they can scale to hundreds or thousands of clients. Were SQL servers would never be able to open that many sessions. Microsoft ADO.NET already has many features to support the kind of architecture I described, so it is not very hard to implement. And the same is used even in non Web applications depending on the circumstance. You could potentially keep track of your sessions so you could open a single session per client, I would first make sure that the use case justifies that. Know that open cursors can take up a lot of resources as well.
Cursors still have a place within a single transaction, it's just hard to keep them open so that the client application can fetch/update values within the result set.
What I would suggest its that you do the following within the query transaction. Store in a separate table the primary key values of the main table in your query. On the separate table include other values like sessionid and rownumber. Return a few of the first rows by linking to the new table in the original query. And in subsequent calls just query the corresponding rows again by linking to your new table. You will need an equivalent to a caching mechanism to purge old data, and to refresh the result set according to your needs.

Joins are for lazy people?

I recently had a discussion with another developer who claimed to me that JOINs (SQL) are useless. This is technically true but he added that using joins is less efficient than making several requests and link tables in the code (C# or Java).
For him joins are for lazy people that don't care about performance. Is this true? Should we avoid using joins?

No, we should avoid developers who hold such incredibly wrong opinions.
In many cases, a database join is several orders of magnitude faster than anything done via the client, because it avoids DB roundtrips, and the DB can use indexes to perform the join.
Off the top of my head, I can't even imagine a single scenario where a correctly used join would be slower than the equivalent client-side operation.
Edit: There are some rare cases where custom client code can do things more efficiently than a straightforward DB join (see comment by meriton). But this is very much the exception.

It sounds to me like your colleague would do well with a no-sql document-database or key-value store. Which are themselves very good tools and a good fit for many problems.
However, a relational database is heavily optimised for working with sets. There are many, many ways of querying the data based on joins that are vastly more efficient than lots of round trips. This is where the versatilty of a rdbms comes from. You can achieve the same in a nosql store too, but you often end up building a separate structure suited for each different nature of query.
In short: I disagree. In a RDBMS, joins are fundamental. If you aren't using them, you aren't using it as a RDBMS.

Well, he is wrong in the general case.
Databases are able to optimize using a variety of methods, helped by optimizer hints, table indexes, foreign key relationships and possibly other database vendor specific information.

No, you shouldnt.
Databases are specifically designed to manipulate sets of data (obviously....). Therefore they are incredibly efficient at doing this. By doing what is essentially a manual join in his own code, he is attempting to take over the role of something specifically designed for the job. The chances of his code ever being as efficient as that in the database are very remote.
As an aside, without joins, whats the point in using a database? he may as well just use text files.

If "lazy" is defined as people who want to write less code, then I agree. If "lazy" is defined as people who want to have tools do what they are good at doing, I agree. So if he is merely agreeing with Larry Wall (regarding the attributes of good programmers), then I agree with him.

Ummm, joins is how relational databases relate tables to each other. I'm not sure what he's getting at.
How can making several calls to the database be more efficient than one call? Plus sql engines are optimized at doing this sort of thing.
Maybe your coworker is too lazy to learn SQL.

"This is technicaly true" - similarly, a SQL database is useless: what's the point in using one when you can get the same result by using a bunch of CSV files, and correlating them in code? Heck, any abstraction is for lazy people, let's go back to programming in machine code right on the hardware! ;)
Also, his asssertion is untrue in all but the most convoluted cases: RDBMSs are heavily optimized to make JOINs fast. Relational database management systems, right?

Yes, You should.
And you should use C++ instead of C# because of performance. C# is for lazy people.
No, no, no. You should use C instead of C++ because of performance. C++ is for lazy people.
No, no, no. You should use assembly instead of C because of performance. C is for lazy people.
Yes, I am joking. you can make faster programs without joins and you can make programs using less memory without joins. BUT in many cases, your development time is more important than CPU time and memory. Give up a little performance and enjoy your life. Don't waste your time for little little performance. And tell him "Why don't you make a straight highway from your place to your office?"

The last company I worked for didn't use SQL joins either. Instead they moved this work to application layer which is designed to scale horizontally. The rationale for this design is to avoid work at database layer. It is usually the database that becomes bottleneck. Its easier to replicate application layer than database. There could be other reasons. But this is the one that I can recall now.
Yes I agree that joins done at application layer are inefficient compared to joins done by database. More network communication also.
Please note that I'm not taking a hard stand on avoiding SQL joins.

Without joins how are you going to relate order items to orders?
That is the entire point of a relational database management system.
Without joins there is no relational data and you might as well use text files
to process data.
Sounds like he doesn't understand the concept so he's trying to make it seem they are useless. He's the same type of person who thinks excel is a database application.
Slap him silly and tell him to read more about databases. Making multiple connections and pulling data and merging the data via C# is the wrong way to do things.

I don't understand the logic of the statement "joins in SQL are useless".
Is it useful to filter and limit the data before working on it? As you're other respondants have stated this is what database engines do, it should be what they are good at.
Perhaps a lazy programmer would stick to technologies with which they were familiar and eschew other possibilities for non technical reasons.
I leave it to you to decide.

Let's consider an example: a table with invoice records, and a related table with invoice line item records. Consider the client pseudo code:
for each (invoice in invoices)
let invoiceLines = FindLinesFor(invoice)
...
If you have 100,000 invoices with 10 lines each, this code will look up 10 invoice lines from a table of 1 million, and it will do that 100,000 times. As the table size increases, the number of select operations increases, and the cost of each select operation increases.
Becase computers are fast, you may not notice a performance difference between the two approaches if you have several thousand records or fewer. Because the cost increase is more than linear, as the number of records increases (into the millions, say), you'll begin to notice a difference, and the difference will become less tolerable as the size of the data set grows.
The join, however. will use the table's indexes and merge the two data sets. This means that you're effectively scanning the second table once rather than randomly accessing it N times. If there's a foreign key defined, the database already has the links between the related records stored internally.
Imagine doing this yourself. You have an alphabetical list of students and a notebook with all the students' grade reports (one page per class). The notebook is sorted in order by the students' names, in the same order as the list. How would you prefer to proceed?
Read a name from the list.
Open the notebook.
Find the student's name.
Read the student's grades, turning pages until you reach the next student or the last page.
Close the notebook.
Repeat.
Or:
Open the notebook to the first page.
Read a name from the list.
Read any grades for that name from the notebook.
Repeat steps 2-3 until you get to the end
Close the notebook.

Sounds like a classic case of "I can write it better." In other words, he's seeing something that he sees as kind of a pain in the neck (writing a bunch of joins in SQL) and saying "I'm sure I can write that better and get better performance." You should ask him if he is a) smarter and b) more educated than the typical person that's knee deep in the Oracle or SQL Server optimization code. Odds are he isn't.

He is most certainly wrong. While there are definite pros to data manipulation within languages like C# or Java, joins are fastest in the database due to the nature of SQL itself.
SQL keeps detailing statistics regarding the data, and if you have created your indexes correctly, can very quickly find one record in a couple of million. Besides the fact that why would you want to drag all your data into C# to do a join when you can just do it right on the database level?
The pros for using C# come into play when you need to do something iteratively. If you need to do some function for each row, it's likely faster to do so within C#, otherwise, joining data is optimized in the DB.

I will say that I have run into a case where it was faster breaking the query down and doing the joins in code. That being said, it was only with one particular version of MySQL that I had to do that. Everything else, the database is probably going to be faster (note that you may have to optimize the queries, but it will still be faster).

I suspect he has a limited view on what databases should be used for. One approach to maximise performance is to read the entire database into memory. In this situation, you may get better performance and you may want to perform joins if memory for efficiency. However this is not really using a database, as a database IMHO.

No, not only are joins better optimized in database code that ad-hoc C#/Java; but usually several filtering techniques can be applied, which yields even better performance.

He is wrong, joins are what competent programmers use. There may be a few limited cases where his proposed method is more efficient (and inthose I would probably be using a Documant database) but I can't see it if you have any deceent amount of data. For example take this query:
select t1.field1
from table1 t1
join table2 t2
on t1.id = t2.id
where t1.field2 = 'test'
Assume you have 10 million records in table1 and 1 million records in table2. Assume 9 million of the records in table 1 meet the where clause. Assume only 15 of them are in table2 as well. You can run this sql statement which if properly indexed will take milliseconds and return 15 records across the network with only 1 column of data. Or you can send ten million records with 2 columns of data and separately send another 1 millions records with one column of data across the network and combine them on the web server.
Or of course you could keep the entire contents of the database on the web server at all times which is just plain silly if you have more than a trivial amount of data and data that is continually changing. If you don't need the qualities of a relational database then don't use one. But if you do, then use it correctly.

I've heard this argument quite often during my career as a software developer. Almost everytime it has been stated, the guy making the claim didn't have much knowledge about relational database systems, the way they work and the way such systems should be used.
Yes, when used incorrectly, joins seem to be useless or even dangerous. But when used in the correct way, there is a lot of potential for database implementation to perform optimizations and to "help" the developer retrieving the correct result most efficiently.
Don't forget that using a JOIN you tell the database about the way you expect the pieces of data to relate to each other and therefore give the database more information about what you are trying to do and therefore making it able to better fit your needs.
So the answer is definitely: No, JOINSaren't useless at all!

This is "technically true" only in one case which is not used often in applications (when all the rows of all the tables in the join(s) are returned by the query). In most queries only a fraction of the rows of each table is returned. The database engine often uses indexes to eliminate the unwanted rows, sometimes even without reading the actual row as it can use the values stored in indexes. The database engine is itself written in C, C++, etc. and is at least as efficient as code written by a developer.

Unless I've seriously misunderstood, the logic in the question is very flawed
If there are 20 rows in B for each A, a 1000 rows in A implies 20k rows in B.
There can't be just 100 rows in B unless there is many-many table "AB" with 20k rows with the containing the mapping.
So to get all information about which 20 of the 100 B rows map to each A row you table AB too. So this would be either:
3 result sets of 100, 1000, and 20k rows and a client JOIN
a single JOINed A-AB-B result set with 20k rows
So "JOIN" in the client does add any value when you examine the data. Not that it isn't a bad idea. If I was retrieving one object from the database than maybe it makes more sense to break it down into separate results sets. For a report type call, I'd flatten it out into one almost always.
In any case, I'd say there is almost no use for a cross join of this magnitude. It's a poor example.
You have to JOIN somewhere, and that's what RDBMS are good at. I'd not like to work with any client code monkey who thinks they can do better.
Afterthought:
To join in the client requires persistent objects such as DataTables (in .net). If you have one flattened resultset it can be consumed via something lighter like a DataReader. High volume = lot of client resources used to avoid a database JOIN.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.