I don't know whether it is better to use temporary tables in SQL Server or use the DataTable in C# for a report. Here is the scope of the report: it will be copied into a workbook with about 10 worksheets - each worksheet containing about 1000 rows and about 30 columns so it's a lot of data. There is some guidance out there but I could not find anything specific regarding the amount of data that is too much for a DataTable. According to https://msdn.microsoft.com/en-us/library/system.data.datatable.aspx, 16M rows but my data set seems unwieldy considering the number of columns I have. Plus, I will either have to make multiple SQL queries to collect the data in my report or try to write a stored procedure in SQL to collect that data. How do I figure out this quandary?
My rule of thumb is that if it can be processed on the database server, it probably should. Keep in mind, no matter how efficient your C# code is, SQL Server will mostly likely to it faster and more efficiently, after all it was designed for data manipulation.
There is no shame in using #temp tables. They maintain stats, can be indexed, and/or manipulated. One recent example, a developer create an admittedly elegant query using cte, the performance was 12-14 seconds vs mine at 1 second using #temps.
Now, one carefully structured stored procedure could produce and return the 10 data-sets for your worksheets. If you are using a product like SpreadSheetLight (there are many options available), it becomes a small matter of passing the results and creating the tabs (no cell level looping... unless you want or need to).
I would also like to add, you can dramatically reduce the number of touch points and better enforce the business logic by making SQL Server do the heavy lifting. For example, a client introduced a 6W risk rating, which was essentially a 6.5. HUNDREDS of legacy reports had to be updated, while I only had to add the 6W into my mapping table.
There's a lot of missing context here - how is this report going to be accessed and run? Is this going to run as a scripted event every day?
Have you considered SSRS?
In my opinion it's best to abstract away your business logic by creating Views or Stored Procedures in the database. Stored Procedures would probably be the way to go but it really depends on your specific environment. Then you can point whatever tools you want to use at the database object. This has several advantages:
if you end up having different versions or different formats of the report, and your logic ever changes, you can update the logic in one place rather than many.
your code is simpler and cleaner, typically:
select v.col1, v.col2, v.col3
from MY_VIEW v
where v.date between #startdate and #enddate
I assume your 10 spreadsheets are going to be something like
Summary Page | Department 1 | Department 2 | ...
So you could make a generalized View or SP, create a master spreadsheet linked to the db object that pulls all the relevant data from SQL, and use Pivot Tables or filters or whatever else you want, and use that to generate your copies that get sent out.
But before going to all that trouble, I would make sure that SSRS is not an option, because if you can use that, it has a lot of baked in functionality that would make your life easier (export to Excel, automatic date parameters, scheduled execution, email subscriptions, etc).
Related
I'm a C++ programmer and I'm not familiar with the .NET database model. I usually use IDataReader (OdbcDataReader, OledbDataReader or SqlDataReader) to read data from database. Sometimes when I need a bulk of data I use DataAdapter, but what should I do to achieve the functionality of scrollable cursors that exists in native libraries like ODBC?
Thanks all of you for your answers, but I am in a situation that I can't accept them, of course this is my fault that didn't explain my problem completely. I explain it as a comment in one of answers that now removed.
I have to write a program that will act as a proxy between client side program and MSSQL, for this library I have following requirements:
My program should be compatible with MSSQL2000
I don't know all the tables and queries that will be sent by the user, I should simply add some information to it, make a log, ... and then execute it against MSSQL, so it is really hard to use techniques that based on ordered field(s) of the query or primary key of the table(All my works are in one database but that database is huge and may change over time).
Only a part of data is needed by the client, most DBMS support LIMIT OFFSET, unfortunately MSSQL do not support it, and ROW_NUMBER does not exist in the MSSQL2000 and if it supported, then again I need to understand program logic and that need a parse of SQL command(Actually I write a parsing library with boost::spirit but that's native code and beside that I'm not yet 100% sure about its functionality).
I may have multiple clients but most of queries that will be sent by them are one of a few predefined queries(of course users still send custom queries but its about 30% of all queries), So I think I can open some scrollable cursors and respond to clients using that cursors and a custom cache.
Server machine and its MSSQL will be dedicated to my program, so I really want to use all of the power of the server and DBMS to achieve my functionality.
So now:
What is the problem in using scrollable cursors and why I should avoid them?
How can I use scrollable cursors in .NET?
In SQL Server you can create queries paged thus. The page number you handle it easily from the application. You do not need to create cursors for this task.
For SQL Server 2005 o higher
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY ID) AS ROW FROM TABLEA ) AS ALIAS
WHERE ROW > 40
AND ROW <= 49
For SQL Server 2000
SELECT TOP 10 T.* FROM TABLA AS T WHERE T.ID NOT IN
( SELECT TOP 39 id from tabla order by id desc )
ORDER BY T.ID DESC
PD: edited to include support for SQL Server 2000
I usually use DataReader.Read() to skip all rows that I do not want to use when doing paging on a DB which do not support paging.
If you don't want to build the SQL paged query yourself you are free to use my paging class: https://github.com/jgauffin/Griffin.Data/blob/master/src/Griffin.Data/BasicLayer/Paging/SqlServerPager.cs
When Microsoft designed the ADO.NET API, they made the decision to expose only firehose cursors (IDataReader etc). This may or may not actually pose a problem for you. You say that you want "functionality of scrollable cursors", but that can mean all sorts of things, not just paging, and each particular use case can be tackled in a variety of ways. For example:
Requirement: The user should be able to arbitrarily page up and down the resultset.
Retrieve only one page of data at a time, e.g. using the ROW_NUMBER() function. This is more efficient than scrolling through a cursor.
Requirement: I have an extremely large data set and I only want to process one row at a time to avoid running out of memory.
Use the firehose cursor provided by ADO.NET. Note that this is only practical if (a) you don't need to hit the database at all during the loop, or (b) you have MARS configured in your connection string.
Simulate a keyset cursor by retrieving the set of unique identifiers into an array, then loop through the array and read one row of data at a time.
Requirement: I am doing a complicated calculation that involves moving forwards and backwards through the resultset.
You should be able to re-write your algorithm to eliminate this requirement. For example, read one set of rows, process them, read another set of rows, process them, etc.
UPDATE (more information provided in the question)
Your business requirements are asking too much. You have to handle arbitrary queries that assume the presence of scrollable cursors, but you can't provide scrollable cursors, and you can't re-write the client code to not use scrollable cursors. That's an impossible position to be in. I recommend you stick with what you currently have (C++ and ODBC) and don't bother trying to re-write it in .NET.
I don't think cursors will work for you particular case. The main reason is that you have 3 tiers. But let's take two steps back.
Most 3 tier applications have a stateless middle tier (your c++ code). Caching is fine since it really just an optimization and does not create any real state in the middle tier. The middle tier normally has a small number of open sessions to the database. Because opening a db session is expensive for the processor, and after the db session is open a set amount of RAM is reserved at the database server. When a request is received by the middle tier, the request is processed and handed on to the SQL database. An algorithm may be used to pick any of the open sessions, or it can even be done at random. In this model it is not possible to know what session will receive the next request. Cursors belong to the session that received the original query request. So you can't really expect that the receiving session will be the one that has your open cursor.
The 3 tier model I described is used mainly for web applications so they can scale to hundreds or thousands of clients. Were SQL servers would never be able to open that many sessions. Microsoft ADO.NET already has many features to support the kind of architecture I described, so it is not very hard to implement. And the same is used even in non Web applications depending on the circumstance. You could potentially keep track of your sessions so you could open a single session per client, I would first make sure that the use case justifies that. Know that open cursors can take up a lot of resources as well.
Cursors still have a place within a single transaction, it's just hard to keep them open so that the client application can fetch/update values within the result set.
What I would suggest its that you do the following within the query transaction. Store in a separate table the primary key values of the main table in your query. On the separate table include other values like sessionid and rownumber. Return a few of the first rows by linking to the new table in the original query. And in subsequent calls just query the corresponding rows again by linking to your new table. You will need an equivalent to a caching mechanism to purge old data, and to refresh the result set according to your needs.
I have a lot of data which needs to be paired based on a few simple criteria. There is a time window (both records have a DateTime column), if one record is very close in time (within 5 seconds) to another then it is a potential match, the record which is the closest in time is considered a complete match. There are other fields which help narrow this down also.
I wrote a stored procedure which does this matching on the server before returning the
full, matched dataset to a C# application. My question is, would it be better to pull in the 1 million (x2) rows and deal with them in C#, or is sql server better suited to perform this matching? If Sql server is, then what is the fastest way of pairing data using datetime fields?
Right now I select all records from Table 1/Table 2 into temporary tables, iterate through each record in Table 1, look for a match in Table 2 and store the match (if one exists) in a temporary table, then I delete both records in their own temporary tables.
I had to rush this piece for a game I'm writing, so excuse the bad (very bad) procedure... It works, it's just horribly inefficient! The whole SP is available on pastebin: http://pastebin.com/qaieDsW7
I know the SP is written poorly, so saying "hey, dumbass... write it better" doesn't help! I'm looking for help in improving it, or help/advice on how I should do the whole thing differently! I have about 3/5 days to rewrite it, I can push that deadline back a bit, but I'd rather not if you guys can help me in time! :)
Thanks!
Ultimately, compiling your your data on the database side is preferable 99% of the time, as it's designed for data crunching (through the use of indexes, relations, etc). A lot of your code can be consolidated by the use of joins to compile the data in exactly the format you need. In fact, you can bypass almost all your temp tables entirely and just fill a master Event temp table.
The general pattern is this:
INSERT INTO #Events
SELECT <all interested columns>
FROM
FireEvent
LEFT OUTER JOIN HitEvent ON <all join conditions for HitEvent>
This way you match all fire events to zero or more HitEvents. After our discussion in chat, you can even limit it to zero or one hit event by wrapping it in a subquery and using a window function for ROW_NUMBER() OVER (PARTITION BY HitEvent.EventID ORDER BY ...) AS HitRank and add a WHERE HitRank = 1 to the outer query. This is ultimately what you ended up doing and got the results you were expecting (with a bit of work and learning in the process).
If the data is already in the database, that is where you should do the work. You absolutely should learn to display and query plans using SQL Server Management Studio, and become able to notice and optimize away expensive computations like nested loops.
Your task probably does not require any use of temporary tables. Temporary tables tend to be efficient when they are relatively small and/or heavily reused, which is not your case.
I would advise you to try to optimize the stored procedure if is not running fast enough and not rewrite it in C#. Why would you want to transfer millions of rows out of SQL Server anyway?
Unfortunately I don't have an SQL Server installation so I can't test your script, but I don't see any CREATE INDEX statements in there. If you didn't just skipped them for brevity, then you should surely analyze your queries and see which indexes are needed.
So the answer depends on several factors like resources available per client/server (Ram/CPU/Concurrent Users/Concurrent processes, etc.)
Here are some basic rules that will improve your performance regardless of what you use:
Loading a million rows into c# program is not a good practice. Unless this is a stand alone process with plenty of ram.
Uniqueidentifiers will never out perform Integers. Comparisons
Common Table Expression are a good alternative for fast performing matching. How to use CTE
Finally you have to consider output. If there is constant reading and writing that affects the user interface, then you should manage that in memory (c#), otherwise all CRUD operations should be kept inside the database.
At the moment I'm working on a quite tricky transferring from a .csv file to DB. I have to develop a package/solution/xxxyyy that handles a flow of data from this .csv file to my SQL Server DB (the .csv is updated with new data everyday).
The approach that my boss "suggested" I should use is through SSIS (normally I would have wrote some kind of "parser" to easily convoy the data from the .csv). The fact is that I have quite a bit of transformation to do.
i.e.
An employee has this fields:
name;surname;id;roles
The field "roles" is formatted like this:
role1,role2,role3
This relationship in my db is mapped in 3 different tables:
tblEmployee
PK_Emp | name | surname
tblRoles
PK_Role | roleName
tblEmployeeRole
PK_Emp | PK_Role
So, from the .csv I have to extract the roles of a single employee, insert those in tblRoles (checking that there's no duplicate). Then I have to manage the relationship in tblEmployeeRole.
Considering that this is just an example of one of the different transformations that I have to manage I was wondering if SSIS is the best tool to achieve my goal (loads if script components). When I explained my perplexities to my boss he came up with this "idea":
Use SSIS to transfer the data, as they are, in a temporary table then handle the different transformations through stored procedures.
From the very little I know about stored procedure, I'm not sure that I should follow this idea.
Now, considering that my actual superior isn't that enlightened project manager (he usually mess up our work with bizarre ideas) and considering the fact that I'm not such an expert neither in SSIS nor in stored procedure, I've decided to write here and see if anyone can explain me if one of the previous approaches is the right one or if I have to consider some other (better) solution.
Sorry for my poor English, ty for any help =)
I would insert the data from the CSV file as-is.
Then do any parsing in the database end. If this is something that has to be done often I would then take any scripts you have made to do this and create procedures/functions from that. This question is a bit grand-scheme so this is only a general solution. If you need help doing the parsing of the roles into the look up tables then that would be more specific and of better use.
In general when I work with massive flat-file data sets that need to be parsed into a SQL structure:
Import the data as-is
Find the commonalities among the look up codes
Create the base look up tables (in your case it would be tblRoles)
Create a script to insert into both tblEmployee and tblEmployee role
Once my test scenarios work then I worry about combining each component step into one monolithic SSIS or stored procedure.
I suggest something similar here. Break this import task into small pieces and worry about the grand design later. SSIS, procs, compiled code...any of these might work for you. You just need to know what you need it to do.
Depending upon your transformations they can all be done within SSIS. If you don't need to store the raw .csv data, I would stay away from stored procedures and temporary tables as you are bypassing a large portion of SSIS's strengths.
As an example, you can do look-ups on your incoming data to determine proper relationships and insert those results into multiple tables (your 3 in the example).
Looks like the task is very suitable for bcp utility or BULK INSERT command
I want to understand the purpose of datasets when we can directly communicate with the database using simple SQL statements.
Also, which way is better? Updating the data in dataset and then transfering them to the database at once or updating the database directly?
I want to understand the purpose of datasets when we can directly communicate with the database using simple SQL statements.
Why do you have food in your fridge, when you can just go directly to the grocery store every time you want to eat something? Because going to the grocery store every time you want a snack is extremely inconvenient.
The purpose of DataSets is to avoid directly communicating with the database using simple SQL statements. The purpose of a DataSet is to act as a cheap local copy of the data you care about so that you do not have to keep on making expensive high-latency calls to the database. They let you drive to the data store once, pick up everything you're going to need for the next week, and stuff it in the fridge in the kitchen so that its there when you need it.
Also, which way is better? Updating the data in dataset and then transfering them to the database at once or updating the database directly?
You order a dozen different products from a web site. Which way is better: delivering the items one at a time as soon as they become available from their manufacturers, or waiting until they are all available and shipping them all at once? The first way, you get each item as soon as possible; the second way has lower delivery costs. Which way is better? How the heck should we know? That's up to you to decide!
The data update strategy that is better is the one that does the thing in a way that better meets your customer's wants and needs. You haven't told us what your customer's metric for "better" is, so the question cannot be answered. What does your customer want -- the latest stuff as soon as it is available, or a low delivery fee?
Datasets support disconnected architecture. You can add local data, delete from it and then using SqlAdapter you can commit everything to the database. You can even load xml file directly into dataset. It really depends upon what your requirements are. You can even set in memory relations between tables in DataSet.
And btw, using direct sql queries embedded in your application is a really really bad and poor way of designing application. Your application will be prone to "Sql Injection". Secondly if you write queries like that embedded in application, Sql Server has to do it's execution plan everytime whereas Stored Procedures are compiled and it's execution is already decided when it is compiled. Also Sql server can change it's plan as the data gets large. You will get performance improvement by this. Atleast use stored procedures and validate junk input in that. They are inherently resistant to Sql Injection.
Stored Procedures and Dataset are the way to go.
See this diagram:
Edit: If you are into .Net framework 3.5, 4.0 you can use number of ORMs like Entity Framework, NHibernate, Subsonic. ORMs represent your business model more realistically. You can always use stored procedures with ORMs if some of the features are not supported into ORMs.
For Eg: If you are writing a recursive CTE (Common Table Expression) Stored procedures are very helpful. You will run into too much problems if you use Entity Framework for that.
This page explains in detail in which cases you should use a Dataset and in which cases you use direct access to the databases
I usually like to practice that, if I need to perform a bunch of analytical proccesses on a large set of data I will fill a dataset (or a datatable depending on the structure). That way it is a disconnected model from the database.
But for DML queries I prefer the quick hits directly to the database (preferably through stored procs). I have found this is the most efficient, and with well tuned queries it is not bad at all on the db.
People suggest creating database table dynamically (or, in run-time) should be avoided, with the saying that it is bad practice and will be hard to maintain.
I don't see the reason why, and I don't see difference between creating table and any another SQL query/statement such as SELECT or INSERT. I wrote apps that create, delete and modify database and tables in run time, and so far I do not see any performance issues.
Can anyone explane the cons of creating database and tables in run-time?
Tables are much more complex entities than rows and managing table creation is much more complex than an insert which has to abide by an existing model, the table. True, a table create statement is a standard SQL operation but depending on creating them dynamically smacks of a bad design decisions.
Now, if you just create one or two and that's it, or an entire database dynamically, or from a script once, that might be ok. But if you depend on having to create more and more tables to handle your data you will also need to join more and more and query more and more. One very serious issue I encountered with an app that made use of dynamic table creation is that a single SQL Server query can only involve 255 tables. It's a built-in constraint. (And that's SQL Server, not CE.) It only took a few weeks in production for this limit to be reached resulting in a nonfunctioning application.
And if you get into editing the tables, e.g. adding/dropping columns, then your maintenance headache gets even worse. There's also the matter of binding your db data to your app's logic. Another issue is upgrading production databases. This would really be a challenge if a db had been growing with objects dynamically and you suddenly needed to update the model.
When you need to store data in such a dynamic manner the standard practice is to make use of EAV models. You have fixed tables and your data is added dynamically as rows so your schema does not have to change. There are drawbacks of course but it's generally thought of as better practice.
KMC ,
Remember the following points
What if you want to add or remove a column , you many need to change in the code and compile it agian
what if the database location changes
Developers who are not very good at database can make changes , if you create the schema at the backend , DBA's can take care of it.
If you get any performance issues , it may get tough to debug.
You will need to be a little clearer about what you mean by "creating tables".
One reason to not allow the application to control table creation and deletion is that this is a task that should be handled only by an administrator. You don't want normal users to have the ability to delete whole tables.
Temporary tables ar a different story, and you may need to create temporary tables as part of your queries, but your basic database structure should be managed only by someone with the rights to do so.
sometimes, creating tables dynamically is not the best option security-wise (Google SQL injection), and it would be better using stored procedures and have your insert or update operations occur at the database level by executing the stored procedures in code.