ETL Processing Design and Performance

ETL Processing Design and Performance - c#

I am working on a ETL process for a data warehouse using C#, that supports both SQL Server and Oracle. During development I have been writing stored procedures that would synchronize data from one database to another database. The stored procedures code are rather ugly because it involves dynamic SQL. It needs to build the SQL strings since we have dynamic database name.
My team lead want to use C# code to do the ETL. We have code generation that automatic generate new classes when database definition changes. That's also why I decided not to use Rhino ETL.
Here are the pros and cons:
Stored Procedure:
Pros:
fast loading process, everything is handled by the database
easy deployment, no compiling is needed
Cons
poor readability due to dynamic SQL
Need to maintain both T-SQL and PL/SQL scripts when database definition changes
Slow development because no intellisense when writing dynamic SQL
C# Code:
Pros:
easier to develop the ETL process because we get intellisense from our generated class
easier to maintain because of generated class
better logging and error handling
Cons:
slow performance compare with stored procedure
I would prefer to use application code to do the ETL process, but the performance was horrible compare with stored procedures. In one test when I tries to update 10,000 row. The stored procedures took only 1 sec, while my ETL code took 70s. Even I somehow manage to reduce the overhead, 20% of the 70s are purely calling update statement from application code.
Could someone provide me suggestions or comment on how to speed up the ETL process using application code?
My next idea is try doing parallel ETL process by opening multiple database connections and perform the update and insert.
Thanks

You say you have code generation that automatically generates new classes - why don't you have code generation that automatically generate new stored procedures?
That should give you the best of two worlds; encapsulate it into a few nice classes that can inspect the database and update things as necessary and you can, well not increase readability, but hide it (you would not need to update the SPs manually)
Also, the difference should not be so huge, sounds as if you are not doing something right (reusing connections, moving data unnecessary from server to the application or processing data in smaller batches - row by row?).
Also, regarding better logging - care to elaborate on that? You can have logging on the database layer, too, or you can design your SPs so that application layer can still do the logging.

If your C# code is already slow with 10,000 rows, I cannot imagine it in a real environement...
Most ETL are done either within the database (stored procedures, packages, or even compiled within the database (PL/SQL, Java for Oracle)). They can handle millions of rows.
Or some professional tools can be used (Informatica, or others), but it will still be slower than stored procedures, but easier to manage.
So my conclusion is: If you want to come anywhere close to stored procedure performances, you will have to code an application as good as those professional ones on the market, that took years to develop and mature... Do you think you can?
Plus, if you have to handle different database types (SQL Server, Oracle), you CANNOT make a generic application AND optimize it at the same time, it's a choice. Because Oracle does not work the same way SQL Server does.
To give you an idea, in ETLs for Oracle, hints are used (like the Parallel Execution hints), and also some indexes may be dropped or integrity disabled temporarly to optimize the ETL.
There is no way that I know of to the the exact same thing in SQL Server (they might have similar options, but different syntax).
So "one ETL for all databases" can hardly be done without losing efficiency and speed.
So I think your pros and cons are very accurate; you have to choose between speed and ease of development, but not both.

You might consider tuning up your application.
A few tricks of mine:
Don't use connection.Open() and conenction.Close() too much.
Im some cases LINQ will slow things down
Use a procedure and pass more parameters when loading to reduce the number of calls, for example, proc_load_to_table(p1 text) change to proc_load_to_table(p1 text, p2 text, p3 text, p4 tex, p5 text)

Related

Stored procedure vs. Entity Framework (LINQ queries) [duplicate]

I've used the entity framework in a couple of projects. In every project, I've used stored procedures mapped to the entities because of the well known benefits of stored procedures - security, maintainability, etc. However, 99% of the stored procedures are basic CRUD stored procedures. This seems like it negates one of the major, time saving features of the Entity Framework -- SQL generation.
I've read some of the arguments regarding stored procedures vs. generated SQL from the Entity Framework. While using CRUD SPs is better for security, and the SQL generated by EF is often more complex than necessary, does it really buy anything in terms of performance or maintainability to use SPs?
Here is what I believe:
Most of the time, modifying an SP requires updating the data model
anyway. So, it isn't buying much in terms of maintainability.
For web applications, the connection to the database uses a single user ID specific to the application. So, users don't even have direct database access. That reduces the security benefit.
For a small application, the slightly decreased performance from using
generated SQL probably wouldn't be much of an issue. For high
volume, performance critical applications, would EF even be a wise
choice? Plus, are the insert / update / delete statements generated
by EF really that bad?
Sending every attribute to a stored procedure has it's own performance penalties, whereas the EF generated code only sends the attributes that were actually changed. When doing updates to large tables, the increased network traffic and overhead of updating all attributes probably negates the performance benefit of stored procedures.
With that said, my specific questions are:
Are my beliefs listed above correct? Is the idea of always using SPs something that is "old school" now that ORMs are gaining in popularity? In your experience, which is the better way to go with EF -- mapping SPs for all insert / update / deletes, or using EF generated SQL for CRUD operations and only using SPs for more complex stuff?

I think always using SP's is somewhat old school. I used to code that way, and now do everything I can in EF generated code...and when I have a performance problem, or other special need, I then add back in a strategic SP to solve a particular problem....it doesn't have to be either or - use both.
All my basic CRUD operations are straight EF generated code - my web apps used to have 100's or more of SP's, now a typical one will have a dozen SP's and everything else is done in my C# code....and my productivity has gone WAY up by eliminating those 95% of CRUD stored procs.

Yes your beliefs are absolutely correct. Using stored procedures for data manipulation has meaning mainly if:
Database follows strict security rules where changing data is allowed only through stored procedures
You are using views or custom queries for mapping your entities and you need advanced logic in stored procedure to push data back
You have some advanced logic (related to data) in the procedure for any other reason
Using procedures for pure CUD where non of mentioned cases applies is redundant and it doesn't provide any measurable performance boost except single scenario
You will use stored procedures for batch / bulk modifications
EF doesn't have bulk / batch functionality so changing 1000 records result in 1000 updates each executed with separate database roundtrip! But such procedures cannot be mapped to entities anyway and must be executed separately via function import (if possible) or directly as ExecuteStoreCommand or old ADO.NET (for example if you want to use table valued parameter).
The whole different story can be R in CRUD where stored procedure can get significant performance boost for reading data with your own optimized queries.

If performance is your primary concern, then you should take one of your existing apps that uses EF with SPs, disable the SPs, and benchmark the new version. That's the only way to get an answer perfectly applicable to your situation. You might find that EF no matter what you do isn't fast enough for your performance needs compared to custom code, but outside of very high volume sites I think EF 4.1 is actually pretty reasonable.
From my PoV, EF is a great developer productivity boost. You lose a fair bit of that if you're writing SPs for simple CRUD operations, and for Insert/Update/Delete in particular I really don't see you gaining much in performance because those operations are so straightforward to generate SQL for. There are definitely some Select cases where EF will not do the optimal thing and you can get major performance increases by writing a SP (hierarchical queries using CONNECT BY in Oracle come to mind as an example).
The best way to deal with that type of thing is to write your app letting EF generate the SQL. Benchmark it. Find areas where there's performance issues and write SPs for those. Delete is almost never going to be one of the cases you need to do this.
As you mentioned, the security gain here is somewhat lessened because you should have EF on an Application tier that has its own account for the app anyway, so you can restrict what it does. SPs do give you a bit more control but in typical usage I don't think it matters.
It's an interesting question that doesn't have a truely right or wrong answer. I use EF primarily so that I don't have to write generic CRUD SPs and can instead spend my time working on the more complex cases, so for me I'd say you should write fewer of them. :)

I agree broadly with E.J, but there are a couple of other options. It really boils down to the requirements for the particular system:
Do you need to get the app developed FAST? - Then use entity framework and its automatic SQL
Need fine-grained and solid security? - Get onto stored procedures
Need it to run as fast as possible? - You're probably looking at some happy medium!

In my opinion as long as your application/database does not suffer from performance issues and you are mostly using the database for CRUD and accessing it using just one DB user, it is better to use generated SQL. It is faster to develop, more maintainable and the few security or more privacy benefits are not worth it (if the data is not so sensitive). Also the use of model based database access or LINQ disabled the threat from SQL injections.

What's the purpose of Datasets?

I want to understand the purpose of datasets when we can directly communicate with the database using simple SQL statements.
Also, which way is better? Updating the data in dataset and then transfering them to the database at once or updating the database directly?

I want to understand the purpose of datasets when we can directly communicate with the database using simple SQL statements.
Why do you have food in your fridge, when you can just go directly to the grocery store every time you want to eat something? Because going to the grocery store every time you want a snack is extremely inconvenient.
The purpose of DataSets is to avoid directly communicating with the database using simple SQL statements. The purpose of a DataSet is to act as a cheap local copy of the data you care about so that you do not have to keep on making expensive high-latency calls to the database. They let you drive to the data store once, pick up everything you're going to need for the next week, and stuff it in the fridge in the kitchen so that its there when you need it.
Also, which way is better? Updating the data in dataset and then transfering them to the database at once or updating the database directly?
You order a dozen different products from a web site. Which way is better: delivering the items one at a time as soon as they become available from their manufacturers, or waiting until they are all available and shipping them all at once? The first way, you get each item as soon as possible; the second way has lower delivery costs. Which way is better? How the heck should we know? That's up to you to decide!
The data update strategy that is better is the one that does the thing in a way that better meets your customer's wants and needs. You haven't told us what your customer's metric for "better" is, so the question cannot be answered. What does your customer want -- the latest stuff as soon as it is available, or a low delivery fee?

Datasets support disconnected architecture. You can add local data, delete from it and then using SqlAdapter you can commit everything to the database. You can even load xml file directly into dataset. It really depends upon what your requirements are. You can even set in memory relations between tables in DataSet.
And btw, using direct sql queries embedded in your application is a really really bad and poor way of designing application. Your application will be prone to "Sql Injection". Secondly if you write queries like that embedded in application, Sql Server has to do it's execution plan everytime whereas Stored Procedures are compiled and it's execution is already decided when it is compiled. Also Sql server can change it's plan as the data gets large. You will get performance improvement by this. Atleast use stored procedures and validate junk input in that. They are inherently resistant to Sql Injection.
Stored Procedures and Dataset are the way to go.
See this diagram:
Edit: If you are into .Net framework 3.5, 4.0 you can use number of ORMs like Entity Framework, NHibernate, Subsonic. ORMs represent your business model more realistically. You can always use stored procedures with ORMs if some of the features are not supported into ORMs.
For Eg: If you are writing a recursive CTE (Common Table Expression) Stored procedures are very helpful. You will run into too much problems if you use Entity Framework for that.

This page explains in detail in which cases you should use a Dataset and in which cases you use direct access to the databases

I usually like to practice that, if I need to perform a bunch of analytical proccesses on a large set of data I will fill a dataset (or a datatable depending on the structure). That way it is a disconnected model from the database.
But for DML queries I prefer the quick hits directly to the database (preferably through stored procs). I have found this is the most efficient, and with well tuned queries it is not bad at all on the db.

SQL Server stored procedure vs an external dll

I am trying to convince someone that using an external DLL to manage sql data is better then using stored procedures. Currently the person I am working with is using vba and calls sql stored procedures to get the complicated data they need from many different sources. It is my understanding that the best way to go about this kind of action is to use a dll/ some intermediate layer to get the data and be able to format it to the needs.
Some things to keep in mind:
The person i am working with doesn't care to much about being able to scale to much further then we are now
They don't care to be able to switch to different platforms
They don't see to much of a performance problem with the current setup
Using a dll requires more work that is in a different direction
They don't want to switch if there's not a current problem with doing it the way it is now.(So just because its not the right way wont work...I tried)
So can anyone tell me some benefits of using an external dll then using sql stored procedures ?

Use stored procedures, and write your data access layer which calls them via parameterized commands in a separate dll. Stored procedures are a standard and give you a ton of benefits, parameterized commands give you automatic string safety.
This type of design is basically so standardized and has been for years now that Microsoft has included a framework that constructs it for you in .NET 4.
More or less, both you and this other fellow are right, use sprocs for security, and separate your DAL for security and reusability and lots of reasons

ORM/DLL Approach
Pro:
You don't have to learn SQL, or stored procedure syntax
Con:
Complicates multiple operations in a single transaction
Risks increasing trips between the application and the database, which means data sync/concurrency issues
Utterly fails at complex queries; most support stored procedures via ORM because of this
You can save SQL, including stored procedures, in flat files. The file extension could be txt, but most use sql - makes storing SQL source in CVS/etc moot vs .NET or Java source code.

Agree with the points about controlling the code, much easier in a DLL. Same with source control. However, from a pure performance perspective, the stored procedures will win they day because they are compiled, not just cached. I don't know if it will make enough difference but thought I'd throw that in.
Using stored procedures can also be much more secure as you can lock down access to only stored procedures and you don't (have to) expose your table data to anyone with a connection.
I guess I'm not really answering your question as much as pointing out holes in your argument. Sorry about that but I'm looking at it from their perspective.

I really think it comes down to a matter of preference. Personally I like ORM & saved queries in a DLL vs. Stored Procs, I find them much easier to maintain and distribute than deploying S.Procs to a DB. There are some certain advantages that a S.Proc has over a raw query though. Some optimizations, and some server-side logic that could improve performance in some areas.
All In all though, personally I prefer to work in code than in DB mumbo-jumbo so that's really why I opt for the DLL approach.
Plus you can keep your source code in source-control too, much harder to do with a stored-proc.
Just my 2c.

When to use Stored Procedures instead of using any ORM with programming logic?

Hi all I wanted to know when I should prefer writing stored procedures over writing programming logic and pulling data using a ORM or something else.

Stored procedures are executed on server side.
This means that processing large amounts of data does not require passing these data over the network connection.
Also, with stored procedures, you can build consistent complicated business logic.
Say, you need to update the account balance each time you insert a transaction, and you need to insert many transactions at once.
Instead of doing this with triggers (which are implemented using inefficient record-by-record approach in many systems), you can pass a table variable or temporary table with the inputs and issue a set-based SQL statement inside the procedure. This will be much more efficient.

I prefer SPs over programming logic mainly for two reasons
Performance, anything what will reduce result set or can be more effectively done on the server, e.g.:
paging
filtering
ordering (on indexed columns)
Security -- if someone have got application's access to the database and wants to wipe out your all your records, having to execute Row_Delete for single each of them instead of DELETE FROM Rows already sounds good.

Never unless you identify a performance issue. (largely opinion)
(a Jeff blog post!)
http://www.codinghorror.com/blog/2004/10/who-needs-stored-procedures-anyways.html
If you see stored procs as optimizations:
http://en.wikipedia.org/wiki/Program_optimization#When_to_optimize

When appropriate.
complex data validation/checking logic
avoid several round trips to do one action in the DB
several clients
anything that should be set based
You can't say "never" or "always".
There is also the case where the database engine will outlive your client code. I bet there's more DAL or ORM upgrades/refactoring that DB engine upgrades/refactoring going on.
Finally, why can't I encapsulate code in a stored proc? Isn't that a good thing?

As ever, much of your decision as to which to use will depend on your application and its environment.
There are a couple of schools of thought here, and this debate always arouses strong sentiments on both sides.
The advantanges of Stored Procedures (as well as the large data moving that Quassnoi has mentioned) are that the logic is tied down in the database, and therefore potentially more secure. It is also only ever in one place.
However, there will be others who believe that the place for application logic should be in the application, especially if you are planning to access other types of datebases (for which you will have to write often different SPs).
Another consideration may be the skills of the resources you have to implement your application.

The point at which stored procedures become preferable to an ORM is that point at which you have multiple applications talking to the same database. At this point, you want your query logic embedded in one place, rather than once per application. And even here, you might want to prefer a service layer (which can scale horizontally) instead of the database (which only scales vertically).

Comparing hand-written ADO/Sprocs w/nHibernate

I am going back and forth between using nHibernate and hand written ado.net/stored procedures.
I currently use codesmith with templates I wrote that spits out simple classes that map my database tables, and it wraps my stored procedures for my data layer, and a thin business logic layer that just calls my data layer and returns the objects (1 object or collection).
This application is a web application, used for online communities (basically a forum).
I am watching summer of nhibernate videos right now.
Will using nHibernate make my life easier? Will updates to the database schema be any easier? What effects will there be on performance?
Is setting up nhibernate, and ensuring it performs optimally a headache of its own?
I don't want a complicated or deep object model, I simply want classes that map my tables, and a way to fetch data from my other tables that have foreign keys to them. I don't want a very complicated OOP model.

NHibernate can definitely make your life easier. Updates to your database schema will definitely be easier, because when you use an ORM, you don't have an API of stored procedures hindering you from refactoring your database schema to meet changes in your business model.
OR mappers have a LOT to offer, and are sadly misunderstood by a significant portion of the developer community, and almost all of the DBA community.
Stored procedures in general give the DBA more options for tuning performance in a database, because they have the freedom to rewrite the stored proc so long as they don't change its output. However, in my experience, stored procedures are rarely rewritten, due to other issues that can arise as a result (i.e. when a deployment of a new version of software is performed, any modified versions of existing procs will overwrite the optimized version that was changed by a DBA...thus negating the benefit and creating a maintenance and unexpected performance issue problem.)
Another grave misconception (and this is primarily from the SQL Server camp...I have very little experience with Oracle), is that Stored Procedures are the only thing that can be compiled and the execution plan cached. As far as SQL Server is concerned, any parameterized query can and probably will be compiled and cached.
A benefit of OR mappers is that they are adaptive...with a stored procedure, you write a single statement that will be used regardless of contextual nuances when that query is executed. LINQ to SQL has an amazing capacity to generate the most efficient queries I've ever seen, and often throws DBA's for a serious loop. I've shown DBA's queries generated by L2S that were full of sub queries and unconventional things which were immediately scoffed at. However, given the challenge, the performance (namely physical reads) of a query written by a DBA that was supposedly superior ended up being significantly inferior (sometimes on a scale of 30 physical reads for L2S vs 400 physical reads for the DBA.)
Another detractor as far as DBA's are concerned is that, because ORM's generate dynamic SQL, they have no way to optimize those queries. On the contrary (and again, this is restricted to SQL Server), SQL Server offers a multitude of optimization paths (horizontal and vertical table partitioning, distribution of physical files accross disks for any table or view, indexes, etc.) that can be taken before the need to modify a query is a necessity. Even in the event that a query needs to be modified, SQL Server 2005 and later provide something called Plan Guides, which allow you to moderately tune any query (stored proc, strait sql, etc.). In the event that tuning a query isn't enough, you can match any particular query to a complete replacement query, allowing the DBA to tune the query as much as they need to (but as a last resort.)
There are many, many benefits that can be gained by using an OR mapper, and NHibernate is one of the best free ones (LLBLGen is also very nice, but is not free.) LINQ to Sql and Entity Framework are some new offerings from Microsoft (L2S is soon to be replaced by EF 4.0 from the .NET 4.0 framework...which will at least rival, if not outpace, NHibernate.) The biggest hurdle to adopting an ORM is usually not the ORM product itself, nor its capabilities or performance. The greatest hurdle is usually convincing your DBA (if your lucky/unlucky enough ... depends on your experience ... to have one) that an ORM can improve efficiency and reduce maintenance costs without a cost of optimization paths for the DBA.

NHibernate works very well, especially for a simple model. It will make your life much easier and isn't too tough to learn. Look at "Fluent NHibernate" instead of using XML mappings, it is much easier.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.