Creating an ETL system (Data import and transformation)

Creating an ETL system (Data import and transformation) - c#

I have been tasked to write a module for importing data into a client's system.
I thought to break the process into 4 parts:
1. Connect to the data source (SQL, Excel, Access, CSV, ActiveDirectory, Sharepoint and Oracle) - DONE
2. Get the available tables/data groups from the source - DONE
i. Get the available fields form the selected table/data group - DONE
ii. Get all data from the selected fields - DONE
3. Transform data to the user's requirements
4. Write the transformed data the the MSSQL target
I am trying to plan how to handle complex data transformations like:
Get column A from Table tblA, inner joined to column FA from table tblB, and concatenate these two with a semicolon in between.
OR
Get column C from table tblC on source where column tblC.D is not in table tblG column G on target database.
My worry is not the visual, but the representation in code of this operation.
I am NOT asking for sample code, but rather for some creative ideas.
The data transformation will not be with free text, but drag and drop objects that represent actions.
I am a bit lost, and need some fresh input.

maybe you can grab some ideas from this open source project: Rhino ETL.

See my answer: Manipulate values in a datatable?

Related

Load data from flat file to Sql Server table and also export to excel using SSIS

Problem Statement: The requirement is straight-forward, which is we have a flat file(csv basically) which we need to load into one of the tables in Sql Server database. The problem arises when we have to derive a new column(not present in flat file) and populate this too alongwith rest of the columns from the file.
The derivation logic of the new columns is - find the max date of "TransactionDate".
The entire exercise is to be performed in SSIS and we were hoping to get it done by using DataFlowTask but stuck on how to derive the new column and then add it to the destination flow.
Ideas:
Use DataFlowTask to read the file and then store it in recordset so that in ControlFlow we would use ScriptTask to read it as DataTable and use LINQ sort-of to determine the max column and push it to another DataFlow to be consumed by Sql table (but this I guess would require creating of tabletype in database which I would avoid)
Perform the entire operation in DataFlowTask itself and we would be needing Asynchronous transformation (to get all the data and find out the max value)
We are kind of out-of-ideas here and any lead would be much appreciated and do let us know if any further information would be required on this regard.

Run a dataflow task to insert the data to your destination table. Follow that up with an Execute SQL task that calculates the MAX(TransactionDate) based on the values in the table with a NULL (or other new record indicator) MaxTransactionDate.

C# Excel Reading optimization

My app will build an item list and grab the necessary data (ex: prices, customer item codes) from an excel file.
This reference excel file has 650 lines and 7 columns.
App will read rows of 10-12 items in one run-time.
Would it be wiser to read line item by line item?
Or should I first read all line item in the excel file into a list/array and make the search from there?
Thank you

It's good to start by designing the classes that best represent the data regardless of where it comes from. Pretend that there is no Excel, SQL, etc.
If your data is always going to be relatively small (650 rows) then I would just read the whole thing into whatever data structure you create (your own classes.) Then you can query those for whatever data you want, like
var itemsIWant = allMyData.Where(item => item.Value == "something");
The reason is that it enables you to separate the query (selecting individual items) from the storage (whatever file or source the data comes from.) If you replace Excel with something else you won't have to rewrite other code. If you read it line by line then the code that selects items based on criteria is mingled with your Excel-reading code.
Keeping things separate enables you to more easily test parts of your code in isolation. You can confirm that one component correctly reads what's in Excel and converts it to your data. You can confirm that another component correctly executes a query to return the data you want (and it doesn't care where that data came from.)
With regard to optimization - you're going to be opening the file from disk and no matter what you'll have to read every row. That's where all the overhead is. Whether you read the whole thing at once and then query or check each row one at a time won't be a significant factor.

format Data in stored procedure

After a little bit of advise.
We are producing a standard list of data with filters.
This filtered list can have 3 alternative views which represent the data in different manor
my question is:
should the stored procedure return the data in the format that is required for each view for the front end?
or should the business layer re-format the data?
I have always believed it best to get in and out of the database as fast as possible, and allow the business layer to handle formatting.
Thanks in advance for any advise
UPDATE:-
by formating,
1 view contain all data withing a scrollable table
2 view will group data withing headings, (click on a heading to expand to view)
3 view will group data into date groups and display dates as table row headers (7 day intivials) and the data count will be displayed within those intivals

I have always believed it best to get in and out of the database as
fast as possible, and allow the business layer to handle formatting.
Short answer: Correct, you have answered your own question.
The database returns data to your application, how it is presented is up to the application. What happens when you want to supply the same data to another application? Do you want to have to write procedures every time you want the data in a different format? Of course not.
1: view contain all data withing a scrollable table
You cannot return a scrolling window from a procedure. Anyway, how would the database know how much data will need to be scrolled? It does not know the size of window/viewport you are using.
2: view will group data withing headings, (click on a heading to expand to view)
This requires user interaction which is solely the domain of the presentation layer.
3: view will group data into date groups and display dates as table
row headers (7 day intivials) and the data count will be displayed
within those intivals
You can return multiple result sets from a procedure, to give the results and the counts, but again this is more easily handled in the application.

In general you should use the presentation layer to format your data output. You can have the stored procedure return the raw data set and then set up, for example 3 different XSLT files for the 3 different presentation types. This way you can always do DataTable.ToXML and apply the XSLT template to this output to produce the format that the user needs.

How do I programatically verify, create, and update SQL table structure?

Scenario:
I have an application (C#) that expects a SQL database and login, which are set by a user. Once connected, it checks for the existence of several table and creates them if not found.
I'd like to expand on this by having the program be capable of adding columns to those tables if I release a new version of the program which relies upon the new columns.
Question:
What is the best way to programatically check the structure of an existing SQL table and create or update it to match an expected structure?
I am planning to iterate through the list of required columns and alter the existing table whenever it does not contain the new column. I can't help but wonder if there's an approach that is different or better.
Criteria:
Here are some of my expectations and self-imposed rules:
Newer versions of the program might no longer use certain columns, but they would be retained for data logging purposes. In other words, no columns will be removed.
Existing data in the table must be preserved, so the table cannot simply be dropped and recreated.
In all cases, newly added columns would allow null data, so the population of old records is taken care of by having default null values.
Example:
Here is a sample table (because visual examples help!):
id datetime sensor_name sensor_status x1 x2 x3 x4
1 20100513T151907 na019 OK 0.01 0.21 1.41 1.22
2 20100513T152907 na019 OK 0.02 0.23 1.45 1.52
Then, in a new version, I may want to add the column x5. The "x-columns" are all data-storage columns that accept null.
Edit:
I updated the sample table above. It is more of a log and not a parent table. So the sensors will repeatedly show up in this logging table with the values logged. A separate parent table contains the geographic and other logistical information about the sensor, making the table I wish to modify a child table.

This is a very troublesome feature that you're thinking about implementing. i would advise against it and instead consider scripting changes using a 3rd party tool such as Red Gate's Sql Compare: http://www.red-gate.com/products/SQL_Compare/index.htm
If you're in doubt, consider downloading the trial version of the software and performing a structure diff script on two databases with some non-trivial differences. You'll see from the result that the considerations for such operations are far from simple.
The other way around this type of issue is to redesign your database using the EAV model: http://en.wikipedia.org/wiki/Entity-attribute-value_model (Pivots to dynamically add rows thus never changing the structure. It has its own issues but it's very flexible.)
(To utilize a diff tool you would have to have a copy of all of your db versions and create diff scripts which would go out and get executed with new releases and upgrades. That's a huge mess of its own to maintain. EAV is the way for a thing like this. It wrongfully gets a lot of flak for not being as performant as a traditional db structure but i've used it a number of times with great success. In fact, i have an HIPAA-compliant EAV db (Sql Server 2000) that's been in production for over six years with several of the EAV tables containing tens or millions of rows and it's still going strong w/ no big slow down. Of course we don't do heavy reporting against that db. For reports we have an export that flattens the data into a relational structure.)

The common solution i see would be to store in your database somewhere version information. maybe have a really small table:
CREATE TABLE DB_PROPERTIES (key varchar(100), value varchar(100));
then you could add a row:
key | value
version | 12
Then you could just create a sql update script (or set of scripts) which updates the db from version 12 to version13.
declare v varchar(100)
select v=value from DB_PROPERTIES where key='version'
if v ='12'
#do upgrade from 12 to 13
elsif v='11'
#do upgrade from 11 to 13
...and so on
depending on what upgrade paths you wanted to support you could add more cases. You could also obviously move this upgrade logic into C# and or whatever design works for you. But having the db version information stored in the database will make it much easier to figure out what is already there, rather than querying for all the db structures individually.

If you have to build something in such a way as to rely on the application making table changes, your design is flawed. You should have a related table for the sensor values (x1, x2, etc.), then you can just add another record rather than having to create a new column.
Suggested child table structure
READINGS
ID int
Reading_type varchar (10)
Reading_Value int
Then data in the table would read:
ID Reading_type Reading_value
1 x1 2
1 x2 3
1 x3 1
2 x1 7

Try Microsoft.SqlServer.Management.Smo
These are a set of C# classes that provide an API to SQL Server database objects.
The Microsoft.SqlServer.Management.Smo.Table has a Columns Collection that will allow you to query and manipulate the columns.
Have fun.

C#: Reading data from an xls document

I am currently working on a project for traversing an excel document and inserting data into a database using C#.
The relevant data for this project is:
The excel sheet has 14 rows at the top that I do not care about. (sometimes 15, see Russia/Siberia below)
The data is grouped by name into 2 columns (date and value), such as:
Sheet 1
USA China Russia
Date Value Date Value Siberia
1/1/09 4.3654 1/1/09 2.7456 Date Value
1/2/09 3.5545 1/3/09 9.3214 2/5/09 0.2454
1/3/09 3.2322 1/21/09 5.2234 2/6/09 0.5557
The name I need to acquire is whichever is listed directly above "Date".
I only care about data from dates we do not have in the database. Before each column set is parsed, I will acquire the max date for any given name from the database, and skip anything at or before it.
There is no guarantee that the columns will be in a constant order or have constant spacing.
I do not want data for all names, rather only those in a list I put together before the file is acquired.
My current plan is this:
For each column, if the date field is at row 16, save the name as the value in row 15 above it, check the database for the last date for that name, only insert data where the date is greater than the acquired date.
If the date field is at row 17, do the same thing, but start the for loop through each row at 18.
If the name is not in the list, skip the column. If it is, make sure to grab the column next to it for the necessary values.
My problem is:
I am currently trying to use the ExcelDataReader from Codeplex(http://www.codeplex.com/ExcelDataReader). This only likes csv-like sheets, which this project has not.
I do not know of any alternative Excel readers.
To the best of my knowledge, a straight FileStream traversal of this file can only go row-by-row, rather than column-by-column.
To anyone still reading, thank you for your time. Any recommendations on how to proceed? Please ensure that solutions can traverse each column, not each row.
Also, please don't worry about the database stuff, or the list of names that precedes the traversal.
Addendum: What I'd really like to end up with is some type of table that I can just traverse with a nested loop, making column-centric traversal much, much easier. Because there is so much garbage near the top of the sheet (14+ rows), most simple solutions are not feasible.

If you want to read from excel in C#, i've used this library with great success, it'll give you the flexibility to parse columns/rows just however you'd like:
http://sourceforge.net/projects/koogra/ (read-only)
Other open source libraries i haven't used but could be good:
http://nexcel.sourceforge.net/ (read-only)
http://npoi.codeplex.com/ (can read and write)
http://developer.novell.com/wiki/index.php/Poi.Net (this project is dead)
Alternatively, you can use one of the many good Java libraries, and convert it into a C# assembly using IKVM:
http://jxls.sourceforge.net/
http://www.andykhan.com/jexcelapi/
http://poi.apache.org/ (this one's the grand-daddy of java XLS libraries)
I've covered how to do the IKVM Java -> C# conversion here (it's really not as horrible an option as you think):
http://splinter.com.au/blog/?p=207

Not a straight answer to your question but an alternative idea:
Your data looks like a pivot-ish table. I'd recommend "unpivoting" it into simple table.
Example:
Russia USA
Q1 123 323
Q2 456 321
Q3 567 843
Becomes:
Quarter Country Value
Q1 Russia 123
Q1 USA 323
Q2 Russia 321
....
If that is the case, not sure if I got this right in your question, than processing the data using a OleDB driver or whatever CSV kind of stuff should be become much less painful.

You can access Excel directly using ADO.NET via the ODBC driver. See http://www.davidhayden.com/blog/dave/archive/2006/05/26/2973.aspx or Google for more info on how to do that. You may wish to try HDR=No in your connection string, since your first row isn't really proper headers by the looks of it.
I haven't done this for a while, but I remember that it is a bit "temperamental" and takes some playing around with to get the column names right, but it should work. Try SELECT * FROM [Sheet1$] and see what you get.

I highly recommend saving this Excel document in a CSV format before doing anything else with it. You can do using this code
After you have a CSV, you can either parse it using that library, or write your own parser for it.

As I did before, I prefer to use OLEDB connection in order to connect to an Excel document.
By the way, you can take a look at the following article for more information:
http://www.codeproject.com/KB/office/excel_using_oledb.aspx

SpreadsheetGear for .NET can load workbooks and access any cells on any sheet in any order. You can get the formatted text of the cell (such as "1/1/09") or the underlying value ("1/1/09" is stored as the double 39814.0 in Excel or SpreadsheetGear).
You can see some live ASP.NET samples here and download the free trial here if you want to try it yourself.
Disclaimer: I own SpreadsheetGear LLC

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.