SSIS setting variables in Script Component of Data Flow - c#

I have an input flat file that has 2 types of input records for each output record. The first record (identified by C in first column) has an ID and Demographic information. The second record (identified by L in first column) has some financial information. They are pipe delimited and of different lengths.
There isn't any way to write all the C records to one stream and the L records to another stream and then bring them back together. So my solution is to put in a conditional split. When I hit a C record store all the info I need in SSIS variables. When I hit an L record make derived columns out of the variables and use the derived columns and the columns from the L record to make my output record (also flat file).
I've looked all over the Internet and can't find C# code to set my variables within the Script Component of the path of the C records. What I want the code to look like is something like
Variable.User::Firstname = Column 2 (from input file)
Variable.User::Lastname = Column 3 (from input file)
etc.
Can somebody help me out?
Thanks,
Dick

This idea won't work. What do you think you will be able to do with the variables as each row gets processed? Anything you do with the value of the variables would have to be done IN the script that populates them, because by the time you leave the script, the variable is being populated by the value of the next row.
However, treating your question as academic, the way to access variables in a script component has already been asked and answered here: How to access ssis package variables inside script component
Here is how I would approach this:
Configure your source component so that each row is a single column
Next do a conditional split that sends the C-rows down one path, and the L-rows down another
In each path use either a Derived Column transformation or a Script transformation that splits the string by the actual delimiter and creates the actual columns for the type of record in that path.
Continue on with the rest of your processing until they reach their separate destinations.

I like using script components:
First Step is to add a data flow.
In data flow add script component and chose source
In inputs/Outputs:
Add your column info that you want as output.
(Note you will have 2 separate outputs with many columns) and chose your data types.
Now enter the script editor.
Here is the code to use 2 separate If statements
string strPath = "";
var lines = System.IO.File.ReadAllLines(strPath);
foreach (string line in lines)
{
if (line.Substring(1, 1) == "C")
{
char[] delim = "|".ToCharArray() ;
var C_cols = line.Split(delim);
Output0Buffer.AddRow();
Output0Buffer.FirstName = C_cols[0];
Output0Buffer.LastName = int.Parse(C_cols[1]); //Note that everything is a string until cast to correct data type
// Keep on going
}
if (line.Substring(1, 1) == "L")
{
char[] delim = "|".ToCharArray();
var L_cols = line.Split(delim);
Output2Buffer.AddRow();
Output2Buffer.FirstName = L_cols[0];
Output2Buffer.LastName = int.Parse(L_cols[1]);
// Keep on going
}
}
At this point the script component will have two outputs that can lead down different paths.

Thanks for your responses.
Since I need both a C record and the L record after it, I decided to load the entire input file into a SQL Server table. Once it was in SQL I wrote a fairly straightforward stored procedure to cursor through the records in the table, put the needed columns into SQL variables when it was a C record and insert the data from the SQL variables and the input data into the output record when it was an L record (the same thing I was trying to do entirely within SSIS but was unable to). After my table of output records was populated it was a simple matter to write a Data Flow to SELECT all the records from the table output records as a Data Flow Source and put them into the desired flat file as the Data Source Destination.
Dick Rosenberg

Related

SSIS Excel destination inserting null values on other columns

I have 2 script components which extract data from result set objects say User::AllXData and User::AllYData.
It is run through a foreach loop and the data is stored in a data table.
Next, I'm adding the data into a excel sheet using Excel destination. Now when I do that. All the data corresponding to column A (i.e, the data from User::AllXData) is being added to the excel sheet, but the column B gets filled with null values till the end of column A's data.
Then column B gets filled leaving column A with null data. It's supposed to be aligned.
Is there a workaround for this?
Edit:
After a long of grinding and running many tests, finally came across a solution.
The answer to this is pretty simple. Instead of using two objects as result set, it's better to use only one.
If you're going to query from a single source, include all the required columns in your SQL query into one object result set and use that as a read only variable in the script component.
Create a single data table that includes all the required columns and adds them into your excel destination row by row without any null values.
Here's an article that has a good example.

Populate OutputBuffer from a List in PostExecute using C# transformer

The saga of trying to chop flat files up into useable bits continues!
You may see from my other questions that I am trying to wrangle some flat file data into various bits using C# transformer in SSIS. The current challenge is trying to turn a selection of rows with one column into one row with many columns.
A friend has very kindly tipped me off to use List and then to somehow loop through that in the PostExecute().
The main problem is that I do not know how to loop through and create a row to add to the Output Buffer programatically - there might be a variable number of fields listed in the flat file, there is no consistency. For now, I have allowed for 100 outputs and called these pos1, pos2, etc.
What I would really like to do is count everything in my list, and loop through that many times, incrementing the numbers accordingly - i.e. fieldlist[0] goes to OutputBuffer.pos1, fieldlist[1] goes to OutputBuffer.pos2, and if there is nothing after this then nothing is put in pos3 to pos100.
The secondary problem is that I can't even test that my list and writing to an output table is working by specifically using OutputBuffer in PostExecute, never mind working out a loop.
The file has all sorts in it, but the list of fields is handily contained between START-OF-FIELDS and END-OF-FIELDS, so I have used the same logic as before to only process the rows in the middle of those.
bool passedSOF;
bool passedEOF;
List<string> fieldlist = new List<string>();
public override void PostExecute()
{
base.PostExecute();
OutputBuffer.AddRow();
OutputBuffer.field1=fieldlist[0];
OutputBuffer.field2=fieldlist[1];
}
public override void Input_ProcessInputRow(InputBuffer Row)
{
if (Row.RawData.Contains("END-OF-FIELDS"))
{
passedEOF = true;
OutputBuffer.SetEndOfRowset();
}
if (passedSOF && !passedEOF)
{
fieldlist.Add(Row.RawData);
}
if(Row.RawData.Contains("START-OF-FIELDS"))
{
passedSOF = true;
}
}
I have nothing underlined in red, but when I try to run this I get an error message about PostExecute() and "object reference not set to an instance of an object", which I thought meant something contained a null where it shouldn't, but in my test file I have more than two fields between START and END markers.
So first of all, what am I doing wrong in the example above, and secondly, how do I do this in a proper loop? There are only 100 possible outputs right now, but this could increase over time.
"Post execute" It's named that for a reason.
The execution of your data flow has ended and this method is for cleanup or anything that needs to happen after execution - like modification of SSIS variables. The buffers have gone away, there's no way to do interact with the contents of the buffers at this point.
As for the rest of your problem statement... it needs focus
So once again I have misunderstood a basic concept - PostExecute cannot be used to write out in the way I was trying. As people have pointed out, there is no way to do anything with the buffer contents here.
I cannot take credit for this answer, as again someone smarter than me came to the rescue, but I have got permission from them to post the code in case it is useful to anyone. I hope I have explained this OK, as I only just understand it myself and am very much learning as I go along.
First of all, make sure to have the following in your namespace:
using System.Reflection;
using System.Linq;
using System.Collections.Generic;
These are going to be used to get properties for the Output Buffer and to allow me to output the first item in the list to pos_1, the second to pos_2, etc.
As usual I have two boolean variables to determine if I have passed the row which indicates the rows of data I want have started or ended, and I have my List.
bool passedSOF;
bool passedEOF;
List<string> fieldlist = new List<string>();
Here is where it is different - as I have something which indicates I am done processing my rows, which is the row containing END-OF-FIELDS, when I hit that point, I should be writing out my collected List to my output buffer. The aim is to take all of the multiple rows containing field names, and turn that into a single row with multiple columns, with the field names populated across those columns in the row order they appeared.
if (Row.RawData.Contains("END-OF-FIELDS"))
{
passedEOF = true;
//IF WE HAVE GOT TO THIS POINT, WE HAVE ALL THE DATA IN OUR LIST NOW
OutputBuffer.AddRow();
var fields = typeof(OutputBuffer).GetProperties();
//SET UP AND INITIALISE A VARIABLE TO HOLD THE ROW NUMBER COUNT
int rowNumber = 0;
foreach (var fieldName in fieldList)
{
//ADD ONE TO THE CURRENT VALUE OF rowNumber
rowNumber++;
//MATCH THE ROW NUMBER TO THE OUTPUT FIELD NAME
PropertyInfo field = fields.FirstOrDefault(x = > x.Name == string.Format("pos{0}", rowNumber));
if (field != null)
{
field.SetValue(OutputBuffer, fieldName);
}
}
OutputBuffer.SetEndOfRowset();
}
if (passedSOF && !passedEOF)
{
this.fieldList.Add(Row.RawData);
}
if (Row.RawData.Contains("START-OF-FIELDS"))
{
passedSOF = true;
}
So instead of having something like this:
START-OF-FIELDS
FRUIT
DAIRY
STARCHES
END-OF-FIELDS
I have the output:
pos_1 | pos_2 | pos_3
FRUIT | DAIRY | STARCHES
So I can build a position key table to show which field will appear in which order in the current monthly file, and now I am looking forward into getting myself into more trouble splitting the actual data rows out into another table :)

C# Excel Reading optimization

My app will build an item list and grab the necessary data (ex: prices, customer item codes) from an excel file.
This reference excel file has 650 lines and 7 columns.
App will read rows of 10-12 items in one run-time.
Would it be wiser to read line item by line item?
Or should I first read all line item in the excel file into a list/array and make the search from there?
Thank you
It's good to start by designing the classes that best represent the data regardless of where it comes from. Pretend that there is no Excel, SQL, etc.
If your data is always going to be relatively small (650 rows) then I would just read the whole thing into whatever data structure you create (your own classes.) Then you can query those for whatever data you want, like
var itemsIWant = allMyData.Where(item => item.Value == "something");
The reason is that it enables you to separate the query (selecting individual items) from the storage (whatever file or source the data comes from.) If you replace Excel with something else you won't have to rewrite other code. If you read it line by line then the code that selects items based on criteria is mingled with your Excel-reading code.
Keeping things separate enables you to more easily test parts of your code in isolation. You can confirm that one component correctly reads what's in Excel and converts it to your data. You can confirm that another component correctly executes a query to return the data you want (and it doesn't care where that data came from.)
With regard to optimization - you're going to be opening the file from disk and no matter what you'll have to read every row. That's where all the overhead is. Whether you read the whole thing at once and then query or check each row one at a time won't be a significant factor.

SSIS add only rows that changed

I have a project that consists in importing all the users (including all their properties) from an Active Directory domain to a SQL Server table. This table will be used by a Reporting Services application.
The table model has the following columns:
-ID: (a unique identifier that is generated automatically).
-distinguishedName: contains the LDAP distinguished Name attribute of the user.
-attribute_name: contains the name of the user property.
-attribute_value: contains the property values.
-timestamp: contains a datetime value that is generated automatically.
I have created an SSIS package with a Script Task which contains a C# code that exports all the data to a .CSV that is imported into the table later by a Data Flow task. The project works without any problem, but generates more than 2 millions of rows (the AD domain has around 30.000 users and each user has between 100-200 properties).
The SSIS package should run every day and import data only when a new exists a new user property or a property value changed.
In order to do this, I created a data flow which copies the entire table into a recordset.
This recordset is converted to a datatable and used in a Script Component step which verfies if the current row exists in the datatable. If the row exists, compares the property values and returns the rows to the output only when the values are different or when the row is not found in the datatable. This is the code:
Blockquote
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
bool processRow = compareValues(Row);
if (processRow)
{
//Direct to output 0
Row.OutdistinguishedName = Row.distinguishedName.ToString();
Row.Outattributename = Row.AttributeName.ToString();
Row.Outattributevalue.AddBlobData(System.Text.Encoding.UTF8.GetBytes(Row.AttributeValue.ToString()));
}
}
public bool compareValues(Input0Buffer Row)
{
//Variable declaration
DataTable dtHostsTbl = (DataTable)Variables.dataTableTbl;
string expression = "", distinguishedName = Row.distinguishedName.ToString(), attribute_name = Row.AttributeName.ToString(), attribute_value = Row.AttributeValue.ToString();
DataRow[] foundRowsHost = null;
//Query datatable
expression = "distinguishedName LIKE '" + distinguishedName + "' AND attribute_name LIKE '" + attribute_name + "'";
foundRowsHost = dtHostsTbl.Select(expression);
//Process found row
if (foundRowsHost.Length > 0)
{
//Get the host id
if (!foundRowsHost[0][2].ToString().Equals(attribute_value))
{
return true;
}
else
{
return false;
}
}
else
{
return true;
}
}
The code is working, but it's extremely slow. Is there any better way of doing this?
Here are some ideas:
Option A.
(actually a combination of options)
Eliminate unnecessary data when querying Active Directory using whenChanged attribute.
This alone should reduce the number of records significantly.
If filtering by whenChanged is not possible, or in addition to this, consider the following steps.
Instead of importing all existing records into Recordset Destination - import them into a Cache Transform.
Then use this Cache Transform in Cache connection manager of 2 Lookup components.
One Lookup component verifies if the {distinguishedName,attribute_name} combination exists. (this will be insert then)
Another Lookup component verifies if the {distinguishedName,attribute_name,attribute_value} combination exists.(this will be an update then, or delete/insert).
This pair of lookups should replace your Skip rows which are in the table Script Component.
Evaluate if it is possible to reduce your columns sizes: attribute_name and attribute_value. Especially nvarchar(max) often spoils the party.
If you cannot reduce the size of attribute_name and attribute_value - consider storing their hashes and verifying if hashes changed instead of verifying values itself.
Remove the CSV step - just transfer data from your initial source which currently populates CSV to the lookups in one data flow, and whatever is not found in lookups - to your OLE DB Destination component.
Option B.
Check if the source, which reads from Active Directory, is fast itself. (Just run the data flow with that source alone, without any destination to measure its performance).
If you are satisfied with its performance, and if you do not have objections against deleting everything from ad_User table - just delete and repopulate those 2 millions every day.
Reading everything from AD and writing into the SQL Server, in the same data flow, without any change detection, might actually be the simplest and fastest option.

Saving a text file to a SQL database without column names

I am reading a text file in C# and trying to save it to a SQL database. I am fine except I don't want the first line, which is the names of the columns, included in the import. What's the easiest way to exclude these?
The code is like this
while (textIn.Peek() != -1)
{
string row = textIn.ReadLine();
string[] columns = row.Split(' ');
Product product = new Product();
product.Column1 = columns[0];
etc.....
product.Save();
}
thanks
If you are writing the code yourself to read in the file and then importing...why don't you just skip over the first line?
Here's my suggestion:
string[] file_rows;
using(var reader=File.OpenText(filepath))
{
file_rows=reader.ReadToEnd().Split("\r\n");
reader.Close();
}
for(var i=1;i<file_rows.Length;i++)
{
var row=file_rows[i];
var cells=row.Split("\t");
....
}
how are you importing the data? if you are looping in C# and inserting them one at a time, construct your loop to skip the first insert!
or just delete the first row inserted after they are there.
give more info, get more details...
Pass a flag into the program (in case in future the first line is also data) that causes the program to skip the first line of text.
If it's the same column names as used in the database you could also parse it to grab the column names from that instead of hard-coding them too (assuming that's what you're doing currently :)).
As a final note, if you're using a MySQL database and you have command line access, you may want to look at the LOAD DATA LOCAL INFILE syntax which lets you import pretty arbitrarily defined CSV data.
For future reference have a look at this awesome package: FileHelpers Library
I can't add links just yet but google should help, it's on sourceforge
It makes our lives here a little easier when people insist on using files as integration

Categories

Resources