SSIS add only rows that changed

SSIS add only rows that changed - c#

I have a project that consists in importing all the users (including all their properties) from an Active Directory domain to a SQL Server table. This table will be used by a Reporting Services application.
The table model has the following columns:
-ID: (a unique identifier that is generated automatically).
-distinguishedName: contains the LDAP distinguished Name attribute of the user.
-attribute_name: contains the name of the user property.
-attribute_value: contains the property values.
-timestamp: contains a datetime value that is generated automatically.
I have created an SSIS package with a Script Task which contains a C# code that exports all the data to a .CSV that is imported into the table later by a Data Flow task. The project works without any problem, but generates more than 2 millions of rows (the AD domain has around 30.000 users and each user has between 100-200 properties).
The SSIS package should run every day and import data only when a new exists a new user property or a property value changed.
In order to do this, I created a data flow which copies the entire table into a recordset.
This recordset is converted to a datatable and used in a Script Component step which verfies if the current row exists in the datatable. If the row exists, compares the property values and returns the rows to the output only when the values are different or when the row is not found in the datatable. This is the code:
Blockquote
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
bool processRow = compareValues(Row);
if (processRow)
{
//Direct to output 0
Row.OutdistinguishedName = Row.distinguishedName.ToString();
Row.Outattributename = Row.AttributeName.ToString();
Row.Outattributevalue.AddBlobData(System.Text.Encoding.UTF8.GetBytes(Row.AttributeValue.ToString()));
}
}
public bool compareValues(Input0Buffer Row)
{
//Variable declaration
DataTable dtHostsTbl = (DataTable)Variables.dataTableTbl;
string expression = "", distinguishedName = Row.distinguishedName.ToString(), attribute_name = Row.AttributeName.ToString(), attribute_value = Row.AttributeValue.ToString();
DataRow[] foundRowsHost = null;
//Query datatable
expression = "distinguishedName LIKE '" + distinguishedName + "' AND attribute_name LIKE '" + attribute_name + "'";
foundRowsHost = dtHostsTbl.Select(expression);
//Process found row
if (foundRowsHost.Length > 0)
{
//Get the host id
if (!foundRowsHost[0][2].ToString().Equals(attribute_value))
{
return true;
}
else
{
return false;
}
}
else
{
return true;
}
}
The code is working, but it's extremely slow. Is there any better way of doing this?

Here are some ideas:
Option A.
(actually a combination of options)
Eliminate unnecessary data when querying Active Directory using whenChanged attribute.
This alone should reduce the number of records significantly.
If filtering by whenChanged is not possible, or in addition to this, consider the following steps.
Instead of importing all existing records into Recordset Destination - import them into a Cache Transform.
Then use this Cache Transform in Cache connection manager of 2 Lookup components.
One Lookup component verifies if the {distinguishedName,attribute_name} combination exists. (this will be insert then)
Another Lookup component verifies if the {distinguishedName,attribute_name,attribute_value} combination exists.(this will be an update then, or delete/insert).
This pair of lookups should replace your Skip rows which are in the table Script Component.
Evaluate if it is possible to reduce your columns sizes: attribute_name and attribute_value. Especially nvarchar(max) often spoils the party.
If you cannot reduce the size of attribute_name and attribute_value - consider storing their hashes and verifying if hashes changed instead of verifying values itself.
Remove the CSV step - just transfer data from your initial source which currently populates CSV to the lookups in one data flow, and whatever is not found in lookups - to your OLE DB Destination component.
Option B.
Check if the source, which reads from Active Directory, is fast itself. (Just run the data flow with that source alone, without any destination to measure its performance).
If you are satisfied with its performance, and if you do not have objections against deleting everything from ad_User table - just delete and repopulate those 2 millions every day.
Reading everything from AD and writing into the SQL Server, in the same data flow, without any change detection, might actually be the simplest and fastest option.

Related

How to optimise inserting multiple records (with exists check) via Entity Framework

I have a folder filled with about 200 csv files, each containing about 6000 rows of data containing mutual fund data. I have to copy those comma separated data into the database via Entity Framework.
The two major objects are Mutual_Fund_Scheme_Details and Mutual_Fund_NAV_Details.
Mutual_Fund_Scheme_Details - this contains columns like Scheme_Name, Scheme_Code, Id, Last_Updated_On.
Mutual_Fund_NAV_Details - this contains Scheme_Id (foreign key), NAV, NAV_Date.
Each line in the CSV contains all of the above columns so before inserting, I have to -
Split each line.
Extract first the scheme related data and check if the scheme exists and get id. If it does not exist then insert the scheme details and get id.
Using the id obtained from step 2, check if an entry for NAV exists for the same date. If not, then insert it else skip it.
If an entry is inserted in Step 3 then the Last_Updated_On date might need to be updated for the scheme with the NAV date (depending on it is newer than existing value)
All the exists checks are done using ANY linq extension method and all the new entries are inserted into the DbContext but the SaveChanges method is called only at the end of processing of each file. I used to call it after each insert but that just takes even longer than right now.
Now since, this involves at least two exists checks, at the most two inserts and one update, the insertion of each file is taking too long close to 5-7 minutes per file. I am looking for suggestions to improve this. Any help would be useful.
Specifically, I am looking to:
Reduce the time it takes to process each file
Decrease the number of individual exists check (if I can possibly club them in some way)
Decrease individual inserts/updates (if I can possibly club them in some way)

It's going to be hard to optimize it with EF. Here is a suggestion:
Once you process the whole file (~6000) do the exists check with .Where( x => listOfIdsFromFile.Contains(x.Id)). This should work for 6000 ids and it will allow you separate inserts from updates.

How to insert/Update 10000 rows in SQL Server using C# efficiently while comparing each row from database

I have been given an Excel file from a customer. It has 4 columns: Id, name, place, date).
I have a table in my database which stores these values. I have to check each row from Excel and compare its values to the database table. If a row already exists, then compare the date and update to latest date from Excel. If the row does not exist yet, insert a new row.
I'm fetching each row and comparing its values using for a loop and updating database using insert/update statement by creating data table adapter.
My problem is this operation is taking 4+ hours to update the data. Is there any efficient way to do this? I have searched a lot and found options like SqlBulkCopy but how will I compare each and every row from database?
I'm using ASP.NET with C# and SQL Server.
Here's my code:
for (var row = 2; row <= workSheet.Dimension.End.Row; row++)
{
// Get data from excel
var Id = workSheet.Cells[row, 1].Text;
var Name = workSheet.Cells[row, 2].Text;
var Place = workSheet.Cells[row, 3].Text;
var dateInExcel = workSheet.Cells[row, 4].Text;
// check in database if ID exists in database then compare date and update database>
if (ID.Rows.Count <= 0) //no row exist in database
{
// Insert row in the database using data table adapter's insert statement
}
else if (Id.Rows.Count > 0) //Id exists in database
{
if (Db.DateInDB < (dateUpdate)) // compare dates
{
// Update database with the new date using data table adapter Update statement.
}
}
}

#mjwills and #Dan Guzman make very valid points in the comments section.
My suggestion would be to create an SSIS package to import the spreadsheet into a temp table then using a merge query/queries make conditional updates to the requires tables(s).
https://learn.microsoft.com/en-us/sql/integration-services/import-export-data/start-the-sql-server-import-and-export-wizard?view=sql-server-ver15
The simplest way to get a good starting point is to use the import wizard in SSMS and save the resultant Package. Create an SSIS Project in Visual Studio (You will need the correct version of BI Installed, for the target SQL Server version)
https://learn.microsoft.com/en-us/sql/ssdt/download-sql-server-data-tools-ssdt?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15
This approach would leverage SQL doing what it does best, dealing with relational data sets, and moves it out of the asp code.
To invoke this the ASP App would need to handle the initial file upload/whatever and then invoke the SSIS Package.
This can be done by setting the SSIS Package as a Job on the SQL Server, with no schedule and then starting the job when you want it to run.
How to execute an SSIS package from .NET?
There are most likely some optimisations that can be made to this approach; but it should work in principal.
Hope this helps :)

10_000 records taking more than 3x3600s suggests >1s per record - I think it should be possible to improve on that.
Doing the work in the database would result in best performance, but there are few things you can do prior.
Check the basics:
Indexes
Network speed. Is your timing based on trying on your computer and talking to a cloud database? If the code and the db are in the same cloud (Azure/Amazon/etc.) it may be much faster than what you're measuring with code running on your office computer talking to the db far away.
Use batches. You should be able to get a magnitude better performance if you do work in batches rather than one record at a time.
Get 10, 100, 500 or 1000 records from the CSV and fetch the corresponding records from the db. Do the checking for presence and date comparison in memory. After that do a single Save to the database.

SSIS setting variables in Script Component of Data Flow

I have an input flat file that has 2 types of input records for each output record. The first record (identified by C in first column) has an ID and Demographic information. The second record (identified by L in first column) has some financial information. They are pipe delimited and of different lengths.
There isn't any way to write all the C records to one stream and the L records to another stream and then bring them back together. So my solution is to put in a conditional split. When I hit a C record store all the info I need in SSIS variables. When I hit an L record make derived columns out of the variables and use the derived columns and the columns from the L record to make my output record (also flat file).
I've looked all over the Internet and can't find C# code to set my variables within the Script Component of the path of the C records. What I want the code to look like is something like
Variable.User::Firstname = Column 2 (from input file)
Variable.User::Lastname = Column 3 (from input file)
etc.
Can somebody help me out?
Thanks,
Dick

This idea won't work. What do you think you will be able to do with the variables as each row gets processed? Anything you do with the value of the variables would have to be done IN the script that populates them, because by the time you leave the script, the variable is being populated by the value of the next row.
However, treating your question as academic, the way to access variables in a script component has already been asked and answered here: How to access ssis package variables inside script component
Here is how I would approach this:
Configure your source component so that each row is a single column
Next do a conditional split that sends the C-rows down one path, and the L-rows down another
In each path use either a Derived Column transformation or a Script transformation that splits the string by the actual delimiter and creates the actual columns for the type of record in that path.
Continue on with the rest of your processing until they reach their separate destinations.

I like using script components:
First Step is to add a data flow.
In data flow add script component and chose source
In inputs/Outputs:
Add your column info that you want as output.
(Note you will have 2 separate outputs with many columns) and chose your data types.
Now enter the script editor.
Here is the code to use 2 separate If statements
string strPath = "";
var lines = System.IO.File.ReadAllLines(strPath);
foreach (string line in lines)
{
if (line.Substring(1, 1) == "C")
{
char[] delim = "|".ToCharArray() ;
var C_cols = line.Split(delim);
Output0Buffer.AddRow();
Output0Buffer.FirstName = C_cols[0];
Output0Buffer.LastName = int.Parse(C_cols[1]); //Note that everything is a string until cast to correct data type
// Keep on going
}
if (line.Substring(1, 1) == "L")
{
char[] delim = "|".ToCharArray();
var L_cols = line.Split(delim);
Output2Buffer.AddRow();
Output2Buffer.FirstName = L_cols[0];
Output2Buffer.LastName = int.Parse(L_cols[1]);
// Keep on going
}
}
At this point the script component will have two outputs that can lead down different paths.

Thanks for your responses.
Since I need both a C record and the L record after it, I decided to load the entire input file into a SQL Server table. Once it was in SQL I wrote a fairly straightforward stored procedure to cursor through the records in the table, put the needed columns into SQL variables when it was a C record and insert the data from the SQL variables and the input data into the output record when it was an L record (the same thing I was trying to do entirely within SSIS but was unable to). After my table of output records was populated it was a simple matter to write a Data Flow to SELECT all the records from the table output records as a Data Flow Source and put them into the desired flat file as the Data Source Destination.
Dick Rosenberg

How to calculate the (custom id) for a row which has not been inserted yet?

I am using the answer of this question How to automatically generate unique id in sql server to create a custom id for a table.It worked perfectly.Now I have a column which holds the values such as UID00000001 UID00000002 and so on. Suppose the last value in this column is UID00000003.Now I want to calculate the value for the row which hasn't been inserted yet via C# in one of my .aspx pages.In this case UID00000004. How can I achieve this value?
Any help would be appreciated.
Thank you.

If you are not required to generate these identifier at database level (e.g. some other processes insert records there), you can pre-generate them within your application. Something like above:
class Generator
{
public static int UniqueId = 0;
public static int GetNextId()
{
return Interlocked.Increment(ref UniqueId);
}
}
Then, your code can preallocate these identifiers and also format those strings. If multiple users access the same functionality, they will receive other identifiers. However, if one does not (successfully) performs a save operation, those identifiers will be lost.

You need to execute this query to get the next identity which will be generated for the table:
SELECT IDENT_CURRENT('table_name')+1;
For your case, it will have some other info concatenated with the next identity so the query will be like this:
SELECT 'UID' + RIGHT('00000000' + CAST(IDENT_CURRENT('table_name')+1 AS VARCHAR(8)), 8)
Of course you will need to write the C# code to send that query to the SQL Server.
Having said that, keep this in mind: When you get the value from that call and hold onto it, if during the time you are holding the value a record is inserted into that table, then the value is no longer the next value.
If you need the identiy value after a record is inserted in your application, please refer this answer.

Is there an efficient way to add a bulk amount of records each one does not exist with EF & SQL Server?

So I have a program that processes a lot of data, it pulls the data from the source then saves it into the database. The only conditions is that there cannot be duplicates based on just a single name field, there is no way to only pull new data, I have to pull it all.
The way I have it working right now is that it pulls all the data from the source, then all of the data from the DB (~1.5m records) then compares and only sends the new entries, however this is not very good in terms of RAM since it can use up around 700mb.
I looked into a way to let the DB handle more of the processing and found the below extension method on another question, but my concerns with using it is that it might be even more inefficient due to having to check 200 000 singularly against all records in the db
public static T AddIfNotExists<T>(this DbSet<T> dbSet, T entity, Expression<Func<T, bool>> predicate = null) where T : class, new()
{
var exists = predicate != null ? dbSet.Any(predicate) : dbSet.Any();
return !exists ? dbSet.Add(entity) : null;
}
So, any ideas as to how I might be able to handle this problem in an efficient manner?

1) Just a suggestion: because this task seems to be "data intensive" you could use a dedicated tool: SSIS. Create one package having one Data Flow Task thus:
OLE DB source (to read data from source)
Lookup transformation (to check if every source value exists or not within target table; This needs indexing)
No match data flow (source value doesn't exist) then
OLE DB destination (it inserts the source value into target table).
I would use this approach if I have to compare/search on small set of columns (this columns should be indexed).
2) Another suggestion: is to simply insert source data into #TempTable and then
INSERT dbo.TargetTable (...)
SELECT ...
FROM #TempTable
EXCEPT
SELECT ...
FROM dbo.TargetTable
I would use this approach if I have to compare/search on all columns.
You have to do your tests to see which is the right/fastest approach.

If you stick to EF, there's a few ways of improving your algorithm:
Download only Name field from your source table (create a new DTO object that maps to the same table but contains only the name field against which you compare). That will minimize the memory usage.
Create and store the hash of the name field in the database. Then download on the client only hashes. Check the input data - if there's no hash, just add it; if there's such hash - check with dbSet.Any(..). If you use the MD5 hash it will take only 16 (bytes) *1,5M records ~ 23Mb. If you use CRC32, 4*1,5 records ~ 6Mb.
You have the high memory usage because all the data you added is stored in the DB context. If you do in chunks (let's say 4000 records) and then close the context, the memory will be released.
using (var context = new MyDBContext())
{
context.DBSet.Add(...); // and no more than 4000 records here.
context.SaveChanges();
}
The better solution is to use SQL statements that can be executed via
model.Database.ExecuteSqlCommand("INSERT...",
new SqlParameter("Name",..),
new SqlParameter("Value",..));
(Mind to set the parameter type for SqlParameter)
INSERT target(name, value)
SELECT #Name, #Value
WHERE NOT EXISTS (SELECT name FROM target WHERE name = #Name);
If the expected number of inserted values is high, you might get benefits form using Table type (System.Data.SqlDbType.Structured) (assuming you use MS SQL server) and executing the similar query in some chunks but it is a little bit more complex.
Depending on the SQL server you may download the data into a temporary table (e.g. you have a CSV file or BulkInsert for MS SQL) and then execute the similar query.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.