Bulk insert c# datatable to sql server - Efficient check for duplicate - c#

I wish to import many files into a database (with custom business logic preventing simple SSIS package use).
High level description of problem:
Pull existing sql data into DataTable (~1M rows)
Read excel file into 2d array in one chunk
validate fields row by row (custom logic)
Check for duplicate in existing DataTable
Insert into DataRow
Bulk insert DataTable into SQL table
Problem with my approach:
Each row must be checked for duplicates, I thought a call to remote server to leverage SQL would be too slow, so I opted for LINQ. The query was simple, but the size of the dataset causes it to crawl (90% execution time spent in this query checking the fields).
var existingRows = from row in recordDataTable.AsEnumerable()
where row.Field<int>("Entry") == entry
&& row.Field<string>("Device") == dev
select row;
bool update = existingRows.Count() > 0;
What other ways might there be to more efficiently check for duplicates?

Using linq it will basically do a for loop over your ~1M records every time you check for a duplicate.
You would be better off putting the data into a dictionary so your lookups are against an in memory index.

Related

How to insert/Update 10000 rows in SQL Server using C# efficiently while comparing each row from database

I have been given an Excel file from a customer. It has 4 columns: Id, name, place, date).
I have a table in my database which stores these values. I have to check each row from Excel and compare its values to the database table. If a row already exists, then compare the date and update to latest date from Excel. If the row does not exist yet, insert a new row.
I'm fetching each row and comparing its values using for a loop and updating database using insert/update statement by creating data table adapter.
My problem is this operation is taking 4+ hours to update the data. Is there any efficient way to do this? I have searched a lot and found options like SqlBulkCopy but how will I compare each and every row from database?
I'm using ASP.NET with C# and SQL Server.
Here's my code:
for (var row = 2; row <= workSheet.Dimension.End.Row; row++)
{
// Get data from excel
var Id = workSheet.Cells[row, 1].Text;
var Name = workSheet.Cells[row, 2].Text;
var Place = workSheet.Cells[row, 3].Text;
var dateInExcel = workSheet.Cells[row, 4].Text;
// check in database if ID exists in database then compare date and update database>
if (ID.Rows.Count <= 0) //no row exist in database
{
// Insert row in the database using data table adapter's insert statement
}
else if (Id.Rows.Count > 0) //Id exists in database
{
if (Db.DateInDB < (dateUpdate)) // compare dates
{
// Update database with the new date using data table adapter Update statement.
}
}
}
#mjwills and #Dan Guzman make very valid points in the comments section.
My suggestion would be to create an SSIS package to import the spreadsheet into a temp table then using a merge query/queries make conditional updates to the requires tables(s).
https://learn.microsoft.com/en-us/sql/integration-services/import-export-data/start-the-sql-server-import-and-export-wizard?view=sql-server-ver15
The simplest way to get a good starting point is to use the import wizard in SSMS and save the resultant Package. Create an SSIS Project in Visual Studio (You will need the correct version of BI Installed, for the target SQL Server version)
https://learn.microsoft.com/en-us/sql/ssdt/download-sql-server-data-tools-ssdt?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15
This approach would leverage SQL doing what it does best, dealing with relational data sets, and moves it out of the asp code.
To invoke this the ASP App would need to handle the initial file upload/whatever and then invoke the SSIS Package.
This can be done by setting the SSIS Package as a Job on the SQL Server, with no schedule and then starting the job when you want it to run.
How to execute an SSIS package from .NET?
There are most likely some optimisations that can be made to this approach; but it should work in principal.
Hope this helps :)
10_000 records taking more than 3x3600s suggests >1s per record - I think it should be possible to improve on that.
Doing the work in the database would result in best performance, but there are few things you can do prior.
Check the basics:
Indexes
Network speed. Is your timing based on trying on your computer and talking to a cloud database? If the code and the db are in the same cloud (Azure/Amazon/etc.) it may be much faster than what you're measuring with code running on your office computer talking to the db far away.
Use batches. You should be able to get a magnitude better performance if you do work in batches rather than one record at a time.
Get 10, 100, 500 or 1000 records from the CSV and fetch the corresponding records from the db. Do the checking for presence and date comparison in memory. After that do a single Save to the database.

SqlBulkCopy.WriteToServer(DataTable) row by row: very slow

I had to make application that imports a csv file to database table.
The csv files are like ~500rows ~30columns and are from not very reliable source (might contain corrupted data).
I did it like this CSV->DataTable->SqlBulkCopy.WriteToServer(DataTable). And it process 500 records to non-local sql server for about 4 sec, which is not a big problem. But since the csv may contain corrupted data (wrong date format, integer overflow, etc.) I had to make it error proof and import good rows and skip bad rows. Problem doesn't occur when processing corrupted data to DataTable but when importing DataTable to DataBase. What I did was TRY {} to add row by row in DB like this.
int Row = 0;
//csvFileData is the DataTable filled with csv data
foreach(DataRow RowToAdd in csvFileData.Rows)
{
//here it spents 1-2% of the time (problem not in DT row by row processing)
Row++;
DataTable TempDT = new DataTable();
TempDT = csvFileData.Clone();
TempDT.ImportRow(RowToAdd);
try
{
//here it spents 98% of the time
s.WriteToServer(TempDT);
}
catch(Exception importex)
{
Console.WriteLine("Couldn't import {0} row, reason", Row, importex.Message);
}
}
calling:
s.WriteToServer(scvFileData);
just once is not good in my case.
End it works real fine. The problem is that the time to execute raised to 15sec which is a lot. Because it does forward and backwards communication with the DB for each row. How can I fix this. I was thinking about emulating something like local clone of the DataBase table design. Try {} all row by row on it, then excluding bad ones and after that importing the entire DataTable(with removed bad rows) at ones. Or doing some async import row by row, but I think that rows might get scrambled in there order or get missed or even duplicate. Can someone give a tip.
A bulk insert of one row is >10 times slower than a single row insert. Your current strategy does not work.
Validate and cleanse the data on the client. It must be guaranteed that inserts succeed. Copy it into a DataTable. Insert all at once, or at least huge batches (perf gains start to appear at 100 or 1000 rows).
The obvious solution, as mentioned, is to verify the data as it is read from the CSV file and filling the data table only with 'good rows'.
In case your verification includes datatype checking, i.e. if a string is convertible by the target system (here: the SQL-Server), you would duplicate logic here, i.e. reprogram parsing/conversion logic already implemented in the SQL-Server. This is not a big problem, but from the design aspect not necessarily smart.
In fact you can directly import a CSV file into SQL-Server using the BULK INSERT command.
So another approach may be to import the raw data into a temporary table in the server and then do a datatype check. This is very easy when you happen to run SQL 2005 or above. They introduce functions like ISDATE and ISNUMERIC.
BULK INSERT CSVRawData FROM 'c:\csvtest.txt' WITH (
FIELDTERMINATOR = ',', ROWTERMINATOR = '\n'
)
INSERT INTO FinalTable
SELECT * from CSVRawData
WHERE ISDATE(DateField)
and ISNUMERIC (NumericField)
I would personally go this way if:
The CSV file has a fixed format
The integrity checks being made are easy to code in SQL
E.g. we analyze log files that way. They contain 50 Mio+ rows and some of them are corrupted or we simply are not interested.

Process 46,000 rows of a document in groups of 1000 using C# and Linq

I have this code below that executes. IT has 46,000 records in the text file that i need to process and insert into the database. It takes FOREVER if i just call it directly and loop one at a time.
I was trying to use LINQ to pull every 1000 rows or so and throw it into a thread so I could proces 3000 rows at once and cut the processing time. I can't figure it out though. so I need some help.
Any suggestions would be welcome. Thank You in advance.
var reader = ReadAsLines(tbxExtended.Text);
var ds = new DataSet();
var dt = new DataTable();
string headerNames = "Long|list|of|strings|"
var headers = headerNames.Split('|');
foreach (var header in headers)
dt.Columns.Add(header);
var records = reader.Skip(1);
foreach (var record in records)
dt.Rows.Add(record.Split('|'));
ds.Tables.Add(dt);
ds.AcceptChanges();
ProcessSmallList(ds);
If you are looking for high performance then look at the SqlBulkInsert if you are using SqlServer. The performance is significantly better than Insert row by row.
Here is an example using a custom CSVDataReader that I used for a project, but any IDataReader compatible Reader, DataRow[] or DataTable can be used as a parameter into WriteToServer, SQLDataReader, OLEDBDataReader etc.
Dim sr As CSVDataReader
Dim sbc As SqlClient.SqlBulkCopy
sbc = New SqlClient.SqlBulkCopy(mConnectionString, SqlClient.SqlBulkCopyOptions.TableLock Or SqlClient.SqlBulkCopyOptions.KeepIdentity)
sbc.DestinationTableName = "newTable"
'sbc.BulkCopyTimeout = 0
sr = New CSVDataReader(parentfileName, theBase64Map, ","c)
sbc.WriteToServer(sr)
sr.Close()
There are quite a number of options available. (See the link in the item)
To bulk insert data into a database, you probably should be using that database engine's bulk-insert utility (e.g. bcp in SQL Server). You might want to first do the processing, write out the processed data into a separate text file, then bulk-insert into your database of concern.
If you really want to do the processing on-line and insert on-line, memory is also a (small) factor, for example:
ReadAllLines reads the whole text file into memory, creating 46,000 strings. That would occupying a sizable chunk of memory. Try to use ReadLines instead which returns an IEnumerable and return strings one line at a time.
Your dataset may contain all 46,000 rows in the end, which will be slow in detecting changed rows. Try to Clear() the dataset table right after insert.
I believe the slowness you observed actually came from the dataset. Datasets issue one INSERT statement per new record, which means that you won't be saving anything by doing Update() 1,000 rows at a time or one row at a time. You still have 46,000 INSERT statements going to the database, which makes it slow.
In order to improve performance, I'm afraid LINQ can't help you here, since the bottleneck is with the 46,000 INSERT statements. You should:
Forgo the use of datasets
Dynamically create an INSERT statement in a string
Batch the update, say, 100-200 rows per command
Dynamically build the INSERT statement with multiple VALUE statments
Run the SQL command to insert 100-200 rows per batch
If you insist on using datasets, you don't have to do it with LINQ -- LINQ solves a different type of problems. Do something like:
// code to create dataset "ds" and datatable "dt" omitted
// code to create data adaptor omitted
int count = 0;
foreach (string line in File.ReadLines(filename)) {
// Do processing based on line, perhaps split it
dt.AddRow(...);
count++;
if (count >= 1000) {
adaptor.Update(dt);
dt.Clear();
count = 0;
}
}
This will improve performance somewhat, but you're never going to approach the performance you obtain by using dedicated bulk-insert utilities (or function calls) for your database engine.
Unfortunately, using those bulk-insert facilities will make your code less portable to another database engine. This is the trade-off you'll need to make.

Bulk Insert into access database from c#?

How can I do this. I have about 10000 records in an an Excel file and I want to insert all records as fast as possible into an access database?
Any suggestions?
What you can do is something like this:
Dim AccessConn As New System.Data.OleDb.OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0; Data Source=C:\Test Files\db1 XP.mdb")
AccessConn.Open()
Dim AccessCommand As New System.Data.OleDb.OleDbCommand("SELECT * INTO [ReportFile] FROM [Text;DATABASE=C:\Documents and Settings\...\My Documents\My Database\Text].[ReportFile.txt]", AccessConn)
AccessCommand.ExecuteNonQuery()
AccessConn.Close()
Switch off the indexing on the affected tables before starting the load and then rebuilding the indexes from scratch after the bulk load has finished. Rebuilding the indexes from scratch is faster than trying to keep them up to date while loading large amount of data into a table.
If you choose to insert row by row, then maybe want to you consider using transactions. Like, open transaction, insert 1000 records, commit transaction. This should work fine.
Use the default data import features in Access. If that does not suit your needs and you want to use C#, use standard ADO.NET and simply write record-for-record. 10K records should not take too long.

Join multiple DataRows into a single DataRow

I am writing this in C# using .NET 3.5. I have a System.Data.DataSet object with a single DataTable that uses the following schema:
Id : uint
AddressA: string
AddressB: string
Bytes : uint
When I run my application, let's say the DataTable gets filled with the following:
1 192.168.0.1 192.168.0.10 300
2 192.168.0.1 192.168.0.20 400
3 192.168.0.1 192.168.0.30 300
4 10.152.0.13 167.10.2.187 80
I'd like to be able to query this DataTable where AddressA is unique and the Bytes column is summed together (I'm not sure I'm saying that correctly). In essence, I'd like to get the following result:
1 192.168.0.1 1000
2 10.152.0.13 80
I ultimately want this result in a DataTable that can be bound to a DataGrid, and I need to update/regenerate this result every 5 seconds or so.
How do I do this? DataTable.Select() method? If so, what does the query look like? Is there an alternate/better way to achieve my goal?
EDIT: I do not have a database. I'm simply using an in-memory DataSet to store the data, so a pure SQL solution won't work here. I'm trying to figure out how to do it within the DataSet itself.
For readability (and because I love it) I would try to use LINQ:
var aggregatedAddresses = from DataRow row in dt.Rows
group row by row["AddressA"] into g
select new {
Address = g.Key,
Byte = g.Sum(row => (uint)row["Bytes"])
};
int i = 1;
foreach(var row in aggregatedAddresses)
{
result.Rows.Add(i++, row.Address, row.Byte);
}
If a performace issue is discovered with the LINQ solution I would go with a manual solution summing up the rows in a loop over the original table and inserting them into the result table.
You can also bind the aggregatedAddresses directly to the grid instead of putting it into a DataTable.
most efficient solution would be to do the sum in SQL directly
select AddressA, SUM(bytes) from ... group by AddressA
I agree with Steven as well that doing this on the server side is the best option. If you are using .NET 3.5 though, you don't have to go through what Rune suggests. Rather, use the extension methods for datasets to help query and sum the values.
Then, you can map it easily to an anonymous type which you can set as the data source for your grid (assuming you don't allow edits to this, which I don't see how you can, since you are aggregating the data).
I agree with Steven that the best way to do this is to do it in the database. But if that isn't an option you can try the following:
Make a new datatable and add the columns you need manually using DataTable.Columns.Add(name, datatype)
Step through the first datatables Rows collection and for each row create a new row in your new datatable using DataTable.NewRow()
Copy the values of the columns found in the first table into the new row
Find the matching row in the other data table using Select() and copy out the final value into the new data row
Add the row to your new data table using DataTable.Rows.Add(newRow)
This will give you a new data table containing the combined data from the two tables. It won't be very fast, but unless you have huge amounts of data it will probably be fast enough. But try to avoid doing a LIKE-query in the Select, for that one is slow.
One possible optimization would be possible if both tables contains rows with identical primary keys. You could then sort both tables and step through them fetching both data rows using their array index. This would rid you of the Select call.

Categories

Resources