I had to make application that imports a csv file to database table.
The csv files are like ~500rows ~30columns and are from not very reliable source (might contain corrupted data).
I did it like this CSV->DataTable->SqlBulkCopy.WriteToServer(DataTable). And it process 500 records to non-local sql server for about 4 sec, which is not a big problem. But since the csv may contain corrupted data (wrong date format, integer overflow, etc.) I had to make it error proof and import good rows and skip bad rows. Problem doesn't occur when processing corrupted data to DataTable but when importing DataTable to DataBase. What I did was TRY {} to add row by row in DB like this.
int Row = 0;
//csvFileData is the DataTable filled with csv data
foreach(DataRow RowToAdd in csvFileData.Rows)
{
//here it spents 1-2% of the time (problem not in DT row by row processing)
Row++;
DataTable TempDT = new DataTable();
TempDT = csvFileData.Clone();
TempDT.ImportRow(RowToAdd);
try
{
//here it spents 98% of the time
s.WriteToServer(TempDT);
}
catch(Exception importex)
{
Console.WriteLine("Couldn't import {0} row, reason", Row, importex.Message);
}
}
calling:
s.WriteToServer(scvFileData);
just once is not good in my case.
End it works real fine. The problem is that the time to execute raised to 15sec which is a lot. Because it does forward and backwards communication with the DB for each row. How can I fix this. I was thinking about emulating something like local clone of the DataBase table design. Try {} all row by row on it, then excluding bad ones and after that importing the entire DataTable(with removed bad rows) at ones. Or doing some async import row by row, but I think that rows might get scrambled in there order or get missed or even duplicate. Can someone give a tip.
A bulk insert of one row is >10 times slower than a single row insert. Your current strategy does not work.
Validate and cleanse the data on the client. It must be guaranteed that inserts succeed. Copy it into a DataTable. Insert all at once, or at least huge batches (perf gains start to appear at 100 or 1000 rows).
The obvious solution, as mentioned, is to verify the data as it is read from the CSV file and filling the data table only with 'good rows'.
In case your verification includes datatype checking, i.e. if a string is convertible by the target system (here: the SQL-Server), you would duplicate logic here, i.e. reprogram parsing/conversion logic already implemented in the SQL-Server. This is not a big problem, but from the design aspect not necessarily smart.
In fact you can directly import a CSV file into SQL-Server using the BULK INSERT command.
So another approach may be to import the raw data into a temporary table in the server and then do a datatype check. This is very easy when you happen to run SQL 2005 or above. They introduce functions like ISDATE and ISNUMERIC.
BULK INSERT CSVRawData FROM 'c:\csvtest.txt' WITH (
FIELDTERMINATOR = ',', ROWTERMINATOR = '\n'
)
INSERT INTO FinalTable
SELECT * from CSVRawData
WHERE ISDATE(DateField)
and ISNUMERIC (NumericField)
I would personally go this way if:
The CSV file has a fixed format
The integrity checks being made are easy to code in SQL
E.g. we analyze log files that way. They contain 50 Mio+ rows and some of them are corrupted or we simply are not interested.
Related
I have a tool I made for work. Every week there are 5-20 files for a certain process that fails and I have to find their job ids and rerun them.
I made a tool in C# that takes the names of the failed files in an Excel spreadsheet (we'll call it the Failed File Spreadsheet, or FFS if you're feeling cynical) and then cross references them with a different Excel spreadsheet that has the job ids, and displays the result in the terminal. It reads the FFS this with a fairly simple OledDbDataAdapter code:
public static DataTable GetDataFromExcel(string filename, string sheetName)
{
using(var oledb = new OleDbConnection(CONN_STR.Replace("<FILENAME>", filename).Replace("<HDR>", "no"))
{
var result = new DataSet();
new OleDbDataAdapter($"SELECT * FROM [{sheetName}]", oledb).Fill(result);
return result.Tables[0];
}
}
The tool works fine, mostly. It cross references with another excel sheet and I get my job ids and I can carry on with my task.
However there's one slight issue, and that is that, often when running the tool, when it reads from the FFS, sometimes it returns blank lines. Like if last week I had 7 files, then this week I erased those, pasted in 5 files, then my tool will show the job ids for those 5 files just fine, but also show two blanks, as if it's still reading those two extra rows from the previous week. If however I make a new blank spreadsheet in Excel, plug in my failed files and overwrite the save file, I don't have this issue at all, making me think this is an Excel issue and not a C# coding issue.
Is there a reason why, if I delete the contents of a cell, the OleDbDataAdapter would still be reading those cells? Like are there whitespace characters or other hidden characters still present after deleting contents? I mean I could fix it in the code and just say "don't write it out if the values are whitespace or null" but I want to know why blank cells are even being read at all.
This is just a minor bug and it's not stopping me from doing my work and this tool is nothing more than a personal tool to help with a weekly task. But I'd still like to know why cells that had content, but then had that content deleted, are still being read.
Excel is a little bit quirky like that. If you are manually editing your "Failed File Spreadsheet" (FFS) and as you say, you are pasting 5 rows over the existing 7 rows, then you may still read in those extra rows after the data you expect, if there is any formatting on the cells. To avoid this, in Excel select the range of cells of the whole sheet and right-click and select "Clear Contents".
To be fair, as you alluded to, I think it would be simpler just to fix it in code and skip rows in the DataTable that are empty. Or there is a SO post here which shows how to remove empty rows from a DataTable
I have been given an Excel file from a customer. It has 4 columns: Id, name, place, date).
I have a table in my database which stores these values. I have to check each row from Excel and compare its values to the database table. If a row already exists, then compare the date and update to latest date from Excel. If the row does not exist yet, insert a new row.
I'm fetching each row and comparing its values using for a loop and updating database using insert/update statement by creating data table adapter.
My problem is this operation is taking 4+ hours to update the data. Is there any efficient way to do this? I have searched a lot and found options like SqlBulkCopy but how will I compare each and every row from database?
I'm using ASP.NET with C# and SQL Server.
Here's my code:
for (var row = 2; row <= workSheet.Dimension.End.Row; row++)
{
// Get data from excel
var Id = workSheet.Cells[row, 1].Text;
var Name = workSheet.Cells[row, 2].Text;
var Place = workSheet.Cells[row, 3].Text;
var dateInExcel = workSheet.Cells[row, 4].Text;
// check in database if ID exists in database then compare date and update database>
if (ID.Rows.Count <= 0) //no row exist in database
{
// Insert row in the database using data table adapter's insert statement
}
else if (Id.Rows.Count > 0) //Id exists in database
{
if (Db.DateInDB < (dateUpdate)) // compare dates
{
// Update database with the new date using data table adapter Update statement.
}
}
}
#mjwills and #Dan Guzman make very valid points in the comments section.
My suggestion would be to create an SSIS package to import the spreadsheet into a temp table then using a merge query/queries make conditional updates to the requires tables(s).
https://learn.microsoft.com/en-us/sql/integration-services/import-export-data/start-the-sql-server-import-and-export-wizard?view=sql-server-ver15
The simplest way to get a good starting point is to use the import wizard in SSMS and save the resultant Package. Create an SSIS Project in Visual Studio (You will need the correct version of BI Installed, for the target SQL Server version)
https://learn.microsoft.com/en-us/sql/ssdt/download-sql-server-data-tools-ssdt?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15
This approach would leverage SQL doing what it does best, dealing with relational data sets, and moves it out of the asp code.
To invoke this the ASP App would need to handle the initial file upload/whatever and then invoke the SSIS Package.
This can be done by setting the SSIS Package as a Job on the SQL Server, with no schedule and then starting the job when you want it to run.
How to execute an SSIS package from .NET?
There are most likely some optimisations that can be made to this approach; but it should work in principal.
Hope this helps :)
10_000 records taking more than 3x3600s suggests >1s per record - I think it should be possible to improve on that.
Doing the work in the database would result in best performance, but there are few things you can do prior.
Check the basics:
Indexes
Network speed. Is your timing based on trying on your computer and talking to a cloud database? If the code and the db are in the same cloud (Azure/Amazon/etc.) it may be much faster than what you're measuring with code running on your office computer talking to the db far away.
Use batches. You should be able to get a magnitude better performance if you do work in batches rather than one record at a time.
Get 10, 100, 500 or 1000 records from the CSV and fetch the corresponding records from the db. Do the checking for presence and date comparison in memory. After that do a single Save to the database.
I wish to import many files into a database (with custom business logic preventing simple SSIS package use).
High level description of problem:
Pull existing sql data into DataTable (~1M rows)
Read excel file into 2d array in one chunk
validate fields row by row (custom logic)
Check for duplicate in existing DataTable
Insert into DataRow
Bulk insert DataTable into SQL table
Problem with my approach:
Each row must be checked for duplicates, I thought a call to remote server to leverage SQL would be too slow, so I opted for LINQ. The query was simple, but the size of the dataset causes it to crawl (90% execution time spent in this query checking the fields).
var existingRows = from row in recordDataTable.AsEnumerable()
where row.Field<int>("Entry") == entry
&& row.Field<string>("Device") == dev
select row;
bool update = existingRows.Count() > 0;
What other ways might there be to more efficiently check for duplicates?
Using linq it will basically do a for loop over your ~1M records every time you check for a duplicate.
You would be better off putting the data into a dictionary so your lookups are against an in memory index.
How can I do this. I have about 10000 records in an an Excel file and I want to insert all records as fast as possible into an access database?
Any suggestions?
What you can do is something like this:
Dim AccessConn As New System.Data.OleDb.OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0; Data Source=C:\Test Files\db1 XP.mdb")
AccessConn.Open()
Dim AccessCommand As New System.Data.OleDb.OleDbCommand("SELECT * INTO [ReportFile] FROM [Text;DATABASE=C:\Documents and Settings\...\My Documents\My Database\Text].[ReportFile.txt]", AccessConn)
AccessCommand.ExecuteNonQuery()
AccessConn.Close()
Switch off the indexing on the affected tables before starting the load and then rebuilding the indexes from scratch after the bulk load has finished. Rebuilding the indexes from scratch is faster than trying to keep them up to date while loading large amount of data into a table.
If you choose to insert row by row, then maybe want to you consider using transactions. Like, open transaction, insert 1000 records, commit transaction. This should work fine.
Use the default data import features in Access. If that does not suit your needs and you want to use C#, use standard ADO.NET and simply write record-for-record. 10K records should not take too long.
What is the best method for saving thousands of rows and after doing something, updating them.
Currently, I use a datatable, filling it, when done inserting by
MyDataAdapter.Update(MyDataTable)
After doing some change on MyDataTable, I again use MyDataAdapter.Update(MyDataTable) method.
Edit:
I am sorry for not providing more info.
There may be up to 200.000 rows which will be created from an XML file. There rows will be saved to the database. After than there will be some process for each row. And I will need to update each row in database.
Instead of updating row by row, I decided to update the datatable and using the same dataadapter to update the rows.
This is the best of me.
I think that there may be a smarter approach.
In Reacting to your comments:
An DataAdapter.Update() will Udate (and Insert/Delete) row by row. If you have individual changes there really is no faster way. If you have systematic changes, like SET Price = Price+ 2 WHERE SelByDate < '1/1/2010' you are better of by running a DbCommand against the database.
But maybe you should worry about transactions and error handling before performance.
If I understand correctly you are doing two separate operations: loading rows to a database, and then updating those rows.
If the rows you are inserting come from another ADO.NET supported datasource then you can use SqlBulkCopy to insert the rows in batches, which will be more efficient than using a datatable.
Once the rows are in the database I would assume you would be better off executing a SQLCommand to modify their values.
If you can provide more details about what--and why--you're asking the question then perhaps we can better tailor an answer for it.