How to bulk insert 20 100mb CSV files into SQL Server - c#

I have about 20 .csv files which are around 100-200mb each.
They each have about 100 columns.
90% of the columns of each file are the same; however, some files have more columns and some files have less columns.
I need to import all of these files into one table in a sql server 2008 database.
If the field does not exist, I need it to be created.
question: What should be the process with this import? How do I more efficiently and quickly import all of these files into one table in a database, and make sure that if a field does not exist, then it is created? Please also keep in mind that the same field might be in a different location. For example, CAR can be in field AB in one csv whereas the same field name (CAR) can be AC in the other csv file. The solution can be SQL or C# or both.

You may choose a number of options
1. Use the DTS package
2. Try to produce one uniform CSV file, get the db table in sync with its columns and bulk insert it
3. Bulk insert every file to its own table, and after that merge the tables into the target table.

I would recommend looking at the BCP program which comes with SQL Server and is intended to help with jobs just like this:
http://msdn.microsoft.com/en-us/library/aa337544.aspx
There are "format files" which allow you to specify which CSV columns go to which SQL columns.
If you are more inclined to use C#, have a look at the SqlBulkCopy class:
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
Also take a look at this SO thread, also about importing from CSV files into SQL Server:
SQL Bulk import from CSV

I recommend writing a small c# application that reads each of the CSV file headers and stores a dictionary of the columns needed and either outputs a 'create table' statement or directly runs a create table operation on the database. Then you can use Sql Management Studio to load the 20 files individually using the import routine.

Use SqlBulkCopy class in System.Data.SqlClient
It facilitates bulk data transfer. only catch it wont work with DataTime DB column

Less of an answer and more of a direction, but here I go. The way I would do it is first enumerate the column names from both the CSV files and the DB, then make sure the ones from your CSV all exist in the destination.
Once you have validated and/or created all the columns, then you can do your bulk insert. Assuming you don't have multiple imports happening at the same time, you could cache the column names from the DB when you start the import, as they shouldn't be changing.
If you will have multiple imports running at the same time, then you will need to make sure you have a full table lock during the import, as race conditions could show up.
I do a lot of automated imports for SQL DBs, and I haven't ever seen what you asked, as it's an assumed requirement that one knows the data that is coming in to the DB. Not knowing columns ahead of time is typically a very bad thing, but it sounds like you have an exception to the rule.

Roll your own.
Keep (or create) a runtime representation of the target table's columns in the database. Before importing each file, check to see if the column exists already. If it doesn't, run the appropriate ALTER statement. Then import the file.
The actual import process can and probably should be done by BCP or whatever Bulk protocol you have available. You will have to do some fancy kajiggering since the source data and destination align only logically, not physically. So you will need BCP format files.

There are several possibilities that you have here.
You can use SSIS if it is available to you.
In Sql Server you can use SqlBulkCopy to bulk insert in a staging table where you will insert the whole .csv file
and then use a stored procedure with possibly MERGE statement in it
to place each row where it belongs or create a new one if it doesn't
exist.
You can use C# code to read the files and write them using SqlBulkInsert or EntityDataReader

For those data volumes, you should use an ETL. See this tutorial.
ETLs are designed for large amount of data manipulation

Related

Best Way To Migrating Huge numbers of Data Rows from Access To SQL

I have several Access Database with more than 40000000 rows. I'm reading each row using a Data Reader and insert every row one by one into SQL Database. But it seems it will take weeks and even more!
Is there any way doing this migration faster?
I would recommend exporting your access database to a CSV files (or a number of CSV files), a guide is here: https://support.spatialkey.com/export-data-from-database-to-csv-file/
You can then using Bulk Import or SSIS to import the rows into SQL Server. A reference for this operation would be; http://blog.sqlauthority.com/2008/02/06/sql-server-import-csv-file-into-sql-server-using-bulk-insert-load-comma-delimited-file-into-sql-server/
This way should be substantially faster.
An programmatic alternative would be using the SQLBulkCopy class; https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy(v=vs.110).aspx

How do I pass a table as a stored proceedure parameter inside an SSIS DataFlow

I am trying to write an SSIS package to transfer data from one database to another (straight copy, the tables that I am transferring to and from have the same structure). I am selecting a subset of the records (the ones that have been created or modified since the last time the package was ran) and I am trying to dump them to a sproc on the destination database which will determine which records need to be updated and which records need to be inserted.
How do I do I either do this inside a Data Flow object, or transfer the records out of the object so I can do it with an Execute SQL Task?
I don't want to use an OLE DB Command since that only works on one record at a time. The two databases are on different machines at different locations, and I'd like this package to take as little time as possible to run since I'm writing it to replace a DTS package that takes way too long to run (it deletes the entire contents of the destination table and re-copies over everything, changed or otherwise)
SSIS does not support table valued parameters. There are several work arounds for what you are trying to achieve:
There are several 3rd party UPSERT/MERGE destinations, including this one on CodePlex http://ssisctc.codeplex.com/wikipage?title=MERGE%20Destination
You could use an OLEDB destination to insert the rows into a temp table on the server, then run your stored procedure against the temp table from an Execute SQL task. (This is what we have done on my current project)
You could write a custom .NET script destination
You could use a Lookup Transformation to see if they exist in the table, then insert the rows that don't exist and run OLEDB Command against the others which would at least be better than running it against every row

SQL Server data import dilemma

I have created an application that will be importing CSV files into a database table, and I've got multiple CSV files I need to import into a table in a SQL Server database.
I've got a couple approaches in mind but I'm not sure which is most practical. The application works by asking the user to select the files they want to import (from their local file system) and then they simply click a [Load Files] button. These files may contain 100,000+ rows at times.
What would be better for the above scenario?
Import CSV file into datatable using C# and open-source GenericParser then using a traditional method of BulkCopy to push the datatable to the database
Note: my concern is the strain on users PC when doing this for files with 100,000+ rows. How will this affect the processing or would it crash the program?
Use Bulk Insert which requires the file name and path. My concern for this option is I'm not sure if the server would be able to process the Bulk Insert command without the physical file being located on the server? The file path would relate to the users local machine. The only time I've used Bulk Insert is when I was logged onto the server itself, which is not possible for this app.
Is there a way to do it with Linq? While I'm not really familiar with Linq if it can be accomplished I'm open to trying it.
Any insight is appreciated. I know what I need to do just not sure of how to accomplish it practically.
Thanks
My recommendation would be to use the SqlBulkCopy class in .NET. It will allow you to import rows almost as fast at the BULK INSERT statement, but only requires that you populate a DataTable with the rows, and then send them to SQL Server.
Another consideration you might want to look at would be (and this is my personal favorite for simple file import programs) to use PowerShell instead of C#, which has a built-in cmdlet for imporing CSV files. Pretty cool stuff.
1) loader app in .Net is a good choice, generally. 100,000 rows is really not a strenuous workload, especially for simple loads. Only if there is a ton of multiple-table-joins involved in order to look up values on the fly would that really be a big concern.
2) although strictly speaking physical file location is just a performance question, I wouldn't do it. It will introduce administrative headaches.
3) I don't have experience with Linq, I cannot remark.
Just for bonus alternate idea: if you have IIS running somewhere, maybe even on the DB server, you can whip up a lightweight, one-page "webapp" which is just a CGI script with ODBC connection to the DB and the user just feeds the CSV in as a "web/CGI" upload. No utility application to install on user workstations this way.
To solve your problem, you have to see on in two basic views:
Do you need make some operations with data before insert in into database (some sumarization, correction,...)?
If yes, than here is the best way to upload rows from file to object (each row into one object instance). And than you can elegantly move with list of items with Linq.
Do you need only insert rows from file to database as they are?
I this case, use process described in point 2 of your question.
I'd prefer to upload file to server before any action. It's more safe.

Limiting Number of Rows Inserted into a SQL Server Database

I have a program in c# in VS that runs a mainform.
That mainform exports data to an SQL Database with stored procedures into tables. The data exported is a lot of data (600,000 + rows).
I have a problem tho. On my mainform I need to have a "database write out interval". This is a number of how many "rows" will be imported into the database.
My problem is however the steps on how to implement that interval. The mainform runs, and when the main program is done, the sql still takes IN data for another 5-10 minutes.
Therefore, if I close the mainform, the rest of the data will not me imported.
Do you professional programmers out there know a way where I can somehow communicate with SQL to only export data for a user-specified interval. T
his has to be done with my c# class.
I dont know where to begin.
I dont think a timer would be a good idea because differenct computers and cpu's perform differently. Any advice would be appreciated.
If the data is of a fixed format (ie, there are going to be the same columns for every row and its not going to change much), you should look at Bulk Insert. Its incredibly fast at inserting large numbers of rows.
The basics are you write your data out to a text file (ie, csv, but you can specify whatever delimiter you want), then execute a BULK INSERT command against the server. One of the arguments is the path to the file you wrote out. It's a bit of a pain to use because you have to write the file in a folder on the server (or a UNC path that the server has access to) which leads to configuring windows shares or setting up FTP on the server. It sounds like exactly what you want to use, though.
Here's the MSDN documentation on BULK INSERT:
http://msdn.microsoft.com/en-us/library/ms188365.aspx
Instead of exporting all of your data to SQL and then trying to abort or manage the load a a better process might be to split your load into smaller chunks (10,000 records or so) and check whether the user wants to continue after each load. This gives you a lot more flexibility and control over the load then dumping all 600,000 records to SQL and trying to manage the process.
Also what Tim Coker mentioned is spot on. Even if your stored proc is doing some data manipulation it is a lot faster to load the data via bulk insert and run a query after the load to do any work you have to do then to run all 600,000 records through the stored proc.
Like all the other comments before, i will suggest you to use BulkInsert. You will be amazed by how fast the performance is when it comes to large dataset and perhaps your concept about interval is no longer required. Inserting 100k of records may only take seconds.
Depends on how your code is written, ADO.NET has native support for BulkInsert through SqlBulkCopy, see the code below
http://www.knowdotnet.com/articles/bulkcopy_intro1.html
If you have been using Linq to db for your code, there are already some clever code written as extension method to the datacontext which transform the linq changeset into a dataset and internally use ADO.NET to achieve the bulk insert
http://blogs.microsoft.co.il/blogs/aviwortzel/archive/2008/05/06/implementing-sqlbulkcopy-in-linq-to-sql.aspx

Clean design and implementation for data import library

I have been given the task of developing a small library (using C# 3.0 and .NET 3.5) to provide data import functionality for an application.
The spec is:
Data can be imported from CSV file
(potentially other file formats in the
future)
The CSV files can contain any
schema and number of rows, with a
maximum file size of 10MB.
It must be
possible to change the datatype and
column name of each column in the CSV
file.
It must be possible to exclude
columns in the CSV file from the
import.
Importing the data will
result in a table matching the schema
being created in a SQL Server database, and then
being populated using rows in the
CSV.
I've been playing around with ideas for a while now my current code feels like it has been hacked together a bit.
My current implementation approach is:
Open CSV and estimate the schema,
store in an ImportSchema class
Allow the schema to be modified.
Use SMO to create the table in SQL
according to the schema.
Create a System.Data.DataTable instance using the schema
for datatypes.
Use CsvReader to read the CSV
data into the DataTable.
Apply column name changes and remove unwanted columns from DataTable.
Use System.Data.SqlClient.SqlBulkCopy() to add the rows from the DataTable into the created database table.
This sounds overly complex to me and I am facing a mental block trying to wrap it up neatly in a handful of testable/extensible objects.
Any suggestions/thoughts on ways to approach this problem, both from an implementation and a design perspective?
Many thanks for any suggestions.
As suggested in some previous SO answers, take a look at the FileHelpers Libraray. This might be at least helpful in your task to import and analyze the CSV files.

Categories

Resources