I have a csv file with 350,000 rows, each row has about 150 columns.
What would be the best way to insert these rows into SQL Server using ADO.Net?
The way I've usually done it is to create the SQL statement manually. I was wondering if there is any way I can code it to simply insert the entire datatable into SQL Server? Or some short-cut like this.
By the way I already tried doing this with SSIS, but there are a few data clean-up issues which I can handle with C# but not so easily with SSIS. The data started as XML, but I changed it to CSV for simplicity.
Make a class "CsvDataReader" that implements IDataReader. Just implement Read(), GetValue(int i), Dispose() and the constructor : you can leave the rest throwing NotImplementedException if you want, because SqlBulkCopy won't call them. Use read to handle the read of each line and GetValue to read the i'th value in the line.
Then pass it to the SqlBulkCopy with the appropriate column mappings you want.
I get about 30000 records/per sec insert speed with that method.
If you have control of the source file format, make it tab delimited as it's easier to parse than CSV.
Edit : http://www.codeproject.com/KB/database/CsvReader.aspx - tx Mark Gravell.
SqlBulkCopy if it's available. Here is a very helpful explanation of using SqlBulkCopy in ADO.NET 2.0 with C#
I think you can load your XML directly into a DataSet and then map your SqlBulkCopy to the database and the DataSet.
Hey you should revert back to XML instead of csv, then load that xml file in a temp table using openxml, clean up your data in temp table and then finally process this data.
I have been following this approach for huge data imports where my XML files happen to be > 500 mb in size and openxml works like a charm.
You would be surprised at how much faster this would work compared to manual ado.net statements.
Related
I am trying to perform insert/update/delete operations on a SQL table based on the input csv file which is loaded into data table from an web application. Currently, I am using DataSet to do CRUD operations but would like to know if there will be any advantages of using LINQ over DataSet. I am assuming code will be reduced and more strongly typed but not sure if I need to switch to LINQ. Any inputs appreciated.
Edit
It is not a bulk operation, CSV might contain 200 records max.
I used the LumenWorks CSV reader which is very fast. It has it's own API for extracting data, using the IDataReader interface. Here is a brief example taken from codeplex.com. I use it for all my CSV projects, as it's very fast at reading CSV data. I was surprised at how fast it actually was.
If you were to go from a reader like this, you're essentially going from a data reader API and as such, would probably work with a data table more easily (you could create a DataTable matching the result set and easily copy data over matching column to column).
A lot of updates can be slower with LINQ, depending on whether you are using Entity Framework or something else, and what flavor you are using. A DataTable, IMHO would probably be faster. I had issues with LINQ and change tracking with a lot of objects (if you are using attached entities, not using POCOs). I've had pretty good performance taking a CSV file from Lumenworks and copying it to a DataTable.
I have a Windows Service application that receives a stream of data with the following format
IDX|20120512|075659|00000002|3|AALI |Astra Agro Lestari Tbk. |0|ORDI_PREOPEN|12 |00000001550.00|00000001291.67|00001574745000|00001574745000|00500|XDS1BXO1| |00001574745000|ݤ
IDX|20120512|075659|00000022|3|ALMI |Alumindo Light Metal Industry Tbk. |0|ORDI |33 |00000001300.00|00000001300.00|00000308000000|00000308000000|00500|--U3---2| |00000308000000|õÄ
This data comes in millions of rows and in sequence 00000002....00198562 and I have to parse and insert them according to the sequence into a database table.
My question is, what is the best way (the most effective) to insert these data into my database? I have tried to use a simple method as to open a SqlConnection object then generate a string of SQL insert script and then execute the script using SqlCommand object, however this method is taking too long.
I read that I can use Sql BULK INSERT but it has to read from a textfile, is it possible for this scenario to use BULK INSERT? (I have never used it before).
Thank you
update: I'm aware of SqlBulkCopy but it requires me to have DataTable first, is this good for performance? If possible I want to insert directly from my data source to SQL Server without having to use in memory DataTable.
If you are writing this in C# you might want to look at the SqlBulkCopy class.
Lets you efficiently bulk load a SQL Server table with data from another source.
First, download free LumenWorks.Framework.IO.Csv library.
Second, use the code like this
StreamReader sr = new TextReader(yourStream);
var sbc = new SqlBulkCopy(connectionString);
sbc.WriteToServer(new LumenWorks.Framework.IO.Csv.CsvReader(sr));
Yeah, it is really that easy.
You can use SSIS "Sql Server Integration Service" for converting data from source data flow to destination data flow.
The source can be a text file and destination can be a SQL Server table. Your conversion executes in bulk insert mode.
I have a text file that contains about a million records. What is the best way to insert them into a SQL Server database from C#?
Can I use BULK INSERT?
Best way is to use the bcp utility or an SSIS workflow. Those tools have refinements like caching and batching your will miss in a naive implementation. Next best option is BULK INSERT statement, as long as the SQL Server engine itself can reach the file. Last option would be SqlBulkCopy class which allows your app do read the file, maybe process it and transform it, then feed the data as an enumerator to the SqlBulkCopy.
I recently worked on same kind of problem , i realized there are couple solutions to it.
I wrote a BatchProgram (for batch program design read this - http://msdn.microsoft.com/en-us/magazine/cc164014.aspx)
You can use SQL Server Utilities either BCP.exe or OSQL.exe or .net framework supplied SQLBulkCopy class.
I ended up using BCP ( i got a CSV file and used a formatting file and load the data) and OSQL ( i used OSQL where i have to supply a file to the the stored proc )
i also went to .NET Process class and used outputdatarecieved event to log all output of BCP.exe into console (Read this - http://msdn.microsoft.com/en-us/library/system.diagnostics.process.outputdatareceived.aspx) this worked pretty well.
I also tried to SQLBulkCopy class , but it can be slow if you load data first to datatable ( http://msdn.microsoft.com/en-us/library/ex21zs8x.aspx ) , if you use IDataReader ( http://msdn.microsoft.com/en-us/library/434atets.aspx ) it could be fast .
Since i had millions rows i tried using CSVReader ( http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader ) which is pretty fast .But down the line there was too much problem in data conversion and i did not have much flexibity in SQL server side .
I ended up using BCP and OSQL.
Is using C# a requirement? The fastest way is to use the bcp command line tool: http://msdn.microsoft.com/en-us/library/ms162802.aspx
What's the best way to import a small csv file into SQL Server using an ASP.NET form with C#? I know there are many ways to do this, but I'm wondering which classes would be best to read the file and how to insert into the database. Do I read the file into a DataTable and then use the SqlBulkCopy class, or just insert the data using ADO.NET? Not sure which way is best. I'm after the simplest solution and am not concerned about scalability or performance as the csv files are tiny.
Using ASP.NET 4.0, C# 4.0 and SQL Server 2008 R2.
The DataTable and the SqlBulkCopy classes will do just fine and that is the way I would prefer to do it, in order to prevent that someday these tiny CSV files become larger, your program will be ready for it, as ADO.NET might add some overhead by treating a single row at a time.
EDIT #1
What's the best way to get from csv file to datatable?
The CSV file format is nothing more than a text file. As such, you might want to read it using the File.ReadAllLines Method (String), which will return a string[]. Then, you may add to your DataTable using the DataRow class or your prefered way.
Consider adding your columns when defining your DataTable so that it knows its structure when adding rows.
What's the most efficient method to load large volumes of data from CSV (3 million + rows) to a database.
The data needs to be formatted(e.g. name column needs to be split into first name and last name, etc.)
I need to do this in a efficiently as possible i.e. time constraints
I am siding with the option of reading, transforming and loading the data using a C# application row-by-row? Is this ideal, if not, what are my options? Should I use multithreading?
You will be I/O bound, so multithreading will not necessarily make it run any faster.
Last time I did this, it was about a dozen lines of C#. In one thread it ran the hard disk as fast as it could read data from the platters. I read one line at a time from the source file.
If you're not keen on writing it yourself, you could try the FileHelpers libraries. You might also want to have a look at Sébastien Lorion's work. His CSV reader is written specifically to deal with performance issues.
You could use the csvreader to quickly read the CSV.
Assuming you're using SQL Server, you use csvreader's CachedCsvReader to read the data into a DataTable which you can use with SqlBulkCopy to load into SQL Server.
I would agree with your solution. Reading the file one line at a time should avoid the overhead of reading the whole file into memory at once, which should make the application run quickly and efficiently, primarily taking time to read from the file (which is relatively quick) and parse the lines. The one note of caution I have for you is to watch out if you have embedded newlines in your CSV. I don't know if the specific CSV format you're using might actually output newlines between quotes in the data, but that could confuse this algorithm, of course.
Also, I would suggest batching the insert statements (include many insert statements in one string) before sending them to the database if this doesn't present problems in retrieving generated key values that you need to use for subsequent foreign keys (hopefully you don't need to retrieve any generated key values). Keep in mind that SQL Server (if that's what you're using) can only handle 2200 parameters per batch, so limit your batch size to account for that. And I would recommend using parameterized TSQL statements to perform the inserts. I suspect more time will be spent inserting records than reading them from the file.
You don't state which database you're using, but given the language you mention is C# I'm going to assume SQL Server.
If the data can't be imported using BCP (which it sounds like it can't if it needs significant processing) then SSIS is likely to be the next fastest option. It's not the nicest development platform in the world, but it is extremely fast. Certainly faster than any application you could write yourself in any reasonable timeframe.
BCP is pretty quick so I'd use that for loading the data. For string manipulation I'd go with a CLR function on SQL once the data is there. Multi-threading won't help in this scenario except to add complexity and hurt performance.
read the contents of the CSV file line by line into a in memory DataTable. You can manipulate the data (ie: split the first name and last name) etc as the DataTable is being populated.
Once the CSV data has been loaded in memory then use SqlBulkCopy to send the data to the database.
See http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.writetoserver.aspx for the documentation.
If you really want to do it in C#, create & populate a DataTable, truncate the target db table, then use System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable dt).