I am trying to perform insert/update/delete operations on a SQL table based on the input csv file which is loaded into data table from an web application. Currently, I am using DataSet to do CRUD operations but would like to know if there will be any advantages of using LINQ over DataSet. I am assuming code will be reduced and more strongly typed but not sure if I need to switch to LINQ. Any inputs appreciated.
Edit
It is not a bulk operation, CSV might contain 200 records max.
I used the LumenWorks CSV reader which is very fast. It has it's own API for extracting data, using the IDataReader interface. Here is a brief example taken from codeplex.com. I use it for all my CSV projects, as it's very fast at reading CSV data. I was surprised at how fast it actually was.
If you were to go from a reader like this, you're essentially going from a data reader API and as such, would probably work with a data table more easily (you could create a DataTable matching the result set and easily copy data over matching column to column).
A lot of updates can be slower with LINQ, depending on whether you are using Entity Framework or something else, and what flavor you are using. A DataTable, IMHO would probably be faster. I had issues with LINQ and change tracking with a lot of objects (if you are using attached entities, not using POCOs). I've had pretty good performance taking a CSV file from Lumenworks and copying it to a DataTable.
Related
I am currently working on a small .NET app in C#, that fetches data through some web service.
The data is represented in objects, so it would have been logical to store the data in a document based database, but there is a demand to use SQL Server.
So what might be the fastest way to insert many thousands, perhaps millions of rows into a database.
I an open to any framework, that might could support that, but I haven't been able to find any benchmarking on this e.g. on Entity Framework.
To iterate over the data an do an insert per row is simply to slow, then it would be quicker to dump the data in a file, and then do a bulk import using SSIS, but for this scenario I would rather avoid that, and keep all logic in the C# app.
You might want to use the SqlBulkCopy class. It is quite efficient for large data.
I'm wondering if there is a faster method than using something like:
while (Reader.Read())
to read the results of mysql select queries.
I'm randomly pulling 10,000 rows from a database and would like to read it as quickly as possible. Is there a way to serialize the results if we know what they are (such as using the metadata to setup a structure)?
Try MySQLDataAdapter.Fill method to fill any DataTable object - read speed is comparable to optimal usage of read data with Read method (depends on your while block reading way) and main advantage is that you achieve prepared data collection which you can manage or just write to XML file.
What's the most efficient method to load large volumes of data from CSV (3 million + rows) to a database.
The data needs to be formatted(e.g. name column needs to be split into first name and last name, etc.)
I need to do this in a efficiently as possible i.e. time constraints
I am siding with the option of reading, transforming and loading the data using a C# application row-by-row? Is this ideal, if not, what are my options? Should I use multithreading?
You will be I/O bound, so multithreading will not necessarily make it run any faster.
Last time I did this, it was about a dozen lines of C#. In one thread it ran the hard disk as fast as it could read data from the platters. I read one line at a time from the source file.
If you're not keen on writing it yourself, you could try the FileHelpers libraries. You might also want to have a look at Sébastien Lorion's work. His CSV reader is written specifically to deal with performance issues.
You could use the csvreader to quickly read the CSV.
Assuming you're using SQL Server, you use csvreader's CachedCsvReader to read the data into a DataTable which you can use with SqlBulkCopy to load into SQL Server.
I would agree with your solution. Reading the file one line at a time should avoid the overhead of reading the whole file into memory at once, which should make the application run quickly and efficiently, primarily taking time to read from the file (which is relatively quick) and parse the lines. The one note of caution I have for you is to watch out if you have embedded newlines in your CSV. I don't know if the specific CSV format you're using might actually output newlines between quotes in the data, but that could confuse this algorithm, of course.
Also, I would suggest batching the insert statements (include many insert statements in one string) before sending them to the database if this doesn't present problems in retrieving generated key values that you need to use for subsequent foreign keys (hopefully you don't need to retrieve any generated key values). Keep in mind that SQL Server (if that's what you're using) can only handle 2200 parameters per batch, so limit your batch size to account for that. And I would recommend using parameterized TSQL statements to perform the inserts. I suspect more time will be spent inserting records than reading them from the file.
You don't state which database you're using, but given the language you mention is C# I'm going to assume SQL Server.
If the data can't be imported using BCP (which it sounds like it can't if it needs significant processing) then SSIS is likely to be the next fastest option. It's not the nicest development platform in the world, but it is extremely fast. Certainly faster than any application you could write yourself in any reasonable timeframe.
BCP is pretty quick so I'd use that for loading the data. For string manipulation I'd go with a CLR function on SQL once the data is there. Multi-threading won't help in this scenario except to add complexity and hurt performance.
read the contents of the CSV file line by line into a in memory DataTable. You can manipulate the data (ie: split the first name and last name) etc as the DataTable is being populated.
Once the CSV data has been loaded in memory then use SqlBulkCopy to send the data to the database.
See http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.writetoserver.aspx for the documentation.
If you really want to do it in C#, create & populate a DataTable, truncate the target db table, then use System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable dt).
Say I have a few tables in the MSSQL database, each with about 5-10 attributes. There are some simple associations between the tables, but each of the table have 500,000 to 1,000,000 rows.
There is an algorithm that runs on that data (all of it), so before running the algorithm, I have to retrieve all the data from the database. The algorithm does not change the data, only reads it, so I just need to retrieve the data.
I am using LINQ to SQL. To retrieve all the data takes about two minutes. What I want to know is whether the serialization to file and then deserialization (when needed) would actually load the data faster.
The data is about 200 MB, and I don't mind saving it to disk. So, would it be faster if the objects were deserialized from the file or by using LINQ 2 SQL DataContext?
Any experiences with this?
I would argue that LINQtoSQL may not be the best choice for this kind of application. When you are talking about so many objects, you incur quite some overhead creating object instances (your persistent classes).
I would choose a solution where a stored procedure retrieves only the necessary data via ADO.NET, the application stores it in memory (memory is cheap nowadays, 200MB should not be a problem) and the analyzing algorithm is run on the in-memory data.
I don't think you should store the data on file. In the end, your database is also simply one or more files that are read by the database engine. So you either
let the database engine read your data and you analyze it, or
let the database engine read your data, you write it to file, you read the file (reading the same data again, but now you do it yourself) and you analyze the data
The latter option involves a lot of overhead without any advantages as far as I can see.
EDIT: If your data changes very infrequently, you may consider preprocessing your data before analyzing and caching the preprocessed data somewhere (in the database or on the file system). This only makes sense if your preprocessed data can be analyzed (a lot) faster than the raw data. Maybe some preprocessing can be done in the database itself.
You should try to use ADO.NET directly without the LINQ to SQL layer on top of it, i.e. using an SqlDataReader to read the data.
If you work sequentially with the data, you can get the records from the reader when you need them without having to read them all into memory first.
If you have a process that operates on most of the data in a database... then that sounds like a job for a stored procedure. It won't be object oriented, but it will be a lot faster and less brittle.
Since you are doing this in C# and your database is MsSql (since you use Linq to Sql), could you not run your code in a managed stored procedure? That would allow you to keep your current code as it is, but loading the data would be much faster since the code was running in the sql server.
I have a csv file with 350,000 rows, each row has about 150 columns.
What would be the best way to insert these rows into SQL Server using ADO.Net?
The way I've usually done it is to create the SQL statement manually. I was wondering if there is any way I can code it to simply insert the entire datatable into SQL Server? Or some short-cut like this.
By the way I already tried doing this with SSIS, but there are a few data clean-up issues which I can handle with C# but not so easily with SSIS. The data started as XML, but I changed it to CSV for simplicity.
Make a class "CsvDataReader" that implements IDataReader. Just implement Read(), GetValue(int i), Dispose() and the constructor : you can leave the rest throwing NotImplementedException if you want, because SqlBulkCopy won't call them. Use read to handle the read of each line and GetValue to read the i'th value in the line.
Then pass it to the SqlBulkCopy with the appropriate column mappings you want.
I get about 30000 records/per sec insert speed with that method.
If you have control of the source file format, make it tab delimited as it's easier to parse than CSV.
Edit : http://www.codeproject.com/KB/database/CsvReader.aspx - tx Mark Gravell.
SqlBulkCopy if it's available. Here is a very helpful explanation of using SqlBulkCopy in ADO.NET 2.0 with C#
I think you can load your XML directly into a DataSet and then map your SqlBulkCopy to the database and the DataSet.
Hey you should revert back to XML instead of csv, then load that xml file in a temp table using openxml, clean up your data in temp table and then finally process this data.
I have been following this approach for huge data imports where my XML files happen to be > 500 mb in size and openxml works like a charm.
You would be surprised at how much faster this would work compared to manual ado.net statements.