I have been given the task of developing a small library (using C# 3.0 and .NET 3.5) to provide data import functionality for an application.
The spec is:
Data can be imported from CSV file
(potentially other file formats in the
future)
The CSV files can contain any
schema and number of rows, with a
maximum file size of 10MB.
It must be
possible to change the datatype and
column name of each column in the CSV
file.
It must be possible to exclude
columns in the CSV file from the
import.
Importing the data will
result in a table matching the schema
being created in a SQL Server database, and then
being populated using rows in the
CSV.
I've been playing around with ideas for a while now my current code feels like it has been hacked together a bit.
My current implementation approach is:
Open CSV and estimate the schema,
store in an ImportSchema class
Allow the schema to be modified.
Use SMO to create the table in SQL
according to the schema.
Create a System.Data.DataTable instance using the schema
for datatypes.
Use CsvReader to read the CSV
data into the DataTable.
Apply column name changes and remove unwanted columns from DataTable.
Use System.Data.SqlClient.SqlBulkCopy() to add the rows from the DataTable into the created database table.
This sounds overly complex to me and I am facing a mental block trying to wrap it up neatly in a handful of testable/extensible objects.
Any suggestions/thoughts on ways to approach this problem, both from an implementation and a design perspective?
Many thanks for any suggestions.
As suggested in some previous SO answers, take a look at the FileHelpers Libraray. This might be at least helpful in your task to import and analyze the CSV files.
Related
Currently I'm working on an application to import data from different sources (csv and xml). The core data in the files are the same (ID, Name, Coordinate, etc.) but with different structures (xml: in nodes, tables: on rows) and in the xml I have additional data which I need to keep together till the export. Additionally important to say, I need for the visualization and modification just a few data but I need all for the export.
Problem:
I'm locking for a good structure (database or what ever) where I can import the data at run-time. I need to have a reference to the data to visualize and modify them. Afterward I need to export the information to a user specified file type. (consider the image).
Approaches:
I defined a class for the csv schema and mapped the necessary information of the xml to it. The problem occurs when I try to export the data because I have not all data available in the memory.
I defined a class for the xml schema and mapped the information from the csv to it. The problem is in this case, that the storage structure bases on the schema of the xml and if this xml schema changes, I need to change the whole storage structure.
I'm planing now to implement a sql database with entity framework. This is not the easiest way but it seems to be state-of-the-art and updateable. The thing is that I'm not very experienced with databases and the entity framework. That's way I like to know whether this is a good way to solve this problem.
Last thing to say: I would like to store the imported data just once and would like to work with references to this source. This way I can export the information from this source and I'm certain that I have the current data.
Question:
What is the common way to solve such storage problems. Did I missed a good approach? Thank you so much for your help!
I have about 20 .csv files which are around 100-200mb each.
They each have about 100 columns.
90% of the columns of each file are the same; however, some files have more columns and some files have less columns.
I need to import all of these files into one table in a sql server 2008 database.
If the field does not exist, I need it to be created.
question: What should be the process with this import? How do I more efficiently and quickly import all of these files into one table in a database, and make sure that if a field does not exist, then it is created? Please also keep in mind that the same field might be in a different location. For example, CAR can be in field AB in one csv whereas the same field name (CAR) can be AC in the other csv file. The solution can be SQL or C# or both.
You may choose a number of options
1. Use the DTS package
2. Try to produce one uniform CSV file, get the db table in sync with its columns and bulk insert it
3. Bulk insert every file to its own table, and after that merge the tables into the target table.
I would recommend looking at the BCP program which comes with SQL Server and is intended to help with jobs just like this:
http://msdn.microsoft.com/en-us/library/aa337544.aspx
There are "format files" which allow you to specify which CSV columns go to which SQL columns.
If you are more inclined to use C#, have a look at the SqlBulkCopy class:
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
Also take a look at this SO thread, also about importing from CSV files into SQL Server:
SQL Bulk import from CSV
I recommend writing a small c# application that reads each of the CSV file headers and stores a dictionary of the columns needed and either outputs a 'create table' statement or directly runs a create table operation on the database. Then you can use Sql Management Studio to load the 20 files individually using the import routine.
Use SqlBulkCopy class in System.Data.SqlClient
It facilitates bulk data transfer. only catch it wont work with DataTime DB column
Less of an answer and more of a direction, but here I go. The way I would do it is first enumerate the column names from both the CSV files and the DB, then make sure the ones from your CSV all exist in the destination.
Once you have validated and/or created all the columns, then you can do your bulk insert. Assuming you don't have multiple imports happening at the same time, you could cache the column names from the DB when you start the import, as they shouldn't be changing.
If you will have multiple imports running at the same time, then you will need to make sure you have a full table lock during the import, as race conditions could show up.
I do a lot of automated imports for SQL DBs, and I haven't ever seen what you asked, as it's an assumed requirement that one knows the data that is coming in to the DB. Not knowing columns ahead of time is typically a very bad thing, but it sounds like you have an exception to the rule.
Roll your own.
Keep (or create) a runtime representation of the target table's columns in the database. Before importing each file, check to see if the column exists already. If it doesn't, run the appropriate ALTER statement. Then import the file.
The actual import process can and probably should be done by BCP or whatever Bulk protocol you have available. You will have to do some fancy kajiggering since the source data and destination align only logically, not physically. So you will need BCP format files.
There are several possibilities that you have here.
You can use SSIS if it is available to you.
In Sql Server you can use SqlBulkCopy to bulk insert in a staging table where you will insert the whole .csv file
and then use a stored procedure with possibly MERGE statement in it
to place each row where it belongs or create a new one if it doesn't
exist.
You can use C# code to read the files and write them using SqlBulkInsert or EntityDataReader
For those data volumes, you should use an ETL. See this tutorial.
ETLs are designed for large amount of data manipulation
What's the best way to import a small csv file into SQL Server using an ASP.NET form with C#? I know there are many ways to do this, but I'm wondering which classes would be best to read the file and how to insert into the database. Do I read the file into a DataTable and then use the SqlBulkCopy class, or just insert the data using ADO.NET? Not sure which way is best. I'm after the simplest solution and am not concerned about scalability or performance as the csv files are tiny.
Using ASP.NET 4.0, C# 4.0 and SQL Server 2008 R2.
The DataTable and the SqlBulkCopy classes will do just fine and that is the way I would prefer to do it, in order to prevent that someday these tiny CSV files become larger, your program will be ready for it, as ADO.NET might add some overhead by treating a single row at a time.
EDIT #1
What's the best way to get from csv file to datatable?
The CSV file format is nothing more than a text file. As such, you might want to read it using the File.ReadAllLines Method (String), which will return a string[]. Then, you may add to your DataTable using the DataRow class or your prefered way.
Consider adding your columns when defining your DataTable so that it knows its structure when adding rows.
I have to import a series of CSV files and place the records into an Access database. This, in itself, is not too much of an issue.
The complication comes from the fact that the format of each CSV file is variable in that the number of fields may vary.
The best way to think about this is that there are 80 master fields, and each CSV file may contain ANY subset of those 80 fields. The only way of telling what you are dealing with is to look at the field headers in the CSV file. The data written to the Access file must contain all 80 fields (so missing fields need to be written as null values).
Rather than re-inventing the wheel, does anyone know of any code that handles this variable mapping/import?
Or any pointers in the right direction would be appreciated.
The best CSV reader library for .NET is LumenWorks'. It will not insert the data into Access, and I think you will still need to write some code to handle column differences but it'll make it easier than rolling your own parser and it'll be faster.
Generally, a basic automated CSV import for a repeated data transfer should expect a consistent input file to an agreed specification.
If the CSV file is being imported into an application by a user, then a basic mapping can be done by listing the fields in the CSV file (eg reading in the values from the first row)
and having a drop down field next to these with the list of master (db) fields.
You could save the mapping for future reference by saving the list of Input Field Index or names with the corresponding Master Field Index/ID/name)
On e of my current requirements is to take in an Excel spreadsheet that the user updates about once a week and be able to query that document for certain fields.
As of right now, I run through and push all the Excel (2007) data into an xml file (just once when they upload the file, then I just use the xml) that then holds all of the needed data (not all of the columns in the spreadsheet) for querying via Linq-to-XML; note that the xml file is smaller than the excel.
Now my question is, is there any performance difference between querying an XML file with Linq and an Excel file with OledbConnection? Am I just adding another unneccesary step?
I suppose the followup question would be, is it worth it for ease of use to keep pushing it to xml.
The file has about 1000 rows.
For something that is done only once per week I don't see the need to perform any optimizations. Instead you should focus on what is maintainable and understandable both for you and whoever will maintain the solution in the future.
Use whatever solution you find most natural :-)
As I understand it the performance side of things stands like this for accessing Excel data.
Fastest to Slowest
1. Custom 3rd party vendor software using C++ directly on the Excel file type.
2. OleDbConnection method using a schema file if necessary for data types, treats Excel as a flatfile db.
3. Linq 2 XML method superior method for read/write data with Excel 2007 file formats only.
4. Straight XML data manipulation using the OOXML SDK and optionally 3rd party xml libraries. Again limited to Excel 2007 file formats only.
5. Using an Object[,] array to read a region of cells (using .Value2 prop), and passing an Object[,] array back again to a region of cells (again .Value2 prop) to write data.
6. Updating and reading from cells individually using the .Cells(x,y) and .Offset(x,y) prop accessors.
You can't use a SqlConnection to access an Excel spreadsheet. More than likely, you are using an OleDbConnection or an OdbcConnection.
That being said, I would guess that using the OleDbConnection to access the Excel sheet would be faster, as you are processing the data natively, but the only way to know for the data you are using is to test it yourself, using the Stopwatch class in the System.Diagnostics namespace, or using a profiling tool.
If you have a great deal of data to process, you might also want to consider putting it in SQL Server and then querying that (depending on the ratio of queries to the time it takes to save the data, of course).
I think it's important to discuss what type of querying you are doing with the file. I have to believe it will be a great deal easier to query using LINQ than the oledbconnection although I am talking more from experience than anything else.