CSV import (variable fields)

CSV import (variable fields) - c#

I have to import a series of CSV files and place the records into an Access database. This, in itself, is not too much of an issue.
The complication comes from the fact that the format of each CSV file is variable in that the number of fields may vary.
The best way to think about this is that there are 80 master fields, and each CSV file may contain ANY subset of those 80 fields. The only way of telling what you are dealing with is to look at the field headers in the CSV file. The data written to the Access file must contain all 80 fields (so missing fields need to be written as null values).
Rather than re-inventing the wheel, does anyone know of any code that handles this variable mapping/import?
Or any pointers in the right direction would be appreciated.

The best CSV reader library for .NET is LumenWorks'. It will not insert the data into Access, and I think you will still need to write some code to handle column differences but it'll make it easier than rolling your own parser and it'll be faster.

Generally, a basic automated CSV import for a repeated data transfer should expect a consistent input file to an agreed specification.
If the CSV file is being imported into an application by a user, then a basic mapping can be done by listing the fields in the CSV file (eg reading in the values from the first row)
and having a drop down field next to these with the list of master (db) fields.
You could save the mapping for future reference by saving the list of Input Field Index or names with the corresponding Master Field Index/ID/name)

Related

Data storage approach for different file types

Currently I'm working on an application to import data from different sources (csv and xml). The core data in the files are the same (ID, Name, Coordinate, etc.) but with different structures (xml: in nodes, tables: on rows) and in the xml I have additional data which I need to keep together till the export. Additionally important to say, I need for the visualization and modification just a few data but I need all for the export.
Problem:
I'm locking for a good structure (database or what ever) where I can import the data at run-time. I need to have a reference to the data to visualize and modify them. Afterward I need to export the information to a user specified file type. (consider the image).
Approaches:
I defined a class for the csv schema and mapped the necessary information of the xml to it. The problem occurs when I try to export the data because I have not all data available in the memory.
I defined a class for the xml schema and mapped the information from the csv to it. The problem is in this case, that the storage structure bases on the schema of the xml and if this xml schema changes, I need to change the whole storage structure.
I'm planing now to implement a sql database with entity framework. This is not the easiest way but it seems to be state-of-the-art and updateable. The thing is that I'm not very experienced with databases and the entity framework. That's way I like to know whether this is a good way to solve this problem.
Last thing to say: I would like to store the imported data just once and would like to work with references to this source. This way I can export the information from this source and I'm certain that I have the current data.
Question:
What is the common way to solve such storage problems. Did I missed a good approach? Thank you so much for your help!

For a tag database is it better to store filenames per tag or tags per filename?

I want to write a small app that manages file tags for my personal files. It's gonna be pretty straightforward but I am not sure if I should be storing filenames for each unique tag, i.e.:
"sharp":
file0.ext file1.ext file2.ext file3.ext
"cold":
file1.ext file2.ext
"ice":
file3.ext
Or if I should be storing tags for each file name i.e:
file0.ext:
"sharp"
file1.ext:
"sharp" "cold"
file2.ext:
"sharp" "cold"
file2.ext:
"sharp" "ice"
I want to use the method that will give me the best performance and/or best design. Since I never did anything like this, the method I think is right might not be optimal.
Just to give more info about the app:
I will search files by tag. All I need is to be able to type my tags so I can see which files match, and double click to open them, etc.
I will use protobuffers (Marc's version) to save and load the database.
Database size is not important as I will use it on my PC.
I don't think I will ever have more than 50K files. Most likely I will have 20K max as these are mostly personal files so it's not possible for me to create/collect more than that.
EDIT: I forgot to mention another feature. Since this will be the same app to define tags for files, when I select a file, I need it to load all tags that file have so I can show them in case I want to edit them.

It all matters how you want to search the data... Since you say that you want to search files by tag, then your first method will be the simplest since you will only need to read a small part of the data file.
If you really wanted to be simple, you could have a separate data file for each tag (i.e. sharp.txt, cold.txt, ice.txt) and then just have a list of filenames in the file.

If you're searching by tag, that seems like the more appropriate index. You may incur some performance penalty for finding all tags on a file if that's something you need to do.
Alternatively if you do want to support either scenario: store both, and you can query on them as needed. This creates some data duplication and you'll need extra logic to update both data sets when a file is changed/added, but it should be pretty straightforward.

In the case, you have a lot of tags, a lot of files and a lot of relations, I would suggest using a relational database. In case you don't have a lot of data, I think you should not care about it.
Anyway, I suppose that even if you do want to save the relations in plain text files, the same principles as in the database normalization apply. The main goal is to avoid data repetition. In your model, a tag and a file would have a many-to-many relation. I would immitate the structure of a relational database, even if the data would be stored in plain text files. I would have a file holding the filenames, one ID per filename and another file holding the tags, one ID per tag. A third file would contain the relationships. Simple, keeping files to a minimum size.
Hope I helped!

C# - Loading XML file in parts

My task is to load new set of data (which is written in XML file) and then compare it to the 'old' set (also in XML). All the changes are written to another file.
My program loads new and old file into two datasets, then row after row I compare primary key from the new set with the old one. When I find corresponding row, I check all fields and if there are differences with the old one, I write it to third set and then this set to a file.
Right now I use:
newDS.ReadXml("data.xml");
oldDS.ReadXml("old.xml");
and then I just find rows with corresponding primary key and compare other fields. It is working quite good for small files.
The problem is that my files may have up to about 4GB. If my new and old data are that big it is quite problematic to load 8GB of data to memory.
I would like to load my data in parts, but to compare I need whole old data (or how to get specific row with corresponding primary key from XML file?).
Another problem is that I don't know the structure of a XML file. It is defined by user.
What is the best way to work with such a big files? I thought about using LINQ to XML, but I don't know if it has options that can help with my problem. Maybe it would be better to leave XML and use something different?

You are absolutely right that you should leave XML. It is not a good tool for datasets this size, especially if the dataset consists of many 'records' all with the same structure. Not only are 4GB files unwieldy, but almost anything you use to load and parse them is going to use even more memory overhead than the size of the file.
I would recommend that you look at solutions involving an SQL database, but I have no idea how it can make sense to be analysing a 4GB file where you "don't know the structure [of the file]" because "it is defined by the user". What meaning do you ascribe to 'rows' and 'primary keys' if you don't understand the structure of the file? What do you know about the XML?
It might make sense eg. to read one file, store all the records with primary keys in a certain range, do the same for the other file, do the comparison of that data, then carry on. By segmenting the key space you make sure that you always find matches if they exist. It could also make sense to break your files into smaller chunks in the same way (although I still think XML storage this large is usually inappropriate). Can you say a little more about the problem?

Clean design and implementation for data import library

I have been given the task of developing a small library (using C# 3.0 and .NET 3.5) to provide data import functionality for an application.
The spec is:
Data can be imported from CSV file
(potentially other file formats in the
future)
The CSV files can contain any
schema and number of rows, with a
maximum file size of 10MB.
It must be
possible to change the datatype and
column name of each column in the CSV
file.
It must be possible to exclude
columns in the CSV file from the
import.
Importing the data will
result in a table matching the schema
being created in a SQL Server database, and then
being populated using rows in the
CSV.
I've been playing around with ideas for a while now my current code feels like it has been hacked together a bit.
My current implementation approach is:
Open CSV and estimate the schema,
store in an ImportSchema class
Allow the schema to be modified.
Use SMO to create the table in SQL
according to the schema.
Create a System.Data.DataTable instance using the schema
for datatypes.
Use CsvReader to read the CSV
data into the DataTable.
Apply column name changes and remove unwanted columns from DataTable.
Use System.Data.SqlClient.SqlBulkCopy() to add the rows from the DataTable into the created database table.
This sounds overly complex to me and I am facing a mental block trying to wrap it up neatly in a handful of testable/extensible objects.
Any suggestions/thoughts on ways to approach this problem, both from an implementation and a design perspective?
Many thanks for any suggestions.

As suggested in some previous SO answers, take a look at the FileHelpers Libraray. This might be at least helpful in your task to import and analyze the CSV files.

Using SSIS to read flat files with multiple record types

We are evaluating SSIS to see if it will be appropriate for a new project that is coming up. One of the processes will have to process a flat file with delimited records. The file will contain orders. There is a header line, an (optional) shipping address line, and one or more detail lines. Each line's fields are delimited but are not the same format.
I read this answer:
SSIS transactional data (different record types, one file)
And I can split the data using the Conditional Split task to produce several outputs, but am not sure how to proceed from there. I have two issues that I need to resolve:
The order header should be inserted first, before the address and details since the address and details will reference the order record, so I think I need to process that output first, but I'm not sure in SSIS how to make that branch of the Conditional Split task be processed before the other branches. Ideally, I would like to process the order header and then store the order id in a user variable so that when processing the details, I can reference that variable.
There will be multiple orders in the file, so splitting it is more complex.
I could always write an application in C# that will preprocess the file or read the file into a staging table, but I'm not sure I like that approach.
Can anyone who has been through this process share some insights into how they dealt with it?
Thanks,
Chris

After the split, deposit each type of record into it's own staging table - or into a ssis raw data destination, which is faster and good for intermediate steps like this. Then load all the headers into their final table and proceed without referential errors.
I'm assuming the detail records have a headerID in them? That should make dealing with your 2nd question easy. If not, let us know.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.