Inserting large csv files into a database

Inserting large csv files into a database - c#

We have an application on the web that must allow the user to upload files with zip codes, these files are .csv's files. Any user will be able to upload the file from their computer, the issue is that the file may contain thousands of records. Right now i am getting the file, making sure it has the right headers but I am pushing the records one by one into the database.
I am using c# asp.net, is there a better way to do this?, more efficient from the code?. We cant use any external importers or data importers or tools like sql server business intelligence. How can I do this?, i was reading something about putting it in memory and then push it to the database?. Any urls, examples or suggestions would be much appreciated.
Regards

Firstly, I'm pretty sure that what you are asking is actually "How do you process a large file and insert the processed data into the database?".
Now assuming I am correct I would say the question is akin to 'how long is a piece of string?'. The reality is that an implementation for processing large files into a database is highly specific to your requirements.
However, at the simplest end of the spectrum you could simply upload the file straight into a table (or folder) and create a windows service that runs every x minutes, traverses through the table, picks each file and processes your data using bulk inserts and the prepare method (which may give you some performance benefits).
Alternatively you could look at something like MSMQ (Microsoft Message Queuing) and save any uploaded files direct to a queue which is then completely independent of your application and can be processed at any point in time along with easily scaled out.
At the end of the day though, honestly I don't think anyone here can give you a 'correct' answer to your question cause there really isn't one and you'll only be able to find improvements to your implementation by experimentation.

if this contains up to a million record, best to do this is to create a service to manage the inserting of records into the database to avoid timeout and prevent the web iis stress.
if you make it a windows service you can notify the service to process the zip files in certain directory where it was uploaded.
also, i would suggest to use bulk insert for more faster database transactions.
if there are validation you can probably stage the data into a different database and validate the data then push to the final database.

Since these records are in the same table and would then not be related to each other, Parallel.ForEach may be a valid answer here. Assuming you have a static method (may not necessarily need to be static) that inserts an individual record into the db, you can run Parallel.ForEach loop over an array where each index of the array represents a line of the CSV.
This assumes that uploading the large file to the server isn't the initial issue. If that is also part of the issue I would reccomend zipping the file and then using something like SharpZipLib to unzip it once it is uploaded. Since text compresses very well this may be the biggest boon to performance from the user's perspective.

Related

C# combined files direct accessable

I am creating a very simple database in C# which I use to store playlists and an overview of all my music. I want to make this C compatible in the future I plan to make this completely text based. The idea is that every text file is a table, and the contents are JSON format where every line of text is a record.
I don't want to have loose files for each database, so I was thinking about something like a zip file. I don't want to extract and compress every time I access a file. Is there someway I can use a stream reader/writer in C# on different files where windows only see one file?
I'm not completely convinced that this is the way to go. So I'm open to suggestions.
Update,
Im currently messing around with the "Local database" item in C#. I never payed any atention to it. It could very well be the solution.
Update2,
SQLite seems to be very simple. I have some experience with MySQL in the past with some php projects so that will give me a headstart.

You want to use a file as a container containing different files? If so, there are a lot ways to accomplish this. These are techniques I used in the past:
Zip:
A compressed file, such as Zip is known to behave that way, can be used as a solution for your interest. It is capable to store virtual files. They can vary in size to at least up to 1 Gigabyte (testet, but I currently don't know if there are implementation based size limits).
SQLite:
SQLite sounds oldschool, but it stores all database related stuff into one physical file. Creating a database with tables for each virtual file should to the trick. This approach is useful if you know that your virtual files won't use a lot of bytes in size or neither reach any limit of sqlite field datatypes. As your virtual files are going to use textlines, may you can be able to form then into attributes and tuples. This way you can even use SQL specific statements to query and filter your data as you wish to.
There are still more ways to implement that kind of container format by your own, but propably needs to invest more time and work in it than getting effort out of it. Stay tuned for better ideas and may ready to use implementations :-)

Will you ever try to search between your data? Then use a real database manager, in C# the built in local database file is the simpliest choice (if you are familiar with SQL).
The zip file is a good choice for data space and compactness (a single file instead of many files) but it is very slow: for each database operation the whole zip file will be reorganized. Even a tar file (without compression) needs a continous reallocation when the content changes, and a zip file needs extra computation and relocation.
If you want something what is compressed and still standard, you can use OpenXML (ods or xlsx, does not matter) to store your data but the save operation will be slow and even slower as your database grows.

Database Relational Records Archive & Restore

Years back, I had created a small system against a requirement where a snapped image from Android was uploaded onto a server along with its respective custom data and then stored on the disk and the custom data describing the image was further broken up and stored in the database. Each of the snapped images was actually a part of a campaign. Over the period, the system went on growing enough and now there are now over 10,000 campaigns already and over 500-1000 images per campaign. Though, the performance is not all that bad however I believe its just a matter of time. We now are thinking of archiving the past campaigns in another database called as Archive. Now here is what I am planning to do.
The Archive Database will have the exact same structure and the Archive functionality may have a search mechanism however, retrieval speed is not much of a concern here as this will happen very rarely.
I was thinking of removing records from one database and cloning it in the other, however the identity column probably will not let me do that very seamlessly. (and I may be wrong too.)
There needs to be a restore option too. (This is probably the most challenging part)
If I just make the records blank(except for the identity) from the original database and copy it to the other with no identity constraint, probably it is not going to help and I think it will loose the purpose of the exercise.
Any advise over this? Is there any known strategy or pattern or literature or even a link that may guide me on this?
Thank you in advance for your help.

I say: as long as you don't run out of space on your server, leave it as it is.
Over the period, the system went on growing enough and now there are now over 10,000 campaigns already and over 500-1000 images per campaign.
→ That's 5-10 millions of rows (created over several years).
For SQL Server, that's not that much.
Yes, I know...we're talking about image files stored in the database, not "regular" rows. Still, if your server has reasonably sized hardware, it shouldn't really matter.
I'm talking from experience here - at work, we have a SQL Server database which we use to store PDF files and images.
In our case, we're using a "regular" image column - since you're using SQL Server 2008, you could even use FILESTREAM (maybe you already do, but I don't know - you didn't say anything how exactly you're storing the image in the database).
We started the project on SQL Server 2005, where FILESTREAM wasn't available yet. In the meantime, we upgraded to SQL Server 2012, but never changed the data type in the table where we're storing the files.
If you still prefer creating a separate archive database and moving old data there, one piece of advice concerning this:
2) I was thinking of removing records from one database and cloning it
in the other, however the identity column probably will not let me do
that very seamlessly. (and I may be wrong too.)
[...]
4) If I just make the records blank(except for the identity) from the
original database and copy it to the other with no identity
constraint, probably it is not going to help and I think it will loose
the purpose of the exercise.
You don't need to set the column to identity in the archive database as well.
Just leave everything as it is in the main database, but remove the identity setting from the primary key in the archive database.
The archive database doesn't ever need to generate new keys (hence no need for identity), you're just copying rows with already existing keys from the main database.

I think good solution for you case is SSIS. This technology can provide fast loading of big volume of data to you Archive system. In addition you can use table partitioning to increase performance of manipulation of big data in Archive system. Also check such thing like comumnstore indexes (but it depends on version of SQL server).
I created such solution with following steps:
1) switch partition from main table t to another table t_1(the oldest rows in a table) in production system
2) load data to Archive system from table t_1
3) drop or truncate table t_1

Storing large files / binary data in a mysql database: when is it ok?

Ok, I have searched about this and read a few points of view about storing binary data in a [MySQL] database. Generally I consider this a bad idea and try to avoid it, favouring traditional file transfers and just storing a reference to the file in a database.
However, I am working on a project which requires database synchronisation with a remote/cloud database, not just for files, but also for settings and other user content. For this, and other reasons, I felt this might be an appropriate situation for binary storage in a database.
I have written a general system for the database sync which works well using Reflection and XML. I have also (against my instincts) integrated the file storage in to this system. Again, it works well - I chop files in to 64Kb BLOBs and store them in a table, with a file_id reference (linked to a seperate table which contains meta data such as file name/size/mime type).
This enables me to send bits and pieces as and when a connection is available, and also allows me to limit each request size to keep things running smoothly.
So far I have not found any issues with this, and have successfully imported and transferred over 1gb of data in both directions (over about 10-15 files / 16000 rows), but I worry about its scalability - will it slow down once there is 20gb+ data in there, or can MySQL handle it provided my queries are well structured?
Another reason for my decision to store the data in the database was that I figured I could simply add another HDD/storage device to MySQL if space ran low, in the hope of efficient scaling/replication/etc.
I would very much appreciate any views or comments as to whether this is a good or bad approach, and have I missed any obvious problems I'm likely to see once used in a production environment?
edit: I forgot to mention, the file sizes could range from 1KB to ~1GB
[Rough] Conclusion
Firstly: thanks very much to those who contributed a considered answer. Choosing the accepted answer here has been quite difficult as each has something decent to offer.
In the end (despite my hopes), I have decided that a pure MySQL storage server is at best only an ok solution (I still can't help wondering why they bother including the BLOB types though).
As the alternative, I am torn between #Nick Coons file system approach and #tadman's suggestion of a hybrid using a light weight key/value database engine such as leveldb. Provided the practicalities of using leveldb in this project are not an issue, this is most likely the approach I will work towards.
I have accepted tadman's answer on this basis; his answer was also most applicable and useful to my situation.
That being said, and for those that are interested: I have enjoyed quite a lot of success using only MySQL so far. I have tested a table storing over 15gb of binary data without any noticable negative side effects from to inserting/retrieving data from large tables (with careful queries). However, I am certain this is still very inefficient and either of the alternative methods mentioned will be significantly better.

I have to wonder why you're even bothering with a database at all, when the layer you've added on top to chunk, store, retrieve and reassemble would work just as well on a well-defined filesystem structure. MySQL wants all of its data on a single volume, so it's not a case of adding another drive whenever you feel like it, and replication of large amounts of binary data is going to be cripplingly slow as the binary logs will end up duplicating the amount of data you need to store.
The simplest approach is often the best one. Storing this in the filesystem directly is probably the best way to do it. If you need to keep an index of what's stored where, maybe you'd use a database like MySQL, but there's many ways to accomplish this same task. The more low-tech, the better. For example, don't rule out SQLite because an embedded database performs very well under light read and write load, and has the advantage of being "just a file" when it comes to backing up and restoring.
That being said, what you're doing sounds suspiciously similar to LevelDB, so before you commit to your approach, you'd have to see how it's significantly different than a key-value document store of that variety.

Short Answer:
I'm not sure there's a hard-lined way to answer this. You mentioned files being from 1KB to 1GB.. I wouldn't store binary data in a DB if it's going to anywhere near 1KB, let along 1GB. I may store a few bytes of binary data in a DB if it's incidental, but any large amount of data, especially that doesn't need to be searched, should be stored in the filesystem:
When you store data in a DB, you're storing it on a filesystem anyway, you've just added another layer (the DB) to the mix. There's a cost to this layer, so there ought to be a benefit to make up the difference. If you're storing the data so that you can search based on it or join it to other data, then this makes sense. But file data, binary or not, is typically not used in that way.
Example Implementation:
There are better methods to distribute file data than to enter it into a DB, such as a distributed filesystems (check into GlusterFS, MooseFS, both of which will scale by simply adding additional hard drives, whereas MySQL will not).
Typically, I'll store file data in the filesystem using an SHA1 hash of the data as the name of the file. If the hash is 98a75af529f07b1ef7be7400f51344b9f07b1ef7, then I'll store it in this directory structure:
./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7
That is, a top-level directory made up of the first two characters, a second-level directory made up of the second two characters, and then finally the file with the name of the total string. In this way, I can literally have billions of files without having so many in a single directory that the system is too slow to function.
Then I create a DB table with these columns to hold the meta data:
file_id, an auto_increment field
created, a field with a default value of current_timestamp
prev_id, more on this below
hash, the SHA1 hash on the filesystem
name, a textual name of the file (such as the original name that the file would have taken on disk.
When I need a hierarchical directory structure, I would also create a directory table and add a dir_id to the list of columns above.
If I edit the file represented by ./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7, I don't actually change that file on disk, I create a new one (because the new file contents would be represented by a new SHA1 hash), and create a new entry in the files table where prev_id equals the file_id of the file I edited. In other words, I now have versioning.
If I need this to be available in a distributed fashion, I setup MySQL replication and then use GlusterFS to replicate he filesystem across multiple servers.

I think you will find a fair amount of debate on this as I did when I began looking into this. I tend to lean toward storing in the file system and maintaining a reference. However, that is not to say that there is never a time to store binary data in a database.
I would say that simply to keep things in sync is not a reason within itself to make an argument for storing binary data in a database. There certainly are ways to keep file systems in sync so that as a database is kept in sync so is the file system.
The bottom line is that there is a fair amount of debate on this topic and you have to go with what works for you. If what you have set up works. Use it. Do performance and load testing to make sure it works. If it doesn't hold up, change it.

SQL Server data import dilemma

I have created an application that will be importing CSV files into a database table, and I've got multiple CSV files I need to import into a table in a SQL Server database.
I've got a couple approaches in mind but I'm not sure which is most practical. The application works by asking the user to select the files they want to import (from their local file system) and then they simply click a [Load Files] button. These files may contain 100,000+ rows at times.
What would be better for the above scenario?
Import CSV file into datatable using C# and open-source GenericParser then using a traditional method of BulkCopy to push the datatable to the database
Note: my concern is the strain on users PC when doing this for files with 100,000+ rows. How will this affect the processing or would it crash the program?
Use Bulk Insert which requires the file name and path. My concern for this option is I'm not sure if the server would be able to process the Bulk Insert command without the physical file being located on the server? The file path would relate to the users local machine. The only time I've used Bulk Insert is when I was logged onto the server itself, which is not possible for this app.
Is there a way to do it with Linq? While I'm not really familiar with Linq if it can be accomplished I'm open to trying it.
Any insight is appreciated. I know what I need to do just not sure of how to accomplish it practically.
Thanks

My recommendation would be to use the SqlBulkCopy class in .NET. It will allow you to import rows almost as fast at the BULK INSERT statement, but only requires that you populate a DataTable with the rows, and then send them to SQL Server.
Another consideration you might want to look at would be (and this is my personal favorite for simple file import programs) to use PowerShell instead of C#, which has a built-in cmdlet for imporing CSV files. Pretty cool stuff.

1) loader app in .Net is a good choice, generally. 100,000 rows is really not a strenuous workload, especially for simple loads. Only if there is a ton of multiple-table-joins involved in order to look up values on the fly would that really be a big concern.
2) although strictly speaking physical file location is just a performance question, I wouldn't do it. It will introduce administrative headaches.
3) I don't have experience with Linq, I cannot remark.
Just for bonus alternate idea: if you have IIS running somewhere, maybe even on the DB server, you can whip up a lightweight, one-page "webapp" which is just a CGI script with ODBC connection to the DB and the user just feeds the CSV in as a "web/CGI" upload. No utility application to install on user workstations this way.

To solve your problem, you have to see on in two basic views:
Do you need make some operations with data before insert in into database (some sumarization, correction,...)?
If yes, than here is the best way to upload rows from file to object (each row into one object instance). And than you can elegantly move with list of items with Linq.
Do you need only insert rows from file to database as they are?
I this case, use process described in point 2 of your question.
I'd prefer to upload file to server before any action. It's more safe.

What are the pitfalls of inserting millions of records into SQL Server from flat file?

I am about to start on a journey writing a windows forms application that will open a txt file that is pipe delimited and about 230 mb in size. This app will then insert this data into a sql server 2005 database (obviously this needs to happen swiftly). I am using c# 3.0 and .net 3.5 for this project.
I am not asking for the app, just some communal advise here and potential pitfalls advise. From the site I have gathered that SQL bulk copy is a prerequisite, is there anything I should think about (I think that just opening the txt file with a forms app will be a large endeavor; maybe break it into blob data?).
Thank you, and I will edit the question for clarity if anyone needs it.

Do you have to write a winforms app? It might be much easier and faster to use SSIS. There are some built-in tasks available especially Bulk Insert task.
Also, worth checking Flat File Bulk Import methods speed comparison in SQL Server 2005.
Update: If you are new to SSIS, check out some of these sites to get you on fast track. 1) SSIS Control Flow Basics 2) Getting Started with SQL Server Integration Services
This is another How to: on importing Excel file into SQL 2005.

This is going to be a streaming endeavor.
If you can, do not use transactions here. The transactional cost will simply be too great.
So what you're going to do is read the file a line at a time and insert it in a line at a time. You should dump failed inserts into another file that you can diagnose later and see where they failed.
At first I would go ahead and try a bulk insert of a couple of hundred rows just to see that the streaming is working properly and then you can open up all you want.

You could try using SqlBulkCopy. It lets you pull from "any data source".

Just as a side note, it's sometimes faster to drop the indices of your table and recreate them after the bulk insert operation.

You might consider switching from full recovery to bulk-logged. This will help to keep your backups a reasonable size.

I totally recommend SSIS, you can read in millions of records and clean them up along the way in relatively little time.
You will need to set aside some time to get to grips with SSIS, but it should pay off. There are a few other threads here on SO which will probably be useful:
What's the fastest way to bulk insert a lot of data in SQL Server (C# client)
What are the recommended learning material for SSIS?
You can also create a package from C#. I have a C# program which reads a 3GL "master file" from a legacy system (parses into an object model using an API I have for a related project), takes a package template and modifies it to generate a package for the ETL.

The size of data you're talking about actually isn't that gigantic. I don't know what your efficiency concerns are, but if you can wait a few hours for it to insert, you might be surprised at how easy this would be to accomplish with a really naive technique of just INSERTing each row one at a time. Batching together a thousand or so rows at a time and submitting them to SQL server may make it quite a bit faster as well.
Just a suggestion that could save you some serious programming time, if you don't need it to be as fast as conceivable. Depending on how often this import has to run, saving a few days of programming time could easily be worth it in exchange for waiting a few hours while it runs.

You could use SSIS for the read & insert, but call it as a package from your WinForms app. Then you could pass in things like source, destination, connection strings etc as parameter/configurations.
HowTo: http://msdn.microsoft.com/en-us/library/aa337077.aspx
You can set up transforms and error handling inside SSIS and even create logical branching based on input parameters.

If the column format of the file matches the target table where the data needs to end up, I prefer using the command line utility bcp to load the data file. It's blazingly fast and you can specify and error file for any "odd" records that fail to be inserted.
Your app could kick off the command if you need to store the command line parameters for it (server, database, username / password or trusted connection, table, error file etc.).
I like this method better than running a BULK INSERT SQL command because the data file isn't required to be on a system accessible by the database server. To use bulk insert you have to specify the path to the data file to load, so it must be a path visible and readable by the system user on the database server that is running the load. Too much hassle for me usually. :-)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.