Need to read a CSV file and recognize blank lines

Need to read a CSV file and recognize blank lines - c#

I've read many posts about how to read a CSV file using TextFieldParser and .ReadFields().
I need to read a CSV file up to the first blank line. However, .ReadFields() skips (ignores) blank lines so I don't know when they are encountered in the file.
Is there a way that I can use TextFieldParser and .ReadFields() but also detect when a blank line is encountered in a CSV file?

I don't know if you can do this with TextFieldParser or not. However, I maintain a library Sylvan.Data.Csv which can handle this scenario.
Add the nuget package Sylvan.Data.Csv. Create a CsvDataReader and set the option of ResultSetMode to MultiResult. This mode will cause the reader to stop any time it detects a change in column count. A blank line will be skipped between result sets.
CsvDataReader derives from System.Data.DbDataReader, so it can be used in the same way as a SqlDataReader. For example, you could use DataTable.Load(csvDataReader) to load the results into an ADO.NET data table. Or used a general-purpose data binder to bind it to objects. My Sylvan.Data library (currently pre-release nuget package) has such a binder, or the popular Dapper library has one as well.
using Sylvan.Data.Csv;
...
var csv = CsvDataReader.Create("mydata.csv", new CsvDataReaderOptions { ResultSetMode = ResultSetMode.MultiResult});
while(csv.Read()) {
for(int i = 0; i < csv.FieldCount; i++) {
csv.GetString(i);
}
}
// if you need to process subsequent results:
csv.NextResult(); // returns true if there is a subsequent result set.
My library is also the fastest CSV parser for .NET; the performance is vastly better than the TextFieldParser (VisualBasic), if that's a concern.

Related

What is the fastest way to export a DataTable in C# to MS Excel?

As the title says I have massive DataTables in C# that I need to export to Excel as fast as possible whilst having a reasonable memory consumption.
I've tried using the following:
EPPlus (currently the fastest)
OpenXML (slower than EPPlus - not sure this makes sense as EPPlus probably uses OpenXML?)
SpreadsheetLight (slow)
ClosedXML (OOM exception with large tables)
Assuming massive data sets (up to 1,000,000 rows, 50 columns) and an infinite development time, what is THE fastest way I can export?
EDIT: I need more than just a basic table starting in A1. I need the table to start in a cell of my choosing, I need to be able to format the cells with data in, and have multiple tabs all of which contain their own data set.
Thanks.

You did not specified any requirements on how the data should look like in excel file. I guess, you don't need any complicated logic, just setting up correct data in correct columns. In that case, you can put your data in CSV (comma separated values) file. Excel can read this file just fine.
Example of CSV file:
Column 1,Column 2,Column 3
value1,value1,value1
value2,value2,value2
...
As requested, here is the code sample for creation csv file.
var csvFile = new StringBuilder();
csvFile.AppendLine("Column 1,Column 2,Column 3");
foreach (var row in data)
{
csvFile.AppendLine($"{row.Column1Value},{row.Column2Value}, row.Column3Value}");
}
File.WriteAllText(filePath, csvFile.ToString());
You can use some texternal libraries for parsing csv files, but this is the most basic way i can think of atm.

Excel is just an XML file format. If you strip away all the helper libraries, and think you can do a better job at coding than the people at EPPPlus or OpenXML then you can just use an XML Stream Writer and write the properly tagged Excel XML to a file.
You can make use of all kinds of standard file buffering and caching to make the writing as fast as possible but none of that will be specific to an Excel file - just standard buffered writes.

Assuming ... an infinite development time, what is THE fastest way I can export?
Hand-roll your own XLSX export. It's basically compressed XML. So stream your XML to a ZipArchive and it will be more-or-less as fast as it can go. If you stream it rather than buffer it then memory usage should be fixed for any size export.

Convert SQL result set to CSV file

I'm working in C# and need making use of Entity Framework 6. I have a service that calls a stored procedure (using the Dbcontext) and places the results in an IList. I then have a controller that makes use of this service. Now originally I was using the results combined with Epplus to save this as an Excel/Xlsl File - this worked perfectly/as intended.
However now, I need to save it as an CSV file. I have found several links, such as this and this, which converts excel to CSV (however I can now skip this step, as I can just convert the resultset to CSV, with no need for the excel file), I also found this link.
From what I understand, it is fairly easy to export/convert a dataset/result set to CSV using a stringbuilder. However, I was wondering, given that I have Epplus and the ability to save as Excel- is there not a cleaner way of doing it? Or is it best to take the data, use a stringbuilder for comma delimiting the values and use that for the CSV?
I know similar topics (like this one) have been posted before - but I felt my question was unique enough for a new post.

Using EPPlus is not a cleaner way of doing this. You would only need much more code to accomplish exactly the same result. Creating a CSV file is nothing more than writing a text file with commas in it. So why not just do that?
StringBuilder sb = new StringBuilder();
foreach (DataRow dr in yourDataSet)
{
List<string> fields = new List<string>();
foreach (object field in dr.ItemArray)
{
fields.Add(field);
}
sb.Append(String.Join(",", fields) + Environment.NewLine);
}
//and save.. sb.ToString() as a .csv file

C# Parallelizing CSV Parsing

Please look at the code below.
void func()
{
for(;;)
{
var item = new Item();
}
}
Item is a class in whose constructor I read several csv files, as follows
List<string> data = new List<string>();
Item()
{
//read from csv into List<string>data
}
As is visible, the csv files are distinct and are read into unique variables. I would like to be able to parallelize this. All my data is on a network drive. I understand that the limitation in this case is the disk access. Can someone suggest what I can do to parallelize this?

As stated before Parallel.ForEach is the easiest way to run something in parallel, but if I recall correctly parallel.ForEach is a .net 4 method. so if you are using a different version you will have to find another method that uses locks.
If you are looking to read in data from a csv, ADO.net has a built in function that can read in csv files based on a schema file, it's one of the fastest ways in my experience to read in csv files.
quick link I found from google
http://www.daniweb.com/web-development/aspnet/threads/38676
I've also had great success with this http://www.codeproject.com/KB/database/CsvReader.aspx . It's a little slower than the ado.net version but it's easier to use and you don't need schema file.
just a warning, if you use the ado.net and you large string numeric values like credit cards numbers and you are getting things that look like scientific notation, your schema file needs to be adjusted, I've had a lot of coders complain about this.
happy coding.

if all your files are unique and stored in unique variables, take a look on Parallel.ForEach statement - take a look at http://msdn.microsoft.com/en-us/library/dd460720.aspx

C# Excel import data from CSV into Excel

How do I import data in Excel from a CSV file using C#? Actually, what I want to achieve is similar to what we do in Excel, you go to the Data tab and then select From Text option and then use the Text to columns option and select CSV and it does the magic, and all that stuff. I want to automate it.
If you could head me in the right direction, I'll really appreciate that.
EDIT: I guess I didn't explained well. What I want to do is something like
Excel.Application excelApp;
Excel.Workbook excelWorkbook;
// open excel
excelApp = new Excel.Application();
// something like
excelWorkbook.ImportFromTextFile(); // is what I need
I want to import that data into Excel, not my own application. As far as I know, I don't think I would have to parse the CSV myself and then insert them in Excel. Excel does that for us. I simply need to know how to automate that process.

I think you're over complicating things. Excel automatically splits data into columns by comma delimiters if it's a CSV file. So all you should need to do is ensure your extension is CSV.
I just tried opening a file quick in Excel and it works fine. So what you really need is just to call Workbook.Open() with a file with a CSV extension.

You could open Excel, start recording a macro, do what you want, then see what the macro recorded. That should tell you what objects to use and how to use them.

I beleive there are two parts, one is the split operation for the csv that the other responder has already picked up on, which I don't think is essential but I'll include anyways. And the big one is the writing to the excel file, which I was able to get working, but under specific circumstances and it was a pain to accomplish.
CSV is pretty simple, you can do a string.split on a comma seperator if you want. However, this method is horribly broken, albeit I'll admit I've used it myself, mainly because I also have control over the source data, and know that no quotes or escape characters will ever appear. I've included a link to an article on proper csv parsing, however, I have never tested the source or fully audited the code myself. I have used other code by the same author with success. http://www.boyet.com/articles/csvparser.html
The second part is alot more complex, and was a huge pain for me. The approach I took was to use the jet driver to treat the excel file like a database, and then run SQL queries against it. There are a few limitations, which may cause this to not fit you're goal. I was looking to use prebuilt excel file templates to basically display data and some preset functions and graphs. To accomplish this I have several tabs of report data, and one tab which is raw_data. My program writes to the raw_data tab, and all the other tabs calculations point to cells in this table. I'll go into some of the reasoning for this behavior after the code:
First off, the imports (not all may be required, this is pulled from a larger class file and I didn't properly comment what was for what):
using System.IO;
using System.Diagnostics;
using System.Data.Common;
using System.Globalization;
Next we need to define the connection string, my class already has a FileInfo reference at this point to the file I want to use, so that's what I pass on. It's possible to search on google what all the parameters are for, but basicaly use the Jet Driver (should be available on ANY windows install) to open an excel file like you're referring to a database.
string connectString = #"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={filename};Extended Properties=""Excel 8.0;HDR=YES;IMEX=0""";
connectString = connectString.Replace("{filename}", fi.FullName);
Now let's open up the connection to the DB, and be ready to run commands on the DB:
DbProviderFactory factory = DbProviderFactories.GetFactory("System.Data.OleDb");
using (DbConnection connection = factory.CreateConnection())
{
connection.ConnectionString = connectString;
using (DbCommand command = connection.CreateCommand())
{
connection.Open();
Next we need the actual logic for DB insertion. So basically throw queries into a loop or whatever you're logic is, and insert the data row-by-row.
string query = "INSERT INTO [raw_aaa$] (correlationid, ipaddr, somenum) VALUES (\"abcdef", \"1.1.1.1", 10)";
command.CommandText = query;
command.ExecuteNonQuery();
Now here's the really annoying part, the excel driver tries to detect you're column type before insert, so even if you pass a proper integer value, if excel thinks the column type is text, it will insert all you're numbers as text, and it's very hard to get this treated like a number. As such, excel must already have the column type as the number. In order to accomplish this, for my template file I fill in the first 10 rows with dummy data, so that when you load the file in the jet driver, it can detect the proper types and use them. Then all my forumals that point at my csv table will operate properly since the values are of the right type. This may work for you if you're goals are similar to mine, and to use templates that already point to this data (just start at row 10 instead of row 2).
Because of this, my raw_aaa tab in excel might look something like this:
correlationid ipaddr somenum
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
Note row 1 is the column names that I referenced in my sql queries. I think you can do without this, but that will require a little more research. By already having this data in the excel file, the somenum column will be detected as a number, and any data inserted will be properly treated as such.
Antoher note that makes this annoying, the Jet Driver is 32-bit only, so in my case where I had an explicit 64-bit program, I was unable to execute this directly. So I had the nasty hack of writing to a file, then launch a program that would insert the data in the file into my excel template.
All in all, I think the solution is pretty nasty, but thus far haven't found a better way to do this unfortunatly. Good luck!

You can take a look at TakeIo.Spreadsheet .NET library. It accepts files from Excel 97-2003, Excel 2007 and newer, and CSV format (semicolon or comma separators).
Example:
var inputFile = new FileInfo("Book1.csv"); // could be .xls or .xlsx too
var sheet = Spreadsheet.Read(inputFile);
foreach (var row in sheet)
{
foreach (var cell in row)
{
// do something
}
}
You can remove beginning and trailing empty rows, and also beginning and trailing columns from the imported data using the Normalize() function:
sheet.Normalize();
Sometimes you can find that your imported data contains empty rows between data, so you can use another helper for this case:
sheet.RemoveEmptyRows();
There is a Serialize() function to convert any input to CSV too:
var outfile = new StreamWriter("AllData.csv");
sheet.Serialize(outfile);
If you like to use comma instead of the default semicolon separator in your CSV file, do:
sheet.Serialize(outfile, ',');
And yes, there is also a ToString() function too...
This package is available at NuGet too, just take a look at TakeIo.Spreadsheet.

You can use ADO.NET
http://vbadud.blogspot.com/2008/09/opening-comma-separate-file-csv-through.html

Well, importing from CSV shouldn't be a big deal. I think the most basic method would be to do it using string operations. You could build a pretty fine parser using simple Split() command, and getting the stuff in arrays.

Saving a text file to a SQL database without column names

I am reading a text file in C# and trying to save it to a SQL database. I am fine except I don't want the first line, which is the names of the columns, included in the import. What's the easiest way to exclude these?
The code is like this
while (textIn.Peek() != -1)
{
string row = textIn.ReadLine();
string[] columns = row.Split(' ');
Product product = new Product();
product.Column1 = columns[0];
etc.....
product.Save();
}
thanks

If you are writing the code yourself to read in the file and then importing...why don't you just skip over the first line?

Here's my suggestion:
string[] file_rows;
using(var reader=File.OpenText(filepath))
{
file_rows=reader.ReadToEnd().Split("\r\n");
reader.Close();
}
for(var i=1;i<file_rows.Length;i++)
{
var row=file_rows[i];
var cells=row.Split("\t");
....
}

how are you importing the data? if you are looping in C# and inserting them one at a time, construct your loop to skip the first insert!
or just delete the first row inserted after they are there.
give more info, get more details...

Pass a flag into the program (in case in future the first line is also data) that causes the program to skip the first line of text.
If it's the same column names as used in the database you could also parse it to grab the column names from that instead of hard-coding them too (assuming that's what you're doing currently :)).
As a final note, if you're using a MySQL database and you have command line access, you may want to look at the LOAD DATA LOCAL INFILE syntax which lets you import pretty arbitrarily defined CSV data.

For future reference have a look at this awesome package: FileHelpers Library
I can't add links just yet but google should help, it's on sourceforge
It makes our lives here a little easier when people insist on using files as integration

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.