fine and fast csv reader - c#

I m using Lumenworks CSV reader and i must say i m not very happy with it with how it works so far.
I m parsing thousands of csv files within an hour and there is always a problem, either throws an exception complaining about bad records or skewing the columns etc.
Can you recommend a fine CSV reader, it doesnt have to be a free one, but bug free.
Thank you.

FileHelpers Open Source Library http://www.filehelpers.net/

Try CsvHelper (a library I maintain). It's also available on NuGet.

You cite that you are receiving exceptions and such from the files. While these may be undesired, have you investigated the cause?
You might just want to use one of the current parsers that are on the table and when an exception occurs, try an alternative or/and handle the scenarios with custom code. I know it's not exactly what you are looking for but the problem may not be the tools you are using but the input the tools are receiving...
You could also move the offending file to a separate directory (in code) to look at a bit later and get what will process, processed.

There is a CSV parser built into .NET.
From http://coding.abel.nu/2012/06/built-in-net-csv-parser/:
// TextFieldParser is in the Microsoft.VisualBasic.FileIO namespace.
using (TextFieldParser parser = new TextFieldParser(path))
{
parser.CommentTokens = new string[] { "#" };
parser.SetDelimiters(new string[] { ";" });
parser.HasFieldsEnclosedInQuotes = true;
// Skip over header line.
parser.ReadLine();
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
yield return new Brand()
{
Name = fields[0],
FactoryLocation = fields[1],
EstablishedYear = int.Parse(fields[2]),
Profit = double.Parse(fields[3], swedishCulture)
};
}
}

You have to check the input files.I think these tool don't stop at format check because they aim for quantity stuff (skiping erroneous data to treat the maximum of files).
In the real world you rarely see a stream of clean CSV. Drivers tend to give their own kind:
-no quotes
-semi colon instead of comma
Files that are generating errors usually come from the same source.

It's been a long time since I used it, but FileHelpers does CSV parsing with lots of options.

Related

Removing text above real content of CSV file

I have a CSV whose author, annoyingly enough, has decided to 'introduce' the file before the contents themselves. So in all, I have a CSV that looks like:
This file was created by XXXXYY and represents the crossover between YY and QQQ.
Additional information can be found through the website GG, blah blah blah...
Jacob, Hybrid
Dan, Pure
Lianne, Hybrid
Jack, Hatchback
So the problem here is that I want to get rid of the first few lines before the 'real content' of the CSV file begins. I'm looking for robustness here, so using Streamreader and removing all content before the 4th line for example, is not ideal (plus the length of the text can vary).
Is there a way in which one can read only what matters and write a new CSV into a directory path?
Regards,
genesis
(edit - I'm looking for C sharp code)
The solution depends on the files you have to parse. You need to look for a reliable pattern that distinguishes data from comment.
In your example, there are some possibilities that might be the same in other files:
there are 4 lines of text. But you say this isn't consistent across files
The text lives may not contain the same number of commas as the data table. But that is unlikely to be reliable for all files.
there is a blank/whitespace only line between the text and the data.
the data appears to be in the form word-comma-word. If this is true it should be easy to identify non data lines (any line which doesn't contain exactly one comma, or has multiple words etc)
You may be able to use a combination of these heuristics to more reliably detect the data.
You could scan by line (looking for the \r\n) and ignore lines that don't have a comma count that matches you csv.
You should be able to read the file into a string pretty easily unless it is really massive.
e.g.
var csv = "some test\r\nsome more text\r\na,b,c\r\nd,e,f\r\n";
var lines = csv.Split('\r\n');
var csvLines = line.Where(l => l.Count(',') == 2);
// now csvLines contains only the lines you are after
List<string> info = new List<string>();
int counter = 0;
// Open the file to read from.
info = System.IO.File.ReadAllLines(path).ToList();
// Find the lines up until (& including) the empty one
foreach (string s in info)
{
counter++;
if(string.IsNullOrEmpty(s))
break; //exit from the loop
}
// Remove the lines including the blank one.
info.RemoveRange(0,counter);
Something like this should work, you should probably put some tests in to make sure counter is not > length and other tests to handle errors.
You could adapt this code so that it just finds the empty line number using linq or something, but I don't like the overhead of linq (Yeah ironic considering I'm using c#).
Regards,
Slipoch

Parsing Resx file with C# crashes on relative paths

Out of the sheer frustration of having to copy resx data into word to get the word count
i've started to write my own tool to do so.
Well that made me run into an issue.
i have icons and such things in the Resources.resx file.
and they have relative paths according to the project they are being used int.
Which they should have obviously.
Well when i try to parse the Resx file in another application to count the words from the Value column.
i am getting errors as they can't parse the relative path. they end up going to folders that do not exist in my wordcount application.
Does any of you have an idea how i can fool the app into looking in the right folder when parsing these values?
i'm not quite sure why it is parsing those values to begin with.
it should just grab the string that's all i care about.
i'm using the ResXReader
ResXResourceReader reader = new ResXResourceReader(filename);
foreach(System.Collections.DictionaryEntry de in reader)
{
if (((string)de.Key).EndsWith(".Text"))
{
System.Diagnostics.Debug.WriteLine(string.Format("{0}: {1}", de.Key, de.Value));
}
}
I found this here: Word Count of .resx files
It errors out on the foreach.
..\common\app.ico for example.
anyone have an idea on how to do this?
Alright.
So the solution was a little easier than expected.
i was using an outdated Class.
I should have been using XDocument instead of XmlDataDocument
secondly LINQ is the bomb.
all i had to do was this:
try
{
XDocument xDoc = XDocument.Load(resxFile);
var result = from item in xDoc.Descendants("data")
select new
{
Name = item.Attribute("name").Value,
Value = item.Element("value").Value
};
resxGrid.DataSource = result.ToArray();
}
and you can even allow empty strings if you cast those attributes/elements to (String)
Hope this helps someone!
Try to use ResXResourceReader for this purpose - see http://msdn.microsoft.com/en-us/library/czdde9sc.aspx

c# smart way to delete multiple occurance of a character in a string

My program reads a file which has thousands of lines of something like this below
"Timestamp","LiveStandby","Total1","Total2","Total3", etc..
each line is different
What is the best way to split by , and delete the "" as well as put the values in a list
this is what I have
while ((line = file.ReadLine()) != null)
{
List<string> title_list = new List<string>(line.Split(','));
}
the step above still missing the deletion of the quotes. I can do foreach but that kinda defeat the purpose of having List and Split in just 1 line. What is the best and smart way to do it?
The best way in my opinion is to use a library that parses CSV, such as FileHelpers.
Concretely, in your case, this would be the solution using the FileHelpers library:
Define a class that describes the structure of a record:
[DelimitedRecord(",")]
public class MyDataRecord
{
[FieldQuoted('"')]
public string TimeStamp;
[FieldQuoted('"')]
public string LiveStandby;
[FieldQuoted('"')]
public string Total1;
[FieldQuoted('"')]
public string Total2;
[FieldQuoted('"')]
public string Total3;
}
Use this code to parse the entire file:
var csvEngine = new FileHelperEngine<MyDataRecord>(Encoding.UTF8)
{
Options = { IgnoreFirstLines = 1, IgnoreEmptyLines = true }
};
var parsedItems = csvEngine.ReadFile(#"D:\myfile.csv");
Please note that this code is for illustration only and I have not compiled/run it. However, the library is pretty straightforward to use and there are good examples and documentation on the website.
Keeping it simple like this should work:
List<string> strings = new List<string>();
while ((line = file.ReadLine()) != null)
string.AddRange(line.Replace("\"").split(',').AsEnumerable());
I'm going to clarify this a bit. If you have a user formatted file that has a predictable format (ie the user has generated the data out of EXCEL or similar program) then you are way better off using an exising parser that is well tested.
Scenarios like the following are just a few examples that manual parsing will have problems with:
"column 1", 2, 0104400, $1,300, "This is an interestion question, he said"
.. and there are more with escaping, file formats etc that can be a headache for roll your own.
If you do that, then ensure you get one that can tollerate differences in columns per row as it can make a difference.
If, on the other hand, you know what's going into the data which is common in system generated files then using CSV parsers will cause more problems than they solve. For example, I have dealt with scenarios where the first part is fixed and can be strongly typed, but there are following parts in a row that are not. This can also happen if you're parsing flat file data in fixed width scenarios from legacy databases. A csv solution makes assumptions we don't want and is not the right solution in many of those cases.
If that is the case and you just want to strip out quotes after splitting on commas, then try a bit of linq. This can also be extended to replace specific characters you are worried about.
line.Split(',').Select(i => i.Replace("\"", "")).ToArray()
Hope that clears up all the conflicting advice.
You can use the Array.ConvertAll() function.
string line = "\"Timestamp\",\"LiveStandby\",\"Total1\",\"Total2\",\"Total3\"";
var list = new List<String>(Array.ConvertAll(line.Split(','), x=> x.Replace("\"","")));
Perform the Replace first, then Split into your List. Here's your code with Replace.
while ((line = file.ReadLine()) != null)
{
List<string> title_list = new List<string>(line.Replace("\"", "").Split(','));
}
Although, you're going to need a variable to hold all of the Lists, so look at using AddRange().

Convert CSV to XML when CSV contains both character and number data

From this thread, I got the basic info on how to parse CSV to create XML. Unfortunately, the text fields (all enclosed in quotes) sometimes contain commas, so line.split(',') gives me too many columns. I can't figure out how to parse the CSV so line.split(',') distinguishes between commas within a text field, and commas separating fields. Any thoughts on how to do that?
Thanks!
Go grab this code: http://geekswithblogs.net/mwatson/archive/2004/09/04/10658.aspx
Then replace line.Split(",") with SplitCSV(line), like:
var lines = File.ReadAllLines(#"C:\text.csv");
var xml = new XElement("TopElement",
lines.Select(line => new XElement("Item",
SplitCSV(line)
.Select((column, index) => new XElement("Column" + index, column)))));
xml.Save(#"C:\xmlout.xml");
Note that the code at the link above is rather old, and probably could be cleaned up a bit using Linq, but it should do the trick.
Try FileHelpers.
The FileHelpers are a free and easy to use .NET library to import/export data from fixed length or delimited records in files, strings or streams.
What about using the pipe character "|"? This often happens with CSV files and a better approach is to seperate on pipes.
If your CSV files are too complex to make writing your own parser practical, use another parser. The Office ACE OLEDB provider may already be available on your system, but may be overkill for your purposes. I haven't used any of the lightweight alternatives, so I can't speak to their suitability.
Here's a little trick if you don't want to use Regex. Instead of spliting with comma you can split with comma and quotes together ","
Assuming there's no space before and after comma:
line.Split("\",\"")
You will need to remove the quote before the first field and after the last field however.
While I'm almost always against regular expression, here's a solution using it.
Assume you have data as such:
"first name","last name","phone number"
"john,jane","doe","555-5555"
Then, the following code:
string csv = GetCSV(); // will load your CSV, or the above data
foreach (string line in csv.Split('\n'))
{
Console.WriteLine("--- Begin record ---");
foreach (Match m in Regex.Matches(line, "\".+?\""))
Console.WriteLine(m.Value);
}
will output this:
--- Begin record ---
"first name"
"last name"
"phone number"
--- Begin record ---
"john,jane"
"doe"
"555-5555"
But I would not recommend the Regex approach if you have like a 2 GB csv file.
So you can use that as your baseline for making up your XML records.

What's the best way to read a tab-delimited text file in C#

We have a text file with about 100,000 rows, about 50 columns per row, most of the data is pretty small (5 to 10 characters or numbers).
This is a pretty simple task, but just wondering what the best way would be to import this data into a C# data structure (for example a DataTable)?
I would read it in as a CSV with the tab column delimiters:
A Fast CSV Reader
Edit:
Here's a barebones example of what you'd need:
DataTable dt = new DataTable();
using (CsvReader csv = new CsvReader(new StreamReader(CSV_FULLNAME), false, '\t')) {
dt.Load(csv);
}
Where CSV_FULLNAME is the full path + filename of your tab delimited CSV.
Use .NET's built in text parser. It is free, has great error handling, and deals with a lot of odd ball cases.
http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser(VS.80).aspx
What about FileHelpers, you can define the tab as a delimiter. HEad on over to that site by the link supplied and have a peeksy.
Hope this helps,
Best regards,
Tom.
Two options:
Use the classes in the System.Data.OleDb namespace. This has the advantage of reading directly into a datatable like you asked with very little code, but it can be tricky to get right because it's tab rather than comma delimited.
Use or write a csv parser. Make sure it's a state machine-based parser like the one #Jay Riggs linked to rather than a String.Split()-based parser. This should be faster than the OleDb method, but it will give you a List or array rather than a datatable.
However you parse the lines, make sure you use something that supports forwarding and rewinding, being the data source of your data grid. You don't want to load everything into memory first, do you? How about if the amount of data should be ten-fold the next time? Make something that uses file.seek deep down, don't read everything to memory first. That's my advice.
Simple, but not the necessarily a great way:
Read the file using a text reader into a string
Use String.Split to get the rows
use String.Split with a tab character to get field values

Categories

Resources