Convert CSV to XML when CSV contains both character and number data - c#

From this thread, I got the basic info on how to parse CSV to create XML. Unfortunately, the text fields (all enclosed in quotes) sometimes contain commas, so line.split(',') gives me too many columns. I can't figure out how to parse the CSV so line.split(',') distinguishes between commas within a text field, and commas separating fields. Any thoughts on how to do that?
Thanks!

Go grab this code: http://geekswithblogs.net/mwatson/archive/2004/09/04/10658.aspx
Then replace line.Split(",") with SplitCSV(line), like:
var lines = File.ReadAllLines(#"C:\text.csv");
var xml = new XElement("TopElement",
lines.Select(line => new XElement("Item",
SplitCSV(line)
.Select((column, index) => new XElement("Column" + index, column)))));
xml.Save(#"C:\xmlout.xml");
Note that the code at the link above is rather old, and probably could be cleaned up a bit using Linq, but it should do the trick.

Try FileHelpers.
The FileHelpers are a free and easy to use .NET library to import/export data from fixed length or delimited records in files, strings or streams.

What about using the pipe character "|"? This often happens with CSV files and a better approach is to seperate on pipes.

If your CSV files are too complex to make writing your own parser practical, use another parser. The Office ACE OLEDB provider may already be available on your system, but may be overkill for your purposes. I haven't used any of the lightweight alternatives, so I can't speak to their suitability.

Here's a little trick if you don't want to use Regex. Instead of spliting with comma you can split with comma and quotes together ","
Assuming there's no space before and after comma:
line.Split("\",\"")
You will need to remove the quote before the first field and after the last field however.

While I'm almost always against regular expression, here's a solution using it.
Assume you have data as such:
"first name","last name","phone number"
"john,jane","doe","555-5555"
Then, the following code:
string csv = GetCSV(); // will load your CSV, or the above data
foreach (string line in csv.Split('\n'))
{
Console.WriteLine("--- Begin record ---");
foreach (Match m in Regex.Matches(line, "\".+?\""))
Console.WriteLine(m.Value);
}
will output this:
--- Begin record ---
"first name"
"last name"
"phone number"
--- Begin record ---
"john,jane"
"doe"
"555-5555"
But I would not recommend the Regex approach if you have like a 2 GB csv file.
So you can use that as your baseline for making up your XML records.

Related

Find a pattern and replace an element of it

I have the following problem:
I am trying to split the rows of a CSV file but the thing is that sometimes I read the following line:
string input = "a,b,c,d,\"V=12.503,I=0.194\",e,f"
I use the following code
string[] SplittedLine= input.split(',');
The result is that i get an extra column because the data \"V=12.503,I=0.194\" has a comma inside, but when I open the CSV file with excel i noticed that Excel doesn't add an extra column because it doesn't split that data into two different data. How can I properly split this CSV file considering this situation?
You are encountering commas in the "cells" of your CSV, which by convention (but not by any standard) are escaped by wrapping the cell data with double quotes. You also need to be aware that the quote-escaped string can contain quote literals.
Let's say you had a name column and someone's name was
Jonathan "Jake" Smith, Jr.
That would be encoded as
"Jonathan ""Jake"" Smith, Jr."
You can certainly improve your code to handle those cases. However, that problem has been solved before. If you don't want to reinvent the wheel, there are a number of solid open source libraries that handle the headache of parsing CSV files. The one I use is
http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader

Removing text above real content of CSV file

I have a CSV whose author, annoyingly enough, has decided to 'introduce' the file before the contents themselves. So in all, I have a CSV that looks like:
This file was created by XXXXYY and represents the crossover between YY and QQQ.
Additional information can be found through the website GG, blah blah blah...
Jacob, Hybrid
Dan, Pure
Lianne, Hybrid
Jack, Hatchback
So the problem here is that I want to get rid of the first few lines before the 'real content' of the CSV file begins. I'm looking for robustness here, so using Streamreader and removing all content before the 4th line for example, is not ideal (plus the length of the text can vary).
Is there a way in which one can read only what matters and write a new CSV into a directory path?
Regards,
genesis
(edit - I'm looking for C sharp code)
The solution depends on the files you have to parse. You need to look for a reliable pattern that distinguishes data from comment.
In your example, there are some possibilities that might be the same in other files:
there are 4 lines of text. But you say this isn't consistent across files
The text lives may not contain the same number of commas as the data table. But that is unlikely to be reliable for all files.
there is a blank/whitespace only line between the text and the data.
the data appears to be in the form word-comma-word. If this is true it should be easy to identify non data lines (any line which doesn't contain exactly one comma, or has multiple words etc)
You may be able to use a combination of these heuristics to more reliably detect the data.
You could scan by line (looking for the \r\n) and ignore lines that don't have a comma count that matches you csv.
You should be able to read the file into a string pretty easily unless it is really massive.
e.g.
var csv = "some test\r\nsome more text\r\na,b,c\r\nd,e,f\r\n";
var lines = csv.Split('\r\n');
var csvLines = line.Where(l => l.Count(',') == 2);
// now csvLines contains only the lines you are after
List<string> info = new List<string>();
int counter = 0;
// Open the file to read from.
info = System.IO.File.ReadAllLines(path).ToList();
// Find the lines up until (& including) the empty one
foreach (string s in info)
{
counter++;
if(string.IsNullOrEmpty(s))
break; //exit from the loop
}
// Remove the lines including the blank one.
info.RemoveRange(0,counter);
Something like this should work, you should probably put some tests in to make sure counter is not > length and other tests to handle errors.
You could adapt this code so that it just finds the empty line number using linq or something, but I don't like the overhead of linq (Yeah ironic considering I'm using c#).
Regards,
Slipoch

CSV with \n\r in lines - how to define a line end?`

I have very big CSV with 244 columns and 4000 rows.
There are a lot of \n\r, so when I try to split it with this (to find the end of a line) I get around 9000 rows instead of my wished 4000.
So how to determine which \n\r is within text or maybe at the end of a cell - and which is a definitive end of a line?
When CSV file has data in column which is either \n,\r or , around these values usually put quotes. To correctly prase CSV I would recommend already existing parsers. See this answer as example.
If you truly want to be on your own you have to write simple state machine which will read data by individual columns. When reading column you have to take care about escaping rules. Only that way you could distinguish between line endings in data and line endings which separate rows
try using Environment.NewLine for splitting instead of \n\r
string path = yourfilepath;
string csv = System.IO.File.ReadAllText(path);
List<string> rows = csv.Split(new string[] {Environment.NewLine }, System.StringSplitOptions.RemoveEmptyEntries).ToList();

Import data from CSV file with comma in string cells

I want to import data from a CSV file, But some cells contain comma in string value. How can I recognize which comma is for separate and which is in cell content?
use TextFieldParser :usage
using Microsoft.VisualBasic.FileIO; //Microsoft.VisualBasic.dll
...
using(var csvReader = new TextFieldParser(reader)){
csvReader.SetDelimiters(new string[] {","});
csvReader.HasFieldsEnclosedInQuotes = true;
fields = csvReader.ReadFields();
}
In general, do not bother writing the import yourself.
I have good experiences with the FileHelpers lib.
http://www.filehelpers.com/
And indeed, I hope your fields are quoted. Filehelpers supports this out of the box.
Otherwise there is not much you can do.
Unless you have quotes around the strings you are pretty much hosed, hence the "quote and comma" delimiter style. If you have control of the export facility then you must select "enclose strings quotes" or change the delimiter to something like a tilde or carat symbol.
If not well then you have to write some code. If you detect "a..z" then start counting commas and then keep working through string until you detect [0..9] and even then this is going to be problematic since people can put a [0..9] in their text. At best this is going to be a best efforts process. Your going to have to know when you are in chars and when you are not going to be in chars. I doubt even regex will help you much on this.
The only other thing I can think of is to run through the data and look for commas. Then look prior to and after the comma. If you are surrounded by chars then replace the comma with alternate char like the carat "^" symbol or the tilde "~". Then process the file as normal then go back and replace the alternate char with a comma.
Good luck.
using FileHelper is defnitley way to go. They have done a great job building all the logic for you. I had the same issue where i had to parse a CSV file having comma as part of the field. And this utility did the job very well. All you have to do is to use fillowing attribute on to the field
[FieldQuoted('"', QuoteMode.OptionalForBoth)]
For details http://www.filehelpers.com/forums/viewtopic.php?f=12&t=391
We can use RegEx also as bellow.
Regex CSVParser = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
String[] Fields = CSVParser.Split(Test);

Parsing a CSV File with C#, ignoring thousand separators

Working on a program that takes a CSV file and splits on each ",". The issue I have is there are thousand separators in some of the numbers. In the CSV file, the numbers render correctly. When viewed as a text document, they are shown like below:
Dog,Cat,100,100,Fish
In a CSV file, there are four cells, with the values "Dog", "Cat", "100,000", "Fish". When I split on the "," to an array of strings, it contains 5 elements, when what I want is 4. Anyone know a way to work around this?
Thanks
There are two common mistakes made when reading csv code: using a split() function and using regular expressions. Both approaches are wrong, in that they are prone to corner cases such as yours and slower than they could be.
Instead, use a dedicated parser such as Microsoft.VisualBasic.TextFieldParser, CodeProject's FastCSV or Linq2csv, or my own implemention here on Stack Overflow.
Typically, CSV files would wrap these elements in quotes, causing your line to be displayed as:
Dog,Cat,"100,100",Fish
This would parse correctly (if using a reasonable method, ie: the TextFieldParser class or a 3rd party library), and avoid this issue.
I would consider your file as an error case - and would try to correct the issue on the generation side.
That being said, if that is not possible, you will need to have more information about the data structure in the file to correct this. For example, in this case, you know you should have 4 elements - if you find five, you may need to merge back together the 3rd and 4th, since those two represent the only number within the line.
This is not possible in a general case, however - for example, take the following:
100,100,100
If that is 2 numbers, should it be 100100, 100, or should it be 100, 100100? There is no way to determine this without more information.
you might want to have a look at the free opensource project FileHelpers. If you MUST use your own code, here is a primer on the CSV "standard" format
well you could always split on ("\",\"") and then trim the first and last element.
But I would look into regular expressions that match elements with in "".
Don't just split on the , split on ", ".
Better still, use a CSV library from google or codeplex etc
Reading a CSV file in .NET?
You may be able to use Regex.Replace to get rid of specifically the third comma as per below before parsing?
Replaces up to a specified number of occurrences of a pattern specified in the Regex constructor with a replacement string, starting at a specified character position in the input string. A MatchEvaluator delegate is called at each match to evaluate the replacement.
[C#] public string Replace(string, MatchEvaluator, int, int);
I ran into a similar issue with fields with line feeds in. Im not convinced this is elegant, but... For mine I basically chopped mine into lines, then if the line didnt start with a text delimeter, I appended it to the line above.
You could try something like this : Step through each field, if the field has an end text delimeter, move to the next, if not, grab the next field, appaend it, rince and repeat till you do have an end delimeter (allows for 1,000,000,000 etc) ..
(Im caffeine deprived, and hungry, I did write some code but it was so ugly, I didnt even post it)
Do you know that it will always contain exactly four columns? If so, this quick-and-dirty LINQ code would work:
string[] elements = line.Split(',');
string element1 = elements.ElementAt(0);
string element2 = elements.ElementAt(1);
// Exclude the first two elements and the last element.
var element3parts = elements.Skip(2).Take(elements.Count() - 3);
int element3 = Convert.ToInt32(string.Join("",element3parts));
string element4 = elements.Last();
Not elegant, but it works.

Categories

Resources