Find a pattern and replace an element of it - c#

I have the following problem:
I am trying to split the rows of a CSV file but the thing is that sometimes I read the following line:
string input = "a,b,c,d,\"V=12.503,I=0.194\",e,f"
I use the following code
string[] SplittedLine= input.split(',');
The result is that i get an extra column because the data \"V=12.503,I=0.194\" has a comma inside, but when I open the CSV file with excel i noticed that Excel doesn't add an extra column because it doesn't split that data into two different data. How can I properly split this CSV file considering this situation?

You are encountering commas in the "cells" of your CSV, which by convention (but not by any standard) are escaped by wrapping the cell data with double quotes. You also need to be aware that the quote-escaped string can contain quote literals.
Let's say you had a name column and someone's name was
Jonathan "Jake" Smith, Jr.
That would be encoded as
"Jonathan ""Jake"" Smith, Jr."
You can certainly improve your code to handle those cases. However, that problem has been solved before. If you don't want to reinvent the wheel, there are a number of solid open source libraries that handle the headache of parsing CSV files. The one I use is
http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader

Related

CSV file parsing issue

I am trying to read CSV file, and i am success to some extent.
i tried
string[] values = strCurLine.Split(',');
and i am getting output as
array[0]="180"
array[1]="LMN"
array[2]="8"
array[3]="5/17/2012 15:00"
array[4]=""
array[5]="name"
array[6]="2nd row"
array[7]="step 2 from 4 to 2"
array[8]="7/9/2012 8:47:00 AM"
But when i am trying to read a content which contains , (comma) in it, its giving additional array item, and because of this i am loosing data. how can i replace a comma(,) in string.
My expected output is
array[0]="180"
array[1]="LMN"
array[2]="8"
array[3]="5/17/2012 15:00"
array[4]=""
array[5]="name"
array[6]="2nd row, step 2 from 4 to 2"
array[7]="7/9/2012 8:47:00 AM"
Please suggest
You've hit on what is known as a "Delimiter collision"
You're going to need to change either the delimiter itself, or places where you want to put in a comma.
To replace parts of a string you can use String.Replace() - but that won't really help you in this case because it'll break your csv file.
So what I suggest is the following:
Use a different csv delimiter in the file (even though its called csv, you can delimit them with tabs or semicolons) - something you're not expecting to be in the data. If you have complete control of the csv, you could put your own delimiter (maybe ,, or something like that)
or
Put some sort of 'exception' to commas. For instance use \, or ,, to define a comma. Then you can string.replace the entire csv file to a temporary character - and swap it back into a comma after you're done.
Check here, i got answer to my question
Dealing with commas in a CSV file answered by
harpo
thanks, happy coding.

Import data from CSV file with comma in string cells

I want to import data from a CSV file, But some cells contain comma in string value. How can I recognize which comma is for separate and which is in cell content?
use TextFieldParser :usage
using Microsoft.VisualBasic.FileIO; //Microsoft.VisualBasic.dll
...
using(var csvReader = new TextFieldParser(reader)){
csvReader.SetDelimiters(new string[] {","});
csvReader.HasFieldsEnclosedInQuotes = true;
fields = csvReader.ReadFields();
}
In general, do not bother writing the import yourself.
I have good experiences with the FileHelpers lib.
http://www.filehelpers.com/
And indeed, I hope your fields are quoted. Filehelpers supports this out of the box.
Otherwise there is not much you can do.
Unless you have quotes around the strings you are pretty much hosed, hence the "quote and comma" delimiter style. If you have control of the export facility then you must select "enclose strings quotes" or change the delimiter to something like a tilde or carat symbol.
If not well then you have to write some code. If you detect "a..z" then start counting commas and then keep working through string until you detect [0..9] and even then this is going to be problematic since people can put a [0..9] in their text. At best this is going to be a best efforts process. Your going to have to know when you are in chars and when you are not going to be in chars. I doubt even regex will help you much on this.
The only other thing I can think of is to run through the data and look for commas. Then look prior to and after the comma. If you are surrounded by chars then replace the comma with alternate char like the carat "^" symbol or the tilde "~". Then process the file as normal then go back and replace the alternate char with a comma.
Good luck.
using FileHelper is defnitley way to go. They have done a great job building all the logic for you. I had the same issue where i had to parse a CSV file having comma as part of the field. And this utility did the job very well. All you have to do is to use fillowing attribute on to the field
[FieldQuoted('"', QuoteMode.OptionalForBoth)]
For details http://www.filehelpers.com/forums/viewtopic.php?f=12&t=391
We can use RegEx also as bellow.
Regex CSVParser = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
String[] Fields = CSVParser.Split(Test);

Parsing a CSV File with C#, ignoring thousand separators

Working on a program that takes a CSV file and splits on each ",". The issue I have is there are thousand separators in some of the numbers. In the CSV file, the numbers render correctly. When viewed as a text document, they are shown like below:
Dog,Cat,100,100,Fish
In a CSV file, there are four cells, with the values "Dog", "Cat", "100,000", "Fish". When I split on the "," to an array of strings, it contains 5 elements, when what I want is 4. Anyone know a way to work around this?
Thanks
There are two common mistakes made when reading csv code: using a split() function and using regular expressions. Both approaches are wrong, in that they are prone to corner cases such as yours and slower than they could be.
Instead, use a dedicated parser such as Microsoft.VisualBasic.TextFieldParser, CodeProject's FastCSV or Linq2csv, or my own implemention here on Stack Overflow.
Typically, CSV files would wrap these elements in quotes, causing your line to be displayed as:
Dog,Cat,"100,100",Fish
This would parse correctly (if using a reasonable method, ie: the TextFieldParser class or a 3rd party library), and avoid this issue.
I would consider your file as an error case - and would try to correct the issue on the generation side.
That being said, if that is not possible, you will need to have more information about the data structure in the file to correct this. For example, in this case, you know you should have 4 elements - if you find five, you may need to merge back together the 3rd and 4th, since those two represent the only number within the line.
This is not possible in a general case, however - for example, take the following:
100,100,100
If that is 2 numbers, should it be 100100, 100, or should it be 100, 100100? There is no way to determine this without more information.
you might want to have a look at the free opensource project FileHelpers. If you MUST use your own code, here is a primer on the CSV "standard" format
well you could always split on ("\",\"") and then trim the first and last element.
But I would look into regular expressions that match elements with in "".
Don't just split on the , split on ", ".
Better still, use a CSV library from google or codeplex etc
Reading a CSV file in .NET?
You may be able to use Regex.Replace to get rid of specifically the third comma as per below before parsing?
Replaces up to a specified number of occurrences of a pattern specified in the Regex constructor with a replacement string, starting at a specified character position in the input string. A MatchEvaluator delegate is called at each match to evaluate the replacement.
[C#] public string Replace(string, MatchEvaluator, int, int);
I ran into a similar issue with fields with line feeds in. Im not convinced this is elegant, but... For mine I basically chopped mine into lines, then if the line didnt start with a text delimeter, I appended it to the line above.
You could try something like this : Step through each field, if the field has an end text delimeter, move to the next, if not, grab the next field, appaend it, rince and repeat till you do have an end delimeter (allows for 1,000,000,000 etc) ..
(Im caffeine deprived, and hungry, I did write some code but it was so ugly, I didnt even post it)
Do you know that it will always contain exactly four columns? If so, this quick-and-dirty LINQ code would work:
string[] elements = line.Split(',');
string element1 = elements.ElementAt(0);
string element2 = elements.ElementAt(1);
// Exclude the first two elements and the last element.
var element3parts = elements.Skip(2).Take(elements.Count() - 3);
int element3 = Convert.ToInt32(string.Join("",element3parts));
string element4 = elements.Last();
Not elegant, but it works.

Excel adds extra quotes on CSV export

I've recently created an application which adds items to a Database by CSV. After adding items I realized that lots of my values had extra quotes (") that weren't needed and this was messing up my ordering.
The problem is that when exporting to a CSV from Excel, Excel adds extra quotes to all of my values that already have a quote in them. I've shown the difference below:
Original Item: Drill Electric Reversible 1/2" 6.3A
Exported Item: "Drill Electric Reversible 1/2"" 6.3"
Note: the CSV export is adding three (3) extra quotes ("). Two on the ends, and one after the original intended quote.
Is there a setting I can change, or a formatting property I can set on the Excel File/Column? Or do I have to live with it and remove these quotes in my back-end code before adding them to the Database?
This is entirely normal. The outer quotes are added because this is a string. The inner quote is doubled to escape it. Same kind of thing you'd see in a SQL query for example. Use the TextFieldParser class to have tried and true framework code care of the parsing of this for you automatically.
That's standard.
The values within a CSV file should have quotes around them (otherwise commas and linebreaks inside a field may be misinterpreted).
The way to escape a quote within a field is to double it, just as you are seeing.
I suggest you read about the basic rules of CSV:
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows terminated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. If a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
(emphasis mine)
You could try exporting from Excel as TAB delimited files. I find it easier to parse.
Replace all characters Right Double Quotation Mark by characters Left Double Quotation Mark. They look similar, Excel will be confused and let the text unchanged.
This solution will only help if your end output is HTML. This is the javascript solution so obviously you'll need to redo this in C# or whichever language you're working in:
base = base.replace(/""/gi, '"');
base = base.replace(/'/gi, ''');
Apply this before you parse the CSV.
Another approach would be to use the Unicode Character "DOUBLE PRIME"
http://www.fileformat.info/info/unicode/char/2033/index.htm
in your Excel data. To export from Excel into a UTF-8 or UTF-16 .csv you'll have to provide a schema.ini with an appropriate CharacterSet property. Obviously, the tool you use to import the .csv into your database has to be Unicode aware too.
Depending on the DBMS a more direct way of data transfer (SELECT/INSERT ... INTO ... IN ) can be used, thereby eliminating the .csv entirely.

Convert CSV to XML when CSV contains both character and number data

From this thread, I got the basic info on how to parse CSV to create XML. Unfortunately, the text fields (all enclosed in quotes) sometimes contain commas, so line.split(',') gives me too many columns. I can't figure out how to parse the CSV so line.split(',') distinguishes between commas within a text field, and commas separating fields. Any thoughts on how to do that?
Thanks!
Go grab this code: http://geekswithblogs.net/mwatson/archive/2004/09/04/10658.aspx
Then replace line.Split(",") with SplitCSV(line), like:
var lines = File.ReadAllLines(#"C:\text.csv");
var xml = new XElement("TopElement",
lines.Select(line => new XElement("Item",
SplitCSV(line)
.Select((column, index) => new XElement("Column" + index, column)))));
xml.Save(#"C:\xmlout.xml");
Note that the code at the link above is rather old, and probably could be cleaned up a bit using Linq, but it should do the trick.
Try FileHelpers.
The FileHelpers are a free and easy to use .NET library to import/export data from fixed length or delimited records in files, strings or streams.
What about using the pipe character "|"? This often happens with CSV files and a better approach is to seperate on pipes.
If your CSV files are too complex to make writing your own parser practical, use another parser. The Office ACE OLEDB provider may already be available on your system, but may be overkill for your purposes. I haven't used any of the lightweight alternatives, so I can't speak to their suitability.
Here's a little trick if you don't want to use Regex. Instead of spliting with comma you can split with comma and quotes together ","
Assuming there's no space before and after comma:
line.Split("\",\"")
You will need to remove the quote before the first field and after the last field however.
While I'm almost always against regular expression, here's a solution using it.
Assume you have data as such:
"first name","last name","phone number"
"john,jane","doe","555-5555"
Then, the following code:
string csv = GetCSV(); // will load your CSV, or the above data
foreach (string line in csv.Split('\n'))
{
Console.WriteLine("--- Begin record ---");
foreach (Match m in Regex.Matches(line, "\".+?\""))
Console.WriteLine(m.Value);
}
will output this:
--- Begin record ---
"first name"
"last name"
"phone number"
--- Begin record ---
"john,jane"
"doe"
"555-5555"
But I would not recommend the Regex approach if you have like a 2 GB csv file.
So you can use that as your baseline for making up your XML records.

Categories

Resources