Parsing a CSV File with C#, ignoring thousand separators - c#

Working on a program that takes a CSV file and splits on each ",". The issue I have is there are thousand separators in some of the numbers. In the CSV file, the numbers render correctly. When viewed as a text document, they are shown like below:
Dog,Cat,100,100,Fish
In a CSV file, there are four cells, with the values "Dog", "Cat", "100,000", "Fish". When I split on the "," to an array of strings, it contains 5 elements, when what I want is 4. Anyone know a way to work around this?
Thanks

There are two common mistakes made when reading csv code: using a split() function and using regular expressions. Both approaches are wrong, in that they are prone to corner cases such as yours and slower than they could be.
Instead, use a dedicated parser such as Microsoft.VisualBasic.TextFieldParser, CodeProject's FastCSV or Linq2csv, or my own implemention here on Stack Overflow.

Typically, CSV files would wrap these elements in quotes, causing your line to be displayed as:
Dog,Cat,"100,100",Fish
This would parse correctly (if using a reasonable method, ie: the TextFieldParser class or a 3rd party library), and avoid this issue.
I would consider your file as an error case - and would try to correct the issue on the generation side.
That being said, if that is not possible, you will need to have more information about the data structure in the file to correct this. For example, in this case, you know you should have 4 elements - if you find five, you may need to merge back together the 3rd and 4th, since those two represent the only number within the line.
This is not possible in a general case, however - for example, take the following:
100,100,100
If that is 2 numbers, should it be 100100, 100, or should it be 100, 100100? There is no way to determine this without more information.

you might want to have a look at the free opensource project FileHelpers. If you MUST use your own code, here is a primer on the CSV "standard" format

well you could always split on ("\",\"") and then trim the first and last element.
But I would look into regular expressions that match elements with in "".

Don't just split on the , split on ", ".
Better still, use a CSV library from google or codeplex etc
Reading a CSV file in .NET?

You may be able to use Regex.Replace to get rid of specifically the third comma as per below before parsing?
Replaces up to a specified number of occurrences of a pattern specified in the Regex constructor with a replacement string, starting at a specified character position in the input string. A MatchEvaluator delegate is called at each match to evaluate the replacement.
[C#] public string Replace(string, MatchEvaluator, int, int);

I ran into a similar issue with fields with line feeds in. Im not convinced this is elegant, but... For mine I basically chopped mine into lines, then if the line didnt start with a text delimeter, I appended it to the line above.
You could try something like this : Step through each field, if the field has an end text delimeter, move to the next, if not, grab the next field, appaend it, rince and repeat till you do have an end delimeter (allows for 1,000,000,000 etc) ..
(Im caffeine deprived, and hungry, I did write some code but it was so ugly, I didnt even post it)

Do you know that it will always contain exactly four columns? If so, this quick-and-dirty LINQ code would work:
string[] elements = line.Split(',');
string element1 = elements.ElementAt(0);
string element2 = elements.ElementAt(1);
// Exclude the first two elements and the last element.
var element3parts = elements.Skip(2).Take(elements.Count() - 3);
int element3 = Convert.ToInt32(string.Join("",element3parts));
string element4 = elements.Last();
Not elegant, but it works.

Related

Get contents of a Var where part of line matches search string C#

I am reading a couple of csv files into var's as follows:
var myFullCsv = ReadFile(myFullCsvFilePath);
var masterCsv = ReadFile(csvFilePath);
Some of the line entries in each csv appear in both files and I am able to create a new var containing lines that exists in myFullCsv but not in masterCsv as follows:
var extraFilesCsv = myFullCsv.Except(masterCsv);
This is great because its very simple. However, I now wish to identify lines in myFullCsv where a specific string appears in the line. The string will correspond to one column of the csv data. I know that I can do this by reading each line of the var and splitting it up, then comparing the field I'm interested in to the string that I am searching for. However, this seems like a very long and inefficient approach as compared to my code above using the 'Except' command.
Is there some way that I can get the lines from myFullCsv with a very simple command or will I have to do it the long way? Please don't ask me to show the long way as that's what I am trying to avoid having to code although I can do it.
Sample csv data:
07801.jpg,67466,9452d316,\Folder1\FolderA\,
07802.jpg,78115,e50492d8,\Folder1\FolderB\,
07803.jpg,41486,37b6a100,\Folder1\FolderC\,
07804.jpg,93500,acdffc2b,\Folder2\FolderA\,
07805.jpg,67466,9452d316,\Folder2\FolderB\,
Sample desired output (I'm always looking for the entry in the 3rd column to match a string, in this case 9452d316):
07801.jpg,67466,9452d316,\Folder1\FolderA\,
07805.jpg,67466,9452d316,\Folder2\FolderB\,
Well you could use:
var results = myFullCsv.Where(line => line.Split(',')[2] == targetValue)
.ToList();
That's just doing the "splitting and checking" you mention in the question but it's pretty simple code. It could be more efficient if you only consider as far as the third comma, but I wouldn't worry about that until it's proved to be a problem.
Personally I'd probably parse each line to an object with meaningful properties rather than treating is as a string, but that's probably what you mean by "the long way".
Note that this doesn't perform any validation, or try to handle escaped commas, or lines with fewer columns etc. Depending on your data source, you may need to make it a lot more robust.
You could use a regex. It doesn't require every line to have at least 3 elements. It doesn't allocate a string array for each line. Therefore it may be faster, but you'd have to test it to prove it.
var regex = new Regex("^.+?,.+?," + Regex.Escape(targetValue) + ",");
var results = myFullCsv.Where(l => regex.IsMatch(l)).ToList();

Parsing Log using something else than string split c#

I'm pretty sure it has been asked before, but I could not find anything good.
I'm trying to parse a log but having troubles with it.
At first it looked pretty easy because the log is build like this:
thing,thing,thing,thing
so I string split it on the ,
however in the value itself it is possible that a , appears, and this is where I did not know what to do anymore.
How would I successfully parse this kind of log?
Edit~~
here is an log example:
1326139200953,info,,0,"str value which may contain, ",,,0
1326139201109,info,,0,"str value which may contain, ",,,0
1326139201265,info,,0,"str value which may contain, ",,,0
1326139201999,start,,0,,,,0
1326139368296,new,F:\Dir\Dir\file.txt,1536,,0,,0
``
If your log file doesn't have field encapsulators, the fields have variable width, and the separator/delimiter can also appear in a field, then it's likely you can't program something that will work in all cases.
Can you supply an example of your log file data? It may be possible to match the parts you need with a regex.
Unfortunately I think your question is not answerable in its current state, please provide more info.
Edit: Thanks for updating the question, you do have field encapsulators (double quotes). This will make it easier!
I think there are many ways to do this. Personally i think i would carry on splitting on commas, but then loop over the resulting array, checking if the first character of any value is a double quote. If it is, then you need to join it to the array item after it. If the last character of the joined array item isn't a double quote, you need to continue joining until you've closed your opening double quote.
There's certainly a better way so you may wish to wait for another solution.
Edit 2: Give this a go and let me know how you get on:
string myRegex = #"(?<=^(?:[^""]*""[^""]*"")*[^""]*),";
string[] outputArray = Regex.Split(myStr, myRegex);

Regex between, from the last to specific end

Today my wish is to take text form the string.
This string must be, between last slash and .partX.rar or .rar
First I tried to find edge's end of the word and then the beginning. After I get that two elements I merged them but I got empty results.
String:
http://hosting.xyz/1234/15-game.part4.rar.html
http://hosting.xyz/1234/16-game.rar.html
Regex:
Begin:(([^/]*)$) - start from last /
End:(.*(?=.part[0-9]+.rar|.rar)) stop before .partX.rar or .rar
As you see, if I merge that codes I won't get any result.
What is more, "end" select me only .partX instead of .partX.rar
All what I want is:
15-game.part4.rar and 16-game.rar
What i tried:
(([^/]*)$)(.*(?=.part[0-9]+.rar|.rar))
(([^/]*)$)
(.*(?=.part[0-9]+.rar|.rar))
I tried also
/[a-zA-Z0-9]+
but I do not know how select symbols.. This could be the easiest way. But this select only letters and numbers, not - or _.
If I could select symbols..
You don't really need a regex for this as you can merely split the url on / and then grab the part of the file name that you need. Since you didn't mention a language, here's an implementation in Perl:
use strict;
use warnings;
my $str1="http://hosting.xyz/1234/15-game.part4.rar.html";
my $str2="http://hosting.xyz/1234/16-game.rar.html";
my $file1=(split(/\//,$str1))[-1]; #last element of the resulting array from splitting on slash
my $file2=(split(/\//,$str2))[-1];
foreach($file1,$file2)
{
s/\.html$//; #for each file name, if it ends in ".html", get rid of that ending.
print "$_\n";
}
The output is:
15-game.part4.rar
16-game.rar
Nothing could be simpler! :-)
Use this:
new Regex("^.*\/(.*)\.html$")
You'll find your filename in the first captured group (don't have a c# compiler at hand, so can't give you working sample, but you have a working regex now! :-) )
See a demo here: http://rubular.com/r/UxFNtJenyF
I'm not a C# coder so can't write full code here but I think you'll need support of negative lookahead here like this:
new Regex("/(?!.*/)(.+?)\.html$");
Matched Group # 1 will have your string i.e. "16-game.rar" OR "15-game.part4.rar"
Use two regexes:
start to substitute .*/ with nothing;
then substitute \.html with nothing.
Job done!

Splitting a string in C#

Let's say I have this string:
"param1,r:1234,p:myparameters=1,2,3"
...and I would like to split it into:
param1
r:1234
p:myparameters=1,2,3
I've used the split function and of course it splits it at every comma. Is there a way to do this using regex or will I have to write my own split function?
Personally, I would try something like this:
,(?=[^,]+:.*?)
Basically, use a positive look-ahead to find a comma, followed by a "key-value" pair (this defined by a key, a colon, and more information [data] (including other commas). This should disqualify the commas between the numbers, too.
You can use ; for separating values which makes easy to work with it.
Since you have , for separation and also for values it is difficult to split it.
You have
string str = "param1,r:1234,p:myparameters=1,2,3"
Recommended to use
string str = "param1;r:1234;p:myparameters=1,2,3"
which can be splited as
var strArray = str.Split(';');
strArray[0]; // contains param1
strArray[1]; // r:1234
strArray[2]; // p:myparameters=1,2,3
I'm not sure how you would write a split that knew which commas to split on there, honestly.
Unless it's a fixed number each time in which case, just use the String.Split overload that takes an int specifying how many substrings to return at max
If you're going to have comma-delimited data that's not always a fixed number of items and it could have literal commas in the data itself, they really should be quoted. If you can control the input in any way, you should encourage that, and use an actual CSV parser instead of String.Split
That depends. You can't parse it with regex (or anything else) unless you can identify a consistent rule separating one group from another. Based on your sample, I can't clearly identify such a rule (though I have some guesses). How does the system know that p:myparameters=1,2,3 is a single item? For example, if there were another item after it, what would be the difference between that and the 1,2,3? Figure that out and you'll be pretty close to a solution.
If you're able to change the format of the input string, why not decide on a consistent delimiter between your groups? ; would be a good choice. Use an input like param1;r:1234;p:myparameters=1,2,3 and there will be no ambiguity where the groups are, plus you can just split on ; and you won't need regex.
The simplest approach would be changing your delimiter from "," to something like "|". Then you can split on "|" no problem. However if you can't change the delimiting character then maybe you could encode the sections in a fashion similar to CSV.
CSV files have the same issue... the standard there is to put double quotes "" around columns.
For example, your string would be "param1","r:1234","p:myparameters=1,2,3".
Then you could use the Microsoft.VisualBasic.FileIO.TextFieldParser to split/parse. You can include this in c# even though its in the VisualBasic namespace.
TextFieldParser
Do you mean that:string[] str = System.Text.RegularExpression.Regex.Spilt("param1,r:1234,p:myparameters=1,2,3",#"\,");

Splitting on a Unique Character

I want to build a comma separated list so that I can split on the comma later to get an array of the values. However, the values may have comma's in them. In fact, they may have any normal keyboard character in them (they are supplied from a user). What is a good strategy for determining a character you are sure will not collide with the values?
In case this matters in a language dependent way, I am building the "some character" separated list in C# and sending it to a browser to be split in javascript.
If JavaScript is consuming the list, why not send it in the form of a JavaScript array? It already has an established and reliable method for representing a list and escaping characters.
["Value 1", "Value 2", "Escaped \"Quotes\"", "Escaped \\ Backslash"]
You could split it by a null character, and terminate your list with a double null character.
I always use | but if you still think that it can contain it, you can use combinations like #|#. For example:
"string one#|#string two#|#...#|#last string"
Eric S. Raymond wrote a book chapter on this that you might find useful. It is directed toward Unix users but should still apply.
As for your question, if you will have commas within cells, then you will need some form of escaping. Using \, is a standard way, but you will also have to escape slashes, which are also common.
Alternatively, use another character such as the pipe (|), tab, or something else of your choice. If users need to work with the data using a spreadsheet program, you can usually add filter rules to split cells on the delimiter of your choice. If this is a concern, it's probably best to choose a delimiter that users can easily type, which excludes the nul char, among others.
You could also use quoting:
"value1", "value2", "etc"
In which case, you will only need to escape quotes (and slashes). This should also be accepted by spreadsheets given the correct filter options.
There are several ways to do this. The first is to select a separator character that would not normally be input from the keyboard. NULL or TAB are normally good. The second is to use a character sequence as a separator, the Excel CSV files are a good example where the cell values are defined by quotes with commas separating the cells.
The answer is dependent on whether you want to reinvent the wheel or not.
If there is potential for any splitting character to appear in your strings then then I would suggest that you write a script element to your output with a javascript array definition in it. For example:
<script>
var myVars=new Array();
myVars[0]="abc|#123$";
myVars[1]="123*456";
myVars[2]="blah|blah";
</script>
Your javascript can then reference that array
Doing this also avoids the need to create a comma seperated string from your C# string array.
The only gotcha I can think of is strings that contains quotes, in this case you would have to escape them in C# when writing them out to the myVars output.
There is an RFC which documents the CSV format. Follow the standards and you will avoid reinventing the wheel and creating a mess for the next guy to come along and maintain your code. The nice thing is that there are libraries available to import/export CSV for just about any platform you can imagine.
That said, if you are serialising data to send to a browser, JSON is really the way to go and it too is documented in an RFC and you can get libraries for just about any platform such as JSON.NET

Categories

Resources