regex to split line (csv file) - c#

I am not good in regex. Can some one help me out to write regex for me?
I may have values like this while reading csv file.
"Artist,Name",Album,12-SCS
"val""u,e1",value2,value3
Output:
Artist,Name
Album
12-SCS
Val"u,e1
Value2
Value3
Update:
I like idea using Oledb provider. We do have file upload control on the web page, that I read the content of the file using stream reader without actual saving file on the file system. Is there any way I can user Oledb provider because we need to specify the file name in connection string and in my case i don't have file saved on file system.

Just adding the solution I worked on this morning.
var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");
foreach (Match m in regex.Matches("<-- input line -->"))
{
var s = m.Value;
}
As you can see, you need to call regex.Matches() per line. It will then return a MatchCollection with the same number of items you have as columns. The Value property of each match is, obviously, the parsed value.
This is still a work in progress, but it happily parses CSV strings like:
2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D

Actually, its pretty easy to match CVS lines with a regex. Try this one out:
StringCollection resultList = new StringCollection();
try {
Regex pattern = new Regex(#"
# Parse CVS line. Capture next value in named group: 'val'
\s* # Ignore leading whitespace.
(?: # Group of value alternatives.
"" # Either a double quoted string,
(?<val> # Capture contents between quotes.
[^""]*(""""[^""]*)* # Zero or more non-quotes, allowing
) # doubled "" quotes within string.
""\s* # Ignore whitespace following quote.
| (?<val>[^,]*) # Or... zero or more non-commas.
) # End value alternatives group.
(?:,|$) # Match end is comma or EOS",
RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
Match matchResult = pattern.Match(subjectString);
while (matchResult.Success) {
resultList.Add(matchResult.Groups["val"].Value);
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Disclaimer: The regex has been tested in RegexBuddy, (which generated this snippet), and it correctly matches the OP test data, but the C# code logic is untested. (I don't have access to C# tools.)

Regex is not the suitable tool for this. Use a CSV parser. Either the builtin one or a 3rd party one.

Give the TextFieldParser class a look. It's in the Microsoft.VisualBasic assembly and does delimited and fixed width parsing.

Give CsvHelper a try (a library I maintain). It's available via NuGet.
You can easily read a CSV file into a custom class collection. It's also very fast.
var streamReader = // Create a StreamReader to your CSV file
var csvReader = new CsvReader( streamReader );
var myObjects = csvReader.GetRecords<MyObject>();

Regex might get overly complex here. Split the line on commas, and then iterate over the resultant bits and concatenate them where "the number of double quotes in the concatenated string" is not even.
"hello,this",is,"a ""test"""
...split...
"hello | this" | is | "a ""test"""
...iterate and merge 'til you've an even number of double quotes...
"hello,this" - even number of quotes (note comma removed by split inserted between bits)
is - even number of quotes
"a ""test""" - even number of quotes
...then strip of leading and trailing quote if present and replace "" with ".

It could be done using below code:
using Microsoft.VisualBasic.FileIO;
string csv = "1,2,3,"4,3","a,"b",c",end";
TextFieldParser parser = new TextFieldParser(new StringReader(csv));
//To read from file
//TextFieldParser parser = new TextFieldParser("csvfile.csv");
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
string[] fields =null;
while (!parser.EndOfData)
{
fields = parser.ReadFields();
}
parser.Close();

Related

c# .NET isn't rendering russian characters

I'm using regex to match a string of unicode and store it in a string. For example:
NOTE: The following content must be read from an outside text file or else visual studio will automagically render it into russian.
"Name": "\u0412\u0438\u043d\u043d\u0438\u0446\u0430, \u0443\u043b. \u041a\u0438\u0435\u0432\u0441\u043a\u0430\u044f, 14-\u0431",
I'm using the pattern:
"\"Name\":\\s*\"(?<match>[^\"]+)\""
However, when I store the match in a string, the string is saved as:
match = "\\u0412\\u0438\\u043d\\u043d\\u0438\\u0446\\u0430, \\u0443\\u043b. \\u041a\\u0438\\u0435\\u0432\\u0441\\u043a\\u0430\\u044f, 14-\\u0431"
.NET is storing the string with an extra "\"
I tried using:
match = match.replace(#"\\", #"\")
but .NET doesn't recognize #"\\" as existing because it is looking at the 'visualizer version'.
How can I store my unicode without c# adding an extra '\'?
EDIT:
Another point:
// this works!
string russianCharacters = "\u041b\u044c\u0432\u043e\u0432, \u0414\u043e\u043b\u0438\u043d\u0430, \u0432\u0443\u043b. \u0427\u043e\u0440\u043d\u043e\u0432\u043e\u043b\u0430, 18");
This renders correctly in the visualizer as russian characters. But when I store characters from a regex match FROM AN OUTSIDE TEXT FILE, it is stored as an excaped sequence.
How can I render my string as russian characters instead of an escaped sequence of unicode?
It seems you read the string from a text file that actually contains literal Unicode points, not actual Unicode symbols. That is, your C# variable looks like:
var match = "\\u0412\\u0438\\u043d\\u043d\\u0438\\u0446\\u0430, \\u0443\\u043b. \\u041a\\u0438\\u0435\\u0432\\u0441\\u043a\\u0430\\u044f, 14-\\u0431"
or
var match = #"\u0412\u0438\u043d\u043d\u0438\u0446\u0430, \u0443\u043b. \u041a\u0438\u0435\u0432\u0441\u043a\u0430\u044f, 14-\u0431"
In this case, to get the actual Unicode string, you need to use Regex.Unescape:
Converts any escaped characters in the input string.
C# demo:
var s = "\\u0412\\u0438\\u043d\\u043d\\u0438\\u0446\\u0430, \\u0443\\u043b. \\u041a\\u0438\\u0435\\u0432\\u0441\\u043a\\u0430\\u044f, 14-\\u0431";
Console.WriteLine(s);
// \u0412\u0438\u043d\u043d\u0438\u0446\u0430, \u0443\u043b. \u041a\u0438\u0435\u0432\u0441\u043a\u0430\u044f, 14-\u0431
Console.WriteLine(Regex.Unescape(s));
// Винница, ул. Киевская, 14-б
The extra '\' is just an escape character. I'm guessing you are viewing the value in the debugger window in which case it is showing the extra '\' but the underlying value will not have the extra '\'. Try using the actual value and you will see this.
This code works as expected:
var myString = "\"Name\": \"\u0412\u0438\u043d\u043d\u0438\u0446\u0430, \u0443\u043b.\u041a\u0438\u0435\u0432\u0441\u043a\u0430\u044f, 14 - \u0431\",";
var pattern = "\"Name\":\\s*\"(?<match>[^\"]+)\"";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(myString);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
var ma = System.Web.HttpUtility.HtmlDecode(match.ToString());
}
}

C# Replace with regex

I'm new to VB, C#, and am struggling with regex. I think I've got the following code format to replace the regex match with blank space in my file.
EDIT: Per comments this code block has been changed.
var fileContents = System.IO.File.ReadAllText(#"C:\path\to\file.csv");
fileContents = fileContents.Replace(fileContents, #"regex", "");
regex = new Regex(pattern);
regex.Replace(filecontents, "");
System.IO.File.WriteAllText(#"C:\path\to\file.csv", fileContents);
My files are formatted like this:
"1111111","22222222222","Text that may, have a comma, or two","2014-09-01",,,,,,
So far, I have regex finding any string between ," and ", that contains a comma (there are never commas in the first or last cell, so I'm not worried about excluding those two. I'm testing regex in Expresso
(?<=,")([^"]+,[^"]+)(?=",)
I'm just not sure how to isolate that comma as what needs to be replaced. What would be the best way to do this?
SOLVED:
Combined [^"]+ with look behind/ahead:
(?<=,"[^"]+)(,)(?=[^"]+",)
FINAL EDIT:
Here's my final complete solution:
//read file contents
var fileContents = System.IO.File.ReadAllText(#"C:\path\to\file.csv");
//find all commas between double quotes
var regex = new Regex("(?<=,\")([^\"]+,[^\"]+(?=\",)");
//replace all commas with ""
fileContents = regex.Replace(fileContents, m => m.ToString().Replace(",", ""));
//write result back to file
System.IO.File.WriteAllText(#"C:\path\to\file.csv", fileContents);
Figured it out by combining the [^"]+ with the look ahead ?= and look behind ?<= so that it finds strings beginning with ,"[anything that's not double quotes, one or more times] then has a comma, then ends with [anything that's not double quotes, one or more times]",
(?<=,"[^"]+)(,)(?=[^"]+",)
Try to parse out all your columns with this:
Regex regex = new Regex("(?<=\").*?(?=\")");
Then you can just do:
foreach(Match match in regex.Matches(filecontents))
{
fileContents = fileContents.Replace(match.ToString(), match.ToString().Replace(",",string.Empty))
}
Might not be as fast but should work.
I would probably use the overload of Regex.Replace that takes a delegate to return the replaced text.
This is useful when you have a simple regex to identify the pattern but you need to do something less straightforward (complex logic) for the replace.
I find keeping your regexes simple will pay benefits when you're trying to maintain them later.
Note: this is similar to the answer by #Florian, but this replace restricts itself to replacement only in the matched text.
string exp = "(?<=,\")([^\"]+,[^\"]+)(?=\",)";
var regex = new Regex(exp);
string replacedtext = regex.Replace(filecontents, m => m.ToString().Replace(",",""))
What you have there is an irregular language. This is because a comma can mean different things depending upon where it is in the text stream. Strangely Regular Expressions are designed to parse regular languages where a comma would mean the same thing regardless of where it is in the text stream. What you need for an irregular language is a parser. In fact Regular expressions are mostly used for tokenizing strings before they are entered into a parser.
While what you are trying to do can be done using regular expressions it is likely to be very slow. For example you can use the following (which will work even if the comma is the first or last character in the field). However every time it finds a comma it will have to scan backwards and forwards to check if it is between two quotation characters.
(?<=,"[^"]*),(?=[^"]*",)
Note also that their may be a flaw in this approach that you have not yet spotted. I don't know if you have this issue but often in CSV files you can have quotation characters in the middle of fields where there may also be a comma. In these cases applications like MS Excel will typically double the quote up to show that it is not the end of the field. Like this:
"1111111","22222222222","Text that may, have a comma, Quote"" or two","2014-09-01",,,,,,
In this case you are going to be out of luck with a regular expression.
Thankfully the code to deal with CSV files is very simple:
public static IList<string> ParseCSVLine(string csvLine)
{
List<string> result = new List<string>();
StringBuilder buffer = new StringBuilder();
bool inQuotes = false;
char lastChar = '\0';
foreach (char c in csvLine)
{
switch (c)
{
case '"':
if (inQuotes)
{
inQuotes = false;
}
else
{
if (lastChar == '"')
{
buffer.Append('"');
}
inQuotes = true;
}
break;
case ',':
if (inQuotes)
{
buffer.Append(',');
}
else
{
result.Add(buffer.ToString());
buffer.Clear();
}
break;
default:
buffer.Append(c);
break;
}
lastChar = c;
}
result.Add(buffer.ToString());
buffer.Clear();
return result;
}
PS. There are another couple of issues often run into with CSV files which the code I have given doesn't solve. First is what happens if a field has an end of line character in the middle of it? Second is how do you know what character encoding a CSV file is in? The former of these two issues is easy to solve by modifying my code slightly. The second however is near impossible to do without coming to some agreement with the person supplying the file to you.

Why does regex replace my feed string once + 3 times?

Interesting situation I have here. I have some files in a folder that all have a very explicit string in the first line that I always know will be there. Want I want to do is really just append |DATA_SOURCE_KEY right after AVAILABLE_IND
//regex to search for the bb_course_*.bbd files
string courseRegex = #"BB_COURSES_([C][E][Q]|[F][A]|[H][S]|[S][1]|[S][2]|[S][P])\d{1,6}.bbd";
string courseHeaderRegex = #"EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND";
//get files from the directory specifed in the GetFiles parameter and returns the matches to the regex
var matches = Directory.GetFiles(#"c:\courseFolder\").Where(path => Regex.Match(path, courseRegex).Success);
//prints the files returned
foreach (string file in matches)
{
Console.WriteLine(file);
File.WriteAllText(file, Regex.Replace(File.ReadAllText(file), courseHeaderRegex, "EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY"));
}
But this code takes the original occurrence of the matching regex, replaces it with my replacement value, and then does it 3 more times.
EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY|EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY|EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY|EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY
And I can't figure out why with breakpoints. My loop is running only 12 times to match the # of files I have in the directory. My only guess is that File.WriteAllText is somehow recursively searching itself after replacing the text and re-replacing. If that makes sense. Any ideas? Is it because courseHeaderRegex is so explicit?
If I change courseHeaderRegex to string courseHeaderRegex = #"AVAILABLE_IND";
then I get the correct changes in my files
EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY
I'd just like to understand why the original way doesn't work.
I think your problem is that you need to escape the | character in courseHeaderRegex:
string courseHeaderRegex = #"EXTERNAL_COURSE_KEY\|COURSE_ID\|COURSE_NAME\|AVAILABLE_IND";
The character | is the Alternation Operator and it will match 'EXTERNAL_COURSE_KEY' , 'COURSE_ID' , ,'COURSE_NAME' and 'AVAILABLE_IND', replacing each of them with your substitution string.
What about
string newString = File.ReadAllText(file)
.Replace(#"EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND",#"EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY");
just using a simple String.Replace()

Parsing Mixedt CSV-File with TextFieldParser

Hello I have a problem to parse a CSV-file. The CSV-File is Delimited with | character . So far so good. But only one field is enclosed with the " char.
For example
field1|field2|"field3"|field4
When I set the
HasFieldsEnclosedInQuotes
to true the i will become a exception otherwise the parsing of the CSV-File goes wrong. Can you help me here.
I haven't seen a culture, where '|' is csv separator...
All in all,
var line = "field1|field2|\"field3\"|field4";
var pattern = string.Format("{0}(?=([^\"]*\"[^\"]*\")*[^\"]*$)", Regex.Escape("|"));
//{0} in pattern is CSV separator. To get current use System.Globalization.CultureInfo.CurrentCulture.TextInfo.ListSeparator
var splitted = Regex.Split(line, pattern, RegexOptions.Compiled | RegexOptions.ExplicitCapture);
foreach (var s in splitted)
Console.WriteLine(s);
Output:
field1
field2
"field3"
field4
Pattern is designed to split a single line from a CSV file using specified separator characters. Includes handling of quotes, etc.
Hope that will help you.
Quick and dirty: You could consider stripping the document of all uses of " beforehand.
string path = "c:\\test.txt";
string s = System.IO.File.ReadAllText(path, System.Text.Encoding.Default);
s = s.Replace("\"", string.Empty);
System.IO.File.WriteAllText(path, s, System.Text.Encoding.Default);
Edit 1:
This method works for number columns or string columns containing only one word, but could break your csv structure in other cases (e.g. field stores html content) - be aware of possible side effects.

how can i optimize the performance of this regular expression?

I'm using a regular expression to replace commas that are not contained by text qualifying quotes into tab spaces.
I'm running the regex on file content through a script task in SSIS. The file content is over 6000 lines long.
I saw an example of using a regex on file content that looked like this
String FileContent = ReadFile(FilePath, ErrInfo);
Regex r = new Regex(#"(,)(?=(?:[^""]|""[^""]*"")*$)");
FileContent = r.Replace(FileContent, "\t");
That replace can understandably take its sweet time on a decent sized file.
Is there a more efficient way to run this regex?
Would it be faster to read the file line by line and run the regex per line?
It seems you're trying to convert comma separated values (CSV) into tab separated values (TSV).
In this case, you should try to find a CSV library instead and read the fields with that library (and convert them to TSV if necessary).
Alternatively, you can check whether each line has quotes and use a simpler method accordingly.
The problem is the lookahead, which looks all the way to the end on each comman, resulting in O(n2) complexity, which is noticeable on long inputs. You can get it done in a single pass by skipping over quotes while replacing:
Regex csvRegex = new Regex(#"
(?<Quoted>
"" # Open quotes
(?:[^""]|"""")* # not quotes, or two quotes (escaped)
"" # Closing quotes
)
| # OR
(?<Comma>,) # A comma
",
RegexOptions.IgnorePatternWhitespace);
content = csvRegex.Replace(content,
match => match.Groups["Comma"].Success ? "\t" : match.Value);
Here we match free command and quoted strings. The Replace method takes a callback with a condition that checks if we found a comma or not, and replaced accordingly.
The simplest optimization would be
Regex r = new Regex(#"(,)(?=(?:[^""]|""[^""]*"")*$)", RegexOptions.Compiled);
foreach (var line in System.IO.File.ReadAllLines("input.txt"))
Console.WriteLine(r.Replace(line, "\t"));
I haven't profiled it, but I wouldn't be surprised if the speedup was huge.
If that's not enough I suggest some manual labour:
var input = new StreamReader(File.OpenRead("input.txt"));
char[] toMatch = ",\"".ToCharArray ();
string line;
while (null != (line = input.ReadLine()))
{
var result = new StringBuilder(line);
bool inquotes = false;
for (int index=0; -1 != (index = line.IndexOfAny (toMatch, index)); index++)
{
bool isquote = (line[index] == '\"');
inquotes = inquotes != isquote;
if (!(isquote || inquotes))
result[index] = '\t';
}
Console.WriteLine (result);
}
PS: I assumed #"\t" was a typo for "\t", but perhaps it isn't :)

Categories

Resources