Parsing Mixedt CSV-File with TextFieldParser - c#

Hello I have a problem to parse a CSV-file. The CSV-File is Delimited with | character . So far so good. But only one field is enclosed with the " char.
For example
field1|field2|"field3"|field4
When I set the
HasFieldsEnclosedInQuotes
to true the i will become a exception otherwise the parsing of the CSV-File goes wrong. Can you help me here.

I haven't seen a culture, where '|' is csv separator...
All in all,
var line = "field1|field2|\"field3\"|field4";
var pattern = string.Format("{0}(?=([^\"]*\"[^\"]*\")*[^\"]*$)", Regex.Escape("|"));
//{0} in pattern is CSV separator. To get current use System.Globalization.CultureInfo.CurrentCulture.TextInfo.ListSeparator
var splitted = Regex.Split(line, pattern, RegexOptions.Compiled | RegexOptions.ExplicitCapture);
foreach (var s in splitted)
Console.WriteLine(s);
Output:
field1
field2
"field3"
field4
Pattern is designed to split a single line from a CSV file using specified separator characters. Includes handling of quotes, etc.
Hope that will help you.

Quick and dirty: You could consider stripping the document of all uses of " beforehand.
string path = "c:\\test.txt";
string s = System.IO.File.ReadAllText(path, System.Text.Encoding.Default);
s = s.Replace("\"", string.Empty);
System.IO.File.WriteAllText(path, s, System.Text.Encoding.Default);
Edit 1:
This method works for number columns or string columns containing only one word, but could break your csv structure in other cases (e.g. field stores html content) - be aware of possible side effects.

Related

Regex Replacing only whole matches

I am trying to replace a bunch of strings in files. The strings are stored in a datatable along with the new string value.
string contents = File.ReadAllText(file);
foreach (DataRow dr in FolderRenames.Rows)
{
contents = Regex.Replace(contents, dr["find"].ToString(), dr["replace"].ToString());
File.SetAttributes(file, FileAttributes.Normal);
File.WriteAllText(file, contents);
}
The strings look like this _-uUa, -_uU, _-Ha etc.
The problem that I am having is when for example this string "_uU" will also overwrite "_-uUa" so the replacement would look like "newvaluea"
Is there a way to tell regex to look at the next character after the found string and make sure it is not an alphanumeric character?
I hope it is clear what I am trying to do here.
Here is some sample data:
private function _-0iX(arg1:flash.events.Event):void
{
if (arg1.type == flash.events.Event.RESIZE)
{
if (this._-2GU)
{
this._-yu(this._-2GU);
}
}
return;
}
The next characters could be ;, (, ), dot, comma, space, :, etc.
First of all, you should use Regex.Escape.
You can use then
contents = Regex.Replace(
contents,
Regex.Escape(dr["find"].ToString()) + #"(?![a-zA-Z])",
Regex.Escape(dr["replace"].ToString()));
or even better
contents = Regex.Replace(
contents,
#"\b" + Regex.Escape(dr["find"].ToString()) + #"\b",
Regex.Escape(dr["replace"].ToString()));
I think this is what you're looking for:
contents = Regex.Replace(
contents,
string.Format(#"(?<!\w){0}(?!\w)", Regex.Escape(dr["find"].ToString())),
dr["replace"].ToString().Replace("$", "$$")
);
You can't use \b because your search strings don't always start and end with word characters. Instead, I used (?<!\w) and (?!\w) to make sure the matched substring is not immediately preceded or followed by a word character (i.e., a letter, a digit, or an underscore). I don't know the complete specs for your search strings, so this pattern might need some tweaking.
None of the sample patterns you provided contain regex metacharacters, but like the other responders, I used Regex.Escape() to render it safe anyway. In the replacement string the only character you have to watch out for is the dollar sign (ref), and the way to escape that is with another dollar sign. Notice that I used String.Replace() for that instead of Regex.Replace().
There are two tricks that can help you here:
Order all the search string by length, and replace the longest ones first, that way you won't accidentally replace the shorter ones.
Use a MatchEvaluator and instead of looping through all your rows, search fro all replacement patterns in the string and look them up in your dataset.
Option one is simple, option two would look like this:
Regex.Replace(contents", "_-\\w+", ReplaceIdentifier)
public string ReplaceIdentifier(Match m)
{
DataRow row = FolderRenames.Rows.FindRow("find"); // Requires a primary key on "find"
if (row != null) return row["replace"];
else return m.Value;
}

C# insert string in multiline string

I have a multiline string (from a txt-file using ReadAllText).
the string looks like this:
R;0035709310000026542510X0715;;;
R;0035709310000045094410P1245;;;
R;0035709310000026502910Z1153;;;
I want to put in a ";" in each line on place 22, so it looks like this:
R;00357093100000265425;10X0715;;;
R;00357093100000450944;10P1245;;;
R;00357093100000265029;10Z1153;;;
The multiline string always contain the samme amount of data but not always 3 lines - sometimes more lines.
How do I make this? Please show some code.
Thanks alot :-)
Best regards
Bent
Try this ...
using System.IO;
using System.Linq;
var lines = File.ReadAllLines("data.txt");
var results = lines.Select(x => x.Insert(22, ";"));
Step 1, don't use ReadAllText(). Use ReadAllLines() instead.
string[] myLinesArray = File.ReadAllLines(...);
Step 2, replace all lines (strings) with the changed version.
for(int i = 0; i < myLinesArray.Length; i++)
myLinesArray[i] = myLinesArray[i].Insert(22, ";");
Step 3, Use WriteAllLines()
try this
string s ="R;0035709310000026542510X0715;;;";
s = s.Insert(22,";");
Console.Write(s);
or use Regex
string s =#"R;0035709310000026542510X0715;;;
R;0035709310000045094410P1245;;;
R;0035709310000026502910Z1153;;;";
string resultString = Regex.Replace(s, "^.{22}", "$0;", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Console.Write(resultString);
I think it would be better to read the source file line by line and modify the line as you go.
You could build up your new file in a StringBuilder or, if is large,
write it to a new file, used to replace the source at the end.
Something like this,
using System.IO;
string tempFileName = Path.GetTempFileName();
using (StreamWriter target = File.CreateText(tempFileName))
{
using(StreamReader source = file.OpenText("YourSourceFile.???"))
{
while (source.Peek() >= 0)
{
target.WriteLine(source.ReadLine().Insert(22, ";"));
}
}
}
File.Delete("YourSourceFile.???");
File.Move(tempFileName, "YourSourceFile.???");
This approach becomes is especially appropriate for large files since it avoids loading all the data into memory at once but the performance will be good for all but very large files or, I guess, if the lines were very (very) long.
As suggested, you can use the Insert method to achieve your goal.
If your file contains a lot of lines and you need to work on 1 line at a time, you might also consider reading it line by line from a TextReader.
You could go with Regex:
myString = Regex.Replace(myString, #"(^.{22})", #"\1;", RegexOptions.Multiline);
Explanation:
you have 3 string arguments:
1st one is the input
2nd is the pattern
3rd is the replacement string
In the pattern:
() is a capturing group: you can call it in the replacement string with \n, n being the 1-based index of the capturing group in the pattern. In this case, \1 is whatever matched "(^.{22})"
"^" is the beginning of a line (because we set the multiline options, otherwise it would be the beginning of the input string)
"." matches any character
{22} means you want preceeding pattern (in this case ".", any character) 22 times
So what that means is:
"in any line with 22 characters or more, replace the 22 first characters by those same 22 characters plus ";"

regex to split line (csv file)

I am not good in regex. Can some one help me out to write regex for me?
I may have values like this while reading csv file.
"Artist,Name",Album,12-SCS
"val""u,e1",value2,value3
Output:
Artist,Name
Album
12-SCS
Val"u,e1
Value2
Value3
Update:
I like idea using Oledb provider. We do have file upload control on the web page, that I read the content of the file using stream reader without actual saving file on the file system. Is there any way I can user Oledb provider because we need to specify the file name in connection string and in my case i don't have file saved on file system.
Just adding the solution I worked on this morning.
var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");
foreach (Match m in regex.Matches("<-- input line -->"))
{
var s = m.Value;
}
As you can see, you need to call regex.Matches() per line. It will then return a MatchCollection with the same number of items you have as columns. The Value property of each match is, obviously, the parsed value.
This is still a work in progress, but it happily parses CSV strings like:
2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D
Actually, its pretty easy to match CVS lines with a regex. Try this one out:
StringCollection resultList = new StringCollection();
try {
Regex pattern = new Regex(#"
# Parse CVS line. Capture next value in named group: 'val'
\s* # Ignore leading whitespace.
(?: # Group of value alternatives.
"" # Either a double quoted string,
(?<val> # Capture contents between quotes.
[^""]*(""""[^""]*)* # Zero or more non-quotes, allowing
) # doubled "" quotes within string.
""\s* # Ignore whitespace following quote.
| (?<val>[^,]*) # Or... zero or more non-commas.
) # End value alternatives group.
(?:,|$) # Match end is comma or EOS",
RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
Match matchResult = pattern.Match(subjectString);
while (matchResult.Success) {
resultList.Add(matchResult.Groups["val"].Value);
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Disclaimer: The regex has been tested in RegexBuddy, (which generated this snippet), and it correctly matches the OP test data, but the C# code logic is untested. (I don't have access to C# tools.)
Regex is not the suitable tool for this. Use a CSV parser. Either the builtin one or a 3rd party one.
Give the TextFieldParser class a look. It's in the Microsoft.VisualBasic assembly and does delimited and fixed width parsing.
Give CsvHelper a try (a library I maintain). It's available via NuGet.
You can easily read a CSV file into a custom class collection. It's also very fast.
var streamReader = // Create a StreamReader to your CSV file
var csvReader = new CsvReader( streamReader );
var myObjects = csvReader.GetRecords<MyObject>();
Regex might get overly complex here. Split the line on commas, and then iterate over the resultant bits and concatenate them where "the number of double quotes in the concatenated string" is not even.
"hello,this",is,"a ""test"""
...split...
"hello | this" | is | "a ""test"""
...iterate and merge 'til you've an even number of double quotes...
"hello,this" - even number of quotes (note comma removed by split inserted between bits)
is - even number of quotes
"a ""test""" - even number of quotes
...then strip of leading and trailing quote if present and replace "" with ".
It could be done using below code:
using Microsoft.VisualBasic.FileIO;
string csv = "1,2,3,"4,3","a,"b",c",end";
TextFieldParser parser = new TextFieldParser(new StringReader(csv));
//To read from file
//TextFieldParser parser = new TextFieldParser("csvfile.csv");
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
string[] fields =null;
while (!parser.EndOfData)
{
fields = parser.ReadFields();
}
parser.Close();

Parsing text file for hexadecimal content

I have this text file that contains approximately 22 000 lines, with each line looking like this:
12A4 (Text)
So it's in the format 4-letter/number (Hexdecimal) and then text. Sometimes there is more than one value in text, separated by a comma:
A34d (Text, Optional)
Is there any efficient way to search for the Hex and then return the first text in the parentheses? Would it be much more effective if I stored this data in SQLite?
Example using substring and split.
string value = "A34d (Text, Optional)";
string hex = value.Substring(0, 4);
string text = value.Split('(')[1];
if (text.Contains(','))
text = text.Substring(0, text.IndexOf(','));
else
text = text.Substring(0, text.Length-1);
For searching use a Dictionary.
That's probably < 2 mb of data.
I think you can:
Read the whole file
Split each line in key ( the hex number ) and value ( the remaining ) Chris Persichetti answer is excellent for that
Store each line in a dictionary ( using the number as int , nor as string )
d = Dictionary<int,string>
d.put( int.Perse( key ), value );
Keep that dictionary in memory and then perform a very quick look up by the id
There are elegant answers posted already, but since you requested regex, try this:
var regex = #"^(?<hexData>.{4}\s(?<textData>.*)$)";
var matches = Regex.Matches
(textInput, regex, RegexOptions.IgnoreWhiteSpace
| RegexOptions.Singleline);
then you parse through matches object to get whatever you want.
If you want to search for the Hex value more than once, you definitely want to store this in a lookup table of some sort.
This could be as simple as a Dictionary<string, string> that you populate with the contents of your file on startup:
read each line (StreamReader.ReadLine)
hexString = substring of first 4 characters in line
store the rest of the string
To find the first part, create a function that retrieves "A" from "(A, B, C, ...)"
If you can rule out commas "," in "A", you are in luck: Remove the parentheses, split on "," and return first substring.
var lines = ...;
var item = (from line in lines
where line.StartsWith("a34d", StringComparison.OrdinalIgnoreCase)
select line).FirstOrDefault();
//if item == null, it is not found
var firstText = item.Split('(',',',')')[1];
It works and if you want to strip leading and trailing whitespaces from firstText then add a .Trim() in the end.
For splitting a text into several lines, see my two answers here. How can I convert a string with newlines in it to separate lines?
Use a StreamReader to ReadLine and you can then check if the first characters are equal to what you search and if it is you can do
string yourresult = thereadline.Split
(new string[]{" (",","},
StringSplitOptions.RemoveEmptyEntries)[1]

Removing all whitespace lines from a multi-line string efficiently

In C# what's the best way to remove blank lines i.e., lines that contain only whitespace from a string? I'm happy to use a Regex if that's the best solution.
EDIT: I should add I'm using .NET 2.0.
Bounty update: I'll roll this back after the bounty is awarded, but I wanted to clarify a few things.
First, any Perl 5 compat regex will work. This is not limited to .NET developers. The title and tags have been edited to reflect this.
Second, while I gave a quick example in the bounty details, it isn't the only test you must satisfy. Your solution must remove all lines which consist of nothing but whitespace, as well as the last newline. If there is a string which, after running through your regex, ends with "/r/n" or any whitespace characters, it fails.
If you want to remove lines containing any whitespace (tabs, spaces), try:
string fix = Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline);
Edit (for #Will): The simplest solution to trim trailing newlines would be to use TrimEnd on the resulting string, e.g.:
string fix =
Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline)
.TrimEnd();
string outputString;
using (StringReader reader = new StringReader(originalString)
using (StringWriter writer = new StringWriter())
{
string line;
while((line = reader.ReadLine()) != null)
{
if (line.Trim().Length > 0)
writer.WriteLine(line);
}
outputString = writer.ToString();
}
off the top of my head...
string fixed = Regex.Replace(input, "\s*(\n)","$1");
turns this:
fdasdf
asdf
[tabs]
[spaces]
asdf
into this:
fdasdf
asdf
asdf
Using LINQ:
var result = string.Join("\r\n",
multilineString.Split(new string[] { "\r\n" }, ...None)
.Where(s => !string.IsNullOrWhitespace(s)));
If you're dealing with large inputs and/or inconsistent line endings you should use a StringReader and do the above old-school with a foreach loop instead.
Alright this answer is in accordance to the clarified requirements specified in the bounty:
I also need to remove any trailing newlines, and my Regex-fu is
failing. My bounty goes to anyone who can give me a regex which passes
this test: StripWhitespace("test\r\n \r\nthis\r\n\r\n") ==
"test\r\nthis"
So Here's the answer:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z
Or in the C# code provided by #Chris Schmich:
string fix = Regex.Replace("test\r\n \r\nthis\r\n\r\n", #"(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z", string.Empty, RegexOptions.Multiline);
Now let's try to understand it. There are three optional patterns in here which I am willing to replace with string.empty.
(?<=\r?\n)(\s*$\r?\n)+ - matches one to unlimited lines containing only white space and preceeded by a line break (but does not match the first preceeding line breaks).
(?<=\r?\n)(\r?\n)+ - matches one to unlimited empty lines with no content that are preceeded by a line break (but does not match the first preceeding line breaks).
(\r?\n)+\z - matches one to unlimited line breaks at the end of the tested string (trailing line breaks as you called them)
That satisfies your test perfectly! But also satisfies both \r\n and \n line break styles! Test it out! I believe this will be the most correct answer, although simpler expression would pass your specified bounty test, this regex passes more complex conditions.
EDIT: #Will pointed out a potential flaw in the last pattern match of the above regex in that it won't match multiple line breaks containing white space at the end of the test string. So let's change that last pattern to this:
\b\s+\z The \b is a word boundry (beginning or END of a word), the \s+ is one or more white space characters, the \z is the end of the test string (end of "file"). So now it will match any assortment of whitespace at the end of the file including tabs and spaces in addition to carriage returns and line breaks. I tested both of #Will's provided test cases.
So all together now, it should be:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
EDIT #2: Alright there is one more possible case #Wil found that the last regex doesn't cover. That case is inputs that have line breaks at the beginning of the file before any content. So lets add one more pattern to match the beginning of the file.
\A\s+ - The \A match the beginning of the file, the \s+ match one or more white space characters.
So now we've got:
\A\s+|(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
So now we have four patterns for matching:
whitespace at the beginning of the file,
redundant line breaks containing white space, (ex: \r\n \r\n\t\r\n)
redundant line breaks with no content, (ex: \r\n\r\n)
whitespace at the end of the file
not good. I would use this one using JSON.net:
var o = JsonConvert.DeserializeObject(prettyJson);
new minifiedJson = JsonConvert.SerializeObject(o, Formatting.None);
In response to Will's bounty, which expects a solution that takes "test\r\n \r\nthis\r\n\r\n" and outputs "test\r\nthis", I've come up with a solution that makes use of atomic grouping (aka Nonbacktracking Subexpressions on MSDN). I recommend reading those articles for a better understanding of what's happening. Ultimately the atomic group helped match the trailing newline characters that were otherwise left behind.
Use RegexOptions.Multiline with this pattern:
^\s+(?!\B)|\s*(?>[\r\n]+)$
Here is an example with some test cases, including some I gathered from Will's comments on other posts, as well as my own.
string[] inputs =
{
"one\r\n \r\ntwo\r\n\t\r\n \r\n",
"test\r\n \r\nthis\r\n\r\n",
"\r\n\r\ntest!",
"\r\ntest\r\n ! test",
"\r\ntest \r\n ! "
};
string[] outputs =
{
"one\r\ntwo",
"test\r\nthis",
"test!",
"test\r\n ! test",
"test \r\n ! "
};
string pattern = #"^\s+(?!\B)|\s*(?>[\r\n]+)$";
for (int i = 0; i < inputs.Length; i++)
{
string result = Regex.Replace(inputs[i], pattern, "",
RegexOptions.Multiline);
Console.WriteLine(result == outputs[i]);
}
EDIT: To address the issue of the pattern failing to clean up text with a mix of whitespace and newlines, I added \s* to the last alternation portion of the regex. My previous pattern was redundant and I realized \s* would handle both cases.
string corrected =
System.Text.RegularExpressions.Regex.Replace(input, #"\n+", "\n");
I'll go with:
public static string RemoveEmptyLines(string value) {
using (StringReader reader = new StringReader(yourstring)) {
StringBuilder builder = new StringBuilder();
string line;
while ((line = reader.ReadLine()) != null) {
if (line.Trim().Length > 0)
builder.AppendLine(line);
}
return builder.ToString();
}
}
Here's another option: use the StringReader class. Advantages: one pass over the string, creates no intermediate arrays.
public static string RemoveEmptyLines(this string text) {
var builder = new StringBuilder();
using (var reader = new StringReader(text)) {
while (reader.Peek() != -1) {
string line = reader.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
builder.AppendLine(line);
}
}
return builder.ToString();
}
Note: the IsNullOrWhiteSpace method is new in .NET 4.0. If you don't have that, it's trivial to write on your own:
public static bool IsNullOrWhiteSpace(string text) {
return string.IsNullOrEmpty(text) || text.Trim().Length < 1;
}
In response to Will's bounty here is a Perl sub that gives correct response to the test case:
sub StripWhitespace {
my $str = shift;
print "'",$str,"'\n";
$str =~ s/(?:\R+\s+(\R)+)|(?:()\R+)$/$1/g;
print "'",$str,"'\n";
return $str;
}
StripWhitespace("test\r\n \r\nthis\r\n\r\n");
output:
'test
this
'
'test
this'
In order to not use \R, replace it with [\r\n] and inverse the alternative. This one produces the same result:
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/g;
There're no needs for special configuration neither multi line support. Nevertheless you can add s flag if it's mandatory.
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/sg;
if its only White spaces why don't you use the C# string method
string yourstring = "A O P V 1.5";
yourstring.Replace(" ", string.empty);
result will be "AOPV1.5"
char[] delimiters = new char[] { '\r', '\n' };
string[] lines = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(Environment.NewLine, lines)
Here is something simple if working against each individual line...
(^\s+|\s+|^)$
Eh. Well, after all that, I couldn't find one that would hit all the corner cases I could figure out. The following is my latest incantation of a regex that strips
All empty lines from the start of a string
Not including any spaces at the beginning of the first non-whitespace line
All empty lines after the first non-whitespace line and before the last non-whitespace line
Again, preserving all whitespace at the beginning of any non-whitespace line
All empty lines after the last non-whitespace line, including the last newline
(?<=(\r\n)|^)\s*\r\n|\r\n\s*$
which essentially says:
Immediately after
The beginning of the string OR
The end of the last line
Match as much contiguous whitespace as possible that ends in a newline*
OR
Match a newline and as much contiguous whitespace as possible that ends at the end of the string
The first half catches all whitespace at the start of the string until the first non-whitespace line, or all whitespace between non-whitespace lines. The second half snags the remaining whitespace in the string, including the last non-whitespace line's newline.
Thanks to all who tried to help out; your answers helped me think through everything I needed to consider when matching.
*(This regex considers a newline to be \r\n, and so will have to be adjusted depending on the source of the string. No options need to be set in order to run the match.)
String Extension
public static string UnPrettyJson(this string s)
{
try
{
// var jsonObj = Json.Decode(s);
// var sObject = Json.Encode(value); dont work well with array of strings c:['a','b','c']
object jsonObj = JsonConvert.DeserializeObject(s);
return JsonConvert.SerializeObject(jsonObj, Formatting.None);
}
catch (Exception e)
{
throw new Exception(
s + " Is Not a valid JSON ! (please validate it in http://www.jsoneditoronline.org )", e);
}
}
Im not sure is it efficient but =)
List<string> strList = myString.Split(new string[] { "\n" }, StringSplitOptions.None).ToList<string>();
myString = string.Join("\n", strList.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList());
Try this.
string s = "Test1" + Environment.NewLine + Environment.NewLine + "Test 2";
Console.WriteLine(s);
string result = s.Replace(Environment.NewLine, String.Empty);
Console.WriteLine(result);
s = Regex.Replace(s, #"^[^\n\S]*\n", "");
[^\n\S] matches any character that's not a linefeed or a non-whitespace character--so, any whitespace character except \n. But most likely the only characters you have to worry about are space, tab and carriage return, so this should work too:
s = Regex.Replace(s, #"^[ \t\r]*\n", "");
And if you want it to catch the last line, without a final linefeed:
s = Regex.Replace(s, #"^[ \t\r]*\n?", "");

Categories

Resources