I can't be the first person to have this issue but hours of searching Stack revealed nothing close to an answer. I have an SSIS script that works over a directory of csv files. This script folds, bends and mutilates these files; performs queries, data cleansing, persists some data and finally outputs a small set to csv file that is ingested by another system.
One of the files has a free text field that contains the value: "20,000 BONUS POINTS". This one field, in a file of 10k rows, one of dozens of similar files, is the problem that I can't seem to solve.
Be advised: I'm weak on both C# and Regex.
Sample csv set:
4121,6383,0,,,TRUE
4122,6384,0,"20,000 BONUS POINTS",,TRUE
4123,6385,,,,
4124,6386,0,,,TRUE
4125,6387,0,,,TRUE
4126,6388,0,,,TRUE
4127,6389,0,,,TRUE
4128,6390,0,,,TRUE
I found plenty of information on how to parse this using a variety of Regex patterns but what I've noticed is the StreamReader.ReadLine() method wraps the complete line with double quotes:
"4121,6383,0,,,TRUE"
such that the output of the regex Replace method:
s = Regex.Replace(line, #"[^\""]([^\""])*[^\""]",
m => m.Value.Replace(",", ""));
looks like this:
412163830TRUE
and the target line that actually contains a double quote delimited string ends up looking like:
"412263840\"20000 BONUS POINTS\"TRUE"
My entire method (for your reading pleasure) is this:
string fileDirectory = "C:\\tmp\\Unzip\\";
string fullPath = "C:\\tmp\\Unzip\\test.csv";
string line = "";
//int count=0;
List<string> list = new List<string>();
try
{
//MessageBox.Show("inside Try Block");
string s = null;
StreamReader infile = new StreamReader(fullPath);
StreamWriter outfile = new StreamWriter(Path.Combine(fileDirectory, "output.csv"));
while ((line = infile.ReadLine()) != null)
{
//line.Substring(0,1).Substring(line.Length-1, 1);
System.Console.WriteLine(line);
Console.WriteLine(line);
line =
s = Regex.Replace(line, #"[^\""]([^\""])*[^\""]",
m => m.Value.Replace(",", ""));
System.Console.WriteLine(s);
list.Add(s);
}
foreach (string item in list)
{
outfile.WriteLine(item);
};
infile.Close();
outfile.Close();
//System.Console.WriteLine("There were {0} lines.", count);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
//another addition for TFS consumption
}
Thanks for reading and if you have a useful answer, bless you and your prodigy for generations to come!
mfc
EDIT: The requirement is a valid csv file output. In the case of the test data, it would look like this:
4121,6383,0,,,TRUE
4122,6384,0,"20000 BONUS POINTS",,TRUE
4123,6385,,,,
4124,6386,0,,,TRUE
4125,6387,0,,,TRUE
4126,6388,0,,,TRUE
4127,6389,0,,,TRUE
4128,6390,0,,,TRUE
I recommend using a CSV reader lib like others have suggested.
Install-Package LumenWorksCsvReader
https://github.com/phatcher/CsvReader#getting-started
However, if you just want to try something fast and dirty. Give this a try.
If I understand correctly. You need to remove commas between double quotes within each line of a CSV file. This should do that.
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string pattern = #"([""'])(?:(?=(\\?))\2.)*?\1";
List<string> lines = new List<string>();
lines.Add("4121,6383,0,,,TRUE");
lines.Add("4122,6384,0,\"20,000 BONUS POINTS\",,TRUE");
lines.Add("4123,6385,,,,");
lines.Add("4124,6386,0,,,TRUE");
lines.Add("4125,6387,0,,,TRUE");
lines.Add("4126,6388,0,,,TRUE");
lines.Add("4127,6389,0,,,TRUE");
lines.Add("4128,6390,0,,,TRUE");
StringBuilder sb = new StringBuilder();
foreach (var line in lines)
{
sb.Append(Regex.Replace(line, pattern, m => m.Value.Replace(",", ""))+"\n");
}
Console.WriteLine(sb.ToString());
}
}
OUTPUT
4121,6383,0,,,TRUE
4122,6384,0,"20000 BONUS POINTS",,TRUE
4123,6385,,,,
4124,6386,0,,,TRUE
4125,6387,0,,,TRUE
4126,6388,0,,,TRUE
4127,6389,0,,,TRUE
4128,6390,0,,,TRUE
https://dotnetfiddle.net/flmWG3
I haven't tried with numerous lines, but this would be my first approach:
namespace ConsoleTestApplication
{
class Program
{
static void Main(string[] args)
{
var before = "4122,6384,0,\"20,000 BONUS POINTS\",,TRUE";
var pattern = #"""[^""]*""";
var after = Regex.Replace(before, pattern, match => match.Value.Replace(",", ""));
Console.WriteLine(after);
}
}
}
Related
Fast way to replace text in text file.
From this: somename#somedomain.com:hello_world
To This: somename:hello_world
It needs to be FAST and support multiple lines of text file.
I tried spiting the string into three parts but it seems slow. Example in the code below.
<pre><code>
public static void Conversion()
{
List<string> list = File.ReadAllLines("ETU/Tut.txt").ToList();
Console.WriteLine("Please wait, converting in progress !");
foreach (string combination in list)
{
if (combination.Contains("#"))
{
write: try
{
using (StreamWriter sw = new
StreamWriter("ETU/UPCombination.txt", true))
{
sw.WriteLine(combination.Split('#', ':')[0] + ":"
+ combination.Split('#', ':')[2]);
}
}
catch
{
goto write;
}
}
else
{
Console.WriteLine("At least one line doesn't contain #");
}
}
}</code></pre>
So a fast way to convert every line in text file from
somename#somedomain.com:hello_world
To: somename:hello_world
then save it different text file.
!Remember the domain bit always changes!
Most likely not the fastest, but it is pretty fast with an expression similar to,
#[^:]+
and replace that with an empty string.
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"#[^:]+";
string substitution = #"";
string input = #"somename#somedomain.com:hello_world1
somename#some_other_domain.com:hello_world2";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
I need to demilitarise text by a single character, a comma. But I want to only use that comma as a delimiter if it is not encapsulated by quotation marks.
An example:
Method,value1,value2
Would contain three values: Method, value1 and value2
But:
Method,"value1,value2"
Would contain two values: Method and "value1,value2"
I'm not really sure how to go about this as when splitting a string I would use:
String.Split(',');
But that would demilitarise based on ALL commas. Is this possible without getting overly complicated and having to manually check every character of the string.
Thanks in advance
Copied from my comment: Use an available csv parser like VisualBasic.FileIO.TextFieldParser or this or this.
As requested, here is an example for the TextFieldParser:
var allLineFields = new List<string[]>();
string sampleText = "Method,\"value1,value2\"";
var reader = new System.IO.StringReader(sampleText);
using (var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader))
{
parser.Delimiters = new string[] { "," };
parser.HasFieldsEnclosedInQuotes = true; // <--- !!!
string[] fields;
while ((fields = parser.ReadFields()) != null)
{
allLineFields.Add(fields);
}
}
This list now contains a single string[] with two strings. I have used a StringReader because this sample uses a string, if the source is a file use a StreamReader(f.e. via File.OpenText).
You can try Regex.Split() to split the data up using the pattern
",|(\"[^\"]*\")"
This will split by commas and by characters within quotes.
Code Sample:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string data = "Method,\"value1,value2\",Method2";
string[] pieces = Regex.Split(data, ",|(\"[^\"]*\")").Where(exp => !String.IsNullOrEmpty(exp)).ToArray();
foreach (string piece in pieces)
{
Console.WriteLine(piece);
}
}
}
Results:
Method
"value1,value2"
Method2
Demo
I want to read 4-5 CSV files in some array in C#
I know that this question is been asked and I have gone through them...
But my use of CSVs is too much simpler for that...
I have csv fiels with columns of following data types....
string , string
These strings are without ',' so no tension...
That's it. And they aren't much big. Only about 20 records in each.
I just want to read them into array of C#....
Is there any very very simple and direct way to do that?
To read the file, use
TextReader reader = File.OpenText(filename);
To read a line:
string line = reader.ReadLine()
then
string[] tokens = line.Split(',');
to separate them.
By using a loop around the two last example lines, you could add each array of tokens into a list, if that's what you need.
This one includes the quotes & commas in fields. (assumes you're doing a line at a time)
using Microsoft.VisualBasic.FileIO; //For TextFieldParser
// blah blah blah
StringReader csv_reader = new StringReader(csv_line);
TextFieldParser csv_parser = new TextFieldParser(csv_reader);
csv_parser.SetDelimiters(",");
csv_parser.HasFieldsEnclosedInQuotes = true;
string[] csv_array = csv_parser.ReadFields();
Here is a simple way to get a CSV content to an array of strings. The CSV file can have double quotes, carriage return line feeds and the delimiter is a comma.
Here are the libraries that you need:
System.IO;
System.Collection.Generic;
System.IO is for FileStream and StreamReader class to access your file. Both classes implement the IDisposable interface, so you can use the using statements to close your streams. (example below)
System.Collection.Generic namespace is for collections, such as IList,List, and ArrayList, etc... In this example, we'll use the List class, because Lists are better than Arrays in my honest opinion. However, before I return our outbound variable, i'll call the .ToArray() member method to return the array.
There are many ways to get content from your file, I personally prefer to use a while(condition) loop to iterate over the contents. In the condition clause, use !lReader.EndOfStream. While not end of stream, continue iterating over the file.
public string[] GetCsvContent(string iFileName)
{
List<string> oCsvContent = new List<string>();
using (FileStream lFileStream =
new FileStream(iFilename, FileMode.Open, FileAccess.Read))
{
StringBuilder lFileContent = new StringBuilder();
using (StreamReader lReader = new StreamReader(lFileStream))
{
// flag if a double quote is found
bool lContainsDoubleQuotes = false;
// a string for the csv value
string lCsvValue = "";
// loop through the file until you read the end
while (!lReader.EndOfStream)
{
// stores each line in a variable
string lCsvLine = lReader.ReadLine();
// for each character in the line...
foreach (char lLetter in lCsvLine)
{
// check if the character is a double quote
if (lLetter == '"')
{
if (!lContainsDoubleQuotes)
{
lContainsDoubleQuotes = true;
}
else
{
lContainsDoubleQuotes = false;
}
}
// if we come across a comma
// AND it's not within a double quote..
if (lLetter == ',' && !lContainsDoubleQuotes)
{
// add our string to the array
oCsvContent.Add(lCsvValue);
// null out our string
lCsvValue = "";
}
else
{
// add the character to our string
lCsvValue += lLetter;
}
}
}
}
}
return oCsvContent.ToArray();
}
Hope this helps! Very easy and very quick.
Cheers!
this the what the data in the string looks like:
<temp><id>TGPU1</id><label>GPU</label><value>67</value></temp><temp><id>THDD1</id><label>ST3320620AS</label><value>34</value></temp><temp><id>FCPU</id><label>CPU</label><value>1430</value></temp>
(there is more, that is just a small snipping of the original output.)
what i would like to do is feed it through some fast code(not in terms of how long the actual code is, but how long it takes to execute) that will remove all the
<temp><id>TGPU1</id><label>GPU</label><value>
and output it to a new string with only the value between all the <value>'s and </value>'s. im looking for output in the string like: 67341430.
i found this code:
bool FoundMatch = false;
try {
Regex regex = new Regex(#"<([be])pt[^>]+>.+?</\1pt>");
while(regex.IsMatch(yourstring) ) {
yourstring = regex.Replace(yourstring, "");
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
but it only removes the <*pt> which cold be changed to remove something else. but it would only remove one tag at a time. not very fast if i have to remove all tags.
thanks
PS if you wanted to know, the code below is the code i am looking to add this to. this is also the code that printed the original string mentioned above:
static void Main(string[] args)
{
Console.WriteLine("Memory mapped file reader started");
using (var file = MemoryMappedFile.OpenExisting("sensor"))
{
using (var reader = file.CreateViewAccessor(0, 3800))
{
var encoding = Encoding.ASCII;
Console.WriteLine(encoding.GetString(bytes));
}
}
Console.WriteLine("Press any key to exit ...");
Console.ReadLine();
}
You can use LINQ:
String.Concat(
XElement.Parse(...)
.Descendants("value")
.Select(v => v.Value)
);
If you're really concerned about performance, you could use XmlReader directly to read through <value> tags and append to a StringBuilder.
However, that's not worth it, unless your XML is hundreds of megabytes large.
What I would like to do is find all instances of a string in a text file, then add the full lines containing the said string to an array.
For example:
eng GB English
lir LR Liberian Creole English
mao NZ Maori
Searching eng, for example, must add the first two lines to the array, including of course the many more instances of 'eng' in the file.
How can this be done, using a text file input and C#?
you can use TextReader to read each line and search for it, if you find what u want, then add that line into string array
List<string> found = new List<string>();
string line;
using(StreamReader file = new StreamReader("c:\\test.txt"))
{
while((line = file.ReadLine()) != null)
{
if(line.Contains("eng"))
{
found.Add(line);
}
}
}
or you can use yield return to return enumurable
One line:
using System.IO;
using System.Linq;
var result = File.ReadAllLines(#"c:\temp").Select(s => s.Contains("eng"));
Or, if you want a more memory efficient solution, you can roll an extension method. You can use FileInfo, FileStream, etc. as the base handler:
public static IEnumerable<string> ReadAndFilter(this FileInfo info, Predicate<string> condition)
{
string line;
using (var reader = new StreamReader(info.FullName))
{
while ((line = reader.ReadLine()) != null)
{
if (condition(line))
{
yield return line;
}
}
}
}
Usage:
var result = new FileInfo(path).ReadAndFilter(s => s.Contains("eng"));
You can try the following code, i tried it and it was working
string searchKeyword = "eng";
string fileName = "Some file name here";
string[] textLines = File.ReadAllLines(fileName);
List<string> results = new List<string>();
foreach (string line in textLines)
{
if (line.Contains(searchKeyword))
{
results.Add(line);
}
}
The File object contains a static ReadLines method that returns line-by-line, in contrast with ReadAllLines which returns an array and thus needs to load the complete file in memory.
So, by using File.ReadLines and LINQ an efficient and short solution could be written as:
var found = File.ReadLines().Where(line => line.Contains("eng")).ToArray();
As for the original question, it could be optimized further by replacing line.Contains with line.StartsWith, as it seems the required term appears in the beginning of each line.