Reading Lines From Delimited File in C# [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a program where I'm trying to have the ability to take information they entered into the program and store it into a template file of sorts, so it can be saved and reloaded easily. The template format looks like this
#START#1 -- Contact#END#
#START#1 -- Error
2 -- Error
3 -- Error#END#
#START#1 -- Actions
2 -- Actions
3 -- Actions
4 -- Actions#END#
#START#1 -- Res
2 -- Res
3 -- Res#END#
#START#WorkedWith#END#
#START#3011#END#
#START#1 -- Details
2 -- Details
3 -- Details#END#
Everything between the #START# and #END# tags is a value that needs to be stored in a different variable.
For instance the first variable would need to contain
1 -- Contact
The second variable would need to contain
1 -- Error
2 -- Error
3 -- Error
And so on until the 7th variable contains the Details second.
What is the easiest way to go about reading the file and storing the data between the delimiters into variables?
Thanks in advance!
EDIT: For Sakura
Code:
string sInput = "";
using (var reader = new StreamReader(sTemplateFilePath))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
sInput = sInput + line;
}
reader.Close();
}
foreach (Match m in Regex.Matches(sInput, "#START#(.*?)#END#", RegexOptions.Singleline | RegexOptions.Compiled))
{
foreach (var line in m.Groups[1].Value.Split('\n'))
{
switch (iLineCount)
{
case 0:
sContactReason = line;
break;
case 1:
sError = line;
break;
case 2:
sActionsTaken = line;
break;
case 3:
sResolution = line;
break;
case 4:
sL3 = line;
break;
case 5:
sKB = line;
break;
case 6:
sDetails = line;
break;
}
iLineCount++;
}
}
Output:
1 -- Contact
1 -- Error2 -- Error3 -- Error
1 -- Actions2 -- Actions3 -- Actions4 -- Actions
1 -- Res2 -- Res3 -- Res
WorkedWith
3011
1 -- Details2 -- Details3 -- Details

static void Main()
{
string s = #"#START#1 -- Contact#END#
#START#1 -- Error
2 -- Error
3 -- Error#END#
#START#1 -- Actions
2 -- Actions
3 -- Actions
4 -- Actions#END#
#START#1 -- Res
2 -- Res
3 -- Res#END#
#START#WorkedWith#END#
#START#3011#END#
#START#1 -- Details
2 -- Details
3 -- Details#END#";
int k = -1;
foreach (Match m in Regex.Matches(s, "#START#(.*?)#END#", RegexOptions.Singleline | RegexOptions.Compiled))
{
Console.WriteLine("Variable " + ++k + " is:\n" + m.Groups[1].Value);
Console.WriteLine();
}
Console.ReadLine();
}
"#START#(.*?)#END#" will match anything between #START# and #END#" for you.
Result:
Variable 0 is:
1 -- Contact
Variable 1 is:
1 -- Error
2 -- Error
3 -- Error
Variable 2 is:
1 -- Actions
2 -- Actions
3 -- Actions
4 -- Actions
Variable 3 is:
1 -- Res
2 -- Res
3 -- Res
Variable 4 is:
WorkedWith
Variable 5 is:
3011
Variable 6 is:
1 -- Details
2 -- Details
3 -- Details
If you want to split result to lines you can use split to get desired variable.
int k = -1;
foreach (Match m in Regex.Matches(s, "#START#(.*?)#END#", RegexOptions.Singleline | RegexOptions.Compiled))
{
k++;
int k2 = -1;
Console.WriteLine("Variable " + k + ":");
foreach (var line in m.Groups[1].Value.Split('\n'))
{
Console.WriteLine("Line " + ++k2 + ": " + line);
}
Console.WriteLine();
}
Result:
Variable 0:
Line 1: 1 -- Contact
Variable 1:
Line 1: 1 -- Error
Line 3: 2 -- Error
Line 5: 3 -- Error
Variable 2:
Line 1: 1 -- Actions
Line 3: 2 -- Actions
Line 5: 3 -- Actions
Line 7: 4 -- Actions
Variable 3:
Line 1: 1 -- Res
Line 3: 2 -- Res
Line 5: 3 -- Res
Variable 4:
Line 1: WorkedWith
Variable 5:
Line 1: 3011
Variable 6:
Line 1: 1 -- Details
Line 3: 2 -- Details
Line 5: 3 -- Details
Edit:
The whole below code is a waste, and wrong.
string sInput = "";
using (var reader = new StreamReader(sTemplateFilePath))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
sInput = sInput + line;
}
reader.Close();
}
Change it to:
string sInput = File.ReadAllText(sTemplateFilePath);
EDIT
#Sakura I need to assign each Regex match to a different variable. So
the first match goes into Variable1, the second match goes in
Variable2, the third match goes in Variable3. Does that make sense? –
Is this what you need?
int k = 0;
foreach (Match m in Regex.Matches(sInput, "#START#(.*?)#END#", RegexOptions.Singleline | RegexOptions.Compiled))
{
k++;
switch (k)
{
case 1:
var1 = m.Groups[1].Value;
break;
case 2:
//var2...
break;
}
foreach (var line in m.Groups[1].Value.Split('\n'))
{
switch (iLineCount)
{
}
}
}

Use a CSV file. They are literally made for what you are trying to do. If you don't want to use commas you can always change the delimiter by specifying in the files additional properties.
You can use rows to separate multiples like you have in between the custom delimiters in your post. I apologize if I missed something.

Write your own parser. It's really simple. Here I'm making an assumption that #START# and #END# are each on its own line (you can enforce it with search&replace or with C# code)
private List<List<string>> parseData(string data)
{
List<List<string>> allValues = new List<List<string>>();
List<string> currentValues = null;
// Assume that each line has only one entry
foreach (var line in data.Split(new [] {"\r\n"}, StringSplitOptions.RemoveEmptyEntries))
{
if (line == "#START#")
{
currentValues = new List<string>();
}
else if (line == "#END#")
{
allValues.Add(currentValues);
}
else
{
currentValues.Add(line);
}
}
return allValues;
}

Contrary to the other answers pointing towards regex or writing your own parser, I'd like to suggest using FileHelpers library.
Reading a delimited file would sort of look like this; first define a class matching a single file record:
[DelimitedRecord("|")]
public class Orders
{
public int OrderID;
public string CustomerID;
[FieldConverter(ConverterKind.Date, "ddMMyyyy")]
public DateTime OrderDate;
[FieldConverter(ConverterKind.Decimal, ".")] // The decimal separator is .
public decimal Freight;
}
Reading the file:
var engine = new FileHelperEngine<Orders>();
var records = engine.ReadFile("Input.txt");
foreach (var record in records)
{
Console.WriteLine(record.CustomerID);
Console.WriteLine(record.OrderDate.ToString("dd/MM/yyyy"));
Console.WriteLine(record.Freight);
}

I would probably use the Regex class with a capture group to get the content between the #BEGIN# and #END# delimiters. I'm guessing that you want to discard text otherwise. The regular expression would look something like:
#BEGIN#(.*?)#END#
The capture group (#1) is indicated by the parenthesis and will contain the delimited text. You can iterate through the content by loading it into a string buffer with this regex terminating when there are no remaining matches.

Related

TakeWhile() with conditions in multiple lines

I have a text file with a few thousand lines of texts. A Readlines() function reads each line of the file and yield return the lines.
I need to skip the lines until a condition is met, so it's pretty easy like this:
var lines = ReadLines().SkipWhile(x => !x.Text.Contains("ABC Header"))
The problem I am trying to solve is there is another condition -- once I find the line with "ABC Header", the following line must contain "XYZ Detail". Basically, the file contains multiple lines that have the text "ABC Header" in it, but not all of these lines are followed by "XYZ Detail". I only need those where both of these lines are present together.
How do I do this? I tried to add .Where after the SkipWhile, but that doesn't guarantee the "XYZ Detail" line is immediately following the "ABC Header" line. Thanks!
// See https://aka.ms/new-console-template for more information
using Experiment72045808;
string? candidate = null;
List<(string, string)> results = new();
foreach (var entry in FileReader.ReadLines())
{
if (entry is null) continue;
if (candidate is null)
{
if (entry.Contains("ABC Header"))
{
candidate = entry;
}
}
else
{
// This will handle two adjacend ABC Header - Lines.
if (entry.Contains("ABC Header"))
{
candidate = entry;
continue;
}
// Add to result set and reset.
if (entry.Contains("XYZ Detail"))
{
results.Add((candidate, entry));
candidate = null;
}
}
}
Console.WriteLine("Found results:");
foreach (var result in results)
{
Console.WriteLine($"{result.Item1} / {result.Item2}");
}
resulted in output:
Found results:
ABC Header 2 / XYZ Detail 2
ABC Header 3 / XYZ Detail 3
For input File
Line 1
Line 2
ABC Header 1
Line 4
Line 5
ABC Header 2
XYZ Detail 2
Line 6
Line7
ABC Header 3
XYZ Detail 3
Line 8
Line 9
FileReader.ReadResults() in my test is implemented nearly identical to yours.

How do I split a string containing a numbered list into multiple strings?

I’m trying to take a string which contains a numbered list with data items next to each, and splitting these into multiple strings based on the data next to each number. I tried using regex but this caused some issues as some of the data are monetary values such as £120,000.00.
The example data is
Mr Test Test
£100,000.00
5 Test Road, Test Street
Test Input:
string testInput = "1. Mr Test Test 2. £100,000 3. 5 Test Road"
The numbered list may appear on separate lines in the string though, as the string is pulled from a PDF using PdfTextStripper in PDFBox
Is there a way I could accurately split this?
I was originally using indexof and relying on the next data item title as the stopping point, but the data is not always the same (15 points on one, 25 on another).
Desired result is:
string Name = "Mr Test Test";
string Money = "£100,000.00";
string Address = "5 Test Road, Test Street";
Any help would be greatly appreciated
Let's implement a simple generator; we can just find line after line in a loop:
Code:
private static IEnumerable<string> ParseListToLines(string value) {
int start = 0;
bool first = true;
for (int index = 1; ; ++index) {
string toFind = $"{index}.";
int next = value.IndexOf(toFind, start);
if (next < 0) {
yield return value.Substring(start);
break;
}
if (!first) // we don't return text before the 1st item
yield return value.Substring(start, next - start);
first = false;
start = next + toFind.Length;
}
}
Demo:
string testInput = "1. Mr Test Test 2. £100,000 3. 5 Test Road";
string[] lines = ParseListToLines(testInput).ToArray();
// string Name = lines[0].Trim();
// string Money = lines[1].Trim();
// string Address = lines[2].Trim();
Console.Write(string.Join(Environment.NewLine, lines));
Outcome:
Mr Test Test
£100,000
5 Test Road
Edit: More Demo with elaborated test (new lines within and without items; one and two digits markers; text - ** - before initial 1. marker):
// Let's build multiline string...
string testInput = "**\r\n" + string.Join(Environment.NewLine, Enumerable
.Range(1, 12)
.Select(i => $"{i,2}.String #{i} {(i < 3 ? "\r\n Next Line" : "")}"));
Console.WriteLine("Initial:");
Console.WriteLine();
Console.WriteLine(testInput);
Console.WriteLine();
Console.WriteLine("Parsed:");
Console.WriteLine();
// ... and parse it into lines
string[] lines = ParseList(testInput)
.Select(line => line.Trim())
.Select((item, index) => $"line number {index} = \"{item}\"")
.ToArray();
Console.WriteLine(string.Join(Environment.NewLine, lines));
Outcome:
Initial:
** // This text - before initial "1." marker should be ingnored
1.String #1 // 1st item contains multiline text
Next Line // 1st item continuation
2.String #2 // Again, multiline marker
Next Line
3.String #3
4.String #4
5.String #5
6.String #6
7.String #7
8.String #8
9.String #9
10.String #10 // two digit markers: "10.", "11.", "12."
11.String #11
12.String #12
Parsed:
line number 0 = "String #1 // 1st item is multiline one
Next Line"
line number 1 = "String #2 // 2nd item is multiline as well
Next Line"
line number 2 = "String #3"
line number 3 = "String #4"
line number 4 = "String #5"
line number 5 = "String #6"
line number 6 = "String #7"
line number 7 = "String #8"
line number 8 = "String #9"
line number 9 = "String #10"
line number 10 = "String #11"
line number 11 = "String #12"
Edit 2: Well, let's try yet another test:
string testInput =
"1. test 5. wrong 2. It's Correct 3. OK 4. 1. 2. 3. - all wrong 5. Corect Now;";
string[] report = ParseList(testInput)
.Select(line => line.Trim())
.ToArray();
Console.Write(string.Join((Environment.NewLine, report));
Outcome:
test 5. wrong
It's Correct
OK
1. 2. 3. - all wrong
Corect Now;

Why does my reg ex not capture 2nd and subsequent lines?

Update
I tried adding RegexOptions.Singleline to my regex options. It worked in that it captured the lines that weren't previously captured, but it put the entire text file into the first match instead of creating one match per date as desired.
End of Update
Update #2
Added new output showing matches and groups when using Poul Bak's modification. See screen shot below titled Output from Poul Bak's modification
End of Update #2
Final Update
Updating the target framework from 4.6.1 to 4.7.1 and tweaking Poul Bak's reg ex a little bit solved all problems. See Poul Bak's answer below
End of Final Update
Original Question: Background
I have the following text file test_text.txt:
2018-10-16 12:00:01 - Error 1<CR><LF>
Error 1 text line 1<CR><LF>
Error 1 text line 2<CR><LF>
2018-10-16 12:00:02 AM - Error 2<CR><LF>
Error 2 text line 1<CR><LF>
Error 2 text line 2<CR><LF>
Error 2 text line 3<CR><LF>
Error 2 text line 4<CR><LF>
2018-10-16 12:00:03 PM - Error 3
Objective
My objective is to have each match be comprised of 3 named groups: Date, Delim, and Text as shown below.
Note: apostrophes used only to denote limits of matched text.
Matches I expect to see:
Match 1: '2018-10-16 12:00:01 - Error 1<CR><LF>'
Date group = '2018-10-16 12:00:01'
Delim group = ' - '
Text group = 'Error 1<CR><LF>Error 1 text line 1<CR><LF>Error 1 text line 2<CR><LF>'
Match 2: '2018-10-16 12:00:02 AM - Error 2<CR><LF>'
Date group = '2018-10-16 12:00:02 AM'
Delim group = ' - '
Text group = 'Error 2 text line 1<CR><LF>Error 2 text line 2<CR><LF>Error 2 text line 3<CR><LF>Error 2 text line 4<CR><LF>'
Match 3: `2018-10-16 12:00:03 PM - Error 3`
Date group = '2018-10-16 12:00:03 PM'
Delim group = ' - '
Text group = 'Error 3'
The problem
My regex is not working in that 2nd and subsequent lines of text (e.g., 'Error 1 text line 1', 'Error 2 text line 1') are not being captured. I expect them to be captured because I'm using the Multiline option.
How do I modify my regex to capture 2nd and subsequent lines of text?
Current code
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp_RegEx
{
class Program
{
static void Main(string[] args)
{
string text = System.IO.File.ReadAllText(#"C:\Users\bill\Desktop\test_text.txt");
string pattern = #"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}.*)(?<Delim>\s-\s)(?<Text>.*\n|.*)";
RegexOptions regexOptions = (RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
Regex rx = new Regex(pattern, regexOptions);
MatchCollection ms = rx.Matches(text);
// Find matches.
MatchCollection matches = rx.Matches(text);
Console.WriteLine("Input Text\n--------------------\n{0}\n--------------------\n", text);
// Report the number of matches found.
Console.WriteLine("Output ({0} matches found)\n--------------------\n", matches.Count);
int m = 1;
// Report on each match.
foreach (Match match in matches)
{
Console.WriteLine("Match #{0}: ", m++, match.Value);
int g = 1;
GroupCollection groups = match.Groups;
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", g++, group.Value);
}
Console.WriteLine();
}
Console.Read();
}
}
}
Current Output
Output from Poul Bak's modification (on the right track, but not quite there yet)
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
You can use the following regex, modified from yours:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
I have changed the 'Date' Group so it accepts 'AM' or 'PM' (otherwise it will only match the first).
Then I have changed the 'Text' Group, so it matches any number of any char (including Newlines) until it looks forward and finds a new date.
Edit:
I don't understand it, when you say 'AM' and 'PM' are not matched, they are part of the 'Date' Group. I assume you want them to be part of the 'Delim' Group, so I have moved the check to that Group.
I have also changed a Group to a non capturing Group.
The new regex:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2})(?<Delim>(?:\s\w\w)?\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
BTW: You should change your code for checking Groups, like this:
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", group.Name, group.Value);
}
Then you will see your named Groups by Name and Value. When you have named Groups, there's no need for accessing by index.
Edit 2:
About 'group.Name': I had mistakenly used 'Group' (capitalized), it should be: 'group.Name'.
This is what the regex look like now:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
I suggest you set the 'RegexOptions.ExplicitCapture' flag, then you only get named groups.

How to remove pieces of data from string

I have a text file with multiple entries of this format:
Page: 1 of 1
Report Date: January 15 2018
Mr. Gerald M. Abridge ID #: 0000008 1 Route 81 Mr. Gerald Michael Abridge Pittaburgh PA 15668 SSN: XXX-XX-XXXX
Birthdate: 01/00/1998 Sex: M
COURSE Course Title CRD GRD GRDPT COURSE Course Title CRD GRD GRDPT
FALL 2017 (08/28/2017 to 12/14/2017) CS102F FUND. OF IT & COMPUTING 4.00 A 16.00 CS110 C++ PROGRAMMING I 3.00 A- 11.10 EL102 LANGUAGE AND RHETORIC 3.00 B+ 9.90 MA109 CALC WITH APPLICATIONS I 4.00 A 16.00 SP203 INTERMEDIATE SPANISH I 3.00 A 12.00
EHRS QHRS QPTS GPA Term 17.00 17.00 65.00 3.824 Cum 17.00 17.00 65.00 3.824
Current Program(s): Bachelor of Science in Computer Science
End of official record.
So far, I have read the text file into a string, full. I want to be able to remove first two lines of each of the entries. How would I go about doing this?
Here's the code that I used to read it in:
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
string full = sr.ReadToEnd();
}
If all the lines you want to skip begin with the same strings, you can put those prefixes in a list and then, when you're reading the lines, skip the any that being with one of the prefixes:
This will leave you with a list of strings that represent all the file lines that don't begin with one of the specified prefixes:
var filePath = #"f:\public\temp\temp.txt";
var ignorePrefixes = new List<string> {"Page:", "Report Date:"};
var filteredContent = File.ReadAllLines(filePath)
.Where(line => ignorePrefixes.All(prefix => !line.StartsWith(prefix)))
.ToList();
If you want all the content as a single string, you can use String.Join:
var filteredAsString = string.Join(Environment.NewLine, filteredContent);
If Linq isn't your thing, or you don't understand what it's doing, here's the "old school" way of doing the same thing:
List<string> filtered = new List<string>();
foreach (string line in File.ReadLines(filePath))
{
bool okToAdd = true;
foreach (string prefix in ignorePrefixes)
{
if (line.StartsWith(prefix))
{
okToAdd = false;
break;
}
}
if (okToAdd)
{
filtered.Add(line);
}
}
public static IEnumerable<string> ReadReportFile(FileInfo file)
{
var line = String.Empty;
var page = "Page:";
var date = "Report Date:";
using(var reader = File.OpenText(file.FullName))
while((line = reader.ReadLine()) != null)
while(line.IndexOf(page) == -1 AND line.IndexOf(date) == -1)
yield return line;
}
Code is pretty straight forward, while line is not null and doesn't contain page or date, return line. You could condense or even get fancier, building lookups for your prefix etc. but if the code is simple or not needed to be that complex, this should suffice.

C# Regular Expression: How to extract a collection

I have collection in text file:
(Collection
(Item "Name1" 1 2 3)
(Item "Simple name2" 1 2 3)
(Item "Just name 3" 4 5 6))
Collection also could be empty:
(Collection)
The number of items is undefined. It could be one item or one hundred. By previous extraction I already have inner text between Collection element:
(Item "Name1" 1 2 3)(Item "Simple name2" 1 2 3)(Item "Just name 3" 4 5 6)
In the case of empty collection it will be empty string.
How could I parse this collection using .Net Regular Expression?
I tried this:
string pattern = #"(\(Item\s""(?<Name>.*)""\s(?<Type>.*)\s(?<Length>.*)\s(?<Number>.*))*";
But the code above doesn't produce any real results.
UPDATE:
I tried to use regex differently:
foreach (Match match in Regex.Matches(document, pattern, RegexOptions.Singleline))
{
for (int i = 0; i < match.Groups["Name"].Captures.Count; i++)
{
Console.WriteLine(match.Groups["Name"].Captures[i].Value);
}
}
or
while (m.Success)
{
m.Groups["Name"].Value.Dump();
m.NextMatch();
}
Try
\(Item (?<part1>\".*?\")\s(?<part2>\d+)\s(?<part3>\d+)\s(?<part4>\d+)\)
this will create a collection of matches:
Regex regex = new Regex(
"\\(Item (?<part1>\\\".*?\\\")\\s(?<part2>\\d+)\\s(?<part3>\\d"+
"+)\\s(?<part4>\\d+)\\)",
RegexOptions.Multiline | RegexOptions.Compiled
);
//Capture all Matches in the InputText
MatchCollection ms = regex.Matches(InputText);
//Get the names of all the named and numbered capture groups
string[] GroupNames = regex.GetGroupNames();
// Get the numbers of all the named and numbered capture groups
int[] GroupNumbers = regex.GetGroupNumbers();
I think you might need to make your captures non-greedy...
(?<Name>.*?)
instead of
(?<Name>.*)
I think you should read file and than make use of Sting.Split function to split the collection and start to read it
String s = "(Collection
(Item "Name1" 1 2 3)
(Item "Simple name2" 1 2 3)
(Item "Just name 3" 4 5 6))";
string colection[] = s.Split('(');
if(colection.Length>1)
{
for(i=1;i<colection.Length;i++)
{
//process string one by one and add ( if you need it
//from the last item remove )
}
}
this will resolve issue easily there is no need of put extra burden of regulat expression.

Categories

Resources