I have a text file with a few thousand lines of texts. A Readlines() function reads each line of the file and yield return the lines.
I need to skip the lines until a condition is met, so it's pretty easy like this:
var lines = ReadLines().SkipWhile(x => !x.Text.Contains("ABC Header"))
The problem I am trying to solve is there is another condition -- once I find the line with "ABC Header", the following line must contain "XYZ Detail". Basically, the file contains multiple lines that have the text "ABC Header" in it, but not all of these lines are followed by "XYZ Detail". I only need those where both of these lines are present together.
How do I do this? I tried to add .Where after the SkipWhile, but that doesn't guarantee the "XYZ Detail" line is immediately following the "ABC Header" line. Thanks!
// See https://aka.ms/new-console-template for more information
using Experiment72045808;
string? candidate = null;
List<(string, string)> results = new();
foreach (var entry in FileReader.ReadLines())
{
if (entry is null) continue;
if (candidate is null)
{
if (entry.Contains("ABC Header"))
{
candidate = entry;
}
}
else
{
// This will handle two adjacend ABC Header - Lines.
if (entry.Contains("ABC Header"))
{
candidate = entry;
continue;
}
// Add to result set and reset.
if (entry.Contains("XYZ Detail"))
{
results.Add((candidate, entry));
candidate = null;
}
}
}
Console.WriteLine("Found results:");
foreach (var result in results)
{
Console.WriteLine($"{result.Item1} / {result.Item2}");
}
resulted in output:
Found results:
ABC Header 2 / XYZ Detail 2
ABC Header 3 / XYZ Detail 3
For input File
Line 1
Line 2
ABC Header 1
Line 4
Line 5
ABC Header 2
XYZ Detail 2
Line 6
Line7
ABC Header 3
XYZ Detail 3
Line 8
Line 9
FileReader.ReadResults() in my test is implemented nearly identical to yours.
Related
I’m trying to take a string which contains a numbered list with data items next to each, and splitting these into multiple strings based on the data next to each number. I tried using regex but this caused some issues as some of the data are monetary values such as £120,000.00.
The example data is
Mr Test Test
£100,000.00
5 Test Road, Test Street
Test Input:
string testInput = "1. Mr Test Test 2. £100,000 3. 5 Test Road"
The numbered list may appear on separate lines in the string though, as the string is pulled from a PDF using PdfTextStripper in PDFBox
Is there a way I could accurately split this?
I was originally using indexof and relying on the next data item title as the stopping point, but the data is not always the same (15 points on one, 25 on another).
Desired result is:
string Name = "Mr Test Test";
string Money = "£100,000.00";
string Address = "5 Test Road, Test Street";
Any help would be greatly appreciated
Let's implement a simple generator; we can just find line after line in a loop:
Code:
private static IEnumerable<string> ParseListToLines(string value) {
int start = 0;
bool first = true;
for (int index = 1; ; ++index) {
string toFind = $"{index}.";
int next = value.IndexOf(toFind, start);
if (next < 0) {
yield return value.Substring(start);
break;
}
if (!first) // we don't return text before the 1st item
yield return value.Substring(start, next - start);
first = false;
start = next + toFind.Length;
}
}
Demo:
string testInput = "1. Mr Test Test 2. £100,000 3. 5 Test Road";
string[] lines = ParseListToLines(testInput).ToArray();
// string Name = lines[0].Trim();
// string Money = lines[1].Trim();
// string Address = lines[2].Trim();
Console.Write(string.Join(Environment.NewLine, lines));
Outcome:
Mr Test Test
£100,000
5 Test Road
Edit: More Demo with elaborated test (new lines within and without items; one and two digits markers; text - ** - before initial 1. marker):
// Let's build multiline string...
string testInput = "**\r\n" + string.Join(Environment.NewLine, Enumerable
.Range(1, 12)
.Select(i => $"{i,2}.String #{i} {(i < 3 ? "\r\n Next Line" : "")}"));
Console.WriteLine("Initial:");
Console.WriteLine();
Console.WriteLine(testInput);
Console.WriteLine();
Console.WriteLine("Parsed:");
Console.WriteLine();
// ... and parse it into lines
string[] lines = ParseList(testInput)
.Select(line => line.Trim())
.Select((item, index) => $"line number {index} = \"{item}\"")
.ToArray();
Console.WriteLine(string.Join(Environment.NewLine, lines));
Outcome:
Initial:
** // This text - before initial "1." marker should be ingnored
1.String #1 // 1st item contains multiline text
Next Line // 1st item continuation
2.String #2 // Again, multiline marker
Next Line
3.String #3
4.String #4
5.String #5
6.String #6
7.String #7
8.String #8
9.String #9
10.String #10 // two digit markers: "10.", "11.", "12."
11.String #11
12.String #12
Parsed:
line number 0 = "String #1 // 1st item is multiline one
Next Line"
line number 1 = "String #2 // 2nd item is multiline as well
Next Line"
line number 2 = "String #3"
line number 3 = "String #4"
line number 4 = "String #5"
line number 5 = "String #6"
line number 6 = "String #7"
line number 7 = "String #8"
line number 8 = "String #9"
line number 9 = "String #10"
line number 10 = "String #11"
line number 11 = "String #12"
Edit 2: Well, let's try yet another test:
string testInput =
"1. test 5. wrong 2. It's Correct 3. OK 4. 1. 2. 3. - all wrong 5. Corect Now;";
string[] report = ParseList(testInput)
.Select(line => line.Trim())
.ToArray();
Console.Write(string.Join((Environment.NewLine, report));
Outcome:
test 5. wrong
It's Correct
OK
1. 2. 3. - all wrong
Corect Now;
I have a text file with multiple entries of this format:
Page: 1 of 1
Report Date: January 15 2018
Mr. Gerald M. Abridge ID #: 0000008 1 Route 81 Mr. Gerald Michael Abridge Pittaburgh PA 15668 SSN: XXX-XX-XXXX
Birthdate: 01/00/1998 Sex: M
COURSE Course Title CRD GRD GRDPT COURSE Course Title CRD GRD GRDPT
FALL 2017 (08/28/2017 to 12/14/2017) CS102F FUND. OF IT & COMPUTING 4.00 A 16.00 CS110 C++ PROGRAMMING I 3.00 A- 11.10 EL102 LANGUAGE AND RHETORIC 3.00 B+ 9.90 MA109 CALC WITH APPLICATIONS I 4.00 A 16.00 SP203 INTERMEDIATE SPANISH I 3.00 A 12.00
EHRS QHRS QPTS GPA Term 17.00 17.00 65.00 3.824 Cum 17.00 17.00 65.00 3.824
Current Program(s): Bachelor of Science in Computer Science
End of official record.
So far, I have read the text file into a string, full. I want to be able to remove first two lines of each of the entries. How would I go about doing this?
Here's the code that I used to read it in:
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
string full = sr.ReadToEnd();
}
If all the lines you want to skip begin with the same strings, you can put those prefixes in a list and then, when you're reading the lines, skip the any that being with one of the prefixes:
This will leave you with a list of strings that represent all the file lines that don't begin with one of the specified prefixes:
var filePath = #"f:\public\temp\temp.txt";
var ignorePrefixes = new List<string> {"Page:", "Report Date:"};
var filteredContent = File.ReadAllLines(filePath)
.Where(line => ignorePrefixes.All(prefix => !line.StartsWith(prefix)))
.ToList();
If you want all the content as a single string, you can use String.Join:
var filteredAsString = string.Join(Environment.NewLine, filteredContent);
If Linq isn't your thing, or you don't understand what it's doing, here's the "old school" way of doing the same thing:
List<string> filtered = new List<string>();
foreach (string line in File.ReadLines(filePath))
{
bool okToAdd = true;
foreach (string prefix in ignorePrefixes)
{
if (line.StartsWith(prefix))
{
okToAdd = false;
break;
}
}
if (okToAdd)
{
filtered.Add(line);
}
}
public static IEnumerable<string> ReadReportFile(FileInfo file)
{
var line = String.Empty;
var page = "Page:";
var date = "Report Date:";
using(var reader = File.OpenText(file.FullName))
while((line = reader.ReadLine()) != null)
while(line.IndexOf(page) == -1 AND line.IndexOf(date) == -1)
yield return line;
}
Code is pretty straight forward, while line is not null and doesn't contain page or date, return line. You could condense or even get fancier, building lookups for your prefix etc. but if the code is simple or not needed to be that complex, this should suffice.
I have a program that reads texts files and I'm wanting it to collect data after a certain title in the text file, in this case [HRData]. Once the streamreader reaches [HRData] I want it to read every line after that and store each line in a list, but allowing me to get access to the seperate numbers.
The text file is like so:
[HRZones]
190
175
162
152
143
133
0
0
0
0
0
[SwapTimes]
[Trip]
250
0
3978
309
313
229
504
651
//n header
[HRData]
91 154 70 309 83 6451
91 154 70 309 83 6451
92 160 75 309 87 5687
94 173 80 309 87 5687
96 187 87 309 95 4662
100 190 93 309 123 4407
101 192 97 309 141 4915
103 191 98 309 145 5429
So referring to the text file, I want it to store the first line after [HRData] and allow me access each variable, for example 91 being [0].
I have code that already stores to a list if the word matches the regex, but I do not know how to code it to read after a specific string like [HRData].
if (squareBrackets.Match(line).Success) {
titles.Add(line);
if (textAfterTitles.Match(line).Success) {
textaftertitles.Add(line);
}
}
This is my attempt so far:
if (line.Contains("[HRData]")) {
inttimes = true;
MessageBox.Show("HRDATA Found");
if (inttimes == true) {
while (null != (line = streamReader.ReadLine())) {
//ADD LINE AND BREAK UP INTO PARTS S
}
}
}
You can call a LINQ-friendly method File.ReadLines , then you can use LINQ to get the part you want:
List<string> numbers = File.ReadLines("data.txt")
.SkipWhile(line => line != "[HRData]")
.Skip(1)
.SelectMany(line => line.Split())
.ToList();
Console.WriteLine(numbers[0]); // 91
Edit - this will give you all the numbers in one List<string>, if you want to keep the line order, use Select instead of SelectMany:
List<List<string>> listsOfNums = File.ReadLines("data.txt")
.SkipWhile(line => line != "[HRData]")
.Skip(1)
.Select(line => line.Split().ToList())
.ToList();
Note that this requires additional index to get a single number:
Console.WriteLine(listsOfNums[0][0]); // 91
You could use a variable to track the current section:
var list = new List<int[]>();
using (StreamReader streamReader = ...)
{
string line;
string sectionName = null;
while (null != (line = streamReader.ReadLine()))
{
var sectionMatch = Regex.Match(line, #"\s*\[\s*(?<NAME>[^\]]+)\s*\]\s*");
if (sectionMatch.Success)
{
sectionName = sectionMatch.Groups["NAME"].Value;
}
else if (sectionName == "HRData")
{
// You can process lines inside the `HRData` section here.
// Getting the numbers in the line, and adding to the list, one array for each line.
var nums = Regex.Matches(line, #"\d+")
.Cast<Match>()
.Select(m => m.Value)
.Select(int.Parse)
.ToArray();
list.Add(nums);
}
}
}
Presuming your current code attempt works, which I have not gone through to verify...
You could simply do the following:
List<int> elements = new List<int>();
while (null != (line = streamReader.ReadLine()))
{
if(line.Contains("["))
{
//Prevent reading in the next section
break;
}
string[] split = line.Split(Convert.ToChar(" "));
//Each element in split will be each number on each line.
for(int i=0;i<split.Length;i++)
{
elements.Add(Convert.ToInt32(split[i]));
}
}
Alternatively, if you want a 2 dimensional list, such that you can reference the numbers by line, you could use a nested list. For each run of the outer loop, create a new list and add it to elements (elements would be List<List<int>>).
Edit
Just a note, be careful with the Convert.ToInt32() function. It should really be in a try catch statement just in case some text is read in that isn't numeric.
Edit
Ok.. to make the routine more robust (per my comment below):
First make sure the routine doesn't go beyond your block of numbers. I'm not sure what is beyond the block you listed, so that will be up to you, but it should take the following form:
If(line.Contains("[") || line.Contains("]") || etc etc etc)
{
break;
}
Next thing is pre-format your split values. Inside the for statement:
for(int i=0;i<split.Length;i++)
{
string val = split[i].Trim(); //Get rid of white space
val = val.Replace("\r\n",""); //Use one of these to trim every character.
val = val.Replace("\n","");
try
{
elements.Add(Convert.ToInt32());
}
catch (Exception ex)
{
string err = ex.Message;
//You might try formatting the split value even more here and retry convert
}
}
To access the individual numbers (presuming you are using a single dimension list) there are a couple ways to do this. If you want to access by index value:
elements.ElementAt(index)
if you want to iterate through the list of values:
foreach(int val in elements)
{
}
If you need to know exactly what line the value came from, I suggest a 2d list. It would be implemented as follows (I'm copying my code from the original code snippet, so assume all of the error checking is added!)
List<List<int>> elements = new List<List<int>>();
while (null != (line = streamReader.ReadLine()))
{
if(line.Contains("["))
{
//Prevent reading in the next section
break;
}
List<int> newLine = new List<int>();
string[] split = line.Split(Convert.ToChar(" "));
//Each element in split will be each number on each line.
for(int i=0;i<split.Length;i++)
{
newLine.Add(Convert.ToInt32(split[i]));
}
elements.Add(newLine);
}
Now to access each element by line:
foreach(var line in elements)
{
//line is a List<int>
int value = line.ElementAt(index); //grab element at index for the given line.
}
Alternatively, if you need to reference directly by line index, and column index
int value = elements.ElementAt(lineIndex).ElementAt(columnIndex);
Be careful with all of these direct index references. You could pretty easily get an index out of bounds issue.
One other thing.. you should probably put a breakpoint on your Convert.ToInt statement and find what string it is breaking on. If you can assume that the data input will be consistent, then finding exactly what string is breaking the conversion will help you create a routine that handles the particular characters that are filtering in. I am going to guess that the method broke when it attempted to Convert the last split value to an integer, and we had not removed line endings.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a program where I'm trying to have the ability to take information they entered into the program and store it into a template file of sorts, so it can be saved and reloaded easily. The template format looks like this
#START#1 -- Contact#END#
#START#1 -- Error
2 -- Error
3 -- Error#END#
#START#1 -- Actions
2 -- Actions
3 -- Actions
4 -- Actions#END#
#START#1 -- Res
2 -- Res
3 -- Res#END#
#START#WorkedWith#END#
#START#3011#END#
#START#1 -- Details
2 -- Details
3 -- Details#END#
Everything between the #START# and #END# tags is a value that needs to be stored in a different variable.
For instance the first variable would need to contain
1 -- Contact
The second variable would need to contain
1 -- Error
2 -- Error
3 -- Error
And so on until the 7th variable contains the Details second.
What is the easiest way to go about reading the file and storing the data between the delimiters into variables?
Thanks in advance!
EDIT: For Sakura
Code:
string sInput = "";
using (var reader = new StreamReader(sTemplateFilePath))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
sInput = sInput + line;
}
reader.Close();
}
foreach (Match m in Regex.Matches(sInput, "#START#(.*?)#END#", RegexOptions.Singleline | RegexOptions.Compiled))
{
foreach (var line in m.Groups[1].Value.Split('\n'))
{
switch (iLineCount)
{
case 0:
sContactReason = line;
break;
case 1:
sError = line;
break;
case 2:
sActionsTaken = line;
break;
case 3:
sResolution = line;
break;
case 4:
sL3 = line;
break;
case 5:
sKB = line;
break;
case 6:
sDetails = line;
break;
}
iLineCount++;
}
}
Output:
1 -- Contact
1 -- Error2 -- Error3 -- Error
1 -- Actions2 -- Actions3 -- Actions4 -- Actions
1 -- Res2 -- Res3 -- Res
WorkedWith
3011
1 -- Details2 -- Details3 -- Details
static void Main()
{
string s = #"#START#1 -- Contact#END#
#START#1 -- Error
2 -- Error
3 -- Error#END#
#START#1 -- Actions
2 -- Actions
3 -- Actions
4 -- Actions#END#
#START#1 -- Res
2 -- Res
3 -- Res#END#
#START#WorkedWith#END#
#START#3011#END#
#START#1 -- Details
2 -- Details
3 -- Details#END#";
int k = -1;
foreach (Match m in Regex.Matches(s, "#START#(.*?)#END#", RegexOptions.Singleline | RegexOptions.Compiled))
{
Console.WriteLine("Variable " + ++k + " is:\n" + m.Groups[1].Value);
Console.WriteLine();
}
Console.ReadLine();
}
"#START#(.*?)#END#" will match anything between #START# and #END#" for you.
Result:
Variable 0 is:
1 -- Contact
Variable 1 is:
1 -- Error
2 -- Error
3 -- Error
Variable 2 is:
1 -- Actions
2 -- Actions
3 -- Actions
4 -- Actions
Variable 3 is:
1 -- Res
2 -- Res
3 -- Res
Variable 4 is:
WorkedWith
Variable 5 is:
3011
Variable 6 is:
1 -- Details
2 -- Details
3 -- Details
If you want to split result to lines you can use split to get desired variable.
int k = -1;
foreach (Match m in Regex.Matches(s, "#START#(.*?)#END#", RegexOptions.Singleline | RegexOptions.Compiled))
{
k++;
int k2 = -1;
Console.WriteLine("Variable " + k + ":");
foreach (var line in m.Groups[1].Value.Split('\n'))
{
Console.WriteLine("Line " + ++k2 + ": " + line);
}
Console.WriteLine();
}
Result:
Variable 0:
Line 1: 1 -- Contact
Variable 1:
Line 1: 1 -- Error
Line 3: 2 -- Error
Line 5: 3 -- Error
Variable 2:
Line 1: 1 -- Actions
Line 3: 2 -- Actions
Line 5: 3 -- Actions
Line 7: 4 -- Actions
Variable 3:
Line 1: 1 -- Res
Line 3: 2 -- Res
Line 5: 3 -- Res
Variable 4:
Line 1: WorkedWith
Variable 5:
Line 1: 3011
Variable 6:
Line 1: 1 -- Details
Line 3: 2 -- Details
Line 5: 3 -- Details
Edit:
The whole below code is a waste, and wrong.
string sInput = "";
using (var reader = new StreamReader(sTemplateFilePath))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
sInput = sInput + line;
}
reader.Close();
}
Change it to:
string sInput = File.ReadAllText(sTemplateFilePath);
EDIT
#Sakura I need to assign each Regex match to a different variable. So
the first match goes into Variable1, the second match goes in
Variable2, the third match goes in Variable3. Does that make sense? –
Is this what you need?
int k = 0;
foreach (Match m in Regex.Matches(sInput, "#START#(.*?)#END#", RegexOptions.Singleline | RegexOptions.Compiled))
{
k++;
switch (k)
{
case 1:
var1 = m.Groups[1].Value;
break;
case 2:
//var2...
break;
}
foreach (var line in m.Groups[1].Value.Split('\n'))
{
switch (iLineCount)
{
}
}
}
Use a CSV file. They are literally made for what you are trying to do. If you don't want to use commas you can always change the delimiter by specifying in the files additional properties.
You can use rows to separate multiples like you have in between the custom delimiters in your post. I apologize if I missed something.
Write your own parser. It's really simple. Here I'm making an assumption that #START# and #END# are each on its own line (you can enforce it with search&replace or with C# code)
private List<List<string>> parseData(string data)
{
List<List<string>> allValues = new List<List<string>>();
List<string> currentValues = null;
// Assume that each line has only one entry
foreach (var line in data.Split(new [] {"\r\n"}, StringSplitOptions.RemoveEmptyEntries))
{
if (line == "#START#")
{
currentValues = new List<string>();
}
else if (line == "#END#")
{
allValues.Add(currentValues);
}
else
{
currentValues.Add(line);
}
}
return allValues;
}
Contrary to the other answers pointing towards regex or writing your own parser, I'd like to suggest using FileHelpers library.
Reading a delimited file would sort of look like this; first define a class matching a single file record:
[DelimitedRecord("|")]
public class Orders
{
public int OrderID;
public string CustomerID;
[FieldConverter(ConverterKind.Date, "ddMMyyyy")]
public DateTime OrderDate;
[FieldConverter(ConverterKind.Decimal, ".")] // The decimal separator is .
public decimal Freight;
}
Reading the file:
var engine = new FileHelperEngine<Orders>();
var records = engine.ReadFile("Input.txt");
foreach (var record in records)
{
Console.WriteLine(record.CustomerID);
Console.WriteLine(record.OrderDate.ToString("dd/MM/yyyy"));
Console.WriteLine(record.Freight);
}
I would probably use the Regex class with a capture group to get the content between the #BEGIN# and #END# delimiters. I'm guessing that you want to discard text otherwise. The regular expression would look something like:
#BEGIN#(.*?)#END#
The capture group (#1) is indicated by the parenthesis and will contain the delimited text. You can iterate through the content by loading it into a string buffer with this regex terminating when there are no remaining matches.
I have collection in text file:
(Collection
(Item "Name1" 1 2 3)
(Item "Simple name2" 1 2 3)
(Item "Just name 3" 4 5 6))
Collection also could be empty:
(Collection)
The number of items is undefined. It could be one item or one hundred. By previous extraction I already have inner text between Collection element:
(Item "Name1" 1 2 3)(Item "Simple name2" 1 2 3)(Item "Just name 3" 4 5 6)
In the case of empty collection it will be empty string.
How could I parse this collection using .Net Regular Expression?
I tried this:
string pattern = #"(\(Item\s""(?<Name>.*)""\s(?<Type>.*)\s(?<Length>.*)\s(?<Number>.*))*";
But the code above doesn't produce any real results.
UPDATE:
I tried to use regex differently:
foreach (Match match in Regex.Matches(document, pattern, RegexOptions.Singleline))
{
for (int i = 0; i < match.Groups["Name"].Captures.Count; i++)
{
Console.WriteLine(match.Groups["Name"].Captures[i].Value);
}
}
or
while (m.Success)
{
m.Groups["Name"].Value.Dump();
m.NextMatch();
}
Try
\(Item (?<part1>\".*?\")\s(?<part2>\d+)\s(?<part3>\d+)\s(?<part4>\d+)\)
this will create a collection of matches:
Regex regex = new Regex(
"\\(Item (?<part1>\\\".*?\\\")\\s(?<part2>\\d+)\\s(?<part3>\\d"+
"+)\\s(?<part4>\\d+)\\)",
RegexOptions.Multiline | RegexOptions.Compiled
);
//Capture all Matches in the InputText
MatchCollection ms = regex.Matches(InputText);
//Get the names of all the named and numbered capture groups
string[] GroupNames = regex.GetGroupNames();
// Get the numbers of all the named and numbered capture groups
int[] GroupNumbers = regex.GetGroupNumbers();
I think you might need to make your captures non-greedy...
(?<Name>.*?)
instead of
(?<Name>.*)
I think you should read file and than make use of Sting.Split function to split the collection and start to read it
String s = "(Collection
(Item "Name1" 1 2 3)
(Item "Simple name2" 1 2 3)
(Item "Just name 3" 4 5 6))";
string colection[] = s.Split('(');
if(colection.Length>1)
{
for(i=1;i<colection.Length;i++)
{
//process string one by one and add ( if you need it
//from the last item remove )
}
}
this will resolve issue easily there is no need of put extra burden of regulat expression.