converting file with unspecified number of lines, by using regex, visual c# - c#

I have an app which converts a file,by reading all lines from source text file and printing only lines which contain word:'student'.Also removes some characters and splits the printed line into 5 fields as shown below:
input text file
Form|01; 23_anna- Member 12569 is student - 12*01*2006
Form|02; 17_smith_ Member 12570 is teacher - 13*01*2007
Form|03; 12_ben_ Member 12571 is student - 14*01*2007
The output file:
Form01 anna 12569 student 12 01 2006
Form03 ben 12571 student 14 01 2007
The code i have tried:
private Regex find = new Regex(#"^(.+?)(?:\|)(\d+)(?:.+?_)(.+?)(?:[_-] Member ?)(\d+)(?:.+?)(student)(?:.+?)(\d\d).(\d\d).(\d\d\d\d)$", RegexOptions.Multiline);
private void MyButton_Click(object sender, EventArgs e)
{
string sample = "Form|01; 23_anna- Member 12569 is student - 12*01*2006\nForm|02; 17_smith_ Member 12570 is teacher - 13*01*2007\nForm|03; 12_ben_ Member 12571 is student - 14*01*2007";
MatchCollection matches = find.Matches(sample);
foreach (Match m in matches)
{
Console.WriteLine("{0}{1} {2} {3} is {4} {5} {6} {7}", m.Groups[1], m.Groups[2], m.Groups[3], m.Groups[4], m.Groups[5], m.Groups[6], m.Groups[7], m.Groups[8]);
}
Console.WriteLine();
But how can I change the code if I want to convert a file with more lines( ~ 500 lines)

The best way to do this in my opinion is to use File.ReadAllLines() then in a foreach loop do your regex. I also think that you are overcomplicating your regex so I have made a few changes where I think it can be simplified.
Working under the assumption that the format of the string you are looking for will always be the same. Since form and student are in all of these lines then I see little reason to capture it. In reality there are 6 key pieces of information to capture.
1 – the numbers after form
2 – the name
3 – the 5-digit member number
4,5,6 – the three sections of the date
Everything else is either constant or not used in the output string. So when we come to rewrite the search and replace we get something like:
/^\w+\|([^;]+).+?([a-z]+)[^\d]+(\d{5})[^\d]+(\d{2}).(\d{2}).(\d{4})/m
Console.WriteLine("Form{0} {1} {2} student {3} {4} {5}", m.Groups[1], m.Groups[2], m.Groups[3], m.Groups[4], m.Groups[5], m.Groups[6])
Note that there are assumptions in the regex such as the name is always in lower case and the member number is always 5 digits and some other stuff like there can't be numbers in the names etc. It isn't optimal but I think it is tidier than yours, but this is personal preference I guess.
To get the lines with student use string.Contains("student") or if you really want to include it in your regex I would recommend using a positive lookahead for student (?=.*student)
Here is a bit of example code I wrote for one way that I would do it:
var regex = new Regex(#"^\w+\|([^;]+).+?([a-z]+)[^\d]+(\d{5})[^\d]+(\d{2}).(\d{2}).(\d{4})$",RegexOptions.Multiline);
var file = File.ReadAllLines(#"C:temp\test.txt");
foreach(var line in file)
{
if (line.Contains("student"))
{
var m = regex.Match(line);
Console.WriteLine("Form{0} {1} {2} student {3} {4} {5}", m.Groups[1], m.Groups[2], m.Groups[3], m.Groups[4], m.Groups[5], m.Groups[6]);
}
}

Related

Why does my reg ex not capture 2nd and subsequent lines?

Update
I tried adding RegexOptions.Singleline to my regex options. It worked in that it captured the lines that weren't previously captured, but it put the entire text file into the first match instead of creating one match per date as desired.
End of Update
Update #2
Added new output showing matches and groups when using Poul Bak's modification. See screen shot below titled Output from Poul Bak's modification
End of Update #2
Final Update
Updating the target framework from 4.6.1 to 4.7.1 and tweaking Poul Bak's reg ex a little bit solved all problems. See Poul Bak's answer below
End of Final Update
Original Question: Background
I have the following text file test_text.txt:
2018-10-16 12:00:01 - Error 1<CR><LF>
Error 1 text line 1<CR><LF>
Error 1 text line 2<CR><LF>
2018-10-16 12:00:02 AM - Error 2<CR><LF>
Error 2 text line 1<CR><LF>
Error 2 text line 2<CR><LF>
Error 2 text line 3<CR><LF>
Error 2 text line 4<CR><LF>
2018-10-16 12:00:03 PM - Error 3
Objective
My objective is to have each match be comprised of 3 named groups: Date, Delim, and Text as shown below.
Note: apostrophes used only to denote limits of matched text.
Matches I expect to see:
Match 1: '2018-10-16 12:00:01 - Error 1<CR><LF>'
Date group = '2018-10-16 12:00:01'
Delim group = ' - '
Text group = 'Error 1<CR><LF>Error 1 text line 1<CR><LF>Error 1 text line 2<CR><LF>'
Match 2: '2018-10-16 12:00:02 AM - Error 2<CR><LF>'
Date group = '2018-10-16 12:00:02 AM'
Delim group = ' - '
Text group = 'Error 2 text line 1<CR><LF>Error 2 text line 2<CR><LF>Error 2 text line 3<CR><LF>Error 2 text line 4<CR><LF>'
Match 3: `2018-10-16 12:00:03 PM - Error 3`
Date group = '2018-10-16 12:00:03 PM'
Delim group = ' - '
Text group = 'Error 3'
The problem
My regex is not working in that 2nd and subsequent lines of text (e.g., 'Error 1 text line 1', 'Error 2 text line 1') are not being captured. I expect them to be captured because I'm using the Multiline option.
How do I modify my regex to capture 2nd and subsequent lines of text?
Current code
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp_RegEx
{
class Program
{
static void Main(string[] args)
{
string text = System.IO.File.ReadAllText(#"C:\Users\bill\Desktop\test_text.txt");
string pattern = #"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}.*)(?<Delim>\s-\s)(?<Text>.*\n|.*)";
RegexOptions regexOptions = (RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
Regex rx = new Regex(pattern, regexOptions);
MatchCollection ms = rx.Matches(text);
// Find matches.
MatchCollection matches = rx.Matches(text);
Console.WriteLine("Input Text\n--------------------\n{0}\n--------------------\n", text);
// Report the number of matches found.
Console.WriteLine("Output ({0} matches found)\n--------------------\n", matches.Count);
int m = 1;
// Report on each match.
foreach (Match match in matches)
{
Console.WriteLine("Match #{0}: ", m++, match.Value);
int g = 1;
GroupCollection groups = match.Groups;
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", g++, group.Value);
}
Console.WriteLine();
}
Console.Read();
}
}
}
Current Output
Output from Poul Bak's modification (on the right track, but not quite there yet)
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
You can use the following regex, modified from yours:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
I have changed the 'Date' Group so it accepts 'AM' or 'PM' (otherwise it will only match the first).
Then I have changed the 'Text' Group, so it matches any number of any char (including Newlines) until it looks forward and finds a new date.
Edit:
I don't understand it, when you say 'AM' and 'PM' are not matched, they are part of the 'Date' Group. I assume you want them to be part of the 'Delim' Group, so I have moved the check to that Group.
I have also changed a Group to a non capturing Group.
The new regex:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2})(?<Delim>(?:\s\w\w)?\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
BTW: You should change your code for checking Groups, like this:
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", group.Name, group.Value);
}
Then you will see your named Groups by Name and Value. When you have named Groups, there's no need for accessing by index.
Edit 2:
About 'group.Name': I had mistakenly used 'Group' (capitalized), it should be: 'group.Name'.
This is what the regex look like now:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
I suggest you set the 'RegexOptions.ExplicitCapture' flag, then you only get named groups.

Regex C# Matching string from two words in exact order and returning capture of non-matched words

C# Regex
I have the following list of strings:
"New patient, brief"
"New patient, limited"
"Established patient, brief"
"Established patient, limited"
"New diet patient"
"Established diet patient"
"School Physical"
"Deposition, 1 hour"
"Deposition, 2 hour"
I would like to separate these strings into groups using regex.
The first pattern I see is:
"New" or "Established" -- will be the first word of the matched pattern. This word will need to be captured and returned. Of this pattern, "patient" must be present without need to capture. Any word after "patient" must be captured.
I've tried: ((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)
but the return match gives:
Full match 0-3 `New`
Group 1. 0-0 ``
Group 2. 0-3 `New`
Not at all what I am looking for.
string input = "New patient, limited";
string pattern = #"((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)";
MatchCollection matches = Regex.Matches(input, pattern);
GroupCollection groups = matches[0].Groups;
foreach (Match match in matches)
{
Console.WriteLine("First word: {0}", match.Groups[1].Value);
Console.WriteLine("Last words: {0}", match.Groups[2].Value);
Console.WriteLine();
}
Console.WriteLine();
Thank you for any help with this.
Edit #1
For strings like "New patient, limited"
output should be: "New" "limited"
For strings like "Deposition, 1 hour" where "hour" is present,
output should be: "Deposition, 1 hour"
For strings where there are no words after "patient" but "patient" is present, like
"New diet patient",
output should be: "New" "diet"
For strings where neither "patient" nor "hour" is present, the entire string should be returned. i.e like "School Physical" should return the entire string,
"School Physical".
As I said, this is my ultimate quest. At the moment, I am trying to focus on separating out only the first pattern :). Much Thanks.
I suggest using
^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)
See the regex demo
Details
^(?:(?!\b(?:New|Established)\b).)*$ - any string that has no New or Established as whole words
| - or
\b(New|Established) - a whole word New or Established (put into Group 1)
\s+ - 1+ whitespaces
(?:patient\b\W*)? - an optional non-capturing group matching 1 or 0 occurrences of patient followed with word boundary and 0+ non-word chars
(.+) - Group 2: any 1 or more chars other than line break chars.
The code will look like
var match = Regex.Match(s, #"^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)");
If Group 1 is not matched (!match.Groups[1].Success), grab the whole match, match.Value. Else, grab match.Groups[1].Value and match.Groups[2].Value.
Results:

Parsing structured text input and composing structured output of nested classes

Here is my code for reading from text file. It "works" and reads from the text file but there is a small bug. It returns this: {Employee: Name: Name: red ID: 123 ID: Request: Name: Name: toilet ID: 444 Desc: water ID: Desc: } I know why its doing it, I just cant figure out how to fix it. columns[0] value is "Name: red \t ID: 123" and columnms[1] value is "Name: toilet \t ID: 444 \t Desc: water".
I know it's doing it because I'm calling assignment.Employee.Name but I don't know how else to call it to get it to show on my form. I thought it would be something like assignment.Employee but then it gives the error that I can't convert string to the Employee type.
Assignment is a list that holds 2 objects from other lists (employee and service request).
public static List<Assignment> GetAssignment()
{
if (!Directory.Exists(dir))
Directory.CreateDirectory(dir);
StreamReader textIn =
new StreamReader(
new FileStream(path3, FileMode.OpenOrCreate, FileAccess.Read));
List<Assignment> assignments = new List<Assignment>();
while (textIn.Peek() != -1)
{
string row = textIn.ReadLine();
string[] columns = row.Split('|');
if (columns.Length >= 2)
{
Assignment assignment = new Assignment();
assignment.Employee.Name = columns[0];
assignment.Request.Name = columns[1];
assignments.Add(assignment);
}
}
textIn.Close();
return assignments;
}
EDIT: I expect it to just return {Employee: Name: red ID: 123 Request: Name: toilet ID: 444 Desc: water}
Sorry this isn't an answer but due to the strange rules on this site I am not allowed to add a comment. Please give us the definition of the class or structure called "Assignment" and tell us what you expect it to contain after your code has run.
You are performing a string.Format() on the this.Employee so basically it is performing the default ToString() on the Employee object, which will list all fields and their associated values. You perhaps are meaning to call it like this:
return string.Format("Employee: {0} \t Request: {1}", this.Employee.Name, this.Request.Name);
Or perhaps you want to override the ToString() on your Employee and ServiceRequest objects to return your desired results.
Update
Since you edited your question to include the Employee object, the above is not relevant. Since your column[0] value actually has the text "Name: red \t ID: 123" then in your Employee override of ToString you do not also need to specify the text "Name:".
This answer is based on the assumption that a typical text line in your data file looks like this:
Name: red \t ID: 123 | Name: toilet \t ID: 444 \t Desc: water
This looks to me like it is encoding two objects, the first one having two attributes (Name and ID) and the second one having three attributes (Name, ID, Desc).
Objects within the same line are separated by pipe signs ("|"). Attributes within the same object are separated by tabs ("\t"). Each attribute consists of an identifier ("Name", "ID") and a value ("red", "123"), separated by a colon (":"). The natural data structure for such pairs would be a Dictionary<string, string>.
Reading such a file would emulate that nesting.
Read a line; split it by "|" into strings containing one object each (your columns).
Split each of these object strings by \t so that each resulting string contains one key and one value with a colon (":") and white space between them.
Split each of those key-values by ":" to separate the key from the value. Trim both to get rid of excess white space.
Employees or other objects of this kind hold a dictionary to store the key/value pairs, and ToString() just prints each pair by printing a key, a colon, and the value.

Trying to match multiple words multiple times, any order using regex

I'm trying to check if a text contains two or more specific words. The words can be in any order an can show up in the text multiple times but at least once.
If the text is a match I will need to get the information about location of the words.
Lets say we have the text :
"Once I went to a store and bought a coke for a dollar and I got another coke for free"
In this example I want to match the words coke and dollar.
So the result should be:
coke : index 37, lenght 4
dollar : index 48, length 6
coke : index 84, length 4
What I have already is this: (which I think is little bit wrong because it should contain each word at least once so the + should be there instead of the *)
(?:(\bcoke\b))\*(?:(\bdollar\b))\*
But with that regex the RegEx Buddy highlights all the three words if I ask it to hightlight group 1 and group 2.
But when I run this in C# I won't get any results.
Can you point me to the right direction ?
I don't think it's possible what you want only using regular expressions.
Here is a possible solution using regular expressions and linq:
var words = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "coke", "dollar" };
var regex = new Regex(#"\b(?:"+string.Join("|", words)+#")\b", RegexOptions.IgnoreCase);
var text = #"Once I went to a store and bought a coke
for a dollar and I got another coke for free";
var grouped = regex.Matches(text)
.OfType<Match>()
.GroupBy(m => m.Value, StringComparer.OrdinalIgnoreCase)
.ToArray();
if (grouped.Length != words.Count)
{
//not all words were found
}
else
{
foreach (var g in grouped)
{
Console.WriteLine("Found: " + g.Key);
foreach (var match in g)
Console.WriteLine(" At {0} length {1}", match.Index, match.Length);
}
}
Output:
Found: coke
At 36 length 4
At 72 length 4
Found: dollar
At 47 length 6
How about this, it is pret-tay bad but I think it has a shot at working and it is pure RegEx no extra tools.
(?:^|\W)[cC][oO][kK][eE](?:$|\W)|(?:^|\W)[dD][oO][lL][lL][aA][rR](?:$|\W)
Get rid of the \w's if you want it to capture cokeDollar or dollarCoKe etc.

How can I exclude the first match in a regular expression?

I have the following regex, so far:
([0-9]+){1}\s*[xX]\s*([A-Za-z\./%\$\s\*]+)
to be used on strings such as:
2x Soup, 2x Meat Balls, 4x Iced Tea
My intent was to capture the number of times something was ordered, as well as the name of item ordered.
In this regular expression however, the multiplier 'x' gets captured before the title.
How can I make it so that the x is ignored, and what comes after the x (and a space) is captured?
You can't ignore something in the middle of the pattern. Therefore you do have your capturing groups.
([0-9]+){1}\s*[xX]\s*([A-Za-z\./%\$\s\*]+)
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^
The marked parts of your pattern are stored in capturing groups, because of the brackets around them.
Your number is in group 1 and the name is in group 2. The "x" is not captured in a group.
How you now access your groups depends on the language you are using.
Btw. the {1} is obsolete.
So for c# try this:
string text = "2x Soup, 2x Meat Balls, 4x Iced Tea";
MatchCollection result = Regex.Matches(text, #"([0-9]+)\s*[xX]\s*([A-Za-z\./%\$\s\*]+)");
int counter = 0;
foreach (Match m in result)
{
counter++;
Console.WriteLine("Order {0}: " + m.Groups[1] + " " + m.Groups[2], counter);
}
Console.ReadLine();
Further I would change the regex to this, since it seems you want to match as name every character till the next comma
#"([0-9]+)\s*x\s*([^,]+)"
and use RegexOptions.IgnoreCase to avoid having to write [xX]

Categories

Resources