Parsing structured text input and composing structured output of nested classes

Parsing structured text input and composing structured output of nested classes - c#

Here is my code for reading from text file. It "works" and reads from the text file but there is a small bug. It returns this: {Employee: Name: Name: red ID: 123 ID: Request: Name: Name: toilet ID: 444 Desc: water ID: Desc: } I know why its doing it, I just cant figure out how to fix it. columns[0] value is "Name: red \t ID: 123" and columnms[1] value is "Name: toilet \t ID: 444 \t Desc: water".
I know it's doing it because I'm calling assignment.Employee.Name but I don't know how else to call it to get it to show on my form. I thought it would be something like assignment.Employee but then it gives the error that I can't convert string to the Employee type.
Assignment is a list that holds 2 objects from other lists (employee and service request).
public static List<Assignment> GetAssignment()
{
if (!Directory.Exists(dir))
Directory.CreateDirectory(dir);
StreamReader textIn =
new StreamReader(
new FileStream(path3, FileMode.OpenOrCreate, FileAccess.Read));
List<Assignment> assignments = new List<Assignment>();
while (textIn.Peek() != -1)
{
string row = textIn.ReadLine();
string[] columns = row.Split('|');
if (columns.Length >= 2)
{
Assignment assignment = new Assignment();
assignment.Employee.Name = columns[0];
assignment.Request.Name = columns[1];
assignments.Add(assignment);
}
}
textIn.Close();
return assignments;
}
EDIT: I expect it to just return {Employee: Name: red ID: 123 Request: Name: toilet ID: 444 Desc: water}

Sorry this isn't an answer but due to the strange rules on this site I am not allowed to add a comment. Please give us the definition of the class or structure called "Assignment" and tell us what you expect it to contain after your code has run.

You are performing a string.Format() on the this.Employee so basically it is performing the default ToString() on the Employee object, which will list all fields and their associated values. You perhaps are meaning to call it like this:
return string.Format("Employee: {0} \t Request: {1}", this.Employee.Name, this.Request.Name);
Or perhaps you want to override the ToString() on your Employee and ServiceRequest objects to return your desired results.
Update
Since you edited your question to include the Employee object, the above is not relevant. Since your column[0] value actually has the text "Name: red \t ID: 123" then in your Employee override of ToString you do not also need to specify the text "Name:".

This answer is based on the assumption that a typical text line in your data file looks like this:
Name: red \t ID: 123 | Name: toilet \t ID: 444 \t Desc: water
This looks to me like it is encoding two objects, the first one having two attributes (Name and ID) and the second one having three attributes (Name, ID, Desc).
Objects within the same line are separated by pipe signs ("|"). Attributes within the same object are separated by tabs ("\t"). Each attribute consists of an identifier ("Name", "ID") and a value ("red", "123"), separated by a colon (":"). The natural data structure for such pairs would be a Dictionary<string, string>.
Reading such a file would emulate that nesting.
Read a line; split it by "|" into strings containing one object each (your columns).
Split each of these object strings by \t so that each resulting string contains one key and one value with a colon (":") and white space between them.
Split each of those key-values by ":" to separate the key from the value. Trim both to get rid of excess white space.
Employees or other objects of this kind hold a dictionary to store the key/value pairs, and ToString() just prints each pair by printing a key, a colon, and the value.

Related

Is it possible to have overlapping regex matches?

Take this data as an example:
ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021
I was wondering if it's possible to create a regex that will return this set of matches
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
I did try creating one below:
ID: (?<id>\w+).*\|(?<instrument>\w+):\s(?<count>\d).*Expiry:\s(?<expiry>[\w\d]+)
but it only returned the one with the violin instrument. I would highly appreciate your insights on this.

I would not use a regular expression. Especially since the string ID: JK546|Guitar: 0|Expiry: Aug14,2021 does not appear in the string ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021, so it's not strictly a match, but more of a replacement. But there's no good way to get all replacements from all matches.
So, I'd just split the input string on |.
Then you want to compose a result string that is comprised of the first field, one of the middle fields, and the last field. You'll get one result for each middle field that exists. If it splits into N fields, you'll get N-2 results. e.g.: if it splits into 5 fields, then you'll get 3 results, one for each of the "middle" fields.
string input = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string[] fields = input.Split('|');
for( int i = 1; i < fields.Length - 1; ++i) {
string result = string.Join("|", fields.First(), fields[i], fields.Last());
Console.WriteLine(result);
}
output:
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021

A single regular expression to return multiple matches on multiple calls? 
I wonder whether that is possible.
I’m not familiar with how to do regex processing in C#,
but this sed command will do what you want. 
Perhaps you can understand how it works and adapt it to your needs:
sed -n ':loop; h; s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p; g; s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/; t loop'
For simplicity, let’s pretend that the input string is “A|B|C|D|E”.
What it does:
-n is the option to tell sed not to print anything automatically
(but only print when told to, with a p command).
:loop is a label for, effectively, a “goto”. 
So use a while loop structure.
h saves the pattern space into the hold space. 
In other words, make a copy of your string.
s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p captures the first two segments
and the last one, and prints the result. 
So “A|B|C|D|E” becomes “A|B|E” (i.e., your first desired output).
g restores the saved string from the hold space into the pattern space. 
In other words, retrieve the copy of the string that you saved.
s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/ captures the first segment,
skips the second, and then captures the rest. 
So “A|B|C|D|E” becomes “A|C|D|E”.
t loop is the “goto” command. 
It says to go back to the beginning of the loop
if the most recent substitution succeeded. 
In other words, this is the end of the loop,
and the specification of the loop condition.
The second iteration of the loop will change “A|C|D|E” to “A|C|E”
and print it. 
And then change “A|C|D|E” to “A|D|E” and iterate. 
The third iteration of the loop will change “A|D|E” to “A|D|E” and print it. 
(Obviously there is no change, because the .* in the middle of the regex
matches the zero-length string between “A|D” and “|E”.) 
The final substitution changes “A|D|E” to “A|E”,
and then there is nothing left to find.

You can make use of the .NET Groups.Captures property to get the values of Guitar, Piano and Violin.
(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)
The pattern matches:
(ID: \w+\|) Capture group 1 match ID: 1+ word chars and |
(\w+: \d+\|)+ Capture group 2 Repeat 1+ times matching 1+ word chars : 1+ digits |
(Expiry: \w+,\d+) Capture group 3 match Expiry: 1+ word chars , and 1+ digits
See a .NET regex demo | C# demo
For example
var str = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string pattern = #"(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)";
Match m = Regex.Match(str, pattern);
foreach(Capture c in m.Groups[2].Captures) {
Console.WriteLine(m.Groups[1].Value + c.Value + m.Groups[3].Value);
}
Output
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021

It should be possible with look behind and look ahead:
string foo = #"ID: JK546 | Guitar: 0 | Piano: 1 | Violin: 0 | Expiry: Aug14,2021";
// First look at "Guitar: 0", "Piano: 1" and "Violin: 0". Then look behind "(?<= )" and search for the ID. Then look ahead "(?= )" and search for Expiry.
string pattern = #"(\w+: \d)(?<=(ID: [A-Z0-9]+).*?)(?=.*?(Expiry: \S+))";
foreach (var match in Regex.Matches(foo, pattern))
{
....
}
Fortunately c# is one of the few languages that can handle variable length look behinds.

How to match values ending with an optional string through Regex?

I am trying to extract a first name from a text snippet, which optionally has a last name in the same line as: <first_name>name<last_name>
E.g.:
Text: JohnnameSnow -> Result: John
Text: John -> Result: John
So I want to extract the <first_name> part from that line, but if there is no name<last_name> it should return the full line.
I have tried the following Regex:
([A-zÀ-ÿ-]{2,})(?=(?:name))
That works fine if there's actually a last name in the same line, but does not return me the full line when there is not. Unfortunately the solution doesn't seem to be as easy as adding |$.
Can I look for an optional end word and ignore it if it does not occur?

You can use
^(?<first>\p{L}+?)(?:name(?<last>\p{L}+))?$
See the regex demo. Output:
Details
^ - start of string
(?<first>\p{L}+?) - Group "first": one or more letters, but as few as possible
(?:name(?<last>\p{L}+))? - an optional non-capturing group:
name - a substring
(?<last>\p{L}+) - Group "last": one or more letters
$ - end of string.
See the C# demo:
var strings = new List<string> { "JohnnameSnow", "John" };
foreach (var s in strings)
{
Console.WriteLine(s);
var m = Regex.Match(s, #"^(?<first>\p{L}+?)(?:name(?<last>\p{L}+))?$");
if (m.Success)
{
Console.WriteLine("First name: {0}, Last name = {1}", m.Groups["first"].Value, m.Groups["last"].Value);
}
else
{
Console.WriteLine("No match!");
}
}
Output:
JohnnameSnow
First name: John, Last name = Snow
John
First name: John, Last name =

converting file with unspecified number of lines, by using regex, visual c#

I have an app which converts a file,by reading all lines from source text file and printing only lines which contain word:'student'.Also removes some characters and splits the printed line into 5 fields as shown below:
input text file
Form|01; 23_anna- Member 12569 is student - 12*01*2006
Form|02; 17_smith_ Member 12570 is teacher - 13*01*2007
Form|03; 12_ben_ Member 12571 is student - 14*01*2007
The output file:
Form01 anna 12569 student 12 01 2006
Form03 ben 12571 student 14 01 2007
The code i have tried:
private Regex find = new Regex(#"^(.+?)(?:\|)(\d+)(?:.+?_)(.+?)(?:[_-] Member ?)(\d+)(?:.+?)(student)(?:.+?)(\d\d).(\d\d).(\d\d\d\d)$", RegexOptions.Multiline);
private void MyButton_Click(object sender, EventArgs e)
{
string sample = "Form|01; 23_anna- Member 12569 is student - 12*01*2006\nForm|02; 17_smith_ Member 12570 is teacher - 13*01*2007\nForm|03; 12_ben_ Member 12571 is student - 14*01*2007";
MatchCollection matches = find.Matches(sample);
foreach (Match m in matches)
{
Console.WriteLine("{0}{1} {2} {3} is {4} {5} {6} {7}", m.Groups[1], m.Groups[2], m.Groups[3], m.Groups[4], m.Groups[5], m.Groups[6], m.Groups[7], m.Groups[8]);
}
Console.WriteLine();
But how can I change the code if I want to convert a file with more lines( ~ 500 lines)

The best way to do this in my opinion is to use File.ReadAllLines() then in a foreach loop do your regex. I also think that you are overcomplicating your regex so I have made a few changes where I think it can be simplified.
Working under the assumption that the format of the string you are looking for will always be the same. Since form and student are in all of these lines then I see little reason to capture it. In reality there are 6 key pieces of information to capture.
1 – the numbers after form
2 – the name
3 – the 5-digit member number
4,5,6 – the three sections of the date
Everything else is either constant or not used in the output string. So when we come to rewrite the search and replace we get something like:
/^\w+\|([^;]+).+?([a-z]+)[^\d]+(\d{5})[^\d]+(\d{2}).(\d{2}).(\d{4})/m
Console.WriteLine("Form{0} {1} {2} student {3} {4} {5}", m.Groups[1], m.Groups[2], m.Groups[3], m.Groups[4], m.Groups[5], m.Groups[6])
Note that there are assumptions in the regex such as the name is always in lower case and the member number is always 5 digits and some other stuff like there can't be numbers in the names etc. It isn't optimal but I think it is tidier than yours, but this is personal preference I guess.
To get the lines with student use string.Contains("student") or if you really want to include it in your regex I would recommend using a positive lookahead for student (?=.*student)
Here is a bit of example code I wrote for one way that I would do it:
var regex = new Regex(#"^\w+\|([^;]+).+?([a-z]+)[^\d]+(\d{5})[^\d]+(\d{2}).(\d{2}).(\d{4})$",RegexOptions.Multiline);
var file = File.ReadAllLines(#"C:temp\test.txt");
foreach(var line in file)
{
if (line.Contains("student"))
{
var m = regex.Match(line);
Console.WriteLine("Form{0} {1} {2} student {3} {4} {5}", m.Groups[1], m.Groups[2], m.Groups[3], m.Groups[4], m.Groups[5], m.Groups[6]);
}
}

How to extract address components from a string?

I have a Xamarin Forms application that uses Xamarin. Mobile on the platforms to get the current location and then ascertain the current address. The address is returned in string format with line breaks.
The address can look like this:
111 Mandurah Tce
Mandurah WA 6210
Australia
or
The Glades
222 Mandurah Tce
Mandurah WA 6210
Australia
I have this code to break it down into the street address (including number), suburb, state and postcode (not very elegant, but it works)
string[] lines = address.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
List<string> addyList = new List<string>(lines);
int count = addyList.Count;
string lineToSplit = addyList.ElementAt(count - 2);
string[] splitLine = lineToSplit.Split(null);
List<string> splitList = new List<string>(splitLine);
string streetAddress = addyList.ElementAt (count - 3).ToString ();
string postCode = splitList.ElementAt(2);
string state = splitList.ElementAt(1);
string suburb = splitList.ElementAt(0);
I would like to extract the street number, and in the previous examples this would be easy, but what is the best way to do it, taking into account the number might be Lot 111 (only need to capture the 111, not the word LOT), or 123A or 8/123 - and sometimes something like 111-113 is also returned
I know that I can use regex and look for every possible combo, but is there an elegant built-in type solution, before I go writing any more messy code (and I know that the above code isn't particularly robust)?

These simple regular expressions will account for many types of address formats, but have you considered all the possible variations, such as:
PO Box 123 suburb state post_code
Unit, Apt, Flat, Villa, Shop X Y street name
7C/94 ALISON ROAD RANDWICK NSW 2031
and that is just to get the number. You will also have to deal with all the possible types of streets such as Lane, Road, Place, Av, Parkway.
Then there are street types such as:
12 Grand Ridge Road suburb_name
This could be interpreted as street = "Grand Ridge" and suburb = "Road suburb_name", as Ridge is also a valid street type.
I have done a lot of work in this area and found the huge number of valid address patterns meant simple regexs didn't solve the problem on large amounts of data.
I ended up develpping this parser http://search.cpan.org/~kimryan/Lingua-EN-AddressParse-1.20/lib/Lingua/EN/AddressParse.pm to solve the problem. It was originally written for Australian addresses so should work well for you.

Regex can capture the parts of a match into groups. Each parentheses () defines a group.
([^\d]*)(\d*)(.*)
For "Lot 222 Mandurah Tce" this returns the following groups
Group 0: "Lot 222 Mandurah Tce" (the input string)
Group 1: "Lot "
Group 2: "222"
Group 3: " Mandurah Tce"
Explanation:
[^\d]* Any number (including 0) of any character except digits.
\d* Any number (including 0) of digits.
.* Any number (including 0) of any character.
string input = "Lot 222 Mandurah Tce";
Match match = Regex.Match(input, #"([^\d]*)(\d*)(.*)");
string beforeNumber = match.Groups[1].Value; // --> "Lot "
string number = match.Groups[2].Value; // --> "222"
string afterNumber = match.Groups[3].Value; // --> " Mandurah Tce"
If a group finds no match, match.Groups[i] will return an empty string ("") for that group.

You could check if the content starts with a number for each entry in the splitLine.
string[] splitLine = lineToSplit.Split(addresseLine);
var streetNumber = string.empty;
foreach(var s in splitLine)
{
//Get the first digit value
if (Regex.IsMatch(s, #"^\d"))
{
streetNumber = s;
break;
}
}
//Deal with empty value another way
Console.WriteLine("My streetnumber is " + s)

Yea I think you have to identify what will work.
If:
it is always in the address line and it must always start with a Digit
nothing else in that line can start with a digit (or if something else does you know which always comes in what order, ie the code below will always work if the street number is always first)
you want every contiguous character to the digit that isn't whitespace (the - and \ examples suggest that to me)
Then it could be as simple as:
var regx = new Regex(#"(?:\s|^)\d[^\s]*");
var mtch = reg.Match(addressline);
You would sort of have to sift and see if any of those assumptions are broken.

Format a string into columns

Is there a cool way to take something like this:
Customer Name - City, State - ID
Bob Whiley - Howesville, TN - 322
Marley Winchester - Old Towne, CA - 5653
and format it to something like this:
Customer Name - City, State - ID
Bob Whiley - Howesville, TN - 322
Marley Winchester - Old Towne, CA - 5653
Using string format commands?
I am not too hung up on what to do if one is very long. For example this would be ok by me:
Customer Name - City, State - ID
Bob Whiley - Howesville, TN - 322
Marley Winchester - Old Towne, CA - 5653
Super Town person - Long Town Name, WA- 45648
To provide some context. I have a drop down box that shows info very similar to this. Right now my code to create the item in the drop down looks like this:
public partial class CustomerDataContract
{
public string DropDownDisplay
{
get
{
return Name + " - " + City + ", " + State + " - " + ID;
}
}
}
I am looking for a way to format this better. Any ideas?
This is what I ended up with:
HttpContext.Current.Server.HtmlDecode(
String.Format("{0,-27} - {1,-15}, {2, 2} - {3,5}",
Name, City, State, ID)
.Replace(" ", " "));
The HtmlDecode changes the to a space that can withstand the space removing formatting of the dropdown list.

You can specify the number of columns occupied by the text as well as alignment using Console.WriteLine or using String.Format:
// Prints "--123 --"
Console.WriteLine("--{0,-10}--", 123);
// Prints "-- 123--"
Console.WriteLine("--{0,10}--", 123);
The number specifies the number of columns you want to use and the sign specifies alignment (- for left alignment, + for right alignment). So, if you know the number of columns available, you could write for example something like this:
public string DropDownDisplay {
get {
return String.Format("{0,-10} - {1,-10}, {2, 10} - {3,5}"),
Name, City, State, ID);
}
}
If you'd like to calculate the number of columns based on the entire list (e.g. the longest name), then you'll need to get that number in advance and pass it as a parameter to your DropDownDisplay - there is no way to do this automatically.

In addition to Tomas's answer I just want to point out that string interpolation can be used in C# 6 or newer.
// with string format
var columnHeaders1 = string.Format($"|{0,-30}|{1,-4}|{2,-15}|{3,-30}|{4,-30}|{5,-30}|{6,-30}", "ColumnA", "ColumnB", "ColumnC", "ColumnD", "ColumnE", "ColumnF", "ColumnG");
// with string interpolation
var columnHeaders2 = $"|{"ColumnA",-30}|{"ColumnB",-4}|{"ColumnC",-15}|{"ColumnD",-30}|{"ColumnE",-30}|{"ColumnF",-30}|{"ColumnG",-30}";

I am unable to add a comment above, but in the accepted answer it was stated:
If you'd like to calculate the number of columns based on the entire list (e.g. the longest name), then you'll need to get that number in advance and pass it as a parameter to your DropDownDisplay - there is no way to do this automatically.
This can in fact be done programmatically at runtime by creating the format string 'on the fly':
string p0 = "first";
string p1 = "separated by alignment value x";
int x = n * 10; // calculate the alignment x as needed
// now use x to give something like: {0,-20}, {1}
string fmt = "{0,-" + x + "},{1}"; // or whatever formatting expression you want
// then use the fmt string
string str = string.Format(fmt, p0, p1)
// with n = 2 this would give us
"first ,separated by alignment value x"

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing structured text input and composing structured output of nested classes - c#

Sorry this isn't an answer but due to the strange rules on this site I am not allowed to add a comment. Please give us the definition of the class or structure called "Assignment" and tell us what you expect it to contain after your code has run.

Related

Is it possible to have overlapping regex matches?

How to match values ending with an optional string through Regex?

converting file with unspecified number of lines, by using regex, visual c#

How to extract address components from a string?

Format a string into columns

Categories

Resources