C# Regular Expression: How to extract a collection

C# Regular Expression: How to extract a collection - c#

I have collection in text file:
(Collection
(Item "Name1" 1 2 3)
(Item "Simple name2" 1 2 3)
(Item "Just name 3" 4 5 6))
Collection also could be empty:
(Collection)
The number of items is undefined. It could be one item or one hundred. By previous extraction I already have inner text between Collection element:
(Item "Name1" 1 2 3)(Item "Simple name2" 1 2 3)(Item "Just name 3" 4 5 6)
In the case of empty collection it will be empty string.
How could I parse this collection using .Net Regular Expression?
I tried this:
string pattern = #"(\(Item\s""(?<Name>.*)""\s(?<Type>.*)\s(?<Length>.*)\s(?<Number>.*))*";
But the code above doesn't produce any real results.
UPDATE:
I tried to use regex differently:
foreach (Match match in Regex.Matches(document, pattern, RegexOptions.Singleline))
{
for (int i = 0; i < match.Groups["Name"].Captures.Count; i++)
{
Console.WriteLine(match.Groups["Name"].Captures[i].Value);
}
}
or
while (m.Success)
{
m.Groups["Name"].Value.Dump();
m.NextMatch();
}

Try
\(Item (?<part1>\".*?\")\s(?<part2>\d+)\s(?<part3>\d+)\s(?<part4>\d+)\)
this will create a collection of matches:
Regex regex = new Regex(
"\\(Item (?<part1>\\\".*?\\\")\\s(?<part2>\\d+)\\s(?<part3>\\d"+
"+)\\s(?<part4>\\d+)\\)",
RegexOptions.Multiline | RegexOptions.Compiled
);
//Capture all Matches in the InputText
MatchCollection ms = regex.Matches(InputText);
//Get the names of all the named and numbered capture groups
string[] GroupNames = regex.GetGroupNames();
// Get the numbers of all the named and numbered capture groups
int[] GroupNumbers = regex.GetGroupNumbers();

I think you might need to make your captures non-greedy...
(?<Name>.*?)
instead of
(?<Name>.*)

I think you should read file and than make use of Sting.Split function to split the collection and start to read it
String s = "(Collection
(Item "Name1" 1 2 3)
(Item "Simple name2" 1 2 3)
(Item "Just name 3" 4 5 6))";
string colection[] = s.Split('(');
if(colection.Length>1)
{
for(i=1;i<colection.Length;i++)
{
//process string one by one and add ( if you need it
//from the last item remove )
}
}
this will resolve issue easily there is no need of put extra burden of regulat expression.

Related

TakeWhile() with conditions in multiple lines

I have a text file with a few thousand lines of texts. A Readlines() function reads each line of the file and yield return the lines.
I need to skip the lines until a condition is met, so it's pretty easy like this:
var lines = ReadLines().SkipWhile(x => !x.Text.Contains("ABC Header"))
The problem I am trying to solve is there is another condition -- once I find the line with "ABC Header", the following line must contain "XYZ Detail". Basically, the file contains multiple lines that have the text "ABC Header" in it, but not all of these lines are followed by "XYZ Detail". I only need those where both of these lines are present together.
How do I do this? I tried to add .Where after the SkipWhile, but that doesn't guarantee the "XYZ Detail" line is immediately following the "ABC Header" line. Thanks!

// See https://aka.ms/new-console-template for more information
using Experiment72045808;
string? candidate = null;
List<(string, string)> results = new();
foreach (var entry in FileReader.ReadLines())
{
if (entry is null) continue;
if (candidate is null)
{
if (entry.Contains("ABC Header"))
{
candidate = entry;
}
}
else
{
// This will handle two adjacend ABC Header - Lines.
if (entry.Contains("ABC Header"))
{
candidate = entry;
continue;
}
// Add to result set and reset.
if (entry.Contains("XYZ Detail"))
{
results.Add((candidate, entry));
candidate = null;
}
}
}
Console.WriteLine("Found results:");
foreach (var result in results)
{
Console.WriteLine($"{result.Item1} / {result.Item2}");
}
resulted in output:
Found results:
ABC Header 2 / XYZ Detail 2
ABC Header 3 / XYZ Detail 3
For input File
Line 1
Line 2
ABC Header 1
Line 4
Line 5
ABC Header 2
XYZ Detail 2
Line 6
Line7
ABC Header 3
XYZ Detail 3
Line 8
Line 9
FileReader.ReadResults() in my test is implemented nearly identical to yours.

Trying to match multiple words multiple times, any order using regex

I'm trying to check if a text contains two or more specific words. The words can be in any order an can show up in the text multiple times but at least once.
If the text is a match I will need to get the information about location of the words.
Lets say we have the text :
"Once I went to a store and bought a coke for a dollar and I got another coke for free"
In this example I want to match the words coke and dollar.
So the result should be:
coke : index 37, lenght 4
dollar : index 48, length 6
coke : index 84, length 4
What I have already is this: (which I think is little bit wrong because it should contain each word at least once so the + should be there instead of the *)
(?:(\bcoke\b))\*(?:(\bdollar\b))\*
But with that regex the RegEx Buddy highlights all the three words if I ask it to hightlight group 1 and group 2.
But when I run this in C# I won't get any results.
Can you point me to the right direction ?

I don't think it's possible what you want only using regular expressions.
Here is a possible solution using regular expressions and linq:
var words = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "coke", "dollar" };
var regex = new Regex(#"\b(?:"+string.Join("|", words)+#")\b", RegexOptions.IgnoreCase);
var text = #"Once I went to a store and bought a coke
for a dollar and I got another coke for free";
var grouped = regex.Matches(text)
.OfType<Match>()
.GroupBy(m => m.Value, StringComparer.OrdinalIgnoreCase)
.ToArray();
if (grouped.Length != words.Count)
{
//not all words were found
}
else
{
foreach (var g in grouped)
{
Console.WriteLine("Found: " + g.Key);
foreach (var match in g)
Console.WriteLine(" At {0} length {1}", match.Index, match.Length);
}
}
Output:
Found: coke
At 36 length 4
At 72 length 4
Found: dollar
At 47 length 6

How about this, it is pret-tay bad but I think it has a shot at working and it is pure RegEx no extra tools.
(?:^|\W)[cC][oO][kK][eE](?:$|\W)|(?:^|\W)[dD][oO][lL][lL][aA][rR](?:$|\W)
Get rid of the \w's if you want it to capture cokeDollar or dollarCoKe etc.

regular expression of 3 groups check

I'd like a group of 3 values in the following regular expression and input string
With the help of the SO experts this is what I have:
string item = "strawb bana 1 10 1.93";
string pattern = #"(?<str>[\w\s]*)(?<qty>\s\d*\s)(?<num>\d*\.\d+)";
Basically,
The first value is going to be the product description. I put a 1 on the end just in case the description has a number in it.
The second value is the quantity.
The third value is price.
Does this look correct? Might I be missing other cases?
Result should be the following
Group 1 = "strawb bana 1"
Group 2 = "10"
Group 3 = "1.93"

It looks like you forgot to include digits in your first match.
string item = "strawb bana 1 10 1.93";
string pattern = #"(?<str>[\w\s]*)(?<qty>\s\d*\s)(?<num>\d*\.\d+)";
Should be:
string item = "strawb bana 1 10 1.93";
string pattern = #"(?<str>[\w\s\d]*)(?<qty>\s\d*\s)(?<num>\d*\.\d+)";

Need multiple regular expression matches using C#

So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).

Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.

Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}

The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.

What's the difference between "groups" and "captures" in .NET regular expressions?

I'm a little fuzzy on what the difference between a "group" and a "capture" are when it comes to .NET's regular expression language. Consider the following C# code:
MatchCollection matches = Regex.Matches("{Q}", #"^\{([A-Z])\}$");
I expect this to result in a single capture for the letter 'Q', but if I print the properties of the returned MatchCollection, I see:
matches.Count: 1
matches[0].Value: {Q}
matches[0].Captures.Count: 1
matches[0].Captures[0].Value: {Q}
matches[0].Groups.Count: 2
matches[0].Groups[0].Value: {Q}
matches[0].Groups[0].Captures.Count: 1
matches[0].Groups[0].Captures[0].Value: {Q}
matches[0].Groups[1].Value: Q
matches[0].Groups[1].Captures.Count: 1
matches[0].Groups[1].Captures[0].Value: Q
What exactly is going on here? I understand that there's also a capture for the entire match, but how do the groups come in? And why doesn't matches[0].Captures include the capture for the letter 'Q'?

You won't be the first who's fuzzy about it. Here's what the famous Jeffrey Friedl has to say about it (pages 437+):
Depending on your view, it either adds
an interesting new dimension to the
match results, or adds confusion and
bloat.
And further on:
The main difference between a Group
object and a Capture object is that
each Group object contains a
collection of Captures representing
all the intermediary matches by the
group during the match, as well as the
final text matched by the group.
And a few pages later, this is his conclusion:
After getting past the .NET
documentation and actually
understanding what these objects add,
I've got mixed feelings about them. On
one hand, it's an interesting
innovation [..] on the other hand, it
seems to add an efficiency burden [..]
of a functionality that won't be used
in the majority of cases
In other words: they are very similar, but occasionally and as it happens, you'll find a use for them. Before you grow another grey beard, you may even get fond of the Captures...
Since neither the above, nor what's said in the other post really seems to answer your question, consider the following. Think of Captures as a kind of history tracker. When the regex makes his match, it goes through the string from left to right (ignoring backtracking for a moment) and when it encounters a matching capturing parentheses, it will store that in $x (x being any digit), let's say $1.
Normal regex engines, when the capturing parentheses are to be repeated, will throw away the current $1 and will replace it with the new value. Not .NET, which will keep this history and places it in Captures[0].
If we change your regex to look as follows:
MatchCollection matches = Regex.Matches("{Q}{R}{S}", #"(\{[A-Z]\})+");
you will notice that the first Group will have one Captures (the first group always being the whole match, i.e., equal to $0) and the second group will hold {S}, i.e. only the last matching group. However, and here's the catch, if you want to find the other two catches, they're in Captures, which contains all intermediary captures for {Q} {R} and {S}.
If you ever wondered how you could get from the multiple-capture, which only shows last match to the individual captures that are clearly there in the string, you must use Captures.
A final word on your final question: the total match always has one total Capture, don't mix that with the individual Groups. Captures are only interesting inside groups.

This can be explained with a simple example (and pictures).
Matching 3:10pm with the regular expression ((\d)+):((\d)+)(am|pm), and using Mono interactive csharp:
csharp> Regex.Match("3:10pm", #"((\d)+):((\d)+)(am|pm)").
> Groups.Cast<Group>().
> Zip(Enumerable.Range(0, int.MaxValue), (g, n) => "[" + n + "] " + g);
{ "[0] 3:10pm", "[1] 3", "[2] 3", "[3] 10", "[4] 0", "[5] pm" }
So where's the 1?
Since there are multiple digits that match on the fourth group, we only "get at" the last match if we reference the group (with an implicit ToString(), that is). In order to expose the intermediate matches, we need to go deeper and reference the Captures property on the group in question:
csharp> Regex.Match("3:10pm", #"((\d)+):((\d)+)(am|pm)").
> Groups.Cast<Group>().
> Skip(4).First().Captures.Cast<Capture>().
> Zip(Enumerable.Range(0, int.MaxValue), (c, n) => "["+n+"] " + c);
{ "[0] 1", "[1] 0" }
Courtesy of this article.

A Group is what we have associated with groups in regular expressions
"(a[zx](b?))"
Applied to "axb" returns an array of 3 groups:
group 0: axb, the entire match.
group 1: axb, the first group matched.
group 2: b, the second group matched.
except that these are only 'captured' groups. Non capturing groups (using the '(?: ' syntax are not represented here.
"(a[zx](?:b?))"
Applied to "axb" returns an array of 2 groups:
group 0: axb, the entire match.
group 1: axb, the first group matched.
A Capture is also what we have associated with 'captured groups'. But when the group is applied with a quantifier multiple times, only the last match is kept as the group's match. The captures array stores all of these matches.
"(a[zx]\s+)+"
Applied to "ax az ax" returns an array of 2 captures of the second group.
group 1, capture 0 "ax "
group 1, capture 1 "az "
As for your last question -- I would have thought before looking into this that Captures would be an array of the captures ordered by the group they belong to. Rather it is just an alias to the groups[0].Captures. Pretty useless..

From the MSDN documentation:
The real utility of the Captures property occurs when a quantifier is applied to a capturing group so that the group captures multiple substrings in a single regular expression. In this case, the Group object contains information about the last captured substring, whereas the Captures property contains information about all the substrings captured by the group. In the following example, the regular expression \b(\w+\s*)+. matches an entire sentence that ends in a period. The group (\w+\s*)+ captures the individual words in the collection. Because the Group collection contains information only about the last captured substring, it captures the last word in the sentence, "sentence". However, each word captured by the group is available from the collection returned by the Captures property.

Imagine you have the following text input dogcatcatcat and a pattern like dog(cat(catcat))
In this case, you have 3 groups, the first one (major group) corresponds to the match.
Match == dogcatcatcat and Group0 == dogcatcatcat
Group1 == catcatcat
Group2 == catcat
So what it's all about?
Let's consider a little example written in C# (.NET) using Regex class.
int matchIndex = 0;
int groupIndex = 0;
int captureIndex = 0;
foreach (Match match in Regex.Matches(
"dogcatabcdefghidogcatkjlmnopqr", // input
#"(dog(cat(...)(...)(...)))") // pattern
)
{
Console.Out.WriteLine($"match{matchIndex++} = {match}");
foreach (Group #group in match.Groups)
{
Console.Out.WriteLine($"\tgroup{groupIndex++} = {#group}");
foreach (Capture capture in #group.Captures)
{
Console.Out.WriteLine($"\t\tcapture{captureIndex++} = {capture}");
}
captureIndex = 0;
}
groupIndex = 0;
Console.Out.WriteLine();
}
Output:
match0 = dogcatabcdefghi
group0 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group1 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group2 = catabcdefghi
capture0 = catabcdefghi
group3 = abc
capture0 = abc
group4 = def
capture0 = def
group5 = ghi
capture0 = ghi
match1 = dogcatkjlmnopqr
group0 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group1 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group2 = catkjlmnopqr
capture0 = catkjlmnopqr
group3 = kjl
capture0 = kjl
group4 = mno
capture0 = mno
group5 = pqr
capture0 = pqr
Let's analyze just the first match (match0).
As you can see there are three minor groups: group3, group4 and group5
group3 = kjl
capture0 = kjl
group4 = mno
capture0 = mno
group5 = pqr
capture0 = pqr
Those groups (3-5) were created because of the 'subpattern' (...)(...)(...) of the main pattern (dog(cat(...)(...)(...)))
Value of group3 corresponds to it's capture (capture0). (As in the case of group4 and group5). That's because there are no group repetition like (...){3}.
Ok, let's consider another example where there is a group repetition.
If we modify the regular expression pattern to be matched (for code shown above)
from (dog(cat(...)(...)(...))) to (dog(cat(...){3})),
you'll notice that there is the following group repetition: (...){3}.
Now the Output has changed:
match0 = dogcatabcdefghi
group0 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group1 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group2 = catabcdefghi
capture0 = catabcdefghi
group3 = ghi
capture0 = abc
capture1 = def
capture2 = ghi
match1 = dogcatkjlmnopqr
group0 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group1 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group2 = catkjlmnopqr
capture0 = catkjlmnopqr
group3 = pqr
capture0 = kjl
capture1 = mno
capture2 = pqr
Again, let's analyze just the first match (match0).
There are no more minor groups group4 and group5 because of (...){3} repetition ({n} wherein n>=2)
they've been merged into one single group group3.
In this case, the group3 value corresponds to it's capture2 (the last capture, in other words).
Thus if you need all the 3 inner captures (capture0, capture1, capture2) you'll have to cycle through the group's Captures collection.
Сonclusion is: pay attention to the way you design your pattern's groups.
You should think upfront what behavior causes group's specification, like (...)(...), (...){2} or (.{3}){2} etc.
Hopefully it will help shed some light on the differences between Captures, Groups and Matches as well.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Regular Expression: How to extract a collection - c#

I think you might need to make your captures non-greedy... (?<Name>.?) instead of (?<Name>.)

Related

TakeWhile() with conditions in multiple lines

Trying to match multiple words multiple times, any order using regex

regular expression of 3 groups check

Need multiple regular expression matches using C#

What's the difference between "groups" and "captures" in .NET regular expressions?

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Regular Expression: How to extract a collection - c#

I think you might need to make your captures non-greedy... (?<Name>.*?) instead of (?<Name>.*)

Related

TakeWhile() with conditions in multiple lines

Trying to match multiple words multiple times, any order using regex

regular expression of 3 groups check

Need multiple regular expression matches using C#

What's the difference between "groups" and "captures" in .NET regular expressions?

Categories

Resources

I think you might need to make your captures non-greedy... (?<Name>.?) instead of (?<Name>.)