How can I exclude the first match in a regular expression? - c#

I have the following regex, so far:
([0-9]+){1}\s*[xX]\s*([A-Za-z\./%\$\s\*]+)
to be used on strings such as:
2x Soup, 2x Meat Balls, 4x Iced Tea
My intent was to capture the number of times something was ordered, as well as the name of item ordered.
In this regular expression however, the multiplier 'x' gets captured before the title.
How can I make it so that the x is ignored, and what comes after the x (and a space) is captured?

You can't ignore something in the middle of the pattern. Therefore you do have your capturing groups.
([0-9]+){1}\s*[xX]\s*([A-Za-z\./%\$\s\*]+)
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^
The marked parts of your pattern are stored in capturing groups, because of the brackets around them.
Your number is in group 1 and the name is in group 2. The "x" is not captured in a group.
How you now access your groups depends on the language you are using.
Btw. the {1} is obsolete.
So for c# try this:
string text = "2x Soup, 2x Meat Balls, 4x Iced Tea";
MatchCollection result = Regex.Matches(text, #"([0-9]+)\s*[xX]\s*([A-Za-z\./%\$\s\*]+)");
int counter = 0;
foreach (Match m in result)
{
counter++;
Console.WriteLine("Order {0}: " + m.Groups[1] + " " + m.Groups[2], counter);
}
Console.ReadLine();
Further I would change the regex to this, since it seems you want to match as name every character till the next comma
#"([0-9]+)\s*x\s*([^,]+)"
and use RegexOptions.IgnoreCase to avoid having to write [xX]

Related

Is it possible to have overlapping regex matches?

Take this data as an example:
ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021
I was wondering if it's possible to create a regex that will return this set of matches
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
I did try creating one below:
ID: (?<id>\w+).*\|(?<instrument>\w+):\s(?<count>\d).*Expiry:\s(?<expiry>[\w\d]+)
but it only returned the one with the violin instrument. I would highly appreciate your insights on this.
I would not use a regular expression. Especially since the string ID: JK546|Guitar: 0|Expiry: Aug14,2021 does not appear in the string ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021, so it's not strictly a match, but more of a replacement. But there's no good way to get all replacements from all matches.
So, I'd just split the input string on |.
Then you want to compose a result string that is comprised of the first field, one of the middle fields, and the last field. You'll get one result for each middle field that exists. If it splits into N fields, you'll get N-2 results. e.g.: if it splits into 5 fields, then you'll get 3 results, one for each of the "middle" fields.
string input = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string[] fields = input.Split('|');
for( int i = 1; i < fields.Length - 1; ++i) {
string result = string.Join("|", fields.First(), fields[i], fields.Last());
Console.WriteLine(result);
}
output:
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
A single regular expression to return multiple matches on multiple calls? 
I wonder whether that is possible.
I’m not familiar with how to do regex processing in C#,
but this sed command will do what you want. 
Perhaps you can understand how it works and adapt it to your needs:
sed -n ':loop; h; s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p; g; s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/; t loop'
For simplicity, let’s pretend that the input string is “A|B|C|D|E”.
What it does:
-n is the option to tell sed not to print anything automatically
(but only print when told to, with a p command).
:loop is a label for, effectively, a “goto”. 
So use a while loop structure.
h saves the pattern space into the hold space. 
In other words, make a copy of your string.
s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p captures the first two segments
and the last one, and prints the result. 
So “A|B|C|D|E” becomes “A|B|E” (i.e., your first desired output).
g restores the saved string from the hold space into the pattern space. 
In other words, retrieve the copy of the string that you saved.
s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/ captures the first segment,
skips the second, and then captures the rest. 
So “A|B|C|D|E” becomes “A|C|D|E”.
t loop is the “goto” command. 
It says to go back to the beginning of the loop
if the most recent substitution succeeded. 
In other words, this is the end of the loop,
and the specification of the loop condition.
The second iteration of the loop will change “A|C|D|E” to “A|C|E”
and print it. 
And then change “A|C|D|E” to “A|D|E” and iterate. 
The third iteration of the loop will change “A|D|E” to “A|D|E” and print it. 
(Obviously there is no change, because the .* in the middle of the regex
matches the zero-length string between “A|D” and “|E”.) 
The final substitution changes “A|D|E” to “A|E”,
and then there is nothing left to find.
You can make use of the .NET Groups.Captures property to get the values of Guitar, Piano and Violin.
(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)
The pattern matches:
(ID: \w+\|) Capture group 1 match ID: 1+ word chars and |
(\w+: \d+\|)+ Capture group 2 Repeat 1+ times matching 1+ word chars : 1+ digits |
(Expiry: \w+,\d+) Capture group 3 match Expiry: 1+ word chars , and 1+ digits
See a .NET regex demo | C# demo
For example
var str = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string pattern = #"(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)";
Match m = Regex.Match(str, pattern);
foreach(Capture c in m.Groups[2].Captures) {
Console.WriteLine(m.Groups[1].Value + c.Value + m.Groups[3].Value);
}
Output
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
It should be possible with look behind and look ahead:
string foo = #"ID: JK546 | Guitar: 0 | Piano: 1 | Violin: 0 | Expiry: Aug14,2021";
// First look at "Guitar: 0", "Piano: 1" and "Violin: 0". Then look behind "(?<= )" and search for the ID. Then look ahead "(?= )" and search for Expiry.
string pattern = #"(\w+: \d)(?<=(ID: [A-Z0-9]+).*?)(?=.*?(Expiry: \S+))";
foreach (var match in Regex.Matches(foo, pattern))
{
....
}
Fortunately c# is one of the few languages that can handle variable length look behinds.

Regex C# Matching string from two words in exact order and returning capture of non-matched words

C# Regex
I have the following list of strings:
"New patient, brief"
"New patient, limited"
"Established patient, brief"
"Established patient, limited"
"New diet patient"
"Established diet patient"
"School Physical"
"Deposition, 1 hour"
"Deposition, 2 hour"
I would like to separate these strings into groups using regex.
The first pattern I see is:
"New" or "Established" -- will be the first word of the matched pattern. This word will need to be captured and returned. Of this pattern, "patient" must be present without need to capture. Any word after "patient" must be captured.
I've tried: ((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)
but the return match gives:
Full match 0-3 `New`
Group 1. 0-0 ``
Group 2. 0-3 `New`
Not at all what I am looking for.
string input = "New patient, limited";
string pattern = #"((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)";
MatchCollection matches = Regex.Matches(input, pattern);
GroupCollection groups = matches[0].Groups;
foreach (Match match in matches)
{
Console.WriteLine("First word: {0}", match.Groups[1].Value);
Console.WriteLine("Last words: {0}", match.Groups[2].Value);
Console.WriteLine();
}
Console.WriteLine();
Thank you for any help with this.
Edit #1
For strings like "New patient, limited"
output should be: "New" "limited"
For strings like "Deposition, 1 hour" where "hour" is present,
output should be: "Deposition, 1 hour"
For strings where there are no words after "patient" but "patient" is present, like
"New diet patient",
output should be: "New" "diet"
For strings where neither "patient" nor "hour" is present, the entire string should be returned. i.e like "School Physical" should return the entire string,
"School Physical".
As I said, this is my ultimate quest. At the moment, I am trying to focus on separating out only the first pattern :). Much Thanks.
I suggest using
^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)
See the regex demo
Details
^(?:(?!\b(?:New|Established)\b).)*$ - any string that has no New or Established as whole words
| - or
\b(New|Established) - a whole word New or Established (put into Group 1)
\s+ - 1+ whitespaces
(?:patient\b\W*)? - an optional non-capturing group matching 1 or 0 occurrences of patient followed with word boundary and 0+ non-word chars
(.+) - Group 2: any 1 or more chars other than line break chars.
The code will look like
var match = Regex.Match(s, #"^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)");
If Group 1 is not matched (!match.Groups[1].Success), grab the whole match, match.Value. Else, grab match.Groups[1].Value and match.Groups[2].Value.
Results:

Trying to match multiple words multiple times, any order using regex

I'm trying to check if a text contains two or more specific words. The words can be in any order an can show up in the text multiple times but at least once.
If the text is a match I will need to get the information about location of the words.
Lets say we have the text :
"Once I went to a store and bought a coke for a dollar and I got another coke for free"
In this example I want to match the words coke and dollar.
So the result should be:
coke : index 37, lenght 4
dollar : index 48, length 6
coke : index 84, length 4
What I have already is this: (which I think is little bit wrong because it should contain each word at least once so the + should be there instead of the *)
(?:(\bcoke\b))\*(?:(\bdollar\b))\*
But with that regex the RegEx Buddy highlights all the three words if I ask it to hightlight group 1 and group 2.
But when I run this in C# I won't get any results.
Can you point me to the right direction ?
I don't think it's possible what you want only using regular expressions.
Here is a possible solution using regular expressions and linq:
var words = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "coke", "dollar" };
var regex = new Regex(#"\b(?:"+string.Join("|", words)+#")\b", RegexOptions.IgnoreCase);
var text = #"Once I went to a store and bought a coke
for a dollar and I got another coke for free";
var grouped = regex.Matches(text)
.OfType<Match>()
.GroupBy(m => m.Value, StringComparer.OrdinalIgnoreCase)
.ToArray();
if (grouped.Length != words.Count)
{
//not all words were found
}
else
{
foreach (var g in grouped)
{
Console.WriteLine("Found: " + g.Key);
foreach (var match in g)
Console.WriteLine(" At {0} length {1}", match.Index, match.Length);
}
}
Output:
Found: coke
At 36 length 4
At 72 length 4
Found: dollar
At 47 length 6
How about this, it is pret-tay bad but I think it has a shot at working and it is pure RegEx no extra tools.
(?:^|\W)[cC][oO][kK][eE](?:$|\W)|(?:^|\W)[dD][oO][lL][lL][aA][rR](?:$|\W)
Get rid of the \w's if you want it to capture cokeDollar or dollarCoKe etc.

Regex for checking for compass directions

I'm looking to match the 8 main directions as might appear in a street or location prefix or suffix, such as:
N Main
south I-22
124 Grover Ave SE
This is easy to code using a brute force list of matches and cycle through every match possibility for every street address, matching once with a start-of-string anchor and once with a end-of-string anchor. My blunt starting point is shown farther down, if you want to see it.
My question is if anyone has some clever ideas for compact, fast-executing patterns to accomplish the same thing. You can assume:
Compound directions always start with the north / south component. So I need to match South East but not EastSouth
The pattern should not match [direction]-ern words, like "Northern" or "Southwestern"
The match will always be at the very beginning or very end of the string.
I'm using C#, but I'm just looking for a pattern so I'm not emphasizing the language. /s(outh)?/ is just as good as #"s(outh)?" for me or future readers.
SO emphasizes real problems, so FYI this is one. I'm parsing a few hundred thousand nasty, unvalidated user-typed address strings. I want to check if the start or end of the "street" field (which is free-form jumble of PO boxes, streets, apartments, and straight up invalid junk) begins or ends with a compass direction. I'm trying to deconstruct these free form strings to find similar addresses which may be accidental or intentional variations and obfuscations.
My blunt attempt
Core pattern: /n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)/
In a function:
public static Tuple<Match, Match> MatchDirection(String value) {
string patternBase = #"n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)";
Match[] matches = new Match[2];
string[] compassPatterns = new[] { #"^(" + patternBase + #")\b", #"\b(" + patternBase + #")$" };
for (int i = 0; i < 2; i++) { matches[i] = Regex.Match(value, compassPatterns[i], RegexOptions.IgnoreCase); }
return new Tuple<Match, Match>(matches[0], matches[1]);
}
In use, where sourceDt is a table with all the addresses:
var parseQuery = sourceDt.AsEnumerable()
.Select((DataRow row) => {
string addr = ((string)row["ADDR_STREET"]).Trim();
Tuple<Match, Match> dirMatches = AddressParser.MatchDirection(addr);
return new string[] { addr, dirMatches.Item1.Value, dirMatches.Item2.Value };
})
Edit: Actually this is probably wrong answer - so keeping it just so people not suggest the same thing - figuring out tokenization for "South East" is task in itself. Also I still doubt RegExp will be very usable either.
Original answer:
Don't... your initial RegExp attempt is already non-readable.
Dictionary look up for each word you want from the tokenized string ("brute force approach") already gives you linear time on length and constant time per word. And it is very easy to customize with new words.
(^[nesw][^n\s]*)|([nesw][^n\s]*$)
So this will match a line that:
begins or ends with a word that:
Begins with a cardinal direction
Doesn't have an n otherwise in it (to get rid of the "-ern"s)
Perl/PCRE compatible expression:
(?xi)
(^)?
\b
(?:
n(?:orth)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
s(?:outh)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
e(?:ast)?
|
w(?:est)?
)
\b
(?(1)|$)
I think C# supports all the features used here.

How does this regex find triangular numbers?

Part of a series of educational regex articles, this is a gentle introduction to the concept of nested references.
The first few triangular numbers are:
1 = 1
3 = 1 + 2
6 = 1 + 2 + 3
10 = 1 + 2 + 3 + 4
15 = 1 + 2 + 3 + 4 + 5
There are many ways to check if a number is triangular. There's this interesting technique that uses regular expressions as follows:
Given n, we first create a string of length n filled with the same character
We then match this string against the pattern ^(\1.|^.)+$
n is triangular if and only if this pattern matches the string
Here are some snippets to show that this works in several languages:
PHP (on ideone.com)
$r = '/^(\1.|^.)+$/';
foreach (range(0,50) as $n) {
if (preg_match($r, str_repeat('o', $n))) {
print("$n ");
}
}
Java (on ideone.com)
for (int n = 0; n <= 50; n++) {
String s = new String(new char[n]);
if (s.matches("(\\1.|^.)+")) {
System.out.print(n + " ");
}
}
C# (on ideone.com)
Regex r = new Regex(#"^(\1.|^.)+$");
for (int n = 0; n <= 50; n++) {
if (r.IsMatch("".PadLeft(n))) {
Console.Write("{0} ", n);
}
}
So this regex seems to work, but can someone explain how?
Similar questions
How to determine if a number is a prime with regex?
Explanation
Here's a schematic breakdown of the pattern:
from beginning…
| …to end
| |
^(\1.|^.)+$
\______/|___match
group 1 one-or-more times
The (…) brackets define capturing group 1, and this group is matched repeatedly with +. This subpattern is anchored with ^ and $ to see if it can match the entire string.
Group 1 tries to match this|that alternates:
\1., that is, what group 1 matched (self reference!), plus one of "any" character,
or ^., that is, just "any" one character at the beginning
Note that in group 1, we have a reference to what group 1 matched! This is a nested/self reference, and is the main idea introduced in this example. Keep in mind that when a capturing group is repeated, generally it only keeps the last capture, so the self reference in this case essentially says:
"Try to match what I matched last time, plus one more. That's what I'll match this time."
Similar to a recursion, there has to be a "base case" with self references. At the first iteration of the +, group 1 had not captured anything yet (which is NOT the same as saying that it starts off with an empty string). Hence the second alternation is introduced, as a way to "initialize" group 1, which is that it's allowed to capture one character when it's at the beginning of the string.
So as it is repeated with +, group 1 first tries to match 1 character, then 2, then 3, then 4, etc. The sum of these numbers is a triangular number.
Further explorations
Note that for simplification, we used strings that consists of the same repeating character as our input. Now that we know how this pattern works, we can see that this pattern can also match strings like "1121231234", "aababc", etc.
Note also that if we find that n is a triangular number, i.e. n = 1 + 2 + … + k, the length of the string captured by group 1 at the end will be k.
Both of these points are shown in the following C# snippet (also seen on ideone.com):
Regex r = new Regex(#"^(\1.|^.)+$");
Console.WriteLine(r.IsMatch("aababc")); // True
Console.WriteLine(r.IsMatch("1121231234")); // True
Console.WriteLine(r.IsMatch("iLoveRegEx")); // False
for (int n = 0; n <= 50; n++) {
Match m = r.Match("".PadLeft(n));
if (m.Success) {
Console.WriteLine("{0} = sum(1..{1})", n, m.Groups[1].Length);
}
}
// 1 = sum(1..1)
// 3 = sum(1..2)
// 6 = sum(1..3)
// 10 = sum(1..4)
// 15 = sum(1..5)
// 21 = sum(1..6)
// 28 = sum(1..7)
// 36 = sum(1..8)
// 45 = sum(1..9)
Flavor notes
Not all flavors support nested references. Always familiarize yourself with the quirks of the flavor that you're working with (and consequently, it almost always helps to provide this information whenever you're asking regex-related questions).
In most flavors, the standard regex matching mechanism tries to see if a pattern can match any part of the input string (possibly, but not necessarily, the entire input). This means that you should remember to always anchor your pattern with ^ and $ whenever necessary.
Java is slightly different in that String.matches, Pattern.matches and Matcher.matches attempt to match a pattern against the entire input string. This is why the anchors can be omitted in the above snippet.
Note that in other contexts, you may need to use \A and \Z anchors instead. For example, in multiline mode, ^ and $ match the beginning and end of each line in the input.
One last thing is that in .NET regex, you CAN actually get all the intermediate captures made by a repeated capturing group. In most flavors, you can't: all intermediate captures are lost and you only get to keep the last.
Related questions
(Java) method matches not work well - with examples on how to do prefix/suffix/infix matching
Is there a regex flavor that allows me to count the number of repetitions matched by * and + (.NET!)
Bonus material: Using regex to find power of twos!!!
With very slight modification, you can use the same techniques presented here to find power of twos.
Here's the basic mathematical property that you want to take advantage of:
1 = 1
2 = (1) + 1
4 = (1+2) + 1
8 = (1+2+4) + 1
16 = (1+2+4+8) + 1
32 = (1+2+4+8+16) + 1
The solution is given below (but do try to solve it yourself first!!!!)
(see on ideone.com in PHP, Java, and C#):
^(\1\1|^.)*.$

Categories

Resources