Matching balanced parentheses before a character - c#

I need to match a string within balanced parentheses before a literal period in c#. My regex with balanced groups works except when there are extra open parens in the string. According to my understanding, this requires a conditional fail pattern to ensure the stack is empty on match, yet something is not quite right.
Original regex:
#"(?<Par>[(]).+(?<-Par>[)])\."
With fail-pattern:
#"(?<Par>[(]).+(?<-Par>[)])(?(Par)(?!))\."
Test-code (last 2 fail):
string[] tests = {
"a.c", "",
"a).c", "",
"(a.c", "",
"a(a).c", "(a).",
"a(a b).c", "(a b).",
"a((a b)).c", "((a b)).",
"a(((a b))).c", "(((a b))).",
"a((a) (b)).c", "((a) (b)).",
"a((a)(b)).c", "((a)(b)).",
"a((ab)).c", "((ab)).",
"a)((ab)).(c", "((ab)).",
"a(((a b)).c", "((a b)).",
"a(((a b)).)c", "((a b))."
};
Regex re = new Regex(#"(?<Par>[(]).+(?<-Par>[)])(?(Par)(?!))\.");
for (int i = 0; i < tests.Length; i += 2)
{
var result = re.Match(tests[i]).Groups[0].Value;
if (result != tests[i + 1]) throw new Exception
("Expecting: " + tests[i + 1] + ", got " + result);
}

You may use a well-known regex to match balanced parentheses and just append a \. to it:
\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)\.
|---------- balanced parens part ----------|.|
See the regex demo.
Details
\( - a (
(?> - start of an atomic group
[^()]+ - 1 or more chars other than ( and )
| - or
(?<o>)\( - an opening ( is pushed on to the Group o stack
| - or
(?<-o>)\) - a closing ( is popped off the Group o stack
)* - 0 or more repetitions of the atomic group
(?(o)(?!)) - fail the match if Group o stack is not empty
\) - a )
\. - a dot.

Related

Regex to replace a symbol & within quotes C#

I'm trying to replace '&' inside quotes.
Input
"I & my friends are stuck here", & we can't resolve
Output
"I and my friends are stuck here", & we can't resolve
Replace '&' by 'and' and only inside quotes, could you please help?
By far the quickest way is to use the \G construct and do it with a single regex.
C# code
var str =
"\"I & my friends are stuck here & we can't get up\", & we can't resolve\n" +
"=> \"I and my friends are stuck here and we can't get up\", & we can't resolve\n";
var rx = #"((?:""(?=[^""]*"")|(?<!""|^)\G)[^""&]*)(?:(&)|(""))";
var res = Regex.Replace(str, rx, m =>
// Replace the ampersands inside double quotes with 'and'
m.Groups[1].Value + (m.Groups[2].Value.Length > 0 ? "and" : m.Groups[3].Value));
Console.WriteLine(res);
Output
"I and my friends are stuck here and we can't get up", & we can't resolve
=> "I and my friends are stuck here and we can't get up", & we can't resolve
Regex ((?:"(?=[^"]*")|(?<!"|^)\G)[^"&]*)(?:(&)|("))
https://regex101.com/r/db8VkQ/1
Explained
( # (1 start), Preamble
(?: # Block
" # Begin of quote
(?= [^"]* " ) # One-time check for close quote
| # or,
(?<! " | ^ ) # If not a quote behind or BOS
\G # Start match where last left off
)
[^"&]* # Many non-quote, non-ampersand
) # (1 end)
(?: # Body
( & ) # (2), Ampersand, replace with 'and'
| # or,
( " ) # (3), End of quote, just put back "
)
Benchmark
Regex1: ((?:"(?=[^"]*")|(?<!"|^)\G)[^"&]*)(?:(&)|("))
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 10
Elapsed Time: 2.21 s, 2209.03 ms, 2209035 ยตs
Matches per sec: 226,343
Use
Regex.Replace(s, "\"[^\"]*\"", m => Regex.Replace(m.Value, #"\B&\B", "and"))
See the C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var s = "\"I & my friends are stuck here\", & we can't resolve";
Console.WriteLine(
Regex.Replace(s, "\"[^\"]*\"", m => Regex.Replace(m.Value, #"\B&\B", "and"))
);
}
}
Output: "I and my friends are stuck here", & we can't resolve

Removal of colon and carriage returns and replace with colon

I'm working on a project where I have a HMTL fragment which needs to be cleaned up - the HTML has been removed and as a result of table being removed, there are some strange ends where they shouldnt be :-)
the characters as they appear are
a space at the beginning of a line
a colon, carriage return and linefeed at the end of the line - which needs to be replaced simply with the colon;
I am presently using regex as follows:
s = Regex.Replace(s, #"(:[\r\n])", ":", RegexOptions.Multiline | RegexOptions.IgnoreCase);
// gets rid of the leading space
s = Regex.Replace(s, #"(^[( )])", "", RegexOptions.Multiline | RegexOptions.IgnoreCase);
Example of what I am dealing with:
Tomas Adams
Solicitor
APLawyers
p:
1800 995 718
f:
07 3102 9135
a:
22 Fultam Street
PO Box 132, Booboobawah QLD 4113
which should look like:
Tomas Adams
Solicitor
APLawyers
p:1800 995 718
f:07 3102 9135
a:22 Fultam Street
PO Box 132, Booboobawah QLD 4313
as my attempt to clean the string, but the result is far from perfect ... Can someone assist me to correct the error and achive my goal ...
[EDIT]
the offending characters
f:\r\n07 3102 9135\r\na:\r\n22
the combination of :\r\n should be replaced by a single colon.
MTIA
Darrin
You may use
var result = Regex.Replace(s, #"(?m)^\s+|(?<=:)(?:\r?\n)+|(\r?\n){2,}", "$1")
See the .NET regex demo.
Details
(?m) - equal to RegexOptions.Multiline - makes ^ match the start of any line here
^ - start of a line
\s+ - 1+ whitespaces
| - or
(?<=:)(?:\r?\n)+ - a position that is immediately preceded with : (matched with (?<=:) positive lookbehind) followed with 1+ occurrences of an optional CR and LF (those are removed)
| - or
(\r?\n){2,} - two or more consecutive occurrences of an optional CR followed with an LF symbol. Only the last occurrence is saved in Group 1 memory buffer, thus the $1 replacement pattern inserts that last, single, occurrence.
A basic solution without Regex:
var lines = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries);
var output = new StringBuilder();
for (var i = 0; i < lines.Length; i++)
{
if (lines[i].EndsWith(":")) // feel free to also check for the size
{
lines[i + 1] = lines[i] + lines[i + 1];
continue;
}
output.AppendLine(lines[i].Trim()); // remove space before or after a line
}
Try it Online!
I tried to use your regular expression.I was able to replace "\n" and ":" with the following regular expression.This is removing ":" and "\n" at the end of the line.
#"([:\r\n])"
A Linq solution without Regex:
var tmp = string.Empty;
var output = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries).Aggregate(new StringBuilder(), (a,b) => {
if (b.EndsWith(":")) { // feel free to also check for the size
tmp = b;
}
else {
a.AppendLine((tmp + b).Trim()); // remove space before or after a line
tmp = string.Empty;
}
return a;
});
Try it Online!

How to write a regular expression that captures tags in a comma-separated list?

Here is my input:
#
tag1, tag with space, !##%^, ๐Ÿฆ„
I would like to match it with a regex and yield the following elements easily:
tag1
tag with space
!##%^
๐Ÿฆ„
I know I could do it this way:
var match = Regex.Match(input, #"^#[\n](?<tags>[\S ]+)$");
// if match is a success
var tags = match.Groups["tags"].Value.Split(',').Select(x => x.Trim());
But that's cheating, as it involves messing around with C#. There must be a neat way to do this with a regex. Just must be... right? ;D
The question is: how to write a regular expression that would allow me to iterate through captures and extract tags, without the need of splitting and trimming?
This works (?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+
It uses C#'s Capture Collection to find a variable amount of field data
in a single record.
You could extend the regex further to get all records at once.
Where each record contains its own variable amount of field data.
The regex has built-in trimming as well.
Expanded:
(?ms) # Inline modifiers: multi-line, dot-all
^ \# \s+ # Beginning of record
(?: # Quantified group, 1 or more times, get all fields of record at once
\s* # Trim leading wsp
( # (1 start), # Capture collector for variable fields
(?: # One char at a time, but not comma or begin of record
(?!
,
| ^ \# \s+
)
.
)*?
) # (1 end)
\s*
(?: , | $ ) # End of this field, comma or EOL
)+
C# code:
string sOL = #"
#
tag1, tag with space, !##%^, ๐Ÿฆ„";
Regex RxOL = new Regex(#"(?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+");
Match _mOL = RxOL.Match(sOL);
while (_mOL.Success)
{
CaptureCollection ccOL1 = _mOL.Groups[1].Captures;
Console.WriteLine("-------------------------");
for (int i = 0; i < ccOL1.Count; i++)
Console.WriteLine(" '{0}'", ccOL1[i].Value );
_mOL = _mOL.NextMatch();
}
Output:
-------------------------
'tag1'
'tag with space'
'!##%^'
'??'
''
Press any key to continue . . .
Nothing wrong with cheating ;]
string input = #"#
tag1, tag with space, !##%^, ๐Ÿฆ„";
string[] tags = Array.ConvertAll(input.Split('\n').Last().Split(','), s => s.Trim());
You can pretty much make it without regex. Just split it like this:
var result = input.Split(new []{'\n','\r'}, StringSplitOptions.RemoveEmptyEntries).Skip(1).SelectMany(x=> x.Split(new []{','},StringSplitOptions.RemoveEmptyEntries).Select(y=> y.Trim()));

Regex with balancing groups

I need to write regex that capture generic arguments (that also can be generic) of type name in special notation like this:
System.Action[Int32,Dictionary[Int32,Int32],Int32]
lets assume type name is [\w.]+ and parameter is [\w.,\[\]]+
so I need to grab only Int32, Dictionary[Int32,Int32] and Int32
Basically I need to take something if balancing group stack is empty, but I don't really understand how.
UPD
The answer below helped me solve the problem fast (but without proper validation and with depth limitation = 1), but I've managed to do it with group balancing:
^[\w.]+ #Type name
\[(?<delim>) #Opening bracet and first delimiter
[\w.]+ #Minimal content
(
[\w.]+
((?(open)|(?<param-delim>)),(?(open)|(?<delim>)))* #Cutting param if balanced before comma and placing delimiter
((?<open>\[))* #Counting [
((?<-open>\]))* #Counting ]
)*
(?(open)|(?<param-delim>))\] #Cutting last param if balanced
(?(open)(?!) #Checking balance
)$
Demo
UPD2 (Last optimization)
^[\w.]+
\[(?<delim>)
[\w.]+
(?:
(?:(?(open)|(?<param-delim>)),(?(open)|(?<delim>))[\w.]+)?
(?:(?<open>\[)[\w.]+)?
(?:(?<-open>\]))*
)*
(?(open)|(?<param-delim>))\]
(?(open)(?!)
)$
I suggest capturing those values using
\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*
See the regex demo.
Details:
\w+(?:\.\w+)* - match 1+ word chars followed with . + 1+ word chars 1 or more times
\[ - a literal [
(?:,?(?<res>\w+(?:\[[^][]*])?))* - 0 or more sequences of:
,? - an optional comma
(?<res>\w+(?:\[[^][]*])?) - Group "res" capturing:
\w+ - one or more word chars (perhaps, you would like [\w.]+)
(?:\[[^][]*])? - 1 or 0 (change ? to * to match 1 or more) sequences of a [, 0+ chars other than [ and ], and a closing ].
A C# demo below:
var line = "System.Action[Int32,Dictionary[Int32,Int32],Int32]";
var pattern = #"\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*";
var result = Regex.Matches(line, pattern)
.Cast<Match>()
.SelectMany(x => x.Groups["res"].Captures.Cast<Capture>()
.Select(t => t.Value))
.ToList();
foreach (var s in result) // DEMO
Console.WriteLine(s);
UPDATE: To account for unknown depth [...] substrings, use
\w+(?:\.\w+)*\[(?:\s*,?\s*(?<res>\w+(?:\[(?>[^][]+|(?<o>\[)|(?<-o>]))*(?(o)(?!))])?))*
See the regex demo

C# Regex Grouping

I have to write a function that will get a string and it will have 2 forms:
XX..X,YY..Y where XX..X are max 4 characters and YY..Y are max 26 characters(X and Y are digits or A or B)
XX..X where XX..X are max 8 characters (X is digit or A or B)
e.g. 12A,784B52 or 4453AB
How can i user Regex grouping to match this behavior?
Thanks.
p.s. sorry if this is to localized
You can use named captures for this:
Regex regexObj = new Regex(
#"\b # Match a word boundary
(?: # Either match
(?<X>[AB\d]{1,4}) # 1-4 characters --> group X
, # comma
(?<Y>[AB\d]{1,26}) # 1-26 characters --> group Y
| # or
(?<X>[AB\d]{1,8}) # 1-8 characters --> group X
) # End of alternation
\b # Match a word boundary",
RegexOptions.IgnorePatternWhitespace);
X = regexObj.Match(subjectString).Groups["X"].Value;
Y = regexObj.Match(subjectString).Groups["Y"].Value;
I don't know what happens if there is no group Y, perhaps you might need to wrap the last line in an if statement.

Categories

Resources