Regex issue with nested groups and using "^" and "$" in the pattern - c#

I have a content like this:
var testInput =
"05(testcontent)\r\n" +
"06(testcontent2)\r\n" +
"07(testcontent3)(testcontent4)" +
"08(testcontent5)";
I need to get the one code string and two value strings for each line.
For the first line:
Code: "05"
Value1: "testcontent"
Value2: Empty string.
For the third line:
Code: "07"
Value1: "testcontent3"
Value2: "testcontent4"
The pattern I use:
// (?<Code>[0-9]{2}) - 2 digit number
// \((?<Value1>.+)\) - First value, which is inside the parentheses.
// (\((?<Value2>.+)\))? - Second value, which also is inside the parentheses.
// The second value does not always exist. Which is why it has "?" at its end.
var testPattern = #"(?<Code>[0-9]{2})\((?<Value1>.+)\)(\((?<Value2>.+)\))?";
The code I use:
var testRegex = new Regex(testPattern,
RegexOptions.Compiled |
RegexOptions.CultureInvariant |
RegexOptions.ExplicitCapture |
RegexOptions.Multiline);
foreach (Match match in testRegex.Matches(testInput))
Console.WriteLine("{0}: {1} | {2}",
match.Groups["Code"].Value,
match.Groups["Value1"].Value,
match.Groups["Value2"].Value);
The result I get:
05: testcontent |
06: testcontent2 |
07: testcontent3)(testcontent4)08(testcontent5 |
If I use ^ at the start and $ at the end of the pattern, I get even worse:
07: testcontent3)(testcontent4)08(testcontent5 |
So,
Why does the ^ and $ complicate things even more when I specified "RegexOptions.Multiline"?
What is wrong with my pattern?

Will you ever have closing parentheses inside your Value1 or Value2? If not, I'd suggest using a negated character class like [^)]+ instead of .+. The reason is that .+ being "greedy" (i.e. repeating as many times as possible) is causing problems in this case.

Related

Removal of colon and carriage returns and replace with colon

I'm working on a project where I have a HMTL fragment which needs to be cleaned up - the HTML has been removed and as a result of table being removed, there are some strange ends where they shouldnt be :-)
the characters as they appear are
a space at the beginning of a line
a colon, carriage return and linefeed at the end of the line - which needs to be replaced simply with the colon;
I am presently using regex as follows:
s = Regex.Replace(s, #"(:[\r\n])", ":", RegexOptions.Multiline | RegexOptions.IgnoreCase);
// gets rid of the leading space
s = Regex.Replace(s, #"(^[( )])", "", RegexOptions.Multiline | RegexOptions.IgnoreCase);
Example of what I am dealing with:
Tomas Adams
Solicitor
APLawyers
p:
1800 995 718
f:
07 3102 9135
a:
22 Fultam Street
PO Box 132, Booboobawah QLD 4113
which should look like:
Tomas Adams
Solicitor
APLawyers
p:1800 995 718
f:07 3102 9135
a:22 Fultam Street
PO Box 132, Booboobawah QLD 4313
as my attempt to clean the string, but the result is far from perfect ... Can someone assist me to correct the error and achive my goal ...
[EDIT]
the offending characters
f:\r\n07 3102 9135\r\na:\r\n22
the combination of :\r\n should be replaced by a single colon.
MTIA
Darrin
You may use
var result = Regex.Replace(s, #"(?m)^\s+|(?<=:)(?:\r?\n)+|(\r?\n){2,}", "$1")
See the .NET regex demo.
Details
(?m) - equal to RegexOptions.Multiline - makes ^ match the start of any line here
^ - start of a line
\s+ - 1+ whitespaces
| - or
(?<=:)(?:\r?\n)+ - a position that is immediately preceded with : (matched with (?<=:) positive lookbehind) followed with 1+ occurrences of an optional CR and LF (those are removed)
| - or
(\r?\n){2,} - two or more consecutive occurrences of an optional CR followed with an LF symbol. Only the last occurrence is saved in Group 1 memory buffer, thus the $1 replacement pattern inserts that last, single, occurrence.
A basic solution without Regex:
var lines = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries);
var output = new StringBuilder();
for (var i = 0; i < lines.Length; i++)
{
if (lines[i].EndsWith(":")) // feel free to also check for the size
{
lines[i + 1] = lines[i] + lines[i + 1];
continue;
}
output.AppendLine(lines[i].Trim()); // remove space before or after a line
}
Try it Online!
I tried to use your regular expression.I was able to replace "\n" and ":" with the following regular expression.This is removing ":" and "\n" at the end of the line.
#"([:\r\n])"
A Linq solution without Regex:
var tmp = string.Empty;
var output = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries).Aggregate(new StringBuilder(), (a,b) => {
if (b.EndsWith(":")) { // feel free to also check for the size
tmp = b;
}
else {
a.AppendLine((tmp + b).Trim()); // remove space before or after a line
tmp = string.Empty;
}
return a;
});
Try it Online!

How to write a regular expression that captures tags in a comma-separated list?

Here is my input:
#
tag1, tag with space, !##%^, 🦄
I would like to match it with a regex and yield the following elements easily:
tag1
tag with space
!##%^
🦄
I know I could do it this way:
var match = Regex.Match(input, #"^#[\n](?<tags>[\S ]+)$");
// if match is a success
var tags = match.Groups["tags"].Value.Split(',').Select(x => x.Trim());
But that's cheating, as it involves messing around with C#. There must be a neat way to do this with a regex. Just must be... right? ;D
The question is: how to write a regular expression that would allow me to iterate through captures and extract tags, without the need of splitting and trimming?
This works (?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+
It uses C#'s Capture Collection to find a variable amount of field data
in a single record.
You could extend the regex further to get all records at once.
Where each record contains its own variable amount of field data.
The regex has built-in trimming as well.
Expanded:
(?ms) # Inline modifiers: multi-line, dot-all
^ \# \s+ # Beginning of record
(?: # Quantified group, 1 or more times, get all fields of record at once
\s* # Trim leading wsp
( # (1 start), # Capture collector for variable fields
(?: # One char at a time, but not comma or begin of record
(?!
,
| ^ \# \s+
)
.
)*?
) # (1 end)
\s*
(?: , | $ ) # End of this field, comma or EOL
)+
C# code:
string sOL = #"
#
tag1, tag with space, !##%^, 🦄";
Regex RxOL = new Regex(#"(?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+");
Match _mOL = RxOL.Match(sOL);
while (_mOL.Success)
{
CaptureCollection ccOL1 = _mOL.Groups[1].Captures;
Console.WriteLine("-------------------------");
for (int i = 0; i < ccOL1.Count; i++)
Console.WriteLine(" '{0}'", ccOL1[i].Value );
_mOL = _mOL.NextMatch();
}
Output:
-------------------------
'tag1'
'tag with space'
'!##%^'
'??'
''
Press any key to continue . . .
Nothing wrong with cheating ;]
string input = #"#
tag1, tag with space, !##%^, 🦄";
string[] tags = Array.ConvertAll(input.Split('\n').Last().Split(','), s => s.Trim());
You can pretty much make it without regex. Just split it like this:
var result = input.Split(new []{'\n','\r'}, StringSplitOptions.RemoveEmptyEntries).Skip(1).SelectMany(x=> x.Split(new []{','},StringSplitOptions.RemoveEmptyEntries).Select(y=> y.Trim()));

How to create Regex that contains Not colon char?

I have created a regular expression that seems to be working somewhat:
// look for years starting with 19 or 20 followed by two digits surrounded by spaces.
// Instead of ending space, the year may be followed by a '.' or ';'
static Regex regex = new Regex(#" 19\d{2} | 19\d{2}. | 19\d{2}; | 20\d {2} | 20\d{2}. | 20\d{2}; ");
// Trying to add 'NOT followed by a colon'
static Regex regex = new Regex(#" 19\d{2}(?!:) | 19\d{2}. | 19\d{2}; | 20\d{2}(?!:) | 20\d{2}. | 20\d{2}; ");
// Trying to optimize --
//static Regex regex = new Regex(#" (19|20)\d{2}['.',';']");
You can see where I tried to optimize a bit.
But more importantly, it is finding a match for 2002:
How do I make it not do that?
I think I am looking for some sort of NOT operator?
(?:19|20)\d{2}(?=[ ,;.])
Try this.See demo.
https://regex101.com/r/sJ9gM7/103
I'd rather go with \b here, this will help deal with other punctuation that may appear after/before the years:
\b(?:19|20)[0-9]{2}\b
C#:
static Regex regex = new Regex(#"\b(?:19|20)[0-9]{2}\b");
Tested in Expresso:
Problem in your regex is dot.
You should have something like this:
static Regex regex =
new Regex(#" 19\d{2} | 19\d{2}[.] | 19\d{2}; | 20\d{2} | 20\d{2}[.] | 20\d{2}; ");
This did it for me:
// look for years starting with 19 or 20 followed by two digits surrounded by spaces.
// Instead of ending space, the year may also be followed by a '.' or ';'
// but may not be followed by a colon, dash or any other unspecified character.
// optimized --
static Regex regex = new Regex(#"(19|20)\d{2} | (19|20)\d{2};| (19|20)\d{2}[.]");
Used Regex Tester here:
http://regexhero.net/tester/

Regex Pattern for filter out anything that doesn't Match

Using Regex.Replace(mystring, #"[^MV:0-9]", "") will remove any Letters that are not M,V,:, or 0-9 (\d could also be used) the problem is I want to remove anything that is not MV: then numbers.
I need to replace anything that is not this pattern with nothing:
Starting String | Wanted Result
---------------------------------------------------------
sdhfuiosdhusdhMV:1234567890sdfahuosdho | MV:1234567890
MV:2138911230989hdsafh89ash32893h8u098 | MV:2138911230989
809308ej0efj0934jf0934jf4fj84j8904jf09 | Null
123MV:1234321234mnnnio234324234njiojh3 | MV:1234321234
mdfmsdfuiovvvajio123oij213432ofjoi32mm | Null
But what I get with what I have is:
Starting String | Returned Result
---------------------------------------------------------
sdhfuiosdhusdhMV:1234567890sdfahuosdho | MV:1234567890
MV:2138911230989hdsafh89ash32893h8u098 | MV:213891123098989328938098
809308ej0efj0934jf0934jf4fj84j8904jf09 | 809308009340934484890409
123MV:1234321234mnnnio234324234njiojh3 | 123MV:12343212342343242343
mdfmsdfuiovvvajio123oij213432ofjoi32mm | mmvvv1232134232mm
And even if there is a Regex pattern for this would I be better off using something along the lines of:
if (Regex.IsMatch(strMyString, #"MV:"))
{
string[] strarMyString = Regex.Split(strMyString, #"MV:");
string[] strarNumbersAfterMV = Regex.Split(strarMyString[1], #"[^\d]");
string WhatIWant = strarNumbersAfterMV[0]
}
If I went with the Latter option would there be away to have:
string[] strarNumbersAfterMV = Regex.Split(strarMyString[1], #"[^\d]");
Only make one split at the first change from numbers? (It will always start with number following the MV:)
Can't you just do:
string matchedText = null;
var match = Regex.Match(myString, #"MV:[0-9]+");
if (match.Success)
{
matchedText = Value;
}
Console.WriteLine((matchedText == null) ? "Not found" : matchedText);
That should give you exactly what you need.

C# regular expression

Please give a suggestion here:
I try to do a regex in C#. Here is the text
E1A: pop(+)
call T
call E1
E1B: return
TA: call F
call T1
I want to split it like this:
1)
E1A: pop(+)
call T
call E1
2)
E1B: return
3)
TA: call F
call T1
I tought at lookbehind, but it's not working because of the .+
Here is what I hope to work but it doesn't:
"[A-Z0-9]+[:].+(?=([A-Z0-9]+[:]))"
Does anyone have a better ideea?
EDIT: The "E1A","E1B","TA" are changing. All it remains the same is that they are made by letter and numbers follow by ":"
Regex regexObj = new Regex(
#"^ # Start of line
[A-Z0-9]+: # Match identifier
(?: # Match...
(?!^[A-Z0-9]+:) # (unless it's the start of the next identifier)
. # ... any character,
)* # repeat as needed.",
RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
allMatchResults = regexObj.Matches(subjectString);
Now allMatchResults.Count will contain the number of matches in subjectString, and allMatchResults.Item[i] will contain the ith match.

Categories

Resources