How to capture hyphen, space or none and ignore case? - c#

I am trying to write a Regex in C# to capture all these potential strings:
Test Pre-Requisite
Test PreRequisite
Test Pre Requisite
Of course the user could also enter any possible case. So it would be great to be able to ignore case. The best I can do is:
Regex TestPreReqRegex = new Regex("Test Pre[- rR]");
If (TestPreReqRegex.IsMatch(StringToCompare)){
// Do Stuff
}
But this doesn't capture "Test PreRequisite" and also doesn't capture lower case. How can I fix this? Any help is much appreciated.

If you're trying to match the entire string, use:
Regex TestPreReqRegex = new Regex("^Test Pre[- ]?Requisite$", RegexOptions.IgnoreCase);
If you're looking for partial matches, then change the pattern to:
\bTest Pre[- ]?Requisite
Or:
\bTest Pre[- ]?R
Pattern details:
^ - Beginning of string.
\b - Word boundary.
[- ]? - Match a hyphen or a space character between zero and one times.
$ - End of string.
C# Demo:
var inputs = new[]
{ "Test Pre-Requisite", "Test PreRequisite", "Test Pre Requisite" };
Regex TestPreReqRegex = new Regex("^Test Pre[- ]?Requisite$",
RegexOptions.IgnoreCase);
foreach (string s in inputs)
{
Console.WriteLine("'{0}' is {1}'.", s,
TestPreReqRegex.IsMatch(s) ? "a match" : "not a match");
}
Output:
'Test Pre-Requisite' is a match'.
'Test PreRequisite' is a match'.
'Test Pre Requisite' is a match'.
Try it online.

Related

How do I regex match each individual word within backticks?

I am trying to get results for each individual word within backticks. For example,
if I have something like this text
some description `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`
I want the search results to be:
match
these_words
th_is_wor
THIS_WOR
thi_sqw
word_snake
I'm essentially trying to get each "word", word being one or more english letter or underscore characters, between each set of backticks.
I currently have the following regex that seems to match ALL the text between each set of backticks:
/(?<=`)(\b([^`\]|\w|_)*\b)(?=`)/gi
This uses a positive lookbehind to find text that comes after a ` character: (?<=`)
Followed by a capture group for one or more things such that the thing is not a `, not a \, is a word character, or is an _ character within word boundaries: (\b([^`\]|\w|_)*\b)
Followed by a positive lookahead for another ` character to ensure we're enclosed within backticks.
This sort of works, but captures ALL the text between backticks instead of each individual word. This would require further processing which I'd like to avoid. My regex results right now are:
match these_words th_is_wor
THIS_WOR thi_sqw
word_snake
If there is a generic formula for getting each individual word within backticks or within quotes, that would be fantastic. Thank you!
Note: Much appreciated if the answer could be formatted in C#, but not required, as I can do that bit myself if needed.
Edit: Thank you Mr. إين from Ben Awad's Discord server for the quickest response! This is the solution as proposed by him. Also thank you to everyone who responded to my post, you guys are all AWESOME!
using System;
using System.Text.RegularExpressions;
class Program {
static void Main(string[] args) {
string backtickSentence = "i want to `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`";
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";
string quoteSentence = "some other \"words in a \" sentence be \"gettin me tripped_up AllUp inHere\"";
string quotePattern = "(?<=^[^\"]*(?:\"[^\"]*\"[^\"]*)*\"(?:[^\"]* )*)\\w+";
// Call Matches method without specifying any options.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {} // Do Nothing: Assume that timeout represents no match.
Console.WriteLine();
// Call Matches method for case-insensitive matching.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {}
}
}
His explanation for this was as follows, but you can paste his regex into regexr.com for more info
var NOT_BACKTICK = #"[^`]*";
var WORD = #"(\w+)";
var START = $#"^{NOT_BACKTICK}"; // match anything before the first backtick
var INSIDE_BACKTICKS = $#"`{NOT_BACKTICK}`"; // match a pair of backticks
var ODD_NUM_BACKTICKS_BEFORE = $#"{START}({INSIDE_BACKTICKS}{NOT_BACKTICK})*`"; // match anything before the first backtick, then any amount of paired backticks with anything afterwards, then a single opening backtick
var CONDITION = $#"(?<={ODD_NUM_BACKTICKS_BEFORE})";
var CONDITION_TRUE = $#"(?: *{WORD})"; // match any spaces then a word
var CONDITION_FALSE = $#"(?:(?<={ODD_NUM_BACKTICKS_BEFORE}{NOT_BACKTICK} ){WORD})"; // match up to an opening backtick, then anything up to a space before the current word
// uses conditional matching
// see https://learn.microsoft.com/en-us/dotnet/standard/base-types/alternation-constructs-in-regular-expressions#Conditional_Expr
var pattern = $#"(?{CONDITION}{CONDITION_TRUE}|{CONDITION_FALSE})";
// refined backtick pattern
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";
With C# you can use the Group.Captures Property and then get the capture group values.
Note that \w also matches _
`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`
Explanation
<code> Match literally
(?: Non capture group to repeat as a whole part
[\p{Zs}\t]* Match optional spaces
(\w+) Capture group 1, match 1+ word characters
[\p{Zs}\t]* Match optional spaces
)+ Close the non capture group and repeat as least 1 or more times
<code> Match literally
See a .NET regex demo and a C# demo.
For example:
string s = #"some description ` match these_words th_is_wor ` or `THIS_WOR thi_sqw` a `word_snake`";
string pattern = #"`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`";
foreach (Match m in Regex.Matches(s, pattern))
{
string[] result = m.Groups[1].Captures.Select(c => c.Value).ToArray();
Console.WriteLine(String.Join(',', result));
}
Output
match,these_words,th_is_wor
THIS_WOR,thi_sqw
word_snake

Regex without taking care of escape codes

I want to validate a string like this (netsh cmd output):
"\r\nR‚servations d'URLÿ:\r\n--------------------\r\n\r\n URL r‚serv‚e : https://+:443/SomeWebSite/ \r\n Utilisateurÿ: AUTORITE NT\\SERVICE R\u0090SEAU\r\n \u0090couterÿ: Yes\r\n D‚l‚guerÿ: Yes\r\n SDDLÿ: D:(A;;GA;;;NS) \r\n\r\n\r\n"
with this pattern:
"URL .+https:\/\/\+:443\/SomeWebSite\/.+Yes.+Yes.+SDDL.+"
So, I intend to detect this kind of strings (xxxxx is something(+)):
xxxxxURLxxxxxhttps://+:443/SomeWebSite/xxxxxYesxxxxxYesxxxxxSDDLxxxx
I wrote this code in C# to do it but my expression still doesn't work:
string output = "\r\nR‚servations d'URLÿ:\r\n--------------------\r\n\r\n URL r‚serv‚e : https://+:443/SomeWebSite/ \r\n Utilisateurÿ: AUTORITE NT\\SERVICE R\u0090SEAU\r\n \u0090couterÿ: Yes\r\n D‚l‚guerÿ: Yes\r\n SDDLÿ: D:(A;;GA;;;NS) \r\n\r\n\r\n";
output = output.Replace(Environment.NewLine, ""); //==> output2=="R‚servations d'URLÿ:-----------
Regex testUrlOpened = new Regex(output, RegexOptions.Singleline);
MessageBox.Show(testUrlOpened.IsMatch(#"URL").ToString()); // ==> False
MessageBox.Show(testUrlOpened.IsMatch(#".+URL.+").ToString()); // ==> False
MessageBox.Show(testUrlOpened.IsMatch(#"URL .+https:\/\/\+:443\/SomeWebSite\/.+Yes.+Yes.+SDDL.+").ToString()); // ==> False
So I suppose that I've another issue with regex in c#...
May be encoding issue?
Start by removing the escape codes expected in the string . It might be better to remove them all depending on your use scenario (C# escape codes)
output = output.Replace('\n').Replace('\r').Replace('\t')
Now you have a single line string, you can do the regex matching
.+URL.+https:\/\/.+:443\/SomeWebSite\/.+Yes.+Yes.+SDDL.+
Notice the following:
1- the ^ and $ means to match the exact begin and end of the string. If you have the target string within the line using these will cause the matching to fail.
2- You need to escape the necessary regex characters .
3- To match "Any character except new line one or more times" you use .+
I hope this helps
You can use Regex.Unescape to unescape the string, and then do your regex match :
var output = #"\r\nR‚servations d'URLÿ:\r\n--------------------\r\n\r\n URL r‚serv‚e : https://+:443/SomeWebSite/ \r\n Utilisateurÿ: AUTORITE NT\\SERVICE R\u0090SEAU\r\n \u0090couterÿ: Yes\r\n D‚l‚guerÿ: Yes\r\n SDDLÿ: D:(A;;GA;;;NS) \r\n\r\n\r\n";
output = Regex.Unescape(output).Dump();
var foundUrl = Regex.IsMatch(output, #"URL .+ https://\+:443/SomeWebSite/.+YES.+YES.+SDDL.+");
+ indicates 1 or more of the previously stated pattern, if we put the pattern (.|\n), which matches anything, in front of those +'s, you'll be all set, without having to remove or account for escape codes.
^(.|\n)+URL(.|\n)+https://(.|\n)+:443/SomeWebSite/(.|\n)+Yes(.|\n)+Yes(.|\n)+SDDL(.|\n)+$
EDIT: The risk of doing something like this instead of sanitizing your string first is that you may get false positives because there could be any character separating the matches, all this regex does is ensure that somewhere in the string, in order, are the strings
"URL", "https://", ":443/SomeWebSite/", "Yes", "Yes", "SDDL"
So simple. Last issue was due to reg expression to put in Regex constructor and input string in IsMatch Method... :(
So final code is:
string output = "\r\nR‚servations d'URLÿ:\r\n--------------------\r\n\r\n URL r‚serv‚e : https://+:443/SomeWebSite/ \r\n Utilisateurÿ: AUTORITE NT\\SERVICE R\u0090SEAU\r\n \u0090couterÿ: Yes\r\n D‚l‚guerÿ: Yes\r\n SDDLÿ: D:(A;;GA;;;NS) \r\n\r\n\r\n";
output = output.Replace(Environment.NewLine, ""); //==> output2=="R‚servations d'URLÿ:-----------
Regex testUrlOpened = new Regex((#"URL .+https:\/\/\+:443\/SomeWebSite\/.+Yes.+Yes.+SDDL.+", RegexOptions.Singleline);
MessageBox.Show(testUrlOpened.IsMatch(output).ToString()); // ==> True!!!
Regex taking decimal number only without using escape character.
^[0-9]+([.][0-9]+)?$
Test It

Remove spaces before non-word character with RegEx

I have the following C# code:
var sentence = "As a result , he failed the test .";
var pattern = new Regex();
var outcome = pattern.Replace(sentence, String.Empty);
What should I do to the RegEx to obtain the following output:
As a result, he failed the test.
If you want to white-list punctuation marks that generally don't appear in English after spaces, you can use:
\s+(?=[.,?!])
\s+ - all white space characters. You may want [ ]+ instead.
(?=[.,?!]) - lookahead. The next character should be ., ,, ?, or ! .
Working example: https://regex101.com/r/iJ5vM8/1
You need to add a pattern to your code that will match spaces before punctuation:
var sentence = "As a result , he failed the test .";
var pattern = new Regex(#"\s+(\p{P})");
var outcome = pattern.Replace(sentence, "$1");
Output:

C# Replace everything except two cases

how can i do something like this.
new Regex("([^my]|[^test])").Replace("Thats my working test", "");
I would get this:
my test
But i would get a empty string, because everything would be replaced with none.
Thank you in Advance!
You can use this lookahead based regex:
new Regex("(?!\b(?:my|test)\b)\b(\w+)\s*").Replace("Thats my working test", "");
//=> my test
Your use of negation in character class is incorrect here: ([^my]|[^test])
Since inside character class every character is checked individually not as a string.
RegEx Demo
Use this regex replacement:
new Regex("\b(?!my|test)\w+\s?").Replace("Thats my working test", "");
Here is a regex demo!
\b Asserts the position before our word to check.
(?! Negative lookahead - asserts that our match is NOT:
my|test The character sequences "my" or "test".
)
\w+ Then match the word because it's what we want.
\s? And scrap the whitespace after it if it's there, too.
I can suggest to use next regEx :
var res = new Regex(#"my(?:$|[\s\.;\?\!,])|test(?:$|[\s\.;\?\!,])").Replace("Thats my working test", "");
Upd: Or even simplier:
var res = new Regex(#"my($|[\s])|test($|[\s])").Replace("Thats my working test", "");
Upd2: If you don't know what word you'll use you can do it even more flexible:
private string ExeptWords(string input, string[] exept){
string tmpl = "{0}|[\s]";
var regexp = string.Join((exept.Select(s => string.Format(tmpl, s)),"|");
return new Regex(regexp).Replace(("Thats my working test", "");
}

Multiline Regex matches first occurance but can't match second

I have a string in the format below. (I added the markers to get the newlines to show up correctly)
-- START BELOW THIS LINE --
2013-08-28 00:00:00 - Tom Smith (Work notes)
Blah blah
b;lah blah
2013-08-27 00:00:00 - Tom Smith (Work notes)
ZXcZXCZXCZX
ZXcZXCZX
ZXCZXcZXc
ZXCZXC
-- END ABOVE THIS LINE --
I am trying to get a regular expression that will allow me to extract the information from the two separate parts of the string.
The following expression matches the first portion successfully:
^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*) \\(Work notes\\)\n([\\w\\W]*)(?=\n\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - .* \\(Work notes\\)\n)
I am trying to figure out a way that I can modify it to get the second part of the string. I have tried things like what is below, but it ends up extending the match all the way to the end of the string. It is like it is giving preference to the expression following the OR.
^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*) \\(Work notes\\)\n([\\w\\W]*)(?:(?=\n\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - .* \\(Work notes\\)\n)|\n\\Z)
Any help would be appreciated
-- EDIT --
Here is a copy of the test program I created to try and get this correct. I also added a 3rd message and my RegEx above breaks in that case.
using System;
using System.Text.RegularExpressions;
namespace RegExTest
{
class MainClass
{
public static void Main (string[] args)
{
string str = "2013-08-28 10:50:13 - Tom Smith (Work notes)\nWhat's up? \nHow you been?\n\n2013-08-19 10:21:03 - Tom Smith (Work notes)\nWork Notes\n\n2013-08-19 10:10:48 - Tom Smith (Work notes)\nGood day\n\n";
var regex = new Regex ("^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*) \\(Work notes\\)\n([\\w\\W]*)\n\n(?=\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - .* \\(Work notes\\)\n)",RegexOptions.Multiline);
foreach (Match match in regex.Matches(str))
{
if (match.Success)
{
for (var i = 0; i < match.Groups.Count; i++)
{
Console.WriteLine('>'+match.Groups [i].Value);
}
}
}
Console.ReadKey();
}
}
}
-- EDIT --
Just to make it clear, the data I am trying to extract is the Date and Timestamp (as one item), the name, and the "body" from each "paragraph".
This is a pretty beefy piece of regex you've got here.
While you can do regex over multiple lines, it just complicates things. Additionally, because you have repetitive patterns, it would be cleaner to split your string on the newline character, and then just match each line.
Eventually, if you intend to ingest this from a file, it will be easy to match each line of the file, rather than reading in the whole file and then matching.
Here's what I would do:
var regex = new Regex ("(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*?) \\(Work notes\\)");
var lines = str.split(new char[] {'\n'});
foreach (var line in lines)
{
var match = regex.Match(line);
if (match.Success)
{
for (var i = 0; i < match.Groups.Count; i++)
{
Console.WriteLine('>' + match.Groups[i].Value);
}
// will preface the body after each header
Console.WriteLine(">");
}
else
{
Console.WriteLine(line);
}
}
As far as your regex goes, I maintained the original groups you had, so we get the Date/timestamp in one group, and the name in the other. The body does not get matched to a group, but it would be trivial to construct a string that is the body.
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) Matching Group 1.
- Matched, but not grouped.
(.*?) Matching Group 2.
\(Work notes\) Matched, but not grouped.
Regex is not really the right solution for this, but if you must...
Your problem is a combination of regex greediness and starting the match with ^. If it starts with ^ it needs it to start the string and it won't match anywhere else.
The greediness of .* can be fixed by making it .*? instead.
Try this:
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) - (.*?) \(Work notes\)\n([\w\W]*?)((?=\n\n\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} - .*? \(Work notes\)\n)|((\s{0,})$))
I was able to get an expression working but it looks a bit scary I guess:
#"([0-9\s:-]+)(?>\s-\s)(?>[^\n\r]+[\r\n]*)((?=[^0-9]+(\d{4}-\d{2}-\d{2}|$))[\s\S])+"
The # before the expression to make this a verbatim string so you won't have to double escape everything.
Note: This is by no means the right way to go about doing this, but I wanted to try out anyway.

Categories

Resources