Regular expression to match base path - c#

I'm trying to come out with a regular expression to match a certain base path. The rule should be to match the base path itself plus a "/" or a "." and the rest of the path.
for example, given /api/ping, the following should match
/api/ping.json
/api/ping
/api/ping/xxx/sss.json
/api/ping.xml
and this should NOT match
/api/pingpong
/api/ping_pong
/api/ping-pong
I tried with the following regexp:
/api/ping[[\.|\/].*]?
But it doesn't seem to catch the /api/ping case
Here is the a link to regex storm tester
--
update: thanks to the answers, now I have this version that reflects better my reasoning:
\/api\/ping(?:$|[.\/?]\S*)
The expression either ends after ping (that's the $ part) or continues with a ., / or ? followed by any non-space characters
here's the regex

You can use this regex which uses alternations to ensure the base path is followed by either a . or / or end of line $
\/api\/ping(?=\.|\/|$)\S*
Explanation:
\/api\/ping - Matches /api/ping text literally
(?=\.|\/|$) - Look ahead ensuring what follows is either a literal dot . or a slash / or end of line $
\S* - Optionally follows whatever non-space character follows the path
Demo
In your regex, /api/ping[[\.|\/].*]? usage of character set [] is not correct, where you don't need to escape a dot . and alternation | isn't needed in a character set and can't be done by placing | within character class, and also as the character class looks nested, it isn't required and not the right thing to do. I guess you wanted to make your regex something like this,
\/api\/ping([.\/].*)?$
Demo with your corrected regex
Notice, once you place anything in [] then it is only counted as one character allowing everything contained within character set, hence it allows either a dot . or slash / and notice you need to escape / as \/

Your pattern uses a character class that will match any of the listed which could also be written as [[./|].
It does not match /api/ping because the the character class has to match at least 1 time as it is not optional.
You could use an alternation to match /api/ping followed by asserting the end of the string or | match the structure by repeating 0 or more times matching a forward slash followed by not a forward slash followed by a dot and 1+ times and then a dot and the extension.
/api/ping(?:(?:/[^/\s]+)*\.\S+|$)
That will match
/api/ping Match literally
(?: Non capturing group
(?:/[^/\s]+)* Repeat a grouping structure 0+ times matching / then 1+ times not / or a whitespace character
\.\S+ Match a dot and 1+ times a non whitespace character
| Or
$ Assert the end of the string
) Close non capturing group
See the regex demo | C# demo

Related

Regex start new match at specific pattern

Hello im kinda new to regex and have a small, maybe simple question.
I have the given text:
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
My current regex (\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?(.*)
matches only till sleeping but reates 3 matches correctly.
But i need the Additional test text also in the second group.
i tried something like (\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?([,.:\w\s]*) but now i have only one huge match because the second group takes everything until the end.
How can i match everything until a new line with a date starts and create a new match from there on?
If you are sure there is only one additional line to be matched you can use
(?m)^(\d{2}\.\d{2}\.\d{4}\s\d{2}:\d{2})\s*(.*(?:\n.*)?)
See the regex demo. Details:
(?m) - a multiline modifier
^ - start of a line
(\d{2}\.\d{2}\.\d{4}\s\d{2}:\d{2}) - Group 1: a datetime string
\s* - zero or more whitespaces
(.*(?:\n.*)?) - Group 2: any zero or more chars other than a newline char as many as possible and then an optional line, a newline followed with any zero or more chars other than a newline char as many as possible.
If there can be any amount of lines, you may consider
(?m)^(\d{2}\.\d{2}\.\d{4}[\p{Zs}\t]\d{2}:\d{2})[\p{Zs}\t]*(?s)(.*?)(?=\n\d{2}\.\d{2}\.\d{4}|\z)
See this regex demo. Here,
(?m)^(\d{2}\.\d{2}\.\d{4}[\p{Zs}\t]\d{2}:\d{2}) - matches the same as above, just \s is replaced with [\p{Zs}\t] that only matches horizontal whitespace
[\p{Zs}\t]* - 0+ horizontal whitespace chars
(?s) - now, . will match any chars including a newline
(.*?) - Group 2: any zero or more chars, as few as possible
(?=\n\d{2}\.\d{2}\.\d{4}|\z) - up to the leftmost occurrence of a newline, followed with a date string, or up to the end of string.
You are using \s repeatedly using the * quantifier with the character class [,.:\w\s]* and \s also matches newlines and will match too much.
You can just match the rest of the line using (.*\r?\n.*) which would not match a newline, then match a newline and the next line in the same group.
^(\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?(.*\r?\n.*)
Regex demo
If multiple lines can follow, match all following lines that do not start with a date like pattern.
^(\d{2}\.\d{2}\.\d{4})\s*(.*(?:\r?\n(?!\d{2}\.\d{2}\.\d{4}).*)*)
Explanation
^ Start of the string
( Capture group1
\d{2}\.\d{2}\.\d{4} Match a date like pattern
) Close group 1
\s* Match 0+ whitespace chars (Or match whitespace chars without newlines [^\S\r\n]*)
( Capture group 2
.* Match the whole line
(?:\r?\n(?!\d{2}\.\d{2}\.\d{4}).*)* Optionally repeat matching the whole line if it does not start with a date like pattern
) Close group 2
Regex demo

Fail validation if there is a peroid (.) not in the specific format?

I'm scanning a string and a period is allowed but if there is a period it has to be in the following format alphanumber.numeric or numeric.numeric. Here are some possible acceptable formats:
5555.1312
ajfdkd.555
Here is what i have so far:
private const string containsPeroidRegularExpress = #"([a-zA-Z]+\.[0-9]+)|([0-9]+\.[0-9]+)";;
validator.RuleFor(x => x.myString)
.Matches(containsPeroidRegularExpress)
.When(x => x.myString.Contains("."), ApplyConditionTo.CurrentValidator)
When you have an example like this it works fine:
This is my example 1 555.1212
But in this example it does not
This is my example 2 555.1212 .
You can see the extra period at the end of the 2nd example. It should fail validation because the extra peroid is not in the specified format stated above. The 1st example should pass validation. Both pass the validation though.
Your pattern is still capturing exactly what you want, however it doesn't "know" that it needs to keep going.
private const string containsPeroidRegularExpress =
#"^([a-zA-Z]+\.[0-9]+)$|^([0-9]+\.[0-9]+)$";
The $ tells it to check right up until the end of the line (I also added ^ to tell it to start at the beginning for completeness so that ". 555.1212" doesn't pass as well).
I definitely won't say this is the best solution. As others mention, you can definitely simplify it. However regex isn't my forte...
I also noticed you mention that the pattern could be alphanumber.numeric. Your pattern does not allow both alpha and numeric characters mixed in the first part. You could use the following:
private const string containsPeroidRegularExpress =
#"^([a-zA-Z0-9]+\.[0-9]+)$|^([0-9]+\.[0-9]+)$";
You might check that after matching the value, there is no space followed by a dot on the right.
You can shorten the pattern a bit by either matching 1+ digits or 1 chars a-zA-Z, and then match a dot and 1+ digits
(?<!\.[^\S\r\n]+)\b[a-zA-Z0-9]+\.[0-9]+\b(?![^\S\r\n]+\.)
The pattern matches
(?<! Negative lookbehind, assert what is on the left is not
\.[^\S\r\n]+ Match a dot and 1+ whitespace chars without a newline
) Close lookbehind
\b Word boundary
(?: Non capture group
[a-zA-Z]+|[0-9]+ Match either 1+ chars a-zA-Z or 1+ digits
) Close group
\.[0-9]+ Match a dot and 1+ digits 0-9
\b Word boundary
(?! Negative lookahead, assert that on the right is not
[^\S\r\n]+\. Match 1+ whitespaces without newlines followed by a dot
) Close lookahead
Regex demo
If you want to match mixed char a-zA-Z and digits:
(?<!\.[^\S\r\n]+)\b[a-zA-Z0-9]+\.[0-9]+\b(?![^\S\r\n]+\.)
Regex demo

Regex to match spaces within quotes only

I need to match any space within double quotes ONLY - not outside. I've tried a few things, but none of them work.
[^"]\s[^"] - matches spaces outside of quotes
[^"] [^"] - see above
\s+ - see above
For example I want to match "hello world" but not "helloworld" and not hello world (without quotes). I will specifically be using this regex inside of Visual Studio via the Find feature.
With .net and pcre regex engines, you can use the \G feature that matches the position after a successful match or the start of the string to build a pattern that returns contiguous occurrences from the start of the string:
((?:\G(?!\A)|\A[^"]*")[^\s"]*)\s([^\s"]*"[^"]*")?
example for a replacement with #: demo
pattern details:
( # capture group 1
(?: # two possible beginning
\G(?!\A) # contiguous to a previous match
| # OR
\A[^"]*" # start of the string and reach the first quote
) # at this point you are sure to be inside quotes
[^\s"]* # all that isn't a white-space or a quote
)
\s # the white-space
([^\s"]*"[^"]*")? # optional capture group 2: useful for the last quoted white-space
# since it reaches an eventual next quoted part.
Notice: with the .net regex engine you can also use the lookbehind to test if the number of quotes before a space is even or odd, but this way isn't efficient. (same thing for a lookahead that checks remaining quotes until the end, but in addition this approach may be wrong if the quotes aren't balanced).

Parsing text between quotes with .NET regular expressions

I have the following input text:
#"This is some text #foo=bar #name=""John \""The Anonymous One\"" Doe"" #age=38"
I would like to parse the values with the #name=value syntax as name/value pairs. Parsing the previous string should result in the following named captures:
name:"foo"
value:"bar"
name:"name"
value:"John \""The Anonymous One\"" Doe"
name:"age"
value:"38"
I tried the following regex, which got me almost there:
#"(?:(?<=\s)|^)#(?<name>\w+[A-Za-z0-9_-]+?)\s*=\s*(?<value>[A-Za-z0-9_-]+|(?="").+?(?=(?<!\\)""))"
The primary issue is that it captures the opening quote in "John \""The Anonymous One\"" Doe". I feel like this should be a lookbehind instead of a lookahead, but that doesn't seem to work at all.
Here are some rules for the expression:
Name must start with a letter and can contain any letter, number, underscore, or hyphen.
Unquoted must have at least one character and can contain any letter, number, underscore, or hyphen.
Quoted value can contain any character including any whitespace and escaped quotes.
Edit:
Here's the result from regex101.com:
(?:(?<=\s)|^)#(?<name>\w+[A-Za-z0-9_-]+?)\s*=\s*(?<value>(?<!")[A-Za-z0-9_-]+|(?=").+?(?=(?<!\\)"))
(?:(?<=\s)|^) Non-capturing group
# matches the character # literally
(?<name>\w+[A-Za-z0-9_-]+?) Named capturing group name
\s* match any white space character [\r\n\t\f ]
= matches the character = literally
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?<value>(?<!")[A-Za-z0-9_-]+|(?=").+?(?=(?<!\\)")) Named capturing group value
1st Alternative: [A-Za-z0-9_-]+
[A-Za-z0-9_-]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
A-Z a single character in the range between A and Z (case sensitive)
a-z a single character in the range between a and z (case sensitive)
0-9 a single character in the range between 0 and 9
_- a single character in the list _- literally
2nd Alternative: (?=").+?(?=(?<!\\)")
(?=") Positive Lookahead - Assert that the regex below can be matched
" matches the characters " literally
.+? matches any character (except newline)
Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
(?=(?<!\\)") Positive Lookahead - Assert that the regex below can be matched
(?<!\\) Negative Lookbehind - Assert that it is impossible to match the regex below
\\ matches the character \ literally
" matches the characters " literally
You can use a very useful .NET regex feature where multiple same-named captures are allowed. Also, there is an issue with your (?<name>) capture group: it allows a digit in the first position, which does not meet your 1st requirement.
So, I suggest:
(?si)(?:(?<=\s)|^)#(?<name>\w+[a-z0-9_-]+?)\s*=\s*(?:(?<value>[a-z0-9_-]+)|(?:"")?(?<value>.+?)(?=(?<!\\)""))
See demo
Note that you cannot debug .NET-specific regexes at regex101.com, you need to test them in .NET-compliant environment.
Use string methods.
Split
string myLongString = ""#"This is some text #foo=bar #name=""John \""The Anonymous One\"" Doe"" #age=38"
string[] nameValues = myLongString.Split('#');
From there either use Split function with "=" or use IndexOf("=").

How do I specify a wildcard (for ANY character) in a c# regex statement?

Trying to use a wildcard in C# to grab information from a webpage source, but I cannot seem to figure out what to use as the wildcard character. Nothing I've tried works!
The wildcard only needs to allow for numbers, but as the page is generated the same every time, I may as well allow for any characters.
Regex statement in use:
Regex guestbookWidgetIDregex = new Regex("GuestbookWidget(' INSERT WILDCARD HERE ', '(.*?)', 500);", RegexOptions.IgnoreCase);
If anyone can figure out what I'm doing wrong, it would be greatly appreciated!
The wildcard character is ..
To match any number of arbitrary characters, use .* (which means zero or more .) or .+ (which means one or more .)
Note that you need to escape your parentheses as \\( and \\). (or \( and \) in an #"" string)
On the dot
In regular expression, the dot . matches almost any character. The only characters it doesn't normally match are the newline characters. For the dot to match all characters, you must enable what is called the single line mode (aka "dot all").
In C#, this is specified using RegexOptions.Singleline. You can also embed this as (?s) in the pattern.
References
regular-expressions.info/The Dot Matches (Almost) Any Character
On metacharacters and escaping
The . isn't the only regex metacharacters. They are:
( ) { } [ ] ? * + - ^ $ . | \
Depending on where they appear, if you want these characters to mean literally (e.g. . as a period), you may need to do what is called "escaping". This is done by preceding the character with a \.
Of course, a \ is also an escape character for C# string literals. To get a literal \, you need to double it in your string literal (i.e. "\\" is a string of length one). Alternatively, C# also has what is called #-quoted string literals, where escape sequences are not processed. Thus, the following two strings are equal:
"c:\\Docs\\Source\\a.txt"
#"c:\Docs\Source\a.txt"
Since \ is used a lot in regular expression, #-quoting is often used to avoid excessive doubling.
References
regular-expressions.info/Metacharacters
MSDN - C# Programmer's Reference - string
On character classes
Regular expression engines allow you to define character classes, e.g. [aeiou] is a character class containing the 5 vowel letters. You can also use - metacharacter to define a range, e.g. [0-9] is a character classes containing all 10 digit characters.
Since digit characters are so frequently used, regex also provides a shorthand notation for it, which is \d. In C#, this will also match decimal digits from other Unicode character sets, unless you're using RegexOptions.ECMAScript where it's strictly just [0-9].
References
regular-expressions.info/Character Classes
MSDN - Character Classes - Decimal Digit Character
Related questions
.NET regex: What is the word character \w
Putting it all together
It looks like the following will work for you:
#-quoting digits_ _____anything but ', captured
| / \ / \
new Regex(#"GuestbookWidget\('\d*', '([^']*)', 500\);", RegexOptions.IgnoreCase);
\/ \/
escape ( escape )
Note that I've modified the pattern slightly so that it uses negated character class instead of reluctance wildcard matching. This causes a slight difference in behavior if you allow ' to be escaped in your input string, but neither pattern handle this case perfectly. If you're not allowing ' to be escaped, however, this pattern is definitely better.
References
regular-expressions.info/An Alternative to Laziness and Capturing Groups

Categories

Resources