RegEx pattern for partial URL (switch on two values in path)

RegEx pattern for partial URL (switch on two values in path) - c#

I have a URL pattern that needs to contain either APPLES or ORANGES in it, no other value. Optionally, it can also have query parameters. I've tried a number of RegEx patterns, but I just can't get a pattern that will respect the strict match.
Sample URLs
Good
http://www.website.com/en/pages/APPLES
http://www.website.com/en/pages/APPLES?k=v
http://www.website.com/en/pages/ORANGES?k=v&k2=v2
http://www.website.com/en/pages/ORANGES
Bad
http://www.website.com/en/pages/APPLES???k=v
http://www.website.com/en/pages/APPLES?k=v=v
http://www.website.com/en/pages/APPLESORANGES
http://www.website.com/en/pages/1APPLES
http://www.website.com/en/APPLES
Attempted RegEx Patterns (well, at least the best attempts)
(http://*.*.website*.*.com/*.*/pages(/APPLES)|(/ORANGES)[\?]*.*)
(http://*.*.website*.*.com/*.*/pages(/APPLES|/ORANGES)[\?]*.*)
If you're curious, I intentionally want to allow any sub-domain, suffix after "website" (for different environments), and any path between .com/ and /pages, hence the use of . in a number of places.
What would be the best way to achieve this?
**Edit: Final Answer**
My final answer was merged from mathematical.coffee and fardjad.
^https?://.*\.website\b.*\.com/.*/pages/(APPLES\b|ORANGES\b)((\?\w+=\w+)(&?\w+=\w+)*)?$
The single limitation I've discovered is that it will not allow a few valid characters (.~_-%+) in the query string parameter key=value pairs (see: http://en.wikipedia.org/wiki/Query_string#Structure). This isn't an issue for me as I'm matching against a string returned from .NET's Uri class, so I know the URL is well-formed overall.

I think the *.* should be .*:
http://.*\.website\b.*\.com/.*/pages/PAGE[12](\?[^=]+=[^&=]+(&[^=]+=[^=&]+)*)?
Explanation:
http:// # just http://
.*\. # any thing, just make sure it's followed by '.'
website\b # website, the whole word
.*\.com # anything between website and .com
/.*/pages/ # anything between the .com and the pages
PAGE[12] # PAGE1 or PAGE2
(\? # opening bracket and '?' (query string)
[^=]+ # the key: i've said it can't include =
= # =
[^=&]+ # the value: i've said it can't include = or &
(& # opening bracket and '&' for next part of query string
[^=]+=[^=&]+ # key=value pair, same regex as before
)* # 0 or more of these (the &key=value)
)? # the entire query string is optional.
NOTE - there are usually problems parsing query strings with regex and making sure it's a syntactically valid regex.
For example, in the regex I supplied above, I've said that the value in &key=value can't have an ampersand in it. But it could be an escaped entity, like &, which is legal.
You'll always suffer from this sort of problem when you try to parse syntax with regex. It's a risk you'll have to take.
Alternatively, I am sure there is a C# module to parse URLs (many other languages have these), and they take care of all these special cases for you.

Try this:
^https?://(www\.)?\w+[^/]+(/\w+(?=/)){2}/(PAGE1|PAGE2)((\?\w+=\w+)(&?\w+=\w+)*)?$

Related

How to check query parameter in url using Regex in C# .NET?

I'm trying to check if below two regex are matching
Target Url: /challenge/getAllChallenges?type=public
Regex: "/challenge/getAllChallenges([/?]+)"
But seems like above Regex only allows any character to appear after "getAllChallenge"
How do I allow only '?' as the first character to appear after "getAllChallenge"?
ideally both of below url to be validated as a match through a single regex:
/challenge/getAllChallenges
/challenge/getAllChallenges?type=public
but below to be not valid
/challenge/getAllChallenge/blah
/challenge/getAllChallengeblah

Something like
/challenge/getAllChallenges(\?[^?]+)?$
It's "/challenge/getAllChallenges" followed by zero or one of: (question mark followed by one or more of anything other than question mark)
Your original regex required "/challenge/getAllChallenges" followed by one or more of: (forward slash, question mark)
Special characters lose their meaning inside [] character classes, and hence do not need escaping (if that's what you were trying to do by putting a slash before the ? (which if you were, the slash was the wrong direction; backslash escapes, not forward slash))
It would be important to include the end of input marker $ to prevent a partial match reporting a success
-
I support the notion raised in the comment: generally using a class dedicated to parsing and manipulating a kind of value will give better results than a regex based solution. For example the regex I gave above is in response to what was determined from your required matches - either no ? or ?-followed-by-something but a url that simply ends with a question mark is perfectly valid, but the regex will need tweaking to allow it

RegEx different substitutions based on groups?

So I'm relatively n00bish at regular expressions, and doing a little practicing.
I'm playing with a dog-simple "deobfucator" that just looks for [dot] or (dot) or [at] or (at). Case-insensitive, and with or w/out any number of spaces before or after the match(s).
This is for the usual: someemail [AT] domain (dot) com type of thing. I want to obviously turn it into someemail#domain.com.
The RegEx I've come up with does the matching fine, but now I want to replace with either a . or a # depending on the match.
i.e.
I want the group matching the "dot" group to replace it with the literal ., and the group matching the "at" group with the literal #.
I know I could just write 2 different (almost identical) RegEx's and run it through both, but for the sake of education, I'm trying to see if I can do it all in one RegEx?
Here's the RegEx I came up with (probably not the smallest possible, which I'd also be interested in seeing):
+(\[|\()(dot)(\)|\]) +| +(\[|\()(at)(\)|\]) +
NOTE: before each + there's an empty space, for matching spaces.
What I'm looking for is what I would use to do the replacement(s) properly?
Update: Sorry all, forgot to add which language I was working with for this. In this case, I'm using a clipboard utility that can run RegEx's on it's input (whatever gets copied to the clipboard), and the engine it uses is C#/VB.NET. Ultimate goal for this little project is to just be able to copy an "obfuscated" email address or URL, and run the RegEx on it so that it's set on the clipboard in it's "unobfuscated" state.
That said, I do tend to use RegEx's on many different languages, so converting them between languages generally isn't an issue.

.NET regex does not support conditional replacement patterns.
for the sake of education, I'm trying to see if I can do it all in one RegEx?
There are other regex engines that allow conditional replacement logic in a single regex replacement operation with conditional replacement patterns.
There are 3 engines that support this type of replacements: JGsoft V2, Boost, and PCRE2.
For conditionals to work in Boost, you need to pass regex_constants::format_all to regex_replace. For them to work in PCRE2, you need to pass PCRE2_SUBSTITUTE_EXTENDED to pcre2_substitute.
In PCRE2:
${1:+matched:unmatched} where 1 is a number between 1 and 99 referencing a numbered capturing group. If your regex contains named capturing groups then you can reference them in a conditional by their name: ${name:+matched:unmatched}.
If you want a literal colon in the matched part, then you need to escape it with a backslash. If you want a literal closing curly brace anywhere in the conditional, then you need to escape that with a backslash too. Plus signs have no special meaning beyond the :+ that starts the conditional, so they don't need to be escaped.
Also, see The Boost-Specific Format Sequences:
When specifying the format_all flag to regex_replace(), the escape sequences recognized are the same as those above for format_perl. In addition, conditional expressions of the following form are recognized:
?Ntrue-expression:false-expression
where N is a decimal digit representing a sub-match. If the corresponding sub-match participated in the full match, then the substitution is true-expression. Otherwise, it is false-expression. In this mode, you can use parens () for grouping. If you want a literal paren, you must escape it as \(.
In Boost replacement patterns, literal ( and ) must be escaped.
The syntax for JGsoft V2 replacement string conditionals is the same as that in the C++ Boost library.
So, your regex can be contracted to ( +)[[(](?:(dot)|(at))[])]( +):
( +) - Group 1: one or more spaces
[[(] - a [ or (
(?:(dot)|(at)) - Either (Group 2) a dot substring or (Group 3) an at substring
[])] - a ) or ]
( +) - Group 4: one or more spaces
And replace with $1(?{3}.:#)$4:
$1 - Group 1 value,
(?{3}.:#) - if Group 3 matched, replace with ., else with #
$4 - Group 4 value.
This is available in Notepad++:

If you are using Java, try replaceAll method from String class.
And finally you need to normalize it with white spaces:
- Pure Java - String after = before.trim().replaceAll("\\s+", " ");
- Pure Java - String after = before.replaceAll("\\s{2,}", " ").trim();
- Apache commons lang3 - String after = StringUtils.normalizeSpace(String str);
- ...

Removing comments using regex

I am building a parser, and I would like to remove comments from various lines. For example,
variable = "some//thing" ////actual comment
Comment marker is //. In this case, variable would contain "some//thing" and everything else would be ignored. I plan to do it using regex replace. Currently I am using (".*"|[ \t])*(\/\/.*) as regex. However replacing it replaces "some//thing" ////actual comment entirely.
I can not figure out the regex which I should use instead. Thanks for any help.
Additional info - I am using C# with netcoreapp 1.1.0
Edit - some cases might be of a line with just comment like //line comment. Strings also might contain escaped quotes.

Here is the ugly regex pattern. I believe it will work well. I have tried it with every pathological example I can think of, including lines that contain syntax errors. For example, a quoted string that has too many quotes, or too few, or has a double escaped quote, which is, therefore, not escaped. And with quoted strings in the comments, which I have been known to do when I want to remind myself of alternatives.
The only time that it trips up is if there is a double slash inside a seemingly quoted string and somehow that string is malformed and the double slash ends up legally outside the properly quoted portion. Syntactically that makes it a valid comment, even though not the programmer's intention. So, from the programmer's perspective it's wrong, but by the rules, it's really a comment. Meaning, the pattern only appears to trip up.
When used the pattern will return the non-comment portion of the line(s). The pattern has a newline \n in it to allow for applying it to an entire file. You may need to modify that if you system interprets newlines in some other fashion, for example as \r or \r\n. To use it in single line mode you can remove that if you choose. It is at characters 17 and 18 in the one-liner and is on the fifth line, 6th and 7th printing characters in the multi-line version. You can safely leave it there, however, as in single-line mode it makes no difference, and in multi-line mode it will return a newline for lines of code that are either blank, or have a comment beginning in the first column. That will keep the line numbers the same in the original version and the stipped version if you write the results to a new file. Makes comparison easy.
One major caveat for this pattern: It uses a grouping construct that has varying level of support in regex engines. I believe as used here, with a lookaround, it's only the .NET and PCRE engines that will accept it YMMV. It is a tertiary type: (?(_condition_)_then_|_else_). The _condition_ pattern is treated as a zero-width assertion. If the pattern matches, then the _then_ pattern is used in the attempted match, otherwise the _else_ pattern is used. Without that construct, the pattern was growing to uncommon lengths, and was still failing on some of my pathological test cases.
The pattern presented here is as it needs to be seen by the regex engine. I am not a C# programmer, so I don't know all the nuances of escaping quoted strings. Getting this pattern into your code, such that all the backslashes and quotes are seen properly by the regex engine is still up to you. Maybe C# has the equivalent of Perl's heredoc syntax.
This is the one-liner pattern to use:
^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)
If you want to use the ignore pattern whitespace option, you can use this version:
(?x) # Turn on the ignore white space option
^( # Start the only capturing group
(?: # A non-capturing group to allow for repeating the logic
(?: # Capture either of the two options below
[^"'/\n] # Capture everything not a single quote, double quote, a slash, or a newline
| # OR
/(?!/) # Capture a slash not followed by a slash [slash an negative look-ahead slash]
)* # As many times as possible, even if none
(?(" # Start a conditional match for double-quoted strings
(?=(?:\\\\|\\"|[^"])*") # Followed by a properly closed double-quoted string
) # Then
(?:"(?:\\\\|\\"|[^"])*") # Capture the whole double-quoted string
| # Otherwise
(?(' # Start a conditional match for single-quoted strings
(?=(?:\\\\|\\'|[^'])*') # Followed by a properly closed single-quoted string
) # Then
(?:'(?:\\\\|\\'|[^'])*') # Capture the whole double-quoted string
| # Otherwise
(?([^/]) # If next character is not a slash
.) # Capture that character, it is either a single quote, or a double quote not part of a properly closed
) # end the conditional match for single-quoted strings
) # End the conditional match for double-quoted strings
)* # Close the repeating non-capturing group, capturing as many times as possible, even if none
) # Close the only capturing group
This allows for your code to explain this monstrosity so that when someone else looks at it, or in a few months you have to work on it yourself, there's no WTF moment. I think the comments explain it well, but feel free to change them any way you please.
As mentioned above, the conditional match grouping has limited support. One place it will fail is on the site you linked to in an earlier comment. Since you're using C#, I choose to do my testing in the .NET Regex Tester, which can handle those constructs. It includes a nice Reference too. Given the proper selections on the side, you can test either version above, and experiment with it as well. Considering its complexity, I would recommend testing it, somewhere, against data from your files, as well as any edge cases and pathological tests you can dream up.
Just to redeem this small pattern, there is a much bigger pattern for testing email address that is 78 columns by 81 lines, with a couple dozen characters to spare. (Which I do not recommend using, or any other regex, for testing email addresses. Wrong tool for the job.) If you want to scare yourself, have a peek at it on the ex-parrot site. I had nothing to do with that!!

"[^"\\]*(?:\\[\W\w][^"\\]*)*"|(\/\/.*)
Flags: global
Matches full strings or a comment.
Group 1: comment.
So if there's no comment, replace with the same matching text. Otherwise, do your thing on the comment itself.

Regex : replace a string

I'm currently facing a (little) blocking issue. I'd like to replace a substring by one another using regular expression. But here is the trick : I suck at regex.
Regex.Replace(contenu, "Request.ServerVariables("*"))",
"ServerVariables('test')");
Basically I'd like to replace whatever is between the " by "test". I tried ".{*}" as a pattern but it doesn't work.
Could you give me some tips, I'd appreciate it!

There are several issues you need to take care of.
You are using special characters in your regex (., parens, quotes) -- you need to escape these with a slash. And you need to escape the slashes with another slash as well because we 're in a C# string literal, unless you prefix the string with # in which case the escaping rules are different.
The expression to match "any number of whatever characters" is .*. In this case, you would want to match any number of non-quote characters, which is [^"]*.
In contrast to (1) above, the replacement string is not a regular expression so you don't want any slashes there.
You need to store the return value of the replace somewhere.
The end result is
var result = Regex.Replace(contenu,
#"Request\.ServerVariables\(""[^""]*""\)",
"Request.ServerVariables('test')");

Based purely on my knowledge of regex (and not how they are done in C#), the pattern you want is probably:
"[^"]*"
ie - match a " then match everything that's not a " then match another "
You may need to escape the double-quotes to make your regex-parser actually match on them... that's what I don't know about C#

Try to avoid where you can the '.*' in regex, you can usually find what you want to get by avoiding other characters, for example [^"]+ not quoted, or ([^)]+) not in parenthesis. So you may just want "([^"]+)" which should give you the whole thing in [0], then in [1] you'll find 'test'.
You could also just replace '"' with '' I think.

Taryn Easts regex includes the *. You should remove it, if it is just a placeholder for any value:
"[^"]"
BTW: You can test this regex with this cool editor: http://rubular.com/r/1MMtJNF3kM

I have two problems, one of them is a regex

I am updating some code that I didn't write and part of it is a regex as follows:
\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]
I understand that .*? does a non-greedy match of everything in the second register.
What does ?:\s* in the first and third registers do?
Update: As requested, language is C# on .NET 3.5

The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.
The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string

?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.

What does ?:\s* in the first and third registers do?
It's matching zero or more whitespace characters, without capturing them.
The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:
[url]www.foo.com[/url] # foo.com
[url ]www.foo.com[/url ] # same
[url ]www.foo.com[/url] # same
[url]www.foo.com[/url ] # same
Note that the regex also matches:
[url]www.[/url] # empty string!
and fails to match
[url]stackoverflow.com[/url] # no match, bummer

You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.
http://krijnhoetmer.nl/stuff/regex/cheat-sheet/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

RegEx pattern for partial URL (switch on two values in path) - c#

Try this: ^https?://(www\.)?\w+[^/]+(/\w+(?=/)){2}/(PAGE1|PAGE2)((\?\w+=\w+)(&?\w+=\w+)*)?$

Related

How to check query parameter in url using Regex in C# .NET?

RegEx different substitutions based on groups?

Removing comments using regex

Regex : replace a string

I have two problems, one of them is a regex

Categories

Resources