Determine if a regex is just a literal match

Determine if a regex is just a literal match - c#

Using the .NET Regex class is there an easy way to check whether a regex is a literal match with no special characters (save for escaped special characters)?
Looking for something like this
var literalRegex = new Regex(#"\(foo\)");
var fancyRegex = new Regex("foo.*");
Console.WriteLine(IsPlainLiteral(literalRegex)); // True
Console.WriteLine(IsPlainLiteral(fancyRegex)); // False

I suggest this pattern that matches all "literal patterns" (* understand well-formed patterns where all characters are literals or escaped special characters or ignored backslashes)
in a verbatim string:
\A
[^[\\|{.?*+^$()]* # characters that aren't one of the twelve special characters
(?>
(?: # exceptions:
# - the opening curly bracket that is not the start of a quantifier
{+ (?! [0-9]+ (?:,[0-9]*)? } )
|
# - the backslash if it escapes a character:
# - that is one of the twelve special characters
# - or produces an ignored escape sequence
\\ [^\p{L}\p{N}]
)
[^[\\|{.?*+^$()]*
)*
\z
Note: this pattern is designed for the .net syntax.
Note2: for patterns with the IgnorePatternWhitespace option, you must exclude spaces and # from the character class to do the same, so: [^[\\|{.?*+^$()#\s]

An easy correct way? I don't think there exists one.
However, if you are not scared of a little bit reflection hacking, it should be quite easy. Namely, once Regex object has been initialized, you have access to abstract syntax tree called RegexTree.
All you need to do is just answer this question:
does the tree only have 1 node and is the node only literal node?
One other option is to write your own regex parser that follows the syntax of C# regex, and build the AST yourself.

Related

Regex.Matches throws exception for regex formula c# [duplicate]

I am trying to create a .NET RegEx expression that will properly balance out my parenthesis. I have the following RegEx expression:
func([a-zA-Z_][a-zA-Z0-9_]*)\(.*\)
The string I am trying to match is this:
"test -> funcPow((3),2) * (9+1)"
What should happen is Regex should match everything from funcPow until the second closing parenthesis. It should stop after the second closing parenthesis. Instead, it is matching all the way to the very last closing parenthesis. RegEx is returning this:
"funcPow((3),2) * (9+1)"
It should return this:
"funcPow((3),2)"
Any help on this would be appreciated.

Regular Expressions can definitely do balanced parentheses matching. It can be tricky, and requires a couple of the more advanced Regex features, but it's not too hard.
Example:
var r = new Regex(#"
func([a-zA-Z_][a-zA-Z0-9_]*) # The func name
\( # First '('
(?:
[^()] # Match all non-braces
|
(?<open> \( ) # Match '(', and capture into 'open'
|
(?<-open> \) ) # Match ')', and delete the 'open' capture
)+
(?(open)(?!)) # Fails if 'open' stack isn't empty!
\) # Last ')'
", RegexOptions.IgnorePatternWhitespace);
Balanced matching groups have a couple of features, but for this example, we're only using the capture deleting feature. The line (?<-open> \) ) will match a ) and delete the previous "open" capture.
The trickiest line is (?(open)(?!)), so let me explain it. (?(open) is a conditional expression that only matches if there is an "open" capture. (?!) is a negative expression that always fails. Therefore, (?(open)(?!)) says "if there is an open capture, then fail".
Microsoft's documentation was pretty helpful too.

Using balanced groups, it is:
Regex rx = new Regex(#"func([a-zA-Z_][a-zA-Z0-9_]*)\(((?<BR>\()|(?<-BR>\))|[^()]*)+\)");
var match = rx.Match("funcPow((3),2) * (9+1)");
var str = match.Value; // funcPow((3),2)
(?<BR>\()|(?<-BR>\)) are a Balancing Group (the BR I used for the name is for Brackets). It's more clear in this way (?<BR>\()|(?<-BR>\)) perhaps, so that the \( and \) are more "evident".
If you really hate yourself (and the world/your fellow co-programmers) enough to use these things, I suggest using the RegexOptions.IgnorePatternWhitespace and "sprinkling" white space everywhere :-)

Regular Expressions only work on Regular Languages. This means that a regular expression can find things of the sort "any combination of a's and b's".(ab or babbabaaa etc) But they can't find "n a's, one b, n a's".(a^n b a^n) Regular expressions can't guarantee that the first set of a's matches the second set of a's.
Because of this, they aren't able to match equal numbers of opening and closing parenthesis. It would be easy enough to write a function that traverses the string one character at a time. Have two counters, one for opening paren, one for closing. increment the pointers as you traverse the string, if opening_paren_count != closing_parent_count return false.

func[a-zA-Z0-9_]*\((([^()])|(\([^()]*\)))*\)
You can use that, but if you're working with .NET, there may be better alternatives.
This part you already know:
func[a-zA-Z0-9_]*\( --weird part-- \)
The --weird part-- part just means; ( allow any character ., or | any section (.*) to exist as many times as it wants )*. The only issue is, you can't match any character ., you have to use [^()] to exclude the parenthesis.
(([^()])|(\([^()]*\)))*

Regex to match spaces within quotes only

I need to match any space within double quotes ONLY - not outside. I've tried a few things, but none of them work.
[^"]\s[^"] - matches spaces outside of quotes
[^"] [^"] - see above
\s+ - see above
For example I want to match "hello world" but not "helloworld" and not hello world (without quotes). I will specifically be using this regex inside of Visual Studio via the Find feature.

With .net and pcre regex engines, you can use the \G feature that matches the position after a successful match or the start of the string to build a pattern that returns contiguous occurrences from the start of the string:
((?:\G(?!\A)|\A[^"]*")[^\s"]*)\s([^\s"]*"[^"]*")?
example for a replacement with #: demo
pattern details:
( # capture group 1
(?: # two possible beginning
\G(?!\A) # contiguous to a previous match
| # OR
\A[^"]*" # start of the string and reach the first quote
) # at this point you are sure to be inside quotes
[^\s"]* # all that isn't a white-space or a quote
)
\s # the white-space
([^\s"]*"[^"]*")? # optional capture group 2: useful for the last quoted white-space
# since it reaches an eventual next quoted part.
Notice: with the .net regex engine you can also use the lookbehind to test if the number of quotes before a space is even or odd, but this way isn't efficient. (same thing for a lookahead that checks remaining quotes until the end, but in addition this approach may be wrong if the quotes aren't balanced).

Regular Expression Space character not working

My Regex is for a canadian postal code and only allowing the valid letters:
Regex pattern = new Regex("^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ][/s][0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$");
The problem I am having is that I want to allow for a space to be put in between the each set but cannot find the correct character to use.

You've got a forward-slash instead of a backslash in your regular expression for whitespace (\s). The following regex should work.
#"^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ][\s][0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$"

If you are simply searching for space use \s
To provide the escape sequence character \ use # verbitm literal character as below in the given example.
Regex pattern = new Regex(#"^[ABCEGHJKLMNPRSTVXY][0-9]\s[ABCEGHJKLMNPRSTVWXYZ[0-9]\s[ABCEGHJKLMNPRSTVWXYZ][0-9]$");
As pointed out in the comments, if space is optional you can use ? quantifier as below.
Regex pattern = new Regex(#"^[ABCEGHJKLMNPRSTVXY][0-9]\s?[ABCEGHJKLMNPRSTVWXYZ[0-9]\s?[ABCEGHJKLMNPRSTVWXYZ][0-9]$");

Use the \s token for whitespace instead of /s.
Some handy tools to speed up regex development:
regexr.com helps with syntax and provides realtime testing
regexpr.com (yes I know :)) visualizes your expression.

As per other answers....
Use \s instead of /s
You shouldn't need to square bracket the [\s], because it already implies a complete class of characters.
Also...
In most languages, you probably don't want to use double quotes "..." as delimiters to the Regex, since this might be interpolating the \s before the pattern is applied. It's certainly worth a try.
Use a trailing quantifier \s* or \s? to allow the space to be optional.

Trying to understand this regex

I have this regex
^(\\w|#|\\-| |\\[|\\]|\\.)+$
I'm trying to understand what it does exactly but I can't seem to get any result...
I just can't understand the double backslashes everywhere... Isn't double backslash supposed to be used to get a single backslash?
This regex is to validate that a username doesn't use weird characters and stuff.
If someone could explain me the double backslashes thing please. #_#
Additional info: I got this regex in C# using Regex.IsMatch to check if my username string match the regex. It's for an asp website.

My guess is that it's simply escaping the \ since backslash is the escape character in c#.
string pattern = "^(\\w|#|\\-| |\\[|\\]|\\.)+$";
Can be rewritten using a verbatim string as
string pattern = #"^(\w|#|\-| |\[|\]|\.)+$";
Now it's a bit easier to understand what's going on. It will match any word character, at-sign, hyphen, space, square bracket or period, repeated one or more times. The ^ and $ match the begging and end of the string, respectively, so only those characters are allowed.
Therefore this pattern is equivalent to:
string pattern = #"^([\w# \[\].-])+$";

Double slash are supposed to be single slash. Double slash are used to escape the slash itself, as slashes are used for other escape characters in C# String context e.g. \n stands for new line
With double slashes sorted out, it becomes ^(\w|#|\-| |\[|\]|\.)+$
Break down this regex, as | means OR, and \w|#|\-| |\[|\]|\. would mean \w or # or \- or space or \[ or \] or \.. That is, any alphanumeric character, #, -, space, [, ] and . characters. Note that this slash is regex escape, to escape -, [, ] and . characters as they all have special meanings in regex context
And, + means the previous token (i.e. \w|#|\-| |\[|\]|\.) repeated one or more times
So, the entire thing means one or more of any combination of alphanumeric character, #, -, space, [, ] and . characters.

There are online tools to analyze regexes. Once such is at http://www.myezapp.com/apps/dev/regexp/show.ws
where it reports
Sequence: match all of the followings in order
BeginOfLine
Repeat
CapturingGroup
GroupNumber:1
OR: match either of the followings
WordCharacter
#
-
[
]
.
one or more times
EndOfLine
As others have noted, the double backslashes just escape a backslash so you can embed the regex in a string. For example, "\\w" will be interpreted as "\w" by the parser.

^ means beginning of the line.
the parentheses is use for grouping
\w is a word character
| means OR
# match the # character
\- match the hyphen character
[ and ] matches the squares brackets
\. match a period
+ means one or more
$ the end of line.
So the regex is use to match a string which contains only word characters or an # or an hyphen or a space or squares brackets or a dot.

Here's what it means:
^(\\w|#|\\-| |\\[|\\]|\\.)+$
^ - Means the regex starts at the beginning of the string. The match shouldn't start in the middle of the string.
Here's the individual things in the parentheses:
\\w - Indicates a "word" character. Normally, this is shown as \w, but this is being escaped.
# - Indicates an # symbol is allowed
\\- - Indicates a - is allowed. This is escaped since the dash can have other meanings in regex. Since it's not in a character class, I don't believe this is technically needed.
- A space is allowed
\\[ and \\] - [ and ] are allowed.
\\. - A period is a valid character. Escaped because periods have special meanings in regex.
Now all of those characters have | as delimiters in the parentheses - this means OR. So any of those characters are valid.
The + at the end means one or more characters as described in parentheses are valid. The $ means the end of the regex must match the end of the string.
Note that the double slashes aren't necessary if you just prefix the string like this:
#"\w" is the same as "\\w"

How do I specify a wildcard (for ANY character) in a c# regex statement?

Trying to use a wildcard in C# to grab information from a webpage source, but I cannot seem to figure out what to use as the wildcard character. Nothing I've tried works!
The wildcard only needs to allow for numbers, but as the page is generated the same every time, I may as well allow for any characters.
Regex statement in use:
Regex guestbookWidgetIDregex = new Regex("GuestbookWidget(' INSERT WILDCARD HERE ', '(.*?)', 500);", RegexOptions.IgnoreCase);
If anyone can figure out what I'm doing wrong, it would be greatly appreciated!

The wildcard character is ..
To match any number of arbitrary characters, use .* (which means zero or more .) or .+ (which means one or more .)
Note that you need to escape your parentheses as \\( and \\). (or \( and \) in an #"" string)

On the dot
In regular expression, the dot . matches almost any character. The only characters it doesn't normally match are the newline characters. For the dot to match all characters, you must enable what is called the single line mode (aka "dot all").
In C#, this is specified using RegexOptions.Singleline. You can also embed this as (?s) in the pattern.
References
regular-expressions.info/The Dot Matches (Almost) Any Character
On metacharacters and escaping
The . isn't the only regex metacharacters. They are:
( ) { } [ ] ? * + - ^ $ . | \
Depending on where they appear, if you want these characters to mean literally (e.g. . as a period), you may need to do what is called "escaping". This is done by preceding the character with a \.
Of course, a \ is also an escape character for C# string literals. To get a literal \, you need to double it in your string literal (i.e. "\\" is a string of length one). Alternatively, C# also has what is called #-quoted string literals, where escape sequences are not processed. Thus, the following two strings are equal:
"c:\\Docs\\Source\\a.txt"
#"c:\Docs\Source\a.txt"
Since \ is used a lot in regular expression, #-quoting is often used to avoid excessive doubling.
References
regular-expressions.info/Metacharacters
MSDN - C# Programmer's Reference - string
On character classes
Regular expression engines allow you to define character classes, e.g. [aeiou] is a character class containing the 5 vowel letters. You can also use - metacharacter to define a range, e.g. [0-9] is a character classes containing all 10 digit characters.
Since digit characters are so frequently used, regex also provides a shorthand notation for it, which is \d. In C#, this will also match decimal digits from other Unicode character sets, unless you're using RegexOptions.ECMAScript where it's strictly just [0-9].
References
regular-expressions.info/Character Classes
MSDN - Character Classes - Decimal Digit Character
Related questions
.NET regex: What is the word character \w
Putting it all together
It looks like the following will work for you:
#-quoting digits_ _____anything but ', captured
| / \ / \
new Regex(#"GuestbookWidget$'\d*', '([^']*)', 500$;", RegexOptions.IgnoreCase);
\/ \/
escape ( escape )
Note that I've modified the pattern slightly so that it uses negated character class instead of reluctance wildcard matching. This causes a slight difference in behavior if you allow ' to be escaped in your input string, but neither pattern handle this case perfectly. If you're not allowing ' to be escaped, however, this pattern is definitely better.
References
regular-expressions.info/An Alternative to Laziness and Capturing Groups

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.