Regex to match backslash inside a string - c#

I'm trying to match the following strings:
this\test_
_thistes\t
_t\histest
In other words, the allowed strings have ONLY a backslash, splitting 2 substrings which can contain numbers, letters and _ characters.
I tried the following regex, testing it on http://regexhero.net/tester/:
^[a-zA-Z_][\\\]?[a-zA-Z0-9_]+$
Unfortunately, it recognizes also the following not allowed strings:
this\\
_\
_\w\s\x
Any help please?

Don't make the \ as optional. The below regex won't allow two or more \ backslashes and asserts that there must be atleast one word character present before and after to the \ symbol.
#"^\w+\\\w+$"
OR
#"^[A-Za-z0-9_]+\\[A-Za-z0-9_]+$"
DEMO

The best way to fix up your regex is the following:
^[a-zA-Z0-9_]+\\[a-zA-Z0-9_]+$
This breaks down to:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
[a-zA-Z0-9_]+ any character of: 'a' to 'z', 'A' to 'Z',
'0' to '9', '_' (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\\ '\'
--------------------------------------------------------------------------------
[a-zA-Z0-9_]+ any character of: 'a' to 'z', 'A' to 'Z',
'0' to '9', '_' (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Explanation courtesy of http://rick.measham.id.au/paste/explain.pl
As you can see we have the same pattern before and after the backslash (since you indicated they should both be letters, numbers and underscores) with the + modifier meaning at least one. Then in the middle there is just the backslash which is compulsory.
Since it is unclear whether when you said "letters" you meant the basic alphabet or if you meant anything that is letter like (most obviously accented characters but also any other alphabet, etc.) then you may want to expand your set of characters by using something like \w as Avinash Raj suggests. See http://msdn.microsoft.com/en-us/library/20bw873z(v=vs.110).aspx#WordCharacter for more info on what the "word character" covers.

Your regex can mean two things, depending on whether you are declaring it as a raw string or as a normal string.
Using:
"^[a-zA-Z_][\\\]?[a-zA-Z0-9_]+$"
Will not match any of your test examples, since this will match, in order:
^ beginning of string,
[a-zA-Z_] 1 alpha character or underscore,
[\\\]? 1 optional backslash,
[a-zA-Z0-9_]+ at least 1 alphanumeric and/or underscore characters,
$ end of string
If you use it as a raw string (which is how regexhero interpreted it and indicated by the # sign before the string starts) is:
#"^[a-zA-Z_][\\\]?[a-zA-Z0-9_]+$"
^ beginning of string,
[a-zA-Z_] 1 alpha character or underscore,
[\\\]?[a-zA-Z0-9_]+ one or more characters being; backslash, ], ?, alphanumeric and underscore,
$ end of string.
So what you actually need is either:
"^[a-zA-Z0-9_]+\\\\[a-zA-Z0-9_]+$"
(Two pairs of backslashes become two literal backslashes, which will be interpreted by the regex engine as an escaped backslash; hence 1 literal backslash)
Or
#"^[a-zA-Z0-9_]+\\[a-zA-Z0-9_]+$"
(No backslash substitution performed, so the regex engine directly interprets the escaped backslash)
Note that I added the numbers in the first character class to allow it to match numbers like you requested and added the + quantifier to allow it to match more than one character before the backslash.

Pretty sure this should work if i understood everything you wanted.
^([a-zA-Z0-9_]+\\[a-zA-Z0-9_]+)

Related

Repetitive pattern but the last one is different - Regex c#

I have this pattern:
^([a-zA-Z0-9]+ )+$
It is supposed to match sentences like:
sfjgsjsg_sbskdf_dsjkfshfsh
sdfhs_skjhsijdgh_dsnjbkg_sdkfsbk_nasjksdj_nsdjkfs
I don't know the word size nor how many words will be in each line.
The problem is that upper pattern identify only sentences like:
sfjgsjsg_sbskdf_dsjkfshfsh_
sdfhs_skjhsijdgh_dsnjbkg_sdkfsbk_nasjksdj_nsdjkfs_
Being _->(space)
You can use
^[a-zA-Z0-9]+(?: [a-zA-Z0-9]+)*$
Or, if any whitespace is meant:
^[a-zA-Z0-9]+(?:\s[a-zA-Z0-9]+)*$
If there can be only one occurrence of horizontal spaces:
^[a-zA-Z0-9]+(?:[\p{Zs}\t][a-zA-Z0-9]+)*$
and if there can be more than one:
^[a-zA-Z0-9]+(?:[\p{Zs}\t]+[a-zA-Z0-9]+)*$
Note that leading/trailing whitespace support can be added by placing *, [\p{Zs}\t]or \s* next to the ^ (right after it) and $ (right before it) anchors.
Details:
^ - start of string
[a-zA-Z0-9]+ - one or more ASCII alphanumeric chars
- a space ([\p{Zs}\t] is any whitespace other than line break chars, \s matches any whitespaces)
(?: [a-zA-Z0-9]+)* - zero or more repetitions of a space and one or more ASCII alphanumeric chars
$ - end of string.

Parsing text between quotes with .NET regular expressions

I have the following input text:
#"This is some text #foo=bar #name=""John \""The Anonymous One\"" Doe"" #age=38"
I would like to parse the values with the #name=value syntax as name/value pairs. Parsing the previous string should result in the following named captures:
name:"foo"
value:"bar"
name:"name"
value:"John \""The Anonymous One\"" Doe"
name:"age"
value:"38"
I tried the following regex, which got me almost there:
#"(?:(?<=\s)|^)#(?<name>\w+[A-Za-z0-9_-]+?)\s*=\s*(?<value>[A-Za-z0-9_-]+|(?="").+?(?=(?<!\\)""))"
The primary issue is that it captures the opening quote in "John \""The Anonymous One\"" Doe". I feel like this should be a lookbehind instead of a lookahead, but that doesn't seem to work at all.
Here are some rules for the expression:
Name must start with a letter and can contain any letter, number, underscore, or hyphen.
Unquoted must have at least one character and can contain any letter, number, underscore, or hyphen.
Quoted value can contain any character including any whitespace and escaped quotes.
Edit:
Here's the result from regex101.com:
(?:(?<=\s)|^)#(?<name>\w+[A-Za-z0-9_-]+?)\s*=\s*(?<value>(?<!")[A-Za-z0-9_-]+|(?=").+?(?=(?<!\\)"))
(?:(?<=\s)|^) Non-capturing group
# matches the character # literally
(?<name>\w+[A-Za-z0-9_-]+?) Named capturing group name
\s* match any white space character [\r\n\t\f ]
= matches the character = literally
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?<value>(?<!")[A-Za-z0-9_-]+|(?=").+?(?=(?<!\\)")) Named capturing group value
1st Alternative: [A-Za-z0-9_-]+
[A-Za-z0-9_-]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
A-Z a single character in the range between A and Z (case sensitive)
a-z a single character in the range between a and z (case sensitive)
0-9 a single character in the range between 0 and 9
_- a single character in the list _- literally
2nd Alternative: (?=").+?(?=(?<!\\)")
(?=") Positive Lookahead - Assert that the regex below can be matched
" matches the characters " literally
.+? matches any character (except newline)
Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
(?=(?<!\\)") Positive Lookahead - Assert that the regex below can be matched
(?<!\\) Negative Lookbehind - Assert that it is impossible to match the regex below
\\ matches the character \ literally
" matches the characters " literally
You can use a very useful .NET regex feature where multiple same-named captures are allowed. Also, there is an issue with your (?<name>) capture group: it allows a digit in the first position, which does not meet your 1st requirement.
So, I suggest:
(?si)(?:(?<=\s)|^)#(?<name>\w+[a-z0-9_-]+?)\s*=\s*(?:(?<value>[a-z0-9_-]+)|(?:"")?(?<value>.+?)(?=(?<!\\)""))
See demo
Note that you cannot debug .NET-specific regexes at regex101.com, you need to test them in .NET-compliant environment.
Use string methods.
Split
string myLongString = ""#"This is some text #foo=bar #name=""John \""The Anonymous One\"" Doe"" #age=38"
string[] nameValues = myLongString.Split('#');
From there either use Split function with "=" or use IndexOf("=").

Regular Expression for no repeating special characters (C#)

I am new to regular expressions and need a regular expression for address, in which user cannot enter repeating special characters such as: ..... or ,,,.../// etc and none of the special characters could be entered more than 5 times in the string.
...,,,....// =>No Match
Street no. 40. hello. =>Match
Thanks in advance!
I have tried this:
([a-zA-Z]+|[\s\,\.\/\-]+|[\d]+)|(\(([\da-zA-Z]|[^)^(]+){1,}\))
It selects all alphanumeric n some special character with no empty brackets.
You can use Negative lookahead construction that asserts what is invalid to match. Its format is (?! ... )
For your case you can try something like this:
This will not match the input string if it has 2 or more consecutive dots, commas or slashes (or any combination of them)
(?!.*[.,\/]{2}) ... rest of the regex
This will not match the input string if it has more than 5 characters 'A'.
(?!(.*A.*){5}) ... rest of the regex
This will match everything except your restrictions. Repplace last part (.*) with your regex.
^(?!.*[.,\/]{2})(?!(.*\..*){5})(?!(.*,.*){5})(?!(.*\/.*){5}).*$
Note: This regex may no be optimized. It may be faster if you use loop to iterate over string characters and count their occurences.
You can use this regex:
^(?![^,./-]*([,./-])\1)(?![^,./-]*([,./-])(?:[^,./-]*\2){4})[ \da-z,./-]+$
In C#:
foundMatch = Regex.IsMatch(yourString, #"^(?![^,./-]*([,./-])\1)(?![^,./-]*([,./-])(?:[^,./-]*\2){4})[ \da-z,./-]+$", RegexOptions.IgnoreCase);
Explanation
The ^ anchor asserts that we are at the beginning of the string
The negative lookahead (?![^,./-]*([,./-])\1) asserts that it is not possible to match any number of special chars, followed by one special char (captured to Group 1) followed by the same special char (the \1 backreference)
The negative lookahead (?![^,./-]*([,./-])(?:[^,./-]*\2){4}) ` asserts that it is not possible to match any number of special chars, followed by one special char (captured to Group 2), then any non-special char and that same char from Group 2, four times (five times total)
The $ anchor asserts that we are at the end of the string
A regular expression string to detect invalid strings is:
[^\w \-\r\n]{2}|(?:[\w \-]+[^\w \-\r\n]){5}
As C# string literal (regular and verbatim):
"[^\\w \\-\\r\\n]{2}|(?:[\\w \\-]+[^\\w \\-\\r\\n]){5}"
#"[^\w \-\r\n]{2}|(?:[\w \-]+[^\w \-\r\n]){5}"
It is much easier to find a string than to validate if a string does not contain ...
It can be checked with this expression if the string entered by the user is invalid because of a match of 2 special characters in sequence OR 5 special characters used in the string.
Explanation:
[^...] ... a negative character class definition which matches any character NOT being one of the characters listed within the square brackets.
\w ... a word character which is either a letter, a digit or an underscore.
The next character is simply a space character.
\- ... the hyphen character which must be escaped with a backslash within square brackets as otherwise the hyphen character would be interpreted as "FROM x TO z" (except when being the first or the last character within the square brackets).
\r ... carriage return
\n ... line-feed
Therefore [^\w \-\r\n] finds a character which is NOT a letter, NOT a digit, NOT an underscore, NOT a space, NOT a hyphen, NOT a carriage return and also NOT a line-feed.
{2} ... the preceding expression must match 2 such characters.
So with the expression [^\w \-\r\n]{2} it can be checked if the string contains 2 special characters in a sequence which makes the string invalid.
| ... OR
(?:...) ... none marking group needed here for applying the expression inside with the multiplier {5} at least 5 times.
[...] ... a positive character class definition which matches any character being one of the characters listed within the square brackets.
[\w \-]+ ... find a word character, or a space, or a hyphen 1 or more times.
[^\w \-\r\n] ... and next character being NOT a word character, space, hyphen, carriage return or line-feed.
Therefore (?:[\w \-]+[^\w \-\r\n]){5} finds a string with 5 "special" characters between "standard" characters.

Trying to understand this regex

I have this regex
^(\\w|#|\\-| |\\[|\\]|\\.)+$
I'm trying to understand what it does exactly but I can't seem to get any result...
I just can't understand the double backslashes everywhere... Isn't double backslash supposed to be used to get a single backslash?
This regex is to validate that a username doesn't use weird characters and stuff.
If someone could explain me the double backslashes thing please. #_#
Additional info: I got this regex in C# using Regex.IsMatch to check if my username string match the regex. It's for an asp website.
My guess is that it's simply escaping the \ since backslash is the escape character in c#.
string pattern = "^(\\w|#|\\-| |\\[|\\]|\\.)+$";
Can be rewritten using a verbatim string as
string pattern = #"^(\w|#|\-| |\[|\]|\.)+$";
Now it's a bit easier to understand what's going on. It will match any word character, at-sign, hyphen, space, square bracket or period, repeated one or more times. The ^ and $ match the begging and end of the string, respectively, so only those characters are allowed.
Therefore this pattern is equivalent to:
string pattern = #"^([\w# \[\].-])+$";
Double slash are supposed to be single slash. Double slash are used to escape the slash itself, as slashes are used for other escape characters in C# String context e.g. \n stands for new line
With double slashes sorted out, it becomes ^(\w|#|\-| |\[|\]|\.)+$
Break down this regex, as | means OR, and \w|#|\-| |\[|\]|\. would mean \w or # or \- or space or \[ or \] or \.. That is, any alphanumeric character, #, -, space, [, ] and . characters. Note that this slash is regex escape, to escape -, [, ] and . characters as they all have special meanings in regex context
And, + means the previous token (i.e. \w|#|\-| |\[|\]|\.) repeated one or more times
So, the entire thing means one or more of any combination of alphanumeric character, #, -, space, [, ] and . characters.
There are online tools to analyze regexes. Once such is at http://www.myezapp.com/apps/dev/regexp/show.ws
where it reports
Sequence: match all of the followings in order
BeginOfLine
Repeat
CapturingGroup
GroupNumber:1
OR: match either of the followings
WordCharacter
#
-
[
]
.
one or more times
EndOfLine
As others have noted, the double backslashes just escape a backslash so you can embed the regex in a string. For example, "\\w" will be interpreted as "\w" by the parser.
^ means beginning of the line.
the parentheses is use for grouping
\w is a word character
| means OR
# match the # character
\- match the hyphen character
[ and ] matches the squares brackets
\. match a period
+ means one or more
$ the end of line.
So the regex is use to match a string which contains only word characters or an # or an hyphen or a space or squares brackets or a dot.
Here's what it means:
^(\\w|#|\\-| |\\[|\\]|\\.)+$
^ - Means the regex starts at the beginning of the string. The match shouldn't start in the middle of the string.
Here's the individual things in the parentheses:
\\w - Indicates a "word" character. Normally, this is shown as \w, but this is being escaped.
# - Indicates an # symbol is allowed
\\- - Indicates a - is allowed. This is escaped since the dash can have other meanings in regex. Since it's not in a character class, I don't believe this is technically needed.
- A space is allowed
\\[ and \\] - [ and ] are allowed.
\\. - A period is a valid character. Escaped because periods have special meanings in regex.
Now all of those characters have | as delimiters in the parentheses - this means OR. So any of those characters are valid.
The + at the end means one or more characters as described in parentheses are valid. The $ means the end of the regex must match the end of the string.
Note that the double slashes aren't necessary if you just prefix the string like this:
#"\w" is the same as "\\w"

How do I specify a wildcard (for ANY character) in a c# regex statement?

Trying to use a wildcard in C# to grab information from a webpage source, but I cannot seem to figure out what to use as the wildcard character. Nothing I've tried works!
The wildcard only needs to allow for numbers, but as the page is generated the same every time, I may as well allow for any characters.
Regex statement in use:
Regex guestbookWidgetIDregex = new Regex("GuestbookWidget(' INSERT WILDCARD HERE ', '(.*?)', 500);", RegexOptions.IgnoreCase);
If anyone can figure out what I'm doing wrong, it would be greatly appreciated!
The wildcard character is ..
To match any number of arbitrary characters, use .* (which means zero or more .) or .+ (which means one or more .)
Note that you need to escape your parentheses as \\( and \\). (or \( and \) in an #"" string)
On the dot
In regular expression, the dot . matches almost any character. The only characters it doesn't normally match are the newline characters. For the dot to match all characters, you must enable what is called the single line mode (aka "dot all").
In C#, this is specified using RegexOptions.Singleline. You can also embed this as (?s) in the pattern.
References
regular-expressions.info/The Dot Matches (Almost) Any Character
On metacharacters and escaping
The . isn't the only regex metacharacters. They are:
( ) { } [ ] ? * + - ^ $ . | \
Depending on where they appear, if you want these characters to mean literally (e.g. . as a period), you may need to do what is called "escaping". This is done by preceding the character with a \.
Of course, a \ is also an escape character for C# string literals. To get a literal \, you need to double it in your string literal (i.e. "\\" is a string of length one). Alternatively, C# also has what is called #-quoted string literals, where escape sequences are not processed. Thus, the following two strings are equal:
"c:\\Docs\\Source\\a.txt"
#"c:\Docs\Source\a.txt"
Since \ is used a lot in regular expression, #-quoting is often used to avoid excessive doubling.
References
regular-expressions.info/Metacharacters
MSDN - C# Programmer's Reference - string
On character classes
Regular expression engines allow you to define character classes, e.g. [aeiou] is a character class containing the 5 vowel letters. You can also use - metacharacter to define a range, e.g. [0-9] is a character classes containing all 10 digit characters.
Since digit characters are so frequently used, regex also provides a shorthand notation for it, which is \d. In C#, this will also match decimal digits from other Unicode character sets, unless you're using RegexOptions.ECMAScript where it's strictly just [0-9].
References
regular-expressions.info/Character Classes
MSDN - Character Classes - Decimal Digit Character
Related questions
.NET regex: What is the word character \w
Putting it all together
It looks like the following will work for you:
#-quoting digits_ _____anything but ', captured
| / \ / \
new Regex(#"GuestbookWidget\('\d*', '([^']*)', 500\);", RegexOptions.IgnoreCase);
\/ \/
escape ( escape )
Note that I've modified the pattern slightly so that it uses negated character class instead of reluctance wildcard matching. This causes a slight difference in behavior if you allow ' to be escaped in your input string, but neither pattern handle this case perfectly. If you're not allowing ' to be escaped, however, this pattern is definitely better.
References
regular-expressions.info/An Alternative to Laziness and Capturing Groups

Categories

Resources