+ in regular expression - c#

I am trying to capture a value out of a string. The string's format should be
01+XXXX
and I want to capture XXXX using a regular expression. This is what I came up with -
01+\\s*(?<1>[.0-9]*)
But that won't work. What DOES work is -
01+\\s*(?<1>[+.0-9]*)
The only difference is adding the + into the character class. My main question is - why does the second expression work and the first expression doesn't? In the first one, I look for 01+ and the rest of it should go to [.0-9]. It seems to me that the second one wants to read + twice - is that not what its doing? I am pretty new to regular expressions so I feel like I might be missing something small.
On this site http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial it says that + is used for "Repeat one or more times". So is it trying to read 01+ more than once?

It's reading the 1 one or more times. That is, the regex 01+ matches 01 or 011 or 0111 etc.
But it doesn't match the +. If you want to match a literal +, write 01\+ or 01[+] for the regex.

The + is a special character, meaning "one or more times." In this case, it means 01, 011, 0111, etc. instead of 01+. If you want to use it literally, you need to escape it, like this: \+
Note: It looks like you are using it with strings, so you would need to double-escape: \\+
It works inside a character class ([+]) because character classes take most characters literally, with exceptions including \ and ].

'+' is a special character in regex, it means "1 or more times". So what you have written means:
The character '0'
The character '1' one or more times
Whitespace 0 or more times
etc.
If you want to match a literal plus you need to escape it:
01\+\\s*(?<1>[.0-9]*)

The + is a quantifier, as explained in the tutorial you linked. So, your regex means "match a zero, then one or more ones, then zero or more whitespaces, then ...".
The plus needs to be escaped:
01\\+\\s*(?<1>[.0-9]*)
Your second regex worked, because the + there was part of a character class and does not need to be escaped there.

01\+(?<cap>[\d.]*)
explain:
01 '01'
\+ '+'
[\d.]* any character of: digits (0-9), '.'
(0 or more times, matching the most amount possible)

Related

How to match string by using regular expression which will not allow same special character at same time?

I m trying to matching a string which will not allow same special character at same time
my regular expression is:
[RegularExpression(#"^+[a-zA-Z0-9]+[a-zA-Z0-9.&' '-]+[a-zA-Z0-9]$")]
this solve my all requirement except the below two issues
this is my string : bracks
acceptable :
bra-cks, b-r-a-c-ks, b.r.a.c.ks, bra cks (by the way above regular expression solved this)
not acceptable:
issue 1: b.. or bra..cks, b..racks, bra...cks (two or more any special character together),
issue 2: bra cks (two ore more white space together)
You can use a negative lookahead to invalidate strings containing two consecutive special characters:
^(?!.*[.&' -]{2})[a-zA-Z0-9.&' -]+$
Demo: https://regex101.com/r/7j14bu/1
The goal
From what i can tell by your description and pattern, you are trying to match text, which start and end with alphanumeric (due to ^+[a-zA-Z0-9] and [a-zA-Z0-9]$ inyour original pattern), and inside, you just don't want to have any two consecuive (adjacent) special characters, which, again, guessing from the regex, are . & ' -
What was wrong
^+ - i think here you wanted to assure that match starts at the beginning of the line/string, so you don't need + here
[a-zA-Z0-9.&' '-] - in this character class you doubled ' which is totally unnecessary
Solution
Please try pattern
^[a-zA-Z0-9](?:(?![.& '-]{2,})[a-zA-Z0-9.& '-])*[a-zA-Z0-9]$
Pattern explanation
^ - anchor, match the beginning of the string
[a-zA-Z0-9] - character class, match one of the characters inside []
(?:...) - non capturing group
(?!...) - negative lookahead
[.& '-]{2,} - match 2 or more of characters inside character class
[a-zA-Z0-9.& '-] - character class, match one of the characters inside []
* - match zero or more text matching preceeding pattern
$ - anchor, match the end of the string
Regex demo
Some remarks on your current regex:
It looks like you placed the + quantifiers before the pattern you wanted to quantify, instead of after. For instance, ^+ doesn't make much sense, since ^ is just the start of the input, and most regex engines would not even allow that.
The pattern [a-zA-Z0-9.&' '-]+ doesn't distinguish between alphanumerical and other characters, while you want the rules for them to be different. Especially for the other characters you don't want them to repeat, so that + is not desired for those.
In a character class it doesn't make sense to repeat the same character, like you have a repeat of a quote ('). Maybe you wanted to somehow delimit the space, but realise that those quotes are interpreted literally. So probably you should just remove them. Or if you intended to allow for a quote, only list it once.
Here is a correction (add the quote if you still need it):
^[a-zA-Z0-9]+(?:[.& -][a-zA-Z0-9]+)*$
Follow-up
Based on a comment, I suspect you would allow a non-alphanumerical character to be surrounded by single spaces, even if that gives a sequence of more than one non-alphanumerical character. In that case use this:
^[a-zA-Z0-9]+(?:(?:[ ]|[ ]?[.&-][ ]?)[a-zA-Z0-9]+)*$
So here the space gets a different role: it can optionally occur before and after a delimiter (one of ".&-"), or it can occur on its own. The brackets around the spaces are not needed, but I used them to stress that the space is intended and not a typo.

Simple phone number regex to match numbers, spaces, etc

I'm trying to modify a fairly basic regex pattern in C# that tests for phone numbers.
The patterns is -
[0-9]+(\.[0-9][0-9]?)?
I have two questions -
1) The existing expression does work (although it is fairly restrictive) but I can't quite understand how it works. Regexps for similar issues seem to look more like this one -
/^[0-9()]+$/
2) How could I extend this pattern to allow brackets, periods and a single space to separate numbers. I tried a few variations to include -
[0-9().+\s?](\.[0-9][0-9]?)?
Although i can't seem to create a valid pattern.
Any help would be much appreciated.
Thanks,
[0-9]+(\.[0-9][0-9]?)?
First of all, I recommend checking out either regexr.com or regex101.com, so you yourself get an understanding of how regex works. Both websites will give you a step-by-step explanation of what each symbol in the regex does.
Now, one of the main things you have to understand is that regex has special characters. This includes, among others, the following: []().-+*?\^$. So, if you want your regex to match a literal ., for example, you would have to escape it, since it's a special character. To do so, either use \. or [.]. Backslashes serve to escape other characters, while [] means "match any one of the characters in this set". Some special characters don't have a special meaning inside these brackets and don't require escaping.
Therefore, the regex above will match any combination of digits of length 1 or more, followed by an optional suffix (foobar)?, which has to be a dot, followed by one or two digits. In fact, this regex seems more like it's supposed to match decimal numbers with up to two digits behind the dot - not phone numbers.
/^[0-9()]+$/
What this does is pretty simple - match any combination of digits or round brackets that has the length 1 or greater.
[0-9().+\s?](\.[0-9][0-9]?)?
What you're matching here is:
one of: a digit, round bracket, dot, plus sign, whitespace or question mark; but exactly once only!
optionally followed by a dot and one or two digits
A suitable regex for your purpose could be:
(\+\d{2})?((\(0\)\d{2,3})|\d{2,3})?\d+
Enter this in one of the websites mentioned above to understand how it works. I modified it a little to also allow, for example +49 123 4567890.
Also, for simplicity, I didn't include spaces - so when using this regex, you have to remove all the spaces in your input first. In C#, that should be possible with yourString.Replace(" ", ""); (simply replacing all spaces with nothing = deleting spaces)
The + after the character set is a quantifier (meaning the preceeding character, character set or group is repeated) at least one, and unlimited number of times and it's greedy (matched the most possible).
Then [0-9().+\s]+ will match any character in set one or more times.

What does `\?` mean in a regular expression?

May I know what \? means in a regular expression? For example, what is its significance in this expression.
I have used this for validating 7 digit telephone no
Any help is highly appreciated.
"\?" means "?" itself. "\" - is escape character. "?" is quantifier and "\" is used to escape it.
I have used this for validating 7 digit telephone no
"[[:number:]]\{3\}[ -]\?[[:number:]]\{4\}"
Looking at your example, it seems that you are talking about BRE, then the \ (escaping) gave ? special meaning: one or zero[ -]
If it is ERE/PCRE, the \ will take that speical meaning away from ?, that is, \? means literal question mark: ?
The properly-escaped "?" will match that exact character, the "?", as it appears in the text.
For instance, if you do
Regex re = new Regex(#"\d{3}-\?\d{4}");
, you will be able to get a positive match for 123-?1234.
If you want to get a positive match for 1231234 OR 123-1234, you can use the special character "?" without escape, like this:
Regex re = new Regex(#"\d{3}-?\d{4}");
P.S. for C# .NET, I find the best regex-testing place online is MyRegexTester. If you use it for C#, don't forget to check the appropriate "C# .NET" checkbox.
P.P.S. as per the comment, putting "\s*" into the regex will match any length white space (spaces and tabs included), "\ ?" will match an optional space, and "[ ]" will match exactly one space (no less).
"\?" escapes "?" that have a special meaning in the regex (0 or 1 match) so "\?" escapes it and identifies the literal "?"
your regex looks strange to me, it looks that all the special character are escaped (also "{" ) and doesn't appear to be valid from what i know.
i think you want to write
"\d{3}[ -]?\d{4}"
if you want to match something that respect the pattern or
"^\d{3}[ -]?\d{4}$"
if you want to have a match something that is exactly the pattern

Regular expression to replace a string

I'm working on some code inherited from someone else and trying to understand some regular expression code in C#:
Regex.Replace(query, #"""[^""~]+""([^~]|$)",
m => string.Format(field + "_exact:{0}", m.Value))
What is the above regular expression doing? This is in relation to input from a user performing a search. It's doing a replace of the query string using the pattern provided in the second argument, with the value of the third. But what is that regular expression? For the life of me, it doesn't make sense. Thanks.
As far as I can see, xanatos' answer is correct. I tried to understand the regex, so here it comes:
"[^"~]+"([^~]|$)
You can test our regex and play with the single parts for better understanding at http://www.regexpal.com/
1.) a single character
"
The first pattern is a literal character. Since there is no statement of relative position, it can occur everywhere.
2.) a character class
[^"~]
The next expression is the []-bracket. This is a character set. It defines a quantity of characters, which maybe follow next. It is a placeholder for one single character... So lets see inside, which content is allowed:
^"~
The definition of the character class begins with an caret (^), which is a special character. Typing a caret after the opening square bracket will negate the character class. So it's "upside down": everything following, which does not match the class expression, matches and is a valid character.
In this case, every literal character is possible, except the two excluded ones: " or ~.
3.) a special character
+
The next expression, a plus, tells the engine to attempt to match the preceding token once or more.
So the defined character class should one or multiple times repeated to match the given expression.
4.) a single character
"
To match, the expression should contain furthermore one further apostrophe, which will be the corresponding apostrophe to the first one in 1.) since the character class in (2.) hence (3.) does not permit an apostrophe.
5.) a lookaround
([^~]|$)
The first structure here to examine is the ()-bracket. This is called a "Lookaround".
It is is a special kind of group. Lookaround matches a position. It does not expand the regex match.
So this means this part does not try to find any certain characters inside of an expression
rather then to localize them.
The localisation demands has two conditions, which are connected by a logical OR by the pipeline symbol: |
So the next character of the matched expression could either be
[^~] one single character out of the class everything excluding the character ~
or
$ the end of the line (or word, if multiline-mode is not used in regex engine)
I'll try to edit my answer to a better format, since this is my first post, I first have to check out how this is working.. :)
Update:
to "detect" a Asterisk/star in front/end of the line, you have to do following:
First it's a special character, so you have to escape it with an backslash: *
To define the position, you can use:
^ to look at the beginning of the line,
$ end of the line
The overall expression would be:
^* in front of the expression to search for an * at the beginning of
the line $* at the end of the regex to demand an * at the end.
.... in your case you can add the * in the last character class to detect an * in the end:
([^~]|$|$*)
and to force an * in the end, delete the other conditions:
($*)
PS:
(somehow my regex is swallowed up by formating engine, so my update is wrong...)
The # makes it necessary to escape all the " with a second ", so "". Without it to escape the " you would have used \", but I consider it better to always use # in regexes, because the \ is used quite often, and it's boring and unreadable to always have to escape it to \\.
Let's see what the regex really is:
Console.WriteLine(#"""[^""~]+""([^~]|$)");
is
"[^"~]+"([^~]|$)
So now we can look at the "real" regex.
It looks for a " followed by one or more non-" and non-~ followed by another " followed by a non-~ or the end of the string. Note that the match could start after the start of the string and it could end before the end of the string (with a non-~)
For example in
car"hello"help
it would match "hello"h

Are these the proper regex expressions

I am trying to make a few regex strings to use in my syntax highlighter, this if the first time I have ever used them and I am having a deal of difficulty...
The first four are, I will have a specified character followed by any number of numbers, match it.
The best I have is "G[0-9]|G[0-9][0-9]|G[0-9][0-9][0-9]" to match either G#, G##, or G###
but I want to do G with any number of numbers after it.
The next three are, I will have a character (X,Y,Z, or P) and I want to match it if there is no letter or symbol behind it
"[X|Y|Z|P][0-9]"
These next few are harder, match "#11.11=11.11" where 1 is a number and there can be any number of numbers between the pound sign, the periods, and the equal sign. And the periods do not have to be there can also be "#11=11" or " #1.1=11" or "#11=1.1"
I have no idea... "#[0-9][ |.] ..."
Anything after a " ' " and between a newline
'[A-Za-z0-9]\n" but I know this only gives me one character...
And the easy one I think is anything between two () or []
"(*) | [*]"?
Quick and dirty, but tested using regexpal
1) G[0-9]{1-3} - the '{1-3}' specifies the last symbol to occur one to three times.
2) ((.*|)) - you put a '\' before the '(' and ')' as escape characters
3) [0-9]1*(.|)1*=1*(.|)1 - this matches your three examples
4) \'.*\n - I think this should work... '\n' represents a new line char right?
5) ((|[).*()|]) - this one has those escape characters again
Again...quick and dirty. Regexpal.com is your friend
1> G[0-9]{1,3}
2> No, it's WRONG.
The correct one is [XYZ][0-9]
(you do not use an OR operator (|), but just write the characters side by side within square braces)
You should really look up how to use regexes. Having said that:
I will have a specified character followed by any number of numbers, match it
G\d+
I will have a character (X,Y,Z, or P) and I want to match it if there
is no letter or symbol behind it
(?<!\w)[XYZP][0-9]
These next few are harder, make "#11.11=11.11" blue
Huh?
Anything after a " ' " and between a newline
'(.+?)\n
And the easy one I think is anything between two () or []
\(.+?\)|\[.+?\]
And the easy one I think is anything between two () or []
"(*) | [*]"?
#"\([^(]*\)" and #"\[[^\[]*\]"
It means: an open bracket - then any number of characters which are not an open bracket - and a close bracket.
Slashes are required to indicate to the regex engine that brackets should be treated literally.
# - verbatim string - is to inform C#, in turn, that those slashes should be taken literally and not as C# escape characters.
Anything after a " ' " and between a newline
Similarly: #"'[^']*\n"
G\d+
[XYZP](?=\d)
#(\d+(\.\d+)?)=(\d+(\.\d+)?)
'.*?\n
\(.*?\)|\[.*?\]
Regex explanation here.
The first one:
G[0-9]+
In regular expressions + means at least 1 or more (repetitions of the previous "characters").
You could also use * for none or any number of repetitions.
The second might be something like this:
^(X|Y|Z|P)$
This actually matches only if it's at the beginning of a line and has no characters behind. If you want it to be anywhere and only exclude certain characters behind it, modify the following:
[XYZP][^0-9a-z]
This is X or Y or Z or P followed by NOT 0-9 and NOT a-z
Notice that I use the OR character | in the first example in brackets, but not in the square brackets.
For the third one:
#[0-9]+\.[0-9]+=[0-9]+\.[0-9]+
Might not be 100 percent correct, I always confuse when to escape which characters. You might need to escape # and =.
For the last one:
(\(.*\)|\[.*\])
For the first one you can use this Regex :
^G\d+
For G with any number of digits after it
\b([Gg]\d+)\b
This matches a wordboundary (\b) followed by a lower or upper G [Gg], followed by 1 or more (+) digits (\d), followed by a wordboundary (\b)
The next three are, I will have a character (X,Y,Z, or P) and I want
to match it if there is no letter or symbol behind it
This is a little tougher
\b[XYZP]([\W]|_)
This matches an XYZ or P followed by a non-word character \W, (word characters are typically a-z, 0-9 and the underscore), so after saying we don't want a word character, we add in that the _ is allowed.
I use perl for regex, but it should hopefully be the same as what you're looking for.
For the first one, G[0-9]+ should work. The square brackets means that the regex looks for only one of the characters within the brackets (the characters being 0 through 9) and the + right after it means that it looks for one or more matches.
The second is a bit more tricky, but I would use \s[XYPZ]. The square brackets function the same as before, only matching one of X, Y, P or Z. Also the \s matches any whitespace character (tab, space, newline, etc.).
For the third one, I would try #[0-9]+\.?[0-9]+=[0-9]+\.?[0-9]+. If we go from left to right, we encounter \.? and it's new. \. matches a literal period (you have to escape it with the backslash, as just a period by itself means that it can match one of any character). The question mark means that the period can either be there or not (matches zero or one instance of a period).
The fourth one: '.*\n. The combination of the period by itself and the asterisk means that it'll match zero or more characters, the characters being anything at all. I'm not too sure if you need to escape the single quotes though.
And for the fifth one, (\(.*\)|\[.*\]) should do the trick. You need to escape the []() inside the brackets because they mean something by themselves. Also, the | means or, so the regex can either matches whatever is on the left side of the bar, or on the right.
You can specify repetitions in different ways. A star "*" after a term means, repeat the term zero, one or several times. A plus sign "+" means, repeat the term one or several times. You can also specify a number range with {n,m}. In your case the expression would be
G\d{1-3}
where \d is a digit.
With this expression you can match a position that does not preceed a suffix
find(?!suffix)
I am not sure what you mean by symbol
[XYZP](?![a-zA-Z specify your symbols here])
For the pound number
\#\d+(\.\d+)?=\d+(\.\d+)?
\# the pound sign
\d+ at least one digit
(\.\d+)? optionally (?) a period succeeded by at least one digit
finally an equal sign succeeded by another number
Everything between "'" and \n. Use this pattern here, which finds a position between a prefix and a suffix.
(?<=prefix)find(?=suffix)
(?<=').*(?=\n)
.* means any character as many times as possible. Alternatively you could use
(?<=').*?(?=\n)
.* means any character as few times as possible, if too many \n are taken. Also be carefult with the RegexOption.Multiline. Depending on its setting you will have to test for the end of line with $ instead of \n.
For the parentheses () or [] you can use the same pattern again
(?<=prefix)find(?=suffix)
(?<=\().*?(?=\))|(?<=\[).*?(?=])
where | is the alternative.

Categories

Resources