Problem
In a very special case, my negative lookahead is an empty list:
(?!^()$)
Is there any string that matches it?
Clarification
Let's say:
(?!^()$)^(.*)$
Will it match everything?
(?!^()$) can be simplified to (?!^$) since () is a null group and will match at any position, all the time.
So now you're saying "match at any and every position where the start and end anchors aren't right next to one another, or in other words, we aren't at an empty string".
Therefore (?!^$) can match at every position in a string that isn't just empty or a newline.
(?!^()$)^(.*)$ is "match everywhere but at empty string" plus ^.*$ which will "match at and consume every single line, empty or not" (anchors ^ and $ have no effect in this case). So it's essentially saying "consume (at least) one or more characters in a string", which can be distilled down to simply .+
Literally anything, beside empty string.
The regex contains 2 parts, (?!^()$) and ^(.*)$ :
(?!^()$) Is a negative zero-width match for empty string. In order words, string.Empty is out.
^(.*)$ is a full match for anything except newlines1 repeated 0 to many times, so basically anything.
Note : 1. exception new line character
Related
I have this regular expression to math:
String start with a zero + white space + anything else
String is a zero
"0 fkvjdm" // Must Match
"0" // Must match
"0.56" // NOT match
Here is the regular expression I'm using:
^([0]$|([0]\s+.))
Is there a way to improve it? or, is it has a bug?
Thanks a lot for your help.
Environment
VS 2010 .net 4
First of all, there is no need to put 0 in a character class.
Secondly your regex will not match more than a single character after whitespace. As you don't have any quantifier on dot - . in 2nd part of your regex. To match more characters after whitespace, you should use .* (0 or more) or .+ (1 or more).
To improve in clarity, you can make use of optional quantifier here:
^0(\s+.*)?$
Seems like the second character is what causes the match to fail. If the second character is a period, then don't match; otherwise match. ?! says if what it matches succeeds, fail the whole match. Hence if the second character is a period, it will fail.
^0(?!\.).*
I'm working on some code inherited from someone else and trying to understand some regular expression code in C#:
Regex.Replace(query, #"""[^""~]+""([^~]|$)",
m => string.Format(field + "_exact:{0}", m.Value))
What is the above regular expression doing? This is in relation to input from a user performing a search. It's doing a replace of the query string using the pattern provided in the second argument, with the value of the third. But what is that regular expression? For the life of me, it doesn't make sense. Thanks.
As far as I can see, xanatos' answer is correct. I tried to understand the regex, so here it comes:
"[^"~]+"([^~]|$)
You can test our regex and play with the single parts for better understanding at http://www.regexpal.com/
1.) a single character
"
The first pattern is a literal character. Since there is no statement of relative position, it can occur everywhere.
2.) a character class
[^"~]
The next expression is the []-bracket. This is a character set. It defines a quantity of characters, which maybe follow next. It is a placeholder for one single character... So lets see inside, which content is allowed:
^"~
The definition of the character class begins with an caret (^), which is a special character. Typing a caret after the opening square bracket will negate the character class. So it's "upside down": everything following, which does not match the class expression, matches and is a valid character.
In this case, every literal character is possible, except the two excluded ones: " or ~.
3.) a special character
+
The next expression, a plus, tells the engine to attempt to match the preceding token once or more.
So the defined character class should one or multiple times repeated to match the given expression.
4.) a single character
"
To match, the expression should contain furthermore one further apostrophe, which will be the corresponding apostrophe to the first one in 1.) since the character class in (2.) hence (3.) does not permit an apostrophe.
5.) a lookaround
([^~]|$)
The first structure here to examine is the ()-bracket. This is called a "Lookaround".
It is is a special kind of group. Lookaround matches a position. It does not expand the regex match.
So this means this part does not try to find any certain characters inside of an expression
rather then to localize them.
The localisation demands has two conditions, which are connected by a logical OR by the pipeline symbol: |
So the next character of the matched expression could either be
[^~] one single character out of the class everything excluding the character ~
or
$ the end of the line (or word, if multiline-mode is not used in regex engine)
I'll try to edit my answer to a better format, since this is my first post, I first have to check out how this is working.. :)
Update:
to "detect" a Asterisk/star in front/end of the line, you have to do following:
First it's a special character, so you have to escape it with an backslash: *
To define the position, you can use:
^ to look at the beginning of the line,
$ end of the line
The overall expression would be:
^* in front of the expression to search for an * at the beginning of
the line $* at the end of the regex to demand an * at the end.
.... in your case you can add the * in the last character class to detect an * in the end:
([^~]|$|$*)
and to force an * in the end, delete the other conditions:
($*)
PS:
(somehow my regex is swallowed up by formating engine, so my update is wrong...)
The # makes it necessary to escape all the " with a second ", so "". Without it to escape the " you would have used \", but I consider it better to always use # in regexes, because the \ is used quite often, and it's boring and unreadable to always have to escape it to \\.
Let's see what the regex really is:
Console.WriteLine(#"""[^""~]+""([^~]|$)");
is
"[^"~]+"([^~]|$)
So now we can look at the "real" regex.
It looks for a " followed by one or more non-" and non-~ followed by another " followed by a non-~ or the end of the string. Note that the match could start after the start of the string and it could end before the end of the string (with a non-~)
For example in
car"hello"help
it would match "hello"h
I am trying to make a few regex strings to use in my syntax highlighter, this if the first time I have ever used them and I am having a deal of difficulty...
The first four are, I will have a specified character followed by any number of numbers, match it.
The best I have is "G[0-9]|G[0-9][0-9]|G[0-9][0-9][0-9]" to match either G#, G##, or G###
but I want to do G with any number of numbers after it.
The next three are, I will have a character (X,Y,Z, or P) and I want to match it if there is no letter or symbol behind it
"[X|Y|Z|P][0-9]"
These next few are harder, match "#11.11=11.11" where 1 is a number and there can be any number of numbers between the pound sign, the periods, and the equal sign. And the periods do not have to be there can also be "#11=11" or " #1.1=11" or "#11=1.1"
I have no idea... "#[0-9][ |.] ..."
Anything after a " ' " and between a newline
'[A-Za-z0-9]\n" but I know this only gives me one character...
And the easy one I think is anything between two () or []
"(*) | [*]"?
Quick and dirty, but tested using regexpal
1) G[0-9]{1-3} - the '{1-3}' specifies the last symbol to occur one to three times.
2) ((.*|)) - you put a '\' before the '(' and ')' as escape characters
3) [0-9]1*(.|)1*=1*(.|)1 - this matches your three examples
4) \'.*\n - I think this should work... '\n' represents a new line char right?
5) ((|[).*()|]) - this one has those escape characters again
Again...quick and dirty. Regexpal.com is your friend
1> G[0-9]{1,3}
2> No, it's WRONG.
The correct one is [XYZ][0-9]
(you do not use an OR operator (|), but just write the characters side by side within square braces)
You should really look up how to use regexes. Having said that:
I will have a specified character followed by any number of numbers, match it
G\d+
I will have a character (X,Y,Z, or P) and I want to match it if there
is no letter or symbol behind it
(?<!\w)[XYZP][0-9]
These next few are harder, make "#11.11=11.11" blue
Huh?
Anything after a " ' " and between a newline
'(.+?)\n
And the easy one I think is anything between two () or []
\(.+?\)|\[.+?\]
And the easy one I think is anything between two () or []
"(*) | [*]"?
#"\([^(]*\)" and #"\[[^\[]*\]"
It means: an open bracket - then any number of characters which are not an open bracket - and a close bracket.
Slashes are required to indicate to the regex engine that brackets should be treated literally.
# - verbatim string - is to inform C#, in turn, that those slashes should be taken literally and not as C# escape characters.
Anything after a " ' " and between a newline
Similarly: #"'[^']*\n"
G\d+
[XYZP](?=\d)
#(\d+(\.\d+)?)=(\d+(\.\d+)?)
'.*?\n
\(.*?\)|\[.*?\]
Regex explanation here.
The first one:
G[0-9]+
In regular expressions + means at least 1 or more (repetitions of the previous "characters").
You could also use * for none or any number of repetitions.
The second might be something like this:
^(X|Y|Z|P)$
This actually matches only if it's at the beginning of a line and has no characters behind. If you want it to be anywhere and only exclude certain characters behind it, modify the following:
[XYZP][^0-9a-z]
This is X or Y or Z or P followed by NOT 0-9 and NOT a-z
Notice that I use the OR character | in the first example in brackets, but not in the square brackets.
For the third one:
#[0-9]+\.[0-9]+=[0-9]+\.[0-9]+
Might not be 100 percent correct, I always confuse when to escape which characters. You might need to escape # and =.
For the last one:
(\(.*\)|\[.*\])
For the first one you can use this Regex :
^G\d+
For G with any number of digits after it
\b([Gg]\d+)\b
This matches a wordboundary (\b) followed by a lower or upper G [Gg], followed by 1 or more (+) digits (\d), followed by a wordboundary (\b)
The next three are, I will have a character (X,Y,Z, or P) and I want
to match it if there is no letter or symbol behind it
This is a little tougher
\b[XYZP]([\W]|_)
This matches an XYZ or P followed by a non-word character \W, (word characters are typically a-z, 0-9 and the underscore), so after saying we don't want a word character, we add in that the _ is allowed.
I use perl for regex, but it should hopefully be the same as what you're looking for.
For the first one, G[0-9]+ should work. The square brackets means that the regex looks for only one of the characters within the brackets (the characters being 0 through 9) and the + right after it means that it looks for one or more matches.
The second is a bit more tricky, but I would use \s[XYPZ]. The square brackets function the same as before, only matching one of X, Y, P or Z. Also the \s matches any whitespace character (tab, space, newline, etc.).
For the third one, I would try #[0-9]+\.?[0-9]+=[0-9]+\.?[0-9]+. If we go from left to right, we encounter \.? and it's new. \. matches a literal period (you have to escape it with the backslash, as just a period by itself means that it can match one of any character). The question mark means that the period can either be there or not (matches zero or one instance of a period).
The fourth one: '.*\n. The combination of the period by itself and the asterisk means that it'll match zero or more characters, the characters being anything at all. I'm not too sure if you need to escape the single quotes though.
And for the fifth one, (\(.*\)|\[.*\]) should do the trick. You need to escape the []() inside the brackets because they mean something by themselves. Also, the | means or, so the regex can either matches whatever is on the left side of the bar, or on the right.
You can specify repetitions in different ways. A star "*" after a term means, repeat the term zero, one or several times. A plus sign "+" means, repeat the term one or several times. You can also specify a number range with {n,m}. In your case the expression would be
G\d{1-3}
where \d is a digit.
With this expression you can match a position that does not preceed a suffix
find(?!suffix)
I am not sure what you mean by symbol
[XYZP](?![a-zA-Z specify your symbols here])
For the pound number
\#\d+(\.\d+)?=\d+(\.\d+)?
\# the pound sign
\d+ at least one digit
(\.\d+)? optionally (?) a period succeeded by at least one digit
finally an equal sign succeeded by another number
Everything between "'" and \n. Use this pattern here, which finds a position between a prefix and a suffix.
(?<=prefix)find(?=suffix)
(?<=').*(?=\n)
.* means any character as many times as possible. Alternatively you could use
(?<=').*?(?=\n)
.* means any character as few times as possible, if too many \n are taken. Also be carefult with the RegexOption.Multiline. Depending on its setting you will have to test for the end of line with $ instead of \n.
For the parentheses () or [] you can use the same pattern again
(?<=prefix)find(?=suffix)
(?<=\().*?(?=\))|(?<=\[).*?(?=])
where | is the alternative.
While answering this question C# Regex Replace and * the point was raised as to why the problem exists. When playing I produced the following code:
string s = Regex.Replace(".A.", "\w*", "B");
Console.Write(s);
This has the output: B.BB.B
I get that the 0 length string is match before and after the . character, but why is A replaced by 2 Bs.
I could understand B.BBB.B as replacing zero-length strings either side of A or B.B.B
But the actual result confuses me - any help appreciated.
Or as AakashM has put it:
Why is Regex.Matches("A", "\w*").Count equal to 2, not 1 or 3 ?
There is a star after \w
It means "zero or many" so that means:
First symbol is a dot, it is NOT \w so there is zero \w here, replace by B
Next we have a dot itself, which is not replaceable
A gets replaced by B
zero \w before the next dot, replace by B
dot, not replaceable
Line end, zero \w so replace by B again.
Expression \w{0,} will have the same effect.
If you want to avoid it, use 'plus' which means 'at least one': \w+
Thats the same behaviour than
Regex.Replace("", "\w*", "B") results in B
Regex.Replace("A", "\w*", "B") results in BB
See it here on Regexr
For the string ".A." \w* matches before the first dot the empty string, then on the "A", after the "A" the empty string and after the last dot the empty string.
Explanation
You can think of the pattern eating the characters, \w* has eaten the "A", the next char is a dot, so this match is complete and replaced. But the start position for the pattern to continue matching is still between the A and the dot. The dot can not be matched, so it matches the empty string before the dot, but then this position is done and the next start position is after the dot.
because \w* is a greedy regex and it tries to find biggest sequence. So it matches "nothing" before dot, then "nothing"A between two dots then "nothing" before second dot and finally "nothing" after the second dot.
By default it's greedy match, so it search's maximum of matches. There is why you get that result.
If you do with reluctant way, like this
string s = Regex.Replace(".A.", "\\w*?", "B");
You will get this result, because it finding minimum matches.
B.BAB.B
Why this do not match and how to make it work?
Regex.Match("qwe", ".*?(?=([ $]))");
I should match everything to first space or to the end of line.
Your specific problem is that you need to use an alternation, not a character class, because inside a character class the $ symbol literally means "match a dollar symbol", and does not have its special meaning end-of-line in that context.
( |$)
It seems however that your example is a bit strange. It would be simpler to match any character except space, then you wouldn't need a lookahead at all.
Try with:
Regex.Match("qwe", "^([^ ]*)");