Here is my c# regex:
\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?
I am testing here with sample string:
{"RestrictedCompany": "","SQLServerIndex": 0,"SurveyAdmin": false}`
This is what I think the regex does:
PART 1: Look for the pattern of " ANYTHING ":
and store ANYTHING (without the quotes).
PART 2: Then look for a : and store everything until you reach a stop character of either " or , or }
It extracts part 1 fine, but doesnt pick up part 2 at all when the " isnt present (ie when part 2 isnt a string). So I have two questions:
Why isn't my current code picking up part 2? (and how can I fix it)
is there a way to make the ANYTHING match more flexible? (I tried using \S but it was too greedy)
First off, don't write your own JSON parser. Use one written by professionals. You're reinventing a rather complex wheel here.
That said, there are also lessons you could learn here about how to write, understand and debug regular expressions, so let's look at that.
Why isn't my current code picking up part 2? (and how can I fix it)
Learn to reason like the regular expression engine.
Let's take a simpler case. We'll take the expression
\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?
And we will search this string:
{"A": "B"}
for an instance of the regular expression.
OK.
The { doesn't match anything, so skip it.
The first " matches \", so maybe we have a match.
A matches ([a-zA-Z0-9]*), so again, maybe we have a match.
The second " matches the second \", so we're still good.
The : matches :...
We now are trying to match \"?, zero or one quotes. We have , a space. We match zero quotes.
We are now trying to match ([a-zA-Z0-9]*), any number of alphanumerics. We have , a space. Therefore we have zero alphanumerics.
We are now trying to again match \"?, and again we have , so we match zero.
We are now trying to match ,?, we have zero of them.
We are now trying to match }?, again we have zero of them
And we're done. We've successfully matched the pattern, and the match is "A":.
Now keep on going; can we match anything in the rest of the string? No. The pattern requires a :, and there is no : in the rest of the string, so I won't labour the point; plainly the match will fail.
If that's not the pattern you wanted to match then write a different pattern. For example, if you want there to be arbitrary whitespace before and after the colon, you probably need a /s* before and after the colon. Also, if you require a value after the : then why did you make everything after the colon optional? "Required" and "optional" are opposites.
So what's the right thing to do here? Again, the right thing to do is to stop trying to solve this problem with regular expressions and use a json parser like a sensible person. But suppose we did want to parse this with regular expressions. How do we do it?
We do it by breaking the problem down into smaller parts.
What do we really want to match? Let's name each thing we want to match and then write a colon, and then say what the structure of that thing is:
DESIRED : NAME OPTIONAL_WHITESPACE COLON OPTIONAL_WHITESPACE VALUE
OK, break it down. What's a name?
NAME : QUOTE NAMECONTENTS QUOTE
Keep breaking it down.
NAMECONTENTS : any alphanumeric text of any length
Ask yourself is that true? Is an "" a NAME? Is "1234" a NAME? Is "$" a NAME? Refine the pattern until you get it right. We'll go with this for now.
Now here is a hard one:
VALUE : BOOLEAN_LITERAL
VALUE : NUMBER_LITERAL
VALUE : STRING_LITERAL
This can be any of three things. So again, keep breaking it down:
BOOLEAN_LITERAL : true
BOOLEAN_LITERAL : false
Keep going; you can see how to do it from here.
Now make a regular expression for each part and start putting it back together.
The regular expression for NAMECONTENTS is \w*.
The regular expression for QUOTE is \".
Therefore the regular expression for NAME is \"\w*\".
We want to capture the name text so put it in a group: \"(\w*)\"
Great. Similarly:
The regular expression for OPTIONAL_WHITESPACE is \s*.
The regular expression for COLON is :.
So our regular expression begins \"(\w*)\"\s:\s
Now we need to handle VALUE. But we've broken it down. What is the regular expression for BOOLEAN_LITERAL? That's [true|false].
Keep going; make a regular expression for the other literals and then build up your regular expression from the leaves to the root.
Related
For my internship I've been asked to create a tool that creates a regular expression from a few examples. Now I got it working, it generates multiple regular expressions and they are sorted depending on how greedy they are, but I want more.
The regex generator works by replacing parts of a string with regular expression character classes. For example GOM178 would turn into [A-Z]+178(letters replaced) or GOM\d+ (numbers replaced). The hard part is getting multiple character classes in one. For example at one point \p{P} is tried as well, and it replaces [],/\- and more. That causes the other character classes to mess up. It would turn [A-Z] in \p{P}A\p{P}Z\p{P}. Replacing \p{P} before [A-Z] wouldn't work as well, because that would replace the P in \p{P} causing this: \p{[A-Z]}.
I've already tried negative lookaheads, but that didn't workout too well. The only reason why it currently works is because I test it before saving the result. This is the regular expression I've used for that:
(?:(?!(?:\[a-z\]|\[A-Z\]|\[a-zA-Z\]|\\d\+\[\\\.,\]\?\\d\*|\\d|\\s|\\p\{P\}|\\w|\\n|\.)(?:\*|\?|\+|\+\?|\*\?)?)(<The character class to match goes here>))
Here is an example of the regex in action: Link to example.
As you can see it also matches the - and ] in the character class. It should ignore it because it's part of [a-z] which is noted in the negative lookahead.
Long story short, a string should not be replaced when it's in a specific context. Does anyone have an idea on how to fix this, or perhaps have a better idea on how to do this.
I'm working on some code inherited from someone else and trying to understand some regular expression code in C#:
Regex.Replace(query, #"""[^""~]+""([^~]|$)",
m => string.Format(field + "_exact:{0}", m.Value))
What is the above regular expression doing? This is in relation to input from a user performing a search. It's doing a replace of the query string using the pattern provided in the second argument, with the value of the third. But what is that regular expression? For the life of me, it doesn't make sense. Thanks.
As far as I can see, xanatos' answer is correct. I tried to understand the regex, so here it comes:
"[^"~]+"([^~]|$)
You can test our regex and play with the single parts for better understanding at http://www.regexpal.com/
1.) a single character
"
The first pattern is a literal character. Since there is no statement of relative position, it can occur everywhere.
2.) a character class
[^"~]
The next expression is the []-bracket. This is a character set. It defines a quantity of characters, which maybe follow next. It is a placeholder for one single character... So lets see inside, which content is allowed:
^"~
The definition of the character class begins with an caret (^), which is a special character. Typing a caret after the opening square bracket will negate the character class. So it's "upside down": everything following, which does not match the class expression, matches and is a valid character.
In this case, every literal character is possible, except the two excluded ones: " or ~.
3.) a special character
+
The next expression, a plus, tells the engine to attempt to match the preceding token once or more.
So the defined character class should one or multiple times repeated to match the given expression.
4.) a single character
"
To match, the expression should contain furthermore one further apostrophe, which will be the corresponding apostrophe to the first one in 1.) since the character class in (2.) hence (3.) does not permit an apostrophe.
5.) a lookaround
([^~]|$)
The first structure here to examine is the ()-bracket. This is called a "Lookaround".
It is is a special kind of group. Lookaround matches a position. It does not expand the regex match.
So this means this part does not try to find any certain characters inside of an expression
rather then to localize them.
The localisation demands has two conditions, which are connected by a logical OR by the pipeline symbol: |
So the next character of the matched expression could either be
[^~] one single character out of the class everything excluding the character ~
or
$ the end of the line (or word, if multiline-mode is not used in regex engine)
I'll try to edit my answer to a better format, since this is my first post, I first have to check out how this is working.. :)
Update:
to "detect" a Asterisk/star in front/end of the line, you have to do following:
First it's a special character, so you have to escape it with an backslash: *
To define the position, you can use:
^ to look at the beginning of the line,
$ end of the line
The overall expression would be:
^* in front of the expression to search for an * at the beginning of
the line $* at the end of the regex to demand an * at the end.
.... in your case you can add the * in the last character class to detect an * in the end:
([^~]|$|$*)
and to force an * in the end, delete the other conditions:
($*)
PS:
(somehow my regex is swallowed up by formating engine, so my update is wrong...)
The # makes it necessary to escape all the " with a second ", so "". Without it to escape the " you would have used \", but I consider it better to always use # in regexes, because the \ is used quite often, and it's boring and unreadable to always have to escape it to \\.
Let's see what the regex really is:
Console.WriteLine(#"""[^""~]+""([^~]|$)");
is
"[^"~]+"([^~]|$)
So now we can look at the "real" regex.
It looks for a " followed by one or more non-" and non-~ followed by another " followed by a non-~ or the end of the string. Note that the match could start after the start of the string and it could end before the end of the string (with a non-~)
For example in
car"hello"help
it would match "hello"h
I am trying to use regular expressions to parse a method in the following format from a text:
mvAddSell[value, type1, reference(Moving, 60)]
so using the regular expressions, I am doing the following
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+[\\) ][\\] ])");
It is working, but the problem is that it always gives me an empty array at the beginning if the string started with a method in the given format and the same happens if it comes at the end. Also if two methods appeared in the string, it catches only the first one! why is that ?
I think what is causing the parser not to catch two methods is the existance of ".+" in my patern, what I wanted to do is that I want to tell it that there will be a number of a date in that location, so I tell it that there will be a sequence of any chars, is that wrong ?
it woooorked with ,e =D ... I replaced ".+" by ".+?" which meant as few as possible of any number of chars ;)
Your goal is quite unclear to me. What do you want as result? If you split on that method pattern, you will get the part before your pattern and the part after your pattern in an array, but not the method itself.
Answer to your question
To answer your concrete question: your .+ is greedy, that means it will match anything till the last )] (in the same line, . does not match newline characters by default).
You can change this behaviour by adding a ? after the quantifier to make it lazy, then it matches only till the first )].
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+?[\\) ][\\] ])");
Problems in your regex
There are several other problems in your regex.
I think you misunderstood character classes, when you write e.g. [\\[ ]. this construct will match either a [ or a space. If you want to allow optional space after the [ (would be logical to me), do it this way: \\[\\s*
Use a verbatim string (with a leading #) to define your regex to avoid excessive escaping.
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
You can simplify your regex, by avoiding repeating parts
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+(?:\s*,\s*[A-Za-z0-9 ]+){2}\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
This is an non capturing group (?:\s*,\s*[A-Za-z0-9 ]+){2} repeated two times.
I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName], I get one match - good.
If the markup contains [BName] [BAddress], I get one match - why?
If the markup contains [BName][BAddress], I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.
You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier # for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = #"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex
Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.
.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress.
You should write \[B[^\]]+\].
[^\]] matches every character except ], so it is forced to stop before the first ].
I am trying to write a regular expression that matches only any of the following:
{string}
{string[]}
{string[6]}
where instead of 6 in the last line, there could be any positive integer. Also, wherever string appears in the above lines, there could be int.
Here is the regular expression I initially wrote : {(string|int)(([]|[[0-9]])?)}. It worked well but wouldn't allow more than one digit within the square bracket. To overcome this problem, I modified it this way : {(string|int)(([]|[[0-9]*])?)}.
The modified regex seems to be having serious problems. It matches {string[][]}. Can you please tell me what causes it to match against this? Also, when I try to enclose [0-9] within paranthesis, I get an exception saying "too many )'s". Why does this happen?
Can you please tell me how to write the regular expression that would satisfy my requirements?
Note : I am tryingthis in C#
You need to escape the special characters like {} and []:
\{(string|int)(\[\d*\])?\}
You might need to use [0-9] instead of \d depending on the engine you use.
You need to escape the [ as it indicates a character set in regular expressions; otherwise [[0-9]*] would be interpreted as character class of [ and 0–9, * quantifier, followed by a literal ]. So:
{(string|int)(\[[0-9]*])?}
And since the quantifier * allows zero repetitions too, you don’t need the special case for an empty [].