problem writing regular expression

problem writing regular expression - c#

I am trying to write a regular expression that matches only any of the following:
{string}
{string[]}
{string[6]}
where instead of 6 in the last line, there could be any positive integer. Also, wherever string appears in the above lines, there could be int.
Here is the regular expression I initially wrote : {(string|int)(([]|[[0-9]])?)}. It worked well but wouldn't allow more than one digit within the square bracket. To overcome this problem, I modified it this way : {(string|int)(([]|[[0-9]*])?)}.
The modified regex seems to be having serious problems. It matches {string[][]}. Can you please tell me what causes it to match against this? Also, when I try to enclose [0-9] within paranthesis, I get an exception saying "too many )'s". Why does this happen?
Can you please tell me how to write the regular expression that would satisfy my requirements?
Note : I am tryingthis in C#

You need to escape the special characters like {} and []:
\{(string|int)(\[\d*\])?\}
You might need to use [0-9] instead of \d depending on the engine you use.

You need to escape the [ as it indicates a character set in regular expressions; otherwise [[0-9]*] would be interpreted as character class of [ and 0–9, * quantifier, followed by a literal ]. So:
{(string|int)(\[[0-9]*])?}
And since the quantifier * allows zero repetitions too, you don’t need the special case for an empty [].

Related

How to match string by using regular expression which will not allow same special character at same time?

I m trying to matching a string which will not allow same special character at same time
my regular expression is:
[RegularExpression(#"^+[a-zA-Z0-9]+[a-zA-Z0-9.&' '-]+[a-zA-Z0-9]$")]
this solve my all requirement except the below two issues
this is my string : bracks
acceptable :
bra-cks, b-r-a-c-ks, b.r.a.c.ks, bra cks (by the way above regular expression solved this)
not acceptable:
issue 1: b.. or bra..cks, b..racks, bra...cks (two or more any special character together),
issue 2: bra cks (two ore more white space together)

You can use a negative lookahead to invalidate strings containing two consecutive special characters:
^(?!.*[.&' -]{2})[a-zA-Z0-9.&' -]+$
Demo: https://regex101.com/r/7j14bu/1

The goal
From what i can tell by your description and pattern, you are trying to match text, which start and end with alphanumeric (due to ^+[a-zA-Z0-9] and [a-zA-Z0-9]$ inyour original pattern), and inside, you just don't want to have any two consecuive (adjacent) special characters, which, again, guessing from the regex, are . & ' -
What was wrong
^+ - i think here you wanted to assure that match starts at the beginning of the line/string, so you don't need + here
[a-zA-Z0-9.&' '-] - in this character class you doubled ' which is totally unnecessary
Solution
Please try pattern
^[a-zA-Z0-9](?:(?![.& '-]{2,})[a-zA-Z0-9.& '-])*[a-zA-Z0-9]$
Pattern explanation
^ - anchor, match the beginning of the string
[a-zA-Z0-9] - character class, match one of the characters inside []
(?:...) - non capturing group
(?!...) - negative lookahead
[.& '-]{2,} - match 2 or more of characters inside character class
[a-zA-Z0-9.& '-] - character class, match one of the characters inside []
* - match zero or more text matching preceeding pattern
$ - anchor, match the end of the string
Regex demo

Some remarks on your current regex:
It looks like you placed the + quantifiers before the pattern you wanted to quantify, instead of after. For instance, ^+ doesn't make much sense, since ^ is just the start of the input, and most regex engines would not even allow that.
The pattern [a-zA-Z0-9.&' '-]+ doesn't distinguish between alphanumerical and other characters, while you want the rules for them to be different. Especially for the other characters you don't want them to repeat, so that + is not desired for those.
In a character class it doesn't make sense to repeat the same character, like you have a repeat of a quote ('). Maybe you wanted to somehow delimit the space, but realise that those quotes are interpreted literally. So probably you should just remove them. Or if you intended to allow for a quote, only list it once.
Here is a correction (add the quote if you still need it):
^[a-zA-Z0-9]+(?:[.& -][a-zA-Z0-9]+)*$
Follow-up
Based on a comment, I suspect you would allow a non-alphanumerical character to be surrounded by single spaces, even if that gives a sequence of more than one non-alphanumerical character. In that case use this:
^[a-zA-Z0-9]+(?:(?:[ ]|[ ]?[.&-][ ]?)[a-zA-Z0-9]+)*$
So here the space gets a different role: it can optionally occur before and after a delimiter (one of ".&-"), or it can occur on its own. The brackets around the spaces are not needed, but I used them to stress that the space is intended and not a typo.

regex not matching when using ? if first character not present

Here is my c# regex:
\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?
I am testing here with sample string:
{"RestrictedCompany": "","SQLServerIndex": 0,"SurveyAdmin": false}`
This is what I think the regex does:
PART 1: Look for the pattern of " ANYTHING ":
and store ANYTHING (without the quotes).
PART 2: Then look for a : and store everything until you reach a stop character of either " or , or }
It extracts part 1 fine, but doesnt pick up part 2 at all when the " isnt present (ie when part 2 isnt a string). So I have two questions:
Why isn't my current code picking up part 2? (and how can I fix it)
is there a way to make the ANYTHING match more flexible? (I tried using \S but it was too greedy)

First off, don't write your own JSON parser. Use one written by professionals. You're reinventing a rather complex wheel here.
That said, there are also lessons you could learn here about how to write, understand and debug regular expressions, so let's look at that.
Why isn't my current code picking up part 2? (and how can I fix it)
Learn to reason like the regular expression engine.
Let's take a simpler case. We'll take the expression
\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?
And we will search this string:
{"A": "B"}
for an instance of the regular expression.
OK.
The { doesn't match anything, so skip it.
The first " matches \", so maybe we have a match.
A matches ([a-zA-Z0-9]*), so again, maybe we have a match.
The second " matches the second \", so we're still good.
The : matches :...
We now are trying to match \"?, zero or one quotes. We have , a space. We match zero quotes.
We are now trying to match ([a-zA-Z0-9]*), any number of alphanumerics. We have , a space. Therefore we have zero alphanumerics.
We are now trying to again match \"?, and again we have , so we match zero.
We are now trying to match ,?, we have zero of them.
We are now trying to match }?, again we have zero of them
And we're done. We've successfully matched the pattern, and the match is "A":.
Now keep on going; can we match anything in the rest of the string? No. The pattern requires a :, and there is no : in the rest of the string, so I won't labour the point; plainly the match will fail.
If that's not the pattern you wanted to match then write a different pattern. For example, if you want there to be arbitrary whitespace before and after the colon, you probably need a /s* before and after the colon. Also, if you require a value after the : then why did you make everything after the colon optional? "Required" and "optional" are opposites.
So what's the right thing to do here? Again, the right thing to do is to stop trying to solve this problem with regular expressions and use a json parser like a sensible person. But suppose we did want to parse this with regular expressions. How do we do it?
We do it by breaking the problem down into smaller parts.
What do we really want to match? Let's name each thing we want to match and then write a colon, and then say what the structure of that thing is:
DESIRED : NAME OPTIONAL_WHITESPACE COLON OPTIONAL_WHITESPACE VALUE
OK, break it down. What's a name?
NAME : QUOTE NAMECONTENTS QUOTE
Keep breaking it down.
NAMECONTENTS : any alphanumeric text of any length
Ask yourself is that true? Is an "" a NAME? Is "1234" a NAME? Is "$" a NAME? Refine the pattern until you get it right. We'll go with this for now.
Now here is a hard one:
VALUE : BOOLEAN_LITERAL
VALUE : NUMBER_LITERAL
VALUE : STRING_LITERAL
This can be any of three things. So again, keep breaking it down:
BOOLEAN_LITERAL : true
BOOLEAN_LITERAL : false
Keep going; you can see how to do it from here.
Now make a regular expression for each part and start putting it back together.
The regular expression for NAMECONTENTS is \w*.
The regular expression for QUOTE is \".
Therefore the regular expression for NAME is \"\w*\".
We want to capture the name text so put it in a group: \"(\w*)\"
Great. Similarly:
The regular expression for OPTIONAL_WHITESPACE is \s*.
The regular expression for COLON is :.
So our regular expression begins \"(\w*)\"\s:\s
Now we need to handle VALUE. But we've broken it down. What is the regular expression for BOOLEAN_LITERAL? That's [true|false].
Keep going; make a regular expression for the other literals and then build up your regular expression from the leaves to the root.

C# Error in the regular expression

I'm using C# 2012 and I can not solve this regular expression.
I need to validate the text so that points or traces are Mandatory to separate the numbers in the input text :
[0-9]{9}(-|.)[\s]?[0-9]{4}(-|.)[0-9]{4}(-|.)0-9[0-9]{2}(-|.)[0-9]{4}
A valid text should be as follows:
0706570-39.2014.8.02.0001
but the expression above returns true to the text below although it should be false:
...Certidão de Casamento nº 00287301551
982200032250000901391 - Cartório Privativo....

^[0-9]{9}(-|\.)[\s]?[0-9]{4}(-|\.)[0-9]{4}(-|\.)0-9[0-9]{2}(-|\.)[0-9]{4}$
Add anchors ^...$ to denote start and end of string. Also escape ..

You need to use the following regex:
\b[0-9]{7}[-.][0-9]{2}[-.][0-9]{4}[-.][0-9][.-][0-9]{2}[-.][0-9]{4}\b
See demo
If the expression must match individual full strings, replace \b...\b with ^...$.
Note that (-|.) is really pointless as . matches -, so your intention was to match a literal .. To match a literal ., you need to escape it (as vks shows), or put it into a character class [.]. A character class is a bit more efficient solution here since there is much less backtracking than with alternation | operator. Anyway, the original expression is matching different digit groups (see [0-9]{7}(-|\.)\s?[0-9]{2}(-|\.)[0-9]{4}(-|\.)[0-9]{1}(-|\.)[0-9]{2}(-|\.)[0-9]{4} just for a demo sake that is a "fixed" version.)

parsing a method Signature using regular expressions

I am trying to use regular expressions to parse a method in the following format from a text:
mvAddSell[value, type1, reference(Moving, 60)]
so using the regular expressions, I am doing the following
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+[\\) ][\\] ])");
It is working, but the problem is that it always gives me an empty array at the beginning if the string started with a method in the given format and the same happens if it comes at the end. Also if two methods appeared in the string, it catches only the first one! why is that ?
I think what is causing the parser not to catch two methods is the existance of ".+" in my patern, what I wanted to do is that I want to tell it that there will be a number of a date in that location, so I tell it that there will be a sequence of any chars, is that wrong ?
it woooorked with ,e =D ... I replaced ".+" by ".+?" which meant as few as possible of any number of chars ;)

Your goal is quite unclear to me. What do you want as result? If you split on that method pattern, you will get the part before your pattern and the part after your pattern in an array, but not the method itself.
Answer to your question
To answer your concrete question: your .+ is greedy, that means it will match anything till the last )] (in the same line, . does not match newline characters by default).
You can change this behaviour by adding a ? after the quantifier to make it lazy, then it matches only till the first )].
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+?[\\) ][\\] ])");
Problems in your regex
There are several other problems in your regex.
I think you misunderstood character classes, when you write e.g. [\\[ ]. this construct will match either a [ or a space. If you want to allow optional space after the [ (would be logical to me), do it this way: \\[\\s*
Use a verbatim string (with a leading #) to define your regex to avoid excessive escaping.
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
You can simplify your regex, by avoiding repeating parts
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+(?:\s*,\s*[A-Za-z0-9 ]+){2}\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
This is an non capturing group (?:\s*,\s*[A-Za-z0-9 ]+){2} repeated two times.

How Can I Check If a C# Regular Expression Is Trying to Match 1-(and-only-1)-Character Strings?

Maybe this is a very rare (or even dumb) question, but I do need it in my app.
How can I check if a C# regular expression is trying to match 1-character strings?
That means, I only allow the users to search 1-character strings. If the user is trying to search multi-character strings, an error message will be displaying to the users.
Did I make myself clear?
Thanks.
Peter
P.S.: I saw an answer about calculating the final matched strings' length, but for some unknown reason, the answer is gone.
I thought it for a while, I think calculating the final matched strings length is okay, though it's gonna be kind of slow.
Yet, the original question is very rare and tedious.

a regexp would be .{1}
This will allow any char though. if you only want alpanumeric then you can use [a-z0-9]{1} or shorthand /w{1}
Another option its to limit the number of chars a user can type in an input field. set a maxlength on it.
Yet another option is to save the forms input field to a char and not a string although you may need some handling around this to prevent errors.
Why not use maxlength and save to a char.

You can look for unescaped *, +, {}, ? etc. and count the number of characters (don't forget to flatten the [] as one character).
Basically you have to parse your regex.

Instead of validating the regular expression, which could be complicated, you could apply it only on single characters instead of the whole string.
If this is not possible, you may want to limit the possibilities of regular expression to some certain features. For instance the user can only enter characters to match or characters to exclude. Then you build up the regex in your code.
eg:
ABC matches [ABC]
^ABC matches [^ABC]
A-Z matches [A-Z]
# matches [0-9]
\w matches \w
AB#x-z matches [AB]|[0-9]|[x-z]|\w
which cases do you need to support?
This would be somewhat easy to parse and validate.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.