parsing a method Signature using regular expressions - c#

I am trying to use regular expressions to parse a method in the following format from a text:
mvAddSell[value, type1, reference(Moving, 60)]
so using the regular expressions, I am doing the following
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+[\\) ][\\] ])");
It is working, but the problem is that it always gives me an empty array at the beginning if the string started with a method in the given format and the same happens if it comes at the end. Also if two methods appeared in the string, it catches only the first one! why is that ?
I think what is causing the parser not to catch two methods is the existance of ".+" in my patern, what I wanted to do is that I want to tell it that there will be a number of a date in that location, so I tell it that there will be a sequence of any chars, is that wrong ?
it woooorked with ,e =D ... I replaced ".+" by ".+?" which meant as few as possible of any number of chars ;)

Your goal is quite unclear to me. What do you want as result? If you split on that method pattern, you will get the part before your pattern and the part after your pattern in an array, but not the method itself.
Answer to your question
To answer your concrete question: your .+ is greedy, that means it will match anything till the last )] (in the same line, . does not match newline characters by default).
You can change this behaviour by adding a ? after the quantifier to make it lazy, then it matches only till the first )].
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+?[\\) ][\\] ])");
Problems in your regex
There are several other problems in your regex.
I think you misunderstood character classes, when you write e.g. [\\[ ]. this construct will match either a [ or a space. If you want to allow optional space after the [ (would be logical to me), do it this way: \\[\\s*
Use a verbatim string (with a leading #) to define your regex to avoid excessive escaping.
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
You can simplify your regex, by avoiding repeating parts
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+(?:\s*,\s*[A-Za-z0-9 ]+){2}\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
This is an non capturing group (?:\s*,\s*[A-Za-z0-9 ]+){2} repeated two times.

Related

Fastest regex for first occurence of a word

I would like my regex to capture the following kind of strings as two Urls with "%3f" inside them.
https://*****%3f****%3D,https://*****%3f****%3D …
Where each string URL of this type should be captured by itself. Note - The * is here for simplification and the URLS can be in any part of the big string with anything in between.
My regex now is:
(https://\S+?%3f)(?<toDelete>\S+?%3D)
But I've been asked to see if there's a non lazy approach for this (or just a faster version), as it is much slower then greediness, and this regex will be called over huge strings and dataflow.
Note that the reason I cant simply put \S* is that doing so will capture in one match from the first http to the last %3D.
You might probably split the string with a comma and then get a substring up to the %3f value.
If you want to make the \S*? pattern work "faster" you must take into account what kind of context this part of a pattern should be aware of.
You are matching any char that is not a whitespace char, any amount of times, up to the first occurrence of %3f. That is, you want to match any chars other than % and whitespace or % chars that are not followed with 3f. That makes (?:[^\s%]|%(?!3f))*. However, alternation ruins the whole idea of optimization. You need to use the "unroll-the-loop" approach: [^%\s]*(?:%(?!3f)[^%\s]*)*.
So, the whole pattern will look like
https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f
Or with the Delete part:
(https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f)(?<toDelete>[^%\s]*(?:%(?!3D)[^%\s]*)*%3D)
For short strings, this last pattern might work a tiny bit slower than the \S+? based pattern, but it becomes much more efficient when the matched string becomes longer.

regex not matching when using ? if first character not present

Here is my c# regex:
\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?
I am testing here with sample string:
{"RestrictedCompany": "","SQLServerIndex": 0,"SurveyAdmin": false}`
This is what I think the regex does:
PART 1: Look for the pattern of " ANYTHING ":
and store ANYTHING (without the quotes).
PART 2: Then look for a : and store everything until you reach a stop character of either " or , or }
It extracts part 1 fine, but doesnt pick up part 2 at all when the " isnt present (ie when part 2 isnt a string). So I have two questions:
Why isn't my current code picking up part 2? (and how can I fix it)
is there a way to make the ANYTHING match more flexible? (I tried using \S but it was too greedy)
First off, don't write your own JSON parser. Use one written by professionals. You're reinventing a rather complex wheel here.
That said, there are also lessons you could learn here about how to write, understand and debug regular expressions, so let's look at that.
Why isn't my current code picking up part 2? (and how can I fix it)
Learn to reason like the regular expression engine.
Let's take a simpler case. We'll take the expression
\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?
And we will search this string:
{"A": "B"}
for an instance of the regular expression.
OK.
The { doesn't match anything, so skip it.
The first " matches \", so maybe we have a match.
A matches ([a-zA-Z0-9]*), so again, maybe we have a match.
The second " matches the second \", so we're still good.
The : matches :...
We now are trying to match \"?, zero or one quotes. We have , a space. We match zero quotes.
We are now trying to match ([a-zA-Z0-9]*), any number of alphanumerics. We have , a space. Therefore we have zero alphanumerics.
We are now trying to again match \"?, and again we have , so we match zero.
We are now trying to match ,?, we have zero of them.
We are now trying to match }?, again we have zero of them
And we're done. We've successfully matched the pattern, and the match is "A":.
Now keep on going; can we match anything in the rest of the string? No. The pattern requires a :, and there is no : in the rest of the string, so I won't labour the point; plainly the match will fail.
If that's not the pattern you wanted to match then write a different pattern. For example, if you want there to be arbitrary whitespace before and after the colon, you probably need a /s* before and after the colon. Also, if you require a value after the : then why did you make everything after the colon optional? "Required" and "optional" are opposites.
So what's the right thing to do here? Again, the right thing to do is to stop trying to solve this problem with regular expressions and use a json parser like a sensible person. But suppose we did want to parse this with regular expressions. How do we do it?
We do it by breaking the problem down into smaller parts.
What do we really want to match? Let's name each thing we want to match and then write a colon, and then say what the structure of that thing is:
DESIRED : NAME OPTIONAL_WHITESPACE COLON OPTIONAL_WHITESPACE VALUE
OK, break it down. What's a name?
NAME : QUOTE NAMECONTENTS QUOTE
Keep breaking it down.
NAMECONTENTS : any alphanumeric text of any length
Ask yourself is that true? Is an "" a NAME? Is "1234" a NAME? Is "$" a NAME? Refine the pattern until you get it right. We'll go with this for now.
Now here is a hard one:
VALUE : BOOLEAN_LITERAL
VALUE : NUMBER_LITERAL
VALUE : STRING_LITERAL
This can be any of three things. So again, keep breaking it down:
BOOLEAN_LITERAL : true
BOOLEAN_LITERAL : false
Keep going; you can see how to do it from here.
Now make a regular expression for each part and start putting it back together.
The regular expression for NAMECONTENTS is \w*.
The regular expression for QUOTE is \".
Therefore the regular expression for NAME is \"\w*\".
We want to capture the name text so put it in a group: \"(\w*)\"
Great. Similarly:
The regular expression for OPTIONAL_WHITESPACE is \s*.
The regular expression for COLON is :.
So our regular expression begins \"(\w*)\"\s:\s
Now we need to handle VALUE. But we've broken it down. What is the regular expression for BOOLEAN_LITERAL? That's [true|false].
Keep going; make a regular expression for the other literals and then build up your regular expression from the leaves to the root.

What is the correct RegEx to extract my substring

I have an input string like this
$(xx.xx.xx)abcde$(yyy.yyy.yyy)fghijk$(zzz.zz.zz.zzz)
I want to be able to pull out each subset of strings matching $(anything inside here), so for the example above I would like to get 3 substrings.
the characters in between the brackets do not necessarily always match the same pattern.
I have tried using the following regex
(\$\([a-z]+.*\))
but this matches whole string, due to the fact it starts with '$', anything in middle, and ends with ')'
Hopefully this makes sense.
I should also note that I have very limited experience using regex.
Thanks
(\$\([a-z]+.*?\))
Use ? to make your search non greedy.* is greedy and consumes the max it can.adding ? to * makes it non greedy and it will stop at the first instance of ).
See demo.
http://regex101.com/r/sU3fA2/28
try the below
\((.*?)\)\g
for the given string $(xx.xx.xx)abcde$(yyy.yyy.yyy)fghijk$(zzz.zz.zz.zzz) it returns the three substring..
MATCH 1
1. [2-10] `xx.xx.xx`
MATCH 2
1. [18-29] `yyy.yyy.yyy`
MATCH 3
1. [38-51] `zzz.zz.zz.zzz`
http://regex101.com/r/bX7qR2/1

Regex.Matches returns one match per line, not per "word"

I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName], I get one match - good.
If the markup contains [BName] [BAddress], I get one match - why?
If the markup contains [BName][BAddress], I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.
You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier # for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = #"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex
Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.
.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress.
You should write \[B[^\]]+\].
[^\]] matches every character except ], so it is forced to stop before the first ].

How Can I Check If a C# Regular Expression Is Trying to Match 1-(and-only-1)-Character Strings?

Maybe this is a very rare (or even dumb) question, but I do need it in my app.
How can I check if a C# regular expression is trying to match 1-character strings?
That means, I only allow the users to search 1-character strings. If the user is trying to search multi-character strings, an error message will be displaying to the users.
Did I make myself clear?
Thanks.
Peter
P.S.: I saw an answer about calculating the final matched strings' length, but for some unknown reason, the answer is gone.
I thought it for a while, I think calculating the final matched strings length is okay, though it's gonna be kind of slow.
Yet, the original question is very rare and tedious.
a regexp would be .{1}
This will allow any char though. if you only want alpanumeric then you can use [a-z0-9]{1} or shorthand /w{1}
Another option its to limit the number of chars a user can type in an input field. set a maxlength on it.
Yet another option is to save the forms input field to a char and not a string although you may need some handling around this to prevent errors.
Why not use maxlength and save to a char.
You can look for unescaped *, +, {}, ? etc. and count the number of characters (don't forget to flatten the [] as one character).
Basically you have to parse your regex.
Instead of validating the regular expression, which could be complicated, you could apply it only on single characters instead of the whole string.
If this is not possible, you may want to limit the possibilities of regular expression to some certain features. For instance the user can only enter characters to match or characters to exclude. Then you build up the regex in your code.
eg:
ABC matches [ABC]
^ABC matches [^ABC]
A-Z matches [A-Z]
# matches [0-9]
\w matches \w
AB#x-z matches [AB]|[0-9]|[x-z]|\w
which cases do you need to support?
This would be somewhat easy to parse and validate.

Categories

Resources