Removing comments using regex - c#

I am building a parser, and I would like to remove comments from various lines. For example,
variable = "some//thing" ////actual comment
Comment marker is //. In this case, variable would contain "some//thing" and everything else would be ignored. I plan to do it using regex replace. Currently I am using (".*"|[ \t])*(\/\/.*) as regex. However replacing it replaces "some//thing" ////actual comment entirely.
I can not figure out the regex which I should use instead. Thanks for any help.
Additional info - I am using C# with netcoreapp 1.1.0
Edit - some cases might be of a line with just comment like //line comment. Strings also might contain escaped quotes.

Here is the ugly regex pattern. I believe it will work well. I have tried it with every pathological example I can think of, including lines that contain syntax errors. For example, a quoted string that has too many quotes, or too few, or has a double escaped quote, which is, therefore, not escaped. And with quoted strings in the comments, which I have been known to do when I want to remind myself of alternatives.
The only time that it trips up is if there is a double slash inside a seemingly quoted string and somehow that string is malformed and the double slash ends up legally outside the properly quoted portion. Syntactically that makes it a valid comment, even though not the programmer's intention. So, from the programmer's perspective it's wrong, but by the rules, it's really a comment. Meaning, the pattern only appears to trip up.
When used the pattern will return the non-comment portion of the line(s). The pattern has a newline \n in it to allow for applying it to an entire file. You may need to modify that if you system interprets newlines in some other fashion, for example as \r or \r\n. To use it in single line mode you can remove that if you choose. It is at characters 17 and 18 in the one-liner and is on the fifth line, 6th and 7th printing characters in the multi-line version. You can safely leave it there, however, as in single-line mode it makes no difference, and in multi-line mode it will return a newline for lines of code that are either blank, or have a comment beginning in the first column. That will keep the line numbers the same in the original version and the stipped version if you write the results to a new file. Makes comparison easy.
One major caveat for this pattern: It uses a grouping construct that has varying level of support in regex engines. I believe as used here, with a lookaround, it's only the .NET and PCRE engines that will accept it YMMV. It is a tertiary type: (?(_condition_)_then_|_else_). The _condition_ pattern is treated as a zero-width assertion. If the pattern matches, then the _then_ pattern is used in the attempted match, otherwise the _else_ pattern is used. Without that construct, the pattern was growing to uncommon lengths, and was still failing on some of my pathological test cases.
The pattern presented here is as it needs to be seen by the regex engine. I am not a C# programmer, so I don't know all the nuances of escaping quoted strings. Getting this pattern into your code, such that all the backslashes and quotes are seen properly by the regex engine is still up to you. Maybe C# has the equivalent of Perl's heredoc syntax.
This is the one-liner pattern to use:
^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)
If you want to use the ignore pattern whitespace option, you can use this version:
(?x) # Turn on the ignore white space option
^( # Start the only capturing group
(?: # A non-capturing group to allow for repeating the logic
(?: # Capture either of the two options below
[^"'/\n] # Capture everything not a single quote, double quote, a slash, or a newline
| # OR
/(?!/) # Capture a slash not followed by a slash [slash an negative look-ahead slash]
)* # As many times as possible, even if none
(?(" # Start a conditional match for double-quoted strings
(?=(?:\\\\|\\"|[^"])*") # Followed by a properly closed double-quoted string
) # Then
(?:"(?:\\\\|\\"|[^"])*") # Capture the whole double-quoted string
| # Otherwise
(?(' # Start a conditional match for single-quoted strings
(?=(?:\\\\|\\'|[^'])*') # Followed by a properly closed single-quoted string
) # Then
(?:'(?:\\\\|\\'|[^'])*') # Capture the whole double-quoted string
| # Otherwise
(?([^/]) # If next character is not a slash
.) # Capture that character, it is either a single quote, or a double quote not part of a properly closed
) # end the conditional match for single-quoted strings
) # End the conditional match for double-quoted strings
)* # Close the repeating non-capturing group, capturing as many times as possible, even if none
) # Close the only capturing group
This allows for your code to explain this monstrosity so that when someone else looks at it, or in a few months you have to work on it yourself, there's no WTF moment. I think the comments explain it well, but feel free to change them any way you please.
As mentioned above, the conditional match grouping has limited support. One place it will fail is on the site you linked to in an earlier comment. Since you're using C#, I choose to do my testing in the .NET Regex Tester, which can handle those constructs. It includes a nice Reference too. Given the proper selections on the side, you can test either version above, and experiment with it as well. Considering its complexity, I would recommend testing it, somewhere, against data from your files, as well as any edge cases and pathological tests you can dream up.
Just to redeem this small pattern, there is a much bigger pattern for testing email address that is 78 columns by 81 lines, with a couple dozen characters to spare. (Which I do not recommend using, or any other regex, for testing email addresses. Wrong tool for the job.) If you want to scare yourself, have a peek at it on the ex-parrot site. I had nothing to do with that!!

"[^"\\]*(?:\\[\W\w][^"\\]*)*"|(\/\/.*)
Flags: global
Matches full strings or a comment.
Group 1: comment.
So if there's no comment, replace with the same matching text. Otherwise, do your thing on the comment itself.

Related

Regex.IsMatch gives true but http://www.regexr.com/ gives false

I'm trying to check if the next string is match to this pattern in this code:
string str = "CRSSA.T,";
var pattern = #"((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)";
Console.WriteLine(Regex.IsMatch(str, pattern));
the site: http://www.regexr.com/ says it's not match(everything match, except the last comma), but that code prints True. is it possible?
thanks ahead! :)
First of all, sure it can happen that different regex engines disagree, either because the capabilities differ or the interpretation, e.g. Java's String.matches method explicitly requires the whole string to match, not just a substring.
In your case, though, both regexr and .NET say it matches, because the substring CRSSA.T will match. Your third group, containing the comma, has a * quantifier, i.e. it can be matched zero or more times. In this case it's being matched zero times, but that's okay. It's still a match.
If you want the whole string to match, and no substrings whatsoever, then you need to add anchors to your regex:
^((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)$
Furthermore, {1} is a useless quantifier, you can just leave it out. Also, if you have a capturing group around the whole regex, you can leave that out as well, as it's already in capturing group 0 automatically. So a bit simplified you could use:
^(\w+\.\w+)+(,\w+\.\w+)*$
Also be careful with \w and \b. Those two features are closely linked (by the definition of \w and \W and are not always intuitive. E.g. they include the underscore and, depending on the regex engine, a lot more than just [A-Za-z_], e.g. in .NET \w also matches things like ä, µ, Ð, ª, or º. For those reasons I tend to be rather explicit when writing more robust regexes (i.e. those that are not just used for a quick one-off usage) and use things like [A-Za-z], \p{L}, (?=\P{L}|$), etc. instead of \w, \W and \b.

Performance and readability of RegEx using positive look ahead

I am validating following strings with regular expressions in C#:
[/1/2/]
[/1/2/];[/3/4/5/]
[/1/22/333/];[/1/];[/9999/]
Basically it's one or more group of square brackets separated by semi-colon (but not at the end). Each group consists out of one or more numbers seperated by slashes. There are no other characters allowed.
These are two alternatives:
^(\[\/(\d+\/)+\](;(?=\[)|$))+$
^(\[\/(\d+\/)+\];)*(\[\/(\d+\/)+\])$
The first version uses a positive look ahead and the second version duplicates part of the pattern.
Both RegEx-es seem to be ok, do what they should and aren't very nice to read. ;)
Does anybody have an idea for a better, faster and more easy to read solution? When I was playing around in regex101 I realized that the second version uses more steps, why?
At the same time I realized that it would be nice to count the steps used in a C#-RegEx. Is there any way to achieve this?
You can use 1 regex to validate all these strings:
^\[/(\d+/)+\](?:;\[/(\d+/)+\])*$
See regex demo
To make it easier to read, use a VERBOSE flag (inline (?x) or RegexOptions.IgnorePatternWhitespace):
var rx = #"(?x)^ # Start of string
\[/ # Literal `[/`
(\d+/)+ # 1 or more sequences of 1 or more digits followed by `/`
\] # Closing `]`
(?: # A non-capturing group start
; # a semi-colon delimiter
\[/(\d+/)+\] # Same as the first part of the regex
)* # 0 or more occurrences
$ # End of string
";
To test a .NET regex performance (not the number of steps), you can use a regexhero.net service. With the 3 sample strings above, my regex shows 217K iterations per second speed, which is more than either of your regexps.
There is nothing particularly wrong with the two options you suggest. They are not that complicated as regexes go, and they should be understandable enough, as long as you put an appropriate comment in your code.
In general, I think it is preferable to avoid look-arounds, unless they are necessary or greatly simplify the regex--they make it harder to figure out what is going on, since they add a non-linear element to the logic.
The relative performance of regexes this simple is not something to worry about, unless you are performing a huge number of operations or discover a performance problem with your code. Still, understanding the relative performance of different patterns may be instructive.

Difficulty finding where to insert "word exclusion" in a regex

I know the regex for excluding words, roughly anyway, It would be (!?wordToIgnore|wordToIgnore2|wordToIgnore3)
But I have an existing, complicated regex that I need to add this to, and I am a bit confused about how to go about that. I'm still pretty new to regex, and it took me a very long time to make this particular one, but I'm not sure where to insert it or how ...
The regex I have is ...
^(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:/\p{L}'-]{1,64}$)$
This should only allow the person typing to insert between 1 and 64 letters that match that pattern, cannot start with a space, quote, double quote, special character, a dash, an escape character, etc, and only allows a-z both upper and lowercase, can include a space, ":", a dash, and a quote anywhere but the beginning.
But I want to forbid them from using certain words, so I have this list of words that I want to be forbidden, I just cannot figure out how to get that to fit into here.. I tried just pasting the whole .. "block" in, and that didn't work.
?!the|and|or|a|given|some|that|this|then|than
Has anyone encountered this before?
ciel, first off, congratulations for getting this far trying to build your regex rule. If you want to read something detailed about all kinds of exclusions, I suggest you have a look at Match (or replace) a pattern except in situations s1, s2, s3 etc
Next, in your particular situation, here is how we could approach your regex.
For consision, let's make all the negative lookarounds more compact, replacing them with a single (?!.*(?: |-|'){2})
In your character class, the \: just escapes the colon, needlessly so as : is enough. I assume you wanted to add a backslash character, and if so we need to use \\
\p{L} includes [a-zA-Z], so you can drop [a-zA-Z]. But are you sure you want to match all letters in any script? (Thai etc). If so, remember to set the u flag after the regex string.
For your "bad word exclusion" applying to the whole string, place it at the same position as the other lookarounds, i.e., at the head of the string, but using the .* as in your other exclusions: (?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3)) It does not matter which lookahead comes first because lookarounds do not change your position in the string. For more on this, see Mastering Lookahead and Lookbehind
This gives us this glorious regex (I added the case-insensitive flag):
^(?i)(?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3))(?!.*(?: |-|'){2})(?:[\\0-9 :/\p{L}'-]{1,64}$)$
Of course if you don't want unicode letters, replace \p{L} with a-z
Also, if you want to make sure that the wordToIgnore is a real word, as opposed to an embedded string (for instance you don't want cat but you are okay with catalog), add boundaries to the lookahead rule: (?!.*\b(?:wordToIgnore|wordToIgnore2|wordToIgnore3)\b)
use this:
^(?!.*(the|and|or|a|given|some|that|this|then|than))(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:\p{L}'-]{1,64}$)$
see demo

Regex not capturing date

I have a regex that works fine currently. But now I want to add on to it to capture dates.
Current regex:
(?<GeneralHelp>^/help\s*)?
(?:/client:)
(?<Client>\w*)
(?:(?:\s*/(?<ClientHelp>help))*)*
(?:(?:\s*/)(?<Modules>createHistory)(?:(?:\s*/(?<ModuleHelp>help))*)*)*
I added to the end:
(?:(?:\s*/)(?<StartDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
(?:(?:\s*/)(?<EndDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
Using the below example, it just won't get the dates, but it does match everything else.
/client:testClient/createHistory/11-11-2013/11.11.2013
This regex is used to break up the Main one string in the string array parameter from a console app. No one on my team in "fluent" in regex, nor do we have time to become fluent. We work with what we can and this addition is something I thought of today that may have with bigger problems what we have with our project and we are running low on time. So any help would be appreciated.
First, the ^ in your regex means "start of string", that is you only want to match a date at the start of the string (which is not true for you). So remove it. Same with "$" which means "end of string".
Secondly, [0|1] means "match characters 0, | or 1". You probably want [01] meaning "match characters 0 or 1".
Thirdly, you have an extra closing bracket with an unmatched opening bracket in both your regexes.
Fourthly as a general style point, [0] is the same as 0 so the square brackets are redundant here.
So your (not quite!) "fixed" regex is:
(?:(?:\s*/)(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:(?:\s*/)(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
However, this will not match your test string because of the extra "/testModule" in the string which is not in your working regex anywhere.
You could modify your original regex to allow extra slashes in between the two parts of regex?
<original regex>
(?:/[^/]+)* # <-- for the /testModule and any other similar tokens that appear in between
<date regex>
Also as a general point
you have a few occurences (?:(?:regex)*)*. I am not sure what the point is of doubling the outer * besides making the regex parser work much harder than it should for no good reason (the outer (?: )* is redundant here).
there is no point doing (?:/\s*) as you are not doing anything with the brackets, so just do /\s*
same with things like (?:/client:). Why have non-capturing brackets if you are not doing anything with them. /client: will do.
(?:regex)* means "match 0 to infinity occurences of regex". With things like (?:\s*/(?<ClientHelp>help))*, do you really expect this to occur infinitely many times in your string, or will it appear just once or not at all? Consider replacing * with ? which means "match 0 or 1 occurences" (if you know that that token will appear either once or not at all), or replace it with (say) {0, 100} if you know that that token will appear at most 100 times (and at least 0 times). This can improve performance.
So I recommend changing your regex like this:
(?<GeneralHelp>^/help\s*)?
/client:
(?<Client>\w*)
(?:\s*/(?<ClientHelp>help))*
(?:\s*/(?<Modules>createHistory)(?:\s*/(?<ModuleHelp>help))*)*
(?:/[^/]+)*
(?:\s*/(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:\s*/(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
You can fiddle around with your regex at regexr where I've created an example with your regex/test string. (Edit: the < and > in the regex seem to have been changed to < and > in regexr so the link won't work unless you copy/paste the regex I've written directly)
If you're sure these two last fields are dates, you could simply add something like
(?<StartDate>(?:\d+[. -]?){3})/(?<EndDate>.*)$
(or even (?<StartDate>[^/]+)/(?<EndDate>.+)$ if your cases are all in the same pattern and it fits your needs).
Also as already pointed out by mathematical.coffee, the first regex can be improved.

I have two problems, one of them is a regex

I am updating some code that I didn't write and part of it is a regex as follows:
\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]
I understand that .*? does a non-greedy match of everything in the second register.
What does ?:\s* in the first and third registers do?
Update: As requested, language is C# on .NET 3.5
The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.
The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string
?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.
What does ?:\s* in the first and third registers do?
It's matching zero or more whitespace characters, without capturing them.
The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:
[url]www.foo.com[/url] # foo.com
[url ]www.foo.com[/url ] # same
[url ]www.foo.com[/url] # same
[url]www.foo.com[/url ] # same
Note that the regex also matches:
[url]www.[/url] # empty string!
and fails to match
[url]stackoverflow.com[/url] # no match, bummer
You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.
http://krijnhoetmer.nl/stuff/regex/cheat-sheet/

Categories

Resources