Regex to match comments in an INF file - c#

While designing a Regex to match comments in an INF file ( its formal syntax definition is here ), I came across an insurmountable (to me) problem where a mismatch is occurring when an opening left square bracket is wrongly paired with a double quote (dquote).
Here is a brief description of the comment syntax in the INF file:
Note ; A comment starts with a semicolon and continues untill the EOL
A string begins with an odd dquote "and terminates at the next dquote" ...or at the EOL.
Unbalanced dquotes "are implicitly closed by the EOL .i.e. ;this is not a comment.
A semicolon ";inside a string" loses its power to initiate a comment.
...however, an even number of dquotes "" before the semicolon ;does not disable its power to
initiate a comment
A string can contain a dquote when such dquote is escaped with another dquote, in other words - "doubled dquote""."
Inside a string "the semicolon ; a backslash \ and square brackets ][ are treated just like regular characters"
;From the comment-initiating semicolon to the EOL, the other semicolons; the line continuation backslash, the dquotes """ and square brackets ][ are treated just like regular characters. \
A semicolon [inside;] square brackets does not initiate a comment; but outside - it does.
A left square bracket [ before the semicolon ; disables its comment initiating function; too.
Neither nested nor unbalanced square brackets occur in the input unless they are inside a string.
The line continuation sequence (backslash+newline) does not work:
inside ;comments\
between the left square bracket [ and EOL \
between an unbalanced dquote and EOL "\
between balanced dquotes"\
;nice try"
( unbalanced and nested square brackets do not occur in the input unless they are inside a string )
Here is my c# (.NET flavor) Regex to match the comments:
(?:^|(?<=\n))(?>(?:[^;""\[\n]*[""\[][^""\]\n]*[""\]])*)(?>[^;""\[\n;]*)(;.*?)(?:\n|$)
See its simulation here (click "List" in the left panel) to get a list of matches.
From the text of the syntax description above, my Regex pattern generates the following matches denoted in green color:
However my Regex fails with the following input:
A semicolon [inside";] square brackets" does not initiate a comment; but outside - it does.
The expected match is at the second semicolon, but it matches at the first semicolon because the left square bracket is wrongly paired to the first dquote.
QUESTION: How to repair this Regex so only a right square bracket can pair&close a left square bracket ?

Related

Regex to match spaces within quotes only

I need to match any space within double quotes ONLY - not outside. I've tried a few things, but none of them work.
[^"]\s[^"] - matches spaces outside of quotes
[^"] [^"] - see above
\s+ - see above
For example I want to match "hello world" but not "helloworld" and not hello world (without quotes). I will specifically be using this regex inside of Visual Studio via the Find feature.
With .net and pcre regex engines, you can use the \G feature that matches the position after a successful match or the start of the string to build a pattern that returns contiguous occurrences from the start of the string:
((?:\G(?!\A)|\A[^"]*")[^\s"]*)\s([^\s"]*"[^"]*")?
example for a replacement with #: demo
pattern details:
( # capture group 1
(?: # two possible beginning
\G(?!\A) # contiguous to a previous match
| # OR
\A[^"]*" # start of the string and reach the first quote
) # at this point you are sure to be inside quotes
[^\s"]* # all that isn't a white-space or a quote
)
\s # the white-space
([^\s"]*"[^"]*")? # optional capture group 2: useful for the last quoted white-space
# since it reaches an eventual next quoted part.
Notice: with the .net regex engine you can also use the lookbehind to test if the number of quotes before a space is even or odd, but this way isn't efficient. (same thing for a lookahead that checks remaining quotes until the end, but in addition this approach may be wrong if the quotes aren't balanced).

.Net Regex - last of repeating characters

I'm trying to capture everything inside curly bracers, but in some cases there may be multiple bracers and I want the external ones.
For example: I want to capture {{this}} part
I'll need {{this}} as the capture.
So I went with ({[^}]+}+) to capture the inner text, but of course this will yield multiple captures {{this} and {{this}}.
So I tried telling the regex to search for the phrase but only if the next character is not curly bracers: ({[^}]+}+)[^}]. This works, unless the capture is at the end of the input, in which case it doesn't work cause it expects a non } character at the end.
So I tried adding end of string option ({[^}]+}+)[$|^}], but for some reason, this will capture {{this} again. I have no idea why, it should only capture if the next char is end of input or not curly bracers...
Suggestions?
Edit:
Just to be clear, I'm not searching for valid nested parenthesis, only for text between { and the first matching } (no nesting!), however there may be cases where instead of one open/close brace there are two (so {something} and {{something}} both need to be caught).
The reason for this, is that the original text always has double braces {{ }}, but sometimes before the regex the text undergoes string.Format, in which case the double braces become single braces.
Generally, regex is not powerful enough to do this. However, .NET regex engine supports so-called Atomic Grouping, which let you process groups with balanced parentheses:
{(?>{(?<DEPTH>)|}(?<-DEPTH>)|[^}]+)*}(?(DEPTH)(?!))
If you want to match all text between braces, I think this should do the trick:
{+.*?}+
This matches everything between braces, taking all consecutive braces and as few internal characters as possible.
Further explanation: matches 1 or more { ({+), then any amount of any character (.*) but gives you the shortest string that does it (?), and finally matches 1+ } (}+). Without that ?, if you have {a} {b} it would match the whole thing instead of {a} and {b} separately.
If you won't want spaces between the braces, you can use this:
{+\S*?}+
If you only want letters, use \w instead of \S.
The only thing this is not validating is that the same amount of braces are used. Do you need that?
Result comparison (should be a comment).
Considering {{{{{{this}}}}}Blabla, I get this:
Regex author: c0d3rman
Matched string: {{{{{{this}}}}}B
Groups: 2 ({{{{{{this}}}}}B and {{{{{{this}}}}})
Captures: {{{{{{this}}}}}
Regex author: dasblinkenlight
Matched string: {{{{{this}}}}}
Groups: 2 ({{{{{this}}}}} and {})
Captures: {{{{{this}}}}}
Note: symmetric braces
Regex author: Andrew
Matched string: {{{{{{this}}}}}
Groups: {{{{{{this}}}}}
Captures: {{{{{{this}}}}}
You seem to have used a character class at the end instead of a non-capturing group. Try:
({[^}]+}+)(?:$|[^}])
This is a very small modification to your final attempt, that just uses correct syntax. In your final attempt you have [$|^}]. The issue with this is that you can't have an or | inside a character class []. Most special characters are escaped inside a character class, with a couple exceptions, one of which is ^ if it is the first character. So [$|^}] means any of the four literal characters $, |, ^, or }. What I did is change the syntax to what you intended by using a non-capturing group (?:stuff) this group does not save its contents and is purely for grouping. As such (?:$|[^}]) means an end-of-line or a non-}, as you wanted.
Note that this makes no effort to balance the curly braces (match the number of braces at the beginning and end).

When multi-line text pasted into text input regex does not match the space

When user pastes something like this (from notepad for example):
multi
line#email.com
into input text box, the line break dissapears and it looks like this:
multi
line#email.com
But whatever the line break is converted to does not match this regex:
'\s|\t|\r|\n|\0','i'
so this invalid character passes through js validation to the .NET application code I am working on.
It is interesting but this text editor does the same transformation, that is why I had to post original sample as code. I would like to find out what the line break got converted to, so I can add a literal to the regex but I don't know how. Many thanks!
Here is the whole snippet:
var invalidChars = new RegExp('(^[.])|[<]|[>]|[(]|[)]|[\]|[,]|[;]|[:]|([.])[.]|\s|\t|\r|\n|\0', 'i');
if (text.match(invalidChars)) {
return false;
}
Your immediate problem is escaping. You're using a string literal to create the regex, like this:
'(^[.])|[<]|[>]|[(]|[)]|[\]|[,]|[;]|[:]|([.])[.]|\s|\t|\r|\n|\0'
But before it ever reaches the RegExp constructor, the [\] becomes []; \s becomes s; \0 becomes 0; and \t, \r and \n are converted to the characters they represent (tab, carriage return and linefeed, respectively). That won't happen if you use a regex literal instead, but you still have to escape the backslash to match a literal backslash.
Your regex is also has way more brackets than it needs. I think this is what you were trying for:
/^\.|\.\.|[<>()\\,;:\s]/
That matches a dot at the beginning, two consecutive dots, or one of several forbidden characters including any whitespace character (\s matches any whitespace character, not just a space).
Ok - here it is
vbCrLF
This is what pasted line breaks are converted to. I added (vbCrLF) group and those spaces are now detected. Thanks, Dan1M
http://forums.asp.net/t/1183613.aspx?Multiline+Textbox+Input+not+showing+line+breaks+in+Repeater+Control

Trying to understand this regex

I have this regex
^(\\w|#|\\-| |\\[|\\]|\\.)+$
I'm trying to understand what it does exactly but I can't seem to get any result...
I just can't understand the double backslashes everywhere... Isn't double backslash supposed to be used to get a single backslash?
This regex is to validate that a username doesn't use weird characters and stuff.
If someone could explain me the double backslashes thing please. #_#
Additional info: I got this regex in C# using Regex.IsMatch to check if my username string match the regex. It's for an asp website.
My guess is that it's simply escaping the \ since backslash is the escape character in c#.
string pattern = "^(\\w|#|\\-| |\\[|\\]|\\.)+$";
Can be rewritten using a verbatim string as
string pattern = #"^(\w|#|\-| |\[|\]|\.)+$";
Now it's a bit easier to understand what's going on. It will match any word character, at-sign, hyphen, space, square bracket or period, repeated one or more times. The ^ and $ match the begging and end of the string, respectively, so only those characters are allowed.
Therefore this pattern is equivalent to:
string pattern = #"^([\w# \[\].-])+$";
Double slash are supposed to be single slash. Double slash are used to escape the slash itself, as slashes are used for other escape characters in C# String context e.g. \n stands for new line
With double slashes sorted out, it becomes ^(\w|#|\-| |\[|\]|\.)+$
Break down this regex, as | means OR, and \w|#|\-| |\[|\]|\. would mean \w or # or \- or space or \[ or \] or \.. That is, any alphanumeric character, #, -, space, [, ] and . characters. Note that this slash is regex escape, to escape -, [, ] and . characters as they all have special meanings in regex context
And, + means the previous token (i.e. \w|#|\-| |\[|\]|\.) repeated one or more times
So, the entire thing means one or more of any combination of alphanumeric character, #, -, space, [, ] and . characters.
There are online tools to analyze regexes. Once such is at http://www.myezapp.com/apps/dev/regexp/show.ws
where it reports
Sequence: match all of the followings in order
BeginOfLine
Repeat
CapturingGroup
GroupNumber:1
OR: match either of the followings
WordCharacter
#
-
[
]
.
one or more times
EndOfLine
As others have noted, the double backslashes just escape a backslash so you can embed the regex in a string. For example, "\\w" will be interpreted as "\w" by the parser.
^ means beginning of the line.
the parentheses is use for grouping
\w is a word character
| means OR
# match the # character
\- match the hyphen character
[ and ] matches the squares brackets
\. match a period
+ means one or more
$ the end of line.
So the regex is use to match a string which contains only word characters or an # or an hyphen or a space or squares brackets or a dot.
Here's what it means:
^(\\w|#|\\-| |\\[|\\]|\\.)+$
^ - Means the regex starts at the beginning of the string. The match shouldn't start in the middle of the string.
Here's the individual things in the parentheses:
\\w - Indicates a "word" character. Normally, this is shown as \w, but this is being escaped.
# - Indicates an # symbol is allowed
\\- - Indicates a - is allowed. This is escaped since the dash can have other meanings in regex. Since it's not in a character class, I don't believe this is technically needed.
- A space is allowed
\\[ and \\] - [ and ] are allowed.
\\. - A period is a valid character. Escaped because periods have special meanings in regex.
Now all of those characters have | as delimiters in the parentheses - this means OR. So any of those characters are valid.
The + at the end means one or more characters as described in parentheses are valid. The $ means the end of the regex must match the end of the string.
Note that the double slashes aren't necessary if you just prefix the string like this:
#"\w" is the same as "\\w"

regex to fetch string between [a] and [/a] excluding any other tag like [b][/b] that comes in between

I have an input like the following
[a href=http://twitter.com/suddentwilight][font][b][i]#suddentwilight[/font][/a] My POV: Rakhi Sawant hits below the belt & does anything for attention... [a href=http://twitter.com/mallikaLA][b]http://www.test.com[/b][/a] has maintained the grace/decency :)
Now I need to get the string #suddentwilight and http://www.test.com that comes inside the anchor tags. there might be some [b] or [i] like tags wrapping the actual text. I need to ignore that.
Basically I need to get a string matching that starts with [a] then need to get the string/url before closing of the a tag [/a].
Please Suggest
I don't know C#, but here's a regex:
/\[a\s+[^\]]*\](?:\[[^\]]+\])*(.*?)(?:\[[^\]]+\])*\[\/a\]/
This will match [a ...][tag1][tag2][...][tagN]text[/tagN]...[tag2][tag1][/a] and capture text.
To explain:
the /.../ are common regex delimiters (like double quotes for strings). C# may just use strings to initialize regexes - in which case the forward slashes aren't necessary.
\[ and \] match a literal [ and ] character. We need to escape them with a backslash since square brackets have a special meaning in regexes.
[^\]] is an example of a character class - here meaning any character that is not a close square bracket. The square brackets delimit the character class, the caret (^) denotes negation, and the escaped close square bracket is the character being negated.
* and + are suffixes meaning match 0 or more and 1 or more of the previous pattern, respectively. So [^\]]* means match 0 or more of anything except a close square bracket.
\s is a shorthand for the character class of whitespace characters
(?:...) allows you to group the contents into an atomic pattern.
(...) groups like (?:...) does, but also saves the substring that this portion of the regex matches into a variable. This is normally called a capture, since it captures this portion of the string for you to use later. Here, we are using a capture to grab the linktext.
. matches any single character.
*? is a suffix for non-greedy matching. Normally, the * suffix is greedy, and matches as much as it can while still allowing the rest of the pattern to match something. *? is the opposite - it matches as little as it can while still allowing the rest of the pattern to match something. The reason we use *? here instead of * is so that if we have multiple [/a]s on a line, we only go as far as the next one when matching link text.
This will only remove [tag]s that come at the beginning and end of the text, to remove any that come in the middle of the text (like [a href=""]a [b]big[/b] frog[/a]), you'll need to do a second pass on the capture from the first, scrubbing out any text that matches:
/\[[^\]]+\]/

Categories

Resources