I am writing a hand-coded CSS 2.1 parsing engine (in C#), and I'm working directly off the W3C CSS 2.1 grammar (http://www.w3.org/TR/CSS21/grammar.html). However, there's a token that I just don't quite get:
url ([!#$%&*-~]|{nonascii}|{escape})*
...
"url("{w}{url}{w}")" {return URI;}
"url("{w}{string}{w}")" {return URI;}
I don't get what the URL production is supposed to do. It appears to be a string of only !#$%&*-~, non-ascii, or escaped unicode code points. How is that a URL? Is this production just really badly named, and what purpose is it supposed to serve?
Any help appreciated. FYI, I've added the C# tag only to increase the audience to actual programmers who might have encountered this or have insights - I apologize if you think I shouldn't apply.
Dude, did you read the CONTEXT surrounding that expression?
baduri1 url\({w}([!#$%&*-\[\]-~]|{nonascii}|{escape})*{w}
baduri2 url\({w}{string}{w}
baduri3 url\({w}{badstring}
Hmmm... Bad, bad, bad. Bit of a giveaway, eh what? Generally, If something in the doco doesn't make sense to you, or appears just plain wrong, maybe it shouldn't make sense? Yes? So you read around it... to acquire the correct context.
[!#$%&*-~] breaks down to:
!, #, $, %, &, plus the character range * - ~.
This takes in most printable ASCII characters, including uppercase, lowercase, digits and a range of punctuation characters.
It's easier to list the printable ASCII characters which this regex doesn't match:
Double quote ", single quote ', and parenthesis (, ); i.e printable ascii characters minus delimiters. This makes it possible to parse urls that do not include quotation marks. E.g. url(http://example.com), instead of url("http://example.com").
Concise, but tricky!
P.S. The token name is confusing as well. A better name would have been something like: url_string or url_arg.
EDIT Feb 2015 The latest CSS3 Syntax Spec names the token url-unquoted
I don't get what the URL production is supposed to do. It appears to be a string of only !#$%&*-~, non-ascii, or escaped unicode code points. How is that a URL? Is this production just really badly named, and what purpose is it supposed to serve?
The first line defines url as a regular expression:
url ([!#$%&*-~]|{nonascii}|{escape})*
The second line defines URI as a token which can be produced/returned by the lexer:
"url("{w}{url}{w}")" {return URI;}
The second line says that if the lexer sees url( then {w} then {url} then {w} then ) then it has found a URI.
The {w} expression is optional whitespace.
So according to the definition, the {url} is a regular expression: which defines what characters are allow inside a URI token, between the initial url( and the final ).
Related
I'm trying to check if below two regex are matching
Target Url: /challenge/getAllChallenges?type=public
Regex: "/challenge/getAllChallenges([/?]+)"
But seems like above Regex only allows any character to appear after "getAllChallenge"
How do I allow only '?' as the first character to appear after "getAllChallenge"?
ideally both of below url to be validated as a match through a single regex:
/challenge/getAllChallenges
/challenge/getAllChallenges?type=public
but below to be not valid
/challenge/getAllChallenge/blah
/challenge/getAllChallengeblah
Something like
/challenge/getAllChallenges(\?[^?]+)?$
It's "/challenge/getAllChallenges" followed by zero or one of: (question mark followed by one or more of anything other than question mark)
Your original regex required "/challenge/getAllChallenges" followed by one or more of: (forward slash, question mark)
Special characters lose their meaning inside [] character classes, and hence do not need escaping (if that's what you were trying to do by putting a slash before the ? (which if you were, the slash was the wrong direction; backslash escapes, not forward slash))
It would be important to include the end of input marker $ to prevent a partial match reporting a success
-
I support the notion raised in the comment: generally using a class dedicated to parsing and manipulating a kind of value will give better results than a regex based solution. For example the regex I gave above is in response to what was determined from your required matches - either no ? or ?-followed-by-something but a url that simply ends with a question mark is perfectly valid, but the regex will need tweaking to allow it
I am building a parser, and I would like to remove comments from various lines. For example,
variable = "some//thing" ////actual comment
Comment marker is //. In this case, variable would contain "some//thing" and everything else would be ignored. I plan to do it using regex replace. Currently I am using (".*"|[ \t])*(\/\/.*) as regex. However replacing it replaces "some//thing" ////actual comment entirely.
I can not figure out the regex which I should use instead. Thanks for any help.
Additional info - I am using C# with netcoreapp 1.1.0
Edit - some cases might be of a line with just comment like //line comment. Strings also might contain escaped quotes.
Here is the ugly regex pattern. I believe it will work well. I have tried it with every pathological example I can think of, including lines that contain syntax errors. For example, a quoted string that has too many quotes, or too few, or has a double escaped quote, which is, therefore, not escaped. And with quoted strings in the comments, which I have been known to do when I want to remind myself of alternatives.
The only time that it trips up is if there is a double slash inside a seemingly quoted string and somehow that string is malformed and the double slash ends up legally outside the properly quoted portion. Syntactically that makes it a valid comment, even though not the programmer's intention. So, from the programmer's perspective it's wrong, but by the rules, it's really a comment. Meaning, the pattern only appears to trip up.
When used the pattern will return the non-comment portion of the line(s). The pattern has a newline \n in it to allow for applying it to an entire file. You may need to modify that if you system interprets newlines in some other fashion, for example as \r or \r\n. To use it in single line mode you can remove that if you choose. It is at characters 17 and 18 in the one-liner and is on the fifth line, 6th and 7th printing characters in the multi-line version. You can safely leave it there, however, as in single-line mode it makes no difference, and in multi-line mode it will return a newline for lines of code that are either blank, or have a comment beginning in the first column. That will keep the line numbers the same in the original version and the stipped version if you write the results to a new file. Makes comparison easy.
One major caveat for this pattern: It uses a grouping construct that has varying level of support in regex engines. I believe as used here, with a lookaround, it's only the .NET and PCRE engines that will accept it YMMV. It is a tertiary type: (?(_condition_)_then_|_else_). The _condition_ pattern is treated as a zero-width assertion. If the pattern matches, then the _then_ pattern is used in the attempted match, otherwise the _else_ pattern is used. Without that construct, the pattern was growing to uncommon lengths, and was still failing on some of my pathological test cases.
The pattern presented here is as it needs to be seen by the regex engine. I am not a C# programmer, so I don't know all the nuances of escaping quoted strings. Getting this pattern into your code, such that all the backslashes and quotes are seen properly by the regex engine is still up to you. Maybe C# has the equivalent of Perl's heredoc syntax.
This is the one-liner pattern to use:
^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)
If you want to use the ignore pattern whitespace option, you can use this version:
(?x) # Turn on the ignore white space option
^( # Start the only capturing group
(?: # A non-capturing group to allow for repeating the logic
(?: # Capture either of the two options below
[^"'/\n] # Capture everything not a single quote, double quote, a slash, or a newline
| # OR
/(?!/) # Capture a slash not followed by a slash [slash an negative look-ahead slash]
)* # As many times as possible, even if none
(?(" # Start a conditional match for double-quoted strings
(?=(?:\\\\|\\"|[^"])*") # Followed by a properly closed double-quoted string
) # Then
(?:"(?:\\\\|\\"|[^"])*") # Capture the whole double-quoted string
| # Otherwise
(?(' # Start a conditional match for single-quoted strings
(?=(?:\\\\|\\'|[^'])*') # Followed by a properly closed single-quoted string
) # Then
(?:'(?:\\\\|\\'|[^'])*') # Capture the whole double-quoted string
| # Otherwise
(?([^/]) # If next character is not a slash
.) # Capture that character, it is either a single quote, or a double quote not part of a properly closed
) # end the conditional match for single-quoted strings
) # End the conditional match for double-quoted strings
)* # Close the repeating non-capturing group, capturing as many times as possible, even if none
) # Close the only capturing group
This allows for your code to explain this monstrosity so that when someone else looks at it, or in a few months you have to work on it yourself, there's no WTF moment. I think the comments explain it well, but feel free to change them any way you please.
As mentioned above, the conditional match grouping has limited support. One place it will fail is on the site you linked to in an earlier comment. Since you're using C#, I choose to do my testing in the .NET Regex Tester, which can handle those constructs. It includes a nice Reference too. Given the proper selections on the side, you can test either version above, and experiment with it as well. Considering its complexity, I would recommend testing it, somewhere, against data from your files, as well as any edge cases and pathological tests you can dream up.
Just to redeem this small pattern, there is a much bigger pattern for testing email address that is 78 columns by 81 lines, with a couple dozen characters to spare. (Which I do not recommend using, or any other regex, for testing email addresses. Wrong tool for the job.) If you want to scare yourself, have a peek at it on the ex-parrot site. I had nothing to do with that!!
"[^"\\]*(?:\\[\W\w][^"\\]*)*"|(\/\/.*)
Flags: global
Matches full strings or a comment.
Group 1: comment.
So if there's no comment, replace with the same matching text. Otherwise, do your thing on the comment itself.
I have added a regular expression from a site to verify user name and and it should work but it is giving some error on the compile time. Please see the image and then I googled and learned that few of chars like '\w' is not going to work because js does not support it. Now I don't know how to convert it , can anyone please help to convert this to workable with ASP.NET MVC data-annotations.
[RegularExpression("^([a-zA-Z])[a-zA-Z_-]*[\w_-]*[\S]$|^([a-zA-Z])[0-9_-]*[\S]$|^[a-zA-Z]*[\S]$")]
Thank you all in advance.
Make your string a literal by adding # sign before the opening quote. Otherwise you would need to escape all the backslashes that the string contains. That would make regular expression like this even less readable.
[RegularExpression(#"^([a-zA-Z])[a-zA-Z_-]*[\w_-]*[\S]$|^([a-zA-Z])[0-9_-]*[\S]$|^[a-zA-Z]*[\S]$")]
A literal string enables you to use special characters such as a
backslash or double-quotes without having to use special codes or
escape characters. This makes literal strings ideal for file paths
that naturally contain many backslashes. To create a literal string,
add the at-sign # before the string’s opening quote
I have a URL pattern that needs to contain either APPLES or ORANGES in it, no other value. Optionally, it can also have query parameters. I've tried a number of RegEx patterns, but I just can't get a pattern that will respect the strict match.
Sample URLs
Good
http://www.website.com/en/pages/APPLES
http://www.website.com/en/pages/APPLES?k=v
http://www.website.com/en/pages/ORANGES?k=v&k2=v2
http://www.website.com/en/pages/ORANGES
Bad
http://www.website.com/en/pages/APPLES???k=v
http://www.website.com/en/pages/APPLES?k=v=v
http://www.website.com/en/pages/APPLESORANGES
http://www.website.com/en/pages/1APPLES
http://www.website.com/en/APPLES
Attempted RegEx Patterns (well, at least the best attempts)
(http://*.*.website*.*.com/*.*/pages(/APPLES)|(/ORANGES)[\?]*.*)
(http://*.*.website*.*.com/*.*/pages(/APPLES|/ORANGES)[\?]*.*)
If you're curious, I intentionally want to allow any sub-domain, suffix after "website" (for different environments), and any path between .com/ and /pages, hence the use of . in a number of places.
What would be the best way to achieve this?
**Edit: Final Answer**
My final answer was merged from mathematical.coffee and fardjad.
^https?://.*\.website\b.*\.com/.*/pages/(APPLES\b|ORANGES\b)((\?\w+=\w+)(&?\w+=\w+)*)?$
The single limitation I've discovered is that it will not allow a few valid characters (.~_-%+) in the query string parameter key=value pairs (see: http://en.wikipedia.org/wiki/Query_string#Structure). This isn't an issue for me as I'm matching against a string returned from .NET's Uri class, so I know the URL is well-formed overall.
I think the *.* should be .*:
http://.*\.website\b.*\.com/.*/pages/PAGE[12](\?[^=]+=[^&=]+(&[^=]+=[^=&]+)*)?
Explanation:
http:// # just http://
.*\. # any thing, just make sure it's followed by '.'
website\b # website, the whole word
.*\.com # anything between website and .com
/.*/pages/ # anything between the .com and the pages
PAGE[12] # PAGE1 or PAGE2
(\? # opening bracket and '?' (query string)
[^=]+ # the key: i've said it can't include =
= # =
[^=&]+ # the value: i've said it can't include = or &
(& # opening bracket and '&' for next part of query string
[^=]+=[^=&]+ # key=value pair, same regex as before
)* # 0 or more of these (the &key=value)
)? # the entire query string is optional.
NOTE - there are usually problems parsing query strings with regex and making sure it's a syntactically valid regex.
For example, in the regex I supplied above, I've said that the value in &key=value can't have an ampersand in it. But it could be an escaped entity, like &, which is legal.
You'll always suffer from this sort of problem when you try to parse syntax with regex. It's a risk you'll have to take.
Alternatively, I am sure there is a C# module to parse URLs (many other languages have these), and they take care of all these special cases for you.
Try this:
^https?://(www\.)?\w+[^/]+(/\w+(?=/)){2}/(PAGE1|PAGE2)((\?\w+=\w+)(&?\w+=\w+)*)?$
I need to use regex to search through an html file and replace href="pagename" with href="pages/pagename"
Also the href could be formatted like HREF = 'pagename'
I do not want to replace any hrefs that could be upper or lowercase that begin with http, ftp, mailto, javascript, #
I am using c# to develop this little app in.
HTML manipulation through Regex is not recommended since HTML is not a "regular language." I'd highly recommend using the HTML Agility Pack instead. That gives you a DOM interface for HTML.
I have not tested with many cases, but for this case it worked:
var str = "href='page' href = 'www.goo' href='http://' href='ftp://'";
Console.WriteLine(Regex.Replace(str, #"href ?= ?(('|"")([a-z0-9_#.-]+)('|""))", "x", RegexOptions.IgnoreCase));
Result:
"x x href='http://' href='ftp://'"
You better hold backup files before running this :P
There are lots of caveats when using a find/replace with HTML and XML. The problem is, there are many variations of syntax which are permitted. (and many which are not permitted but still work!)
But, you seem to want something like this:
search for
([Hh][Rr][Ee][Ff]\s*=\s*['"])(\w+)(['"])
This means:
[Hh]: any of the items in square-brackets, followed by
\s*: any number of whitespaces (maybe zero),
=
\s* any more whitespaces,
['"] either quote type,
\w+: a word (without any slashes or dots - if you want to include .html then use [.\w]+ instead ),
and ['"]: another quote of any kind.
replace with
$1pages/$2$3
Which means the things in the first bracket, then pages/, then the stuff in the second and third sets of brackets.
You will need to put the first string in #" quotes, and also escape the double-quotes as "".
Note that it won't do anything even vaguely intelligent, like making sure the quotes match. Warning: try never to use as "any character" (.) symbol in this kind of regex, as it will grab large sections of text, over and including the next quotation mark, possibly up to the end of the file!
see a regex tutorial for more info, e.g. http://www.regular-expressions.info/dotnet.html