I'm using an HTML sanitizing whitelist code found here:
http://refactormycode.com/codes/333-sanitize-html
I needed to add the "font" tag as an additional tag to match, so I tried adding this condition after the <img tag check
if (tagname.StartsWith("<font"))
{
// detailed <font> tag checking
// Non-escaped expression (for testing in a Regex editor app)
// ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
if (!IsMatch(tagname, #"<font
(\s*size=""\d{1}"")?
(\s*color=""((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)"")?
(\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
\s*?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
}
Aside from the condition above, my code is almost identical to the code in the page I linked to. When I try to test this in C#, it throws an exception saying "Not enough )'s". I've counted the parenthesis several times and I've run the expression through a few online Javascript-based regex testers and none of them seem to tell me of any problems.
Am I missing something in my Regex that is causing a parenthesis to escape? What do I need to do to fix this?
UPDATE
After a lot of trial and error, I remembered that the # sign is a comment in regexes. The key to fixing this is to escape the # character. In case anyone else comes across the same problem, I've included my fix (just escaping the # sign)
if (tagname.StartsWith("<font"))
{
// detailed <font> tag checking
// Non-escaped expression (for testing in a Regex editor app)
// ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
if (!IsMatch(tagname, #"<font
(\s*size=""\d{1}"")?
(\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
(\s*face=""(Arial|Courier\sNew|Garamond|Georgia|Tahoma|Verdana)"")?
\s*?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
}
Your IsMatch Method is using the option RegexOptions.IgnorePatternWhitespace, that allows you to put comments inside the regular expressions, so you have to scape the # chatacter, otherwise it will be interpreted as a comment.
if (!IsMatch(tagname,#"<font(\s*size=""\d{1}"")?
(\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
(\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
\s?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
I don't see anything obviously wrong with the regex. I would try isolating the problem by removing pieces of the regex until the problem goes away and then focus on the part that causes the issue.
It works fine for me... what version of the .NET framework are you using, and what is the exact exception?
Also - what does you IsMatch method look like? is this just a pass-thru to Regex.IsMatch?
[update] The problem is that the OP's example code didn't show they are using the IgnorePatternWhitespace regex option; with this option it doesn't work; without this option (i.e. as presented) the code is fine.
Download Chris Sells Regex Designer. Its a great free tool for testing .NET regex's.
I'm not sure this regex is going to do what you want because it depends on the order of the attributes matching what you have in the regex. If for example face="Arial" preceeded size="5" then face= wouldn't match.
There are some escaping problems in your regex. You need to escape your " with \ You need to escape your # with \ You need to use \s in Courier New instead of just the space. You need to use the RegexOptions.IgnorePatternWhitespace and RegexOptions.IgnoreCase options.
<font
(\s+size=\"\d{1}\")?
(\s+color=\"((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)\")?
(\s+face=\"(Arial|Courier\sNew|Garamond|Georgia|Tahoma|Verdana)\")?
The # characters are what was causing the exception with the somewhat misleading missing ) message.
Related
Hi I am trying to create an App to help me switch out the test contents of my Catch-Block and replace them with production contents. I am able to read through my file, and parse the contents, but having problems creating a regex( I am brand new to this still) to identify the try-catch block, so I can either choose to delete or change the contents of the catch block. Anyone able to help me solve this problem please??
so far I have the expression below(does not work at all)
try{*}catch(*){*)
Thanks in advance.
You can't write a regex that does this, because regex can't be used to match nested patterns. Which means that it won't be able to identify when your closing brace occurs vs other nested braces in your code. You will need a parser generator such as ANTLR like the linked answer suggests to accomplish this.
I'd suggest taking a look at Microsoft's Roslyn compiler that is under development. It's APIs should probably allow you to accomplish whatever it is you're looking to do. It is currently in preview.
I think this would drive you to a solution:
try\s*\{[^{}]*([^{}]*\{[^{}]*\}[^{}]*)*[^{}]*\}\s*catch\s*\([^()]*(\([^()]*\))*\)\s*\{[^{}]*([^{}]*\{[^{}]*\}[^{}]*)*[^{}]*\}
try\s* matches try followed by zero or more spaces.
\{[^{}]*([^{}]*\{[^{}]*\}[^{}]*)*[^{}]*\} matches a block ({ followed by zero or more characters except { and } followed by zero or more blocks with any number of characters preceding and succeeding each)
\s*catch\s*\([^()]*(\([^()]*\))*\) matches zero or more spaces preceding and following catch, then something inside brackets ()
\s*\{[^{}]*([^{}]*\{[^{}]*\}[^{}]*)*[^{}]*\} similar to try block
Note: May fail in case of ones with comments containing {s or }s
im currently writting on a little project here that uses regex to parse a template, now the big problem is that we also got a "tag" for includes here which makes it kinda difficult.
Regex reg = new Regex(#"##############TEMPLATEENGINE(^#)##############(.*?)##############TEMPLATEENGINE(\1)##############", RegexOptions.IgnoreCase | RegexOptions.Compiled);
works fine on templates like
########TEMPLATEENGINE$$startswith$$account:firstname$$Firstn##############
blah
########TEMPLATEENGINE$$startswith$$account:firstname$$Firstn
########TEMPLATEENGINEaccount:firstname############## Attribute missing: firstname! ##############TEMPLATEENGINEaccount:firstname
but as soon as i have a template like
########TEMPLATEENGINE$$startswith$$account:firstname$$Firstn##############
blah
########TEMPLATEENGINEaccount:firstname############## Attribute missing: firstname! ##############TEMPLATEENGINEaccount:firstname
########TEMPLATEENGINE$$startswith$$account:firstname$$Firstn##############
blahg
it just finds the inner template, although i think that \1 should make sure that start and end should be equal....
I had a go at getting this pattern to work:
########TEMPLATEENGINE([^#]+)##############(((?!########TEMPLATEENGINE(?!\1)).)*)########TEMPLATEENGINE\1##############
But no luck so far and I need to move on - I've posted it anyway in case it helps, as I think the technique is sound. I will delete it later (or fix it, if I can!) if nobody makes use of it, so please don't vote it down - I'm not claiming it's a complete solution!
Note that you'd need to iterate the match expression, capturing the inner tags first.
Also note that the end tag for ########TEMPLATEENGINEaccount:firstname############## seems to be in a different format for the example (missing it's # suffix) - is that a problem?
I have a source to a web page and I need to extract the body. So anything between </head><body> and </body></html>.
I've tried the following with no success:
var match = Regex.Match(output, #"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");
It finds a string but cuts it off long before </body></html>. I escaped characters based on the RegEx cheat sheet.
What am i missing?
I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.
The latest version even supports Linq so you can get your content like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;
Regex is not meant for such html handling, as many here would say. Without having your sample web page / html, I can only say that try removing the non-greedy ? quantifier in (.*?) and try. After all, a html page will have only one head and body.
Though regexes are definitely not the best tool for this task, there are a few suggestions and points I would like to make:
un-escape the angle brackets - with the # before your string, they are going through to the regex and they do not need to be escaped for a .NET regex
with your regex, you need to make sure that the head/body tag combinations do not have any white-space between them.
with your regex, the body tag cannot have any attributes.
I would suggest something more like:
(?<=</head>\s*<body(\s[^>]*)?>)(.*?)(?=</body>\s*</html>)
this seems to work for me on the source of this page!
As the others have said, the correct way to handle this is with an HTML-specific tool. I just want to point out some problems with that cheat-sheet.
First, it's wrong about angle brackets: you do not need to escape them. In fact, it's wrong twice: it also says \< and \> match word boundaries, which is both incorrect for .NET, and incompatible with the advice about escaping angle brackets.
That cheat-sheet is just a random collection of regex syntax elements; most of them will work in most flavors, but many are guaranteed not to work in your particular flavor, whatever it happens to be. I recommend you disregard it and rely instead on .NET-specific documents or Regular-Expressions.info. The books Mastering Regular Expressions and Regular Expressions Cookbook are both excellent, too.
As for your regex, I don't see how it could behave the way you say it does. If it were going to fail, I would expect it to fail completely. Does your HTML document contain a CDATA section or SGML comment with </body></html> inside it? Or is it really two or more HTML documents run together?
It would be great if someone could provide me the Regular expression for the following string.
Sample 1: <div>abc</div><br>
Sample 2: <div>abc</div></div></div></div></div><br>
As you can see in the samples provided above, I need to match the string no matter how many number of </div> occurs. If there occurs any other string between </div> and <br>, say like this <div>abc</div></div></div>DEF</div></div><br> OR <div>abc</div></div></div></div></div>DEF<br>, then the Regex should not match.
Thanks in advance.
Try this:
<div>([^<]+)(?:<\/div>)*<br>
As seen on rubular
Notes:
This only works if there are not tags in the abc part (or anything that has a < symbol).
You might want to use start and end of string anchors (^<div>([^<]+)(?:<\/div>)*<br>$ if you want your string to match the pattern exactly.
If you want to allow the abc part to be empty, use * instead of +
That being said, you should be wary of using regex to parse HTML.
In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.
You need to use a real parser. Things like infinitely nested tags can't be handled via regex.
You could also include a named group in the the expression, e.g.:
<div>(?<text>[^<]*)(?:<\/div>)*<br>
Implemented in C#:
var regex = new Regex(#"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));
Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));
NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.
Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.
The rest of this post refers to the following section of the regex:
([^<]+?)
Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.
Do you want to match if there is nothing inside the div? If so change the + in the above to *
Finally, although it will work fine, you don't need the ? in the above.
I think, this regex is more flexible:
<div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>
I don't include the ^ and $ in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.
How can I write a regular expression to replace links with no link text like this:
with
http://www.somesite.com
?
This is what I was trying to do to capture the matches, and it isn't catching any. What am I doing wrong?
string pattern = "<a\\s+href\\s*=\\s*\"(?<href>.*)\">\\s*</a>";
I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) {
link.InnerText = link.GetAttribute("href");
}
I could be wrong, but I think you simply need to change the quantifier within the href group to be lazy rather than greedy.
string pattern = #"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";
(I've also changed the type of the string literal to use #, for better readability.)
The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).
I would suggest
string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";
This way also links with their href attribute somewhere else would be captured.
Replace with
"$1$2$3"
The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.
Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.