Regex match URL if not in html comment line

Regex match URL if not in html comment line - c#

I want to match "https://www.mysite/embed/M7znk1c-ay0" only if it is not html comment.
So dont't match this line
<!--<p><iframe src="https://www.mysite/embed/M7znk1c-ay0" width="854" height="480" frameborder="0" allowfullscreen="allowfullscreen"></iframe>-->
but match this line
<article class="art-post"><div class="art-postcontent clearfix"><div class="art-article"><p><iframe src="https://www.mysite/embed/M7znk1c-ay0" ></iframe></p>
I tried this pattern ^(?=<!--).*www.mysite\/embed\/+[\w\-]*
but it isn't quite working

You almost did it correctly. The correct regex is ^(?!<!--).*"(.*www.mysite\/embed\/+[\w\-]*).

HTML is not regular so using regular expressions to parse html might not be a good idea...
#csabinho's answer ^(?!<!--).*"(.*www.mysite\/embed\/+[\w\-]*) won't work if the URL you want to match is in middle of a page, it simply checks if line doesn't begin with a comment.
Best practice would be to create DOM and use XPath to query XML-like contents.
Edit:
By the way you can use following code first to remove comments.
System.Text.RegularExpressions;
...
string pattern = #"(<!--(.+?)-->)";
var res = Regex.Replace(input, pattern, "", RegexOptions.Singleline);
and then use a simple pattern to extract the URL from result

Related

Regular expression Google image

I make RSS reader and I need to find path url image (Google RSS) using regex expression.
URL image from RSS is for example:
RSS channel is https://news.google.com/?output=rss.
<img src="//t0.gstatic.com/images?q=tbn:ANd9GcRfMZ3MOzznCthFKCdIan17n9B8vZvEE-tRSQVTcgJa5i1OPfdf90zi4mBuGzPfB7Bj2mwE0TE" alt="" border="1" width="80" height="80" />
btw. I use regex expressions:
Regex regx = new Regex("\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
RegexOptions.IgnoreCase);
Some advice?

First, you should not parse xml with regex -> use XmlDocument, XmlParser, Readers,...
If you know what you are doing here is the quick and dirty regex solution.
All image Tags in your Feed seems to be in descriptions-Tags and they are of course xml encoded (just keep that in mind the next few steps)
Next you should look for some example img tags
Are you looking for img-tags without src too, or with empty source?
Overall -> define what you are looking for
Design your Regex
because the feed is generated automatically the tags seems to be in the same order every time (we use that fact for shorter regex)
Each img Tag starts with < (but keep point 1 in mind -> xml encoded)
looking for < followed by img (current regex: <img
Next followed by at least one whitespace char. (current regex: lt;img\s+
the src attribute is always the first attribute (in this case) if present so we select src=" (current regex: <img\s+src=")
Next select the url itselt with .* but be carefull the * quantifier is greedy so we have to use Lazy quantification .*? and finally close with "
Final regex: <img\s+src="(.*?)" Make sure that you use brackets for the url for easy group selection.
Last Step: C# Code
//quick & dirty :-)
var url = "https://news.google.com/?output=rss";
var regex = #"<img\s+src="(.*?)"";
var RssContent = new StreamReader(((HttpWebRequest)HttpWebRequest.Create(url)).GetResponse().GetResponseStream()).ReadToEnd();
foreach (Match match in Regex.Matches(RssContent, regex))
{
//print img urls
Debug.WriteLine(match.Groups[1]);
}
PS: If you are trying to write an RSS-reader you should NOT use Regex to parse html at all! try to find a way to transform html into xaml and write your reader in WPF or start with learning more about those problems by studying some open source RSS readers

You can use the below regex patter:
/(.*\/images.*)/

Strip out content between and including h2 tag

I am trying to strip the content from between the h2 tags in a string using a Regex in C#:
<h2>content needs removing</h2> other content...
I have the following Regex, which according to the Regex buddy software I used to test it, should work, but it doesn't:
myString = Regex.Replace(myString, #"<h[0-9]>.*</h[0-9]>", String.Empty);
I have another Regex that is run after this to remove all other HTML tags, it is called in the same way and works fine. Can anyone help me out with why this isn't working?

Don't use Regular Expressions.
HTML is not a Regular Language, thus it can't be parsed correctly with a Regular Expression.
For example, your Regex would match:
<h2>sample</h1>
which is not valid. When dealing with nested structures, this would lead to unexpected results (.* is greedy and matches everything until the last closing h[0-9] tag in your input HTML string)
You can use XMLDocument (HTML is not XML but that would be sufficient for what you're trying to do) or you can use Html Agility Pack.

try this code :
String sourcestring = "<h2>content needs removing</h2> other content...";
String matchpattern = #"\s?<h[0-9]>[^<]+</h[0-9]>\s?";
String replacementpattern = #"";
MessageBox.Show(Regex.Replace(sourcestring,matchpattern,replacementpattern));
[^<]+ is more safer than .+ because it stops collecting where it sees a <.

This works fine for me:
string myString = "<h2>content needs removing</h2> other content...";
Console.WriteLine(myString);
myString = Regex.Replace(myString, "<h[0-9]>.*</h[0-9]>", string.Empty);
Console.WriteLine(myString);
Displays:
<h2>content needs removing</h2> other content...
other content...
As expected.
If you problem is that your real case has several different heading tags, then you have an issue with the greedy * quantifier. It will create the longest match that it can. For example, if you have:
<h2>content needs removing</h2> other content...<h3>some more headings</h3> and some other stuff
You will match everything from <h2> to </h3> and replace it. To fix this, you need to use a lazy quantifier:
myString = Regex.Replace(myString, "<h[0-9]>.*?</h[0-9]>", string.Empty);
Will leave you with:
other content... and some other stuff
Note however, that this will not fix nested <h> tags. As #fardjad said, using Regex for HTML isn't generally a good idea.

Regex for a string

It would be great if someone could provide me the Regular expression for the following string.
Sample 1: <div>abc</div><br>
Sample 2: <div>abc</div></div></div></div></div><br>
As you can see in the samples provided above, I need to match the string no matter how many number of </div> occurs. If there occurs any other string between </div> and <br>, say like this <div>abc</div></div></div>DEF</div></div><br> OR <div>abc</div></div></div></div></div>DEF<br>, then the Regex should not match.
Thanks in advance.

Try this:
<div>([^<]+)(?:<\/div>)*<br>
As seen on rubular
Notes:
This only works if there are not tags in the abc part (or anything that has a < symbol).
You might want to use start and end of string anchors (^<div>([^<]+)(?:<\/div>)*<br>$ if you want your string to match the pattern exactly.
If you want to allow the abc part to be empty, use * instead of +
That being said, you should be wary of using regex to parse HTML.
In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.

You need to use a real parser. Things like infinitely nested tags can't be handled via regex.

You could also include a named group in the the expression, e.g.:
<div>(?<text>[^<]*)(?:<\/div>)*<br>
Implemented in C#:
var regex = new Regex(#"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));
Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));

NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.
Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.
The rest of this post refers to the following section of the regex:
([^<]+?)
Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.
Do you want to match if there is nothing inside the div? If so change the + in the above to *
Finally, although it will work fine, you don't need the ? in the above.

I think, this regex is more flexible:
<div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>
I don't include the ^ and $ in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.

C# extracting certain parts of a string

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the html code that is returned.
Below is a fragment of the html I am interested in:
<span class="header">Number of People:</span>
<span class="peopleCount">1001</span> <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>
Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).
I've searched stack over flow and found some code that could work:
How do I extract text that lies between parentheses (round brackets)?
But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:
string responseHtml; // this is already filled with html code above ^^
string insideBrackets = null;
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");
Match match = regex.Match(responseHtml);
if (match.Success)
{
insideBrackets = match.Groups["TextInsideBrackets"].Value;
Console.WriteLine(insideBrackets);
}
The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.
Thanks in advance!

Try this one:
Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <
(I changed the group name to data)
Cheers,
Florian

?<TextInsideBrackets> is incorrect
You need:
(?<TextInsideBrackets>...)

I assume you want to do a named capture.
You should use
Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");
and not
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

How can I write a regular expression to capture links with no link text?

How can I write a regular expression to replace links with no link text like this:
with
http://www.somesite.com
?
This is what I was trying to do to capture the matches, and it isn't catching any. What am I doing wrong?
string pattern = "<a\\s+href\\s*=\\s*\"(?<href>.*)\">\\s*</a>";

I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) {
link.InnerText = link.GetAttribute("href");
}

I could be wrong, but I think you simply need to change the quantifier within the href group to be lazy rather than greedy.
string pattern = #"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";
(I've also changed the type of the string literal to use #, for better readability.)
The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).

I would suggest
string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";
This way also links with their href attribute somewhere else would be captured.
Replace with
"$1$2$3"
The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.

Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex match URL if not in html comment line - c#

You almost did it correctly. The correct regex is ^(?!<!--)."(.www.mysite\/embed\/+[\w\-]*).

Related

Regular expression Google image

Strip out content between and including h2 tag

Regex for a string

C# extracting certain parts of a string

How can I write a regular expression to capture links with no link text?

Categories

Resources