Regex for a URI / URL - c#

I am currently searching through an HTML page for a specific link, at the moment I have a regex as follows that picks up a generic URI:
Regex regex = new Regex(#"(https?|ftp|file)\://[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*");
Although there are several links in the HTML so it picks out the first one, where as the link I want to extract is as follows:
http://*.*.com/dlp/*/*/*
How could this be achieved using a regex?

Try this:
http\://[A-Za-z0-9\.\-]+\.com/dlp[A-Za-z0-9\.\-/]*
You may need to escape some characters again.

Related

Regex match URL if not in html comment line

I want to match "https://www.mysite/embed/M7znk1c-ay0" only if it is not html comment.
So dont't match this line
<!--<p><iframe src="https://www.mysite/embed/M7znk1c-ay0" width="854" height="480" frameborder="0" allowfullscreen="allowfullscreen"></iframe>-->
but match this line
<article class="art-post"><div class="art-postcontent clearfix"><div class="art-article"><p><iframe src="https://www.mysite/embed/M7znk1c-ay0" ></iframe></p>
I tried this pattern ^(?=<!--).*www.mysite\/embed\/+[\w\-]*
but it isn't quite working
You almost did it correctly. The correct regex is ^(?!<!--).*"(.*www.mysite\/embed\/+[\w\-]*).
HTML is not regular so using regular expressions to parse html might not be a good idea...
#csabinho's answer ^(?!<!--).*"(.*www.mysite\/embed\/+[\w\-]*) won't work if the URL you want to match is in middle of a page, it simply checks if line doesn't begin with a comment.
Best practice would be to create DOM and use XPath to query XML-like contents.
Edit:
By the way you can use following code first to remove comments.
System.Text.RegularExpressions;
...
string pattern = #"(<!--(.+?)-->)";
var res = Regex.Replace(input, pattern, "", RegexOptions.Singleline);
and then use a simple pattern to extract the URL from result

Regex for URL C#

In my C# program I wrote a Google Search Function, which works by fetching the source from each page and getting the URLs via regex.
My actual Regex is:
(?:(?:(?:http)://)(?:w{3}\\.)?(?:[a-zA-Z0-9/;\\?&=:\\-_\\$\\+!\\*'\\(\\|\\\\~\\[\\]#%\\.])+)
This works good at the moment, but I get for example URLs like http://www.example.com/forums/arcade.php?efdf=332
I just want to get in this case the URL without the ?efdf=332 at the end.
So how should I change the regex?
http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+
does the same as your regex (I've removed a lot of unnecessary cruft) but stops matching a link before a ?.
In C#:
Regex regexObj = new Regex(#"http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+")
That said, I'm not sure this is such a good way of matching URLs (what about https, ftp, mailto etc.?)
You can use the Uri class to access various parts of the URL and either remove the query string from the end, or concatenate the parts you want.

Extract action attribute in a Form tag with Regex in C#?

I wanna extract https://www.sth.com/yment/Paymentform.aspx from below string
<form id='paymentUTLfrm' action='https://www.sth.com/yment/Paymentform.aspx' method='post'>
How can I do it with Regex or somthing ?
While I don't encourage using regex to parse HTML, this is simple enough that a regex will suffice. For more complex operations, do use a proper (X)HTML parser like HtmlAgilityPack.
This regex should work:
<\s*form[^>]*\s+action=(["'])(.*?)\1
EDIT:
Updated regex so it will work with apostrophes in URLs. Note that the URL is now in the 2nd capture group.
See it on rubular
Use Html Agility Pack. It will save you a lot of trouble in the long run.
using HtmlAgilityPack;
var doc = new HtmlDocument();
doc.LoadHtml("<form id='paymentUTLfrm' action='https://www.sth.com/yment/Paymentform.aspx' method='post'>");
var form = doc.DocumentNode.SelectSingleNode("id('paymentUTLfrm')");
string action = form.Attributes["action"].Value;
It supports loading pages directly from the web, as well as XPath (used above). The HTML does not have to be valid.
EDIT: If you want to use the name:
doc.DocumentNode.SelectSingleNode("//*[#name='paymentUTLfrm']");
While I would agree that general html parsing is best done with html agility pack (etc) rather than with regex, this is a pretty simple requirement and a regex would be appropriate. I am no regex expert, but this one works:
action=["'](.*)["']
The (.*) will capture the url
maybe some expert can add a comnent to refine this...

How can I write a regular expression to capture links with no link text?

How can I write a regular expression to replace links with no link text like this:
with
http://www.somesite.com
?
This is what I was trying to do to capture the matches, and it isn't catching any. What am I doing wrong?
string pattern = "<a\\s+href\\s*=\\s*\"(?<href>.*)\">\\s*</a>";
I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) {
link.InnerText = link.GetAttribute("href");
}
I could be wrong, but I think you simply need to change the quantifier within the href group to be lazy rather than greedy.
string pattern = #"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";
(I've also changed the type of the string literal to use #, for better readability.)
The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).
I would suggest
string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";
This way also links with their href attribute somewhere else would be captured.
Replace with
"$1$2$3"
The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.
Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Get URL from HTML code using a regular expression

Consider:
<div>Anirudha Web blog</div>
What is the regular expression to get http://anirudhagupta.blogspot.com/
from the following?
<div>Anirudha Web blog</div>
If you suggest something in C# that's good. I also like jQuery to do this.
If you want to use jQuery you can do the following.
$('a').attr('href')
Quick and dirty:
href="(.*?)"
Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).
The simplest way to do this is using the following regular expression.
/href="([^"]+)"/
This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.
UPDATE: A complete Perl program for parsing URLs would look like this:
use 5.010;
while (<>) {
push #matches, m/href="([^"]+)"/gi;
push #matches, m/href='([^']+)'/gi;
push #matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
say for #matches;
}
It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl to find all the URLs in a webpage:
curl url | perl urls.pl
The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.
You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.
data="""
<html>
abcd ef ....
blah blah <div>Anirudha Web blog</div>
blah ...
<div><a href="http://mike.blogspot.com/">Mike's Web blog
</a></div>
end...
</html>
"""
for item in data.split("</a>"):
if "<a href" in item:
start_of_href = item.index("<a href") # get where <a href=" is
print item[start_of_href+len('<a href="'):] # print substring from <a href onwards.
The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using "</a>" as delimiter. Go through each split field, check for "href", then get the substr after "href". That will be your links.

Categories

Resources