Regex for URL C# - c#

In my C# program I wrote a Google Search Function, which works by fetching the source from each page and getting the URLs via regex.
My actual Regex is:
(?:(?:(?:http)://)(?:w{3}\\.)?(?:[a-zA-Z0-9/;\\?&=:\\-_\\$\\+!\\*'\\(\\|\\\\~\\[\\]#%\\.])+)
This works good at the moment, but I get for example URLs like http://www.example.com/forums/arcade.php?efdf=332
I just want to get in this case the URL without the ?efdf=332 at the end.
So how should I change the regex?

http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+
does the same as your regex (I've removed a lot of unnecessary cruft) but stops matching a link before a ?.
In C#:
Regex regexObj = new Regex(#"http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+")
That said, I'm not sure this is such a good way of matching URLs (what about https, ftp, mailto etc.?)

You can use the Uri class to access various parts of the URL and either remove the query string from the end, or concatenate the parts you want.

Related

regular expression get all hosts from html

I'm trying to get all urls in one regular expression, currently i'm using this pattern.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
However that regex returns the pages/files, instead of hosts. So instead of having to run a second regular expression, I'm hoping someone here can help
This returns http://www.yoursite.com/index.html
I'm attempting to return yoursite.com.
Also the the regex will be parsing from html and hosts will be checked after, so 100% accuracy isn't crucial.
Assuming that your regex:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Actually does parse the Urls (I haven't checked it), you could easily use a capture group to get the host:
/^(https?:\/\/)?(?<host>([\da-z\.-]+)\.([a-z\.]{2,6}))([\/\w \.-]*)*\/?$/
When you get the Match result, you can examine Groups["host"] to get the host name.
But you're much better off, in my opinion, just using Uri.TryCreate, although you'll need a little logic to get around the possible lack of a scheme. That is:
if (!Regex.IsMatch(line, "https?:\/\/"))
line = "http://" + line;
Uri uri;
if (Uri.TryCreate(line, UriKind.Absolute, out uri))
{
// it's a valid url.
host = uri.Host;
}
Parsing Urls is a pretty tricky business. For example, no individual dotted segment can exceed 63 characters, and there's nothing preventing the last dotted segment from having numbers or hyphens. Nor is it limited to 6 characters. You're better off passing the entire string to Uri.TryCreate than you are trying to duplicate the craziness of URL parsing with a single regular expression.
It's possible that the rest of the Url (after the host name) could be trash. If you want to eliminate that bit causing a problem, then extract everything up to the end of the host name:
^https?:\/\/[^\/]*
Then run that through Uri.TryCreate.
To capture just the yoursite.com from sample text http://www.yoursite.com/index?querystring=value you could use this expression, however this does not validate the string:
^(https?:\/\/)?(?:[^.\/?]*[.])?([^.\/?]*[.][^.\/?]*)
Live demo: http://www.rubular.com/r/UNR7qiQ0Eq

Regex for a URI / URL

I am currently searching through an HTML page for a specific link, at the moment I have a regex as follows that picks up a generic URI:
Regex regex = new Regex(#"(https?|ftp|file)\://[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*");
Although there are several links in the HTML so it picks out the first one, where as the link I want to extract is as follows:
http://*.*.com/dlp/*/*/*
How could this be achieved using a regex?
Try this:
http\://[A-Za-z0-9\.\-]+\.com/dlp[A-Za-z0-9\.\-/]*
You may need to escape some characters again.

Regex to match a fragment of the URL

I have URL's like:
http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd
or
http://127.0.0.1:81/controller/verbTwo/NXw4fDF8MXwxfDQ1
I'd like to extract that part in bold. The host and port can change to anything (when I publish it to a live server it will change). The controller never changes. And for the verb part, there are 2 possibilities.
Can anyone help me with the regex?
Thanks
Instead of using a regex you could use the built in functionality of Uri
Uri uri = new Uri("http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd");
var lastSegment = uri.Segments.Last();
You're looking for the Uri and Path classes:
Path.GetFileName(new Uri(str).AbsolutePath)
Why do you look for a regex? you can look for the two string elements "verbOne/" or "verbTwo/" and make a substring from the end. And then you can look for the rest and substrakt the part with the '?'
I think this is faster then a regex.
krikit
Though everyone else here is correct that regex is not the best solution, because it could fail when parsers already exist that should never fail due to their specialization, I believe you could use the following regex:
(?<=http://127\.0\.0\.1:81/controller/verb(One|Two)/)[a-zA-Z0-9]*

How do I use a pattern Url to extract a segment from an actual Url?

If I have a series of "pattern" Urls of the form:
http://{username}.sitename.com/
http://{username}.othersite.net/
http://mysite.com/{username}
and I have an actual Url of the form:
http://joesmith.sitename.com/
Is there any way that I can match a pattern Url and in turn use it to extract the username portion out the actual Url? I've thought of nasty ways to do it, but it just seems like there should be a more intuitive way to accomplish this.
ASP.NET MVC uses a similar approach to extract the various segments of the URL when it is building its routes. Given the example:
{controller}/{action}
So given the Url of the form, Home/Index, it knows that it is the Home controller calling the Index action method.
Not sure I understand this question correctly but you can just use a regular expression to match anything between 'http://' and the first dot.
A very simple regex will do:
':https?://([a-z0-9\.-]*[a-z0-9])\.sitename\.com'
This will allow any subdomain that only contains valid subdomain characters. Example of allowed subdomains:
joesmith.sitename.com
joe.smith.sitename.com
joe-smith.sitename.com
a-very-long-subdomain.sitename.com
As you can see, you might want to complicate the regex slightly. For instance, you could limit it to only allow a certain amount of characters in the subdomain.
It seems the the quickest and easiest solution is going off of Machine's answer.
var givenUri = "http://joesmith.sitename.com/";
var patternUri = "http://{username}.sitename.com/";
patternUri = patternUri.Replace("{username}", #"([a-z0-9\.-]*[a-z0-9]");
var result = Regex.Match(givenUri, patternUri, RegexOptions.IgnoreCase).Groups;
if(!String.IsNullOrEmpty(result[1].Value))
return result[1].Value;
Seems to work great.
Well, this "pattern URL" is a format you've made up, right? You basically you'll just need to process it.
If the format of it is:
anything inside "{ }" is a thing to capture, everything else must be as is
Then you'd just find the start/end index of those brackets, and match everything else. Then when you get to a place where one is, make sure you only look for chars such that they don't match whatever 'token' comes after the next ending '}'.
There are definitely different ways - ultimately though your server must be configured to handle (and possibly route) these different subdomain requests.
What I would do would be to answer all subdomain requests (except maybe some reserved words, like 'www', 'mail', etc.) on sitename.com with a single handler or page (I'm assuming ASP.NET here based on your C# tag).
I'd use the request path, which is easy enough to get, with some simple string parsing/regex routines (remove the 'http://', grab the first token up until '.' or '/' or '\', etc.) and then use that in a session, making sure to observe URL changes.
Alternately, you could map certain virtual paths to request urls ('joesmith.sitename.com' => 'sitename.com/index.aspx?username=joesmith') via IIS but that's kind of nasty too.
Hope this helps!

Get URL from HTML code using a regular expression

Consider:
<div>Anirudha Web blog</div>
What is the regular expression to get http://anirudhagupta.blogspot.com/
from the following?
<div>Anirudha Web blog</div>
If you suggest something in C# that's good. I also like jQuery to do this.
If you want to use jQuery you can do the following.
$('a').attr('href')
Quick and dirty:
href="(.*?)"
Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).
The simplest way to do this is using the following regular expression.
/href="([^"]+)"/
This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.
UPDATE: A complete Perl program for parsing URLs would look like this:
use 5.010;
while (<>) {
push #matches, m/href="([^"]+)"/gi;
push #matches, m/href='([^']+)'/gi;
push #matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
say for #matches;
}
It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl to find all the URLs in a webpage:
curl url | perl urls.pl
The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.
You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.
data="""
<html>
abcd ef ....
blah blah <div>Anirudha Web blog</div>
blah ...
<div><a href="http://mike.blogspot.com/">Mike's Web blog
</a></div>
end...
</html>
"""
for item in data.split("</a>"):
if "<a href" in item:
start_of_href = item.index("<a href") # get where <a href=" is
print item[start_of_href+len('<a href="'):] # print substring from <a href onwards.
The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using "</a>" as delimiter. Go through each split field, check for "href", then get the substr after "href". That will be your links.

Categories

Resources