regular expression get all hosts from html

regular expression get all hosts from html - c#

I'm trying to get all urls in one regular expression, currently i'm using this pattern.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
However that regex returns the pages/files, instead of hosts. So instead of having to run a second regular expression, I'm hoping someone here can help
This returns http://www.yoursite.com/index.html
I'm attempting to return yoursite.com.
Also the the regex will be parsing from html and hosts will be checked after, so 100% accuracy isn't crucial.

Assuming that your regex:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Actually does parse the Urls (I haven't checked it), you could easily use a capture group to get the host:
/^(https?:\/\/)?(?<host>([\da-z\.-]+)\.([a-z\.]{2,6}))([\/\w \.-]*)*\/?$/
When you get the Match result, you can examine Groups["host"] to get the host name.
But you're much better off, in my opinion, just using Uri.TryCreate, although you'll need a little logic to get around the possible lack of a scheme. That is:
if (!Regex.IsMatch(line, "https?:\/\/"))
line = "http://" + line;
Uri uri;
if (Uri.TryCreate(line, UriKind.Absolute, out uri))
{
// it's a valid url.
host = uri.Host;
}
Parsing Urls is a pretty tricky business. For example, no individual dotted segment can exceed 63 characters, and there's nothing preventing the last dotted segment from having numbers or hyphens. Nor is it limited to 6 characters. You're better off passing the entire string to Uri.TryCreate than you are trying to duplicate the craziness of URL parsing with a single regular expression.
It's possible that the rest of the Url (after the host name) could be trash. If you want to eliminate that bit causing a problem, then extract everything up to the end of the host name:
^https?:\/\/[^\/]*
Then run that through Uri.TryCreate.

To capture just the yoursite.com from sample text http://www.yoursite.com/index?querystring=value you could use this expression, however this does not validate the string:
^(https?:\/\/)?(?:[^.\/?]*[.])?([^.\/?]*[.][^.\/?]*)
Live demo: http://www.rubular.com/r/UNR7qiQ0Eq

Related

RegEx to Validate URL with optional Scheme

I want to validate a URL using regular expression. Following are my conditions to validate the URL:
Scheme is optional
Subdomains should be allowed
Port number should be allowed
Path should be allowed.
I was trying the following pattern:
((http|https)://)?([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
But I am not getting the desired results. Even an invalid URL like '*.example.com' is getting matched.
What is wrong with it?

are you matching the entire string? you don't say what language you are using, but in python it looks like you may be using search instead of match.
one way to fix this is to start you regexp with ^ and end it with $.

While parsing URL's is best left to a library (since I know perl best, I would suggest something like http://search.cpan.org/dist/URI/), if you want some help debugging that statement, it might be best to try it in a debugger, something like: http://www.debuggex.com/.
I think one of the main reasons it is matching, is because you don't use beginning and ending string match markers. Meaning, no part of that string might be matching what you put in explicitly, but because you haven't marked it with beginning and end markers for the string, your regex could just be matching 'example.com' in your string, not the entire input.

Found the regular expression for my condition with help from your inputs
^(http(s)?://)?[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-‌\.\?\,\'\/\\\+&%\$#_]*)?$

Following code works for me in c#
private static bool IsValidUrl(string url)
{
return new Regex(#"^(http|http(s)?://)?([\w-]+\.)+[\w-]+[.\w]+(\[\?%&=]*)?").IsMatch(url) &&!new Regex(#"[^a-zA-Z0-9]+$").IsMatch(url);
}
it allows "something.anything (at least 2 later after period) with or without http(s) and www.

Regex to match subdomain?

I have the following so far:
^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$
Been testing against these:
https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
http://www.foo.com
http://www.foo.com/
http://blog.foo.com/
http://blog.foo.com.ar/
http://foo.com
http://blog.foo.com
http://foo.com.ar
I'm using the following tool to test the regexes: regex tester
So far I've been able to yield the following groups:
full protocol
reduced protocol
full domain name
subdomain?
top level domain
port
port number
rest of the url
rest of the "directory"
no idea how to drop this group
page name
argument string
argument string
hash tag
hash tag
I will be using this regex to change the subdomain for my application for cross-domain redirect hyperlinks.
Using Request.Url as a parameter, I want to redirect from
http://example.com or http://www.example.com to http://blog.example.com
How can I achieve this?
I can't really tell what, if any, the current subdomain ( either nothing, www, blog, or forum, for instance) actually is...
What would be the best way to make this replacement?
What I actually need is some way to find out what the top level domain is. in either http://www.example.com, http://blog.example.com, or http://example.com I want to get example.com.

What would be the best way to make this replacement?
This may not be the answer you're looking for... but IMO the best way would be to make use of the System.Uri class.
The Uri class will easily extract the Host for you - and you can then split the host on "." delimiter - that should easily give you access to the current subdomain.
This is just my opinion - and its especially formed because I find it hard to maintain regex code like ^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$

You can use the Uri class to parse the strings. There are many properties available in addition to Segments:
Uri MyUri = new Uri("https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash");
foreach (String Segment in MyUri.Segments)
Response.Write(Segment + "<br />");

I think you should reconsider whether usage of a RegEx is really needed in this case;
I think extracting the top level domain from an URL is quite simple; in case of "http://www.example.com/?blah=111" you can simply take the part before the 3rd slash and perform a String.Split('.') and concat the last two array items. In case of "http://www.example.com", even easier.
Regex-patterns are very error-prone and quite hard to maintain and according to me you won't get any advantage of it. I recommend you to get rid off the Regex. Perhaps the result will be 2 - 3 more lines of code, but it will work, your code will be much better readable and easier to understand.

I am implementing URL rewriting in ASP.net and my URLs are causing me a world of problems.
The URL is generated from a database of departments & categories. I want employees to be able to add items to the database with whatever special characters are appropriate without it breaking the site.
I am encoding the data before I construct the URLs.
There are several problems...
IIS decodes the URL before it reaches .net making it impossible to properly parse anything with a "/" in it.
ASP.net gets confused by the url making "~" useless within certain pages
I migrated from the built in test server to my local IIS server (XP machine) and any URL containing an encoded & (%26) gives me a "Bad Request" error.
UrlEncode leaves some breaking characters untouched such as '.'
I did have two other related posts on this subject, at the time I only saw the small problems not the big problem upstream. I've found some registry tricks to solve the "Bad Request" issue but I'm going to be deploying to a shared hosting environment making that useless. I also know that this is a fix for some security issue so I don't want to necessarily bypass it without knowing what can of worms I'm opening.
Rather than trying to force .net to pass me the raw url, or override IIS settings i'd like to make truly safe URLs in the first place.
I'll note i've tried AntiXss.URLEncode, HttpUtility.URLEncode, URI.EscapeDataString. I've even tried stupid things like double URLEncodng. Is there a utility that does what I need, or do i really need to roll my own. I'm even considering doing something Hacky like replacing the % with an unusual string of characters. The end result should be at least readable which was the point of using URL rewriting in the first place.
Sorry for the long post- I just wanted to make sure that I've included all the necessary details. I can't seem to find any relevant information on this, and it seems like it would be a common problem - so maybe I'm missing something big. Thanks for your help, and patience with the long explanation!
Edit for clarity:
When I say the urls are being built from a database what I mean is that the directory structure is contstructed from the departments and categories in my database.
Some Example URLS -
Mystore/Refrigeration/Bar+Fridge.aspx
Mystore/Cooking+Equipment.aspx
Mystore/Kitchen/Cutting+Boards.asxpx
The problems come in when I use a department like "Beverage & Bar" or "Pastry/Decorating" to construct my URL. Despite being encoded first these cause the aforementioned issues.
My handlers are already implemented and working fine except for the special character encoding issues.

You should consider having a table off of your category/department table which has a unique URL for each category. Then you can use a special routine to generate the URLs. This can be a SQL scalar function, or a CLR function, but one of the things it would do is normalize the URL for the web. You can convert "Beverage & Bar" to "Beverage-And-Bar" and "Pastry / Decorating" to "Pastry-Decorating". Mainly, the routine needs to replace all invalid HTTP URL characters with something else. An example is this:
public static class URL
{
static readonly Regex feet = new Regex(#"([0-9]\s?)'([^'])", RegexOptions.Compiled);
static readonly Regex inch1 = new Regex(#"([0-9]\s?)''", RegexOptions.Compiled);
static readonly Regex inch2 = new Regex(#"([0-9]\s?)""", RegexOptions.Compiled);
static readonly Regex num = new Regex(#"#([0-9]+)", RegexOptions.Compiled);
static readonly Regex dollar = new Regex(#"[$]([0-9]+)", RegexOptions.Compiled);
static readonly Regex percent = new Regex(#"([0-9]+)%", RegexOptions.Compiled);
static readonly Regex sep = new Regex(#"[\s_/\\+:.]", RegexOptions.Compiled);
static readonly Regex empty = new Regex(#"[^-A-Za-z0-9]", RegexOptions.Compiled);
static readonly Regex extra = new Regex(#"[-]+", RegexOptions.Compiled);
public static string PrepareURL(string str)
{
str = str.Trim().ToLower();
str = str.Replace("&", "and");
str = feet.Replace(str, "$1-ft-");
str = inch1.Replace(str, "$1-in-");
str = inch2.Replace(str, "$1-in-");
str = num.Replace(str, "num-$1");
str = dollar.Replace(str, "$1-dollar-");
str = percent.Replace(str, "$1-percent-");
str = sep.Replace(str, "-");
str = empty.Replace(str, string.Empty);
str = extra.Replace(str, "-");
str = str.Trim('-');
return str;
}
}
You could make this a SQL enhance function, or run URL generation as a separate process. Then to implement mapping, you would map the entire URL directly to a category ID. This approach is better in the long run for several reasons. First, you are not always generating URLs, you do this once and they stay static, you don't have to worry about your procedure changing, and then GoogleBot not being able to find old URLs. Also, if you get a collision, you may notice a potential duplicate category name, because a collision would only be different by special characters. Finally, you can always view your URLs from the database, without having to run the mapping function.

I have a url rewrite i implement in the global.asax file in the begin authenticated request as I have some security. This is where I take the raw url and then do the db look up. this then rewrites the path to the aspx page and all the parameters are passed through the query string. No encoding is necessary.
However if you are using the url to actually change data then i can see that you will have huge problems as you are effectively using the http GET to change database. It is usually concidered a bad idead, and not something i do.
I only use a post request to do any databse manipulation. This keeps the url clean as all the data is in the page form.
The only issue i had was to set the correct url to the page.form.action which in most cases is the raw url.
If its the category names that are causing the issue then perhaps you should restrict the names to alpha numeric characters only and swap spaces for "-". IIS will throw a wobbly with periods "." as it looks for file names.
P.S.
IIS does not understand the tilde "~", this is something that the compiler understands. so if you use it in an anchor tag it will not work as expected and you should use the application root instead of the tilde.
Edit:
OK, it looks like an issue with IIS having issues with certain characters such as . / and &. Even if you do urlencode these IIS will still try to implement its own meanings.
As such consider removing them so:
Beverage & bar becomes BeverageBar
Pastry / decorating becomes PastryDecorating.
This will keep you urls clean, but does mean an extra column in the database so you can cheack the url against this shortened category name.

I'm having the exact same problem. Thanks for writing it up so nicely. It actually helped me to understand the problem better.
I had some other considerations however. One of the goals I have is to support the potential for any characters to be in the url which is based on the title of an article. Additionally I want to ensure uniqueness in the encoding and a two way encode / decode process.
So I did some manual encoding to solve the problem. This won't completely eliminate percent encoding, but will greatly reduce it and keep users from generating an inaccessible url. My process starts with using the Server.URLEncode function. But this doesn't eliminate the problems in the url. Because IIS is decoding the url and then passing it to the application, certain characters will break it with a dangerous request exception. These characters include +, &, /, !, *, ., ( and ). So on those characters plus other characters I would like to make more readable I do a double encoding for a more usable url. Encoding is also hard because of the limited number of characters that are allowed in an url. So prior to encoding I made all letters capital and then did the encoding with lower case. This keeps it from being totally decodable, but I can easily do a match in the database or in code by making the value I wish to match be upper case.
Well, here is my code. Feedback would be appreciated. Oh ya, this is in VB, but things should transfer over to C# easy enough.
Dim strReturn As String = Trim(strStringToEncode)
strReturn = Server.UrlEncode(strReturn)
strReturn = strReturn.Replace("-", "dash").Replace("+", "-")
strReturn = strReturn.Replace("%26", "and").
Replace("%2f", "or").
Replace("!", "excl").
Replace("*", "star").
Replace("%27", "apos").
Replace("(", "lprn").
Replace(")", "rprn").
Replace("%3b", "semi").
Replace("%3a", "coln").
Replace("%40", "at").
Replace("%3d", "eq").
Replace("%2b", "plus").
Replace("%24", "dols").
Replace("%25", "pct").
Replace("%2c", "coma").
Replace("%3f", "query").
Replace("%23", "hash").
Replace("%5b", "lbrk").
Replace("%5d", "rbrk").
Replace(".", "dot").
Replace("%3e", "gt").
Replace("%3c", "lt")
Return strReturn

I guess you are looking for HttpUtility.UrlEncode and HttpUtility.HtmlDecode
string url = "http://www.google.com/search?q=" + HttpUtility.UrlEncode("Example");

How do I use a pattern Url to extract a segment from an actual Url?

If I have a series of "pattern" Urls of the form:
http://{username}.sitename.com/
http://{username}.othersite.net/
http://mysite.com/{username}
and I have an actual Url of the form:
http://joesmith.sitename.com/
Is there any way that I can match a pattern Url and in turn use it to extract the username portion out the actual Url? I've thought of nasty ways to do it, but it just seems like there should be a more intuitive way to accomplish this.
ASP.NET MVC uses a similar approach to extract the various segments of the URL when it is building its routes. Given the example:
{controller}/{action}
So given the Url of the form, Home/Index, it knows that it is the Home controller calling the Index action method.

Not sure I understand this question correctly but you can just use a regular expression to match anything between 'http://' and the first dot.

A very simple regex will do:
':https?://([a-z0-9\.-]*[a-z0-9])\.sitename\.com'
This will allow any subdomain that only contains valid subdomain characters. Example of allowed subdomains:
joesmith.sitename.com
joe.smith.sitename.com
joe-smith.sitename.com
a-very-long-subdomain.sitename.com
As you can see, you might want to complicate the regex slightly. For instance, you could limit it to only allow a certain amount of characters in the subdomain.

It seems the the quickest and easiest solution is going off of Machine's answer.
var givenUri = "http://joesmith.sitename.com/";
var patternUri = "http://{username}.sitename.com/";
patternUri = patternUri.Replace("{username}", #"([a-z0-9\.-]*[a-z0-9]");
var result = Regex.Match(givenUri, patternUri, RegexOptions.IgnoreCase).Groups;
if(!String.IsNullOrEmpty(result[1].Value))
return result[1].Value;
Seems to work great.

Well, this "pattern URL" is a format you've made up, right? You basically you'll just need to process it.
If the format of it is:
anything inside "{ }" is a thing to capture, everything else must be as is
Then you'd just find the start/end index of those brackets, and match everything else. Then when you get to a place where one is, make sure you only look for chars such that they don't match whatever 'token' comes after the next ending '}'.

There are definitely different ways - ultimately though your server must be configured to handle (and possibly route) these different subdomain requests.
What I would do would be to answer all subdomain requests (except maybe some reserved words, like 'www', 'mail', etc.) on sitename.com with a single handler or page (I'm assuming ASP.NET here based on your C# tag).
I'd use the request path, which is easy enough to get, with some simple string parsing/regex routines (remove the 'http://', grab the first token up until '.' or '/' or '\', etc.) and then use that in a session, making sure to observe URL changes.
Alternately, you could map certain virtual paths to request urls ('joesmith.sitename.com' => 'sitename.com/index.aspx?username=joesmith') via IIS but that's kind of nasty too.
Hope this helps!

Can someone give me a Regular Expression to identify an encoded URL?

I am currently looking to detect whether an URL is encoded or not. Here are some specific examples:
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL21lZGlhLXBsYXllci8%3D&b=13
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL290aGVyX2ZpbGVzL2VzcG5zdGFyL25hdl9iZy1vZmYucG5n&b=13
Can you please give me a Regular Expression for this?
Is there a self learning regular expression generator out there which can filter a perfect Regex as the number of inputs are increased?

If you are interested in the base64-encoded URLs, you can do it.
A little theory. If L, R are regular languages and T is a regular transducer, then LR (concatenation), L & R (intersection), L | R (union), TR(L) (image), TR^-1(L) (kernel) are all regular languages. Every regular language has a regular expression that generates it, and every regexp generates a regular language. URLs can be described by regular language (except if you need a subset of those that is not), almost every escaping scheme (and base64) is a regular transducer. Therefore, in theory, it's possible.
In practice, it gets rather messy.
A regex for valid base64 strings is ([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(==|[A-Za-z0-9+/]=)
If it is embedded in a query parameter of an url, it will probably be urlencoded. Let's assume only the = will be urlencoded (because other characters can too, but don't need to).
This gets us to something like [?&][^?&#=;]+=([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
Another possibility is to consider only those base64 encoded URLs that have some property - in your case, thy all begin with "://", which is fortunate, because that translates exactly to 4 characters "Oi8v". Otherwise, it would be more complex.
This gets [?&][^?&#=;]+=Oi8v([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
As you can see, it gets messier and messier. Therefore, I'd recommend you rather to
break the URL in its parts (eg. protocol, host, query string)
get the parameters from the query string, and urldecode them
try base64 decode on the values of the parameters
apply your criterion for "good encoded URLs"

Well, depending on what is in that encoded text, you might not even need a regular expression. If there are multiple querystring parameters in that one "u" key, perhaps you could just check the length of the text on each querystring value, and if it is over (say) 50, you can assume it's probably encoded. I doubt any unencoded single parameters would be as long as these, since those would have to be string data, and therefore they would probably need to be encoded!

This question may be harder than you realize. For example:
I could say that if a query string includes a question mark character then what follows it is encoded.
Now, it may be simple encoding like "?year=2009" or complicated like in your examples.
Or
The site URLs could use URL rewriting (like this site does). Look at the URL of this question. The "615958" is encoded and... no question marks were used!
In fact, you could say that the entire URL is encoded!
Perhaps you need to better define what you mean by "encoded".

You can't reliably parse URL using regex. (Is this an SO mantra yet?)
Here are some specific examples:
It's not clear what ‘encoded’ means — can you give some counter-examples of URLs you consider “not encoded”?
Are you talking about the Base64 encoding in the ‘u’ parameter? Whilst it is possible to say whether a string is a valid Base64 string, it's not possible to detect Base64 and distinguish it from anything else; for example the word “sausages” also happens to be valid Base64 (it decodes to '\xb1\xab\xacj\x07\xac').

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.