Regular expression to extract domain name from any domain

Regular expression to extract domain name from any domain - c#

I'm trying to extract the domain name from a string in C#. You don't necessarily have to use a RegEx but we should be able to extract yourdomain.com from all of the following:
yourdomain.com
www.yourdomain.com
http://www.yourdomain.com
http://www.yourdomain.com/
store.yourdomain.com
http://store.yourdomain.com
whatever.youdomain.com
*.yourdomain.com
Also, any TLD is acceptable, so replace all the above with .net, .org, 'co'uk, etc.

If no scheme present (no colon in string), prepend "http://" to make it a valid URL.
Pass string to Uri constructor.
Access the Uri's Host property.
Now you have the hostname. What exactly you consider the ‘domain name’ of a given hostname is a debatable point. I'm guessing you don't simply mean everything after the first dot.
It's not possible to distinguish hostnames like ‘whatever.youdomain.com’ from domains-in-an-SLD like ‘warwick.ac.uk’ from just the strings. Indeed, there is even a bit of grey area about what is and isn't a public SLD, given the efforts of some registrars to carve out their own niches.
A common approach is to maintain a big list of SLDs and other suffixes used by unrelated entities. This is what web browsers do to stop unwanted public cookie sharing. Once you've found a public suffix, you could add the one nearest prefix in the host name split by dots to get the highest-level entity responsible for the given hostname, if that's what you want. Suffix lists are hell to maintain, but you can piggy-back on someone else's efforts.
Alternatively, if your app has the time and network connection to do it, it could start sniffing for information on the hostname. eg. it could do a whois query for the hostname, and keep looking at each parent until it got a result and that would be the domain name of the lowest-level entity responsible for the given hostname.
Or, if all that's too much work, you could try just chopping off any leading ‘www.’ present!

I would recommend trying this yourself. Using regulator and a regex cheat sheet.
http://sourceforge.net/projects/regulator/
http://regexlib.com/CheatSheet.aspx
Also find some good info on Regular Expressions at coding horror.

Have a look at this other answer. It was for PHP but you'll easily get the regex out of the 4-5 lines of PHP and you can benefit from the discussion that followed (see Alnitak's answer).

A regex doesn't really fit your requirement of "any TLD", since the format and number of TLDs is quite large and continually in flux. If you limited your scope to:
(?<domain>[^\.]+\.([A-Z]+$|co\.[A-Z]$))
You would catch .anything and .co.anything, which I imagine covers most realistic cases...

Related

Regex for google referrer validation

I am totally new to Regex and have been trying to do this with little success.
Basically what I want to do is to create a regex that matches any google domain such as Google.com, Google.co.uk, etc.
So far I have ^http://www.google\.com/.*$, but this only matches Google.com. How can I modify it to allow any extension besides com?
Thanks!

You could use alternation, but then you would have to supply all TLDs you want to allow:
^http://www\.google\.(?:com|co\.uk|de|es)/.*$
Add more options separated by pipes. Alternatively, you could allow any TLD (whether valid or not) with this:
^http://www\.google\.[a-z.]+/.*$
However this would also match something like http://www.google.myowndomain.com/. I don't think there would be any approach that allows only valid domains without listing them all.
By the way, if you want to make that slash and the path/query at the end optional, change that to one of the following:
^http://www\.google\.(?:com|co\.uk|de|es)(?:/.*)?$
^http://www\.google\.[a-z.]+(?:/.*)?$
And then you could go another step further and make the www. optional:
^http://(?:www\.)?google\.(?:com|co\.uk|de|es)(?:/.*)?$
^http://(?:www\.)?google\.[a-z.]+(?:/.*)?$
You see, matching all possible but valid URLs for a given problem is not an easy task, but one that needs careful consideration ;).
Depending on the language you are using there might be better options with built-in URL-parsing functions. In PHP for instance, this would be a much easier approach:
$domain = parse_url($urlStr, PHP_URL_HOST);
$isGoogle = preg_match('/^(?:www\.)?google\.[a-z.]+/', $domain);
Or (since this is not perfect anyway, as outlined above) you could abandon regex altogether and do the check like this:
$isGoogle = strpos($domain, 'google.') !== false;

Getting domain from subdomain

I get url as
http://orders.mealsandyou.com/default.php
i dont want to use string functions to use it to get the main domain ie
mealsandyou.com
is there any function in c# to do that, UrilAuthority and all gives subdomain too...
Suggestions welcome, not workarounds

.Net doesn't provide a built-in feature to extract specific parts from Uri.Host. You will have to use string manipulation or a regular expression yourself.

The only constant part of the domain string is the TLD. The TLD is the very last bit of the domain string, eg .com, .net, .uk etc. Everything else under that depends on the particular TLD for its position (so you can't assume the next to last part is the "domain name" as, for .co.uk it would be .co.
In any case I think you're taking the wrong approach. URL rewriting is far more suited to this sort of thing. Have a read of this: learn.iis.net/page.aspx/460/using-the-url-rewrite-module

How to recognize urls with similar patterns in C#?

I need a way to recognize urls with similar pattern, e.g. a function which returns true when matched
http://mysite.com/page/123
and
http://mysite.com/page/456
or
http://mysite.com/?page=123
and
http://mysite.com/?page=456
or
http://mysite.com/?page=123&param=2
and
http://mysite.com/?page=456&param=3
I don't need to check validity of urls here, only find out if the pattern is the same.
I probably need a regular expression for it, but can't figure out how to do it. Can anyone help? Thanks.

May be you can try levenshtein distance
http://www.dotnetperls.com/levenshtein, which is used to find similarity between strings.

Use a lowest common subsequence algorithm and divide by the length of either of the strings. If it's above an arbitrary number, they're common enough.

Not a specific answer, but I feel that if you want this to work well in a generalised sense, you will need to be content-aware, i.e. you need to break each URL into subsections:
Protocol
Domain
Path
Querystrings
... And process each separately. The level of acceptable fuzziness will control how much you need to break up the URL, but each section would (I feel) need quite specific inspection. The protocol and domain could be straight string matches, but the paths could perhaps be split by '/' and then after basic length checks, the elements could be compared one by one, only comparing items of equal depth (using direct equality or a "change distance" like the Levenshtein distance mentioned earlier). The querystrings could be broken up into dictionaries via a simple split on "&" then by "=", which you could sort and compare however you want. This would also satisfy #MarcGravell's question about reordered querystring parameters.

Regex to match subdomain?

I have the following so far:
^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$
Been testing against these:
https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
http://www.foo.com
http://www.foo.com/
http://blog.foo.com/
http://blog.foo.com.ar/
http://foo.com
http://blog.foo.com
http://foo.com.ar
I'm using the following tool to test the regexes: regex tester
So far I've been able to yield the following groups:
full protocol
reduced protocol
full domain name
subdomain?
top level domain
port
port number
rest of the url
rest of the "directory"
no idea how to drop this group
page name
argument string
argument string
hash tag
hash tag
I will be using this regex to change the subdomain for my application for cross-domain redirect hyperlinks.
Using Request.Url as a parameter, I want to redirect from
http://example.com or http://www.example.com to http://blog.example.com
How can I achieve this?
I can't really tell what, if any, the current subdomain ( either nothing, www, blog, or forum, for instance) actually is...
What would be the best way to make this replacement?
What I actually need is some way to find out what the top level domain is. in either http://www.example.com, http://blog.example.com, or http://example.com I want to get example.com.

What would be the best way to make this replacement?
This may not be the answer you're looking for... but IMO the best way would be to make use of the System.Uri class.
The Uri class will easily extract the Host for you - and you can then split the host on "." delimiter - that should easily give you access to the current subdomain.
This is just my opinion - and its especially formed because I find it hard to maintain regex code like ^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$

You can use the Uri class to parse the strings. There are many properties available in addition to Segments:
Uri MyUri = new Uri("https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash");
foreach (String Segment in MyUri.Segments)
Response.Write(Segment + "<br />");

I think you should reconsider whether usage of a RegEx is really needed in this case;
I think extracting the top level domain from an URL is quite simple; in case of "http://www.example.com/?blah=111" you can simply take the part before the 3rd slash and perform a String.Split('.') and concat the last two array items. In case of "http://www.example.com", even easier.
Regex-patterns are very error-prone and quite hard to maintain and according to me you won't get any advantage of it. I recommend you to get rid off the Regex. Perhaps the result will be 2 - 3 more lines of code, but it will work, your code will be much better readable and easier to understand.

How do I use a pattern Url to extract a segment from an actual Url?

If I have a series of "pattern" Urls of the form:
http://{username}.sitename.com/
http://{username}.othersite.net/
http://mysite.com/{username}
and I have an actual Url of the form:
http://joesmith.sitename.com/
Is there any way that I can match a pattern Url and in turn use it to extract the username portion out the actual Url? I've thought of nasty ways to do it, but it just seems like there should be a more intuitive way to accomplish this.
ASP.NET MVC uses a similar approach to extract the various segments of the URL when it is building its routes. Given the example:
{controller}/{action}
So given the Url of the form, Home/Index, it knows that it is the Home controller calling the Index action method.

Not sure I understand this question correctly but you can just use a regular expression to match anything between 'http://' and the first dot.

A very simple regex will do:
':https?://([a-z0-9\.-]*[a-z0-9])\.sitename\.com'
This will allow any subdomain that only contains valid subdomain characters. Example of allowed subdomains:
joesmith.sitename.com
joe.smith.sitename.com
joe-smith.sitename.com
a-very-long-subdomain.sitename.com
As you can see, you might want to complicate the regex slightly. For instance, you could limit it to only allow a certain amount of characters in the subdomain.

It seems the the quickest and easiest solution is going off of Machine's answer.
var givenUri = "http://joesmith.sitename.com/";
var patternUri = "http://{username}.sitename.com/";
patternUri = patternUri.Replace("{username}", #"([a-z0-9\.-]*[a-z0-9]");
var result = Regex.Match(givenUri, patternUri, RegexOptions.IgnoreCase).Groups;
if(!String.IsNullOrEmpty(result[1].Value))
return result[1].Value;
Seems to work great.

Well, this "pattern URL" is a format you've made up, right? You basically you'll just need to process it.
If the format of it is:
anything inside "{ }" is a thing to capture, everything else must be as is
Then you'd just find the start/end index of those brackets, and match everything else. Then when you get to a place where one is, make sure you only look for chars such that they don't match whatever 'token' comes after the next ending '}'.

There are definitely different ways - ultimately though your server must be configured to handle (and possibly route) these different subdomain requests.
What I would do would be to answer all subdomain requests (except maybe some reserved words, like 'www', 'mail', etc.) on sitename.com with a single handler or page (I'm assuming ASP.NET here based on your C# tag).
I'd use the request path, which is easy enough to get, with some simple string parsing/regex routines (remove the 'http://', grab the first token up until '.' or '/' or '\', etc.) and then use that in a session, making sure to observe URL changes.
Alternately, you could map certain virtual paths to request urls ('joesmith.sitename.com' => 'sitename.com/index.aspx?username=joesmith') via IIS but that's kind of nasty too.
Hope this helps!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular expression to extract domain name from any domain - c#

I would recommend trying this yourself. Using regulator and a regex cheat sheet. http://sourceforge.net/projects/regulator/ http://regexlib.com/CheatSheet.aspx Also find some good info on Regular Expressions at coding horror.

Have a look at this other answer. It was for PHP but you'll easily get the regex out of the 4-5 lines of PHP and you can benefit from the discussion that followed (see Alnitak's answer).

A regex doesn't really fit your requirement of "any TLD", since the format and number of TLDs is quite large and continually in flux. If you limited your scope to: (?<domain>[^\.]+\.([A-Z]+$|co\.[A-Z]$)) You would catch .anything and .co.anything, which I imagine covers most realistic cases...

Related

Regex for google referrer validation

Getting domain from subdomain

How to recognize urls with similar patterns in C#?

Regex to match subdomain?

How do I use a pattern Url to extract a segment from an actual Url?

Categories

Resources