RegEx to detect syntactically correct URL [duplicate]

RegEx to detect syntactically correct URL [duplicate] - c#

This question already has answers here:
What is the best regular expression to check if a string is a valid URL?
(62 answers)
Closed 8 years ago.
I use Asp.net 4 and C# Web Forms.
In my Web Application Users can add URLS using a TextBox.
I need to make sure that every value inserted has a syntactically correct URL format (I do not need to check if the URL really exists).
So as first rule I would like check using a CustomValidator Control if the Input inserted by the User has the value string "http://" at the beginning.
My questions?
Are you able to provide me a RegEx to add to my CustomValidator Control that will let to pass only string beginning with "http://"?
Do you have any other rule using RegEx to suggest me?
What is you best practice to detect detect syntactically correct URL?
Thanks for your help

An easier approach in many ways, and more flexible to later changes, is to just try it and see:
public static bool IsValidHttpUri(string uriString)
{
Uri test = null;
return Uri.TryCreate(uriString, UriKind.Absolute, out test) && test.Scheme == "http";
)
Using Uri.IsWellFormedUriString is easier still, but doesn't check your requirement that the URI must be an HTTP one.
Edit: Oh, whether this considers IRIs valid or not depends on configuration, see the section on "International Resource Identifier Support" at http://msdn.microsoft.com/en-us/library/system.uri.aspx As a rule, whether you want them to be considered valid or not will match this configuration setting anyway, so this is actually a benefit in most cases.

Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

In my (limited) experience regex wastes a lot of resources for such a simple task (string starts with http:// or https://)
You might want to consider checking if the url has 'illegal' characters.
about urlencoding

Related

How can I check if a string is url friendly

I'm making an ecommerce application and I want the user to be able to put content at a URL they have specified. IF a user were to put in something like "/thank-you!", how can I clean the string to either be a valid URL or check this is valid URL format? I would want the url to basically always be hyphened between words so like "/thank-you" from "/thankyou". What's the best approach for achieving such a thing. I'm within c# using .NET MVC 4.

Alas, I cannot comment 'possible duplicate' yet (How to check whether a string is a valid HTTP URL?).
As this must be an answer however, one way to validate a string URL would be using the URI.TryCreate functioanlity. See here also https://msdn.microsoft.com/en-us/library/system.uri.trycreate(v=vs.110).aspx
URI is also the preferred data type for URLs, rather than strings.

RegEx to Validate URL with optional Scheme

I want to validate a URL using regular expression. Following are my conditions to validate the URL:
Scheme is optional
Subdomains should be allowed
Port number should be allowed
Path should be allowed.
I was trying the following pattern:
((http|https)://)?([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
But I am not getting the desired results. Even an invalid URL like '*.example.com' is getting matched.
What is wrong with it?

are you matching the entire string? you don't say what language you are using, but in python it looks like you may be using search instead of match.
one way to fix this is to start you regexp with ^ and end it with $.

While parsing URL's is best left to a library (since I know perl best, I would suggest something like http://search.cpan.org/dist/URI/), if you want some help debugging that statement, it might be best to try it in a debugger, something like: http://www.debuggex.com/.
I think one of the main reasons it is matching, is because you don't use beginning and ending string match markers. Meaning, no part of that string might be matching what you put in explicitly, but because you haven't marked it with beginning and end markers for the string, your regex could just be matching 'example.com' in your string, not the entire input.

Found the regular expression for my condition with help from your inputs
^(http(s)?://)?[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-‌\.\?\,\'\/\\\+&%\$#_]*)?$

Following code works for me in c#
private static bool IsValidUrl(string url)
{
return new Regex(#"^(http|http(s)?://)?([\w-]+\.)+[\w-]+[.\w]+(\[\?%&=]*)?").IsMatch(url) &&!new Regex(#"[^a-zA-Z0-9]+$").IsMatch(url);
}
it allows "something.anything (at least 2 later after period) with or without http(s) and www.

Regex to match a fragment of the URL

I have URL's like:
http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd
or
http://127.0.0.1:81/controller/verbTwo/NXw4fDF8MXwxfDQ1
I'd like to extract that part in bold. The host and port can change to anything (when I publish it to a live server it will change). The controller never changes. And for the verb part, there are 2 possibilities.
Can anyone help me with the regex?
Thanks

Instead of using a regex you could use the built in functionality of Uri
Uri uri = new Uri("http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd");
var lastSegment = uri.Segments.Last();

You're looking for the Uri and Path classes:
Path.GetFileName(new Uri(str).AbsolutePath)

Why do you look for a regex? you can look for the two string elements "verbOne/" or "verbTwo/" and make a substring from the end. And then you can look for the rest and substrakt the part with the '?'
I think this is faster then a regex.
krikit

Though everyone else here is correct that regex is not the best solution, because it could fail when parsers already exist that should never fail due to their specialization, I believe you could use the following regex:
(?<=http://127\.0\.0\.1:81/controller/verb(One|Two)/)[a-zA-Z0-9]*

Regex to match subdomain?

I have the following so far:
^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$
Been testing against these:
https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
http://www.foo.com
http://www.foo.com/
http://blog.foo.com/
http://blog.foo.com.ar/
http://foo.com
http://blog.foo.com
http://foo.com.ar
I'm using the following tool to test the regexes: regex tester
So far I've been able to yield the following groups:
full protocol
reduced protocol
full domain name
subdomain?
top level domain
port
port number
rest of the url
rest of the "directory"
no idea how to drop this group
page name
argument string
argument string
hash tag
hash tag
I will be using this regex to change the subdomain for my application for cross-domain redirect hyperlinks.
Using Request.Url as a parameter, I want to redirect from
http://example.com or http://www.example.com to http://blog.example.com
How can I achieve this?
I can't really tell what, if any, the current subdomain ( either nothing, www, blog, or forum, for instance) actually is...
What would be the best way to make this replacement?
What I actually need is some way to find out what the top level domain is. in either http://www.example.com, http://blog.example.com, or http://example.com I want to get example.com.

What would be the best way to make this replacement?
This may not be the answer you're looking for... but IMO the best way would be to make use of the System.Uri class.
The Uri class will easily extract the Host for you - and you can then split the host on "." delimiter - that should easily give you access to the current subdomain.
This is just my opinion - and its especially formed because I find it hard to maintain regex code like ^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$

You can use the Uri class to parse the strings. There are many properties available in addition to Segments:
Uri MyUri = new Uri("https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash");
foreach (String Segment in MyUri.Segments)
Response.Write(Segment + "<br />");

I think you should reconsider whether usage of a RegEx is really needed in this case;
I think extracting the top level domain from an URL is quite simple; in case of "http://www.example.com/?blah=111" you can simply take the part before the 3rd slash and perform a String.Split('.') and concat the last two array items. In case of "http://www.example.com", even easier.
Regex-patterns are very error-prone and quite hard to maintain and according to me you won't get any advantage of it. I recommend you to get rid off the Regex. Perhaps the result will be 2 - 3 more lines of code, but it will work, your code will be much better readable and easier to understand.

How do I use a pattern Url to extract a segment from an actual Url?

If I have a series of "pattern" Urls of the form:
http://{username}.sitename.com/
http://{username}.othersite.net/
http://mysite.com/{username}
and I have an actual Url of the form:
http://joesmith.sitename.com/
Is there any way that I can match a pattern Url and in turn use it to extract the username portion out the actual Url? I've thought of nasty ways to do it, but it just seems like there should be a more intuitive way to accomplish this.
ASP.NET MVC uses a similar approach to extract the various segments of the URL when it is building its routes. Given the example:
{controller}/{action}
So given the Url of the form, Home/Index, it knows that it is the Home controller calling the Index action method.

Not sure I understand this question correctly but you can just use a regular expression to match anything between 'http://' and the first dot.

A very simple regex will do:
':https?://([a-z0-9\.-]*[a-z0-9])\.sitename\.com'
This will allow any subdomain that only contains valid subdomain characters. Example of allowed subdomains:
joesmith.sitename.com
joe.smith.sitename.com
joe-smith.sitename.com
a-very-long-subdomain.sitename.com
As you can see, you might want to complicate the regex slightly. For instance, you could limit it to only allow a certain amount of characters in the subdomain.

It seems the the quickest and easiest solution is going off of Machine's answer.
var givenUri = "http://joesmith.sitename.com/";
var patternUri = "http://{username}.sitename.com/";
patternUri = patternUri.Replace("{username}", #"([a-z0-9\.-]*[a-z0-9]");
var result = Regex.Match(givenUri, patternUri, RegexOptions.IgnoreCase).Groups;
if(!String.IsNullOrEmpty(result[1].Value))
return result[1].Value;
Seems to work great.

Well, this "pattern URL" is a format you've made up, right? You basically you'll just need to process it.
If the format of it is:
anything inside "{ }" is a thing to capture, everything else must be as is
Then you'd just find the start/end index of those brackets, and match everything else. Then when you get to a place where one is, make sure you only look for chars such that they don't match whatever 'token' comes after the next ending '}'.

There are definitely different ways - ultimately though your server must be configured to handle (and possibly route) these different subdomain requests.
What I would do would be to answer all subdomain requests (except maybe some reserved words, like 'www', 'mail', etc.) on sitename.com with a single handler or page (I'm assuming ASP.NET here based on your C# tag).
I'd use the request path, which is easy enough to get, with some simple string parsing/regex routines (remove the 'http://', grab the first token up until '.' or '/' or '\', etc.) and then use that in a session, making sure to observe URL changes.
Alternately, you could map certain virtual paths to request urls ('joesmith.sitename.com' => 'sitename.com/index.aspx?username=joesmith') via IIS but that's kind of nasty too.
Hope this helps!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

RegEx to detect syntactically correct URL [duplicate] - c#

Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,])?", RegexOptions.IgnoreCase);

In my (limited) experience regex wastes a lot of resources for such a simple task (string starts with http:// or https://) You might want to consider checking if the url has 'illegal' characters. about urlencoding

Related

How can I check if a string is url friendly

RegEx to Validate URL with optional Scheme

Regex to match a fragment of the URL

Regex to match subdomain?

How do I use a pattern Url to extract a segment from an actual Url?

Categories

Resources