RegEx to Validate URL with optional Scheme - c#

I want to validate a URL using regular expression. Following are my conditions to validate the URL:
Scheme is optional
Subdomains should be allowed
Port number should be allowed
Path should be allowed.
I was trying the following pattern:
((http|https)://)?([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
But I am not getting the desired results. Even an invalid URL like '*.example.com' is getting matched.
What is wrong with it?

are you matching the entire string? you don't say what language you are using, but in python it looks like you may be using search instead of match.
one way to fix this is to start you regexp with ^ and end it with $.

While parsing URL's is best left to a library (since I know perl best, I would suggest something like http://search.cpan.org/dist/URI/), if you want some help debugging that statement, it might be best to try it in a debugger, something like: http://www.debuggex.com/.
I think one of the main reasons it is matching, is because you don't use beginning and ending string match markers. Meaning, no part of that string might be matching what you put in explicitly, but because you haven't marked it with beginning and end markers for the string, your regex could just be matching 'example.com' in your string, not the entire input.

Found the regular expression for my condition with help from your inputs
^(http(s)?://)?[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-‌​\.\?\,\'\/\\\+&%\$#_]*)?$

Following code works for me in c#
private static bool IsValidUrl(string url)
{
return new Regex(#"^(http|http(s)?://)?([\w-]+\.)+[\w-]+[.\w]+(\[\?%&=]*)?").IsMatch(url) &&!new Regex(#"[^a-zA-Z0-9]+$").IsMatch(url);
}
it allows "something.anything (at least 2 later after period) with or without http(s) and www.

Related

Matching and replacing function expressions

I need to do some very light parsing of C# (actually transpiled Razor code) to replace a list of function calls with textual replacements.
If given a set containing {"Foo.myFunc" : "\"def\"" } it should replace this code:
var res = "abc" + Foo.myFunc(foo, Bar.otherFunc( Baz.funk()));
with this:
var res = "abc" + "def"
I don't care about the nested expressions.
This seems fairly trivial and I think I should be able to avoid building an entire C# parser using something like this for every member of the mapping set:
find expression start (e.g. Foo.myFunc)
Push()/Pop() parentheses on a Stack until Count == 0.
Mark this as expression stop
replace everything from expression start until expression stop
But maybe I don't need to ... Is there a (possibly built-in) .NET library that can do this for me? Counting is not possible in the family of languages that RE is in, but maybe the extended regex syntax in C# can handle this somehow using back references?
edit:
As the comments to this answer demonstrates simply counting brackets will not be sufficient generally, as something like trollMe("(") will throw off those algorithms. Only true parsing would then suffice, I guess (?).
The trick for a normal string will be:
(?>"(\\"|[^"])*")
A verbatim string:
(?>#"(""|[^"])*")
Maybe this can help, but I'm not sure that this will work in all cases:
<func>(?=\()((?>/\*.*?\*/)|(?>#"(""|[^"])*")|(?>"(\\"|[^"])*")|\r?\n|[^()"]|(?<open>\()|(?<-open>\)))+?(?(open)(?!))
Replace <func> with your function name.
Useless to say that trollMe("\"(", "((", #"abc""de((f") works as expected.
DEMO

regular expression get all hosts from html

I'm trying to get all urls in one regular expression, currently i'm using this pattern.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
However that regex returns the pages/files, instead of hosts. So instead of having to run a second regular expression, I'm hoping someone here can help
This returns http://www.yoursite.com/index.html
I'm attempting to return yoursite.com.
Also the the regex will be parsing from html and hosts will be checked after, so 100% accuracy isn't crucial.
Assuming that your regex:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Actually does parse the Urls (I haven't checked it), you could easily use a capture group to get the host:
/^(https?:\/\/)?(?<host>([\da-z\.-]+)\.([a-z\.]{2,6}))([\/\w \.-]*)*\/?$/
When you get the Match result, you can examine Groups["host"] to get the host name.
But you're much better off, in my opinion, just using Uri.TryCreate, although you'll need a little logic to get around the possible lack of a scheme. That is:
if (!Regex.IsMatch(line, "https?:\/\/"))
line = "http://" + line;
Uri uri;
if (Uri.TryCreate(line, UriKind.Absolute, out uri))
{
// it's a valid url.
host = uri.Host;
}
Parsing Urls is a pretty tricky business. For example, no individual dotted segment can exceed 63 characters, and there's nothing preventing the last dotted segment from having numbers or hyphens. Nor is it limited to 6 characters. You're better off passing the entire string to Uri.TryCreate than you are trying to duplicate the craziness of URL parsing with a single regular expression.
It's possible that the rest of the Url (after the host name) could be trash. If you want to eliminate that bit causing a problem, then extract everything up to the end of the host name:
^https?:\/\/[^\/]*
Then run that through Uri.TryCreate.
To capture just the yoursite.com from sample text http://www.yoursite.com/index?querystring=value you could use this expression, however this does not validate the string:
^(https?:\/\/)?(?:[^.\/?]*[.])?([^.\/?]*[.][^.\/?]*)
Live demo: http://www.rubular.com/r/UNR7qiQ0Eq

Regex to match a fragment of the URL

I have URL's like:
http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd
or
http://127.0.0.1:81/controller/verbTwo/NXw4fDF8MXwxfDQ1
I'd like to extract that part in bold. The host and port can change to anything (when I publish it to a live server it will change). The controller never changes. And for the verb part, there are 2 possibilities.
Can anyone help me with the regex?
Thanks
Instead of using a regex you could use the built in functionality of Uri
Uri uri = new Uri("http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd");
var lastSegment = uri.Segments.Last();
You're looking for the Uri and Path classes:
Path.GetFileName(new Uri(str).AbsolutePath)
Why do you look for a regex? you can look for the two string elements "verbOne/" or "verbTwo/" and make a substring from the end. And then you can look for the rest and substrakt the part with the '?'
I think this is faster then a regex.
krikit
Though everyone else here is correct that regex is not the best solution, because it could fail when parsers already exist that should never fail due to their specialization, I believe you could use the following regex:
(?<=http://127\.0\.0\.1:81/controller/verb(One|Two)/)[a-zA-Z0-9]*

Regex Help (again)

I don't really know what to entitle this, but I need some help with regular expressions. Firstly, I want to clarify that I'm not trying to match HTML or XML, although it may look like it, it's not. The things below are part of a file format I use for a program I made to specify which details should be exported in that program. There is no hierarchy involved, just that each new line contains a 'tag':
<n>
This is matched with my program to find an enumeration, which tells my program to export the name value, anyway, I also have tags like this:
<adr:home>
This specifies the home address. I use the following regex:
<((?'TAG'.*):(?'SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
The problem is that the regex will split the adr:home tag fine, but fail to find the n tag because it lacks a colon, but when I add a ? or a *, it then doesn't split the adr:home and similar tags. Can anyone help? I'm sure it's only simple, it's just this is my first time at creating a regular expression. I'm working in C#, by the way.
Will this help
<((?'TAG'.*?)(?::(?'SUBTAG'.*))?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
I've wrapped the : capture into a non capturing group round subtag and made the tag capture non greedy
Not entirely sure what your aim is but try this:
(?><)(?'TAG'[^:\s>]*)(:(?'SUBTAG'[^\s>:]*))?(\s\w+=['"](?'VALUE'[^'"]*)['"])?(?>>)
I find this site extremely useful for testing C# regex expressions.
What if you put the colon as part of the second tag?
<((?'TAG'.*)(?':SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>

How do I use a pattern Url to extract a segment from an actual Url?

If I have a series of "pattern" Urls of the form:
http://{username}.sitename.com/
http://{username}.othersite.net/
http://mysite.com/{username}
and I have an actual Url of the form:
http://joesmith.sitename.com/
Is there any way that I can match a pattern Url and in turn use it to extract the username portion out the actual Url? I've thought of nasty ways to do it, but it just seems like there should be a more intuitive way to accomplish this.
ASP.NET MVC uses a similar approach to extract the various segments of the URL when it is building its routes. Given the example:
{controller}/{action}
So given the Url of the form, Home/Index, it knows that it is the Home controller calling the Index action method.
Not sure I understand this question correctly but you can just use a regular expression to match anything between 'http://' and the first dot.
A very simple regex will do:
':https?://([a-z0-9\.-]*[a-z0-9])\.sitename\.com'
This will allow any subdomain that only contains valid subdomain characters. Example of allowed subdomains:
joesmith.sitename.com
joe.smith.sitename.com
joe-smith.sitename.com
a-very-long-subdomain.sitename.com
As you can see, you might want to complicate the regex slightly. For instance, you could limit it to only allow a certain amount of characters in the subdomain.
It seems the the quickest and easiest solution is going off of Machine's answer.
var givenUri = "http://joesmith.sitename.com/";
var patternUri = "http://{username}.sitename.com/";
patternUri = patternUri.Replace("{username}", #"([a-z0-9\.-]*[a-z0-9]");
var result = Regex.Match(givenUri, patternUri, RegexOptions.IgnoreCase).Groups;
if(!String.IsNullOrEmpty(result[1].Value))
return result[1].Value;
Seems to work great.
Well, this "pattern URL" is a format you've made up, right? You basically you'll just need to process it.
If the format of it is:
anything inside "{ }" is a thing to capture, everything else must be as is
Then you'd just find the start/end index of those brackets, and match everything else. Then when you get to a place where one is, make sure you only look for chars such that they don't match whatever 'token' comes after the next ending '}'.
There are definitely different ways - ultimately though your server must be configured to handle (and possibly route) these different subdomain requests.
What I would do would be to answer all subdomain requests (except maybe some reserved words, like 'www', 'mail', etc.) on sitename.com with a single handler or page (I'm assuming ASP.NET here based on your C# tag).
I'd use the request path, which is easy enough to get, with some simple string parsing/regex routines (remove the 'http://', grab the first token up until '.' or '/' or '\', etc.) and then use that in a session, making sure to observe URL changes.
Alternately, you could map certain virtual paths to request urls ('joesmith.sitename.com' => 'sitename.com/index.aspx?username=joesmith') via IIS but that's kind of nasty too.
Hope this helps!

Categories

Resources