RegEx Problem: Valid URL does not match, but should - c#

I posted a similar question earlier, but I now realize I should have been more thorough.
I've tested a number of the URL/URI expressions listed on regexlib.com, but I can't get any of them to work as desired:
msn.com
msn-msn.net
yahoo.c!om
http://www.yahoo.com
msn msn
test ! number 1
Here is how I desire them to act:
msn.com (match)
msn-msn.net (match)
yahoo.c!om (fail)
http://www.yahoo.com (match)
msn msn (fail)
test ! number 1 (fail)
I'm using the tester here: http://regexlib.com/RETester.aspx before testing in my own app (C#, .NET 4.0)
The expression that is closest is this, but it doesn't match the http://www.yahoo.com one:
^[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*
Any help is appreciated. Additionally, somebody should come up with a more human-readable equivalent to RegEx...this stuff is a nightmare.
Thanks,
Beems

If you can't guarantee that the URL-esque pattern you're trying to match has a scheme/protocol, then the safest thing to do is match against top-level domains:
^(https?://)[^/]*.([possibly|really|long|list|of|valid|top|level|domains][2])
From your post it's evidently not necessary to go into the path, hash, or querystring parts of a URL, so that's it!

This one appears to work as desired:
[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?
Can anyone see any issues with this in regards to my original query? I don't need to validate whether the TLD is proper, so this isn't really an issue.

Agree with kojiro
But this does match your tests
http://www.rubular.com/r/gUb4U6Pzux

Related

How to split a string into meaningful words

I wanted to split a string into possible word string. What approach should I follow.
Given string : thisisapineapple
solution 1: this is a pineapple
solution 2: this is a pine apple
Please suggest and explain the possible alogriths to get above solution.
Thanks :)
To answer your question, Knuth-Morris-Pratt algorithm is powerful and not terribly difficult to implement.
Use strings from /usr/share/dict/words or /usr/dict/words as the patterns.
You need a scanner-less, GLR parser. They can handle words run together like this and can return ambiguous results. My own NLP library (AboditNLP) does this. Wordnet is a good source for the words.

C# the following RegEx too slow on large string

I am using the following regex to filter large string:
(?m)(?(^*(?=.*\\btrue\\b)(?=.*\\ba\\b).*\\r*$)(.*)|(?!))
It takes for ever to do so. What am I doing wrong here? is it a problem with my pattern or its a length of string that causes a delay.
Please help me here.
Thanks in advance.
Ok I fonund this Regex working for multiple words (Here I am using 2 word condition) with AND operator and speed is good compared to my old RegEx which #Dispersia pointed out as wrong RegEx. Strange that even that one worked but taking forever to produce result
(?i)(?m)^((?=.*\bword1\b)(?=.*\bword2\b)).*[\r\n]*$.
Here is the screenshot of the result:

URL regex - not getting it to work

I am using the following regex to find if there is a url present in a text, however it seems to miss some URLs like:
youtube.be/8P0BxJO
youtube.com/watch?v=VrmlFL
and also some bit.ly links (but not all)
Match m = Regex.Match(nc[i].InnerText,
#"(http(s)?://)?([\w-]+\.)+[\w-]+(/\S\w[\w- ;,./?%&=]\S*)?");
if (m.Success)
{
MessageBox.Show(nc[i].InnerText);
}
any ideas how to fix it?
See this related question, the first answer should help you out. The suggestion both finds links and then replaces them, so obviously just take what you need. This and this article are different approaches that should get you more or less the same result.
Another (perhaps more reliable) non-regex approach would be to tokenize the string by splitting on spaces and punctuation, and then checking the tokens to see whether they are a valid uri using Uri.IsWellFormedUriString (which only works on well formed uri's, as this question points out).

Regex to match a fragment of the URL

I have URL's like:
http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd
or
http://127.0.0.1:81/controller/verbTwo/NXw4fDF8MXwxfDQ1
I'd like to extract that part in bold. The host and port can change to anything (when I publish it to a live server it will change). The controller never changes. And for the verb part, there are 2 possibilities.
Can anyone help me with the regex?
Thanks
Instead of using a regex you could use the built in functionality of Uri
Uri uri = new Uri("http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd");
var lastSegment = uri.Segments.Last();
You're looking for the Uri and Path classes:
Path.GetFileName(new Uri(str).AbsolutePath)
Why do you look for a regex? you can look for the two string elements "verbOne/" or "verbTwo/" and make a substring from the end. And then you can look for the rest and substrakt the part with the '?'
I think this is faster then a regex.
krikit
Though everyone else here is correct that regex is not the best solution, because it could fail when parsers already exist that should never fail due to their specialization, I believe you could use the following regex:
(?<=http://127\.0\.0\.1:81/controller/verb(One|Two)/)[a-zA-Z0-9]*

Regular expression for validating a url

I'm a beginner in regexes. My requirement is to validate simple urls to urls with query strings, square brackets etc.. say for eg,
www.test.com?waa=[sample data]
the regex that I wrote only work for simple urls. It fails for the one with square brackets. Any idea?
Do you really need to use regex ?
bool isUri = Uri.IsWellFormedUriString("http://...", UriKind.RelativeOrAbsolute)
I would suggest taking a better look at the following site
http://www.regular-expressions.info/dotnet.html
Without actually seeing the Regex you're using I can't provide much insight. And giving you the answer wouldn't really teach you much either. Give a man a regex and you help him for a bit. Teach him regex and he's good for life
Take a look at the following:
http://www.geekzilla.co.uk/view2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm
thanks a lot fr reply..
this is what i wrote ..works for query strings too...but it fails while adding []..
/^(https?|ftp)://(?#)(([a-z0-9$.+!*\'(),;\?&=-]|%[0-9a-f]{2})+(?#)(:([a-z0-9$.+!*\'(),;\?&=-]|%[0-9a-f]{2})+)?(?#)#)?(#)((([a-z0-9][a-z0-9-][a-z0-9].)(#)[a-z]{2}[a-z0-9-]a-z0-9|(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5].){3}(?#)(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])(?#))(:\d+)?(?#))(((/+([a-z0-9$_.+!*\'(),;:#&=-]|%[0-9a-f]{2}))(?# )(\?([a-z0-9$_.+!*\'(),;:#&=-]|%[0-9a-f]{2}))(?#)?)?)?(?#)(#([a-z0-9$_.+!*\'(),;:#&=-]|%[0-9a-f]{2})*)?(?#)$/i
Use this if u want url with http
http(s)?://([\w-]+.)+[\w-]+(/[\w- ./?%&=]*)?
if oyu dnt want http in URL then go for
?://([\w-]+.)+[\w-]+(/[\w- ./?%&=]*)?

Categories

Resources