URL regex - not getting it to work

URL regex - not getting it to work - c#

I am using the following regex to find if there is a url present in a text, however it seems to miss some URLs like:
youtube.be/8P0BxJO
youtube.com/watch?v=VrmlFL
and also some bit.ly links (but not all)
Match m = Regex.Match(nc[i].InnerText,
#"(http(s)?://)?([\w-]+\.)+[\w-]+(/\S\w[\w- ;,./?%&=]\S*)?");
if (m.Success)
{
MessageBox.Show(nc[i].InnerText);
}
any ideas how to fix it?

See this related question, the first answer should help you out. The suggestion both finds links and then replaces them, so obviously just take what you need. This and this article are different approaches that should get you more or less the same result.
Another (perhaps more reliable) non-regex approach would be to tokenize the string by splitting on spaces and punctuation, and then checking the tokens to see whether they are a valid uri using Uri.IsWellFormedUriString (which only works on well formed uri's, as this question points out).

Related

RegEx doesn't work with .NET, but does with other RegEx implementations

I'm trying to match strings that look like this:
http://www.google.com
But not if it occurs in larger context like this:
http://www.google.com
The regex I've got that does the job in a couple different RegEx engines I've tested (PHP, ActionScript) looks like this:
(?<!["'>]\b*)((https?://)([A-Za-z0-9_=%&#?./-]+))\b
You can see it working here: http://regexr.com?36g0e
The problem is that that particular RegEx doesn't seem to work correctly under .NET.
private static readonly Regex fixHttp = new Regex(#"(?<![""'>]\b*)((https?://)([A-Za-z0-9_=%&#?./-]+))\b", RegexOptions.IgnoreCase);
private static readonly Regex fixWww = new Regex(#"(?<=[\s])\b((www\.)([A-Za-z0-9_=%&#?./-]+))\b", RegexOptions.IgnoreCase);
public static string FixUrls(this string s)
{
s = fixHttp.Replace(s, "$1");
s = fixWww.Replace(s, "$1");
return s;
}
Specifically, .NET doesn't seem to be paying attention to the first \b*. In other words, it correctly fails to match this string:
http://www.google.com
But it incorrectly matches this string (note the extra spaces):
http://www.google.com
Any ideas as to what I'm doing wrong or how to work around it?

I was waiting for one of the folks who actually originally answered this question to pop the answer down here, but since they haven't, I'll throw it in.
I'm not precisely sure what was going wrong, but it turns out that in .NET, I needed to replace the \b* with a \s*. The \s* doesn't seem to work with other RegEx engines (I only did a little bit of testing), but it does work correctly with .NET. The documentation I've read around \b would lead me to believe that it should match whitespace leading up to a word as well, but perhaps I've misunderstood, or perhaps there are some weirdnesses around captures that different engines handle differently.
At any rate, this is my final RegEx:
(?<!["'>]\s*)((https?:\/\/)([A-Za-z0-9_=%&#\?\.\/\-]+))\b
I don't understand what was going wrong well enough to give any real context for why this change works, and I dislike RegExes enough that I can't quite justify the time figuring it out, but maybe it'll help someone else eventually :-).

Regex to match a fragment of the URL

I have URL's like:
http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd
or
http://127.0.0.1:81/controller/verbTwo/NXw4fDF8MXwxfDQ1
I'd like to extract that part in bold. The host and port can change to anything (when I publish it to a live server it will change). The controller never changes. And for the verb part, there are 2 possibilities.
Can anyone help me with the regex?
Thanks

Instead of using a regex you could use the built in functionality of Uri
Uri uri = new Uri("http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd");
var lastSegment = uri.Segments.Last();

You're looking for the Uri and Path classes:
Path.GetFileName(new Uri(str).AbsolutePath)

Why do you look for a regex? you can look for the two string elements "verbOne/" or "verbTwo/" and make a substring from the end. And then you can look for the rest and substrakt the part with the '?'
I think this is faster then a regex.
krikit

Though everyone else here is correct that regex is not the best solution, because it could fail when parsers already exist that should never fail due to their specialization, I believe you could use the following regex:
(?<=http://127\.0\.0\.1:81/controller/verb(One|Two)/)[a-zA-Z0-9]*

Fastest way of removing unicode codes from a string

Hi I'm trying to figure out a way to remove the tags from the results returned from the Google Feed API. Specifically they are placing bold tags on titles and inside the description.
The codes that are being inserted are as follows:
\u003cb
\u003e
\u003c/b\u003e
Since its a fixed amount I did try doing a String.Replace() for each of these codes per string but it resulted in bad performance not surprisingly. I'm not sure if RegEx would be better (or worse). Does anyone have an idea on how to remove these? Google does not supply an option to remove tags from the results.

You could remove the unicode codes using a regex like this one:
\\u[\d\w]{4}
var subject = #"\u003cb\u003e\u003c/b\u003e";
var result = Regex.Replace(subject, #"\\u[\d\w]{4}", String.Empty);
As for performance, this article seems to suggest that regex is much slower, but i would run your own tests with your own data as it might be wildly different. The regular expression itself will play a big part in performance and I don't think that article states what the regex is being used so its impossible to compare. The size and type of your data will also play a big part, so it's difficult to say which is better without understanding your data.
Also, you should try compiling the regex with the RegexOptions.Compiled flag to see if that boosts performance.

Regex to match subdomain?

I have the following so far:
^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$
Been testing against these:
https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
http://www.foo.com
http://www.foo.com/
http://blog.foo.com/
http://blog.foo.com.ar/
http://foo.com
http://blog.foo.com
http://foo.com.ar
I'm using the following tool to test the regexes: regex tester
So far I've been able to yield the following groups:
full protocol
reduced protocol
full domain name
subdomain?
top level domain
port
port number
rest of the url
rest of the "directory"
no idea how to drop this group
page name
argument string
argument string
hash tag
hash tag
I will be using this regex to change the subdomain for my application for cross-domain redirect hyperlinks.
Using Request.Url as a parameter, I want to redirect from
http://example.com or http://www.example.com to http://blog.example.com
How can I achieve this?
I can't really tell what, if any, the current subdomain ( either nothing, www, blog, or forum, for instance) actually is...
What would be the best way to make this replacement?
What I actually need is some way to find out what the top level domain is. in either http://www.example.com, http://blog.example.com, or http://example.com I want to get example.com.

What would be the best way to make this replacement?
This may not be the answer you're looking for... but IMO the best way would be to make use of the System.Uri class.
The Uri class will easily extract the Host for you - and you can then split the host on "." delimiter - that should easily give you access to the current subdomain.
This is just my opinion - and its especially formed because I find it hard to maintain regex code like ^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$

You can use the Uri class to parse the strings. There are many properties available in addition to Segments:
Uri MyUri = new Uri("https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash");
foreach (String Segment in MyUri.Segments)
Response.Write(Segment + "<br />");

I think you should reconsider whether usage of a RegEx is really needed in this case;
I think extracting the top level domain from an URL is quite simple; in case of "http://www.example.com/?blah=111" you can simply take the part before the 3rd slash and perform a String.Split('.') and concat the last two array items. In case of "http://www.example.com", even easier.
Regex-patterns are very error-prone and quite hard to maintain and according to me you won't get any advantage of it. I recommend you to get rid off the Regex. Perhaps the result will be 2 - 3 more lines of code, but it will work, your code will be much better readable and easier to understand.

Wikilinks - turn the text [[a]] into an internal link

I need to implement something similar to wikilinks on my site. The user is entering plain text and will enter [[asdf]] wherever there is an internal link. Only the first five examples are really applicable in the implementation I need.
Would you use regex, what expression would do this? Is there a library out there somewhere that already does this in C#?

On the pure regexp side, the expression would rather be:
\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)
\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)
By replacing the (.+?) suggested by David with ([^\]\|\r\n]+?), you ensure to only capture legitimate wiki links texts, without closing square brackets or newline characters.
([^\] ]\S+) at the end ensures the wiki link expression is not followed by a closing square bracket either.
I am note sure if there is C# libraries already implementing this kind of detection.
However, to make that kind of detection really full-proof with regexp, you should use the pushdown automaton present in the C# regexp engine, as illustrated here.

I don't know if there are existing libraries to do this, but if it were me I'd probably just use regexes:
match \[\[(.+?)\|(.+?)\]\](\S+) and replace with \1\3
match \[\[(.+?)\]\](\S+) and replace with \1\2
Or something like that, anyway.

Although this is an old question and already answered, I thought I'd add this as an addendum for anyone else coming along. The existing two answers do all the real work and got me 90% there, but here is the last bit for anyone looking for code to get straight on with trying:
string html = "Some text with a wiki style [[page2.html|link]]";
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$2$3");
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$1$2");
The only change to the actual regex is I think the original answer had the replacement parts the wrong way around, so the href was set to the display text and the link was shown on the page. I've therefore swapped them.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

URL regex - not getting it to work - c#

Related

RegEx doesn't work with .NET, but does with other RegEx implementations

Regex to match a fragment of the URL

Fastest way of removing unicode codes from a string

Regex to match subdomain?

Wikilinks - turn the text [[a]] into an internal link

Categories

Resources