Regex - Extract also URLs with www - c#

I use this regex to find URLs:
(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
Problem is, that it doesn't find urls which start with www.
How can I solve this?
Here is my data source that I need to extract urls from.

This answer is based on the provide xml file you come with in your comment.
There are couple of issues with your file, beside starting with https, http and www, it contains urls that start with download.somedomain.com, marketplace.somedmain.com, so it is inconsistence. the other issues is the ending of the the url can end with ., </, it does not have spaces after ending the url and it does not have a pattern to go through it line by line or chunk by chunk.
And last thing it contains duplicates.
The way I chose to solve, by chopping regex in 2 parts:
One part take all urls that start with valid url, with out looking at the end of it.
The second part take care of the valid url of what is remained from first part.
Regarding duplicates, I used hashset for that.
The solution does not consider specific tags in the xml or specific contain, it just care about urls in content.
Here is the solution:
HashSet<string> urls = new HashSet<string>();
var beginWith = new Regex(#"\b(?:(http|ftp|https)?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match item in beginWith.Matches(input))
{
var endWith = new Regex(#"([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?");
foreach (Match url in endWith.Matches(item.ToString()))
{
urls.Add(url.ToString());
}
}
The code here can in deed be reduced and improved. I leave it for your fantasy.
Here is the final and 5 first urls output of the file:
www.w3.org/2005/Atom
marketplace.xboxlive.com/resource/product/v1
www.xbox.com/live/accounts
download.xbox.com/content/images/66acd000-77fe-1000-9115-d802534307d4/1033/boxartlg.jpg
download.xbox.com/content/images/66acd000-77fe-1000-9115-d802534307d4/1033/boxartsm.jpg
etc.....

Well, just check if your string contain "https://" or "http://", if not, add https:// at the beginning ^^
string url = "";
if (!url.Contains("https://") || !url.Contains("http://"))
{
url.Insert(0, "https://");
}

Related

C# regex to parse /simple1/1.2-SNAPSHOT/

I need to find the last two values at the end of such a string, "simple1" and "1.2-SNAPSHOT" in the sample url below. But my code below (try to get simple1/1.2-SNAPSHOT/) doesn't work, can anyone help?
http://localhost:8060/nexus/service/local/repositories/snapshots/content/org/sonatype/mavenbook/simple1/1.2-SNAPSHOT/
List<string> artifacts = new List<string>(); // this is already foler URL
// store all URLs to the artifacts be deleted
artifacts = nexusAPI.findArtifacts(repository, contents, days, pattern);
var regex = new Regex(".*\\/(.*\\/.*\\/)$");
foreach (string url in artifacts)
{
Console.WriteLine("group/artifact: {0}", regex.Matches(url));
}
I would just split the string on '/' and get the last two parts. The regex isn't going to do anything more then that.
If you must use RegEx, you're encountering an issue in that regexes are greedy - that means it puts as much in each .* as it possibly can. So your first step is to make the regex not greedy. Simply use this as your pattern:
(.*?)/
Here's a simple test showing how that this works.
This tells the regex to look for any character up to the slash, and then stop.
When you call Regex.Matches(url, "(.*?)/"), you will get returned an array of the matching data. From there, you can just look at the last two elements.
Of course, as SledgeHammer mentioned, this is one case where regex is unnecessary and even cumbersome. Simply working with url.Split(new char[] {'/'}) will give you the results you need.

Finding reCaptcha ID with Regex

Alright, so I've been trying to pull the reCaptcha ID out of a web source that I'm downloading, I was going to do this with Regex, pull the line out with what it contains it, then pull the ID from there [If that makes sense].
Here's how I'm doing it right now:
WebClient W = new WebClient();
W.Encoding = System.Text.Encoding.UTF8;
string pattern = "recaptcha_challenge_field";
string SourceCode = W.DownloadString("http://www.xtremetop100.com/in.php?COLLCC=4025385947&COLLCC=1765882190&site=1132330052");
foreach (string Match in Regex.Split(SourceCode, Environment.NewLine))
{
if (Regex.IsMatch(Match, pattern, RegexOptions.IgnoreCase))
{
MessageBox.Show(Match);
}
}
Problem being, that it just shows the whole page source besides the line with the "pattern" in it. I tried changing the encoding type because I thought it was returning the source as one big sentence, but I guess that's not the answer. Any help here guys? Thank you.
Fist of all you have terrible name convention!
Local variable should have names starts with small letter, so "sourceCode", "match". Big letter suggest that "Match" is class not variable.
Secondly, why you using Split from Regex class only to split string by new lines? Use build-in string method:
foreach (string line in sourceCode.Split(new string[] { Environment.NewLine }, StringSplitOptions.None))
Now... if you notice, I change the name of variable, so now your code will looks like
if (Regex.IsMatch(line, pattern, RegexOptions.IgnoreCase))
{
MessageBox.Show(line);
}
And you will see obvious, code do excaly what you want: If line match the pattern, then show whole line.
Another thing: what is your regex pattern? This is more like comparison to check if it's match or not. Try read about regex more. Your pattern should looks more like
recaptcha_challenge_field=([0-9]+). I don't know exactly, because the link you are posted contain only "refresh" meta tag.
And try to use Regex.Match method instead Regex.IsMatch. It gives you more information, not only if the string match your pattern, but what groups within you capture.

Regex to Replace the end of the Url

I have a url something that follows a pattern as below :
http://i.ebayimg.com/00/s/MTUw12323gxNTAw/$(KGr123qF,!p0F123Q~~60_12.JPG?set_id=88123231232F
I need a regex to find and replace the end of the url _12.JPG with _14.JPG. So basically i need to capture the _[numbers only].JPG pattern and replace it with my value.
var regex = new Regex(#"_\d+\.JPG");
var newUrl = regex.Replace(url, "_14.JPG");
_[0-9]+\.JPG\?
works for the sample URL. You didn't really mention whether you wanted the
?set_id=88123231232F gone or not.
Basically, you shouldn't normally be concerned with periods anywhere else in the URL. It is possible, but the additional constraint of the jpg extension should limit anything returned with not much issue.
///_(\d?\d).jpg/ig
var regex = new Regex(#"_(\d?\d).[Jj][Pp][Gg]");
That will capture one or two numbers between an underscore and .jpg
I will double check this, but it should work for both one digit and two digits.

Regular expression for recognizing url

I want to create a Regex for url in order to get all links from input string.
The Regex should recognize the following formats of the url address:
http(s)://www.webpage.com
http(s)://webpage.com
www.webpage.com
and also the more complicated urls like:
- http://www.google.pl/#sclient=psy&hl=pl&site=&source=hp&q=regex+url&pbx=1&oq=regex+url&aq=f&aqi=g1&aql=&gs_sm=e&gs_upl=1582l3020l0l3199l9l6l0l0l0l0l255l1104l0.2.3l5l0&bav=on.2,or.r_gc.r_pw.&fp=30a1604d4180f481&biw=1680&bih=935
I have the following one
((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)
but it does not recognize the following pattern: www.webpage.com. Can someone please help me to create an appropriate Regex?
EDIT:
It should works to find an appropriate link and moreover place a link in an appropriate index like this:
private readonly Regex RE_URL = new Regex(#"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)", RegexOptions.Multiline);
foreach (Match match in (RE_URL.Matches(new_text)))
{
// Copy raw string from the last position up to the match
if (match.Index != last_pos)
{
var raw_text = new_text.Substring(last_pos, match.Index - last_pos);
text_block.Inlines.Add(new Run(raw_text));
}
// Create a hyperlink for the match
var link = new Hyperlink(new Run(match.Value))
{
NavigateUri = new Uri(match.Value)
};
link.Click += OnUrlClick;
text_block.Inlines.Add(link);
// Update the last matched position
last_pos = match.Index + match.Length;
}
I don't know why your result in match is only http:// but I cleaned your regex a bit
((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:##%/;$()~_?\+,\-=\\.&]+)
(?:) are non capturing groups, that means there is only one capturing group left and this contains the complete matched string.
(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.) The link has now to start with something fom the first list followed by an optional www. or with an www.
[\w\d:##%/;$()~_?\+,\-=\\.&] I added a comma to the list (otherwise your long example does not match) escaped the - (you were creating a character range) and unescaped the . (not needed in a character class.
See this here on Regexr, a useful tool to test regexes.
But URL matching is not a simple task, please see this question here
I've just written up a blog post on recognising URLs in most used formats such as:
www.google.com
http://www.google.com
mailto:somebody#google.com
somebody#google.com
www.url-with-querystring.com/?url=has-querystring
The regular expression used is /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/ however I would recommend you got to http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the to see a complete working example along with an explanation of the regular expression in case you need to extend or tweak it.
The regex you give doesn't work for www. addresses because it is expecting a URI scheme (the bit before the URL, like http://). The 'www.' part in your regular expression doesn't work because it would only match www.:// (which is meaningless)
Try something like this instead:
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:##%/;$()~_?\+-=\\\.&]*)
This will match something with a valid URI scheme, or something beginning with 'www.'

c# regex - matching optionals after a named group

I'm sure this has been quite numerous times but though i've checked all similar questions, i couldn't come up with a solution.
The problem is that i've an input urls similar to;
http://www.justin.tv/peacefuljay
http://www.justin.tv/peacefuljay#/w/778713616/3
http://de.justin.tv/peacefuljay#/w/778713616/3
I want to match the slug part of it (in above examples, it's peacefuljay).
Regex i've tried so far are;
http://.*\.justin\.tv/(?<Slug>.*)(?:#.)?
http://.*\.justin\.tv/(?<Slug>.*)(?:#.)
But i can't come with a solution. Either it fails in the first url or in others.
Help appreciated.
The easiest way of parsing a Uri is by using the Uri class:
string justin = "http://www.justin.tv/peacefuljay#/w/778713616/3";
Uri uri = new Uri(justin);
string s1 = uri.LocalPath; // "/peacefuljay"
string s2 = uri.Segments[1]; // "peacefuljay"
If you insisnt on a regex, you can try someting a bit more specific:
Match mate = Regex.Match(str, #"http://(\w+\.)*justin\.tv(?:/(?<Slug>[^#]*))?");
(\w+\.)* - Ensures you match the domain, not anywhere else in the string (eg, hash or query string).
(?:/(?<Slug>[^#]*))? - Optional group with the string you need. [^#] limits the characters you expect to see in your slug, so it should eliminate the need of the extra group after it.
As I see it there's no reason to treat to the parts after the "slug".
Therefore you only need to match all characters after the host that aren't "/" or "#".
http://.*\.justin\.tv/(?<Slug>[^/#]+)
http://.*\.justin\.tv/(?<Slug>.*)#*?
or
http://.*\.justin\.tv/(?<Slug>.*)(#|$)

Categories

Resources