Extracting image link through regex in C# - c#

I have a bunch of links in this format
http://imgur.com/a/bwBpM
http://imgur.com/a/bwBpM[/IMG]
[IMG]http://imgur.com/a/bwBpM
[IMG]http://imgur.com/a/bwBpM[/IMG]
The IMG tags are only supplied in some cases, and I want to extract the link, i.e. http://imgur.com/a/bwBpM in this case. Is there an easy way to do this through regex in C#?

If you're saying that you have the text in the question in some kind of list and they are always either in the format of:
Just the Url
the Url + partial or full tags
then the easiest thing to do is to run:
url = url.Replace("[IMG]", "").Replace("[/IMG]");
if there are no tags then there is no change, but if the tags are there they will be stripped out.

You could use this pattern:
^(?:\[IMG\])?([^[]*)(?:\[/IMG\])?$
You can get the output using:
var match = Regex.Match(input, #"^(?:\[IMG\])?([^[]*)(?:\[/IMG\])?$");
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // http://imgur.com/a/bwBpM
}

Related

Regex match URL if not in html comment line

I want to match "https://www.mysite/embed/M7znk1c-ay0" only if it is not html comment.
So dont't match this line
<!--<p><iframe src="https://www.mysite/embed/M7znk1c-ay0" width="854" height="480" frameborder="0" allowfullscreen="allowfullscreen"></iframe>-->
but match this line
<article class="art-post"><div class="art-postcontent clearfix"><div class="art-article"><p><iframe src="https://www.mysite/embed/M7znk1c-ay0" ></iframe></p>
I tried this pattern ^(?=<!--).*www.mysite\/embed\/+[\w\-]*
but it isn't quite working
You almost did it correctly. The correct regex is ^(?!<!--).*"(.*www.mysite\/embed\/+[\w\-]*).
HTML is not regular so using regular expressions to parse html might not be a good idea...
#csabinho's answer ^(?!<!--).*"(.*www.mysite\/embed\/+[\w\-]*) won't work if the URL you want to match is in middle of a page, it simply checks if line doesn't begin with a comment.
Best practice would be to create DOM and use XPath to query XML-like contents.
Edit:
By the way you can use following code first to remove comments.
System.Text.RegularExpressions;
...
string pattern = #"(<!--(.+?)-->)";
var res = Regex.Replace(input, pattern, "", RegexOptions.Singleline);
and then use a simple pattern to extract the URL from result

Regex for Removing Comma between <a> tag text C#

I have the following string , i tried many many regex to remove comma between a tag text, but not found any regex for removing comma between a tag text. I want that , whenever text inside a tag has comma ,then will be replace by empty string.
Getty Center, Restaurant at the
i have tried this regex but it is not working, here input is string that contains html.
input = Regex.Replace(input, #"<a(\s+[^>]*)?>[^\w\s]</a(\s+[^>]*)?>", "");
Please help me out. Thank You
You can use the Regex to find and modify the content of the tag like so.
var input = "Getty Center, Restaurant at the";
var regex = new Regex(#"<a[^>]*>(?<content>.*?)</a[^>]*>",
RegexOptions.Singleline);
var match = regex.Match(input);
while (match.Success) {
var group = match.Groups["content"];
input = input.Substring(0, group.Index)
+ group.Value.Replace(",", "")
+ input.Substring(group.Index + group.Length);
match = regex.Match(input, group.Index);
};
The loop is in place to catch multiple tags in the same string. The Regex however is fairly naive. It will mess with tags nested inside the A tag, and will parse incorrectly if a > is in any of the attributes. (Though that would probably be bad HTML anyway.) A proper HTML parser is recommended for this reason.
I would suggest to use a HTML parser. There are plenty available which are open source and are free. One of the best I found is HTMLAgilityPack at HTMLAgilityPack
Some examples at Some Examples
In nutshell, the following code snippet will give you all tag
HtmlDocument myDoc = new HtmlDocument();
myDoc.Load(path);
HtmlNodeCollection imgs = new HtmlNodeCollection(myDoc.DocumentNode.ParentNode);
imgs = myDoc.DocumentNode.SelectNodes("//img");
Hope that helps
If you want to directly use the replace, you will have to match only the comma and not the text before or after the comma. You'd have to use look ahead and look behind to check if the comma is in the tag. Although this is doable, it is not advised to do this.
An alternative is to use matching groups to match the whole text in the tag and group the comma if it exists and replace the match.
<a[^>]+>[\w\s]*(,?)[\w\s]*<\/a>
The first capture group captures comma if present. You can test it here. [http://rubular.com/r/K2jjIaObty][1]
The best option would be to use a html parser to capture contents of the a tag, search for comma and replace.

Regular expression for recognizing url

I want to create a Regex for url in order to get all links from input string.
The Regex should recognize the following formats of the url address:
http(s)://www.webpage.com
http(s)://webpage.com
www.webpage.com
and also the more complicated urls like:
- http://www.google.pl/#sclient=psy&hl=pl&site=&source=hp&q=regex+url&pbx=1&oq=regex+url&aq=f&aqi=g1&aql=&gs_sm=e&gs_upl=1582l3020l0l3199l9l6l0l0l0l0l255l1104l0.2.3l5l0&bav=on.2,or.r_gc.r_pw.&fp=30a1604d4180f481&biw=1680&bih=935
I have the following one
((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)
but it does not recognize the following pattern: www.webpage.com. Can someone please help me to create an appropriate Regex?
EDIT:
It should works to find an appropriate link and moreover place a link in an appropriate index like this:
private readonly Regex RE_URL = new Regex(#"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)", RegexOptions.Multiline);
foreach (Match match in (RE_URL.Matches(new_text)))
{
// Copy raw string from the last position up to the match
if (match.Index != last_pos)
{
var raw_text = new_text.Substring(last_pos, match.Index - last_pos);
text_block.Inlines.Add(new Run(raw_text));
}
// Create a hyperlink for the match
var link = new Hyperlink(new Run(match.Value))
{
NavigateUri = new Uri(match.Value)
};
link.Click += OnUrlClick;
text_block.Inlines.Add(link);
// Update the last matched position
last_pos = match.Index + match.Length;
}
I don't know why your result in match is only http:// but I cleaned your regex a bit
((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:##%/;$()~_?\+,\-=\\.&]+)
(?:) are non capturing groups, that means there is only one capturing group left and this contains the complete matched string.
(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.) The link has now to start with something fom the first list followed by an optional www. or with an www.
[\w\d:##%/;$()~_?\+,\-=\\.&] I added a comma to the list (otherwise your long example does not match) escaped the - (you were creating a character range) and unescaped the . (not needed in a character class.
See this here on Regexr, a useful tool to test regexes.
But URL matching is not a simple task, please see this question here
I've just written up a blog post on recognising URLs in most used formats such as:
www.google.com
http://www.google.com
mailto:somebody#google.com
somebody#google.com
www.url-with-querystring.com/?url=has-querystring
The regular expression used is /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/ however I would recommend you got to http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the to see a complete working example along with an explanation of the regular expression in case you need to extend or tweak it.
The regex you give doesn't work for www. addresses because it is expecting a URI scheme (the bit before the URL, like http://). The 'www.' part in your regular expression doesn't work because it would only match www.:// (which is meaningless)
Try something like this instead:
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:##%/;$()~_?\+-=\\\.&]*)
This will match something with a valid URI scheme, or something beginning with 'www.'

How do you match a regex to a google search correction string in javascript

I need to use regex to get the "Did you mean? portion of the source code from a google search. I am not aware of any difference between the regex in C# and in javascript, but
This is the regex I have in C#:
output = output.Replace("\"", "");
string regex = "Did you mean: </span><a href=/search.[a-zA-Z0-9=&;_-]{1,}q=[a-zA-Z0-9+-]{1,}";
This is what i have in javascript:
var response = this.responseText.replace("\"", "")
var regex = new RegExp("Did you mean: </span><a href=.search.[a-zA-Z0-9=&;_-]{1,}q=[a-zA-Z0-9+-]{1,}")
This is part of the response I am getting back from google:
style="color:#cc0000">Did you mean: </span><a href=/search?hl=en&safe=off&&sa=X&ei=DXtLTd2hKYjKgQfJ0sBD&ved=0CBIQBSgA&q=Linkin+Park-In+The+End&spell=1"class=spell>Linkin Park-In <b><i>The</i></b> End</a> <br></div><!--a--><h2 class=hd>Search Results</h2><div id=ires><ol><li class="g videobox" id=videobox><h3 class=r><a href="/search?q=Linkin+Park-In+Th+End&hl=en&
How can I make the javascript regex match correctly?
NOTE: I already know that the C# regex only matches once on the page.
Why use regex here?
If you notice the source code, the Did you Mean section is inside a Div which has ID topstuff. So you can get the innerHtml of this Div.

C# extracting certain parts of a string

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the html code that is returned.
Below is a fragment of the html I am interested in:
<span class="header">Number of People:</span>
<span class="peopleCount">1001</span> <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>
Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).
I've searched stack over flow and found some code that could work:
How do I extract text that lies between parentheses (round brackets)?
But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:
string responseHtml; // this is already filled with html code above ^^
string insideBrackets = null;
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");
Match match = regex.Match(responseHtml);
if (match.Success)
{
insideBrackets = match.Groups["TextInsideBrackets"].Value;
Console.WriteLine(insideBrackets);
}
The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.
Thanks in advance!
Try this one:
Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <
(I changed the group name to data)
Cheers,
Florian
?<TextInsideBrackets> is incorrect
You need:
(?<TextInsideBrackets>...)
I assume you want to do a named capture.
You should use
Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");
and not
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

Categories

Resources