Regular expression Google image

Regular expression Google image - c#

I make RSS reader and I need to find path url image (Google RSS) using regex expression.
URL image from RSS is for example:
RSS channel is https://news.google.com/?output=rss.
<img src="//t0.gstatic.com/images?q=tbn:ANd9GcRfMZ3MOzznCthFKCdIan17n9B8vZvEE-tRSQVTcgJa5i1OPfdf90zi4mBuGzPfB7Bj2mwE0TE" alt="" border="1" width="80" height="80" />
btw. I use regex expressions:
Regex regx = new Regex("\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
RegexOptions.IgnoreCase);
Some advice?

First, you should not parse xml with regex -> use XmlDocument, XmlParser, Readers,...
If you know what you are doing here is the quick and dirty regex solution.
All image Tags in your Feed seems to be in descriptions-Tags and they are of course xml encoded (just keep that in mind the next few steps)
Next you should look for some example img tags
Are you looking for img-tags without src too, or with empty source?
Overall -> define what you are looking for
Design your Regex
because the feed is generated automatically the tags seems to be in the same order every time (we use that fact for shorter regex)
Each img Tag starts with < (but keep point 1 in mind -> xml encoded)
looking for < followed by img (current regex: <img
Next followed by at least one whitespace char. (current regex: lt;img\s+
the src attribute is always the first attribute (in this case) if present so we select src=" (current regex: <img\s+src=")
Next select the url itselt with .* but be carefull the * quantifier is greedy so we have to use Lazy quantification .*? and finally close with "
Final regex: <img\s+src="(.*?)" Make sure that you use brackets for the url for easy group selection.
Last Step: C# Code
//quick & dirty :-)
var url = "https://news.google.com/?output=rss";
var regex = #"<img\s+src="(.*?)"";
var RssContent = new StreamReader(((HttpWebRequest)HttpWebRequest.Create(url)).GetResponse().GetResponseStream()).ReadToEnd();
foreach (Match match in Regex.Matches(RssContent, regex))
{
//print img urls
Debug.WriteLine(match.Groups[1]);
}
PS: If you are trying to write an RSS-reader you should NOT use Regex to parse html at all! try to find a way to transform html into xaml and write your reader in WPF or start with learning more about those problems by studying some open source RSS readers

You can use the below regex patter:
/(.*\/images.*)/

Related

Regex match URL if not in html comment line

I want to match "https://www.mysite/embed/M7znk1c-ay0" only if it is not html comment.
So dont't match this line
<!--<p><iframe src="https://www.mysite/embed/M7znk1c-ay0" width="854" height="480" frameborder="0" allowfullscreen="allowfullscreen"></iframe>-->
but match this line
<article class="art-post"><div class="art-postcontent clearfix"><div class="art-article"><p><iframe src="https://www.mysite/embed/M7znk1c-ay0" ></iframe></p>
I tried this pattern ^(?=<!--).*www.mysite\/embed\/+[\w\-]*
but it isn't quite working

You almost did it correctly. The correct regex is ^(?!<!--).*"(.*www.mysite\/embed\/+[\w\-]*).

HTML is not regular so using regular expressions to parse html might not be a good idea...
#csabinho's answer ^(?!<!--).*"(.*www.mysite\/embed\/+[\w\-]*) won't work if the URL you want to match is in middle of a page, it simply checks if line doesn't begin with a comment.
Best practice would be to create DOM and use XPath to query XML-like contents.
Edit:
By the way you can use following code first to remove comments.
System.Text.RegularExpressions;
...
string pattern = #"(<!--(.+?)-->)";
var res = Regex.Replace(input, pattern, "", RegexOptions.Singleline);
and then use a simple pattern to extract the URL from result

Regex matching URL's by sub-folder

I am trying to essentially write an outbound URL matcher so I can replace a stream of html containing URL's to point to my CDN. I cant use the IIS URL Rewrite module as I am using compression. I currently have a regex that matches on a sub folder for a specific file type i.e.
Regex ASSET_PATH = new Regex(#"(?i)assets/([A-Za-z0-9\-_/.]+)\.(jpg|jpeg|bmp|tiff|png|gif|js|css|mov|mp4|ogg|avi|mp3)", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase );
This works great and allows me to manipulate anything in the string from that point onwards ( i.e. from "assets/" onwards to the right ). What I need to achieve is to manipulate the string to the left of the "assets/" sub-folder, without necessarily knowing the format? Here are some examples :
<img src="./assets/123/pig.jpg" />
<img src="http://mysite.blah/assets/123/pig.jpg" />
<img src="http://www.mysite.blah/assets/123/pig.jpg" />
<img src='assets/123/pig.jpg' />
in css / inline styles :
background-image : URL('assets/123/pig.jpg')
background-image : URL(http://www.mysite.blah/assets/123/pig.jpg)
anyway, I think you get the picture. I essentially want to be able to look to the "left" of the word "assets" until I can find the logical start point of the url and then manipulate it from there to point to my CDN.
I'm not sure this is possible in regex, so any suggestions using a combination of regex / c# /HTML Agility Pack are welcome

Is this what you're after?
(?<BeforeAssets>.*?(?:\/|^))assets\/(?<AfterAssets>[A-Za-z0-9\-_\/.]+)\.(?<FileExtension>jpg|jpeg|bmp|tiff|png|gif|js|css|mov|mp4|ogg|avi|mp3)
You can try this out here: http://regexstorm.net/tester
Or here: https://regex101.com/r/b8XxcF/1
NB: In the above regex I escaped the forward slash characters. .Net doesn't require this, but doesn't complain; and doing so makes this compatible with other Regex engines; which means it can be tested on Regex101.
When testing with those tools you'll need to specify the MultiLine or SingleLine options to get the example where assets/ has nothing preceding it, since otherwise the ^ character won't match the start of that line. This option may not be required in your code; i.e. if you're only matching one string at a time, rather than a whole block of text.
Update
Apologies for misreading; you're parsing the full HTML page; not just the URIs returned from that page. To do this you could use something like:
["'\(](?<BeforeAssets>[^"'\(\)]*?)assets\/(?<AfterAssets>[A-Za-z0-9\-_\/.]+)\.(?<FileExtension>jpg|jpeg|bmp|tiff|png|gif|js|css|mov|mp4|ogg|avi|mp3)
(thankfully characters ", ', and ( are illegal in the URL, so should be OK to detect the start of a variable: https://www.rfc-editor.org/rfc/rfc3986#section-2.2.)
This isn't fool-proof; it's better to use an HTML parsing tool, then pull out the URIs from that; but if you are doing everything with regex, hopefully this will help.

Regex to parse and replace img src in C#/.NET?

Ahoy,
I have a problem, see; I have strings like:
<img width="594" height="392" src="/sites/it_kb/SiteAssets/Pages/exploding%20the%20VDI%20vDesktop/VDI3.PNG" alt="" style="margin:5px;width:619px;height:232px" />
They are not consistently formatted.
I need to parse strings like this, and return the following:
<img width="594" height="392" src="/exploding%20the%20VDI%20vDesktop-VDI3.PNG" alt="" style="margin:5px;width:619px;height:232px" />
Changes:
Remove everything except the immediate directory in which the image file lay.
Instead of that directory being a subdirectory, prepend it onto the file name.
So if the file is currently in /blabla/bla/blaaaaah/pickles/pickle.png
then I want the IMG SRC attribute to say pickles-pickle.png
Now, I've been trying to do this with regex, but after 3 hours, I've discovered something about myself... I am awful at regex. I could be at this for weeks, and I'd never get anywhere.
Thus, I am asking this wonderful community for two things:
How would you do this? Is regex even the right answer? I need to be able to parse any SRC attributes inside IMG tags (whether or not they have height/width or other attributes).
What resources would you recommend for me to learn regex with .NET?
Now for the problem at hand, I suppose I could do a string.replace where I....
Find the IMG tag, and get indexes of the surrounding '<' and '>'
Find index of 'SRC=' and ' ' (space) between those two instances
Find last index of '/' between the src and space indexes
Find second to last index of '/' between src and space indexes
Replace... er no, remove... everything before the second to last instance of '/'...
...String.Replace remaining '/' with '-'.
....I.. I think that'd do it?
But DAMN that is ugly. A regex would be so much prettier, don't you think?
Any advice?
Note: I tagged this as 'homework', but it's not homework. I'm volunteering for work after-hours to save the company like 200k. This is literally the last piece of an incredibly convoluted (to me) puzzle. Of course, I don't see a penny of that 200k, but I look good doing it.

To get the tag, I suggest using HtmlAgilityPack. It's just safer than to do regex on an entire HTML page.
Use something like this to get the image nodes:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var imgs = doc.DocumentNode.SelectNodes("//img");
Use something like this to get/set the attributes:
foreach (var img in imgs)
{
string orig = img.Attributes["src"].Value;
//do replacements on orig to a new string, newsrc
img.SetAttributeValue("src",newsrc);
}
So, what kind of replacements should you do? I do agree that using a Regex is much more elegant. Things like these are what it's for after all!
Something like this should do the trick:
string s = #"/sites/it_kb/SiteAssets/Pages/exploding%20the%20VDI%20vDesktop/VDI3.PNG";
string n = Regex.Replace(s,#"(.*?)\/([^\/]*?)\/([^\/]*?)$",#"/$2-$3");
Some resources that you can use to learn C# Regexing:
dotnetperls Regex.Match
MSDN: Regex.Match method
MSDN Regex Cheat Sheet

(?<=src=)"[^" ]*\/(?=[^\/"]*\/)
Try this.Replace with empty string.
http://regex101.com/r/dZ1vT6/50
Must warn you its a kind of hack.Html should not be parsed with regex.

Replace this
(?i)(?<=<img\s[\s\S]*?src=")(?:[^"]*\/)+(?=[^"]*\/)([^\/]*)\/([^"]+)
To:
/$1-$2

regex to replace self closing html tags in c#

I have an xml which contains some html tags also. When a tag comes in, it breaks the page because it's a self closing tag. Something like:
<iframe width="420" height="315" src="//www.youtube.com/embed/6krfYKxJFqA" frameborder="0" />
I want to replace this and convert it to:
Can anyone provide a c# code with regex to do this. I tried doing:
tmp = tmp.Replace("(<iframe[^>]*)(\\s*/>)", "$1></iframe>");
and
tmp = new Regex(#"(<iframe[^>]*)(\\s*/>)").Replace(tmp, "$1></iframe>");
tmp is the xml containing lot of code + this iframe tag as string.
but with no result.

Try this as a match expression:
<iframe(.*?)(["\d\w\s])\/>
note that you can use http://regexpal.com/ to test regex, it's super convenient.

In the second regex, you don't need the double backslash as you are using #.
Also, (<iframe[^>]*) also matches the last /, use the non-greedy ? operator: (<iframe[^>]*?)(\s*/>)

C# extracting certain parts of a string

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the html code that is returned.
Below is a fragment of the html I am interested in:
<span class="header">Number of People:</span>
<span class="peopleCount">1001</span> <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>
Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).
I've searched stack over flow and found some code that could work:
How do I extract text that lies between parentheses (round brackets)?
But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:
string responseHtml; // this is already filled with html code above ^^
string insideBrackets = null;
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");
Match match = regex.Match(responseHtml);
if (match.Success)
{
insideBrackets = match.Groups["TextInsideBrackets"].Value;
Console.WriteLine(insideBrackets);
}
The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.
Thanks in advance!

Try this one:
Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <
(I changed the group name to data)
Cheers,
Florian

?<TextInsideBrackets> is incorrect
You need:
(?<TextInsideBrackets>...)

I assume you want to do a named capture.
You should use
Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");
and not
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.