Regex to extract img source from a string - c#

I have strings like this:
<img width="1" height="1" alt="" src="http://row.bc.yahoo.com.link">
What regex should I have to write in C# to extract src portion of it ? (end result should be "http://row.bc.yahoo.com.link" )

If you are dealing with HTML you're better of using a HTML parser like the HTML Agility Pack.
Sample:
var doc = new HtmlDocument();
doc.LoadHtml(
"<img width=\"1\" height=\"1\" alt=\"\" src=\"http://row.bc.yahoo.com.link\">");
var anchor = doc.DocumentNode.Element("img");
Console.WriteLine(anchor.Attributes["src"].Value);
Update:
If you are already using the HTML agility pack and have selected all the img tags from the document using XPath you need to iterate them and access the src attribute:
var imgs = doc.DocumentNode.SelectNodes("//img/#src");
foreach (var node in imgs)
{
Console.WriteLine(node.Attributes["src"].Value);
}

This pattern should work: src="([^"]*)".

Related

C# replace the selection of regex by another selection of regex if a condition is met

I am parsing some html code and what I am trying is to replace alt in img by its src value(without suffix) if and only if the alt is empty.
Example:
Input:
... some HTML here ....
<img src="my_image.jpg" alt="something_is_already_here" width="450" height="300">
... some HTML here ....
<img src="my_image2.jpg" alt="" width="450" height="300"
Output:
... some HTML here ....
<img src="my_image.jpg" alt="something_is_already_here" width="450" height="300">
... some HTML here ....
<img src="my_image2.jpg" alt="my_image2" width="450" height="300">
I've already written the regular expressions for src and alt, but don't know how to use it to do what I exactly need.
//src=\"([^"]*)\.jpg\"
string srcPattern = "src=\\\"([^\"]*)\\.jpg\\\"";
//alt=\"([^"]*)\"
string altPattern = "alt=\\\"([^\"]*)\\\"";
Regex rSrc = new Regex(srcPattern);
Regex rAlt = new Regex(altPattern);
Here is how you can do it with HTML parser (HtmlAgilityPack - install as a NuGet Package): you can pass either a URL or an HTML string to the HtmlAgilityPackPopulateAltWithSrcIfEmpty method, and the output will be the HTML string with populated alts in img tags.
The XPath used //img[string-length(#alt) = 0] selects all img tags (//img) whose alt attribute value is empty ([string-length(#alt) = 0]).
The alt is only populated with part of src if src value ends with .jpg extension. Then, only the part before the extension is used to set the alt attribute.
public string HtmlAgilityPackPopulateAltWithSrcIfEmpty(string html)
{
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.SelectNodes("//img[string-length(#alt) = 0]");
if (nodes != null)
{
foreach (var node in nodes)
{
var val = node.GetAttributeValue("src", string.Empty);
if (val.ToUpper().EndsWith(".JPG"))
node.SetAttributeValue("alt", val.Substring(0, val.Length - 4));
}
}
var ffg = hap.DocumentNode.OuterHtml;
return hap.DocumentNode.OuterHtml;
}
Use it like:
var s = "<img src=\"my_image.jpg\" alt=\"something_is_already_here\" width=\"450\" height=\"300\"><img src=\"my_image2.jpg\" alt=\"\" width=\"450\" height=\"300\">";
var new_html = HtmlAgilityPackPopulateAltWithSrcIfEmpty(s);
Result:
<img src="my_image.jpg" alt="something_is_already_here" width="450" height="300"><img src="my_image2.jpg" alt="my_image2" width="450" height="300">
You need to use a Regex.Replace.
As you want to replace a different regex on your find, you need to use a if.
First you need to filter out the whole line with the img-Tag as you want to replace its source and not of any src ;)
To filter using regex use Regex.IsMatch(text, pattern).
Example:
string text = Console.ReadLine();
string reg = #"^((([\w]+\.[\w]+)+)|([\w]+))#(([\w]+\.)+)([A-Za-z]{1,3})$";
if (Regex.IsMatch(text, reg))
{
Console.WriteLine("Email.");
}
You make a pattern for the img-Tag line and then you go further by IsMatch(imgLine, patternForAlt) and check if it is empty, if yes you use the Replace(srcTag, replacetext).
If you try this and provide code with your try if it's not working, I can help you further.
Edit
You can use https://regex101.com/ to test your regex easily before using it in the program :)

C# - Get html contents

Hello How can I get a html content like a shoutbox or just the username of user connected in C# ?
Example: <p><?php echo USER['name'] ?></p>
in C#: How can I get the p value ?
You should be using an html parser like htmlagilitypack .Regex is not a good choice for parsing HTML files as HTML is not strict nor is it regular with its format.
You can use below code to retrieve it using HtmlAgilityPack
`
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//p")//this xpath selects all p tags
.Select(p => p.InnerText)
.ToList();
`
Give the P element an ID, and reference it by:
string contents = yourId.Text;
with your html like this:
<p id="yourId"></p>

How to remove the <br> tag in my HTML string using HtmlAgilityPack in C#?

I have an HTML string and I am using HtmlAgilityPack for parsing HTML string.
This is my html string:
<p class="Normal-P" style="direction: ltr; unicode-bidi: normal;"><span class="Normal-H">sample<br/></span> <span class="Normal-H">texting<br></span></p>
This HTML string has <br> tag in two places. How can I remove both of them?
It's as easy as:
loading the HTML fragment into an Agility Pack HtmlDocument
getting all <br /> tags using the "//br" xpath expression
removing the tags obtained at the previous step using the Remove() method
inspecting the result in the DocumentNode.OuterHtml property
Here it is in code:
const string htmlFragment =
#"<p class=""Normal-P"" style=""direction: ltr; unicode-bidi: normal;"">" +
#"<span class=""Normal-H"">sample<br/></span>" +
#"<span class=""Normal-H"">texting<br></span></p> ";
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(htmlFragment);
foreach (var brTag in document.DocumentNode.SelectNodes("//br"))
brTag.Remove();
Console.WriteLine(document.DocumentNode.OuterHtml);
string html = ...;
string html = Regex.Replace(html, "<br>", "", RegexOptions.Singleline);

How do I use HTML Agility Pack to edit an HTML snippet

So I have an HTML snippet that I want to modify using C#.
<div>
This is a specialSearchWord that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that specialSearchWord again.
</div>
and I want to transform it to this:
<div>
This is a <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> again.
</div>
I'm going to use HTML Agility Pack based on the many recommendations here, but I don't know where I'm going. In particular,
How do I load a partial snippet as a string, instead of a full HTML document?
How do edit?
How do I then return the text string of the edited object?
The same as a full HTML document. It doesn't matter.
The are 2 options: you may edit InnerHtml property directly (or Text on text nodes) or modifying the dom tree by using e.g. AppendChild, PrependChild etc.
You may use HtmlDocument.DocumentNode.OuterHtml property or use HtmlDocument.Save method (personally I prefer the second option).
As to parsing, I select the text nodes which contain the search term inside your div, and then just use string.Replace method to replace it:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var textNodes = doc.DocumentNode.SelectNodes("/div/text()[contains(.,'specialSearchWord')]");
if (textNodes != null)
foreach (HtmlTextNode node in textNodes)
node.Text = node.Text.Replace("specialSearchWord", "<a class='special' href='http://mysite.com/search/specialSearchWord'>specialSearchWord</a>");
And saving the result to a string:
string result = null;
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
result = writer.ToString();
}
Answers:
There may be a way to do this but I don't know how. I suggest
loading the entire document.
Use a combination of XPath and regular
expressions
See the code below for a contrived example. You may have
other constraints not mentioned but this code sample should get you
started.
Note that your Xpath expression may need to be more complex to find the div that you want.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtmlFile);
HtmlNode divNode = doc.DocumentNode.SelectSingleNode("//div[2]");
string newDiv = Regex.Replace(divNode.InnerHtml, #"specialSearchWord",
"<a class='special' href='http://etc'>specialSearchWord</a>");
divNode.InnerHtml = newDiv;
Console.WriteLine(doc.DocumentNode.OuterHtml);

HTML Scraping using Html Agility Pack

I have an HTML which contains the following code
<div id="image_src" style="display: block; ">
<img id="captcha_img" src="" alt="image" onclick="imageClick(event)" style="cursor:crosshair;">
In this how can i detect the src using HTML Agility Pack ?
From another question I tried using the following LINQ
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));
but i keep getting null pointer exception here ...
I have only one image tag in entire HTML given like above
Can somebody help me please ..
To troubleshoot the null pointer exception, break each Linq statement into its own line, like this:
var img = document.DocumentNode.Descendants("img");
var s = img.Select(e => e.GetAttributeValue("src", null));
var w = s.Where(s => !String.IsNullOrEmpty(s));
Then, step through each line with the debugger, and see where it throws.
Using the HTML Agility Pack
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string imgValue = doc.DocumentNode.SelectSingleNode("//img[#id = \"captcha_img\"]").GetAttributeValue("src", "0");

Categories

Resources