HTML Scraping using Html Agility Pack - c#

I have an HTML which contains the following code
<div id="image_src" style="display: block; ">
<img id="captcha_img" src="" alt="image" onclick="imageClick(event)" style="cursor:crosshair;">
In this how can i detect the src using HTML Agility Pack ?
From another question I tried using the following LINQ
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));
but i keep getting null pointer exception here ...
I have only one image tag in entire HTML given like above
Can somebody help me please ..

To troubleshoot the null pointer exception, break each Linq statement into its own line, like this:
var img = document.DocumentNode.Descendants("img");
var s = img.Select(e => e.GetAttributeValue("src", null));
var w = s.Where(s => !String.IsNullOrEmpty(s));
Then, step through each line with the debugger, and see where it throws.

Using the HTML Agility Pack
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string imgValue = doc.DocumentNode.SelectSingleNode("//img[#id = \"captcha_img\"]").GetAttributeValue("src", "0");

Related

How to get href in HTML Agility Pack?

i want to get "href" link in this html node.
I already trying, but the result still not work.
This is the code:
<a title="ASUS ROG" class="product-media__link js-tracker-product-link" href="https://www.bukalapak.com/p/komputer/laptop/8vl4vm-jual-asus-rog?search%5Bkeywords%5D=asus%20rog&from=omnisearch">
Here are some sample code to extract the href url on the page:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load("yourpage.html");
var link = htmlDoc.DocumentNode
.Descendants("a")
.First(x => x.Attributes["title"] != null
&& x.Attributes["title"].Value == "ASUS ROG");
string hrefValue = link.Attributes["href"].Value;

C# - Get html contents

Hello How can I get a html content like a shoutbox or just the username of user connected in C# ?
Example: <p><?php echo USER['name'] ?></p>
in C#: How can I get the p value ?
You should be using an html parser like htmlagilitypack .Regex is not a good choice for parsing HTML files as HTML is not strict nor is it regular with its format.
You can use below code to retrieve it using HtmlAgilityPack
`
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//p")//this xpath selects all p tags
.Select(p => p.InnerText)
.ToList();
`
Give the P element an ID, and reference it by:
string contents = yourId.Text;
with your html like this:
<p id="yourId"></p>

Load p tag from form using HtmlAgilityPack

I'm trying to grap p tag from form tag but it is null:
string html = "<form id='foo123'> <p> loll </p> </form>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode.SelectNodes("//form[contains(#id, 'foo')]"); //.Count = 1
var p = node[0].SelectSingleNode("./p"); // p is null
How do I fix this?
This is a known issue where the Agility Pack is incorrectly fixing the nesting of tags. You can work around it by calling:
HtmlNode.ElementsFlags.Remove("form");
See: http://htmlagilitypack.codeplex.com/workitem/23074

Html Agility Pack: how to parse a webresponse and get a specified html element in c#

I googled my problem and found Html Agility Pack to parse html in c#. But there is no good examples and I can't use it to my purpose. I have a html document and it has a part like this:
<div class="pray-times-holder">
<div class="pray-time">
<div class="labels">
Time1:</div>
04:28:24
</div>
<div class="pray-time">
<div class="labels">
Time2:</div>
06:04:41
</div>
</div>
I want to get the value for Time1 and Time2. e.g. Time1 has value 04:28:24 and Time2 has value 06:04:41 and I want to get these values. Can you help me please?
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var results = doc.DocumentNode
.Descendants("div")
.Where(n => n.Attributes["class"] != null && n.Attributes["class"].Value == "pray-time")
.Select(n => n.InnerText.Replace("\r\n","").Trim())
.ToArray();
This console application code:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class = 'labels']"))
{
Console.WriteLine(node.NextSibling.InnerText.Trim());
}
will output this:
04:28:24
06:04:41

Regex to extract img source from a string

I have strings like this:
<img width="1" height="1" alt="" src="http://row.bc.yahoo.com.link">
What regex should I have to write in C# to extract src portion of it ? (end result should be "http://row.bc.yahoo.com.link" )
If you are dealing with HTML you're better of using a HTML parser like the HTML Agility Pack.
Sample:
var doc = new HtmlDocument();
doc.LoadHtml(
"<img width=\"1\" height=\"1\" alt=\"\" src=\"http://row.bc.yahoo.com.link\">");
var anchor = doc.DocumentNode.Element("img");
Console.WriteLine(anchor.Attributes["src"].Value);
Update:
If you are already using the HTML agility pack and have selected all the img tags from the document using XPath you need to iterate them and access the src attribute:
var imgs = doc.DocumentNode.SelectNodes("//img/#src");
foreach (var node in imgs)
{
Console.WriteLine(node.Attributes["src"].Value);
}
This pattern should work: src="([^"]*)".

Categories

Resources