How to get href in HTML Agility Pack? - c#

i want to get "href" link in this html node.
I already trying, but the result still not work.
This is the code:
<a title="ASUS ROG" class="product-media__link js-tracker-product-link" href="https://www.bukalapak.com/p/komputer/laptop/8vl4vm-jual-asus-rog?search%5Bkeywords%5D=asus%20rog&from=omnisearch">

Here are some sample code to extract the href url on the page:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load("yourpage.html");
var link = htmlDoc.DocumentNode
.Descendants("a")
.First(x => x.Attributes["title"] != null
&& x.Attributes["title"].Value == "ASUS ROG");
string hrefValue = link.Attributes["href"].Value;

Related

How to remove a tag link a href without removing the link text in Html Agility Pack?

I have to replace the tag with HAP - HTML Agility Pack, in order to get a link without removing the link text. For e.g. in this case:
<p>This is the link</p>
I want to replace the link and the desired result should be:
<p>This is <span>the link<span></p>
I made this function, getting a html string as input.
public string CleanLinks(string input) {
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(input);
var links = doc.DocumentNode.SelectNodes("//a");
if (links == null) return input;
foreach (HtmlNode tb in links)
{
HtmlNode lbl = doc.CreateElement("span");
lbl.InnerHtml = tb.InnerHtml;
tb.ParentNode.ReplaceChild(lbl, tb);
}
return doc.DocumentNode.OuterHtml;
}

Load p tag from form using HtmlAgilityPack

I'm trying to grap p tag from form tag but it is null:
string html = "<form id='foo123'> <p> loll </p> </form>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode.SelectNodes("//form[contains(#id, 'foo')]"); //.Count = 1
var p = node[0].SelectSingleNode("./p"); // p is null
How do I fix this?
This is a known issue where the Agility Pack is incorrectly fixing the nesting of tags. You can work around it by calling:
HtmlNode.ElementsFlags.Remove("form");
See: http://htmlagilitypack.codeplex.com/workitem/23074

Html Agility Pack: how to parse a webresponse and get a specified html element in c#

I googled my problem and found Html Agility Pack to parse html in c#. But there is no good examples and I can't use it to my purpose. I have a html document and it has a part like this:
<div class="pray-times-holder">
<div class="pray-time">
<div class="labels">
Time1:</div>
04:28:24
</div>
<div class="pray-time">
<div class="labels">
Time2:</div>
06:04:41
</div>
</div>
I want to get the value for Time1 and Time2. e.g. Time1 has value 04:28:24 and Time2 has value 06:04:41 and I want to get these values. Can you help me please?
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var results = doc.DocumentNode
.Descendants("div")
.Where(n => n.Attributes["class"] != null && n.Attributes["class"].Value == "pray-time")
.Select(n => n.InnerText.Replace("\r\n","").Trim())
.ToArray();
This console application code:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class = 'labels']"))
{
Console.WriteLine(node.NextSibling.InnerText.Trim());
}
will output this:
04:28:24
06:04:41

HTML Scraping using Html Agility Pack

I have an HTML which contains the following code
<div id="image_src" style="display: block; ">
<img id="captcha_img" src="" alt="image" onclick="imageClick(event)" style="cursor:crosshair;">
In this how can i detect the src using HTML Agility Pack ?
From another question I tried using the following LINQ
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));
but i keep getting null pointer exception here ...
I have only one image tag in entire HTML given like above
Can somebody help me please ..
To troubleshoot the null pointer exception, break each Linq statement into its own line, like this:
var img = document.DocumentNode.Descendants("img");
var s = img.Select(e => e.GetAttributeValue("src", null));
var w = s.Where(s => !String.IsNullOrEmpty(s));
Then, step through each line with the debugger, and see where it throws.
Using the HTML Agility Pack
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string imgValue = doc.DocumentNode.SelectSingleNode("//img[#id = \"captcha_img\"]").GetAttributeValue("src", "0");

How to extract html links from html file in C#?

Can anyone help me by explaining how to extract urls/links from HTML File in C#
look at Html Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
yourList.Add(att.Value)
}
doc.Save("file.htm");
Use HTMLAgility Pack...
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(r => r.Attributes.ToList().ConvertAll(i => i.Value)).SelectMany(j => j).ToList();
}
It works for me.
You can use an HTQL COM object and query the page using query:
<a>:href
HTQLCOMLib.HtqlControl h = new HTQLCOMLib.HtqlControl();
string page = "<html><body><a href='test1.html'>test1</a><a href='test2.html'>test2</a> </body></html>";
h.setSourceData(page, page.Length);
h.setQuery("<a>: href ");
for (h.moveFirst(); 0 == h.isEOF(); h.moveNext() )
{
MessageBox.Show(h.getValueByIndex(1));
}
It will show messages of:
test1.html
test2.html

Categories

Resources