I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img#src, just replace each a with img, and href with src.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/#href | //img/#src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri class.
The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
This works for me.
Maybe I am too late here to post an answer. The following worked for me:
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string name = htmlDoc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;
Source:
https://html-agility-pack.net/select-nodes
You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).
For more information check:
<base>: The Document Base URL element page on MDN
The Protocol-relative URL article by Paul Irish
What are the recommendations for html tag? discussion on StackOverflow
Uri Constructor (Uri, Uri) page on MSDN
Uri class doesn't handle the protocol-relative URL discussion no StackOverflow
Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string command = "";
// The Xpath below gets images.
// It is specific to a site. Yours will vary ...
command = "//a[contains(concat(' ', #class, ' '), 'product-card')]//img";
List<string> listImages=new();
foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
{
// Using "data-src" below, but it may be "src" for you
listImages.Add(node.Attributes["data-src"].Value);
}
Related
I've seen that the html agility pack can come handy but I dont understand how it works. This is how I got the code right now and at the moment it extracts the headings content successfully but also takes more unneeded content.
driver.Manage().Window.Maximize();
driver.Navigate().GoToUrl(response);
String sourcePage = driver.PageSource;
Regex regexHeadings = new Regex("(?<=\\>)(?!\\<)(.*)(?=\\<)(?<!\\>)");
foreach (Match match in regexHeadings.Matches(sourcePage))
{
h1Keywords.Add(match.Value);
colorOutput(ConsoleColor.White, match.Value);
}
I'd recommend you using HtmlAgility Pack with the help of XPath / CSS Selectors.
See this cheatsheet for help: https://devhints.io/xpath
Quick example:
var url = "https://devhints.io/xpath";
var web = new HtmlWeb();
var doc = web.Load(url);
foreach (var heading in doc.DocumentNode.SelectNodes("//h1"))
{
Console.WriteLine(heading.InnerText);
}
How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.
I'm new to stackoverflow and I hope my question not be odd..
I want to just download the text inside svalue of sindex element, and also content of another <p> tag. This is its hierarchy:
/html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr/td/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/div/span/span/p/span/sindex
is it possible to download the content by its hierarchy? with HtmlAgilityPack for example, or in another way?
Thanks
WebClient client = new WebClient();
string url = "http://www.google.com";
var content = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(content);
// ?
Update after #MSI answer, I Use this:
var value = doc.DocumentNode
.SelectSingleNode("//html/body/div/div/a/div");
But the return value always is null. mayber I get the hierarchy in a wrong way. I use firebug and look at html tab for its hierarchy, is it wrong?
Can't you use something along the line of,
*Considering svalue to be an attribute:
doc.DocumentNode
.SelectSingleNode("//html/element1/element2")
.Attributes["svalue"].Value;
or for element,
doc.DocumentNode
.SelectSingleNode("//html/element1/element2/svalue").InnerText;
EDIT:
Re. SelectSingleNode returning null for my previous examples with google.com.au as reference HTML source use the following method to et the desired result.
doc.DocumentNode
.SelectSingleNode(".//element1/element2/svalue").InnerText;
DocumentNode should refer to the html document root node and .// is relative to that.
Why does this pick all of my <li> elements in my document?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<Page>();
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes("//li");
What I want is to get all <li> elements in the <div> with an id of "myTrips".
It's a bit confusing because you're expecting that it would do a selectNodes on only the div with id "myTrips", however if you do another SelectNodes("//li") it will performn another search from the top of the document.
I fixed this by combining the statement into one, but that would only work on a webpage where you have only one div with an id "mytrips". The query would look like this:
doc.DocumentNode.SelectNodes("//div[#id='myTrips'] //li");
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes(".//li");
Note the dot in the second line. Basically in this regard HTMLAgitilityPack completely relies on XPath syntax, however the result is non-intuitive, because those queries are effectively the same:
doc.DocumentNode.SelectNodes("//li");
some_deeper_node.SelectNodes("//li");
Creating a new node can be beneficial in some situations and lets you use the xpaths more intuitively. I've found this useful in a couple of places.
var myTripsDiv = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']");
var myTripsNode = HtmlNode.CreateNode(myTripsDiv.InnerHtml);
var liOfTravels = myTripsNode.SelectNodes("//li");
You can do this with a Linq query:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<HtmlNode>();
foreach (var matchingDiv in doc.DocumentNode.DescendantNodes().Where(n=>n.Name == "div" && n.Id == "myTrips"))
{
travelList.AddRange(matchingDiv.DescendantNodes().Where(n=> n.Name == "li"));
}
I hope it helps
This seems counter intuitive to me aswell, if you run a selectNodes method on a particular node I thought it would only search for stuff underneath that node, not in the document in general.
Anyway OP if you change this line :
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("//li");
TO:
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("li");
I think you'll be ok, i've just had the same issue and that fixed it for me. Im not sure though if the li would have to be a direct child of the node you have.
I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img#src, just replace each a with img, and href with src.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/#href | //img/#src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri class.
The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
This works for me.
Maybe I am too late here to post an answer. The following worked for me:
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string name = htmlDoc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;
Source:
https://html-agility-pack.net/select-nodes
You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).
For more information check:
<base>: The Document Base URL element page on MDN
The Protocol-relative URL article by Paul Irish
What are the recommendations for html tag? discussion on StackOverflow
Uri Constructor (Uri, Uri) page on MSDN
Uri class doesn't handle the protocol-relative URL discussion no StackOverflow
Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string command = "";
// The Xpath below gets images.
// It is specific to a site. Yours will vary ...
command = "//a[contains(concat(' ', #class, ' '), 'product-card')]//img";
List<string> listImages=new();
foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
{
// Using "data-src" below, but it may be "src" for you
listImages.Add(node.Attributes["data-src"].Value);
}