How to download content of an element by its hierarchy - c#

I'm new to stackoverflow and I hope my question not be odd..
I want to just download the text inside svalue of sindex element, and also content of another <p> tag. This is its hierarchy:
/html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr/td/table[4]/tbody/tr[2]/td[2]/table/tbody/tr/td/div/span/span/p/span/sindex
is it possible to download the content by its hierarchy? with HtmlAgilityPack for example, or in another way?
Thanks
WebClient client = new WebClient();
string url = "http://www.google.com";
var content = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(content);
// ?
Update after #MSI answer, I Use this:
var value = doc.DocumentNode
.SelectSingleNode("//html/body/div/div/a/div");
But the return value always is null. mayber I get the hierarchy in a wrong way. I use firebug and look at html tab for its hierarchy, is it wrong?

Can't you use something along the line of,
*Considering svalue to be an attribute:
doc.DocumentNode
.SelectSingleNode("//html/element1/element2")
.Attributes["svalue"].Value;
or for element,
doc.DocumentNode
.SelectSingleNode("//html/element1/element2/svalue").InnerText;
EDIT:
Re. SelectSingleNode returning null for my previous examples with google.com.au as reference HTML source use the following method to et the desired result.
doc.DocumentNode
.SelectSingleNode(".//element1/element2/svalue").InnerText;
DocumentNode should refer to the html document root node and .// is relative to that.

Related

Htmlagilitypack doesnt get nodes.

I am using Htmlagilitypack in c#. But when i want to select images in a div at the url bottom, there are nothing found in selector. But i think i write right selector.
Codes are in fiddle. Thanks.
https://dotnetfiddle.net/NNIC3X
var url = "https://fotogaleri.haberler.com/unlu-sarkici-imaj-degistirdi-gorenler-gozlerine/";
//I will get the images src values in .col-big div at this url.
var web = new HtmlWeb();
var doc = web.Load(url);
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big']//*/img");
//i am selecting all images in div.col-big. But there is nothing.
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}
Your xpath is wrong because there is no div-tag that has class-attribtue with the value 'col-big'. There is however a div-tag that has a class attribute with the value 'col-big pull-left'. So try.
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big pull-left']//*/img");

Find string in string inside HTML img element [duplicate]

I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img#src, just replace each a with img, and href with src.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/#href | //img/#src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri class.
The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
This works for me.
Maybe I am too late here to post an answer. The following worked for me:
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string name = htmlDoc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;
Source:
https://html-agility-pack.net/select-nodes
You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).
For more information check:
<base>: The Document Base URL element page on MDN
The Protocol-relative URL article by Paul Irish
What are the recommendations for html tag? discussion on StackOverflow
Uri Constructor (Uri, Uri) page on MSDN
Uri class doesn't handle the protocol-relative URL discussion no StackOverflow
Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string command = "";
// The Xpath below gets images.
// It is specific to a site. Yours will vary ...
command = "//a[contains(concat(' ', #class, ' '), 'product-card')]//img";
List<string> listImages=new();
foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
{
// Using "data-src" below, but it may be "src" for you
listImages.Add(node.Attributes["data-src"].Value);
}

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

How do I use HTML Agility Pack to edit an HTML snippet

So I have an HTML snippet that I want to modify using C#.
<div>
This is a specialSearchWord that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that specialSearchWord again.
</div>
and I want to transform it to this:
<div>
This is a <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> again.
</div>
I'm going to use HTML Agility Pack based on the many recommendations here, but I don't know where I'm going. In particular,
How do I load a partial snippet as a string, instead of a full HTML document?
How do edit?
How do I then return the text string of the edited object?
The same as a full HTML document. It doesn't matter.
The are 2 options: you may edit InnerHtml property directly (or Text on text nodes) or modifying the dom tree by using e.g. AppendChild, PrependChild etc.
You may use HtmlDocument.DocumentNode.OuterHtml property or use HtmlDocument.Save method (personally I prefer the second option).
As to parsing, I select the text nodes which contain the search term inside your div, and then just use string.Replace method to replace it:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var textNodes = doc.DocumentNode.SelectNodes("/div/text()[contains(.,'specialSearchWord')]");
if (textNodes != null)
foreach (HtmlTextNode node in textNodes)
node.Text = node.Text.Replace("specialSearchWord", "<a class='special' href='http://mysite.com/search/specialSearchWord'>specialSearchWord</a>");
And saving the result to a string:
string result = null;
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
result = writer.ToString();
}
Answers:
There may be a way to do this but I don't know how. I suggest
loading the entire document.
Use a combination of XPath and regular
expressions
See the code below for a contrived example. You may have
other constraints not mentioned but this code sample should get you
started.
Note that your Xpath expression may need to be more complex to find the div that you want.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtmlFile);
HtmlNode divNode = doc.DocumentNode.SelectSingleNode("//div[2]");
string newDiv = Regex.Replace(divNode.InnerHtml, #"specialSearchWord",
"<a class='special' href='http://etc'>specialSearchWord</a>");
divNode.InnerHtml = newDiv;
Console.WriteLine(doc.DocumentNode.OuterHtml);

How to get img/src or a/hrefs using Html Agility Pack?

I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img#src, just replace each a with img, and href with src.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/#href | //img/#src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri class.
The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
This works for me.
Maybe I am too late here to post an answer. The following worked for me:
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string name = htmlDoc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;
Source:
https://html-agility-pack.net/select-nodes
You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).
For more information check:
<base>: The Document Base URL element page on MDN
The Protocol-relative URL article by Paul Irish
What are the recommendations for html tag? discussion on StackOverflow
Uri Constructor (Uri, Uri) page on MSDN
Uri class doesn't handle the protocol-relative URL discussion no StackOverflow
Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string command = "";
// The Xpath below gets images.
// It is specific to a site. Yours will vary ...
command = "//a[contains(concat(' ', #class, ' '), 'product-card')]//img";
List<string> listImages=new();
foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
{
// Using "data-src" below, but it may be "src" for you
listImages.Add(node.Attributes["data-src"].Value);
}

Categories

Resources