li in htmlagilitypack c# - c#

I want to get label and strong values from the following li
<div class="property-summary">
<h3>Listing summary</h3>
<ul>
<li>
<label>Reference</label>
<strong>BR-S-4301</strong>
</li>
<li>
<label>Type</label>
<strong>Apartment</strong>
</li>
<li>
<label>City</label>
<strong>Dubai</strong>
</li>
<li>
<label>Community</label>
<strong>Palm Jumeirah</strong>
</li>
<li>
<label>Subcommunity</label>
<strong>Tiara Residences</strong>
</li>
</ul>
</div>
Here is my c# code
var dataNode = rootNode.SelectNodes("//div[normalize-space(#class)='property-summary']");
Now how to get it? below is not working for me
var Node = dataNode .SelectSingleNode(".//li/strong");

There are couple of ways to do it.
1
var labelNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li/label");
var strongNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li/strong");
foreach (var node in labelNodes)
{
Debug.WriteLine(node.InnerText.Trim());
}
foreach (var node in strongNodes)
{
Debug.WriteLine(node.InnerText.Trim());
}
2
var liNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li");
foreach (var node in liNodes)
{
Debug.WriteLine(node.SelectSingleNode("label").InnerText.Trim());
Debug.WriteLine(node.SelectSingleNode("strong").InnerText.Trim());
}
check for existence of nodes before writing any real code.

If you want to get all the label tags, you can use
IEnumerable<HtmlNode> labels = dataNode.Descendants("label");
And same for strong tags
IEnumerable<HtmlNode> strongs = dataNode.Descendants("strong");
You can also use:
var dataNode = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']")[0];
HtmlNodeCollection strongs = dataNode.SelectNodes(".//li/strong");
HtmlNodeCollection labels = dataNode.SelectNodes(".//li/label");
To get text from strongs or labels use:
foreach (var strong in strongs)
{
string strongText = strong.InnerText.Trim();
}

You may consider switching to these HTML parsing libraries which provide excellent jQuery selectors like features.
http://nsoup.codeplex.com/
http://github.com/jamietre/csquery

Related

How to get a particular text inside HTML using c#?

How to get the text "Attractions" from the below HTML ?
<li class="product">
<strong>
Attractions
</strong>
<span></span>
</li>
I usually get this done by the below code, when i need the text inside span. But need some help for the above situation.
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.SelectNodes("//span[#class='cityName']"))
{
Result = selectNode.InnerHtml;
}
How can i do this ?
Result = htmlDocument.DocumentNode.SelectSingleNode("//li[#class='product']/strong/a").InnerText;
You can also do a foreach using SelectNodes like what you did up there.

HTML Agility Pack - Get div without class or id (C#)

I've been trying to follow some solutions here on StackOverflow but I need some help.
This is the source HTML:
<div class="myclass">
<div style="font-size:2em;"> STRING_N1 </div>
<div> STRING_N2 </div>
</div>
And this is my current code:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlcode);
var res = doc.DocumentNode.SelectNodes("//div[#class='myclass']");
foreach (var item in res)
{
var firstDiv = item.SelectSingleNode("div");
var content1 = firstDiv.ChildNodes[0].InnerText.Trim();
richTextBox1.AppendText(content1.ToString());
}
So far so good, I can extract "STRING_N1" without a problem. However, I can't figure it out on how to extract STRING_N2 without having a class or id.
Thank you.
You can use LINQ to get descendant divs:
var divs = doc.DocumentNode.SelectNodes("//div[#class='myclass']")
.SelectMany(x => x.Descendants("div"));
var contents = divs.Select(x => x.InnerText.Trim());
richTextBox1.AppendText(string.Join(Environment.NewLine, contents);

Computing multiple node sets

var cats = doc.DocumentNode.SelectNodes("xpath1 | xpath2");
I use the | operator to compute multiple nodes and html agilitypack puts them in a single NodeCollection containg all the results, how do I know if the Node is a result of xpath1 or xpath2?
example
var cats = doc.DocumentNode.SelectNodes("//*[contains(#name,'thename')]/../../div/ul/li/a | //*[contains(#name,'thename')]/../../div/ul/li/a/../div/ul/li/a");
I am trying to build a tree like structure from that the first xpath returns a single element the second xpath returns single or multiple elements, the first xpath is a main tree node and the second xpath are the childeren of that node, and i want to build a List<string,List<string>> from that based on the inner text of the results.
To make it more simple consider the following Html:
<ul>
<li>
<h1>Node1</h1>
<ul>
<li>Node1_1</li>
<li>Node1_2</li>
<li>Node1_3</li>
<li>Node1_4</li>
</ul>
</li>
<li>
<h1>Node2</h1>
<ul>
<li>Node2_1</li>
<li>Node2_2</li>
</ul>
</li>
<li>
<h1>Node3</h1>
<ul>
<li>Node3_1</li>
<li>Node3_2</li>
<li>Node3_3</li>
</ul>
</li>
</ul>
var cats = doc.DocumentNode.SelectNodes("//ul/li/h1 | //ul/li/ul/li")
Why not just do:
var head = doc.DocumentNode.SelectNodes("xpath1");
var children = head.SelectNodes("xpath2");
?
For the code in the example you would do:
var containerNodes = doc.DocumentNode.SelectNodes("//ul/li");
foreach(var n in containerNodes)
{
var headNode = n.SelectSingleNode("h1");
var subNodes = n.SelectNodes("ul/li");
}

search for node using contains in c#

I am trying to search all nodes that start with searchResult1, searchResult2 until searchResult10 in my C# program from an HTML input. here's my code
var results = hdoc.DocumentNode
.Descendants("div")
.Where(x => x.Attributes.Contains("id") &&
x.Attributes["id"].Value.Contains("\"searchResult")).ToList();
for (int i = 0; i < results.Count; i++)
{
rawdata[i] = results[i].InnerHtml.Trim();
}
My HTMl looks like this
<div id="searchResultTable" class="searchReturnData"> some junk html
<li id="searchResult1" class="searchResult searchResultsData_OFF"> searchResult1 html </li>
<li id="searchResult2" class="searchResult searchResultsData_OFF">searchResult2 html </li>
<li id="searchResult3" class="searchResult searchResultsData_OFF">searchResult3 html </li>
</div>
I want to print only searchResult1,searchResult2,searchResult3 html only and not some junk html. How can I do this.
Thanks
Rashmi
if you can use the HTMLAgilityPack to parse HTML. you can do something like this
HtmlDocument doc = new HtmlDocument();
doc.Load(#"C:\file.html");
var root = doc.DocumentNode;
var a_nodes = root.Descendants("li").Where(c=>c.GetAttributeValue("id","")
.Contains("searchResult")).ToList()

Select node based on sibling properties - HtmlAgilityPack - C#

I have an HTML-document that is structured as follows
<ul class="beverageFacts">
<li>
<span>Vintage</span>
<strong>2007 </strong>
</li>
<li>
<span>ABV</span>
<strong>13,0 %</strong>
</li>
<li>
<span>Sugar</span>
<strong>5 gram/liter</strong>
</li>
I need to parse the values of the <strong>-tags to corresponding string's, depending on what value the <span>-tag has.
I have the following:
String vintage;
String sugar;
String abv;
As of now, I am looping through each child node of the beverageFacts-node checking the values to parse it to the correct corresponding string.
The code I have so far to get the "Vintage"-value is the following, though the result is always null.
HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
foreach (HtmlNode subNode in childNodes)
{
if (subNode.InnerText.TrimStart() == "Vintage")
vintage = subNode.NextSibling.InnerText.Trim();
}
I believe my selection of the nodes is incorrect, but I cannot figure out how to properly do it in the most efficient way.
Is there an easy way to achieve this?
Edit 2013-07-29
I have tried to remove the whitespaces as suggested by enricoariel in the comments using the following code
HtmlAgilityPack.HtmlDocument page = new HtmlWeb().Load("http://www.systembolaget.se/" + articleID);
string cleanDoc = Regex.Replace(page.DocumentNode.OuterHtml, #"\s*(?<capture><(?<markUp>\w+)>.*<\/\k<markUp>>)\s*", "${capture}", RegexOptions.Singleline);
HtmlDocument cleanPage = new HtmlDocument();
cleanPage.LoadHtml(cleanDoc);
The resulting is still
String vintage = null;
Looking at the HTML markup, I realized I didn't go deep enough in the nodes.
Also, as enricoariel pointed out, there are whitespaces that I do not clean properly. By skipping the sibling which is the whitespaces, and instead jump to the following, I get the correct result.
foreach (HtmlNode bevFactNode in bevFactsNodes)
{
HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
foreach (HtmlNode node in childNodes)
{
foreach(HtmlNode subNode in node.ChildNodes)
{
if (subNode.InnerText.Trim() == "Årgång")
vintage = HttpUtility.HtmlDecode(subNode.NextSibling.NextSibling.InnerText.Trim());
}
}
}
Console.WriteLine("Vintage: " + vintage);
will output
Vintage: 2007
I decoded the HTML to get the result formatted correctly.
Lessons learned!
to summarize I think the best solution would be stripping all white spaces using a regex prior to retrieve the nextSibling value:
string myHtml =
#"
<ul class='beverageFacts'>
<li>
<span>Vintage</span>
<strong>2007 </strong>
</li>
<li>
<span>ABV</span>
<strong>13,0 %</strong>
</li>
<li>
<span>Sugar</span>
<strong>5 gram/liter</strong>
</li>";
//Remove space after and before tag
myHtml = Regex.Replace(myHtml, #"\s+<", "<", RegexOptions.Multiline | RegexOptions.Compiled);
myHtml = Regex.Replace(myHtml, #">\s+", "> ", RegexOptions.Compiled | RegexOptions.Multiline);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(myHtml.Replace("/r", "").Replace("/n", "").Replace("/r/n", "").Replace(" ", ""));
doc.OptionFixNestedTags = true;
HtmlNodeCollection vals = doc.DocumentNode.SelectNodes("//ul[#class='beverageFacts']//span");
var myNodeContent = string.Empty;
foreach (HtmlNode val in vals)
{
if (val.InnerText == "Vintage")
{
myNodeContent = val.NextSibling.InnerText;
}
}
return myNodeContent;

Categories

Resources