Parse a div with HTML Agility Pack - c#

I've this HTML code:
div class="singolo-contenuto link_azure">
<p><img src="" class="left pad2 field_foto" alt="" /><p> Message </p>
</div>
I need to "capture" "Message".
I'm trying with:
String message = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='singolo-contenuto link_azure']").InnerText;
but doesn't works... I obtain a lot of the full page... what's wrong?

The XPath expression you have just gets you to the <div> tag. You need to get deeper into the last <p> tag. This will work:
var message = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='singolo-contenuto link_azure']//p[last()]").InnerText;

Related

Finding and extracting an email address using HtmlAgilityPack

Within HTML code, I am trying to extract an email address only if it is provided by the user. Here is a sample of the HTML:
<div class="header">
<div class="details">
<span>
<!-- Random description here which MAY contain an email address -->
</span>
</div>
</div>
I have managed to get to <span>by using HTML Agility Pack as follows:
var getWeb = new HtmlWeb();
var pageHtml = getWeb.Load("website here");
IEnumerable<string> listItemHtml = pageHtml.DocumentNode.SelectNodes(
#"//div[#class='header']
/div[#class='details']
/span").Select(span => span.InnerText);
My next challenge is to search through this text and check if an email has been provided, which I am unable to figure out. Could someone please help me with this?

Retrieving specific URLs with HtmlAgilityPack C#

I'm currently attempting to use HtmlAgilityPack to extract specific links from an html page. I tried using plain C# to force my way in but that turned out to be a real pain. The links are all inside of <div> tags that all have the same class. Here's what I have:
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load(url);
//this should select only the <div> tags with the class acTrigger
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']"))
{
//not sure how to dig further in to get the href values from each of the <a> tags
}
and the sites code looks along the lines of this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
There are a lot of links on this page, but the href I need are contained inside of those <a> tags which are nested inside of the <div class="acTrigger"> tags. It would be simple if each <a> shared unique classes, but unfortunately only the <div> tags have classes. What I need to do is grab each one of those hrefs and store them so I can retrieve them later, go to each page, and retrieve more information from each page. I just need a nudge in the right direction to get over this hump, then I should be able to do the other pages as well. I have no previous experience with this HtmlAgilityPack and all the example I find seem to want to extract all the URLs from a page, not specific ones. I just need a link to an example or documentation, any help is greatly appreciated.
You should be able to change your select to include the <a> tag: //div[#class='acTrigger']/a. That way your HtmlNode is your <a> tag instead of the div.
To store the links you can use GetAttributeValue.
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
// Get the value of the HREF attribute.
string hrefValue = node.GetAttributeValue( "href", string.Empty );
// Then store hrefValue for later.
}

HtmlAgilityPack (C#) can't read past hidden text

using the following url:
link to search results page
I am trying to first scrape the text from the a tag from this html that can be seen from the source code when viewed with Firebug:
<div id="search-results" class="search_results">
<div class="searchResultItem">
<div class="searchResultImage photo">
<h3 class="black">
<a class="linkmed " href="/content/1/2484243.html">加州旱象不减 开源节流声声急</a>
</h3>
<p class="resultPubDate">15.10.2014 06:08 </p>
<p class="resultText">
</div>
</div>
<p class="more-results">
But what I get back when I scrape the page is:
<div class="search_results" id="search-results">
<input type="hidden" name="ctl00$ctl00$cpAB$cp1$hidSearchType" id="hidSearchType">
</div>
<p class="more-results">
Is there anyway to view the source the way Firebug does?
How are you scraping the page? Use something like Fiddler and check the request and the response for dynamic pages like these ones. The reason why Firebug sees more is because all of the dynamic elements have loaded already when you are viewing it in your browser, when in fact your scraping method is only one piece of the puzzle (the initial HTML).
Hint: For this search page, you will see that the request for the results data is actually a) a separate GET request with b) a long query string and c) a cookie on the header, which returns a JSON object containing the data. This is why the link you posted just gives me "undefined," because it does not contain the search data.

HTML agility pack get all divs with class

I am trying to scape a complicated HTMl. I need to get some text from div's with certain class.
What I am trying to do is have the html agility pack to go over the whole html and find all divs whos class contains "listevent" and return me those.
When I searched online I found out that If I map it , it is possible, but some of these divs are under somemany divs so trying to find some easy way.
The HTML looks like this
<div>
<div>
<table>
<tr>
<td>
<div class="thisone listevent"></td>
<td>
<div class="thisone listevent"></td>
</tr>
</table>
</div>
</div>
You could use SelectNodes method
foreach(HtmlNode div in document.DocumentNode.SelectNodes("//div[contains(#class,'listevent')]"))
{
}
If you are more familiar with css style selectors, try fizzler and do this
document.DocumentNode.QuerySelectorAll("div.listevent");

Getting value from string using specific conditions

I have an html data in my string in which i need to get only paragraph values.Below is a sample html.
<html>
<head>
<title>
<script>
<div>
Some contents
</div>
<div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
<div>
Other html elements
</div>
So how to get the data from the paragraphs using string manipulation.
Desired Output
<Div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
Give the div an ID, e.g.
<div id="test">
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
then use //div[#id='test']/p.
The solution broken down:
//div - All div elements
[#id='test'] - With an ID attribute whose value is test
/p
I have used Html agility Pack for something like this. Then you can use LINQ to get what you want.
Xpath is the obvious answer (if the HTML is decent, has a root etc), failing that some third party widget like chilkat
If you use Html Agility Pack as mentioned in the other posts, you can get all paragraph elements in the html by using:
HtmlDocument doc = new HtmlDocument();
doc.Load("your html string");
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p")
Since you are using .net Framework 2.0, you would want an older version of Agility Pack, which can be found here: HTML Agility Pack
If you want just the text inside the paragraph, you can use
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p/text()")

Categories

Resources