HTML agility pack get all divs with class - c#

I am trying to scape a complicated HTMl. I need to get some text from div's with certain class.
What I am trying to do is have the html agility pack to go over the whole html and find all divs whos class contains "listevent" and return me those.
When I searched online I found out that If I map it , it is possible, but some of these divs are under somemany divs so trying to find some easy way.
The HTML looks like this
<div>
<div>
<table>
<tr>
<td>
<div class="thisone listevent"></td>
<td>
<div class="thisone listevent"></td>
</tr>
</table>
</div>
</div>

You could use SelectNodes method
foreach(HtmlNode div in document.DocumentNode.SelectNodes("//div[contains(#class,'listevent')]"))
{
}
If you are more familiar with css style selectors, try fizzler and do this
document.DocumentNode.QuerySelectorAll("div.listevent");

Related

Retrieving specific URLs with HtmlAgilityPack C#

I'm currently attempting to use HtmlAgilityPack to extract specific links from an html page. I tried using plain C# to force my way in but that turned out to be a real pain. The links are all inside of <div> tags that all have the same class. Here's what I have:
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load(url);
//this should select only the <div> tags with the class acTrigger
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']"))
{
//not sure how to dig further in to get the href values from each of the <a> tags
}
and the sites code looks along the lines of this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
There are a lot of links on this page, but the href I need are contained inside of those <a> tags which are nested inside of the <div class="acTrigger"> tags. It would be simple if each <a> shared unique classes, but unfortunately only the <div> tags have classes. What I need to do is grab each one of those hrefs and store them so I can retrieve them later, go to each page, and retrieve more information from each page. I just need a nudge in the right direction to get over this hump, then I should be able to do the other pages as well. I have no previous experience with this HtmlAgilityPack and all the example I find seem to want to extract all the URLs from a page, not specific ones. I just need a link to an example or documentation, any help is greatly appreciated.
You should be able to change your select to include the <a> tag: //div[#class='acTrigger']/a. That way your HtmlNode is your <a> tag instead of the div.
To store the links you can use GetAttributeValue.
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
// Get the value of the HREF attribute.
string hrefValue = node.GetAttributeValue( "href", string.Empty );
// Then store hrefValue for later.
}

Not able to display image and text side by side using float

I want to display image and text side by side (image on the left and text on the right(which is ordered using ).
I achieved this in .html page in visual studio
For image div I gave float: left and for text div I gave float:right
I want to achieve the same by passing the html page code to a string variable and displaying the ouput in WebBrowser.
Eg:
string html= #"<div style=""float:left"">
<img src=""smiley.jpg"" alt=""smiley"" />
<div style=""float:right;font-family:Calibri"">
<h2>Dispalying Image and text</h2>
</div>
</div>"
webBrowser1.DocumentText = html;
But here, the image and text is not displayed side by side instead the text is displayed in the next line.
Float:right is not working as expected here. How to resolve this?
You may do it like this:
<div style="float:left">
<img style="display:inline-block;vertical-align:middle" src="http://www.smilys.net/lachende_smilies/smiley7702.jpg" alt="smiley" />
<div style="display:inline-block;font-family:Calibri">
<h2>Dispalying Image and text</h2>
</div>
</div>
<img> is an inline element, and <div> and <h2> are block elements, you are mixing both together. And also, you can't set "float" to inline element.
Maybe you can try to put img inside, like this:
<div>
<div style=""font-family:Calibri"">
<h2>
<img src=""smiley.jpg"" alt=""smiley""/>
Dispalying Image and text
</h2>
</div>
</div>
This is little different view to this problem, but it is really simple, without any additional styles.
Well if you want to put your image to the left and the text to right you are doing well just assign the float:left style to the image instead of the main block like:
string html= #"<div>
<img src=""smiley.jpg"" alt=""smiley"" style=""float:left""/>
<div style=""float:left;font-family:Calibri"">
<h2>Displaying Image and text</h2>
</div>
</div>"
I've got the expected output by placing image in one div and text in one div and both the div in one div. It worked:)
Like,
<div>
<div><img></div>
<div>text</div>
</div>
You could use a table.
<table>
<tr>
<td> <img src="smiley.jpg" alt="smiley"/> </td>
<td> Displaying Image and text </td>
</tr>
</table>

Parse a div with HTML Agility Pack

I've this HTML code:
div class="singolo-contenuto link_azure">
<p><img src="" class="left pad2 field_foto" alt="" /><p> Message </p>
</div>
I need to "capture" "Message".
I'm trying with:
String message = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='singolo-contenuto link_azure']").InnerText;
but doesn't works... I obtain a lot of the full page... what's wrong?
The XPath expression you have just gets you to the <div> tag. You need to get deeper into the last <p> tag. This will work:
var message = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='singolo-contenuto link_azure']//p[last()]").InnerText;

dynamically add html tag in asp.net

I need to add html tags dynamically in asp.net code behind , here is the scenario :
List<Product> products = ProductManager.GetProductList();//read data from database
Product is a class containing information I need to show in website such as product name, image,… .
For adding html code I tried this to show 5 products per page :
String finalHtmlCodeContainer="";
for(int i=0;i<=5;i++)
{
String myBox= "<div class=\"box\"> </div>";
finalHtmlCodeContainer+=mybox;
}
boxWrapper.InnerHtml=finalHtmlCodeContainer;
boxWrapper is a div that would contain our 5 product info.up to now everything is ok but problem appears when insted of "<div class=\"box\"> </div>",myBoxcontains long characters of html code , the original myBox should include this:
<div class="boxWrapper">
<div class="box">
<div class="rightHeader">rightHeader</div>
<div class="leftHeader">leftHeader</div>
<div class="article">
<img src="../Image/cinemaHeader.jpg" />
<p>
some text here <!--product.content-->
</p>
</div><!--end of article-->
<div class="rightFooter"><span>rightFooter</span>
<div class="info">
<ul>
<li>item1</li>
<li>item2</li>
<li>item3</li>
</ul>
</div>
</div><!--end of rightFooter-->
<div class="leftFooter"><span>leftFooter</span>
</div><!--end of leftFooter-->
</div><!--end of box-->
</div><!--end of boxWrapper-->
you see if i add producet.atrributes to code snippet above, it would be more messy,hard to debug , has less scalability and... . what is your solution?
It depends on the underlying product. If you are using web forms, I'd suggest using a data bound control that supports templating like the ListView or Repeater. It gives you full control over the UI, and also supports reloading the UI in ViewState. In your scenario, you'd have to build the UI on every postback. The ListView or Repeater track it in ViewState.
If you are using MVC, well using a helper method can make this easier as you can stick the item template within and then use a foreach and render out the helper.
If you are talking doing it in JavaScript, then there are templating components for doing the same thing.
I suggest you to Add repeater in your <p> tag ,
<p>
<asp:Repeater runat="server" ID="rpDetail">
<ItemTemplate>
<div class="box">
<%# Eval("YourDataField") %>
</div>
</ItemTemplate>
</asp:Repeater>
</p>
And make your products list as Repeater's datasourse in code behind .
rpDetail.DataSource = products ;
rpDetail.DataBind();

Getting value from string using specific conditions

I have an html data in my string in which i need to get only paragraph values.Below is a sample html.
<html>
<head>
<title>
<script>
<div>
Some contents
</div>
<div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
<div>
Other html elements
</div>
So how to get the data from the paragraphs using string manipulation.
Desired Output
<Div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
Give the div an ID, e.g.
<div id="test">
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
then use //div[#id='test']/p.
The solution broken down:
//div - All div elements
[#id='test'] - With an ID attribute whose value is test
/p
I have used Html agility Pack for something like this. Then you can use LINQ to get what you want.
Xpath is the obvious answer (if the HTML is decent, has a root etc), failing that some third party widget like chilkat
If you use Html Agility Pack as mentioned in the other posts, you can get all paragraph elements in the html by using:
HtmlDocument doc = new HtmlDocument();
doc.Load("your html string");
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p")
Since you are using .net Framework 2.0, you would want an older version of Agility Pack, which can be found here: HTML Agility Pack
If you want just the text inside the paragraph, you can use
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p/text()")

Categories

Resources