Select items from parsed html - c#

I want to select some ítems from an HTML webpage and put them into a list. Each ítem will be an instance of this class:
public class shopItem
{
private String itemName;
private String itemImageLink;
private Double itemPrice;
public void shopItem(String itemName, String itemImageLink, Double itemPrice)
{
this.itemName = itemName;
this.itemImageLink = itemImageLink;
this.itemPrice = itemPrice;
}
public String getItemName()
{
return this.itemName;
}
public String getItemImageLink()
{
return this.itemImageLink;
}
public Double getItemPrice()
{
return this.itemPrice;
}
}
The HTML is this thing:
<div class="list_categorie_product">
<!-- Products list -->
<ul id="product_list_grid" class="categorie_product clear">
</li>
<li class="ajax_block_product alternate_item clearfix">
<p>
<a href="http://thefrogco.com/polos/12-polo-2.html" class="product_img_link" title="Gris-Burdeos">
<img src="http://thefrogco.com/12-111-large/polo-2.jpg" alt="Gris-Burdeos" width="174" height="261" />
</a>
</p>
<h3>
Gris-Burdeos
</h3>
<p id="p1">
<!--<span class="new_product">
</span>-->
<span class="new_product">
<span class="price"><!--<strike>30,00 €</strike>--><br />24,00 €</span>
</span>
</p>
</li>
<li class="ajax_block_product item clearfix">
<p>
<a href="http://thefrogco.com/polos/14-polo-4.html" class="product_img_link" title="Blanco-Marino">
<img src="http://thefrogco.com/14-114-large/polo-4.jpg" alt="Blanco-Marino" width="174" height="261" />
</a>
</p>
<h3>
Blanco-Marino
</h3>
<p id="p2">
<!--<span class="new_product">
</span>-->
<span class="new_product">
<span class="price"><!--<strike>30,00 €</strike>--><br />24,00 €</span>
</span>
</p>
</li>
<li class="ajax_block_product last_item alternate_item clearfix">
<p>
<a href="http://thefrogco.com/polos/15-marron-turquesa.html" class="product_img_link" title="Marrón-Turquesa">
<img src="http://thefrogco.com/15-126-large/marron-turquesa.jpg" alt="Marrón-Turquesa" width="174" height="261" />
</a>
</p>
<h3>
Marrón-Turquesa
</h3>
<p id="p3">
<!--<span class="new_product">
</span>-->
<span class="new_product">
<span class="price"><!--<strike>30,00 €</strike>--><br />24,00 €</span>
</span>
</p>
</li>
</ul>
As you can see, i want to store each polo-shirt. I use HTMLAgilityPack and i don´t know how to pick them. This is as far as i can get:
List<shopItem> itemsList = new List<shopItem>();
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml("http://thefrogco.com/14-polos");
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.Elements("//div/div/li[#class='ajax_block_product last_item alternate_item clearfix']"))
{
foreach(HtmlNde)
{
//I suppose i have to iterate all inside nodes...
}
shopItem detectedItem = new shopItem();
itemsList.Add(selectNode.);
}
THANK YOU SO MUCH!

Something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load(myDocHtm);
// get all LI elements with a CLASS attribute that starts with 'ajax_block_product'
foreach (HtmlNode selectNode in doc.DocumentNode.SelectNodes("//li[starts-with(#class,'ajax_block_product')]"))
{
// from the current node, get recursively the first A element with a CLASS attribute set to 'product_link'
HtmlNode name = selectNode.SelectSingleNode(".//a[#class='product_link']");
// from the current node, get recursively the first IMG element with a non empty SRC attribute
HtmlNode img = selectNode.SelectSingleNode(".//img[#src]");
// from the current node, get recursively the first SPAN element with a CLASS attribute set to 'price'
// and get the child text node from it
HtmlNode price = selectNode.SelectSingleNode(".//span[#class='price']/text()");
shopItem item = new shopItem(
name.InnerText,
img.GetAttributeValue("src", null),
double.Parse(price.InnerText, NumberStyles.Any)
);
itemsList.Add(item);
}

Related

HTML Agility Pack cannot find specific node

In my html I have node named music-detail-header.
My HTML:
<music-detail-header image-src="https://m.media-amazon.com/images/I/51du+vdj5WL.jpg" image-dimen="1:1" primary-text="Curated by Amazon’s Music Experts and Updated Fridays" secondary-text="The biggest songs in the world. Cover: Miley Cyrus." tertiary-text="50 SONGS • 2 HOURS AND 43 MINUTES" image-kind="square" style="contain: layout; margin-top: 24px; display: block;" label="playlist" headline="All Hits" class="hydrated">
<div slot="icons"><span class="_7Ge9Z_A8glJAPCM8OQ1xf">
<music-button id="detailHeaderButton1" slot="icons" size="medium" icon-name="shuffle" title="Shuffle" aria-label="Shuffle" role="button" aria-disabled="false" tabindex="0" variant="solid" refinement="none" class="hydrated">Shuffle</music-button>
</span>
<span class="_7Ge9Z_A8glJAPCM8OQ1xf">
<music-button id="detailHeaderButton2" slot="icons" size="small" icon-only="" icon-name="add" title="Add to My Music" aria-label="Add to My Music" role="button" aria-disabled="false" tabindex="0" variant="primary" refinement="none" class="hydrated">
</music-button>
</span>
</div>
<div slot="icons" class="_12yl6gY7_evjyhHEj91p3J">
<span class="_7Ge9Z_A8glJAPCM8OQ1xf">
<music-button id="detailHeaderButton3" slot="icons" size="small" icon-only="" icon-name="shareandroid" title="Share this playlist" aria-label="Share this playlist" role="button" aria-disabled="false" tabindex="0" variant="primary" refinement="none" class="hydrated">
</music-button>
</span>
</div>
</music-detail-header>
But when I try to get it from page, like that:
public class HtmlScrapper
{
public HtmlDocument? HtmlDocument { get; private set; }
/// <param name = "URL"> URL of website to parse </param>
public HtmlScrapper(string URL)
{
var web = new HtmlWeb();
HtmlDocument = web.Load(URL);
}
public Playlist GetPlaylist()
{
var p = new Playlist();
string Avatar = String.Empty;
var node = HtmlDocument.DocumentNode.Descendants("//div/music-detail-header").FirstOrDefault();
Avatar = node.GetAttributeValue("image-src", null);
using (var ms = new MemoryStream(new WebClient().DownloadData(Avatar)))
{
p.Avatar = Image.FromStream(ms);
}
return p;
}
}
My node remains being null.
And then, of course, I get an exception here:
Avatar = node.GetAttributeValue("image-src", null);
I've already tried different methods of getting that element from document like:
HtmlDocument.DocumentNode.DescendantNodes, HtmlDocument.DocumentNode.SelectNodes, HtmlDocument.DocumentNode.SelectSingleNode and others, but in all cases result is the same.
I would really appreciate your help!

Get the href link of an element with classname having spaces

I have been trying to get the link of an element using the class name but always getting an error that no element found
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByClassName("column.wrap-text").ToList();
I somehow managed to get the links I want using the below code but I know that is not a good approach.
try
{
Selenium.Selenium.driver.Navigate().GoToUrl(txt_url.Text);
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByTagName("a").ToList();
List<string> ValidLinks = new List<string>();
foreach (IWebElement LinkElement in LinkElements)
{
string LinkString = LinkElement.GetAttribute("href");
if (LinkString != null)
{
if (LinkString.Contains("documents"))
{
list.Items.Add(LinkString);
}
}
}
}
catch (Exception)
{ }
Below is the html code for the element I want to extract the href link("/view/garnimii#/Testing%20Folder/MyFile.txt") with the title name in it. I have tried every possible way but not able to read the element with the findbyclassname or findbyxpath(which is very vague here). can anyone please help me with this?
<div class="wrapper fluid-element">
<div class="wrapper fluid-element">
<div class="wrapper fluid-element">
<div class="column wrap-text">
<a title="MyFile.txt" href="https://drive.corp.amazon.com/documents/garnimii#/Testing%20Folder/MyFile.txt">MyFile.txt</a
>
</div>
</div>
<div class="column actions resource-actions-view">
<a data-turbolink="true" href="/view/garnimii#/Testing%20Folder/MyFile.txt"><i class="fa fa-
external-link"></i> View
</a></div>
<div class="column actions resource-actions-share">
<a data-target="#resource-modal-share" data-toggle="modal"
href="/share/garnimii#/Testing%20Folder/MyFile.txt">
<i class="fa fa-share-alt"></i> Share
</a>
</div>
<div class="column actions resource-actions-rename resource-header-actions">
<a data-resource-basename="MyFile.txt" data-resource-id="8a520062-5dbe-46ba-b4b0-b672f6481c17"
data-root-path="/" data-target="#resource-modal-rename" data-toggle="modal" href="#resource-
modal-rename">
<i class="fa fa-pencil"></i> Rename
</a>
</div>
</div>
</div>
Update
foreach (IWebElement LinkElement in LinkElements)
{
string LinkString = LinkElement.GetAttribute("title");
if (LinkString != null)
{
if(LinkString.Contains("myfile.txt"))
{
list.Items.Add(LinkString.GetAttribute('href'));
}
}
}
You can even try with //a xpath.
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByXpath("//a");
List<string> ValidLinks = new List<string>();
foreach (IWebElement LinkElement in LinkElements){
Console.WriteLine(LinkElement.GetAttribute('href'))
}
print all the GetAttribute with href first. and if your output contains all the href then we can proceed further with adding into other list.
Update :
string LinkString = Selenium.Selenium.driver.FindElementByXpath("//a[#title='MyFile.txt']").GetAttribute('href')
FindElementsByClassName can locate element by single class name.
For multiple class names you should use XPath or CSS selector.
So instead of
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByClassName("column.wrap-text").ToList();
Try using
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByCssSelector("div.column.wrap-text").ToList();

get div information with html agility pack

Hi I want to process information on a html page, with the following code I can get the information
This is how the order is received
new-link-1
new-link-2
new-link-3
But when it comes to the new-link-no-title section, it breaks up And it changes to
new-link-3
new-link-1
new-link-2
And at the end of the program it stops with an ArgumentOutOfRangeException error
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = await web.LoadFromWebAsync(Link);
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex())
{
var x = item.SelectNodes("//div[#class='new-link-2']")[index].InnerText;
var xx = item.SelectNodes("//div[#class='new-link-3']//a")[index];
MessageBox.Show(item.InnerText);
MessageBox.Show(x);
MessageBox.Show(xx.Attributes["href"].Value);
}
and html
<div id="new-link">
<ul>
<li>
<div class="new-link-1"> فصل پنجم</div>
<div class="new-link-2"> تکمیل شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
<li class="new-link-no-titel">
<div class="new-link-1"> فصل ششم</div>
<div class="new-link-2"> درحال پخش</div>
<div class="new-link-3">
<i class="fa fa-arrow-down" title=حال پخش">
</i>
</div>
</li>
<li>
<divs="new-link-1"> قسمت 1</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلودلینک مستقیم
</div>
</li>
<li>
<div class="new-link-1"> قسمت 7</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
</ul>
</div>
This is what I found to be the issue with your code.
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex()) //-> Gives 4 indecies for index
item.SelectNodes("//div[#class='new-link-2']") // -> This produces 4 nodes
item.SelectNodes("//div[#class='new-link-3']//a") // -> This produces only 3 nodes
Issue:
When you search with //div, you search All nodes.. not just from the item you are currently on.
Solution/Suggestion: Your current code searches all a elements starting from the root node. If you prefix it with a dot instead only the descendants of the current node will be considered. (Excerpt from here)
foreach (HtmlNode item in doc.DocumentNode.SelectNodes(".//li"))
{
try
{
var x0 = item.SelectSingleNode(".//div[#class='new-link-1']");
var x = item.SelectSingleNode(".//div[#class='new-link-2']");
var xx = item.SelectSingleNode(".//a");
MessageBox.Show(x0.InnerText);
MessageBox.Show(x.InnerText);
if (xx.Attributes["href"] != null)
MessageBox.Show(xx.Attributes["href"].Value);
}
catch { }
}

Parse through Each li tag in browser using 'WatiN'

I am using a watin dll to browse through a webpage, click on a link in li tag, go to the next page, fetch some data, go back to previous page and click the link in the next li tag.
I am able to do this with one link in li tag. I want to get all the li tag underul <classname> click on each link and perform the above procedure. How can I get all the li and loop through each page?
HTML code of the page is like this:
<ul id="ul_classname" class="search-result-set">
<li class="">
<div class="Div_Classname">
<h3 class="standard_font">
<a class="a class_name" href="link to be clicked">text to be displayed</a>
</h3>
<p class="word-wrap"></p>
</div>
</li>
<li class="">
<div class="Div_Classname">
<h3 class="standard_font">
<a class="a class_name" href="link to be clicked">text to be displayed</a>
</h3>
<p class="word-wrap"></p>
</div>
</li>
</ul>
HTH!
private void CrawlSite()
{
int idx = 0;
do
{
idx = this.ClickLink(idx);
}
while (idx != -1);
}
private int ClickLink(int idx)
{
WatiN.Core.Browser browser = GetBrowser();
ListItemCollection listItems = browser.List("ul_classname").ListItems;
if (idx > listItems.Count - 1)
return -1;
Link lnk = listItems[idx].Link(Find.ByClass("a class_name"));
lnk.Click();
//TODO: get your data
browser.Back();
return idx + 1;
}
you can try with this code (Linq to XML)
var xdoc = XDocument.Load(yourFile);
var terms= from term in xdoc.Descendants("ul")
select new
{
Class= term.Attribute("class").Value
};
foreach(var li in terms)
{
Console.Write(li.Class);
}
Try this:
LinkCollection links = ie.Links;
foreach (var link in links)
{
link.Click();
// Do something
ie.Back();
}

How to getelement by class?

I am trying to code a way using webBrowser1 to get a hold of of a download link via href, but the problem is I must find it using its class name.
<body>
<iframe scrolling="no" frameborder="0" allowtransparency="true" tabindex="0" name="twttrHubFrame" style="position: absolute; top: -9999em; width: 10px; height: 10px;" src="http://platform.twitter.com/widgets/hub.html">
‌¶
<div id="main">
‌¶‌→
<div id="header">
<div style="float:left;">
‌¶‌→
<div id="content">
‌¶‌→
<h1 style="background-image:url('http://static.mp3skull.com/img/bgmen.JPG'); background-repeat:repeat-x;">Rush‌·Mp3‌·Download</h1>
‌¶‌→
<a id="bitrate" onclick="document.getElementById('ofrm').submit(); return false;" rel="nofollow" href="">
<form id="ofrm" method="POST" action="">
‌¶‌→‌¶‌→‌→
<div id="song_html" class="show1">
‌¶‌→‌→‌→
<div class="left">
‌¶‌→‌→‌→
<div id="right_song">
‌¶‌→‌→‌→‌→
<div style="font-size:15px;">
‌¶‌→‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→‌→‌→
<div style="float:left;">
‌¶‌→‌→‌→‌→‌→
<div style="float:left; height:27px; font-size:13px; padding-top:2px;">
‌¶‌→‌→‌→‌→‌→‌→
<div style="float:left; width:27px; text-align:center;">
‌¶‌→‌→‌→‌→‌→‌→
<div style="margin-left:8px; float:left;">
<a style="color:green;" target="_blank" rel="nofollow" href="http://dc182.4shared.com/img/1011303409/865387c9/dlink__2Fdownload_2F6QmedN8H_3Ftsid_3D20111211-54337-a79f8d10/preview.mp3">Download</a>
</div>
‌·‌¶‌→‌→‌→‌→‌→‌→
<div style="margin-left:8px; float:left;">
‌¶‌→‌→‌→‌→‌→‌→
<div style="margin-left:8px; float:left;">
‌·‌¶‌→‌→‌→‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→‌→‌→‌→
</div>
‌¶‌→‌→‌→‌→‌→
<div id="player155580779" class="player" style="float:left; margin-left:10px;"></div>
‌¶‌→‌→‌→‌→
</div>
‌→‌¶‌→‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→‌→
</div>
‌¶‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→
</div>
I looked and searched all over google, but I found PHP examples?
I understand you would do something along the lines of this
HtmlElement downloadlink = webBrowser1.Document.GetElementById("song_html").All[0];
URL = downloadlink.GetAttribute("href");
but I do not understand how to do it by the class "show1".
Please point me in the right direction with examples and/or a website I can visit so I can learn how to do this as I searched and have no clue.
EDIT: I pretty much need the href link ("http://dc182.4shared.com/img/1011303409/865387c9/dlink__2Fdownload_2F6QmedN8H_3Ftsid_3D20111211-54337-a79f8d10/preview.mp3"), so how would I obtain it?
There is nothing built-in in the WebBrowser control to retrieve an element by class name. Since you know it is going to be an a element the best you can do is get all a elements and search for the one you want:
var links = webBrowser1.Document.GetElementsByTagName("a");
foreach (HtmlElement link in links)
{
if (link.GetAttribute("className") == "show1")
{
//do something
}
}
Extension Method for HtmlDocument
Returns a list of elements with a particular tag, which coincides with the given className
It can be used to capture the elements only on the tag, or only by class name
internal static class Utils
{
internal static List<HtmlElement> getElementsByTagAndClassName(this HtmlDocument doc, string tag = "", string className = "")
{
List<HtmlElement> lst = new List<HtmlElement>();
bool empty_tag = String.IsNullOrEmpty(tag);
bool empty_cn = String.IsNullOrEmpty(className);
if (empty_tag && empty_cn) return lst;
HtmlElementCollection elmts = empty_tag ? doc.All : doc.GetElementsByTagName(tag);
if (empty_cn)
{
lst.AddRange(elmts.Cast<HtmlElement>());
return lst;
}
for (int i = 0; i < elmts.Count; i++)
{
if (elmts[i].GetAttribute("className") == className)
{
lst.Add(elmts[i]);
}
}
return lst;
}
}
Usage:
WebBrowser wb = new WebBrowser();
List<HtmlElement> lst_div = wb.Document.getElementsByTagAndClassName("div");// all div elements
List<HtmlElement> lst_err_elmnts = wb.Document.getElementsByTagAndClassName(String.Empty, "error"); // all elements with "error" class
List<HtmlElement> lst_div_err = wb.Document.getElementsByTagAndClassName("div", "error"); // all div's with "error" class
I followed up these answers and make my method to hide div by class name.
I shared for whom concern.
public void HideDivByClassName(WebBrowser browser, string classname)
{
if (browser.Document != null)
{
var byTagName = browser.Document.GetElementsByTagName("div");
foreach (HtmlElement element in byTagName)
{
if (element.GetAttribute("className") == classname)
{
element.Style = "display:none";
}
}
}
}

Categories

Resources