Select items from parsed html

Select items from parsed html - c#

I want to select some ítems from an HTML webpage and put them into a list. Each ítem will be an instance of this class:
public class shopItem
{
private String itemName;
private String itemImageLink;
private Double itemPrice;
public void shopItem(String itemName, String itemImageLink, Double itemPrice)
{
this.itemName = itemName;
this.itemImageLink = itemImageLink;
this.itemPrice = itemPrice;
}
public String getItemName()
{
return this.itemName;
}
public String getItemImageLink()
{
return this.itemImageLink;
}
public Double getItemPrice()
{
return this.itemPrice;
}
}
The HTML is this thing:
<div class="list_categorie_product">
<!-- Products list -->
<ul id="product_list_grid" class="categorie_product clear">
</li>
<li class="ajax_block_product alternate_item clearfix">
<p>
<a href="http://thefrogco.com/polos/12-polo-2.html" class="product_img_link" title="Gris-Burdeos">
<img src="http://thefrogco.com/12-111-large/polo-2.jpg" alt="Gris-Burdeos" width="174" height="261" />
</a>
</p>
<h3>
Gris-Burdeos
</h3>
<p id="p1">
<!--<span class="new_product">
</span>-->
<span class="new_product">
<span class="price"><!--<strike>30,00 €</strike>--><br />24,00 €</span>
</span>
</p>
</li>
<li class="ajax_block_product item clearfix">
<p>
<a href="http://thefrogco.com/polos/14-polo-4.html" class="product_img_link" title="Blanco-Marino">
<img src="http://thefrogco.com/14-114-large/polo-4.jpg" alt="Blanco-Marino" width="174" height="261" />
</a>
</p>
<h3>
Blanco-Marino
</h3>
<p id="p2">
<!--<span class="new_product">
</span>-->
<span class="new_product">
<span class="price"><!--<strike>30,00 €</strike>--><br />24,00 €</span>
</span>
</p>
</li>
<li class="ajax_block_product last_item alternate_item clearfix">
<p>
<a href="http://thefrogco.com/polos/15-marron-turquesa.html" class="product_img_link" title="Marrón-Turquesa">
<img src="http://thefrogco.com/15-126-large/marron-turquesa.jpg" alt="Marrón-Turquesa" width="174" height="261" />
</a>
</p>
<h3>
Marrón-Turquesa
</h3>
<p id="p3">
<!--<span class="new_product">
</span>-->
<span class="new_product">
<span class="price"><!--<strike>30,00 €</strike>--><br />24,00 €</span>
</span>
</p>
</li>
</ul>
As you can see, i want to store each polo-shirt. I use HTMLAgilityPack and i don´t know how to pick them. This is as far as i can get:
List<shopItem> itemsList = new List<shopItem>();
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml("http://thefrogco.com/14-polos");
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.Elements("//div/div/li[#class='ajax_block_product last_item alternate_item clearfix']"))
{
foreach(HtmlNde)
{
//I suppose i have to iterate all inside nodes...
}
shopItem detectedItem = new shopItem();
itemsList.Add(selectNode.);
}
THANK YOU SO MUCH!

Something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load(myDocHtm);
// get all LI elements with a CLASS attribute that starts with 'ajax_block_product'
foreach (HtmlNode selectNode in doc.DocumentNode.SelectNodes("//li[starts-with(#class,'ajax_block_product')]"))
{
// from the current node, get recursively the first A element with a CLASS attribute set to 'product_link'
HtmlNode name = selectNode.SelectSingleNode(".//a[#class='product_link']");
// from the current node, get recursively the first IMG element with a non empty SRC attribute
HtmlNode img = selectNode.SelectSingleNode(".//img[#src]");
// from the current node, get recursively the first SPAN element with a CLASS attribute set to 'price'
// and get the child text node from it
HtmlNode price = selectNode.SelectSingleNode(".//span[#class='price']/text()");
shopItem item = new shopItem(
name.InnerText,
img.GetAttributeValue("src", null),
double.Parse(price.InnerText, NumberStyles.Any)
);
itemsList.Add(item);
}

Related

HTML Agility Pack cannot find specific node

In my html I have node named music-detail-header.
My HTML:
<music-detail-header image-src="https://m.media-amazon.com/images/I/51du+vdj5WL.jpg" image-dimen="1:1" primary-text="Curated by Amazon’s Music Experts and Updated Fridays" secondary-text="The biggest songs in the world. Cover: Miley Cyrus." tertiary-text="50 SONGS • 2 HOURS AND 43 MINUTES" image-kind="square" style="contain: layout; margin-top: 24px; display: block;" label="playlist" headline="All Hits" class="hydrated">
<div slot="icons"><span class="_7Ge9Z_A8glJAPCM8OQ1xf">
<music-button id="detailHeaderButton1" slot="icons" size="medium" icon-name="shuffle" title="Shuffle" aria-label="Shuffle" role="button" aria-disabled="false" tabindex="0" variant="solid" refinement="none" class="hydrated">Shuffle</music-button>
</span>
<span class="_7Ge9Z_A8glJAPCM8OQ1xf">
<music-button id="detailHeaderButton2" slot="icons" size="small" icon-only="" icon-name="add" title="Add to My Music" aria-label="Add to My Music" role="button" aria-disabled="false" tabindex="0" variant="primary" refinement="none" class="hydrated">
</music-button>
</span>
</div>
<div slot="icons" class="_12yl6gY7_evjyhHEj91p3J">
<span class="_7Ge9Z_A8glJAPCM8OQ1xf">
<music-button id="detailHeaderButton3" slot="icons" size="small" icon-only="" icon-name="shareandroid" title="Share this playlist" aria-label="Share this playlist" role="button" aria-disabled="false" tabindex="0" variant="primary" refinement="none" class="hydrated">
</music-button>
</span>
</div>
</music-detail-header>
But when I try to get it from page, like that:
public class HtmlScrapper
{
public HtmlDocument? HtmlDocument { get; private set; }
/// <param name = "URL"> URL of website to parse </param>
public HtmlScrapper(string URL)
{
var web = new HtmlWeb();
HtmlDocument = web.Load(URL);
}
public Playlist GetPlaylist()
{
var p = new Playlist();
string Avatar = String.Empty;
var node = HtmlDocument.DocumentNode.Descendants("//div/music-detail-header").FirstOrDefault();
Avatar = node.GetAttributeValue("image-src", null);
using (var ms = new MemoryStream(new WebClient().DownloadData(Avatar)))
{
p.Avatar = Image.FromStream(ms);
}
return p;
}
}
My node remains being null.
And then, of course, I get an exception here:
Avatar = node.GetAttributeValue("image-src", null);
I've already tried different methods of getting that element from document like:
HtmlDocument.DocumentNode.DescendantNodes, HtmlDocument.DocumentNode.SelectNodes, HtmlDocument.DocumentNode.SelectSingleNode and others, but in all cases result is the same.
I would really appreciate your help!

Get the href link of an element with classname having spaces

I have been trying to get the link of an element using the class name but always getting an error that no element found
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByClassName("column.wrap-text").ToList();
I somehow managed to get the links I want using the below code but I know that is not a good approach.
try
{
Selenium.Selenium.driver.Navigate().GoToUrl(txt_url.Text);
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByTagName("a").ToList();
List<string> ValidLinks = new List<string>();
foreach (IWebElement LinkElement in LinkElements)
{
string LinkString = LinkElement.GetAttribute("href");
if (LinkString != null)
{
if (LinkString.Contains("documents"))
{
list.Items.Add(LinkString);
}
}
}
}
catch (Exception)
{ }
Below is the html code for the element I want to extract the href link("/view/garnimii#/Testing%20Folder/MyFile.txt") with the title name in it. I have tried every possible way but not able to read the element with the findbyclassname or findbyxpath(which is very vague here). can anyone please help me with this?
<div class="wrapper fluid-element">
<div class="wrapper fluid-element">
<div class="wrapper fluid-element">
<div class="column wrap-text">
<a title="MyFile.txt" href="https://drive.corp.amazon.com/documents/garnimii#/Testing%20Folder/MyFile.txt">MyFile.txt</a
>
</div>
</div>
<div class="column actions resource-actions-view">
<a data-turbolink="true" href="/view/garnimii#/Testing%20Folder/MyFile.txt"><i class="fa fa-
external-link"></i> View
</a></div>
<div class="column actions resource-actions-share">
<a data-target="#resource-modal-share" data-toggle="modal"
href="/share/garnimii#/Testing%20Folder/MyFile.txt">
<i class="fa fa-share-alt"></i> Share
</a>
</div>
<div class="column actions resource-actions-rename resource-header-actions">
<a data-resource-basename="MyFile.txt" data-resource-id="8a520062-5dbe-46ba-b4b0-b672f6481c17"
data-root-path="/" data-target="#resource-modal-rename" data-toggle="modal" href="#resource-
modal-rename">
<i class="fa fa-pencil"></i> Rename
</a>
</div>
</div>
</div>
Update
foreach (IWebElement LinkElement in LinkElements)
{
string LinkString = LinkElement.GetAttribute("title");
if (LinkString != null)
{
if(LinkString.Contains("myfile.txt"))
{
list.Items.Add(LinkString.GetAttribute('href'));
}
}
}

You can even try with //a xpath.
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByXpath("//a");
List<string> ValidLinks = new List<string>();
foreach (IWebElement LinkElement in LinkElements){
Console.WriteLine(LinkElement.GetAttribute('href'))
}
print all the GetAttribute with href first. and if your output contains all the href then we can proceed further with adding into other list.
Update :
string LinkString = Selenium.Selenium.driver.FindElementByXpath("//a[#title='MyFile.txt']").GetAttribute('href')

FindElementsByClassName can locate element by single class name.
For multiple class names you should use XPath or CSS selector.
So instead of
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByClassName("column.wrap-text").ToList();
Try using
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByCssSelector("div.column.wrap-text").ToList();

get div information with html agility pack

Hi I want to process information on a html page, with the following code I can get the information
This is how the order is received
new-link-1
new-link-2
new-link-3
But when it comes to the new-link-no-title section, it breaks up And it changes to
new-link-3
new-link-1
new-link-2
And at the end of the program it stops with an ArgumentOutOfRangeException error
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = await web.LoadFromWebAsync(Link);
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex())
{
var x = item.SelectNodes("//div[#class='new-link-2']")[index].InnerText;
var xx = item.SelectNodes("//div[#class='new-link-3']//a")[index];
MessageBox.Show(item.InnerText);
MessageBox.Show(x);
MessageBox.Show(xx.Attributes["href"].Value);
}
and html
<div id="new-link">
<ul>
<li>
<div class="new-link-1"> فصل پنجم</div>
<div class="new-link-2"> تکمیل شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
<li class="new-link-no-titel">
<div class="new-link-1"> فصل ششم</div>
<div class="new-link-2"> درحال پخش</div>
<div class="new-link-3">
<i class="fa fa-arrow-down" title=حال پخش">
</i>
</div>
</li>
<li>
<divs="new-link-1"> قسمت 1</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلودلینک مستقیم
</div>
</li>
<li>
<div class="new-link-1"> قسمت 7</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
</ul>
</div>

This is what I found to be the issue with your code.
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex()) //-> Gives 4 indecies for index
item.SelectNodes("//div[#class='new-link-2']") // -> This produces 4 nodes
item.SelectNodes("//div[#class='new-link-3']//a") // -> This produces only 3 nodes
Issue:
When you search with //div, you search All nodes.. not just from the item you are currently on.
Solution/Suggestion: Your current code searches all a elements starting from the root node. If you prefix it with a dot instead only the descendants of the current node will be considered. (Excerpt from here)
foreach (HtmlNode item in doc.DocumentNode.SelectNodes(".//li"))
{
try
{
var x0 = item.SelectSingleNode(".//div[#class='new-link-1']");
var x = item.SelectSingleNode(".//div[#class='new-link-2']");
var xx = item.SelectSingleNode(".//a");
MessageBox.Show(x0.InnerText);
MessageBox.Show(x.InnerText);
if (xx.Attributes["href"] != null)
MessageBox.Show(xx.Attributes["href"].Value);
}
catch { }
}

Parse through Each li tag in browser using 'WatiN'

I am using a watin dll to browse through a webpage, click on a link in li tag, go to the next page, fetch some data, go back to previous page and click the link in the next li tag.
I am able to do this with one link in li tag. I want to get all the li tag underul <classname> click on each link and perform the above procedure. How can I get all the li and loop through each page?
HTML code of the page is like this:
<ul id="ul_classname" class="search-result-set">
<li class="">
<div class="Div_Classname">
<h3 class="standard_font">
<a class="a class_name" href="link to be clicked">text to be displayed</a>
</h3>
<p class="word-wrap"></p>
</div>
</li>
<li class="">
<div class="Div_Classname">
<h3 class="standard_font">
<a class="a class_name" href="link to be clicked">text to be displayed</a>
</h3>
<p class="word-wrap"></p>
</div>
</li>
</ul>

HTH!
private void CrawlSite()
{
int idx = 0;
do
{
idx = this.ClickLink(idx);
}
while (idx != -1);
}
private int ClickLink(int idx)
{
WatiN.Core.Browser browser = GetBrowser();
ListItemCollection listItems = browser.List("ul_classname").ListItems;
if (idx > listItems.Count - 1)
return -1;
Link lnk = listItems[idx].Link(Find.ByClass("a class_name"));
lnk.Click();
//TODO: get your data
browser.Back();
return idx + 1;
}

you can try with this code (Linq to XML)
var xdoc = XDocument.Load(yourFile);
var terms= from term in xdoc.Descendants("ul")
select new
{
Class= term.Attribute("class").Value
};
foreach(var li in terms)
{
Console.Write(li.Class);
}

Try this:
LinkCollection links = ie.Links;
foreach (var link in links)
{
link.Click();
// Do something
ie.Back();
}

How to getelement by class?

I am trying to code a way using webBrowser1 to get a hold of of a download link via href, but the problem is I must find it using its class name.
<body>
<iframe scrolling="no" frameborder="0" allowtransparency="true" tabindex="0" name="twttrHubFrame" style="position: absolute; top: -9999em; width: 10px; height: 10px;" src="http://platform.twitter.com/widgets/hub.html">
‌¶
<div id="main">
‌¶‌→
<div id="header">
<div style="float:left;">
‌¶‌→
<div id="content">
‌¶‌→
<h1 style="background-image:url('http://static.mp3skull.com/img/bgmen.JPG'); background-repeat:repeat-x;">Rush‌·Mp3‌·Download</h1>
‌¶‌→
<a id="bitrate" onclick="document.getElementById('ofrm').submit(); return false;" rel="nofollow" href="">
<form id="ofrm" method="POST" action="">
‌¶‌→‌¶‌→‌→
<div id="song_html" class="show1">
‌¶‌→‌→‌→
<div class="left">
‌¶‌→‌→‌→
<div id="right_song">
‌¶‌→‌→‌→‌→
<div style="font-size:15px;">
‌¶‌→‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→‌→‌→
<div style="float:left;">
‌¶‌→‌→‌→‌→‌→
<div style="float:left; height:27px; font-size:13px; padding-top:2px;">
‌¶‌→‌→‌→‌→‌→‌→
<div style="float:left; width:27px; text-align:center;">
‌¶‌→‌→‌→‌→‌→‌→
<div style="margin-left:8px; float:left;">
<a style="color:green;" target="_blank" rel="nofollow" href="http://dc182.4shared.com/img/1011303409/865387c9/dlink__2Fdownload_2F6QmedN8H_3Ftsid_3D20111211-54337-a79f8d10/preview.mp3">Download</a>
</div>
‌·‌¶‌→‌→‌→‌→‌→‌→
<div style="margin-left:8px; float:left;">
‌¶‌→‌→‌→‌→‌→‌→
<div style="margin-left:8px; float:left;">
‌·‌¶‌→‌→‌→‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→‌→‌→‌→
</div>
‌¶‌→‌→‌→‌→‌→
<div id="player155580779" class="player" style="float:left; margin-left:10px;"></div>
‌¶‌→‌→‌→‌→
</div>
‌→‌¶‌→‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→‌→
</div>
‌¶‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→
</div>
I looked and searched all over google, but I found PHP examples?
I understand you would do something along the lines of this
HtmlElement downloadlink = webBrowser1.Document.GetElementById("song_html").All[0];
URL = downloadlink.GetAttribute("href");
but I do not understand how to do it by the class "show1".
Please point me in the right direction with examples and/or a website I can visit so I can learn how to do this as I searched and have no clue.
EDIT: I pretty much need the href link ("http://dc182.4shared.com/img/1011303409/865387c9/dlink__2Fdownload_2F6QmedN8H_3Ftsid_3D20111211-54337-a79f8d10/preview.mp3"), so how would I obtain it?

There is nothing built-in in the WebBrowser control to retrieve an element by class name. Since you know it is going to be an a element the best you can do is get all a elements and search for the one you want:
var links = webBrowser1.Document.GetElementsByTagName("a");
foreach (HtmlElement link in links)
{
if (link.GetAttribute("className") == "show1")
{
//do something
}
}

Extension Method for HtmlDocument
Returns a list of elements with a particular tag, which coincides with the given className
It can be used to capture the elements only on the tag, or only by class name
internal static class Utils
{
internal static List<HtmlElement> getElementsByTagAndClassName(this HtmlDocument doc, string tag = "", string className = "")
{
List<HtmlElement> lst = new List<HtmlElement>();
bool empty_tag = String.IsNullOrEmpty(tag);
bool empty_cn = String.IsNullOrEmpty(className);
if (empty_tag && empty_cn) return lst;
HtmlElementCollection elmts = empty_tag ? doc.All : doc.GetElementsByTagName(tag);
if (empty_cn)
{
lst.AddRange(elmts.Cast<HtmlElement>());
return lst;
}
for (int i = 0; i < elmts.Count; i++)
{
if (elmts[i].GetAttribute("className") == className)
{
lst.Add(elmts[i]);
}
}
return lst;
}
}
Usage:
WebBrowser wb = new WebBrowser();
List<HtmlElement> lst_div = wb.Document.getElementsByTagAndClassName("div");// all div elements
List<HtmlElement> lst_err_elmnts = wb.Document.getElementsByTagAndClassName(String.Empty, "error"); // all elements with "error" class
List<HtmlElement> lst_div_err = wb.Document.getElementsByTagAndClassName("div", "error"); // all div's with "error" class

I followed up these answers and make my method to hide div by class name.
I shared for whom concern.
public void HideDivByClassName(WebBrowser browser, string classname)
{
if (browser.Document != null)
{
var byTagName = browser.Document.GetElementsByTagName("div");
foreach (HtmlElement element in byTagName)
{
if (element.GetAttribute("className") == classname)
{
element.Style = "display:none";
}
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Select items from parsed html - c#

Related

HTML Agility Pack cannot find specific node

Get the href link of an element with classname having spaces

get div information with html agility pack

Parse through Each li tag in browser using 'WatiN'

How to getelement by class?

Categories

Resources