I would like the nodes in the collection but with iterating SelectSingleNode I keep getting the same object just node.Id is changing...
What i try is to readout the webresponse of a given site and catch some information like values, links .. in special defined elements.
int offSet = 0;
string address = "http://www.testsite.de/ergebnisliste.html?offset=" + offSet;
HtmlWeb web = new HtmlWeb();
//web.OverrideEncoding = Encoding.UTF8;
HtmlDocument doc = web.Load(address);
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[#itemtype='http://schema.org/Posting']");
foreach (HtmlNode node in collection) {
string id = HttpUtility.HtmlDecode(node.Id);
string cpname = HttpUtility.HtmlDecode(node.SelectSingleNode("//span[#itemprop='name']").InnerText);
string cptitle = HttpUtility.HtmlDecode(node.SelectSingleNode("//span[#itemprop='title']").InnerText);
string cpaddress = HttpUtility.HtmlDecode(node.SelectSingleNode("//span[#itemprop='addressLocality']").InnerText);
string date = HttpUtility.HtmlDecode(node.SelectSingleNode("//div[#itemprop='datePosted']").InnerText);
string link = "http://www.testsite.de" + HttpUtility.HtmlDecode(node.SelectSingleNode("//div[#class='h3 title']//a[#href]").GetAttributeValue("href", "default"));
}
This is for example for 1 iteration:
<div id="66666" itemtype="http://schema.org/Posting">
<div>
<a>
<img />
</a>
</div>
<div>
<div class="h3 title">
<a href="/test.html" title="Test">
<span itemprop="title">Test</span>
</a>
</div>
<div>
<span itemprop="name">TestName</span>
</div>
</div>
<div>
<div>
<div>
<div>
<span itemprop="address">Test</span>
</div>
<span>
<a>
<span><!-- --></span>
<span></span>
</a>
</span>
</div>
</div>
<div itemprop="date">
<time datetime="2013-03-01">01.03.13</time>
</div>
</div>
By writing
node.SelectSingleNode("//span[#itemprop='name']").InnerText
it's like you writing
doc.DocumentNode.SelectSingleNode("//span[#itemprop='name']").InnerText
To do what you want to do you should write it like this: node.SelectSingleNode(".//span[#itemprop='name']").InnerText.
This .dot / period tells make a search on the current node which is node instead on doc
Related
I want to scrape data with Html Agility Pack.
I used this:
string url = #"https://mobile.bet365.gr/#type=Coupon;key=1-1-13-40-141-0-0-0-1-0-0-4100-0-0-1-0-0-0-0-0-0;ip=0;lng=5;anim=1";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var nodes = document.DocumentNode.SelectNodes("//*[#id='Coupon']/div[1]/div[2]/div[1]/div/div[1]/div[1]/span");
int i = 0;
foreach (var node in nodes)
{
dataGridView1.Rows.Add();
dataGridView1.Rows[i].Cells[0].Value = i + 1;
dataGridView1.Rows[i].Cells[1].Value = node.InnerHtml;
i++;
}
The XPath is taken from FireXPath but nothing appears.
The HTML snippet is this:
<div id="Coupon" class="C4 C4_1">
<div class="liveAlertKey enhancedPod cc_12_7" data-sportskey="1-1-13-40-141-0-0-0-1-0-0-4100-0-0-1-0-0-0-0-0-0" data-alertkey="NPower Champs">
<h1><em>Αγγλία - Τσάμπιονσιπ</em></h1>
<div class="podHeaderRow">
<div class="wideLeftColumn">Παρ 29 Σεπ</div>
<div class="priceColumn"><em>1</em></div>
<div class="priceColumn"><em>X</em></div>
<div class="priceColumn"><em>2</em></div>
</div>
<div data-fixtureid="67185688" data-plbtid="40" class="podEventRow cc_12_4 ippg-Market " data-nav="rw_spl_sc_1-1-8-67185688-3-0-0-0-1-0-0-0-0-0-1-0-0-0-0-0-0,MarketCount,1-1-8-67185688-3-0-0-0-1-0-0-0-0-0-1-0-0-0-0-0-0,False,1">
<div class="wideLeftColumn hasStatsIcon">
<div class="ippg-Market_GameDetail">
<div class="ippg-Market_GameItem ">
<div class="ippg-Market_CompetitorName">
<span class="ippg-Market_Truncator">ΚΠΡ</span>
</div>
<div class="ippg-Market_CompetitorScores">
<span class="ippg-PointNode"></span>
</div>
</div>
<div class="ippg-Market_GameItem ">
<div class="ippg-Market_CompetitorName">
<span class="ippg-Market_Truncator">Φούλαμ</span>
</div>
<div class="ippg-Market_CompetitorScores">
<span class="ippg-PointNode"></span>
</div>
</div>
<div class="ippg-Market_MetaContainer ">
<div class="ippg-Market_GameStartTime">20:45</div>
<div class="ippg-Market_GameInfo "></div>
<div class="ippg-Market_MarketCount">109</div>
<div id="FixtureIconsContainer">
<img src="/grfx/V6/Misc/pixel.gif" class="VideoIcon SSP-7">
</div>
<div id="StatsIconContainer">
<a class="icon-stats" target="_blank" data-nav="externalLink" href="http://www.stats.betradar.com/s4/?clientid=259&matchid=11868244&language=el"></a>
</div>
</div>
</div>
</div>
<div class="ippg-Market_Topic priceColumn" data-nav="pt=N#o=9/4#f=67185688#fp=1410316836#so=0#c=1#" data-inplaytopic="" data-pgfpid="1410316836" data-inplaymarkettopic="" data-inplayaltmarkettopic="">
<span class="ippg-Market_Odds">3.25</span>
</div>
<div class="ippg-Market_Topic priceColumn" data-nav="pt=N#o=13/5#f=67185688#fp=1410316839#so=0#c=1#" data-inplaytopic="" data-pgfpid="1410316839" data-inplaymarkettopic="" data-inplayaltmarkettopic="">
<span class="ippg-Market_Odds">3.60</span>
</div>
<div class="ippg-Market_Topic priceColumn" data-nav="pt=N#o=5/4#f=67185688#fp=1410316841#so=0#c=1#" data-inplaytopic="" data-pgfpid="1410316841" data-inplaymarkettopic="" data-inplayaltmarkettopic="">
<span class="ippg-Market_Odds">2.25</span>
</div>
</div>
</div>
</div>
Could anyone help me find the correct XPath? I used this technique in other sites and I had taken the results I wanted but from this site I have some problem to find the correct XPath.
You can get your teams and odds from the HTML snippet like this:
HtmlDocument document = new HtmlDocument();
document.Load(Server.MapPath("xpath.html"));
// Teams
HtmlNodeCollection teamNodes = document.DocumentNode.SelectNodes("//div[#class='ippg-Market_CompetitorName']");
List<string> teams = new List<string>();
foreach (HtmlNode n in teamNodes)
{
HtmlNode nodeTeam = n.SelectSingleNode(".//span[#class='ippg-Market_Truncator']");
if (nodeTeam != null)
{
teams.Add(nodeTeam.InnerText);
}
}
// Odds
HtmlNodeCollection oddNodes = document.DocumentNode.SelectNodes("//span[#class='ippg-Market_Odds']");
List<string> odds = new List<string>();
foreach (HtmlNode o in oddNodes)
{
odds.Add(o.InnerText);
}
I'm starting with some crawler in C# and I heard that HtmlAgilityPack is best solution for this.
I can't find valid example of usage so maybe someone here will help me with my issue.
In one class I'm using method to get part of code I want. For example ul with class "testable ul"
public static string GetElement(string url, string element, string type, string name)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string rate = doc.DocumentNode.SelectSingleNode("//"+ element +"[#"+ type +"='"+ name +"']").OuterHtml;
return rate;
}
so I am running
string content = SiteMethods.GetElement(startPage, "ul", "class", "testable ul");
now there is a part where I am doing some background work but in the end I'm loading that string again to HtmlAgality:
HtmlDocument html = new HtmlDocument();
html.OptionOutputAsXml = true;
html.LoadHtml(content);
HtmlNode document = html.DocumentNode;
And here I have a problem.
The structure inside content string is like that:
<ul class="testable ul">
<li>
<a href="http://www.veryimportant.link">
<div class="img">
<img src="http://image.so.important/">
</div>
<div class="info">
<span class="name">
NAME
</span>
<span class="price">10</span>
<span class="price2">8</span>
<span class="grade">C</span>
</div>
<p class="tips">tips</p>
</a>
</li>
<li>
<a href="http://www.veryimportant.link/2">
<div class="img">
<img src="http://image.so.important/2">
</div>
<div class="info">
<span class="name">
NAME2
</span>
<span class="price">3</span>
<span class="price2">4</span>
<span class="grade">A</span>
</div>
<p class="tips">tips2</p>
</a>
</li>
</ul>
So the questions are:
How to get every <li> to diffrent object? For further actions.
is it possible in one simple command to get links http://www.veryimportant.link and http://www.veryimportant.link/2 or for example images http://image.so.important/ and http://image.so.important/2 ? How to get them?
How to get NAME and NAME2 in list?
Is it possible to map the whole struct of html to list?
Please, with some examples the rest of learning will be really easy.
I want to catch some data from a website using HtmlAgilityPack. The data is stored in an object with the property class="translateTxt". I use this code but it returns null
c# code:
HtmlAgilityPack.HtmlDocument doc = hw.Load(Url);
HtmlNodeCollection nodes1 = doc.DocumentNode.SelectNodes("//div[#class='translateTxt']");
foreach (HtmlNode node in nodes1)
{
string Txt = node.InnerText;
}
html code:
<div id="trans" class="tap_mt">
<div class="tr_brst clearfix">
<div class="tr_instyle">
<div class="tr_ext clearfix">
<div class="translateTxt">
hi
</div>
</div>
</div>
</div>
</div>
Try using the following to get the all descendants div tag's
var findclasses = doc.DocumentNode.Descendants("div").Where(d =>
d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("translateTxt"));
Then loop over your findClasses variable
I am attempting to parse a html web page to get information from it. Here is a sample of the source:
<div class="market_listing_row market_recent_listing_row listing_2107979855708535333" id="listing_2107979855708535333">
<div class="market_listing_item_img_container"> <img id="listing_2107979855708535333_image" src="asdgfasdfgasgasgdasgasdgsdasgsadg" style="border-color: #D2D2D2;" class="market_listing_item_img" alt="" /> </div>
<div class="market_listing_right_cell market_listing_action_buttons">
<div class="market_listing_buy_button">
<a href="javascript:BuyMarketListing('listing', '2107979855708535333', 570, '2', '508690045')" class="item_market_action_button item_market_action_button_green">
<span class="item_market_action_button_edge item_market_action_button_left"></span>
<span class="item_market_action_button_contents">
Buy Now </span>
<span class="item_market_action_button_edge item_market_action_button_right"></span>
<span class="item_market_action_button_preload"></span>
</a>
</div>
</div>
<div class="market_listing_right_cell market_listing_their_price">
<span>
<span class="market_listing_price market_listing_price_with_fee">
0,03 pуб. </span>
<span class="market_listing_price market_listing_price_without_fee">
0,01 pуб. </span>
<br/>
</span>
Basically I need to get the part that is enclosed in the
<span class="market_listing_price market_listing_price_with_fee">
0,03 pуб. </span>
I have attempted to use HTMLAgiltiyPack, but can't seem to figure it out.
You can use HtmlAgilityPack
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode
.SelectSingleNode("//span[#class='market_listing_price market_listing_price_without_fee']");
var text = WebUtility.HtmlDecode(node.InnerText);
I figured out you cannot just put a URL into the doc.LoadHtml. You have to use a HttpWebRequest and Response.
I want to get link,title and price from this html(this is one result of ten results)
<div class="listing-item">
<div class="block item-title">
<h3 id="title">
<span style="direction: ltr" class="title">
<a xtcltype="S" xtclib="listing_list_1_title_link" href="http://dubai.dubizzle.com/motors/used-cars/ford/explorer/2013/7/1/ford-explorer-2012-new-model-expat-leaving-2/?back=ZHViYWkuZHViaXp6bGUuY29tL21vdG9ycy91c2VkLWNhcnMv&pos=1">FORD EXPLORER - 2012 - NEW MODEL - EXPAT LEAV...</a>
</span>
</h3>
<div class="price">
AED 118,000
<br>
</div>
</div>
</div>
Here is my code
var allCarResults = rootNode.SelectNodes("//div[normalize-space(#class)='listing-item']");
foreach (var carResult in allCarResults)
{
var dataNode = carResult.SelectSingleNode(".//div[#class='block item-title']");
var carNameNode = dataNode.SelectSingleNode(".//h3/a");
string carName = carNameNode.InnerText.Trim();
}
This give me object reference issue to get carName.What mistake i am doing here?
dataNode.SelectSingleNode(".//h3/a"); tries to select a <a> node directly under the <h3> that is somewhere under that dataNode.
However, in your case there is a <span> inbetween. So use dataNode.SelectSingleNode(".//h3//a"); (note the // between h3 and a) to get an <a> node somewhere below a <h3>.