How to get specific data using HtmlAgilityPack

How to get specific data using HtmlAgilityPack - c#

I am using HtmlAgilityPack for scrapping data.
Here is the link that i am using to scrap data
This Link
The structure is something like that
<div id="left">
<h2>
<i id="bn7483" class="fa fa-volume-up fa-lg in au" title="Speak!"/>
<span class="in">(dhaarmika) </span>
<div class="row">
...
I need two data from there one is "(dhaarmika)" and another is the id from that is "bn7483" using this code
HtmlAgilityPack.HtmlDocument doc2 = web2.Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
HtmlNodeCollection nodes = doc2.DocumentNode.SelectNodes("//span[#class='in']");
I was able to get the first one data that is "(dhaarmika)".
But i couldn't get the second data.
Could anyone tell me how to get the second data???

Another possible way is by selecting preceding sibling of the <span> you already found :
var doc2 = new HtmlWeb().Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
var span = doc2.DocumentNode.SelectSingleNode("//span[#class='in']");
var i = node.SelectSingleNode("preceding-sibling::i[#id]")
.Attributes["id"]
.Value;

Related

c# substring - parse all text in between

trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.
to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
The Top Social Networking Sites People Are Using
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...
</a>
<p class="result-snippet">
The Top
</p>
</div>
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking- websites"
>
Top 15 Most Popular Social Networking Sites | January 2019
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>
</a>
<p class="result-snippet">
Top 15 Most
</p>
</div>
i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.
int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);
to grab url i am using the following:
var regexURLParser = new Regex(#"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);
what i want is to grab is the url from these:
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking-websites"
>
so that the outcome shows only:
https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites

You can make it more easier by using HTMLAgilityPack just include it in your project using NuGet.
To add HTMLAgilityPack using NuGet
go to the Package Manager Console and type Install-Package HtmlAgilityPack -Version 1.11.3
after the installation you can extract Urls like below.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = new List<string>();
doc.DocumentNode.SelectNodes("//a").ToList()
.ForEach(x=>
{
//Use HasClass method to filter elements
if (!string.IsNullOrEmpty(x.GetAttributeValue("href", ""))
&& x.HasClass("result-title") && x.HasClass("js-result-title"))
{
listOfUrls.Add(x.GetAttributeValue("href", ""));
}
});
listOfUrls.ForEach(x => Console.WriteLine(x));
EDIT
Added && x.HasClass("result-title") && x.HasClass("js-result-title") to shows only those elements which has the class result-title and js-result-title.
Another way
shorter and another way to get filtered values.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = doc.DocumentNode.Descendants("a")
.Where(x => x.Attributes["class"] != null
&& x.Attributes["class"].Value == "result-title js-result-title")
.Select(x => x.GetAttributeValue("href", "")).ToList();

XPath retrieving values from multiple tags inside a node

I'm currently creating a crawler and I'm at the point where I need to abstract data in a set so I can send it to a database as a single row, nice and neat.
Here is a snip-it of my program, it correctly goes to each page so far and retrieves the correct corresponding url
int tempflag = 0;
//linkValueList is full of sub urls previously crawled in the program
foreach (string str in linkValueList)
{
string tempURL = baseURL + str;
HtmlWeb tempWeb = new HtmlWeb();
HtmlDocument tempHtml = tempWeb.Load(tempURL);
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
//get the category from the linkNameList
string tempCategory = linkNameList.ElementAt(tempflag);
//grab url
string tempHref = node.GetAttributeValue("data-itemurl", string.Empty);
//grab image url
//grab brand
//grab name
//grab price
//send to database via INSERT
}
tempflag++;
}
Here is the site code I am working with, this is an example of one item, each item looks similar
<article .... itemprop="product" data-itemurl="Item's url">
<figure>
<a ....>
<img .... src="item's image source" ...>
</a>
<div ...>
<a>....</a>
</div>
</figure>
<div ...>
<a ....>
<div class="brand" itemprop="brand>Item's Brand</div>
<div class="title" itemprop="name">Item's Name</div>
</a>
<div ....>
<div class="msrp"></div>
<div class="price" itemprop="price">$18.99 - $119.99</div>
<span ...> ... </span>
</div>
</div>
</article>
As you can see I have already used XPath to get myself inside of the <article> tag to get the data-itemurl to retrieve the item's url. My question is now that I am already inside of the <article> tag, is there an easy way to now access the other tags nested inside?
I need to get to the <img> tag for the image's url, <div itemprop="brand"> for the brand, <div itemprop="name"> for the item name, and <div itemprop="price"> for the price.
As I mentioned before, I am trying to get all of that information in one go around so I can query it all into a database as a single insert statement at the end of each loop.

Sure you can use another XPath to query within a given element. One thing to note, which many have been troubled with, never start a relative XPath with /, for it will search the entire document instead, start with ./ if you need to, for example (SelectSingleNode() assumed to always find the target element here, otherwise you need to check whether the result is not null first) :
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
img = node.SelectSingleNode(".//img").GetAttributeValue("src","");
brand = node.SelectSingleNode(".//div[#itemprop='brand']").InnerText.Trim();
.....
}

Sure you can use node.Descendants("img") or node.Descendants("div").Where(d => d.Attributes.Contains("itemprop") && d.Attributes["itemprop"].Value.Equals("price"))
Hope it helps.

Cannot get content of specific div with html agility pack

I'm using html agility pack for take some data from a website, now there is a bit problem. I want get some data from this div:
<div class="container middle">
<div class="details clearfix">
<dl>
<dt>Gara</dt>
<dd>Super League</dd>
<dt>Data</dt>
<dd><span class='timestamp' data-value='1467459300' data-format='d mmmm yyyy'>2 luglio 2016</span></dd>
<dt>Game week</dt>
<dd>15</dd>
<dt>calcio di inizio</dt>
<dd>
<span class='timestamp' data-value='1467459300' data-format='HH:MM'>13:35</span>
(<span class="game-minute">FP'</span>)
</dd>
</dl>
</div>
the problem's that there are two div with the class container middle and details clearfix, I want get the content onlhy of the specific div pasted above. This div have a dl tag for each tag.
This is my code:
var url = "http://it.soccerway.com/matches/2016/07/02/china-pr/csl/henan-jianye/beijing-guoan-football-club/2207361/";
var doc = new HtmlDocument();
doc.LoadHtml(new WebClient().DownloadString(url));
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode("//div[#class='container middle']");
and this return a wrong result, in particular this:
<div class="container middle">
<h3 class="thick scoretime score-orange">
0 - 0
</h3>
this is the complete source code.

Well, you could do the following, for this particular web-page:
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
Console.WriteLine(matchDetails[1].InnerHtml);
and working with HtmlNode via matchDetails[1]. To retrieve other data you can use similar xpath requests, like:
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
var dl = matchDetails[1].SelectSingleNode(".//dl");
var dt = dl.SelectNodes(".//dt");
var dd = dl.SelectNodes(".//dd");
for (int i = 0; i < dt.Count; i++) {
var name = dt[i].InnerHtml;
var value = dd[i].InnerHtml;
Console.WriteLine(name + ": " + value);
}
Of course, you need some check for the NullReference and stuff

Query div with class details clearfix should return the target div element. There is one crucial detail you need to be aware of though,
that a . before / is needed to make the XPath relative to the context element referenced by infoDiv, otherwise the XPath will be evaluated on the root document context (as if it was called on doc.DocumentNode instead of on infoDiv) :
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode(".//div[#class='details clearfix']");

How to get all attributes with the same name?

let's say this is the xml file
<div Pictures>
<span Pic1>
<a title="pic1" class="thumb" image="LinkToImage.com">
</a>
</span >
<span Pic2>
<a title="pic2" class="thumb-small" image="LinkToImage2.com">
</a>
</span >
</div >
How do I get all image attributes from this page? I know I need to use the XPath syntax //#image but I can't find the code to collect all these attributes and put them in a foreach. I've tried something like this but that didn't work
var WebgetME_ = new HtmlWeb();
var docME_ = WebgetME_.Load(MEURL_);
foreach (HtmlAttribute HA_ME in docME_.DocumentNode.Attributes["//#image"])) { ;}
How do you get all attribute info with the same attribute name from a page?

How do I get all image attributes from this page?
I believe you can do the following:
var images =
from link in docME_.Descendants("a")
where link.Attributes.Contains("image")
select link.Attributes["image"].Value;
This
gets all <a .../> nodes
filters them by having an an "image" attribute
retrieves the value for this attribute.

How to get a particular text inside HTML using c#?

How to get the text "Attractions" from the below HTML ?
<li class="product">
<strong>
Attractions
</strong>
<span></span>
</li>
I usually get this done by the below code, when i need the text inside span. But need some help for the above situation.
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.SelectNodes("//span[#class='cityName']"))
{
Result = selectNode.InnerHtml;
}
How can i do this ?

Result = htmlDocument.DocumentNode.SelectSingleNode("//li[#class='product']/strong/a").InnerText;
You can also do a foreach using SelectNodes like what you did up there.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to get specific data using HtmlAgilityPack - c#

Related

c# substring - parse all text in between

XPath retrieving values from multiple tags inside a node

Cannot get content of specific div with html agility pack

How to get all attributes with the same name?

How to get a particular text inside HTML using c#?

Categories

Resources