How to get all attributes with the same name? - c#

let's say this is the xml file
<div Pictures>
<span Pic1>
<a title="pic1" class="thumb" image="LinkToImage.com">
</a>
</span >
<span Pic2>
<a title="pic2" class="thumb-small" image="LinkToImage2.com">
</a>
</span >
</div >
How do I get all image attributes from this page? I know I need to use the XPath syntax //#image but I can't find the code to collect all these attributes and put them in a foreach. I've tried something like this but that didn't work
var WebgetME_ = new HtmlWeb();
var docME_ = WebgetME_.Load(MEURL_);
foreach (HtmlAttribute HA_ME in docME_.DocumentNode.Attributes["//#image"])) { ;}
How do you get all attribute info with the same attribute name from a page?

How do I get all image attributes from this page?
I believe you can do the following:
var images =
from link in docME_.Descendants("a")
where link.Attributes.Contains("image")
select link.Attributes["image"].Value;
This
gets all <a .../> nodes
filters them by having an an "image" attribute
retrieves the value for this attribute.

Related

c# substring - parse all text in between

trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.
to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
The Top Social Networking Sites People Are Using
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...
</a>
<p class="result-snippet">
The Top
</p>
</div>
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking- websites"
>
Top 15 Most Popular Social Networking Sites | January 2019
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>
</a>
<p class="result-snippet">
Top 15 Most
</p>
</div>
i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.
int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);
to grab url i am using the following:
var regexURLParser = new Regex(#"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);
what i want is to grab is the url from these:
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking-websites"
>
so that the outcome shows only:
https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites
You can make it more easier by using HTMLAgilityPack just include it in your project using NuGet.
To add HTMLAgilityPack using NuGet
go to the Package Manager Console and type Install-Package HtmlAgilityPack -Version 1.11.3
after the installation you can extract Urls like below.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = new List<string>();
doc.DocumentNode.SelectNodes("//a").ToList()
.ForEach(x=>
{
//Use HasClass method to filter elements
if (!string.IsNullOrEmpty(x.GetAttributeValue("href", ""))
&& x.HasClass("result-title") && x.HasClass("js-result-title"))
{
listOfUrls.Add(x.GetAttributeValue("href", ""));
}
});
listOfUrls.ForEach(x => Console.WriteLine(x));
EDIT
Added && x.HasClass("result-title") && x.HasClass("js-result-title") to shows only those elements which has the class result-title and js-result-title.
Another way
shorter and another way to get filtered values.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = doc.DocumentNode.Descendants("a")
.Where(x => x.Attributes["class"] != null
&& x.Attributes["class"].Value == "result-title js-result-title")
.Select(x => x.GetAttributeValue("href", "")).ToList();

XPath retrieving values from multiple tags inside a node

I'm currently creating a crawler and I'm at the point where I need to abstract data in a set so I can send it to a database as a single row, nice and neat.
Here is a snip-it of my program, it correctly goes to each page so far and retrieves the correct corresponding url
int tempflag = 0;
//linkValueList is full of sub urls previously crawled in the program
foreach (string str in linkValueList)
{
string tempURL = baseURL + str;
HtmlWeb tempWeb = new HtmlWeb();
HtmlDocument tempHtml = tempWeb.Load(tempURL);
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
//get the category from the linkNameList
string tempCategory = linkNameList.ElementAt(tempflag);
//grab url
string tempHref = node.GetAttributeValue("data-itemurl", string.Empty);
//grab image url
//grab brand
//grab name
//grab price
//send to database via INSERT
}
tempflag++;
}
Here is the site code I am working with, this is an example of one item, each item looks similar
<article .... itemprop="product" data-itemurl="Item's url">
<figure>
<a ....>
<img .... src="item's image source" ...>
</a>
<div ...>
<a>....</a>
</div>
</figure>
<div ...>
<a ....>
<div class="brand" itemprop="brand>Item's Brand</div>
<div class="title" itemprop="name">Item's Name</div>
</a>
<div ....>
<div class="msrp"></div>
<div class="price" itemprop="price">$18.99 - $119.99</div>
<span ...> ... </span>
</div>
</div>
</article>
As you can see I have already used XPath to get myself inside of the <article> tag to get the data-itemurl to retrieve the item's url. My question is now that I am already inside of the <article> tag, is there an easy way to now access the other tags nested inside?
I need to get to the <img> tag for the image's url, <div itemprop="brand"> for the brand, <div itemprop="name"> for the item name, and <div itemprop="price"> for the price.
As I mentioned before, I am trying to get all of that information in one go around so I can query it all into a database as a single insert statement at the end of each loop.
Sure you can use another XPath to query within a given element. One thing to note, which many have been troubled with, never start a relative XPath with /, for it will search the entire document instead, start with ./ if you need to, for example (SelectSingleNode() assumed to always find the target element here, otherwise you need to check whether the result is not null first) :
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
img = node.SelectSingleNode(".//img").GetAttributeValue("src","");
brand = node.SelectSingleNode(".//div[#itemprop='brand']").InnerText.Trim();
.....
}
Sure you can use node.Descendants("img") or node.Descendants("div").Where(d => d.Attributes.Contains("itemprop") && d.Attributes["itemprop"].Value.Equals("price"))
Hope it helps.

How to get specific data using HtmlAgilityPack

I am using HtmlAgilityPack for scrapping data.
Here is the link that i am using to scrap data
This Link
The structure is something like that
<div id="left">
<h2>
<i id="bn7483" class="fa fa-volume-up fa-lg in au" title="Speak!"/>
<span class="in">(dhaarmika) </span>
<div class="row">
...
I need two data from there one is "(dhaarmika)" and another is the id from that is "bn7483" using this code
HtmlAgilityPack.HtmlDocument doc2 = web2.Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
HtmlNodeCollection nodes = doc2.DocumentNode.SelectNodes("//span[#class='in']");
I was able to get the first one data that is "(dhaarmika)".
But i couldn't get the second data.
Could anyone tell me how to get the second data???
Another possible way is by selecting preceding sibling of the <span> you already found :
var doc2 = new HtmlWeb().Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
var span = doc2.DocumentNode.SelectSingleNode("//span[#class='in']");
var i = node.SelectSingleNode("preceding-sibling::i[#id]")
.Attributes["id"]
.Value;

How to get a particular text inside HTML using c#?

How to get the text "Attractions" from the below HTML ?
<li class="product">
<strong>
Attractions
</strong>
<span></span>
</li>
I usually get this done by the below code, when i need the text inside span. But need some help for the above situation.
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.SelectNodes("//span[#class='cityName']"))
{
Result = selectNode.InnerHtml;
}
How can i do this ?
Result = htmlDocument.DocumentNode.SelectSingleNode("//li[#class='product']/strong/a").InnerText;
You can also do a foreach using SelectNodes like what you did up there.

How to unwrap an element if it exists with CsQuery?

I'm using CsQuery to read values of HTML elements.
In advance, I don't know if the <a> element contains a <font> element or not.
Is there a way to read the InnerText of an anchor regardless if it contains a fontelement or not?
Scenario 1: Text inside font element
<div class="link">
<a href="http://www.example.com/1">
<font>Foo</font>
</a>
</div>
Scenario 2: Text without font element
<div class="link">
<a href="http://www.example.com/2">
Foo
</a>
</div>
I've got the following working solution:
var dom = CQ.CreateFromUrl("http://www.myurl.com");
var a = new CQ(dom.Select("div.link a").InnerHTML);
var font = a.Select("font");
var myValue = a.Count() > 0 ? font[0].InnerText : a[0].InnerText;
But it's a bit messy and I'd rather just always remove the font element - if present - so I could go for the anchor value right away. Something like Contents() in combination with UnWrap(), but I haven't succeeded to make it work. Ideas anyone?
var dom = CQ.CreateFromUrl("http://www.myurl.com");
string result = dom[".link a"].Text();

Categories

Resources