How to get a particular text inside HTML using c#? - c#

How to get the text "Attractions" from the below HTML ?
<li class="product">
<strong>
Attractions
</strong>
<span></span>
</li>
I usually get this done by the below code, when i need the text inside span. But need some help for the above situation.
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.SelectNodes("//span[#class='cityName']"))
{
Result = selectNode.InnerHtml;
}
How can i do this ?

Result = htmlDocument.DocumentNode.SelectSingleNode("//li[#class='product']/strong/a").InnerText;
You can also do a foreach using SelectNodes like what you did up there.

Related

find link with multiple keywords in c# with HTML Agility Pack

I am writing a program that parse a website.
I manage to find a link in the website, but I needed to pass the exact Innertext words to find it.
I'm looking for a way to do the same thing but to find it by partial inner text
example:
innertext is: "hi my name is"
I want to be able to find it by putting only
"hi my"
foreach (var title in htmlNodes)
{
if (keywords == title.SelectSingleNode("div/h1").InnerText)
{
if (color == title.SelectSingleNode("div/p").InnerText)
{
Console.WriteLine(title.SelectSingleNode("div/p/a").GetAttributeValue("href", "pas d'addresse"));
}
}
}
here keywords need to match exactly the innertext in div/h1. I want it to be partial.
here is the html code :
<article>
<div class="inner-article">
<a style = "height:150px;" href="/shop/shirts/c712g63kx/p1us9bkh7">
<img width = "150" height="150" src="//assets.supremenewyork.com/146319/vi/qW2Nur88W30.jpg" alt="Qw2nur88w30">
</a>
<h1>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Tiger Stripe Rayon Shirt</a>
</h1>
<p>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Teal</a>
</p>
</div>
</article>
thank you all for your answers!
I found out how to resolve my problem. It was actually quite simple. here is the code:
if ((title.SelectSingleNode("div/h1").InnerText).Contains(keywords))
Now the problem is to do it with case insensitive.

How to get specific data using HtmlAgilityPack

I am using HtmlAgilityPack for scrapping data.
Here is the link that i am using to scrap data
This Link
The structure is something like that
<div id="left">
<h2>
<i id="bn7483" class="fa fa-volume-up fa-lg in au" title="Speak!"/>
<span class="in">(dhaarmika) </span>
<div class="row">
...
I need two data from there one is "(dhaarmika)" and another is the id from that is "bn7483" using this code
HtmlAgilityPack.HtmlDocument doc2 = web2.Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
HtmlNodeCollection nodes = doc2.DocumentNode.SelectNodes("//span[#class='in']");
I was able to get the first one data that is "(dhaarmika)".
But i couldn't get the second data.
Could anyone tell me how to get the second data???
Another possible way is by selecting preceding sibling of the <span> you already found :
var doc2 = new HtmlWeb().Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
var span = doc2.DocumentNode.SelectSingleNode("//span[#class='in']");
var i = node.SelectSingleNode("preceding-sibling::i[#id]")
.Attributes["id"]
.Value;

XPATH: how to get child nodes

i have following html
<ul class="enh-toggle">
<li>
Design<sup>1</sup><span class="accordion"></span>
<ul id="design">
<li>
<strong>Dimensions</strong>
<ul><li>length:12.3cm</li></ul>
</li>
</ul>
</li>
</ul>
i use the following code to get ul[id='design']
HTMLNode node = doc.DocumentNode.SelectSingleNode("//ul[#class='enh-toggle']//ul[#id='design']");
this just work perfect...
now my question is how can i get the strong tag text. i use the following code but it don't works
string text = node.SelectSingleNode("/li/strong").InnerText;
variation on the "li/strong" answers:
string text = node.SelectSingleNode("./li/strong").InnerText;
A single slash in XPath is the root of the document. You just want to select the direct descendants, so you don't need to give a context:
string text = node.SelectSingleNode("li/strong").InnerText;
I think it should just be:
string text = node.SelectSingleNode("li/strong").InnerText;
..without the leading /.

Html parsing with Agility

Could someone please help with resolving the trouble parsing sequential tags of Html by Agility in C#? I have 2 question as listed below.
in this case, I want to parse following Html and store them into a structure (list, stack, etc) so I can use these data effectively.
<h3> header </h3>
<p> paragraph 1</p>
<p>
Google
Gizmodo
</p>
<ul>
<li> something is here with a download
link
</li>
<li> hello
<img src="www.imagesource.com"/>
</li>
</ul>
How to parse these data in sequential manner?
If I use var ParaTags = HtmlDocument.DocumentNode.Descendants("p");,
then I can only get all "p" tags. but I don't know how to get "h3" then "p" in sequence, because "p" is not inside "h3".
following code will returns me all hyperlinks,
var links =
from paras in document.DocumentNode.Descendants("p")
from hyperLinks in paras.Descendants("a").Where(x => x.Attributes["href"].Value != "")
select hyperLinks;
What's the best way to parse and store those mixed content with string, hyperlinks, and images?
so I can output them later in a efficient way? List, stack?
Another word, I want to store every possible content from html and reserve its format if possible. so I can resemble the content in proper format once i reload it onto the app.
Thank you!
If you want to extract all href and src attributes you may try this:
using System;
using System.Linq;
using HtmlAgilityPack;
public class Program
{
static void Main()
{
var document = new HtmlDocument();
document.Load("test.html");
var links =
from element in document.DocumentNode.Descendants()
let href = element.Attributes["href"]
let src = element.Attributes["src"]
where href != null || src != null
select href != null ? href.Value : src.Value;
foreach (var link in links)
{
Console.WriteLine(link);
}
}
}
outputs:
www.google.com
www.gizmodo.com
www.google.com
www.imagesource.com

Get all <li> elements from inside a certain <div> with C#

I have a web page consisting of several <div> elements.
I would like to write a program that prints all the li elements inside a <div> after a certain <h4> header. Could anyone give me some help or sample code?
<div id="content">
<h4>Header</h4>
<ul>
<li><a href...></a> THIS IS WHAT I WANT TO GET</li>
</ul>
</div>
When it come to parsing HTML in C#, don't try to write your own. The HTML Agility Pack is almost certainly capable of doing what you want!
What parts are constant:
The 'id' in the DIV?
The h4
Searching a complete HTML document and reacting on H4 alone is likely to be a mess, whereas if you know the DIV has the ID of "content" then just look for that!
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourHtml);
if ( doc.DocumentNode != null )
{
var divs = doc.DocumentNode
.SelectNodes("//div")
.Where(e => e.Descendants().Any(e => e.Name == "h4"));
// You now have all of the divs with an 'h4' inside of it.
// The rest of the element structure, if constant needs to be examined to get
// the rest of the content you're after.
}
If its a web page why would you need to do HTML Parsing. Would not the technology that you are using to build the web page would give access to all the element of the page. For example if you are using ASP.NET, you could assign id's to your UL and LI(with runat server tag) and they would be available in code behind ?
Could you explain your scenario what you are trying to do ? If you trying to make a web request, download the html as string, then scrapping the HTML would make sense
EDIT
Think this should work
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtmlFile);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//div"))
{
if(p.Attributes["id"].Value == "content")
{
foreach(HtmlNode child in p.ChildNodes.SelectNodes("//ul"))
{
if(p.PreviousSibling.InnerText() == "Header")
{
foreach(HtmlNode liNodes in p.ChildNodes)
{
//liNodes represent all childNode
}
}
}
}
If all you want is the stuff that's between all <li></li> tags underneath the <div id="content"> tag and comes right after a <h4> tag, then this should suffice:
//Load your document first.
//Load() accepts a Stream, a TextReader, or a string path to the file on your computer
//If the entire document is loaded into a string, then use .LoadHtml() instead.
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.Load("c:\foobar.html");
//Select all the <li> nodes that are inside of an element with the id of "content"
// and come directly after an <h4> tag.
HtmlNodeCollection processMe = mainDoc.GetElementbyId("content")
.SelectNodes("//h4/following-sibling::*[1]//li");
//Iterate through each <li> node and print the inner text to the console
foreach (HtmlNode listElement in processMe)
{
Console.WriteLine(listElement.InnerText);
}

Categories

Resources