Html Agility Pack get specific content from a div - c#

I'm trying to pull text from a "div" and to exclude everything else. Can you help me please ?!
<div class="article">
<div class="date">01.01.2000</div>
<div class="news-type">Breaking News</div>
"Here is the location of the text i would like to pull"
</div>
When I pull "article" class i get everything, but i'm unable/don't know how to exclude class="date", class="news-type", and everything in it.
Here is the code i use:
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]"))
{
name_text.text += node.InnerHtml.Trim();
}
Thank you!

Another way would be using XPath /text()[normalize-space()] to get non-empty, direct-child text nodes from the div elements :
var divs = doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]");
foreach (HtmlNode div in divs)
{
var node = div.SelectSingleNode("text()[normalize-space()]");
Console.WriteLine(node.InnerText.Trim());
}
dotnetfiddle demo
output :
"Here is the location of the text i would like to pull"

You want the ChildNodes that are type HtmlTextNode. Untested suggested code:
var textNodes = node.ChildNodes.OfType<HtmlTextNode>();
if (textNodes.Any())
{
name_text.text += string.Join(string.Empty, textNodes.Select(tn => tn.InnerHtml));
}

Related

Get innerText from <div class> with an <a href> child

I am working with a webBrowser in C# and I need to get the text from the link. The link is just a href without a class.
its like this
<div class="class1" title="myfirstClass">
<a href="link.php">text I want read in C#
<span class="order-level"></span>
Shouldn't it be something like this?
HtmlElementCollection theElementCollection = default(HtmlElementCollection);
theElementCollection = webBrowser1.Document.GetElementsByTagName("div");
foreach (HtmlElement curElement in theElementCollection)
{
if (curElement.GetAttribute("className").ToString() == "class1")
{
HtmlElementCollection childDivs = curElement.Children.GetElementsByName("a");
foreach (HtmlElement childElement in childDivs)
{
MessageBox.Show(childElement.InnerText);
}
}
}
This is how you get the element by tag name:
String elem = webBrowser1.Document.GetElementsByTagName("div");
And with this you should extract the value of the href:
var hrefLink = XElement.Parse(elem)
.Descendants("a")
.Select(x => x.Attribute("href").Value)
.FirstOrDefault();
If you have more then 1 "a" tag in it, you could also put in a foreach loop if that is what you want.
EDIT:
With XElement:
You can get the content including the outer node by calling element.ToString().
If you want to exclude the outer tag, you can call String.Concat(element.Nodes()).
To get innerHTML with HtmlAgilityPack:
Install HtmlAgilityPack from NuGet.
Use this code.
HtmlWeb web = new HtmlWeb();
HtmlDocument dc = web.Load("Your_Url");
var s = dc.DocumentNode.SelectSingleNode("//a[#name="a"]").InnerHtml;
I hope it helps!
Here I created the console app to extract the text of anchor.
static void Main(string[] args)
{
string input = "<div class=\"class1\" title=\"myfirstClass\"><a href=\"link.php\">text I want read in C#<span class=\"order-level\"></span>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(input);
foreach (HtmlNode item in doc.DocumentNode.Descendants("div"))
{
var link = item.Descendants("a").First();
var text = link.InnerText.Trim();
Console.Write(text);
}
Console.ReadKey();
}
Note this is htmlagilitypack question so tag the question properly.

How to find a link in HTML under a certain header AND parse it

I am currently attempting to parse a link from an HTML doc based off the header above it, but no matter what I try, the program is unable to find it.
Here is the method I have that isn't working:
public string findMajorURL(string collegeURL, string major)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(collegeURL);
var root = doc.DocumentNode;
var htmlNodes = root.Descendants();
//Find html node containing the major heading
foreach(HtmlNode node in htmlNodes)
{
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
List<string> links = target.Descendants("a").Select(a => a.Attributes["href"].Value).ToList();
return links.First()+ "__IT WORKED__";
}
}
return "Major not found";
}
This is what the HTML looks like that I am attempting to parse:
<div style="padding-left: 20px">
<h3 id="ent1629">Biological Sciences </h3>
Go to information for this department.
<br>
<p>...</p>
<div id="data_c_1629" style="display: none">...</div>
<!--script language="javascript">hideshow(data_c_1630)</script-->
The major the user inputs is supposed to match the heading, Biological Sciences. Based off of the header, I want to get the link under it, which in this case is preview_entity.php?catoid=5&ent_oid=1629&returnto=818
WARNING: I cannot use XPath withthe version of Visual Studio that I have, so I'm assuming using LINQ somehow would be the best way to go, but again I'm not sure.
EDIT It turns out that the Inner Text is not matching the major, however, I don't see how that's possible, as I took it directly from the html code. Any ideas as to what's wrong?
According to the HTML snippet posted, node inside your if block references <h3> element and target references next sibling of <h3> which is <a>. That said, you don't need to do target.Descendants("a"). Just get href attribute from target directly :
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
return target.GetAttributeValue("href", "")+ "__IT WORKED__";
}

text of node's inner text and first child nodes text

I have multiple links in page of structure like this:
<a ....>
<b>Text I Need</b>
Also Text I need
</a>
And i want to extract string for example from code above "Text I NeedAlso Text I need"
I successfully extract second part, but I'm not sure how to select text inside b tags as well, currently I'm using this:
var link_list = doc.DocumentNode.SelectNodes(#"/a/text()");
foreach (var link in link_list)
{
Console.WriteLine(link.InnerText);
}
Should i perhaps instead get not text but html of a and remove tags with regex and extract text then, or is there some other ways?
Accessing InnerText property of <a> should give you all text nodes at once :
var html = #"<a ....>
<b>Text I Need</b>
Also Text I need
</a>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var link_list = doc.DocumentNode.SelectNodes("/a");
foreach (var link in link_list)
{
Console.WriteLine(link.InnerText);
}
or if you really need to get only direct child text nodes and grand child text nodes, try this way :
var link_list = doc.DocumentNode.SelectNodes("/a");
foreach (var link in link_list)
{
var texts = link.SelectNodes("text() | */text()");
Console.WriteLine(String.Join("", texts.Select(o => o.InnerText)));
}
output :
Text I Need
Also Text I need

Select link inside div tag

I would like to get a link (URL to be specific) inside a div class. This is the code I have that gets me the text inside div class (Some text...).
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='content']"))
{
//saves text (node.InnerText) in array
}
This is the HTML from the site. I would like to get www.google.com
<div class="content">
<p>Some text...
LINK
</p>
</div>
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='novica']/p/a[#href='www.google.com']"))
{
//saves text (node.InnerText) in array
}
That code is not valid based on your writing, however you have 2 options:
Once you have the node for the div, use .GetElementsByTagName("a") or the children to pull out the link, then get it's href attribute.
Amend your SelectNodes() XPath to get the a tag instead: //div[#class='novica']/p/a.
The first is obviously better if you do need the .InnerText of that element to get Some text..., however the second would be faaster.
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='novica']"))
{
var links = node.Descendants("a").Select(n => n.InnerText).ToList();
}

Get all <li> elements from inside a certain <div> with C#

I have a web page consisting of several <div> elements.
I would like to write a program that prints all the li elements inside a <div> after a certain <h4> header. Could anyone give me some help or sample code?
<div id="content">
<h4>Header</h4>
<ul>
<li><a href...></a> THIS IS WHAT I WANT TO GET</li>
</ul>
</div>
When it come to parsing HTML in C#, don't try to write your own. The HTML Agility Pack is almost certainly capable of doing what you want!
What parts are constant:
The 'id' in the DIV?
The h4
Searching a complete HTML document and reacting on H4 alone is likely to be a mess, whereas if you know the DIV has the ID of "content" then just look for that!
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourHtml);
if ( doc.DocumentNode != null )
{
var divs = doc.DocumentNode
.SelectNodes("//div")
.Where(e => e.Descendants().Any(e => e.Name == "h4"));
// You now have all of the divs with an 'h4' inside of it.
// The rest of the element structure, if constant needs to be examined to get
// the rest of the content you're after.
}
If its a web page why would you need to do HTML Parsing. Would not the technology that you are using to build the web page would give access to all the element of the page. For example if you are using ASP.NET, you could assign id's to your UL and LI(with runat server tag) and they would be available in code behind ?
Could you explain your scenario what you are trying to do ? If you trying to make a web request, download the html as string, then scrapping the HTML would make sense
EDIT
Think this should work
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtmlFile);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//div"))
{
if(p.Attributes["id"].Value == "content")
{
foreach(HtmlNode child in p.ChildNodes.SelectNodes("//ul"))
{
if(p.PreviousSibling.InnerText() == "Header")
{
foreach(HtmlNode liNodes in p.ChildNodes)
{
//liNodes represent all childNode
}
}
}
}
If all you want is the stuff that's between all <li></li> tags underneath the <div id="content"> tag and comes right after a <h4> tag, then this should suffice:
//Load your document first.
//Load() accepts a Stream, a TextReader, or a string path to the file on your computer
//If the entire document is loaded into a string, then use .LoadHtml() instead.
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.Load("c:\foobar.html");
//Select all the <li> nodes that are inside of an element with the id of "content"
// and come directly after an <h4> tag.
HtmlNodeCollection processMe = mainDoc.GetElementbyId("content")
.SelectNodes("//h4/following-sibling::*[1]//li");
//Iterate through each <li> node and print the inner text to the console
foreach (HtmlNode listElement in processMe)
{
Console.WriteLine(listElement.InnerText);
}

Categories

Resources