I have a web page consisting of several <div> elements.
I would like to write a program that prints all the li elements inside a <div> after a certain <h4> header. Could anyone give me some help or sample code?
<div id="content">
<h4>Header</h4>
<ul>
<li><a href...></a> THIS IS WHAT I WANT TO GET</li>
</ul>
</div>
When it come to parsing HTML in C#, don't try to write your own. The HTML Agility Pack is almost certainly capable of doing what you want!
What parts are constant:
The 'id' in the DIV?
The h4
Searching a complete HTML document and reacting on H4 alone is likely to be a mess, whereas if you know the DIV has the ID of "content" then just look for that!
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourHtml);
if ( doc.DocumentNode != null )
{
var divs = doc.DocumentNode
.SelectNodes("//div")
.Where(e => e.Descendants().Any(e => e.Name == "h4"));
// You now have all of the divs with an 'h4' inside of it.
// The rest of the element structure, if constant needs to be examined to get
// the rest of the content you're after.
}
If its a web page why would you need to do HTML Parsing. Would not the technology that you are using to build the web page would give access to all the element of the page. For example if you are using ASP.NET, you could assign id's to your UL and LI(with runat server tag) and they would be available in code behind ?
Could you explain your scenario what you are trying to do ? If you trying to make a web request, download the html as string, then scrapping the HTML would make sense
EDIT
Think this should work
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtmlFile);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//div"))
{
if(p.Attributes["id"].Value == "content")
{
foreach(HtmlNode child in p.ChildNodes.SelectNodes("//ul"))
{
if(p.PreviousSibling.InnerText() == "Header")
{
foreach(HtmlNode liNodes in p.ChildNodes)
{
//liNodes represent all childNode
}
}
}
}
If all you want is the stuff that's between all <li></li> tags underneath the <div id="content"> tag and comes right after a <h4> tag, then this should suffice:
//Load your document first.
//Load() accepts a Stream, a TextReader, or a string path to the file on your computer
//If the entire document is loaded into a string, then use .LoadHtml() instead.
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.Load("c:\foobar.html");
//Select all the <li> nodes that are inside of an element with the id of "content"
// and come directly after an <h4> tag.
HtmlNodeCollection processMe = mainDoc.GetElementbyId("content")
.SelectNodes("//h4/following-sibling::*[1]//li");
//Iterate through each <li> node and print the inner text to the console
foreach (HtmlNode listElement in processMe)
{
Console.WriteLine(listElement.InnerText);
}
Related
I have some text like as below
<span style="font-weight: 700;">Aanbod wielen (banden + velgen) </span>
<br><br>
<span style="font-weight: 500;">lichtmetalen originele Volvo set met winterbanden:<br>origineel:</span> Volvo<br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<span style="font-weight: 700;">naafgat:</span>
I need to identify that span tag with inline style font-weight and replace with <b> tag and same as closing tag also replace </b> tag in c#. I need that text like as below.
<b>Aanbod wielen (banden + velgen)</b>
<br><br>
<b>lichtmetalen originele Volvo set met winterbanden:<br>origineel:</b> Volvo <br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<b>naafgat:</b>
so how can we identify. Please help me in that case.
You can replace your span by b by using HtmlAgilityPack. And it's free and open source.
You can install HtmlAgilityPack from nuget also Install-Package HtmlAgilityPack -Version 1.8.9
public string ReplaceSpanByB()
{
HtmlDocument doc = new HtmlDocument();
string htmlContent = File.ReadAllText(#"C:\Users\xxx\source\repos\ConsoleApp4\ConsoleApp4\Files\HTMLPage1.html");
doc.LoadHtml(htmlContent);
if (doc.DocumentNode.SelectNodes("//span") != null)
{
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span"))
{
var attributes = node.Attributes;
foreach (var item in attributes)
{
if (item.Name.Equals("style") && item.Value.Contains("font-weight"))
{
HtmlNode b = doc.CreateElement("b");
b.InnerHtml = node.InnerHtml;
node.ParentNode.ReplaceChild(b, node);
}
}
}
}
return doc.DocumentNode.OuterHtml;
}
Output:
1st: Dont use Regex, though it is possible and it seems logical to use so,
it is mostly wrong and full of pain.
a happy post about it can be found HERE
2nd:
use an HTML parser such as https://html-agility-pack.net/ to traverse the tree
(you can use xPath to easily find all the span elements you want to replace)
and replace any span elements with a b (don't forget to set the new b element contents)
Side note: As much as i recall, the b tag is discouraged
so if you only need the span text to be Bold...
it is already is because of "font-weight:bold".
On https://developer.mozilla.org/en-US/docs/Web/HTML/Element/b :
Historically, the element was meant to make text boldface. Styling information has been deprecated since HTML4, so the meaning of the element has been changed." and "The HTML Bring Attention To element () is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance." – Thanks #Richardissimo
I am currently attempting to parse a link from an HTML doc based off the header above it, but no matter what I try, the program is unable to find it.
Here is the method I have that isn't working:
public string findMajorURL(string collegeURL, string major)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(collegeURL);
var root = doc.DocumentNode;
var htmlNodes = root.Descendants();
//Find html node containing the major heading
foreach(HtmlNode node in htmlNodes)
{
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
List<string> links = target.Descendants("a").Select(a => a.Attributes["href"].Value).ToList();
return links.First()+ "__IT WORKED__";
}
}
return "Major not found";
}
This is what the HTML looks like that I am attempting to parse:
<div style="padding-left: 20px">
<h3 id="ent1629">Biological Sciences </h3>
Go to information for this department.
<br>
<p>...</p>
<div id="data_c_1629" style="display: none">...</div>
<!--script language="javascript">hideshow(data_c_1630)</script-->
The major the user inputs is supposed to match the heading, Biological Sciences. Based off of the header, I want to get the link under it, which in this case is preview_entity.php?catoid=5&ent_oid=1629&returnto=818
WARNING: I cannot use XPath withthe version of Visual Studio that I have, so I'm assuming using LINQ somehow would be the best way to go, but again I'm not sure.
EDIT It turns out that the Inner Text is not matching the major, however, I don't see how that's possible, as I took it directly from the html code. Any ideas as to what's wrong?
According to the HTML snippet posted, node inside your if block references <h3> element and target references next sibling of <h3> which is <a>. That said, you don't need to do target.Descendants("a"). Just get href attribute from target directly :
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
return target.GetAttributeValue("href", "")+ "__IT WORKED__";
}
I'm trying to pull text from a "div" and to exclude everything else. Can you help me please ?!
<div class="article">
<div class="date">01.01.2000</div>
<div class="news-type">Breaking News</div>
"Here is the location of the text i would like to pull"
</div>
When I pull "article" class i get everything, but i'm unable/don't know how to exclude class="date", class="news-type", and everything in it.
Here is the code i use:
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]"))
{
name_text.text += node.InnerHtml.Trim();
}
Thank you!
Another way would be using XPath /text()[normalize-space()] to get non-empty, direct-child text nodes from the div elements :
var divs = doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]");
foreach (HtmlNode div in divs)
{
var node = div.SelectSingleNode("text()[normalize-space()]");
Console.WriteLine(node.InnerText.Trim());
}
dotnetfiddle demo
output :
"Here is the location of the text i would like to pull"
You want the ChildNodes that are type HtmlTextNode. Untested suggested code:
var textNodes = node.ChildNodes.OfType<HtmlTextNode>();
if (textNodes.Any())
{
name_text.text += string.Join(string.Empty, textNodes.Select(tn => tn.InnerHtml));
}
I would like to get a link (URL to be specific) inside a div class. This is the code I have that gets me the text inside div class (Some text...).
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='content']"))
{
//saves text (node.InnerText) in array
}
This is the HTML from the site. I would like to get www.google.com
<div class="content">
<p>Some text...
LINK
</p>
</div>
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='novica']/p/a[#href='www.google.com']"))
{
//saves text (node.InnerText) in array
}
That code is not valid based on your writing, however you have 2 options:
Once you have the node for the div, use .GetElementsByTagName("a") or the children to pull out the link, then get it's href attribute.
Amend your SelectNodes() XPath to get the a tag instead: //div[#class='novica']/p/a.
The first is obviously better if you do need the .InnerText of that element to get Some text..., however the second would be faaster.
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='novica']"))
{
var links = node.Descendants("a").Select(n => n.InnerText).ToList();
}
Could someone please help with resolving the trouble parsing sequential tags of Html by Agility in C#? I have 2 question as listed below.
in this case, I want to parse following Html and store them into a structure (list, stack, etc) so I can use these data effectively.
<h3> header </h3>
<p> paragraph 1</p>
<p>
Google
Gizmodo
</p>
<ul>
<li> something is here with a download
link
</li>
<li> hello
<img src="www.imagesource.com"/>
</li>
</ul>
How to parse these data in sequential manner?
If I use var ParaTags = HtmlDocument.DocumentNode.Descendants("p");,
then I can only get all "p" tags. but I don't know how to get "h3" then "p" in sequence, because "p" is not inside "h3".
following code will returns me all hyperlinks,
var links =
from paras in document.DocumentNode.Descendants("p")
from hyperLinks in paras.Descendants("a").Where(x => x.Attributes["href"].Value != "")
select hyperLinks;
What's the best way to parse and store those mixed content with string, hyperlinks, and images?
so I can output them later in a efficient way? List, stack?
Another word, I want to store every possible content from html and reserve its format if possible. so I can resemble the content in proper format once i reload it onto the app.
Thank you!
If you want to extract all href and src attributes you may try this:
using System;
using System.Linq;
using HtmlAgilityPack;
public class Program
{
static void Main()
{
var document = new HtmlDocument();
document.Load("test.html");
var links =
from element in document.DocumentNode.Descendants()
let href = element.Attributes["href"]
let src = element.Attributes["src"]
where href != null || src != null
select href != null ? href.Value : src.Value;
foreach (var link in links)
{
Console.WriteLine(link);
}
}
}
outputs:
www.google.com
www.gizmodo.com
www.google.com
www.imagesource.com