I would like to get a link (URL to be specific) inside a div class. This is the code I have that gets me the text inside div class (Some text...).
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='content']"))
{
//saves text (node.InnerText) in array
}
This is the HTML from the site. I would like to get www.google.com
<div class="content">
<p>Some text...
LINK
</p>
</div>
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='novica']/p/a[#href='www.google.com']"))
{
//saves text (node.InnerText) in array
}
That code is not valid based on your writing, however you have 2 options:
Once you have the node for the div, use .GetElementsByTagName("a") or the children to pull out the link, then get it's href attribute.
Amend your SelectNodes() XPath to get the a tag instead: //div[#class='novica']/p/a.
The first is obviously better if you do need the .InnerText of that element to get Some text..., however the second would be faaster.
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='novica']"))
{
var links = node.Descendants("a").Select(n => n.InnerText).ToList();
}
Related
I've been using HtmlAgilityPack in order to parse some html in a web page. The current html looks like this:
div class="price__child price__price flex-child__auto tooltip-container">
<div class="price__min-order tooltip-container js-minOrder">
<i>⚠️</i>
<div class="price__min-order-tooltip tooltip">
Minimum order of $15.00.
</div>
</div>
$1.75
</div>
I only want to retrieve the text of the price at the very end, in this case, the $1.75. Doing something like below will return that number, but also the all of the other text within the larger div.
return node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
.InnerText
.Trim().Replace(" ", "")
.TrimStart('$');
Is there a way to exclude/not grab the innertext from the price__min-order tooltip-container js-minOrder and also the price__min-order-tooltip tooltip, and only grab the 1.75 from the larger div?
I found the way to do it. If you call child node and remove, it will get rid of it.
var priceNode = node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
?.ChildNodes[1];
priceNode?.Remove();
return node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
.InnerText
.Trim().Replace(" ", "")
.TrimStart('$');
I have some text like as below
<span style="font-weight: 700;">Aanbod wielen (banden + velgen) </span>
<br><br>
<span style="font-weight: 500;">lichtmetalen originele Volvo set met winterbanden:<br>origineel:</span> Volvo<br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<span style="font-weight: 700;">naafgat:</span>
I need to identify that span tag with inline style font-weight and replace with <b> tag and same as closing tag also replace </b> tag in c#. I need that text like as below.
<b>Aanbod wielen (banden + velgen)</b>
<br><br>
<b>lichtmetalen originele Volvo set met winterbanden:<br>origineel:</b> Volvo <br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<b>naafgat:</b>
so how can we identify. Please help me in that case.
You can replace your span by b by using HtmlAgilityPack. And it's free and open source.
You can install HtmlAgilityPack from nuget also Install-Package HtmlAgilityPack -Version 1.8.9
public string ReplaceSpanByB()
{
HtmlDocument doc = new HtmlDocument();
string htmlContent = File.ReadAllText(#"C:\Users\xxx\source\repos\ConsoleApp4\ConsoleApp4\Files\HTMLPage1.html");
doc.LoadHtml(htmlContent);
if (doc.DocumentNode.SelectNodes("//span") != null)
{
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span"))
{
var attributes = node.Attributes;
foreach (var item in attributes)
{
if (item.Name.Equals("style") && item.Value.Contains("font-weight"))
{
HtmlNode b = doc.CreateElement("b");
b.InnerHtml = node.InnerHtml;
node.ParentNode.ReplaceChild(b, node);
}
}
}
}
return doc.DocumentNode.OuterHtml;
}
Output:
1st: Dont use Regex, though it is possible and it seems logical to use so,
it is mostly wrong and full of pain.
a happy post about it can be found HERE
2nd:
use an HTML parser such as https://html-agility-pack.net/ to traverse the tree
(you can use xPath to easily find all the span elements you want to replace)
and replace any span elements with a b (don't forget to set the new b element contents)
Side note: As much as i recall, the b tag is discouraged
so if you only need the span text to be Bold...
it is already is because of "font-weight:bold".
On https://developer.mozilla.org/en-US/docs/Web/HTML/Element/b :
Historically, the element was meant to make text boldface. Styling information has been deprecated since HTML4, so the meaning of the element has been changed." and "The HTML Bring Attention To element () is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance." – Thanks #Richardissimo
I am currently attempting to parse a link from an HTML doc based off the header above it, but no matter what I try, the program is unable to find it.
Here is the method I have that isn't working:
public string findMajorURL(string collegeURL, string major)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(collegeURL);
var root = doc.DocumentNode;
var htmlNodes = root.Descendants();
//Find html node containing the major heading
foreach(HtmlNode node in htmlNodes)
{
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
List<string> links = target.Descendants("a").Select(a => a.Attributes["href"].Value).ToList();
return links.First()+ "__IT WORKED__";
}
}
return "Major not found";
}
This is what the HTML looks like that I am attempting to parse:
<div style="padding-left: 20px">
<h3 id="ent1629">Biological Sciences </h3>
Go to information for this department.
<br>
<p>...</p>
<div id="data_c_1629" style="display: none">...</div>
<!--script language="javascript">hideshow(data_c_1630)</script-->
The major the user inputs is supposed to match the heading, Biological Sciences. Based off of the header, I want to get the link under it, which in this case is preview_entity.php?catoid=5&ent_oid=1629&returnto=818
WARNING: I cannot use XPath withthe version of Visual Studio that I have, so I'm assuming using LINQ somehow would be the best way to go, but again I'm not sure.
EDIT It turns out that the Inner Text is not matching the major, however, I don't see how that's possible, as I took it directly from the html code. Any ideas as to what's wrong?
According to the HTML snippet posted, node inside your if block references <h3> element and target references next sibling of <h3> which is <a>. That said, you don't need to do target.Descendants("a"). Just get href attribute from target directly :
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
return target.GetAttributeValue("href", "")+ "__IT WORKED__";
}
I'm trying to pull text from a "div" and to exclude everything else. Can you help me please ?!
<div class="article">
<div class="date">01.01.2000</div>
<div class="news-type">Breaking News</div>
"Here is the location of the text i would like to pull"
</div>
When I pull "article" class i get everything, but i'm unable/don't know how to exclude class="date", class="news-type", and everything in it.
Here is the code i use:
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]"))
{
name_text.text += node.InnerHtml.Trim();
}
Thank you!
Another way would be using XPath /text()[normalize-space()] to get non-empty, direct-child text nodes from the div elements :
var divs = doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]");
foreach (HtmlNode div in divs)
{
var node = div.SelectSingleNode("text()[normalize-space()]");
Console.WriteLine(node.InnerText.Trim());
}
dotnetfiddle demo
output :
"Here is the location of the text i would like to pull"
You want the ChildNodes that are type HtmlTextNode. Untested suggested code:
var textNodes = node.ChildNodes.OfType<HtmlTextNode>();
if (textNodes.Any())
{
name_text.text += string.Join(string.Empty, textNodes.Select(tn => tn.InnerHtml));
}
I have a web page consisting of several <div> elements.
I would like to write a program that prints all the li elements inside a <div> after a certain <h4> header. Could anyone give me some help or sample code?
<div id="content">
<h4>Header</h4>
<ul>
<li><a href...></a> THIS IS WHAT I WANT TO GET</li>
</ul>
</div>
When it come to parsing HTML in C#, don't try to write your own. The HTML Agility Pack is almost certainly capable of doing what you want!
What parts are constant:
The 'id' in the DIV?
The h4
Searching a complete HTML document and reacting on H4 alone is likely to be a mess, whereas if you know the DIV has the ID of "content" then just look for that!
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourHtml);
if ( doc.DocumentNode != null )
{
var divs = doc.DocumentNode
.SelectNodes("//div")
.Where(e => e.Descendants().Any(e => e.Name == "h4"));
// You now have all of the divs with an 'h4' inside of it.
// The rest of the element structure, if constant needs to be examined to get
// the rest of the content you're after.
}
If its a web page why would you need to do HTML Parsing. Would not the technology that you are using to build the web page would give access to all the element of the page. For example if you are using ASP.NET, you could assign id's to your UL and LI(with runat server tag) and they would be available in code behind ?
Could you explain your scenario what you are trying to do ? If you trying to make a web request, download the html as string, then scrapping the HTML would make sense
EDIT
Think this should work
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtmlFile);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//div"))
{
if(p.Attributes["id"].Value == "content")
{
foreach(HtmlNode child in p.ChildNodes.SelectNodes("//ul"))
{
if(p.PreviousSibling.InnerText() == "Header")
{
foreach(HtmlNode liNodes in p.ChildNodes)
{
//liNodes represent all childNode
}
}
}
}
If all you want is the stuff that's between all <li></li> tags underneath the <div id="content"> tag and comes right after a <h4> tag, then this should suffice:
//Load your document first.
//Load() accepts a Stream, a TextReader, or a string path to the file on your computer
//If the entire document is loaded into a string, then use .LoadHtml() instead.
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.Load("c:\foobar.html");
//Select all the <li> nodes that are inside of an element with the id of "content"
// and come directly after an <h4> tag.
HtmlNodeCollection processMe = mainDoc.GetElementbyId("content")
.SelectNodes("//h4/following-sibling::*[1]//li");
//Iterate through each <li> node and print the inner text to the console
foreach (HtmlNode listElement in processMe)
{
Console.WriteLine(listElement.InnerText);
}