Html Agility Pack - Replace HTML between two comments - c#

I'm using Html Agility Pack for build a library with different functionalities.
One of them is:
search into HTML all the HTML parts contained between "start comment tag" and "end comment tag"
replace all HTML for the HTML part that matches one search string
For example:
I need to search HTML parts contained between <!-- data-example-start start tag and <!-- data-example-end end tag. Both are keyword (comments starts with those keywords)
The HTML part to replace is the one that contains the keyword "hello"
<body>
<p>Title
</p>
<!-- data-example-start-try_1 -->
<div>
</div>
<span id="hello"> Hi
</span>
<!-- data-example-end-try_1 -->
<!-- data-example-start-goodbye 2-->
<div>
<span id="bye"> Bye
</span>
</div>
<p>
</p>
<!-- data-example-end-goodbye 2-->
</body>
In this case I expect to replace the first HTML part contained between <!-- data-example-start-try_1 --> and <!-- data-example-end-try_1 -->, because inside there is the Search Word "hello" that I'm searching for.
How can I select, into Html Agility Pack , HTML parts contained between two HTML comments?
Thanks in advance

Here is an online example that shows how to get nodes between comment:
https://dotnetfiddle.net/JlkMot
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var docNode = doc.DocumentNode.InnerHtml;
var descendants = doc.DocumentNode.Descendants().ToList();
var startNode = descendants.FindIndex(x => x.InnerHtml == "<!-- data-example-start-try_1 -->");
var endEnd = descendants.FindIndex(x => x.InnerHtml == "<!-- data-example-end-try_1 -->");
if (startNode != -1 && endEnd != 1)
{
var betweenNodes = descendants.GetRange(startNode + 1, endEnd - startNode - 1);
foreach (var node in betweenNodes)
{
// show 2 times "Hi", once for the span, once for the text
Console.WriteLine(node.InnerHtml);
}
}

Related

How can I remove the spaces in html tags if tags only contain whitespace? using HTMLAgility C#

<p style="text-align:right;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-family:Times New Roman;font-size:11pt;"> </p>
here you can see the space inside the p tag, want to remove this space from the whole html document.
I am using HTMLAgility pack to remove few HTML characters already. Not sure how should I remove this whitespace.
An example of how to do that, searching for all paragraph elements that have only spaces as its inner text value, replacing these paragraph elements with empty paragraphs.
var doc = new HtmlDocument();
doc.LoadHtml(
#"<body>
<p> </p>
<span>My span text ! </span>
<p> </p>
</body>");
//Using HtmlAgilityPack.CssSelectors.NetCore
var ps = doc.QuerySelectorAll("p").Where(p => p.InnerText.ToCharArray().All(c => char.IsWhiteSpace(c)));
for(var i = 0; i < ps.Count(); i++)
{
var p = ps.ElementAt(i);
var newP = HtmlNode.CreateNode("<p></p>");
p.ParentNode.ReplaceChild(newP, p);
}
doc.Save("demo.html");

c# substring - parse all text in between

trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.
to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
The Top Social Networking Sites People Are Using
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...
</a>
<p class="result-snippet">
The Top
</p>
</div>
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking- websites"
>
Top 15 Most Popular Social Networking Sites | January 2019
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>
</a>
<p class="result-snippet">
Top 15 Most
</p>
</div>
i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.
int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);
to grab url i am using the following:
var regexURLParser = new Regex(#"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);
what i want is to grab is the url from these:
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking-websites"
>
so that the outcome shows only:
https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites
You can make it more easier by using HTMLAgilityPack just include it in your project using NuGet.
To add HTMLAgilityPack using NuGet
go to the Package Manager Console and type Install-Package HtmlAgilityPack -Version 1.11.3
after the installation you can extract Urls like below.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = new List<string>();
doc.DocumentNode.SelectNodes("//a").ToList()
.ForEach(x=>
{
//Use HasClass method to filter elements
if (!string.IsNullOrEmpty(x.GetAttributeValue("href", ""))
&& x.HasClass("result-title") && x.HasClass("js-result-title"))
{
listOfUrls.Add(x.GetAttributeValue("href", ""));
}
});
listOfUrls.ForEach(x => Console.WriteLine(x));
EDIT
Added && x.HasClass("result-title") && x.HasClass("js-result-title") to shows only those elements which has the class result-title and js-result-title.
Another way
shorter and another way to get filtered values.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = doc.DocumentNode.Descendants("a")
.Where(x => x.Attributes["class"] != null
&& x.Attributes["class"].Value == "result-title js-result-title")
.Select(x => x.GetAttributeValue("href", "")).ToList();

C# - Append color code before HTML tags

I have a html color code reader that takes in a html (in string form) like this:
var str = #"<html><head><title> HTML highlight test page </title> </head> <body> This is text in the body.<br><h1> This is a heading </h1><p> This is a paragraph.</p> There is more text in the body after the paragraph. <p> So is this.</p> </body> </html>";
I would like to for example, take all the <p> tags and append \color[DARKGRAY]to it
<p>This is a paragraph.</p>
to
\color[DARKGRAY]<p>This is a paragraph.</p>
I have the HTML agility pack like this
var html = doc.DocumentNode.SelectNodes("//p");
if (html != null)
{
foreach (HtmlAgilityPack.HtmlNode item in html)
{
item.Name = "\color[RED]<p>";
}
}
But that is really wrong. How can i achieve the append?
You already have selected the paragraph nodes, then in your loop use InsertBefore to add the text.
item.ParentNode.InsertBefore(doc.CreateTextNode(#"\color[RED]"), item);

Get all <li> elements from inside a certain <div> with C#

I have a web page consisting of several <div> elements.
I would like to write a program that prints all the li elements inside a <div> after a certain <h4> header. Could anyone give me some help or sample code?
<div id="content">
<h4>Header</h4>
<ul>
<li><a href...></a> THIS IS WHAT I WANT TO GET</li>
</ul>
</div>
When it come to parsing HTML in C#, don't try to write your own. The HTML Agility Pack is almost certainly capable of doing what you want!
What parts are constant:
The 'id' in the DIV?
The h4
Searching a complete HTML document and reacting on H4 alone is likely to be a mess, whereas if you know the DIV has the ID of "content" then just look for that!
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourHtml);
if ( doc.DocumentNode != null )
{
var divs = doc.DocumentNode
.SelectNodes("//div")
.Where(e => e.Descendants().Any(e => e.Name == "h4"));
// You now have all of the divs with an 'h4' inside of it.
// The rest of the element structure, if constant needs to be examined to get
// the rest of the content you're after.
}
If its a web page why would you need to do HTML Parsing. Would not the technology that you are using to build the web page would give access to all the element of the page. For example if you are using ASP.NET, you could assign id's to your UL and LI(with runat server tag) and they would be available in code behind ?
Could you explain your scenario what you are trying to do ? If you trying to make a web request, download the html as string, then scrapping the HTML would make sense
EDIT
Think this should work
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtmlFile);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//div"))
{
if(p.Attributes["id"].Value == "content")
{
foreach(HtmlNode child in p.ChildNodes.SelectNodes("//ul"))
{
if(p.PreviousSibling.InnerText() == "Header")
{
foreach(HtmlNode liNodes in p.ChildNodes)
{
//liNodes represent all childNode
}
}
}
}
If all you want is the stuff that's between all <li></li> tags underneath the <div id="content"> tag and comes right after a <h4> tag, then this should suffice:
//Load your document first.
//Load() accepts a Stream, a TextReader, or a string path to the file on your computer
//If the entire document is loaded into a string, then use .LoadHtml() instead.
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.Load("c:\foobar.html");
//Select all the <li> nodes that are inside of an element with the id of "content"
// and come directly after an <h4> tag.
HtmlNodeCollection processMe = mainDoc.GetElementbyId("content")
.SelectNodes("//h4/following-sibling::*[1]//li");
//Iterate through each <li> node and print the inner text to the console
foreach (HtmlNode listElement in processMe)
{
Console.WriteLine(listElement.InnerText);
}

How to get all input elements in a form with HtmlAgilityPack without getting a null reference error

Example HTML:
<html><body>
<form id="form1">
<input name="foo1" value="bar1" />
<!-- Other elements -->
</form>
<form id="form2">
<input name="foo2" value="bar2" />
<!-- Other elements -->
</form>
</body></html>
Test code:
HtmlDocument doc = new HtmlDocument();
doc.Load(#"D:\test.html");
foreach (HtmlNode node in doc.GetElementbyId("form2").SelectNodes(".//input"))
{
Console.WriteLine(node.Attributes["value"].Value);
}
The statement doc.GetElementbyId("form2").SelectNodes(".//input") gives me a null reference.
Anything I did wrong? thanks.
You can do the following:
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = new HtmlDocument();
doc.Load(#"D:\test.html");
HtmlNode secondForm = doc.GetElementbyId("form2");
foreach (HtmlNode node in secondForm.Elements("input"))
{
HtmlAttribute valueAttribute = node.Attributes["value"];
if (valueAttribute != null)
{
Console.WriteLine(valueAttribute.Value);
}
}
By default HTML Agility Pack parses forms as empty node because they are allowed to overlap other HTML elements. The first line, (HtmlNode.ElementsFlags.Remove("form");) disables this behavior allowing you to get the input elements inside the second form.
Update:
Example of form elements overlap:
<table>
<form>
<!-- Other elements -->
</table>
</form>
The element begins inside a table but is closed outside the table element. This is allowed in the HTML specification and HTML Agility Pack has to deal with it.
Just get them in array:
HtmlNodeCollection resultCollection = doc.DocumentNode.SelectNodes("//*[#type='text']");

Categories

Resources