How to add text inside XElement value that contains other elements? - c#

I have and XML element like this:
string markup = #"<a href='#'>
<span>
outer content
<span>inner content</span>
</span>
</a>";
XElement elelemt = XDocument.Parse(markup).Root;
I want to add brackets to the outer span so it becomes:
<a href='#'>
<span>
(outer content
<span>inner content</span>)
</span>
</a>
I tried modifying the Value propery byt it strips away the inner element and replaces it with only text:
elelemt.Element("span").Value = "(" + elelemt.Element("span").Value + ")";

You would need replace the child nodes with the existing nodes with your text on either side. Something approximately like this:
var span = element.Element("span");
span.ReplaceNodes(
new XText("("),
span.Nodes(),
new XText(")"));
It will get a little trickier if the whitespace must match what you've specified. You'd have to iterate through span.Nodes() to work out where to insert your XText nodes.
As an aside, there exists XElement.Parse, so your parsing could be written as:
var element = XElement.Parse(markup);

For the VB'ers that might come across this.
Dim markup As XElement
markup = <a href='#'>
<span>
outer content
<span>inner content</span>
</span>
</a>
Dim newmarkup As XElement = New XElement(markup)
newmarkup.<span>.DescendantNodes.Remove()
newmarkup.<span>.Value = "("
For Each el As XNode In markup.<span>.Nodes
newmarkup.<span>.Nodes.LastOrDefault.AddAfterSelf(el)
Next
newmarkup.<span>.Nodes.LastOrDefault.AddAfterSelf(")")

Related

How can I remove the spaces in html tags if tags only contain whitespace? using HTMLAgility C#

<p style="text-align:right;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-family:Times New Roman;font-size:11pt;"> </p>
here you can see the space inside the p tag, want to remove this space from the whole html document.
I am using HTMLAgility pack to remove few HTML characters already. Not sure how should I remove this whitespace.
An example of how to do that, searching for all paragraph elements that have only spaces as its inner text value, replacing these paragraph elements with empty paragraphs.
var doc = new HtmlDocument();
doc.LoadHtml(
#"<body>
<p> </p>
<span>My span text ! </span>
<p> </p>
</body>");
//Using HtmlAgilityPack.CssSelectors.NetCore
var ps = doc.QuerySelectorAll("p").Where(p => p.InnerText.ToCharArray().All(c => char.IsWhiteSpace(c)));
for(var i = 0; i < ps.Count(); i++)
{
var p = ps.ElementAt(i);
var newP = HtmlNode.CreateNode("<p></p>");
p.ParentNode.ReplaceChild(newP, p);
}
doc.Save("demo.html");

How to replace span with inline style tag to b tag in c#?

I have some text like as below
<span style="font-weight: 700;">Aanbod wielen (banden + velgen) </span>
<br><br>
<span style="font-weight: 500;">lichtmetalen originele Volvo set met winterbanden:<br>origineel:</span> Volvo<br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<span style="font-weight: 700;">naafgat:</span>
I need to identify that span tag with inline style font-weight and replace with <b> tag and same as closing tag also replace </b> tag in c#. I need that text like as below.
<b>Aanbod wielen (banden + velgen)</b>
<br><br>
<b>lichtmetalen originele Volvo set met winterbanden:<br>origineel:</b> Volvo <br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<b>naafgat:</b>
so how can we identify. Please help me in that case.
You can replace your span by b by using HtmlAgilityPack. And it's free and open source.
You can install HtmlAgilityPack from nuget also Install-Package HtmlAgilityPack -Version 1.8.9
public string ReplaceSpanByB()
{
HtmlDocument doc = new HtmlDocument();
string htmlContent = File.ReadAllText(#"C:\Users\xxx\source\repos\ConsoleApp4\ConsoleApp4\Files\HTMLPage1.html");
doc.LoadHtml(htmlContent);
if (doc.DocumentNode.SelectNodes("//span") != null)
{
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span"))
{
var attributes = node.Attributes;
foreach (var item in attributes)
{
if (item.Name.Equals("style") && item.Value.Contains("font-weight"))
{
HtmlNode b = doc.CreateElement("b");
b.InnerHtml = node.InnerHtml;
node.ParentNode.ReplaceChild(b, node);
}
}
}
}
return doc.DocumentNode.OuterHtml;
}
Output:
1st: Dont use Regex, though it is possible and it seems logical to use so,
it is mostly wrong and full of pain.
a happy post about it can be found HERE
2nd:
use an HTML parser such as https://html-agility-pack.net/ to traverse the tree
(you can use xPath to easily find all the span elements you want to replace)
and replace any span elements with a b (don't forget to set the new b element contents)
Side note: As much as i recall, the b tag is discouraged
so if you only need the span text to be Bold...
it is already is because of "font-weight:bold".
On https://developer.mozilla.org/en-US/docs/Web/HTML/Element/b :
Historically, the element was meant to make text boldface. Styling information has been deprecated since HTML4, so the meaning of the element has been changed." and "The HTML Bring Attention To element () is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance." – Thanks #Richardissimo

How to unwrap an element if it exists with CsQuery?

I'm using CsQuery to read values of HTML elements.
In advance, I don't know if the <a> element contains a <font> element or not.
Is there a way to read the InnerText of an anchor regardless if it contains a fontelement or not?
Scenario 1: Text inside font element
<div class="link">
<a href="http://www.example.com/1">
<font>Foo</font>
</a>
</div>
Scenario 2: Text without font element
<div class="link">
<a href="http://www.example.com/2">
Foo
</a>
</div>
I've got the following working solution:
var dom = CQ.CreateFromUrl("http://www.myurl.com");
var a = new CQ(dom.Select("div.link a").InnerHTML);
var font = a.Select("font");
var myValue = a.Count() > 0 ? font[0].InnerText : a[0].InnerText;
But it's a bit messy and I'd rather just always remove the font element - if present - so I could go for the anchor value right away. Something like Contents() in combination with UnWrap(), but I haven't succeeded to make it work. Ideas anyone?
var dom = CQ.CreateFromUrl("http://www.myurl.com");
string result = dom[".link a"].Text();

Wrapping an HTML element with another element?

I am writing a program that parses a bit of HTML. Specifically, I am looking for underlined elements within a list, and turning those underlined elements into hyperlinks.
Here's an example of the pre-converted HTML:
<ul>
<li>
<u>Mode selector </u>
</li>
<li>
<u>LAND ALT</u>
</li>
<li>
<u>FLT ALT</u>
</li>
</ul>
Here's what I'm wanting the result to look like:
<ul>
<li>
<a id="triv14522" onclick="TxtLinkAction(15627,15673)">
<span style="color: rgb(102, 204, 255); font-size: 11pt;">
<u>Mode selector</u>
</span>
</a>
</l1>
<li>
<a id="triv14523" onclick="TxtLinkAction(15627,15674)">
<span style="color: rgb(102, 204, 255); font-size: 11pt;">
<u>LAND ALT</u>
</span>
</a>
</li>
<a id="triv14887" onclick="TxtLinkAction(15627,15679)">
<span style="color: rgb(102, 204, 255); font-size: 11pt;">
<u>FLT ALT</u>
</span>
</a>
</li>
</ul>
In my program, I've already built the anchor and span elements for each underlined element. Just for reference, here's how I've done this:
TrivId = trivId;
ActionItemId = actionItemId;
TextLayerId = textLayerId;
var trivIdText = "id=\"triv" + TrivId + "\"";
var onClickText = "onclick=\"TxtLinkAction(" + TextLayerId + "," + ActionItemId + ")\"";
var anchor = "<a " + trivIdText + " " + onClickText + ">";
var span = "<span style=\"color: rgb(102, 204, 255); font-size: 11pt;\">";
So, my main problem is I don't exactly know how to "wrap" each underlined element in the list with my anchor and span elements. If this were XML, I could add my XML element by using AddBeforeSelf. Can I do something similar with HTML?
NOTE: I notice that the C# tag has been removed, and Javascript tag added. I should clarify: This is a C# program that is parsing a PowerPoint document. One of the values that is being brought in is in HTML format. I am not using Javascript at all, since this isn't an actual webpage. I'm just grabbing this particular value from the PowerPoint slide, which happens to be in HTML format.
For further clarification, here's the C# method that I'm using. The resulting, modified HTML will be written out to an XML file. The resulting HTML will be stored in an XML tag, <RTF>, with the valid HTML as that tag's value.
public Hyperlink(int textLayerId, int runGroupId)
{
TrivId = LectoraTitle.GetId();
ActionItemId = LectoraTitle.GetId();
TextLayerId = textLayerId;
var trivIdText = "id=\"triv" + TrivId + "\"";
var onClickText = "onclick=\"TxtLinkAction(" + TextLayerId + "," + ActionItemId + ")\"";
var styleText = "style=\"" + Settings.Default.Style + "\"";
// build anchor/span and determine where to insert into text.text
var anchor = "<a " + trivIdText + " " + onClickText + " " + styleText + ">";
var span = "<span style=\"color: rgb(102, 204, 255); font-size: 11pt;\">";
ActionItem = new ActionItem { ActionType = ActionType.rungroup, TargetId = runGroupId };
}
Further explanation: I'm assuming that I can iterate over my HTML elements with a foreach loop, using something like the below code:
// note: this is pseudocode
var nodes = htmlSnippet;
foreach (var node in nodes)
{
// if node is underline element
// surround node with generated anchor
// and span elements.
}
I'm just not quite sure how to get my HTML snippet into an enumerable state so that I can iterate over it, and then wrap a particular element with my generated elements.
NEW EDIT:
So, after looking at HtmlAgilityPack, I've incorporated it into my program and am iterating over the Html like so (The variable text contains the HTML value (see first example above)):
htmlDocument.LoadHtml(text);
var nodes = htmlDocument.DocumentNode.SelectNodes("//u");
foreach (var node in nodes)
{
// insert code here to wrap the
// underline element with the generated
// anchor/span elements
}
So, now I'm able to parse the HTML and get only the underline elements. I now need to figure out how to surround these underline elements with my generated anchor/span elements. I was hoping I could do something like node.AddParent(anchor).
In order to iterate the HTML you may want to use HTML Agility Pack
http://htmlagilitypack.codeplex.com/
Examples here:
http://htmlagilitypack.codeplex.com/wikipage?title=Examples
A decent how-to here:
http://www.codeproject.com/Articles/659019/Scraping-HTML-DOM-elements-using-HtmlAgilityPack-H
You can install it using NuGet.

Inner text of Node ignoring inner text of children

Pardon me if it sounds too simple to be asked here but since this is my very first day with html-agility-pack, I am unable to sort out a way to select the inner text of a node which is the direct child of the node and ignoring inner text of the children nodes.
For example
<div id="div1">
<div class="h1"> this needs to be selected
<small> and not this</small>
</div>
</div>
currently I am trying this
HtmlDocument page = new HtmlWeb().Load(url);
var s = page.DocumentNode.SelectSingleNode("//div[#id='div1']//div[#class='h1']");
string selText = s.innerText;
which returns the whole text (e.g- this needs to be selected and not this).
Any suggestions??
The div could possibly have multiple text nodes if there is text before and after its children. As I similarly indicated here, I think the best way to get all the direct text content of a node is to do something like:
HtmlDocument page = new HtmlWeb().Load(url);
var nodes = page.DocumentNode.SelectNodes("//div[#id='div1']//div[#class='h1']/text()");
StringBuilder sb = new StringBuilder();
foreach(var node in nodes)
{
sb.Append(node.InnerText);
}
string content = sb.ToString();
You can use the /text() option to get all text nodes directly under a specific tag. If you only need the first one, add [1] to it:
page.LoadHtml(text);
var s = page.DocumentNode.SelectSingleNode("//div[#id='div1']//div[#class='h1']/text()[1]");
string selText = s.InnerText;

Categories

Resources