Add div in HTML after <body ...> in c# - c#

Requirement:
add custom html after body tag in string
I solved with htmlagilitypack like this:
StringBuilder sb = new StringBuilder();
sb.Append(customStringWithHtmlContent)
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(sb.ToString());
// Create new node from newcontent
HtmlNode newNode = HtmlNode.CreateNode("<div>" + someStringWithAdditionalContent + "</div>");
// Get body node
HtmlNode body = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (body != null)
{// Add new node as first child of body
body.PrependChild(newNode);
}
var docContent = htmlDoc.DocumentNode.InnerHtml;
Looks good but in some html pages, html structure is changed, closed div tags are moved and html is renderend ugly
second solution:
if (sb.ToString().Contains("<body>"))
{
sb.Replace("<body>", "<body><div>" + someStringWithAdditionalContent + "</div>");
}
Looks good, but is not a solution for body with attributes like
<body style="someAttr:value ..." ...>
some ideas ? other solutions?

RegEx? There's probably a more elegant way but the basic idea:
string input = "<body style=\"someAttr\"><tag>sdsdsa</tag></body>";
Regex Pattern = new Regex(#"(<body.*?>)(.*?)(<\/body>)", RegexOptions.Compiled);
var updatedText = Pattern.Replace(input, match =>
{
string newMatch = match.Groups[2].Value;
string newContent = "<div>" + "someStringWithAdditionalContent" + "</div>";
return match.Groups[1].Value + newContent + newMatch + match.Groups[3].Value;
});
Console.WriteLine(updatedText);
Output:
<body style="someAttr"><div>someStringWithAdditionalContent</div><tag>sdsdsa</tag></body>

Related

Selenium C# - Finding div elements text by span element in it

I would like to get the text of <div> element.
The only thing I am able to use is <span> element inside this <div>.
<div>
<span id="lblName" class="fieldTitle">Name</span>
John
</div>
How can I receive John using lblName or Name?
You can use xpath
span = driver.findElement(By.id("lblName"));
div = span.findElement(By.xpath(".."));
You can try: //span[#id = 'lblName']/parent::div/text()
Something like this:
string xml = "<?xml version=\"1.0\"?>" +
"<div>" +
"<span id=\"lblName\" class=\"fieldTitle\">Name</span>" +
"</div>";
XDocument xdoc = XDocument.Parse(xml);
var parent = xdoc.Descendants().First(el => el.Name == "span" &&
el.Attribute("id") != null &&
el.Attribute("id").Value == "lblName").Parent;
You cannot get the text node by Selenium. Please try the workaround with JS below:
IWebElement span = driver.FindElement(By.Id("lblName"));
IWebElement div = span.FindElement(By.XPath(".."));
string script = "var nodes = arguments[0].childNodes;" +
"var text = '';" +
"for (var i = 0; i < nodes.length; i++) {" +
" if (nodes[i].nodeType == Node.TEXT_NODE) {" +
" text += nodes[i].textContent;" +
" }" +
"}" +
"return text;";
string text = driver.GetJavaScriptExecutor().ExecuteScript(script, div).ToString();

How to get links from google html search results in c#?

I got this code that brings me the search results from Google as an HTML string:
WebClient webClient = new WebClient();
string htmlString = webClient.DownloadString("http://www.google.com/search?q=" + searchQuery);
Any idea how to extract only the links from it ?
I guess I do a string search, but it doesn't look so elegant...
I found this code
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(htmlString);
var selectNodes = htmlDoc.DocumentNode.SelectNodes("//li[#class='g']");
foreach (var node in selectNodes)
{
//node.InnerText will give you the text content of the li tags ...
}
But I'm getting an exception that var selectNodes = htmlDoc.DocumentNode.SelectNodes("//li[#class='g']"); is null...
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
HtmlNodeCollection links = doc.DocumentNode.SelectNodes("//*[#background or #lowsrc or #src or #href]");
foreach (HtmlNode link in links)
{
if (link.Attributes["background"] != null)
link.Attributes["background"].Value = _newPath + link.Attributes["background"].Value;
if (link.Attributes["href"] != null)
link.Attributes["href"].Value = _newPath + link.Attributes["href"].Value;(link.Attributes["href"] != null)
link.Attributes["lowsrc"].Value = _newPath + link.Attributes["href"].Value;
if (link.Attributes["src"] != null)
link.Attributes["src"].Value = _newPath + link.Attributes["src"].Value;
}

HtmlAgilityPack : Issues getting content of anchor tag within a string

Guys what i'm trying to do is I've a section of a html code listed below. I need the content within the anchor tag.
HtmlDocument newHtml = new HtmlDocument();
newHtml.OptionOutputAsXml = true;
var content = "<div class="business-name-container">
<span class="tier_info"></span>
<h3 class="title fn org">
Foo
</h3>
</div>";
newHtml.Load(content);
HtmlNode doc = newHtml.DocumentNode;
var findContent = doc.SelectNodes("//a[#class='url link']");
foreach (var aContent in findContent)
{
if (acontent.InnerHtml != null)
{
Console.WriteLine("Content: " + acontent.InnerHtml);
}
}
But i'm not getting the results.
I want to the output to be as "Content : Foo"
Replace
Console.WriteLine("Content: " + acontent.InnerHtml);
With
Console.WriteLine("Content: " + acontent.InnerText);
Or even better something like this
var result = acontent.DocumentNode
.Descendants("a")
.Where(x=>x.Attributes["class"].Value =="url link").InnerText;

How to obtain inner tags value in XML?

XDocument coordinates = XDocument.Load("http://feeds.feedburner.com/TechCrunch");
System.IO.StreamWriter StreamWriter1 = new System.IO.StreamWriter(DestFile);
XNamespace nsContent = "http://purl.org/rss/1.0/modules/content/";
string pchild = null;
foreach (var item in coordinates.Descendants("item"))
{
string link = item.Element("guid").Value;
//string content = item.Element(nsContent + "encoded").Value;
foreach (var child in item.Descendants(nsContent + "encoded"))
{
pchild = pchild + child.Element("p").Value;
}
StreamWriter1.WriteLine(link + Environment.NewLine + Environment.NewLine + pchild + Environment.NewLine);
}
StreamWriter1.Close();
If i use Commented line code (string content = item.Element(nsContent + "encoded").Value;) instead of inner for loop than it will fetch the value of <conten:encoded> element but it contains all links, images etc etc. And I want only text.
For that I have tried to use this filter (inner for loop) but its showing error :
Object reference not set to an instance of an object.
Please suggest me code so that I can store only text and remove all other links, <img> tags etc.
The content of item.Element(nsContent + "encoded").Value is html not xml. You should parse it accordingly, such as using Html Agility Pack
See the example below
string content = item.Element(nsContent + "encoded").Value;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(new StringReader(content));
var text = String.Join(Environment.NewLine + Environment.NewLine,
doc.DocumentNode
.Descendants("p")
.Select(n => "\t" + System.Web.HttpUtility.HtmlDecode(n.InnerText))
);
Firstly, I would start by using a StringBuilder:
StringBuilder sb = new StringBuilder();
Then, I suspect that sometimes, the "child" doesn't have a "p" element, so you can check before using it:
foreach (var child in item.Descendants(nsContent + "encoded"))
{
if (child.Element("p") != null)
{
sb.Append(child.Element("p").Value);
}
}
StreamWriter1.WriteLine(link + Environment.NewLine + Environment.NewLine + sb.ToString() + Environment.NewLine);
Does that work for you?

Modifying InnerXml of a text XmlNode

I traverse an html document with SGML and XmlDocument. When I find an XmlNode which its type is Text, I need to change its value that has an xml element. I can't change InnerXml because it's readonly. I tried to change InnerText, but this time tag descriptor chars < and > encoded to < and >. for example:
<p>
This is a text that will be highlighted.
<anothertag />
<......>
</p>
I'm trying to change to:
<p>
This is a text that will be <span class="highlighted">highlighted</span>.
<anothertag />
<......>
</p>
What is the easiest way to modify the value of a text XmlNode?
I have a workaround, I don't know it is a real solution or what, but it can result what I want. Please comment for this code if it is worthy solution or not
private void traverse(ref XmlNode node)
{
XmlNode prevOldElement = null;
XmlNode prevNewElement = null;
var element = node.FirstChild;
do
{
if (prevNewElement != null && prevOldElement != null)
{
prevOldElement.ParentNode.ReplaceChild(prevNewElement, prevOldElement);
prevNewElement = null;
prevOldElement = null;
}
if (element.NodeType == XmlNodeType.Text)
{
var el = doc.CreateElement("text");
//Here is manuplation of the InnerXml.
el.InnerXml = element.Value.Replace(a_search_term, "<b>" + a_search_term + "</b>");
//I don't replace element right now, because element.NextSibling will be null.
//So I replace the new element after getting the next sibling.
prevNewElement = el;
prevOldElement = element;
}
else if (element.HasChildNodes)
traverse(ref element);
}
while ((element = element.NextSibling) != null);
if (prevNewElement != null && prevOldElement != null)
{
prevOldElement.ParentNode.ReplaceChild(prevNewElement, prevOldElement);
}
}
Also, I remove <text> and </text> strings after the traverse function:
doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
var html = doc.FirstChild;
traverse(ref html);
textBox1.Text = doc.OuterXml.Replace("<text>", String.Empty).Replace("</text>", String.Empty);
using System;
using System.Xml;
public class Sample {
public static void Main() {
XmlDocument doc = new XmlDocument();
doc.LoadXml(
"<p>" +
"This is a text that will be highlighted." +
"<br />" +
"<img />" +
"</p>");
string ImpossibleMark = "_*_";
XmlNode elem = doc.DocumentElement.FirstChild;
string thewWord ="highlighted";
if(elem.NodeType == XmlNodeType.Text){
string OriginalXml = elem.ParentNode.InnerXml;
while(OriginalXml.Contains(ImpossibleMark)) ImpossibleMark += ImpossibleMark;
elem.InnerText = elem.InnerText.Replace(thewWord, ImpossibleMark);
string replaceString = "<span class=\"highlighted\">" + thewWord + "</span>";
elem.ParentNode.InnerXml = elem.ParentNode.InnerXml.Replace(ImpossibleMark, replaceString);
}
Console.WriteLine(doc.DocumentElement.InnerXml);
}
}
The InnerText property will give you the text content of all the child nodes of the XmlNode. What you really want to set is the InnerXml property, which will be construed as XML, not as text.

Categories

Resources