So I have an HTML snippet that I want to modify using C#.
<div>
This is a specialSearchWord that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that specialSearchWord again.
</div>
and I want to transform it to this:
<div>
This is a <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> again.
</div>
I'm going to use HTML Agility Pack based on the many recommendations here, but I don't know where I'm going. In particular,
How do I load a partial snippet as a string, instead of a full HTML document?
How do edit?
How do I then return the text string of the edited object?
The same as a full HTML document. It doesn't matter.
The are 2 options: you may edit InnerHtml property directly (or Text on text nodes) or modifying the dom tree by using e.g. AppendChild, PrependChild etc.
You may use HtmlDocument.DocumentNode.OuterHtml property or use HtmlDocument.Save method (personally I prefer the second option).
As to parsing, I select the text nodes which contain the search term inside your div, and then just use string.Replace method to replace it:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var textNodes = doc.DocumentNode.SelectNodes("/div/text()[contains(.,'specialSearchWord')]");
if (textNodes != null)
foreach (HtmlTextNode node in textNodes)
node.Text = node.Text.Replace("specialSearchWord", "<a class='special' href='http://mysite.com/search/specialSearchWord'>specialSearchWord</a>");
And saving the result to a string:
string result = null;
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
result = writer.ToString();
}
Answers:
There may be a way to do this but I don't know how. I suggest
loading the entire document.
Use a combination of XPath and regular
expressions
See the code below for a contrived example. You may have
other constraints not mentioned but this code sample should get you
started.
Note that your Xpath expression may need to be more complex to find the div that you want.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtmlFile);
HtmlNode divNode = doc.DocumentNode.SelectSingleNode("//div[2]");
string newDiv = Regex.Replace(divNode.InnerHtml, #"specialSearchWord",
"<a class='special' href='http://etc'>specialSearchWord</a>");
divNode.InnerHtml = newDiv;
Console.WriteLine(doc.DocumentNode.OuterHtml);
Related
What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)
What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?
Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email #sampleemail.com but I think that is a bad approach since in some html files there will be a lot of
"<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked
Sample tag containing information of from:
<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p>
HTML FILE output:
HTMLAgilityPack is your friend. Simply using XPath like //p[#class ='MsoNormal'] to get tags content in HTML
public static void Main()
{
var html =
#"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//p[#class ='MsoNormal']");
foreach(var node in nodes)
Console.WriteLine(node.InnerText);
}
Result:
From:1234#sampleemail.com
Update
We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.
public static void MainFunc()
{
string str = #"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
Console.WriteLine(result);
}
I have some text like as below
<span style="font-weight: 700;">Aanbod wielen (banden + velgen) </span>
<br><br>
<span style="font-weight: 500;">lichtmetalen originele Volvo set met winterbanden:<br>origineel:</span> Volvo<br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<span style="font-weight: 700;">naafgat:</span>
I need to identify that span tag with inline style font-weight and replace with <b> tag and same as closing tag also replace </b> tag in c#. I need that text like as below.
<b>Aanbod wielen (banden + velgen)</b>
<br><br>
<b>lichtmetalen originele Volvo set met winterbanden:<br>origineel:</b> Volvo <br>
<b>inch maat:</b> 15''<br>
<p>steek:</p> 5x108mm<br>
<b>naafgat:</b>
so how can we identify. Please help me in that case.
You can replace your span by b by using HtmlAgilityPack. And it's free and open source.
You can install HtmlAgilityPack from nuget also Install-Package HtmlAgilityPack -Version 1.8.9
public string ReplaceSpanByB()
{
HtmlDocument doc = new HtmlDocument();
string htmlContent = File.ReadAllText(#"C:\Users\xxx\source\repos\ConsoleApp4\ConsoleApp4\Files\HTMLPage1.html");
doc.LoadHtml(htmlContent);
if (doc.DocumentNode.SelectNodes("//span") != null)
{
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span"))
{
var attributes = node.Attributes;
foreach (var item in attributes)
{
if (item.Name.Equals("style") && item.Value.Contains("font-weight"))
{
HtmlNode b = doc.CreateElement("b");
b.InnerHtml = node.InnerHtml;
node.ParentNode.ReplaceChild(b, node);
}
}
}
}
return doc.DocumentNode.OuterHtml;
}
Output:
1st: Dont use Regex, though it is possible and it seems logical to use so,
it is mostly wrong and full of pain.
a happy post about it can be found HERE
2nd:
use an HTML parser such as https://html-agility-pack.net/ to traverse the tree
(you can use xPath to easily find all the span elements you want to replace)
and replace any span elements with a b (don't forget to set the new b element contents)
Side note: As much as i recall, the b tag is discouraged
so if you only need the span text to be Bold...
it is already is because of "font-weight:bold".
On https://developer.mozilla.org/en-US/docs/Web/HTML/Element/b :
Historically, the element was meant to make text boldface. Styling information has been deprecated since HTML4, so the meaning of the element has been changed." and "The HTML Bring Attention To element () is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance." – Thanks #Richardissimo
I'd like to scrape a table within a comment using HTMLAgilityPack. For example, on the page
http://www.baseball-reference.com/register/team.cgi?id=f72457e4
there is a table with id="team_pitching". I can get this comment as a block of text with:
var tags = doc.DocumentNode.SelectSingleNode("//comment()[contains(., 'team_pitching')]");
however my preference would be to select the rows from the table with something like:
var tags = doc.DocumentNode.SelectNodes("//comment()[contains(., 'team_pitching')]//table//tbody//tr");
or
var tags = doc.DocumentNode.SelectNodes("//comment()//table[#id = 'team_pitching']//tbody//tr");
but these both return null. Is there a way to do this so I don't have to parse the text manually to get all of the table data?
Sample HTML - I'm looking to find nodes inside <!-- ... -->:
<p>not interesting HTML here</p>
<!-- <table id=team_pitching>
<tbody><tr>...</tr>...</tbody>...</table> -->
Content of comment is not parsed as DOM nodes, so you can't search outside comment and inside comment with single XPath.
You can get InnerHTML of the comment node, trim comment tags, load it into the HtmlDocument and query on it. Something like this should work
var commentNode = doc.DocumentNode
.SelectSingleNode("//comment()[contains(., 'team_pitching')]");
var commentHtml = commentNode.InnerHtml.TrimStart('<', '!', '-').TrimEnd('-', '>');
var commentDoc = new HtmlDocument();
commentDoc.LoadHtml(commentHtml);
var tags = commentDoc.DocumentNode.SelectNodes("//table//tbody//tr");
I'm new to XML/HTML-parsing. Don't even know the right words to do a proper search for duplicates.
I have this HTML file which looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">
<span fontFamily="SchoolHouse Cursive B" fontSize="18">I'm great!</span>
</p>
Now I need 00:00:00, 00:00:29 and I'm great! from it. I could read it like this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
if (reader.LocalName == "span")
{
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
}
I get values in variables a, b and c. But there was a slight change in HTML format. Now the HTML looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">I'm great! </p>
In this scenario how do I parse out 00:00:00, 00:00:29 and I'm great! ? I tried this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
But I get this error: This document already has a 'DocumentElement' node. at line doc.Load(reader). How to read correctly and what's causing the trouble? I am using .NET 2.0
It looks like you have HTML that you want to parse with a XML parser. That may also be the reason why you get the This document already has a 'DocumentElement' node. exception: because you have more than one root node, which is allowed (or better: tolerated) in HTML, but not XML.
Use an HTML parser instead. Unfortunatelly there is nothing built-in within the .NET framework. You have to take a third party library for that. A very good one is the HTML agility pack, that oleksii already mentioned in his comment.
Edit:
From your comments, I get the feeling your not familiar with the fact that there is no direct relation between HTML and XML. The graphic taken from here illustrates this quite well:
Neither is XML a subset of HTML, nor the other way around. Only if you have strict XHTML (rarely the case), you have an HTML document that can be parsed with an XML parser. But be aware if there is some mistake in the code of such an XHTML document, the parser will fail, while a common browser will continue to display the page. Also, the future of XHTML is quite unclear, now that HTML5 is comming to life slowly but steadily...
To sum up: To avoid all those pitfalls, take the easy road and go for an HTML parser.
Since you are wanting to parse HTML, you could use WebClient (or WebBrowser) to load the page and then use the HTML DOM to navigate through it. You need to add a reference to Microsoft HTML Object Library (COM) for the following code example:
string html;
WebClient webClient = new WebClient();
using (Stream stream = webClient.OpenRead(new Uri("http://www.google.com")))
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}
IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocument();
doc.write(html);
foreach (IHTMLElement el in doc.all)
Console.WriteLine(el.tagName);
I have tried loading HTML into XML before, and its all too hard - fixing up unclosed tags (like <BR>), putting quotes around attributes, giving attributes without values a value, etc. Since I wanted to then use XSLT against it, after loading into the HTML DOM and navigated through it creating the relevant XML node for each HTML node. Then I had a proper XML representation of the HTML.
How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");