C# Strip HTML Markup in XML - c#

i really hope someone can help me with this issue. The solution should be on C#.
I have a xml file with the size of 36 MB and with 900k lines. On some nodes it has a lot of html markup and some invalid markup like
<Obs><p>
<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p>
I've tried different ways to clean this file but only one way is able to perform the task, however, as this is being executed on a web application it's blocking the application and taking around 6 minutes to finish the task and consuming around 450MB in memory.
As this file is an invalid xml i cannot use XmlTextReader.
Using XLST, based on Strip HTML-like characters (not markup) from XML with XSLT? ,strangely i'm also with problems with HTML Entities.
The process that worked (with some tweaks) is the following on http://www.codeproject.com/Articles/19652/HTML-Tag-Stripper
Thanks
Edit:
Following Kevin's suggestions. I'm trying to build a solution using HTML Agility Pack.
At least to do some benchmarks.
I'm stuck however. Imagine the following xml node:
<Obs><p> I WANT THIS TEXT<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p></Obs>
How can i strip the tags inside "obs" tag, keep the tag "obs" and also keep the text "I WANT THIS TEXT" ? Basically this:
<Obs>I WANT THIS TEXT</Obs>
For now this is the code i have:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
while (nodes.Count > 0)
{
HtmlNode node = nodes.Dequeue();
HtmlNode parentNode = node.ParentNode;
HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (HtmlNode child in childNodes)
{
if (child.Name != "obs")
{
nodes.Enqueue(child);
}
else
{
childNodes = child.SelectNodes("//p|//jantes");
foreach (HtmlNode nodeToStrip in childNodes)
nodeToStrip.ParentNode.RemoveChild(nodeToStrip);
}
}
}
}
string s = doc.DocumentNode.InnerHtml;
Thanks :)
EDIT 2
Ok, i was able to complete the task. However this is taking too much time. About 3 hours and consuming 800MB in memory.
Still needing help!
Here is the code, it might help someone.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
while (nodes.Count > 0)
{
HtmlNode node = nodes.Dequeue();
HtmlNode parentNode = node.ParentNode;
HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (HtmlNode child in childNodes)
{
if (child.Name != "obs")
{
nodes.Enqueue(child);
}
else
{
childNodes = child.SelectNodes("//p|//jantes");
if (childNodes != null)
{
foreach (HtmlNode nodeToStrip in childNodes)
{
var replacement = doc.CreateTextNode(nodeToStrip.InnerText);
nodeToStrip.ParentNode.ReplaceChild(replacement, nodeToStrip);
}
}
}
}
}
}
string s = doc.DocumentNode.InnerHtml;

Have you tried Html Agility Pack? Among its claims:
the parser is very tolerant with "real world" malformed HTML
you can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it

Related

Is it possible to get element from webpage?

In my program, I'm using webbrowser (C#) and I want to get all element from the current page to text. Can anyone help me?
Code:
HtmlElement htmlelement = webBrowser1.Document.GetElementById("html");
if (htmlelement == null)
{
}
else
{
richTextBox1.Text = webBrowser1.Document.GetElementById("html").OuterText;
}
Ps. OuterHtml can use on this?
You can use the HTML Agility Pack
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.example.com/");
HtmlNodeCollection tags = doc.DocumentNode.SelectNodes("//tag1//tag2");
In Java i use the below code, some reformating should fetch you C# code.
List<WebElement> webPageElements = driver.findElementsByTagName(webHTMLTagName);
// Loop Over All WebPage Elements with same Tag Type
for (WebElement webElement : webPageElements) {
System.out.println(webElement.getAttribute("type"));
System.out.println(webElement.getAttribute("name"));
System.out.println(webElement.getAttribute("id"));
}
webHTMLTagName can be your html tag.(for eg., "input")

Parse Html Document Get All input fields with ID and Value

I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.
Basically like:
foreach(var htmlDoc in HtmlFolder)
{
foreach(var inputBox in htmlDoc)
{
//Make Collection of ID and Values Insert to DB
}
}
From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?
Thanks in advance
An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:
var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
//get values
To get the CHM to work, you probably need to view the properties in Windows Explorer and uncheck the "Unblock Content" checkbox.
The HTML Agility Pack is quite easy when you know your way around Linq-to-XML or XPath.
Basics you'll need to know:
//import the HtmlAgilityPack
using HtmlAgilityPack;
HtmlDocument doc = new HtmlDocument();
// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);
// OR
// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------
// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
.Where(node => !string.IsNullOrWhitespace(node.Id)
&& node.Attributes["value"] != null
&& !string.IsNullOrWhitespace(node.Attributes["value"].Value));
// OR
// Finding things using XPath
var nodes = doc.DocumentNode
.SelectNodes("//input[not(#id='') and not(#value='')]");
// -----------------------------
// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null)
{
foreach (var node in nodes)
{
var id = node.Id;
var value = node.Attributes["value"].Value;
}
}
The easiest way to add the HtmlAgility Pack is using NuGet:
PM> Install-Package HtmlAgilityPack
Hah, looks like the ideal time to make a shameless plug of a library I wrote!
This should be rather easy to accomplish with this library (that's built on top of HtmlAgility pack by the way!) : https://github.com/amoerie/htmlbuilders
(You can find the Nuget package here: https://www.nuget.org/packages/HtmlBuilders/ )
Code samples:
const string html = "<div class='invoice'><input type='text' name='abc' value='123'/><input id='ohgood' type='text' name='def' value='456'/></div>";
var htmlDocument = new HtmlDocument {OptionCheckSyntax = false}; // avoid exceptions when html is invalid
htmlDocument.Load(new StringReader(html));
var tag = HtmlTag.Parse(htmlDocument); // if there is a root tag
var tags = HtmlTag.ParseAll(htmlDocument); // if there is no root tag
// find looks recursively through the entire DOM tree
var inputFields = tag.Find(t => string.Equals(t.TagName, "input"));
foreach (var inputField in inputFields)
{
Console.WriteLine(inputField["type"]);
Console.WriteLine(inputField["value"]);
if(inputField.HasAttribute("id"))
Console.WriteLine(inputField["id"]);
}
Note that inputField[attribute] will throw a 'KeyNotFoundException' if that field does not have the specified attribute name. That's because HtmlTag implements and reuses IDictionary logic for its attributes.
Edit: If you're not running this code in a web environment, you'll need to add a reference to System.Web. That's because this library makes use of the HtmlString class which can be found in System.Web. Just choose 'Add reference' and then you can find it under 'Assemblies > Framework'
You can download HtmlAgilityPack Documents CHM file from here.
If chm file contents are not visible then un-check Always ask before opening this file check-box as shown in screen shot
Note: The above dialog appears for unsigned files
Source: HtmlAgilityPack Documentation

HTML Agility Pack - How can append element at the top of Head element?

I'm trying to use HTML Agility Pack to append a script element into the top of the HEAD section of my html. The examples I have seen so far just use the AppendChild(element) method to accomplish this. I need the script that I am appending to the head section to come before some other scripts. How can I specify this?
Here's what I'm trying:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.Load(filePath);
HtmlNode head = htmlDocument.DocumentNode.SelectSingleNode("/html/head");
HtmlNode stateScript = htmlDocument.CreateElement("script");
head.AppendChild(stateScript);
stateScript.SetAttributeValue("id", "applicationState");
stateScript.InnerHtml = "'{\"uid\":\"testUser\"}'";
I would like a script tag to be added toward the top of HEAD rather than appended at the end.
Realizing that this is an old question, there is also the possibility of prepending child elements that might not have existed then.
// Load content as new Html document
HtmlDocument html = new HtmlDocument();
html.LoadHtml(oldContent);
// Wrapper acts as a root element
string newContent = "<div>" + someHtml + "</div>";
// Create new node from newcontent
HtmlNode newNode = HtmlNode.CreateNode(newContent);
// Get body node
HtmlNode body = html.DocumentNode.SelectSingleNode("//body");
// Add new node as first child of body
body.PrependChild(newNode);
// Get contents with new node
string contents = html.DocumentNode.InnerHtml;
Got it..
HtmlNode has the following methods:
HtmlNode.InsertBefore(node, refNode)
HtmlNode.InsertAfter(nodeToAdd, refNode)

How can I retrieve all the text nodes of a HTMLDocument in the fastest way in C#?

I need to perform some logic on all the text nodes of a HTMLDocument. This is how I currently do this:
HTMLDocument pageContent = (HTMLDocument)_webBrowser2.Document;
IHTMLElementCollection myCol = pageContent.all;
foreach (IHTMLDOMNode myElement in myCol)
{
foreach (IHTMLDOMNode child in (IHTMLDOMChildrenCollection)myElement.childNodes)
{
if (child.nodeType == 3)
{
//Do something with textnode!
}
}
}
Since some of the elements in myCol also have children, which themselves are in myCol, I visit some nodes more than once! There must be some better way to do this?
It might be best to iterate over the childNodes (direct descendants) within a recursive function, starting at the top-level, something like:
HtmlElementCollection collection = pageContent.GetElementsByTagName("HTML");
IHTMLDOMNode htmlNode = (IHTMLDOMNode)collection[0];
ProcessChildNodes(htmlNode);
private void ProcessChildNodes(IHTMLDOMNode node)
{
foreach (IHTMLDOMNode childNode in node.childNodes)
{
if (childNode.nodeType == 3)
{
// ...
}
ProcessChildNodes(childNode);
}
}
You could access all the text nodes in one shot using XPath in HTML Agility Pack.
I think this would work as shown, but have not tried this out.
using HtmlAgilityPack;
HtmlDocument htmlDoc = new HtmlDocument();
// filePath is a path to a file containing the html
htmlDoc.Load(filePath);
HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");
foreach (HTMLNode node in coll)
{
// do the work for a text node here
}

Encoding error when using HTML Agility Pack

I'm trying to parse a html doc
using some code I found from this actual site
but I keep getting a parsing error
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags = true;
// filePath is a path to a file containing the html
htmlDoc.Load(#"C:\Documents and Settings\Mine\My Documents\Random.html");
// Use: htmlDoc.LoadXML(xmlString); to load from a string
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count > 0)
{
// Handle any parse errors as required
MessageBox.Show("Oh no");
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//head");
if (bodyNode != null)
{
MessageBox.Show("Hello");
}
}
}
Any help would be appreciated :)
In the wild, HTML is likely to be non-conformant, non-compliant, and non-validating. Only XHTML or very simple HTML will go without populating ParseErrors. I've noticed that the HTML Agility Pack is fairly robust and will still build a decent DOM tree from most HTML sources, even if ParseErrors are generated. Drop the else, and let that else block execute normally.
If it did not build the DOM tree, then you should investigate the ParseError(s) that were generated. If it only built a partial tree, try recursing over the nodes, printing or messagebox'ing to see which parts of the DOM tree got built or not. You might not need the whole tree.

Categories

Resources