C# HtmlDocument Extract Classes - c#

I am writing some code to loop through every element in a HTML page and extract all ID and Classes.
My current code is able to extract the ID's but I can't see a way to get the classes, does anybody know where I can access these?
private void ParseElements()
{
// GET: Document from Browser
HtmlDocument ThisDocument = Browser.Document;
// DECLARE: List of IDs
List<string> ListIdentifiers = new List<string>();
// LOOP: Through Each Element
for (int LoopA = 0; LoopA < ThisDocument.All.Count; LoopA += 1)
{
// DETERMINE: Whether ID Exists in Element
if (ThisDocument.All[LoopA].Id != null)
{
// ADD: Identifier to List
ListIdentifiers.Add(ThisDocument.All[LoopA].Id);
}
}
}

You could get the inner HTML of each node and use a regular expression to get the class. Or you could try HTML Agility pack.
Something like...
HtmlAgilityPack.HtmlDocument AgilePack = new HtmlAgilityPack.HtmlDocument();
AgilePack.LoadHtml(ThisDocument.Body.OuterHtml);
HtmlNodeCollection Nodes = AgilePack.DocumentNode.SelectNodes(#"//*");
foreach (HtmlAgilityPack.HtmlNode Node in Nodes)
{
if (Node.Attributes["class"] != null)
MessageBox.Show(Node.Attributes["class"].Value);
}

Related

How to extract text from structure elements in a tagged pdf using itext7

I want to read a tagged pdf, traverse the structure tree, and extract the text for each element, the final output would be something like
- document
- div
- H1
- "The title of the document"
- P
- "The contents of the paragraph"
I can traverse the tree using this code:
if (doc.IsTagged())
{
var root = doc.GetStructTreeRoot();
var stack = new Stack<iText.Kernel.Pdf.Tagging.IStructureNode>();
var stack2 = new Stack<iText.Kernel.Pdf.Tagging.IStructureNode>();
stack.Push(root);
while (stack.Count > 0)
{
var currentNode = stack.Pop();
stack2.Push(currentNode);
var kids = currentNode.GetKids();
if (kids != null)
{
foreach (var kid in kids)
{
stack.Push(kid);
}
}
}
while (stack2.Count > 0)
{
var currentNode = stack2.Pop();
var role = currentNode.GetRole()?.ToString();
if (currentNode is iText.Kernel.Pdf.Tagging.PdfMcrDictionary mcr) {
// this is where I want to extract the text from the structured node
}
}
}
I am not sure how to get the actual text that that would go inside the structure node, e.g. the contents of H1, P and other tags.
There is an out of the box solution for reading the document tag structure - it's called TaggedPdfReaderTool. It allows you to parse the tag structure including element textual content and create an XML with that content.
Example on how to use the tool:
FileOutputStream xmlOut = new FileOutputStream(outXmlPath);
new TaggedPdfReaderTool(pdfDocument).setRootTag("root").convertToXml(xmlOut);
If the XML structure does not work well for you then you can look at the implementation for inspiration - the class is self-contained and includes the logic for extracting the text from tags.

Select a child node in XML with specific name using C#

I am trying to find a child element with tag name Reason.
I have XML doc that is basically contains bunch of elements with Entity name.
Reason tag is somewhere inside of Entity(along with other elements).
void IParseResponse.ParseResponseData(XmlDocument responseDocument)
{
List<string> reasons = new List<string>();
var reasonValue = "";
var entityList = responseDocument.GetElementsByTagName("Entity");
if (entityList != null)
{
foreach (XmlNode reason in entityList)
{
reasonValue = //look into current Entity element, find Reason in it and get it's inner text.
reasons.Add(reasonValue);
}
}
}
This is location of Reason element.
<Entity>
<WatchList>
<Match ID="1">
<MatchDetails>
<Reason>
Does anybody have experience with this?
Here's how you can get all the Reason elements.
var xml = "<Entity> <WatchList><Match ID=\"1\"><MatchDetails><Reason>asdasd</Reason></MatchDetails></Match></WatchList></Entity>";
var x = XDocument.Parse(xml);
var reasons = x.Descendants("Reason").ToList();
foreach (var reason in reasons)
{
Console.WriteLine(reason.Value);
}
If you give us a more complete example of your XML I can improve the answer.
Edit:
If you want to use XmlDocument instead you could do this:
XmlNodeList nodes = responseDocument.GetElementsByTagName("Reason");
for (int i = 0; i < nodes.Count; i++)
{
Console.WriteLine(nodes[i].InnerText);
}

MVC StackOverflowException with larger html data

I have the following method (i'm using the htmlagilitypack):
public DataTable tableIntoTable(HtmlDocument doc)
{
var nodes = doc.DocumentNode.SelectNodes("//table");
var table = new DataTable("MyTable");
table.Columns.Add("raw", typeof(string));
foreach (var node in nodes)
{
if (
(!node.InnerHtml.Contains("pldefault"))
&& (!node.InnerHtml.Contains("ntdefault"))
&& (!node.InnerHtml.Contains("bgtabon"))
)
{
table.Rows.Add(node.InnerHtml);
}
}
return table;
}
It accepts html grabbed using this:
public HtmlDocument getDataWithGet(string url)
{
using (var wb = new WebClient())
{
string response = wb.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(response);
return doc;
}
}
All works fine with an html document that is 3294 lines long.
When I feed it some html that is 33960 lines long I get:
StackOverflowException was unhandled at the IF statement in the tableIntoTable method as seen in this image:
http://imgur.com/Q2FnIgb
I thought it might be related to the MaxHttpCollectionKeys limit of 1000 so I tried putting this in my Web.config and it still doesn't work:
add key="aspnet:MaxHttpCollectionKeys" value="9999"
I'm not really sure where to go from here, it only breaks with larger html documents.
Assuming the values in your if statement are contained in some attribute value of some decendant of a table.
var xpath = #"//table[not(.//*[contains(#*,'pldefault') or
contains(#*,'ntdefault') or
contains(#*,'bgtabon')])]";
var tables = doc.DocumentNode.SelectNodes(xpath);
Upadte: More accurately based on your comments:
#"//table[not(.//td[contains(#class,'pldefault') or
contains(#class,'ntdefault') or
contains(#class,'bgtabon')])]";

how do i get the text from a nested <p> tag in an external html using html agility pack?

I am trying to get some text from an external site. The text I am trying to get is nested in a paragraph tag. The div has has a class value
html code snippet:
<div class="discription"><p>this is the text I want to grab</p></div>
current c# code:
public String getDiscription(string url)
{
var web = new HtmlWeb();
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//div[#class='discription']");
if (nodes != null)
{
foreach (var node in nodes)
{
string Description = node.InnerHtml;
return Description;
}
} else
{
string error = "could not find text";
return error;
}
}
what I dont understand is the syntax of the xpath //div[#class='discription'] I know it is wrong what should the xpath be?
use //div[#class='discription']/p.
Breakdown:
//div - All div elements
[#class='discription'] - With a class attribute whose value is discription
/p - Select the child p elements

Removing commented lines from InnerText

i'm currently using the below code which extracts the InnerText, however, what happens is i'm stuck with a bunch of comment out lines of html <-- how do I remove these using the code below?
HtmlWeb hwObject = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmldocObject = hwObject.Load(htmlURL);
foreach (var script in htmldocObject.DocumentNode.Descendants("script").ToArray())
script.Remove();
HtmlNode body = htmldocObject.DocumentNode.SelectSingleNode("//body");
resultingHTML = body.InnerText.ToString();
Just filter the nodes by comment nodes and call remove on them.
var rootNode = doc.DocumentNode;
var query = rootNode.Descendants().OfType<HtmlCommentNode>().ToList();
foreach (var comment in query)
{
comment.Remove();
}
This is probably a better answer:
public static void RemoveComments(HtmlNode node)
{
foreach (var n in node.ChildNodes.ToArray())
RemoveComments(n);
if (node.NodeType == HtmlNodeType.Comment)
node.Remove();
}

Categories

Resources