Checking Descendants for matches C# - c#

I'm working on a project where i basically scrape content from a web page using html agility pack until i get the nods i need using the code below
foreach (HtmlNode tr in bodyNode.SelectNodes("//tr")){
//tr.Descendants("span").Where(n => n.InnerHtml.Contains("display:none")).ToList().
//ForEach(n => n.Remove()); ;
}
So after each loop i get html content similar to what i have pasted below
http://pastebin.com/DLVykNXU
What i need to do is to check the node for matches. Basically i need to remove the
<span style="display:none">what ever value here</span>
div style="display:none">what ever value here</div>
and some other div's & Spans with different different ID & Class names. But the code I've written above seems to fail me since it deletes the entire nod without deleting only the content within the tag and tags it self.
I will be very thankful if a expert could help me with this, i'm still a student learning C# so i'm sorry if the codes aren't perfect.
Thank you.

As you said... you'll have to do some tricks.
foreach (var tr in doc.DocumentNode.SelectNodes("//tr")) {
var style = tr.SelectSingleNode(".//style");
//Find the classes and ids with {display:none}
var matches = Regex.Matches(style.InnerText, #"(\.|#)(.+?)\s*{\s*display\s*:\s*none");
//Here we will store the classes & ids we'll need to remove
List<string> classes = new List<string>();
List<string> ids = new List<string>();
//Storing the ids and classes
foreach (Match m in matches) {
var type = m.Groups[1].Value;
if (type == ".")
{
classes.Add(m.Groups[2].Value);
}
else {
ids.Add(m.Groups[2].Value);
}
}
foreach (var n in tr.SelectNodes(".//*")) {
if (Remove(n, classes, ids)) {
n.Remove();
}
}
var proxy = tr.SelectSingleNode("./td[2]/span").InnerText;
var port = tr.SelectSingleNode("./td[3]").InnerText.Trim('\r', '\n', ' ');
}
and have the following method
//Remove the ones that have {display:none}, and the ones with the ids & classes provided.
static bool Remove(HtmlNode x, IList<string> classes, IList<string> ids) {
var classAttr = x.GetAttributeValue("class", "");
var idAttr = x.GetAttributeValue("id", "");
return (x.Name == "span" && x.GetAttributeValue("style", "") == "display:none") ||
(x.Name == "div" && x.GetAttributeValue("style", "") == "display:none") ||
(x.Name == "span" && classes.Contains(classAttr)) ||
(x.Name == "div" && classes.Contains(classAttr)) ||
(x.Name == "span" && ids.Contains(idAttr)) ||
(x.Name == "id" && ids.Contains(idAttr)) ||
(x.Name == "style");
}
You could add more filters in the Remove method. Since you're writing C# code, you can check anything you want there, not only related to XPath expressions.
That code gave me the following list of proxies:
190.199.36.220:8080
177.139.137.107:3128
103.247.23.90:8080
222.124.130.203:8080
14.140.241.242:8080
175.103.37.10:8080
110.77.183.113:3128
54.243.51.203:8118
200.90.179.90:8080
213.152.173.137:8080
187.17.212.162:8080
62.201.207.14:8080
77.123.76.178:8080
189.76.212.254:3128
89.218.224.234:9090
221.179.173.170:8080
187.84.56.42:3128
118.99.79.13:8080
211.86.157.110:3128
189.38.3.122:3128
2.135.238.178:9090
2.135.238.2:9090
122.50.38.128:3128
217.11.185.251:3128
82.200.254.2:9090
37.59.82.253:8080
83.111.38.131:3128
85.118.227.76:3128
182.30.249.13:8080
124.88.154.3:6673
111.13.87.150:80
190.85.37.90:8080
219.117.232.133:3128
211.100.47.138:8990
46.32.21.195:8080
107.18.121.126:8080
118.97.191.203:8080
119.195.32.211:3128
2.133.92.242:9090
202.164.217.18:8080
222.124.214.194:3128
79.140.17.253:3128
61.138.104.30:1080
201.45.116.138:3128
190.98.209.168:3128
190.204.222.183:8080
200.199.173.122:3128
197.159.16.58:8080
223.4.233.164:3128
212.93.195.229:3128

If you want to extract the content, it would be easier, if you just select it with
"//tr//span[contains(#style, 'display:none')]/text()"
Similarly for spans with a specific id or class attribute
"//tr//span[#id='123']/text()"
or
"//tr//span[contains(#class, 'cl17')]/text()"
If you want, you can combine all these conditions
"//tr//span[contains(#style, 'display:none') or #id='123' or contains(#class, 'cl17')]/text()"

Related

Replace multiple p tags containing line breaks or non breaking spaces with a single line break using HTML Agility Pack

How can I remove multiple "empty p tags" or "p tags containing a non breaking space" or a "p tag containing a line break" and replace with a "single p tag containing a line break", I assume using something like HTML Agility pack is a better solution than Regex but I am open to suggestions.
For example the following HTML:
<p>Test</p><p> </p><p> </p><p></p><p></p><p> </p><p>Test 2</p>
Or the following more complex example:
<p>Test</p><p> </p><p><br/></p><p><p></p><br data-mce-bogus="1"></p><p></p><p>Test 2</p>
Would get replaced with the following:
<p>Test</p><p><br></p><p>Test 2</p>
So effectively anything that could cause multiple line breaks in the HTML code would get replaced with just a single line break.
The HTML can be added and edited from multiple sources (i.e. web application, iOS App, Android App) and multiple rich text editor types so the way the line breaks have been added are not necessarily consistent hence needing to find and replace multiple types of line break with a single one using
With a little bit of help from Chat GPT I have come up with the following code:
// Load the HTML document
var doc = new HtmlDocument();
doc.LoadHtml(value);
// Select all the p tags
var pTags = doc.DocumentNode.SelectNodes("//p");
// If no p tags found then return the value
if (pTags == null || pTags.Count <= 0)
return value;
// Iterate p tags
for (int i = 0; i < pTags.Count; i++)
{
// Check if current p tag
if (pTags[i].InnerHtml.Trim() == " " || // Contains only a
String.IsNullOrWhiteSpace(pTags[i].InnerHtml) || // Or whitespace
(pTags[i].ChildNodes.Any(x => x.Name == "br") && pTags[i].ChildNodes.Where(x => x.Name != "br").All(x => x.InnerHtml.Trim() == " " || String.IsNullOrWhiteSpace(x.InnerHtml)))) // Or contains only a "br" (and possibly whitespace either side)
{
// Change to a break
pTags[i].InnerHtml = "<br>";
}
else
continue;
// If this is not the first p tag
if (i > 0)
{
// Check if current tag and previous tag both contain a line break and if so then remove current tag
if (pTags[i].InnerHtml == "<br>" && pTags[i - 1].InnerHtml == "<br>")
doc.DocumentNode.RemoveChild(pTags[i]);
}
}
// Return the modified html
return doc.DocumentNode.OuterHtml;

Is it possible to convert this foreach loop into a LINQ-to-XML loop?

I originally asked this question (Can we automatically add a child element to an XElement in a sortable manner?) and it was closed as a duplicate (how to add XElement in specific location in XML Document).
This is teh code that I have at the moment:
bool bInserted = false;
foreach (var weekDB in xdoc.Root.Elements())
{
DateTime datWeekDB = DateTime.ParseExact(weekDB.Name.LocalName, "WyyyyMMdd", CultureInfo.InvariantCulture);
if (datWeekDB != null && datWeekDB.Date > historyWeek.Week.Date)
{
// Insert here
weekDB.AddBeforeSelf(xmlHistoryWeek);
bInserted = true;
break;
}
}
if (!bInserted)
xdoc.Root.Add(xmlHistoryWeek);
It works fine. But I wondered if I can use LINQ to acheive the same thing? The linked answer suggests:
Search element you want to add and use Add method as shown below
xDoc.Element("content")
.Elements("item")
.Where(item => item.Attribute("id").Value == "2").FirstOrDefault()
.AddAfterSelf(new XElement("item", "C", new XAttribute("id", "3")));
But I don't understand how to end up with logic like that based on my codes logic.
I think the best way to think about it is: you are going to use LINQ to find a specific element. Then you will insert into the document based on what you found.
So something like this:
var targetElement = xdoc.Root.Elements()
.Where(weekDB => {
DateTime datWeekDB = DateTime.ParseExact(weekDB.Name.LocalName, "WyyyyMMdd", CultureInfo.InvariantCulture);
return datWeekDB != null && datWeekDB.Date > historyWeek.Week.Date;
})
.FirstOrDefault();
if (targetElement == null)
{
xdoc.Root.Add(xmlHistoryWeek);
}
else
{
targetElement.AddBeforeSelf(xmlHistoryWeek);
}

Better way to add a style attribute to Html using HtmlAgilityPack

I am using the HtmlAgilityPack. I am searching through all P tags and adding a "margin-top: 0px" to the style within the P tag.
As you can see it is kinda "brute forcing" the margin-top attribute. It seems there has to be a better way to do this using the
HtmlAgilityPack but I could not find it, and the HtmlAgilityPack documentation is non-existent.
Anybody know a better way?
HtmlNodeCollection pTagNodes = node.SelectNodes("//p[not(contains(#style,'margin-top'))]");
if (pTagNodes != null && pTagNodes.Any())
{
foreach (HtmlNode pTagNode in pTagNodes)
{
if (pTagNode.Attributes.Contains("style"))
{
string styles = pTagNode.Attributes["style"].Value;
pTagNode.SetAttributeValue("style", styles + "; margin-top: 0px");
}
else
{
pTagNode.Attributes.Add("style", "margin-top: 0px");
}
}
}
UPDATE: I have modified the code based on Alex's suggestions. Would still like to know if there is a some built-in
functionality in HtmlAgilityPack that will handle the style attributes in a more "DOM" manner.
const string margin = "; margin-top: 0px";
HtmlNodeCollection pTagNodes = node.SelectNodes("//p[not(contains(#style,'margin-top'))]");
if (pTagNodes != null && pTagNodes.Any())
{
foreach (var pTagNode in pTagNodes)
{
string styles = pTagNode.GetAttributeValue("style", "");
pTagNode.SetAttributeValue("style", styles + margin);
}
}
You could simplify your code a little bit by using HtmlNode.GetAttributeValue method, and making your "margin-top" magic string as constant:
const string margin = "margin-top: 0";
foreach (var pTagNode in pTagNodes)
{
var styles = pTagNode.GetAttributeValue("style", null);
var separator = (styles == null ? null : "; ");
pTagNode.SetAttributeValue("style", styles + separator + margin);
}
Not a very significant improvement, but this code is simpler as for me.
First of all, are you sure you need more than what you asked for? Alex solution should just work fine for your current problem, if it's always that "simple" why bother and add more complexity to it?
Anway, the AgilityPack doesn't have that kind of function, but surely the .Net Framework has. Note this is all for .Net 4, if you're using an earlier version things might be a bit different.
First of, System.Web.dll comes with the CssStyleCollection Class, this class already has everything build in that you could want for parsing inline css, there's just one catch, it's constructor is internal so the solution is a bit "hacky".
First off, for construction an instance of the class all you need is a bit of reflection, the code for that has already been done here. Just keep in mind that this works now, but could break in a future version of .Net.
All that's left is really easy
CssStyleCollection css = CssStyleTools.Create();
css.Value = "border-top:1px dotted #BBB;margin-top: 0px;font-size:12px";
Console.WriteLine(css["margin-top"]); //prints "0px"
IF you can't for some reason add a reference to System.Web (would be the case if you're using .Net 4 Client Profile) there's always the possibility to use Reflector.
Personally i'd go with Alex's solution, but it's up to you to decide. :)
Just use the following extension method of the HtmlNode:
public static void AddOrUpdateCssValue(this HtmlNode htmlNode, string cssKey, string cssValue)
{
string style = htmlNode.GetAttributeValue("style", "");
string newStyle = addOrUpdateCssStyleKeyValue(style: style, newKey: cssKey, newValue: cssValue);
htmlNode.SetAttributeValue(name: "style", value: newStyle);
}
private static string addOrUpdateCssStyleKeyValue(string style, string newKey, string newValue)
{
if (String.IsNullOrEmpty(style)) return style;
if (String.IsNullOrEmpty(newKey)) return style;
if (String.IsNullOrEmpty(newValue)) return style;
style = style.Clone() as string;
List<string> keyValue = style.Split(';').Where(x => String.IsNullOrEmpty(x) == false).Select(x => x.Trim()).ToList();
bool found = false;
List<string> updatedStyles = new List<string>();
foreach (string keyValuePair in keyValue)
{
if (String.IsNullOrEmpty(keyValuePair) == true) continue;
if (keyValuePair.Contains(':') == false) continue;
List<string> splitted = keyValuePair.Split(':').Where(x => String.IsNullOrEmpty(x) == false).Select(x => x.Trim()).ToList();
if (splitted == null) continue;
if (splitted.Count < 2) continue;
string key = splitted[0];
string value = splitted[1];
if (key == newKey)
{
value = newValue;
found = true;
}
updatedStyles.Add(String.Format("{0}: {1}", key, value));
}
if (found == false)
{
updatedStyles.Add(String.Format("{0}: {1}", newKey, newValue));
}
string result = String.Join("; ", updatedStyles);
return result;
}

speed up parsing in html agility pack

This is a method I use to grab certain tags with the html agility pack. I use this method to do rankings with google local. It seems to take quite a bit of time and be memory intensive, does anyone have any suggestions to make it better?
private void findGoogleLocal(HtmlNode node) {
String name = String.Empty;
//
// ----------------------------------------
if (node.Attributes["id"] != null) {
if (node.Attributes["id"].Value.ToString().Contains("panel_") && node.Attributes["id"].Value.ToString() != "panel__")
{
GoogleLocalResults.Add(new Result(URLGoogleLocal, Listing, node, SearchEngine.Google, SearchType.Local, ResultType.GooglePlaces));
}
}
if (node.HasChildNodes) {
foreach (HtmlNode children in node.ChildNodes) {
findGoogleLocal(children);
}
}
}
Why does this method have to be recursive? Just get all the nodes in one go (example using the Linq support in HAP):
var results = node.Descendants()
.Where(x=> x.Attributes["id"]!= null &&
x.Attributes["id"].Value.Contains("panel_") &&
x.Attributes["id"].Value!= "panel__")
.Select( x=> new Result(URLGoogleLocal, Listing, x, SearchEngine.Google, SearchType.Local, ResultType.GooglePlaces));
I just want to add another clean, simple and fast solution: using XPath.
var results = node
.SelectNodes(#"//*[contains(#id, 'panel_') and #id != 'panel__']")
.Select(x => new Result(URLGoogleLocal, Listing, x, SearchEngine.Google, SearchType.Local, ResultType.GooglePlaces));
foreach (var result in results)
GoogleLocalResults.Add(result);
Fizzler: CSS selector engine for HAP
http://code.google.com/p/fizzler/

What is the best way to parse html in C#? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
Html Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.
Another alternative would be to use the builtin engine mshtml:
using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);
This allows you to use javascript-like functions like getElementById()
I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.
http://code.google.com/p/fizzler/
You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:
var wb = new WebBrowser()
... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.
var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);
you can do similar reflection stuff to submit forms, etc.
Enjoy.
I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:
IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);
foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {
if (anchorTag.Attribute("href") == null)
continue;
Console.WriteLine(anchorTag.Attribute("href").Value);
}
I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...
So now that expectations are realistically low, here's the code :)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace Majestic12ToXml {
public class Majestic12ToXml {
static public IEnumerable<XNode> ConvertNodesToXml(byte[] htmlAsBytes) {
HTMLparser parser = OpenParser();
parser.Init(htmlAsBytes);
XElement currentNode = new XElement("document");
HTMLchunk m12chunk = null;
int xmlnsAttributeIndex = 0;
string originalHtml = "";
while ((m12chunk = parser.ParseNext()) != null) {
try {
Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting
XNode newNode = null;
XElement newNodesParent = null;
switch (m12chunk.oType) {
case HTMLchunkType.OpenTag:
// Tags are added as a child to the current tag,
// except when the new tag implies the closure of
// some number of ancestor tags.
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
currentNode = newNode as XElement;
}
break;
case HTMLchunkType.CloseTag:
if (m12chunk.bEndClosure) {
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
}
}
else {
XElement nodeToClose = currentNode;
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
nodeToClose = nodeToClose.Parent;
if (nodeToClose != null)
currentNode = nodeToClose.Parent;
Debug.Assert(currentNode != null);
}
break;
case HTMLchunkType.Script:
newNode = new XElement("script", "REMOVED");
newNodesParent = currentNode;
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Comment:
newNodesParent = currentNode;
if (m12chunk.sTag == "!--")
newNode = new XComment(m12chunk.oHTML);
else if (m12chunk.sTag == "![CDATA[")
newNode = new XCData(m12chunk.oHTML);
else
throw new Exception("Unrecognized comment sTag");
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Text:
currentNode.Add(m12chunk.oHTML);
break;
default:
break;
}
}
catch (Exception e) {
var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);
// the original html is copied for tracing/debugging purposes
originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
.Take(m12chunk.iChunkLength)
.Select(B => (char)B).ToArray());
wrappedE.Data.Add("source", originalHtml);
throw wrappedE;
}
}
while (currentNode.Parent != null)
currentNode = currentNode.Parent;
return currentNode.Nodes();
}
static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
XElement discoveredParent = null;
// Get a list of all ancestors
List<XElement> ancestors = new List<XElement>();
XElement ancestor = nextPotentialParent;
while (ancestor != null) {
ancestors.Add(ancestor);
ancestor = ancestor.Parent;
}
// Check if the new tag implies a previous tag was closed.
if ("form" == m12chunkCleanedTag) {
discoveredParent = ancestors
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("td" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "tr" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("tr" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => !("table" == XE.Name
|| "thead" == XE.Name
|| "tbody" == XE.Name
|| "tfoot" == XE.Name))
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("thead" == m12chunkCleanedTag
|| "tbody" == m12chunkCleanedTag
|| "tfoot" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "table" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
return discoveredParent ?? nextPotentialParent;
}
static string CleanupTagName(string originalName, string originalHtml) {
string tagName = originalName;
tagName = tagName.TrimStart(new char[] { '?' }); // for nodes <?xml >
if (tagName.Contains(':'))
tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);
return tagName;
}
static readonly Regex _startsAsNumeric = new Regex(#"^[0-9]", RegexOptions.Compiled);
static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {
result = null;
string attributeName = originalName;
if (string.IsNullOrEmpty(originalName))
return false;
if (_startsAsNumeric.IsMatch(originalName))
return false;
//
// transform xmlns attributes so they don't actually create any XML namespaces
//
if (attributeName.ToLower().Equals("xmlns")) {
attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
xmlnsIndex++;
}
else {
if (attributeName.ToLower().StartsWith("xmlns:")) {
attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
}
//
// trim trailing \"
//
attributeName = attributeName.TrimEnd(new char[] { '\"' });
attributeName = attributeName.Replace(":", "_");
}
result = attributeName;
return true;
}
static Regex _weirdTag = new Regex(#"^<!\[.*\]>$"); // matches "<![if !supportEmptyParas]>"
static Regex _aspnetPrecompiled = new Regex(#"^<%.*%>$"); // matches "<%# ... %>"
static Regex _shortHtmlComment = new Regex(#"^<!-.*->$"); // matches "<!-Extra_Images->"
static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {
if (string.IsNullOrEmpty(m12chunk.sTag)) {
if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
return new XElement("doctype");
if (_weirdTag.IsMatch(originalHtml))
return new XElement("REMOVED_weirdBlockParenthesisTag");
if (_aspnetPrecompiled.IsMatch(originalHtml))
return new XElement("REMOVED_ASPNET_PrecompiledDirective");
if (_shortHtmlComment.IsMatch(originalHtml))
return new XElement("REMOVED_ShortHtmlComment");
// Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
return null;
}
string tagName = CleanupTagName(m12chunk.sTag, originalHtml);
XElement result = new XElement(tagName);
List<XAttribute> attributes = new List<XAttribute>();
for (int i = 0; i < m12chunk.iParams; i++) {
if (m12chunk.sParams[i] == "<!--") {
// an HTML comment was embedded within a tag. This comment and its contents
// will be interpreted as attributes by Majestic-12... skip this attributes
for (; i < m12chunk.iParams; i++) {
if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
break;
}
continue;
}
if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
continue;
string attributeName = m12chunk.sParams[i];
if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
continue;
attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
}
// If attributes are duplicated with different values, we complain.
// If attributes are duplicated with the same value, we remove all but 1.
var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);
foreach (var duplicatedAttribute in duplicatedAttributes) {
if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
throw new Exception("Attribute value was given different values");
attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
attributes.Add(duplicatedAttribute.First());
}
result.Add(attributes);
return result;
}
static HTMLparser OpenParser() {
HTMLparser oP = new HTMLparser();
// The code+comments in this function are from the Majestic-12 sample documentation.
// ...
// This is optional, but if you want high performance then you may
// want to set chunk hash mode to FALSE. This would result in tag params
// being added to string arrays in HTMLchunk object called sParams and sValues, with number
// of actual params being in iParams. See code below for details.
//
// When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
oP.SetChunkHashMode(false);
// if you set this to true then original parsed HTML for given chunk will be kept -
// this will reduce performance somewhat, but may be desireable in some cases where
// reconstruction of HTML may be necessary
oP.bKeepRawHTML = false;
// if set to true (it is false by default), then entities will be decoded: this is essential
// if you want to get strings that contain final representation of the data in HTML, however
// you should be aware that if you want to use such strings into output HTML string then you will
// need to do Entity encoding or same string may fail later
oP.bDecodeEntities = true;
// we have option to keep most entities as is - only replace stuff like
// this is called Mini Entities mode - it is handy when HTML will need
// to be re-created after it was parsed, though in this case really
// entities should not be parsed at all
oP.bDecodeMiniEntities = true;
if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
oP.InitMiniEntities();
// if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
// extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
// this only works if auto extraction is enabled
oP.bAutoExtractBetweenTagsOnly = true;
// if true then comments will be extracted automatically
oP.bAutoKeepComments = true;
// if true then scripts will be extracted automatically:
oP.bAutoKeepScripts = true;
// if this option is true then whitespace before start of tag will be compressed to single
// space character in string: " ", if false then full whitespace before tag will be returned (slower)
// you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
// a waste of CPU cycles
oP.bCompressWhiteSpaceBeforeTag = true;
// if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
// forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
// compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
// or open
oP.bAutoMarkClosedTagsWithParamsAsOpen = false;
return oP;
}
}
}
The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.
I think #Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:
SgmlReader
No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Threading;
class ParseHTML
{
public ParseHTML() { }
private string ReturnString;
public string doParsing(string html)
{
Thread t = new Thread(TParseMain);
t.ApartmentState = ApartmentState.STA;
t.Start((object)html);
t.Join();
return ReturnString;
}
private void TParseMain(object html)
{
WebBrowser wbc = new WebBrowser();
wbc.DocumentText = "feces of a dummy"; //;magic words
HtmlDocument doc = wbc.Document.OpenNew(true);
doc.Write((string)html);
this.ReturnString = doc.Body.InnerHtml + " do here something";
return;
}
}
usage:
string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
Console.WriteLine("before:" + myhtml);
myhtml = (new ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:" + myhtml);
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.
I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[#class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.
You could use a HTML DTD, and the generic XML parsing libraries.
Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]
Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.
Try this script.
http://www.biterscripting.com/SS_URLs.html
When I use it with this url,
script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")
It shows me all the links on the page for this thread.
http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.
You can modify that script to check for images, variables, whatever.
I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.
You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.
There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.

Categories

Resources