How to parse a simple page using html agility pack?

How to parse a simple page using html agility pack? - c#

I am trying to parse this page, but there aren't much unique info for me to uniquely identify the sections I want.
Basically I am trying to get the most of the data right to the flash video. So:
Alternating Floor Press
Type: Strength
Main Muscle Worked: Chest
Other Muscles: Abdominals, Shoulders, Triceps
Equipment: Kettlebells
Mechanics Type: Compound
Level: Beginner
Sport: No
Force: N/A
And also the image links that shows before and after states.
Right now I use this:
HtmlAgilityPack.HtmlDocument doc = web.Load ( "http://www.bodybuilding.com/exercises/detail/view/name/alternating-floor-press" );
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants ( "a" );
foreach ( var link in threadLinks )
{
string str = link.InnerHtml;
Console.WriteLine ( str );
}
This gives me a lot of stuff I don't need but also prints what I need. Should I be parsing this printed data by trying to see where my goal data might be inside it?

You can select the id of the nodes you are interested in:
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.bodybuilding.com/exercises/detail/view/name/alternating-floor-press");
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.SelectNodes("//*[#id=\"exerciseDetails\"]");
foreach (var link in threadLinks)
{
string str = link.InnerText;
Console.WriteLine(str);
}
Console.ReadKey();

For a given <a> node, to get the text shown, try .InnerText.
Right now you are using the contents of all <a> tags within the document. Try narrowing down to only the ones you need. Look for other elements which contain the particular <a> tags you are after. For example, do they all sit inside a <div> with a certain class?
E.g. if you find the <a> tags you are interested in all sit within <div class="foolinks"> then you can do something like:-
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("div")
.First(dn => dn.Attributes["class"] == "foolinks").Descendants("a");
--UPDATE--
Given the information in your comment, I would try:-
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("div")
.First(dn => dn.Id == "exerciseDetails").Descendants("a");
--UPDATE--
If you are having trouble getting it to work, try splitting it up into variable assignments and stepping through the code, inspecting each variable to see if it holds what you expect.
E.g,
var divs = doc.DocumentNode.Descendants("div");
var div = divs.FirstOrDefault(dn => dn.Id == "exerciseDetails");
if (div == null)
{
// couldn't find the node - do whatever is appropriate, e.g. throw an exception
}
IEnumerable<HtmlNode> threadLinks = div.Descendants("a");
BTW - I'm not sure if the .Id property maps to the id attribute of the node as you suggest it does. If not, you could try dn => dn.Attributes["id"] == "exerciseDetails" instead.

Related

Scrape Table Inside Comment With HTMLAgilityPack

I'd like to scrape a table within a comment using HTMLAgilityPack. For example, on the page
http://www.baseball-reference.com/register/team.cgi?id=f72457e4
there is a table with id="team_pitching". I can get this comment as a block of text with:
var tags = doc.DocumentNode.SelectSingleNode("//comment()[contains(., 'team_pitching')]");
however my preference would be to select the rows from the table with something like:
var tags = doc.DocumentNode.SelectNodes("//comment()[contains(., 'team_pitching')]//table//tbody//tr");
or
var tags = doc.DocumentNode.SelectNodes("//comment()//table[#id = 'team_pitching']//tbody//tr");
but these both return null. Is there a way to do this so I don't have to parse the text manually to get all of the table data?
Sample HTML - I'm looking to find nodes inside <!-- ... -->:
<p>not interesting HTML here</p>
<!-- <table id=team_pitching>
<tbody><tr>...</tr>...</tbody>...</table> -->

Content of comment is not parsed as DOM nodes, so you can't search outside comment and inside comment with single XPath.
You can get InnerHTML of the comment node, trim comment tags, load it into the HtmlDocument and query on it. Something like this should work
var commentNode = doc.DocumentNode
.SelectSingleNode("//comment()[contains(., 'team_pitching')]");
var commentHtml = commentNode.InnerHtml.TrimStart('<', '!', '-').TrimEnd('-', '>');
var commentDoc = new HtmlDocument();
commentDoc.LoadHtml(commentHtml);
var tags = commentDoc.DocumentNode.SelectNodes("//table//tbody//tr");

How do I use HTML Agility Pack to extract text from a specific class?

For example, I want to extract the first definition from http://www.urbandictionary.com/define.php?term=potato . It's raw text, though.
var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.urbandictionary.com/define.php?term=potato"));
var root = html.DocumentNode;
var p = root.Descendants()
.Where(n => n.GetAttributeValue("class", "").Equals("meaning"))
.Single()
.Descendants("")
.Single();
var content = p.InnerText;
This is the code I use to try and extract the meaning class. This doesn't work at all, though... How do I extract the class from Urban Dictionary?

If you change your code as below it works
var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.urbandictionary.com/define.php?term=potato"));
var root = html.DocumentNode;
var p = root.SelectNodes("//div[#class='meaning']").First();
var content = p.InnerText;
The text I'm using in SelectNodes is XPath and means all div elements with class named meaning. You need to use First or FirstOrDefault as the page contains multiple div elements with that class name, so Single would throw an exception.

Alternatively you can use, if you wanted to use the same "style" as the link you were using.
var p = root.Descendants()
.Where(n => n.GetAttributeValue("class", "").Equals("meaning"))
.FirstOrDefault();
Though Tone's answer is more elegant, one liners are usually better.

Html Agility Pack, SelectNodes from a node

Why does this pick all of my <li> elements in my document?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<Page>();
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes("//li");
What I want is to get all <li> elements in the <div> with an id of "myTrips".

It's a bit confusing because you're expecting that it would do a selectNodes on only the div with id "myTrips", however if you do another SelectNodes("//li") it will performn another search from the top of the document.
I fixed this by combining the statement into one, but that would only work on a webpage where you have only one div with an id "mytrips". The query would look like this:
doc.DocumentNode.SelectNodes("//div[#id='myTrips'] //li");

var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes(".//li");
Note the dot in the second line. Basically in this regard HTMLAgitilityPack completely relies on XPath syntax, however the result is non-intuitive, because those queries are effectively the same:
doc.DocumentNode.SelectNodes("//li");
some_deeper_node.SelectNodes("//li");

Creating a new node can be beneficial in some situations and lets you use the xpaths more intuitively. I've found this useful in a couple of places.
var myTripsDiv = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']");
var myTripsNode = HtmlNode.CreateNode(myTripsDiv.InnerHtml);
var liOfTravels = myTripsNode.SelectNodes("//li");

You can do this with a Linq query:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<HtmlNode>();
foreach (var matchingDiv in doc.DocumentNode.DescendantNodes().Where(n=>n.Name == "div" && n.Id == "myTrips"))
{
travelList.AddRange(matchingDiv.DescendantNodes().Where(n=> n.Name == "li"));
}
I hope it helps

This seems counter intuitive to me aswell, if you run a selectNodes method on a particular node I thought it would only search for stuff underneath that node, not in the document in general.
Anyway OP if you change this line :
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("//li");
TO:
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("li");
I think you'll be ok, i've just had the same issue and that fixed it for me. Im not sure though if the li would have to be a direct child of the node you have.

Replace Empty span tag to br tag using Regex

Can any one tell me the Regex pattern which checks for the empty span tags and replace them with tag.
Something like the below :
string io = Regex.Replace(res,"" , RegexOptions.IgnoreCase);
I dont know what pattern to be passed in!

This pattern will find all empty span tags, such as <span/> and <span></span>:
<span\s*/>|<span>\s*</span>
So this code should replace all your empty span tags with br tags:
string io = Regex.Replace(res, #"<span\s*/>|<span>\s*</span>", "<br/>");

The code of Jeff Mercado has error at lines:
.Where(e => e.Name.Equals("span", StringComparison.OrdinalIgnoreCase)
&& n.Name.Equals("span", StringComparison.OrdinalIgnoreCase)
Error message: Member 'object.Equals(object, object)' cannot be accessed with an instance reference; qualify it with a type name instead
They didn't work when I tried replace with other objects!

My favourite answer to this problem is this one: RegEx match open tags except XHTML self-contained tags

You should parse it, searching for the empty span elements and replace them. Here's how you can do it using LINQ to XML. Just note that depending on the actual HTML, it may require tweaks to get it to work since it is an XML parser, not HTML.
// parse it
var doc = XElement.Parse(theHtml);
// find the target elements
var targets = doc.DescendantNodes()
.OfType<XElement>()
.Where(e => e.Name.Equals("span", StringComparison.OrdinalIgnoreCase)
&& e.IsEmpty
&& !e.HasAttributes)
.ToList(); // need a copy since the contents will change
// replace them all
foreach (var span in targets)
span.ReplaceWith(new XElement("br"));
// get back the html string
theHtml = doc.ToString();
Otherwise, here's some code showing how you can use the HTML Agility Pack to do the same (written in a way that mirrors the other version).
// parse it
var doc = new HtmlDocument();
doc.LoadHtml(theHtml);
// find the target elements
var targets = doc.DocumentNode
.DescendantNodes()
.Where(n => n.NodeType == HtmlNodeType.Element
&& n.Name.Equals("span", StringComparison.OrdinalIgnoreCase)
&& !n.HasChildNodes && !n.HasAttributes)
.ToList(); // need a copy since the contents will change
// replace them all
foreach (var span in targets)
{
var br = HtmlNode.CreateNode("<br />");
span.ParentNode.ReplaceChild(br, span);
}
// get back the html string
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
theHtml = writer.ToString();
}

Can this be improved? Scrubbing of dangerous html tags

I been finding that for something that I consider pretty import there is very little information or libraries on how to deal with this problem.
I found this while searching. I really don't know all the million ways that a hacker could try to insert the dangerous tags.
I have a rich html editor so I need to keep non dangerous tags but strip out bad ones.
So is this script missing anything?
It uses html agility pack.
public string ScrubHTML(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//Remove potentially harmful elements
HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.ParentNode.RemoveChild(node, false);
}
}
//remove hrefs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(#href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(#href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(#href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.SetAttributeValue("href", "#");
}
}
//remove img with refs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(#src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(#src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(#src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.SetAttributeValue("src", "#");
}
}
//remove on<Event> handlers from all tags
nc = doc.DocumentNode.SelectNodes("//*[#onclick or #onmouseover or #onfocus or #onblur or #onmouseout or #ondoubleclick or #onload or #onunload]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.Attributes.Remove("onFocus");
node.Attributes.Remove("onBlur");
node.Attributes.Remove("onClick");
node.Attributes.Remove("onMouseOver");
node.Attributes.Remove("onMouseOut");
node.Attributes.Remove("onDoubleClick");
node.Attributes.Remove("onLoad");
node.Attributes.Remove("onUnload");
}
}
// remove any style attributes that contain the word expression (IE evaluates this as script)
nc = doc.DocumentNode.SelectNodes("//*[contains(translate(#style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.Attributes.Remove("stYle");
}
}
return doc.DocumentNode.WriteTo();
}
Edit
2 people have suggested whitelisting. I actually like the idea of whitelisting but never actually did it because no one can actually tell me how to do it in C# and I can't even really find tutorials for how to do it in c#(the last time I looked. I will check it out again).
How do you make a white list? Is it just a list collection?
How do you actual parse out all html tags, script tags and every other tag?
Once you have the tags how do you determine which ones are allowed? Compare them to you list collection? But what happens if the content is coming in and has like 100 tags and you have 50 allowed. You got to compare each of those 100 tag by 50 allowed tags. Thats quite a bit to go through and could be slow.
Once you found a invalid tag how do you remove it? I don't really want to reject a whole set of text if one tag was found to be invalid. I rather remove and insert the rest.
Should I be using html agility pack?

That code is dangerous -- you should be whitelisting elements, not blacklisting them.
In other words, make a small list of tags and attributes you want to allow, and don't let any others through.
EDIT: I'm not familiar with HTML agility pack, but I see no reason why it wouldn't work for this. Since I don't know the framework, I'll give you pseudo-code for what you need to do.
doc.LoadHtml(html);
var validTags = new List<string>(new string[] {"b", "i", "u", "strong", "em"});
var nodes = doc.DocumentNode.SelectAllNodes();
foreach(HtmlNode node in nodes)
if(!validTags.Contains(node.Tag.ToLower()))
node.Parent.ReplaceNode(node, node.InnerHtml);
Basically, for each tag, if it's not contained in the whitelist, replace the tag with just its inner HTML. Again, I don't know your framework, so I can't really give you specifics, sorry. Hopefully this gets you started in the right direction.

Yes, I already see you're missing onmousedown, onmouseup, onchange, onsubmit, etc. This is part of why should use whitelisting for both tags and attributes. Even if you had a perfect blacklist now (very unlikely), tags and attributes are added fairly often.
See Why use a whitelist for HTML sanitizing?.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to parse a simple page using html agility pack? - c#

Related

Scrape Table Inside Comment With HTMLAgilityPack

How do I use HTML Agility Pack to extract text from a specific class?

Html Agility Pack, SelectNodes from a node

Replace Empty span tag to br tag using Regex

Can this be improved? Scrubbing of dangerous html tags

Categories

Resources