How to use Html Agility Pack for HTML validations - c#

I am using HTML Agility Pack for validating my html. Below is what I am using,
public class MarkupErrors
{
public string ErrorCode { get; set; }
public string ErrorReason { get; set; }
}
public static List<MarkupErrors> IsMarkupValid(string html)
{
var document = new HtmlAgilityPack.HtmlDocument();
document.OptionFixNestedTags = true;
document.LoadHtml(html);
var parserErrors = new List<MarkupErrors>();
foreach(var error in document.ParseErrors)
{
parserErrors.Add(new MarkupErrors
{
ErrorCode = error.Code.ToString(),
ErrorReason = error.Reason
});
}
return parserErrors;
}
So say my input is something like the one shown below :
<h1>Test</h1>
Hello World</h2>
<h3>Missing close h3 tag
So my current function return a list of following errors
- Start tag <h2> was not found
- End tag </h3> was not found
which is fine...
My problem is that I want the entire html to be valid, that is with a proper <head> and <body> tags, because this html will later be available for preview, download as .html files.
So I was wondering if I could check for this using HTML Agility Pack ?
Any ideas or other options will be appreciated. Thanks

You can check there is a HEAD element or a BODY element under an HTML element like this for example:
bool hasHead = doc.DocumentNode.SelectSingleNode("html/head") != null;
bool hasBody = doc.DocumentNode.SelectSingleNode("html/body") != null;
These would fail if there is no HTML element, or if there is no BODY element under the HTML element.
Note I don't use this kind of XPATH expression "//head" because it would give a result even if the head was not directly under the HTML element.

Related

Parsing Nodes with HTML AgilityPack

I'm trying to get information from that page : http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets
rows look like this when inspecting elements :
I've tried this code but it return me null every time on any nodes:
public class ItemSetsTransmog
{
public string ItemSetName { get; set; }
public string ItemSetId { get; set; }
}
public partial class Fmain : Form
{
DataTable Table;
HtmlWeb web = new HtmlWeb();
public Fmain()
{
InitializeComponent();
initializeItemSetTransmogTable();
}
private async void Fmain_Load(object sender, EventArgs e)
{
int PageNum = 0;
var itemsets = await ItemSetTransmogFromPage(0);
while (itemsets.Count > 0)
{
foreach (var itemset in itemsets)
Table.Rows.Add(itemset.ItemSetName, itemset.ItemSetId);
itemsets = await ItemSetTransmogFromPage(PageNum++);
}
}
private async Task<List<ItemSetsTransmog>> ItemSetTransmogFromPage(int PageNum)
{
String url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets";
if (PageNum != 0)
url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets:75+" + PageNum.ToString();
var doc = await Task.Factory.StartNew(() => web.Load(url));
var NameNodes = doc.DocumentNode.SelectNodes("//*[#id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");
var IdNodes = doc.DocumentNode.SelectNodes("//*[#id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");
// if these are null it means the name/score nodes couldn't be found on the html page
if (NameNodes == null || IdNodes == null)
return new List<ItemSetsTransmog>();
var ItemSetNames = NameNodes.Select(node => node.InnerText);
var ItemSetIds = IdNodes.Select(node => node.InnerText);
return ItemSetNames.Zip(ItemSetIds, (name, id) => new ItemSetsTransmog() { ItemSetName = name, ItemSetId = id }).ToList();
}
private void initializeItemSetTransmogTable()
{
Table = new DataTable("ItemSetTransmogTable");
Table.Columns.Add("ItemSetName", typeof(string));
Table.Columns.Add("ItemSetId", typeof(string));
ItemSetTransmogDataView.DataSource = Table;
}
}
}
why does my script doesn't load any of theses nodes ? how can i fix it ?
Your code does not load these nodes because they do not exist in the HTML that is pulled back by HTML Agility Pack. This is probably because a large majority of the markup you have shown is generated by JavaScript. Just try inspecting the doc.ParsedText property in your ItemSetTransmogFromPage() method.
Html Agility Pack is an HTTP Client/Parser, it will not run scripts. If you really need to get the data using this process then you will need to use a "headless browser" such as Optimus to retrieve the page (caveat: I have not used this library, though a nuget package appears to exist) and then probably use HTML Agility Pack to parse/query the markup.
The other alternative might be to try to parse the JSON that exists on this page (if this provides you with the data that you need, although this appears unlikely).
Small note - I think the id in you xpath should be "tab-transmog-sets" instead of "tab - transmog - sets"

How to get URL from the XPATH?

I've tried to check other answers on this site, but none of them worked for me. I have following HTML code:
<h3 class="x-large lheight20 margintop5">
<strong>some textstring</strong>
</h3>
I am trying to get # from this document with following code:
string adUrl = Doc.DocumentNode.SelectSingleNode("//*[#id=\"offers_table\"]/tbody/tr["+i+ "]/td/table/tbody/tr[1]/td[2]/div/h3/a/#href").InnerText;
I've also tried to do that without #href. Also tried with a[contains(#href, 'searchString')]. But all of these lines gave me just the name of the link - some textstring
Attributes doesn't have InnerText.You have to use the Attributes collection instead.
string adUrl = Doc.DocumentNode.SelectSingleNode("//*[#id=\"offers_table\"]/tbody/tr["+i+ "]/td/table/tbody/tr[1]/td[2]/div/h3/a")
.Attributes["href"].Value;
Why not just use the XDocument class?
private string GetUrl(string filename)
{
var doc = XDocument.Load(filename)
foreach (var h3Element in doc.Elements("h3").Where(e => e.Attribute("class"))
{
var classAtt = h3Element.Attribute("class");
if (classAtt == "x-large lheight20 margintop5")
{
h3Element.Element("a").Attribute("href").value;
}
}
}
The code is not tested so use with caution.

Selenium WebDriver and C# (VS): Look for a specific header string

I've been trying, without luck, to use IJavaScriptExecutor to find a specific header string in a page. Here's the html code form the page:
<div class="wrap">
<h2>Edit Page <a href="http://www.webtest.bugrit.net/wordpress/wp-admin/post-
new.php?post_type=page" class="add-new-h2">Add New</a></h2>
<div id...
The text I need to check for is the "Edit Page" string.
This is the closest I've come, which isn't very close:
var element = FFDriver.Instance.FindElements(By.ClassName("add-new-h2"));
IJavaScriptExecutor js = FFDriver.Instance as IJavaScriptExecutor;
if (js != null) {
string innerHtml = (string)js.ExecuteScript("return arguments[0].innerHTML;", element);
//System.Windows.Forms.MessageBox.Show(innerHtml);
if (innerHtml.Equals("Edit Page")) {
return true;
} else {
return false;
}
}
Now, I realize that the text I should expect to get from that code isn't the exact string "Edit Page". But shouldn't it return something? When I enable the MessageBox line, the innerHtml string is empty.
Or, of couse - if someone knows another, possible easier, way to check for the existance of a specific string inside a specific html tag, I'm all ears.
Your element returns you <a> element, not <h2>. Your <a> doesn't contain Edit Page string.
Try find your element like this to the parent element <h2> (only if class name add-new-h2 is unique, otherwise you will get the first matching one):
var element = FFDriver.Instance.FindElement(By.XPath(".//a[#class='add-new-h2']/.."));
var containsText = element.Text.Contains("Edit Page");

Need some HTML element with HTMLAgilityPack in C# - how to do it?

I have the following scenario:
Some text <b>is bolded</b> some is <b>not</b>
Now, how do I get the "test.com" part and the anchor of the text, without having the bolded parts?
Assuming the following markup:
<html>
<head>
<title>Test</title>
</head>
<body>
Some text <b>is bolded</b> some is <b>not</b>
</body>
</html>
You could perform the following:
class Program
{
static void Main()
{
var doc = new HtmlDocument();
doc.Load("test.html");
var anchor = doc.DocumentNode.SelectSingleNode("//a");
Console.WriteLine(anchor.Attributes["href"].Value);
Console.WriteLine(anchor.InnerText);
}
}
prints:
test.com
Some text is bolded some is not
Of course you probably wanna adjust your SelectSingleNode XPath selector by providing an unique id or a classname to the anchor you are trying to fetch:
// assuming Some text <b>is bolded</b> some is <b>not</b>
var anchor = doc.GetElementbyId("foo");

How to verify a Hyperlink exists on a webpage?

I have a need to verify a specific hyperlink exists on a given web page. I know how to download the source HTML. What I need help with is figuring out if a "target" url exists as a hyperlink in the "source" web page.
Here is a little console program to demonstrate the problem:
public static void Main()
{
var sourceUrl = "http://developer.yahoo.com/search/web/V1/webSearch.html";
var targetUrl = "http://developer.yahoo.com/ypatterns/";
Console.WriteLine("Source contains link to target? Answer = {0}",
SourceContainsLinkToTarget(
sourceUrl,
targetUrl));
Console.ReadKey();
}
private static bool SourceContainsLinkToTarget(string sourceUrl, string targetUrl)
{
string content;
using (var wc = new WebClient())
content = wc.DownloadString(sourceUrl);
return content.Contains(targetUrl); // Need to ensure this is in a <href> tag!
}
Notice the comment on the last line. I can see if the target URL exists in the HTML of the source URL, but I need to verify that URL is inside of a <href/> tag. This way I can validate it's actually a hyperlink, instead of just text.
I'm hoping someone will have a kick-ass regular expression or something I can use.
Thanks!
Here is the solution using the HtmlAgilityPack:
private static bool SourceContainsLinkToTarget(string sourceUrl, string targetUrl)
{
var doc = (new HtmlWeb()).Load(sourceUrl);
foreach (var link in doc.DocumentNode.SelectNodes("//a[#href]"))
if (link.GetAttributeValue("href",
string.Empty).Equals(targetUrl))
return true;
return false;
}
The best way is to use a web scraping library with a built in DOM parser, which will build an object tree out of the HTML and let you explore it programmatically for the link entity you are looking for. There are many available - for example Beautiful Soup (python) or scrapi (ruby) or Mechanize (perl). For .net, try the HTML agility pack. http://htmlagilitypack.codeplex.com/

Categories

Resources