HtmlAgilityPack search url link - c#

I create a WindownsFormApplication for a group of friends. I'm using HtmlAgilityPack for my application.
I need to find all version of taco addon's , like this:
<li><a href='https://www.dropbox.com/s/nks140nf794tx77/GW2TacO_034r.zip?dl=0'>Download Build 034.1866r</a></li>
Additionally, I need to check the latest version for downloading the file with the url as in the code below:
public static bool Tacoisuptodate(string Version)
{
// Load HtmlDocuments
var doc = new HtmlWeb().Load("http://www.gw2taco.com/");
var body = doc.DocumentNode.SelectNodes("//body").Single();
// Sort out the document to take that he to interest us
//SelectNodes("//div"))
foreach (var node in doc.DocumentNode.SelectNodes("//div"))
{
// Check for null value
var classeValue = node.Attributes["class"]?.Value;
var idValue = node.Attributes["id"]?.Value;
var hrefValue = node.Attributes["href"]?.Value;
// We search <div class="widget LinkList" id="LinkList1" into home page >
if (classeValue == "widget LinkList" && idValue == "LinkList1")
{
foreach(HtmlNode content in node.SelectNodes("//li"))
{
Debug.Write(content.GetAttributeValue("href", false));
}
}
}
return false;
}
If somebody could help me, I would really appreciate it.

A single xpath is enough.
var xpath = "//h2[text()='Downloads']/following-sibling::div[#class='widget-content']/ul/li/a";
var doc = new HtmlAgilityPack.HtmlWeb().Load("http://www.gw2taco.com/");
var downloads = doc.DocumentNode.SelectNodes(xpath)
.Select(li => new
{
href = li.Attributes["href"].Value,
name = li.InnerText
})
.ToList();

Related

Iterate through web pages and download PDFs

I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:
System.NullReferenceException HResult=0x80004003 Message=Object
reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in
C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16
Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)
Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...
Here is the process to be done:
Go to https://www.nordicwater.com/products/waste-water/
List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>
Proceed to each link and search for PDF files. PDF files are in:
<div class="dl-items">
<a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">
Here is my full code for testing:
using HtmlAgilityPack;
using System;
using System.Net;
namespace NW_Crawler
{
class Program
{
static void Main(string[] args)
{
{
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']//a");
Console.WriteLine("Here are the links:" + ProductListPage);
foreach (HtmlNode src in ProductListPage)
{
htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);
// Thread.Sleep(5000); // wait some time
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
{
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.Attributes["href"].Value;
Console.WriteLine(LinkURL);
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
var DLClient = new WebClient();
// Thread.Sleep(5000); // wait some time
DLClient.DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
}
}
}
}
}
Made a couple of changes to cover the errors you might be seeing.
Changes
Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
Check if ProductListPage is valid and not null.
ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
Move on to the next iteration if the href is null on either of the loops
Changed the Product List query to //a[#class='ap-area-link'] from //a[#class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']");
if (ProductListPage != null)
foreach (HtmlNode src in ProductListPage)
{
string href = src.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(href))
continue;
htmlDoc = new HtmlWeb().Load(href);
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(LinkURL))
continue;
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
new WebClient().DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[#class='ap-area-link'] and was able to find matching elements, screenshot below.

Looping through HtmlNodes and collecting data gives me the same result every time

I have an async method which calls a mapper for turning HTML string into an IEnumerable:
public async Task<IEnumerable<MovieRatingScrape>> GetMovieRatingsAsync(string username, int page)
{
var response = await _httpClient.GetAsync($"/betyg/{username}?p={page}");
response.EnsureSuccessStatusCode();
var html = await response.Content.ReadAsStringAsync();
return new MovieRatingsHtmlMapper().Map(html);
}
...
public class MovieRatingsHtmlMapper : HtmlMapperBase<IEnumerable<MovieRatingScrape>>
{
// In reality, this method belongs to base class with signature T Map(string html)
public IEnumerable<MovieRatingScrape> Map(string html)
{
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
return Map(htmlDocument);
}
public override IEnumerable<MovieRatingScrape> Map(HtmlDocument item)
{
var movieRatings = new List<MovieRatingScrape>();
var nodes = item.DocumentNode.SelectNodes("//table[#class='list']/tr");
foreach (var node in nodes)
{
var title = node.SelectSingleNode("//td[1]/a")?.InnerText;
movieRatings.Add(new MovieRatingScrape
{
Date = DateTime.Parse(node.SelectSingleNode("//td[2]")?.InnerText),
Slug = node.SelectSingleNode("//td[1]/a[starts-with(#href, '/film/')]")?
.GetAttributeValue("href", null)?
.Replace("/film/", string.Empty),
SwedishTitle = title,
Rating = node.SelectNodes($"//td[3]/i[{XPathHasClass("fa-star")}]").Count
});
}
return movieRatings;
}
}
The resulting list movieRatings contains copies of the same object, but when I look at the HTML and when I debug and view the HtmlNode node they differ as they are supposed to.
Either I'm blind to something really obvious, or I am hitting some async issue which I do not grasp. Any ideas? I should be getting 50 unique objects out of this call, now I am only getting the first 50 times.
Thank you in advance, Viktor.
Edit: Adding some screenshots to show my predicament. Look at locals InnerHtml (node) and title for item 1 and 2 of the foreach loop.
Edit 2: Managed to reproduce on .NET Fiddle: https://dotnetfiddle.net/A2I4CQ
You need to use .// and not //
Here is the fixed Fiddle: https://dotnetfiddle.net/dZkSRN
// will search anywhere in the document
.// will search anywhere in the current node
i am not super sure how to describe this but your issue is here (i think)
//table[#class='list']/tr"
specifically the //
I experienced the same thing while looking for a span. i had to use something similar
var nodes = htmlDoc.DocumentNode.SelectNodes("//li[#class='itemRow productItemWrapper']");
foreach(HtmlNode node in nodes)
{
var nodeDoc = new HtmlDocument();
nodeDoc.LoadHtml(node.InnerHtml);
string name = nodeDoc.DocumentNode.SelectSingleNode("//span[#class='productDetailTitle']").InnerText;
}

Parsing Nodes with HTML AgilityPack

I'm trying to get information from that page : http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets
rows look like this when inspecting elements :
I've tried this code but it return me null every time on any nodes:
public class ItemSetsTransmog
{
public string ItemSetName { get; set; }
public string ItemSetId { get; set; }
}
public partial class Fmain : Form
{
DataTable Table;
HtmlWeb web = new HtmlWeb();
public Fmain()
{
InitializeComponent();
initializeItemSetTransmogTable();
}
private async void Fmain_Load(object sender, EventArgs e)
{
int PageNum = 0;
var itemsets = await ItemSetTransmogFromPage(0);
while (itemsets.Count > 0)
{
foreach (var itemset in itemsets)
Table.Rows.Add(itemset.ItemSetName, itemset.ItemSetId);
itemsets = await ItemSetTransmogFromPage(PageNum++);
}
}
private async Task<List<ItemSetsTransmog>> ItemSetTransmogFromPage(int PageNum)
{
String url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets";
if (PageNum != 0)
url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets:75+" + PageNum.ToString();
var doc = await Task.Factory.StartNew(() => web.Load(url));
var NameNodes = doc.DocumentNode.SelectNodes("//*[#id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");
var IdNodes = doc.DocumentNode.SelectNodes("//*[#id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");
// if these are null it means the name/score nodes couldn't be found on the html page
if (NameNodes == null || IdNodes == null)
return new List<ItemSetsTransmog>();
var ItemSetNames = NameNodes.Select(node => node.InnerText);
var ItemSetIds = IdNodes.Select(node => node.InnerText);
return ItemSetNames.Zip(ItemSetIds, (name, id) => new ItemSetsTransmog() { ItemSetName = name, ItemSetId = id }).ToList();
}
private void initializeItemSetTransmogTable()
{
Table = new DataTable("ItemSetTransmogTable");
Table.Columns.Add("ItemSetName", typeof(string));
Table.Columns.Add("ItemSetId", typeof(string));
ItemSetTransmogDataView.DataSource = Table;
}
}
}
why does my script doesn't load any of theses nodes ? how can i fix it ?
Your code does not load these nodes because they do not exist in the HTML that is pulled back by HTML Agility Pack. This is probably because a large majority of the markup you have shown is generated by JavaScript. Just try inspecting the doc.ParsedText property in your ItemSetTransmogFromPage() method.
Html Agility Pack is an HTTP Client/Parser, it will not run scripts. If you really need to get the data using this process then you will need to use a "headless browser" such as Optimus to retrieve the page (caveat: I have not used this library, though a nuget package appears to exist) and then probably use HTML Agility Pack to parse/query the markup.
The other alternative might be to try to parse the JSON that exists on this page (if this provides you with the data that you need, although this appears unlikely).
Small note - I think the id in you xpath should be "tab-transmog-sets" instead of "tab - transmog - sets"

System.ArgumentNullException when trying to access span with Xpath (C#)

So i've been trying to get a program working where I get info from google finance regarding different stock stats. So far I have not been able to get information out of spans. As of now I have hardcoded direct access to the apple stock.
Link to Apple stock: https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=NgItWIG1GIftsAHCn4zIAg
What i can't understand is that I receive correct output when I trying it in the chrome console with the following command:
$x("//*[#id=\"appbar\"]//div//div//div//span");
This is my current code in Visual studio 2015 with Html Agility Pack installed(I suspect a fault in currDocNodeCompanyName):
class StockDataAccess
{
HtmlWeb web= new HtmlWeb();
private List<string> testList;
public void FindStock()
{
var histDoc = web.Load("https://www.google.com/finance/historical?q=NASDAQ%3AAAPL&ei=q9IsWNm4KZXjsAG-4I7oCA.html");
var histDocNode = histDoc.DocumentNode.SelectNodes("//*[#id=\"prices\"]//table//tr//td");
var currDoc = web.Load("https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=CdcsWMjNCIe0swGd3oaYBA.html");
var currDocNodeCurrency = currDoc.DocumentNode.SelectNodes("//*[#id=\"ref_22144_elt\"]//div//div");
var currDocNodeCompanyName = currDoc.DocumentNode.SelectNodes("//*[#id=\"appbar\"]//div//div//div//span");
var histDocText = histDocNode.Select(node => node.InnerText);
var currDocCurrencyText = currDocNodeCurrency.Select(node => node.InnerText);
var currDocCompanyName = currDocNodeCompanyName.Select(node => node.InnerText);
List<String> result = new List<string>(histDocText.Take(6));
result.Add(currDocCurrencyText.First());
result.Add(currDocCompanyName.Take(2).ToString());
testList = result;
}
public List<String> ReturnStock()
{
return testList;
}
}
I have been trying the Xpath expression [text] and received an output that i can work with when using the chrome console but not in VS. I have also been experimenting with a foreach-loop, a few suggested it to others.
class StockDataAccess
{
HtmlWeb web= new HtmlWeb();
private List<string> testList;
public void FindStock()
{
///same as before
var currDoc = web.Load("https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=CdcsWMjNCIe0swGd3oaYBA.html");
HtmlNodeCollection currDocNodeCompanyName = currDoc.DocumentNode.SelectNodes("//*[#id=\"appbar\"]//div//div//div//span");
///Same as before
List <string> blaList = new List<string>();
foreach (HtmlNode x in currDocNodeCompanyName)
{
blaList.Add(x.InnerText);
}
List<String> result = new List<string>(histDocText.Take(6));
result.Add(currDocCurrencyText.First());
result.Add(blaList[1]);
result.Add(blaList[2]);
testList = result;
}
public List<String> ReturnStock()
{
return testList;
}
}
I would really appreciate if anyone could point me in the right direction.
If you check the contents of currDoc.DocumentNode.InnerHtml you will notice that there is no element with the id "appbar", therefore the result is correct, since the xpath doesn't return anything.
I suspect that the html element you're trying to find is generated by a script (js for example), and that explains why you can see it on the browser and not on the HtmlDocument object, since HtmlAgilityPack does not render scripts, it only download and parse the raw source code.

Dynamic tree control

I am using control named 'tree' from http://www.jeasyui.com. But in my case I want to load at first highest level then when click to one of node load it's child nodes.
<ul id="tt" checkbox="true" animate="true"></ul>
$(function() {
$('#tt').tree({
data: #Html.Raw(Model.Tree)
});
});
on this function I get childs nodes for selected node from DB:
$(function() {
$('#tt').tree({
onBeforeExpand: function(node) {
var hospitalId = node.id;
$.getJSON('#Url.Action("LoadDepartments")', { hospitalId: hospitalId }, function () {
});
}});});
[HttpGet]
public ActionResult LoadDepartments(Guid hospitalId)
{
LoadHospitals();
var departments = _templateAccessor.GetDepartments(hospitalId);
var hospital = tree.Where(obj => obj.id == hospitalId.ToString()).FirstOrDefault();
if (hospital != null)
{
foreach (var department in departments)
{
DataTreeModel dep = new DataTreeModel();
dep.id = department.Id.ToString();
dep.text = department.Name;
dep.state = "closed";
hospital.children.Add(dep);
hospital.state = "open";
}
}
var result = SerializeToJsonString(tree);
return Json(result, JsonRequestBehavior.AllowGet);
}
In method LoadDepartments I have correct structure but the tree doesn't show new elements. the question is how to clean up previous content of tree and fill it with new content, maybe I am doing something wrong?
jqTree
This Tree should have the features you require, it's very lightweight and easy to use. I'm not advertising but I was in the same situation a couple of weeks ago, and this offered me a solution. It will also keep track of opened up nodes, etc...

Categories

Resources