HtmlAgilityPack ArgumentOutOfRangeException - c#

I'm trying to parse a website's content on Windows Phone using the HtmlAgilityPack. My current code is:
HtmlWeb.LoadAsync(url, DownloadCompleted);
...
void DownloadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
if (e.Error == null)
{
HtmlDocument doc = e.Document;
if (doc != null)
{
string test = doc.DocumentNode.Element("html").Element("body").Element("form").Elements("div").ElementAt(2).Element("table").Element("tbody").Elements("tr").ElementAt(4).Element("td").Element("center").Element("div").InnerText.ToString();
System.Diagnostics.Debug.WriteLine(test);
}
}
}
Currently, when I run the above, I get an ArgumentOutOfRangeException at string test = doc.DocumentNode.Element("html").Element("body").Element("form").Elements("div").ElementAt(2).Element("table").Element("tbody").Elements("tr").ElementAt(4).Element("td").Element("center").Element("div").InnerText.ToString();.
doc.DocumentNode.Element("html").InnerText.ToString() seems to give me the source code for the entire page.
The URL of the website I'm trying to parse is: http://polyclinic.singhealth.com.sg/Webcams/QimgPage.aspx?Loc_Code=BDP

It looks like you're after a specific DIV, if I'm not mistaking the one you're after has a unique identifier <td class="queueNo"><center><div id="divRegPtwVal">0</div></center></td>.
Why not simply use doc.DocumentNode.SelectSingleNode("//div[#id='divRegPtwVal']") or doc.DocumentNode.Descendants("div").Where(div => div.Id == "divRegPtwVal").FirstOrDefault()
Select the image source for a specific image with id:
var attrib = doc.DocumentNode.SelectSingleNode("//img[#id='imgCam2']/#src");
//I suspect, might be a slightly different property, I can't check right now
string src = attrib.InnerText;
Or:
var img = doc.DocumentNode.Descendants("img").Where(img => img.Id=="imgCam2");
string src = img.Attributes["Source"].Value;

Related

Iterate through web pages and download PDFs

I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:
System.NullReferenceException HResult=0x80004003 Message=Object
reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in
C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16
Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)
Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...
Here is the process to be done:
Go to https://www.nordicwater.com/products/waste-water/
List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>
Proceed to each link and search for PDF files. PDF files are in:
<div class="dl-items">
<a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">
Here is my full code for testing:
using HtmlAgilityPack;
using System;
using System.Net;
namespace NW_Crawler
{
class Program
{
static void Main(string[] args)
{
{
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']//a");
Console.WriteLine("Here are the links:" + ProductListPage);
foreach (HtmlNode src in ProductListPage)
{
htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);
// Thread.Sleep(5000); // wait some time
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
{
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.Attributes["href"].Value;
Console.WriteLine(LinkURL);
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
var DLClient = new WebClient();
// Thread.Sleep(5000); // wait some time
DLClient.DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
}
}
}
}
}
Made a couple of changes to cover the errors you might be seeing.
Changes
Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
Check if ProductListPage is valid and not null.
ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
Move on to the next iteration if the href is null on either of the loops
Changed the Product List query to //a[#class='ap-area-link'] from //a[#class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']");
if (ProductListPage != null)
foreach (HtmlNode src in ProductListPage)
{
string href = src.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(href))
continue;
htmlDoc = new HtmlWeb().Load(href);
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(LinkURL))
continue;
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
new WebClient().DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[#class='ap-area-link'] and was able to find matching elements, screenshot below.

Parsing Nodes with HTML AgilityPack

I'm trying to get information from that page : http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets
rows look like this when inspecting elements :
I've tried this code but it return me null every time on any nodes:
public class ItemSetsTransmog
{
public string ItemSetName { get; set; }
public string ItemSetId { get; set; }
}
public partial class Fmain : Form
{
DataTable Table;
HtmlWeb web = new HtmlWeb();
public Fmain()
{
InitializeComponent();
initializeItemSetTransmogTable();
}
private async void Fmain_Load(object sender, EventArgs e)
{
int PageNum = 0;
var itemsets = await ItemSetTransmogFromPage(0);
while (itemsets.Count > 0)
{
foreach (var itemset in itemsets)
Table.Rows.Add(itemset.ItemSetName, itemset.ItemSetId);
itemsets = await ItemSetTransmogFromPage(PageNum++);
}
}
private async Task<List<ItemSetsTransmog>> ItemSetTransmogFromPage(int PageNum)
{
String url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets";
if (PageNum != 0)
url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets:75+" + PageNum.ToString();
var doc = await Task.Factory.StartNew(() => web.Load(url));
var NameNodes = doc.DocumentNode.SelectNodes("//*[#id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");
var IdNodes = doc.DocumentNode.SelectNodes("//*[#id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");
// if these are null it means the name/score nodes couldn't be found on the html page
if (NameNodes == null || IdNodes == null)
return new List<ItemSetsTransmog>();
var ItemSetNames = NameNodes.Select(node => node.InnerText);
var ItemSetIds = IdNodes.Select(node => node.InnerText);
return ItemSetNames.Zip(ItemSetIds, (name, id) => new ItemSetsTransmog() { ItemSetName = name, ItemSetId = id }).ToList();
}
private void initializeItemSetTransmogTable()
{
Table = new DataTable("ItemSetTransmogTable");
Table.Columns.Add("ItemSetName", typeof(string));
Table.Columns.Add("ItemSetId", typeof(string));
ItemSetTransmogDataView.DataSource = Table;
}
}
}
why does my script doesn't load any of theses nodes ? how can i fix it ?
Your code does not load these nodes because they do not exist in the HTML that is pulled back by HTML Agility Pack. This is probably because a large majority of the markup you have shown is generated by JavaScript. Just try inspecting the doc.ParsedText property in your ItemSetTransmogFromPage() method.
Html Agility Pack is an HTTP Client/Parser, it will not run scripts. If you really need to get the data using this process then you will need to use a "headless browser" such as Optimus to retrieve the page (caveat: I have not used this library, though a nuget package appears to exist) and then probably use HTML Agility Pack to parse/query the markup.
The other alternative might be to try to parse the JSON that exists on this page (if this provides you with the data that you need, although this appears unlikely).
Small note - I think the id in you xpath should be "tab-transmog-sets" instead of "tab - transmog - sets"

Parsing HTML with HtmlAgilityPack in a htaccess-protected section of a website

I am struggling to parse an html webpage in a htaccess-protected area of my website through a .aspx file in C# (the .aspx file is within the protected area). By debugging the code, I can get the raw page through the HtmlWeb().Load method, but when it comes to getting the html node (SelectSingleNode method), I get a null value.
Here is the sample code I am testing:
protected void Page_Load(object sender, EventArgs e)
{
lbl.Text = getTextFromPage();
}
private string getTextFromPage()
{
var web = new HtmlWeb();
var doc = web.Load("html_address");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[#id='reference_to_id']");
if (node != null)
{
return node.InnerText;
}
else
{
return "nothing found";
}
}
I always get a "nothing found" response, since the "node" object is null. If I remove the .htaccess file (thus removing the protection), everything works perfectly fine, so I suppose something should be done in the .htaccess definition. What shall I do?
EDIT:
I have included the content of the node in the question:
<div id="reference_to_id"><p>test_test_test</p></div>

Get GeckoFx firefox browser control iframe html not accessible

I am using the GeckoFX 22 c# web browser control but cannot manage to access tags within an iframe. When I check the gecko innerhtml it seems that although the iframe tag shows in the html, the contents of it do not.
This is the code I used to get the inner html of the browser control which just shows the iframe tag as empty (when it should have another doc inside of it):
GeckoHtmlElement element = null;
var geckoDomElement = webBrowser.Document.DocumentElement;
if (geckoDomElement is GeckoHtmlElement)
{
element = (GeckoHtmlElement)geckoDomElement;
var innerHtml = element.InnerHtml;
}
Previously I used code similar to the code below to access individual elements which works fine:
GeckoDocument checkDoc = (GeckoDocument)webBrowser.Window.Document;
var x = (checkDoc.GetElementsByTagName("a").Where(b => b.Id == "ipt-form-format-aside").First());
I am able to get individual elements and change their values/trigger events etc without problems with the main html document but anything in an iframe is impossible to get the elements of. I think perhaps the Iframe has not been loaded yet or something like that. Is there a way to force the control to wait for the I frame to load before attempting to access its elements?
string content = null;
var iframe = webBrowser.Document.GetElementsByTagName("iframe").FirstOrDefault() as Gecko.DOM.GeckoIFrameElement;
if(iframe != null)
{
var html = iframe.ContentDocument.DocumentElement as GeckoHtmlElement;
if (html != null)
content = html.OuterHtml;
}
I'm just posting this for anyone else that might get this problem
foreach (GeckoIFrameElement _E in geckoWebBrowser1.Document.GetElementsByTagName("iframe"))
{
if (_E.GetAttribute("class") == "testClass")
{
var innerHTML = _E.ContentDocument;
foreach (GeckoHtmlElement _A in innerHTML.GetElementsByTagName("input"))
{
_A.SetAttribute("value", "Test");
}
}
}
I got a similar problem so i did this
checkDoc.Window.Frames(1)
instead of
checkDoc.GetElementsByTagName("iframe")
value within the parenthesis (i.e. 1 here) depends of your index

Extracting Twitter PIN info, WP7

I've been following this great tutorial:
http://buildmobile.com/twitter-in-a-windows-phone-7-app/#fbid=o0eLp-OipGa
But it seems that the pin extraction method used in doesn't work for me or is out of date. I'm not an expert on html scrapping and was wondering if someone could help me find a solution to extracting the pin. The method used by the tutorial is:
private void BrowserNavigated(object sender, NavigationEventArgs e){
if (AuthenticationBrowser.Visibility == Visibility.Collapsed) {
AuthenticationBrowser.Visibility = Visibility.Visible;
}
if (e.Uri.AbsoluteUri.ToLower().Replace("https://", "http://") == AuthorizeUrl) {
var htmlString = AuthenticationBrowser.SaveToString();
var pinFinder = new Regex(#"<DIV id=oauth_pin>(?<pin>[A-Za-z0-9_]+)</DIV>", RegexOptions.IgnoreCase);
var match = pinFinder.Match(htmlString);
if (match.Length > 0) {
var group = match.Groups["pin"];
if (group.Length > 0) {
pin = group.Captures[0].Value;
if (!string.IsNullOrEmpty(pin)) {
RetrieveAccessToken();
}
}
}
if (string.IsNullOrEmpty(pin)){
Dispatcher.BeginInvoke(() => MessageBox.Show("Authorization denied by user"));
}
// Make sure pin is reset to null
pin = null;
AuthenticationBrowser.Visibility = Visibility.Collapsed;
}
}
When running through that code, "match" always ends up null and the pin is never found. Everything else in the tutorial works, but I have no idea how to manipulate this code to extract the pin due to the new structure of the page.
I really appreciate the time,
Mike
I have found that Twitter has 2 different PIN pages, and I think they determine which page to redirect you to depending on your browser.
Something as simple as string parsing will work for you. The first PIN page I came across has the PIN code wrapped in a <.code> tag, so simply look for <.code> and parse it out:
if (innerHtml.Contains("<code>"))
{
pin = innerHtml.Substring(innerHtml.IndexOf("<code>") + 6, 7);
}
The other page I came across (which looks like the one in the tutorial you are using) is wrapped using an id="oauth_pin" if I recall correctly. So, just parse that as well:
else if(innerHtml.Contains("oauth_pin"))
{
pin = innerHtml.Substring(innerHtml.IndexOf("oauth_pin") + 10, 7);
}
innerHtml is a string that contains the body of the page. Which seems to be var htmlString = AuthenticationBrowser.SaveToString(); from your code.
I use both of these in my C# program and they work great, full snippet:
private void WebBrowser1DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var innerHtml = webBrowser1.Document.Body.InnerHtml.ToLower();
var code = string.Empty;
if (innerHtml.Contains("<code>"))
{
code = innerHtml.Substring(innerHtml.IndexOf("<code>") + 6, 7);
}
else if(innerHtml.Contains("oauth_pin"))
{
code = innerHtml.Substring(innerHtml.IndexOf("oauth_pin") + 10, 7);
}
textBox1.Text = code;
}
Let me know if you have any question and I hope this helps!!
I need to change the code suggested from Toma A with this one:
var innerHtml = webBrowser1.SaveToString();
var code = string.Empty;
if (innerHtml.Contains("<code>"))
{
code = innerHtml.Substring(innerHtml.IndexOf("<code>") + 6, 7);
}
else if (innerHtml.Contains("oauth_pin"))
{
code = innerHtml.Substring(innerHtml.IndexOf("oauth_pin") + 10, 7);
}
because this one doesn't works for windows phone
var innerHtml = webBrowser1.Document.Body.InnerHtml.ToLower();

Categories

Resources