Extract html element attribute value with html agility pack - c#

I need to retrieve a form anti forgery token from the html page.
To do so, I'm using the Html Agility Pack, but I'm fairly new to it.
This is my code:
var page = new HtmlDocument();
page.LoadHtml(htmlPage);
var tokenNode = page.DocumentNode.SelectSingleNode("/html/body/div[3]/div[2]/form/input").Attributes["value"].Value;
The 'tokenNode' variable is returning null.
I've managed to trackdown my problem to this method:
page.DocumentNode.SelectSingleNode("/html/body/div[3]/div[2]/form/input");
If I simply use page.DocumentNode.SelectSingleNode("/html/body/div[3]); it returns a value. However when I add the second div to my xpath, it starts returning null.
What am I missing here?
Edit: Got the xpath using chrome developer tools.
Edit2: After all the problem was in the Xpath I got from chrome.
TL;DR The html code in the brwoser was different from the one my http request retrieved, therefore the xpath was wrong.
Here's a more thorough explanation

To get anti forgery token from a page your could just call GetElementById method by passing id
For example
var page = new HtmlDocument();
page.LoadHtml(htmlPage);
string token = page.GetElementbyId("__RequestVerificationToken").GetAttributeValue("value", "");;
You no need to go through the nested path

Related

How to parse tumblr search results page?

There is a Tumblr page with the search results, e.g. https://www.tumblr.com/search/fruit+apple
I need to scrape at least 10 results from it and parse them. How to do that?
It's seems like the Tumblr API doesn't have the appropriate method. And the
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("https://www.tumblr.com/search/fruit+apple");
Console.WriteLine(doc.DocumentNode.OuterHtml); //<--it freezes here
method from HtmlAgilityPack doesn't work with this web-address (or I'm doing something wrong).
Is it possible to get the search results? Please help (or say that it's impossible). Thanks in advance.

Get Image Absolute URL From Some Node in HtmlAgilityPack.HtmlDocument

I want fetch some webpage from internet, and get absolute URLs of some images on the page by using HtmlAgilityPack in C#.
The problem is...
The website will first redirect the URL to some other one, and then the src attribute in the <img> tag is related URL.
Currently, I have some codes like this:
using HtmlAgilityPack;
HtmlDocument webpageDocument = new HtmlWeb().Load("http://xyz.example.com/");
HtmlNodeCollection nodes = webpageDocument.DocumentNode.SelectNodes("//img");
String url = nodes[0].Attributes["src"].Value.ToString();
Above codes fetch a webpage from the given example url, and get some <img> element from the DOM tree, and get src attribute of it.
It works if the <img> has absolute url. But unfortunately the website I want to handle give me a related URI (e.g. /img/01.png). I need the absolute URL so that I can do more options about the image.
So, I need to know what URL is the base URL for given src, but failed. Or, in another word, I don't know how to get the location of the webpage after redirect.
Server side is not mine (I have no control to it).
Consider ResponseUri and to avoid second call give html agility parser the string with the content of the page instead.

Html Agility Pack - reading div InnerText in table

My problem is that I can't get div InnerText from table. I have successfully extraced different kind of data, but i don't know how to read div from table.
In following picture I've highlighted div, and I need to get InnerText from it, in this case - number 3.
Click here for first picture
I'm trying to accomplish this using following path:
"//div[#class='kal']//table//tr[2]/td[1]/div[#class='cipars']"
But I'm getting following Error:
Click here for Error message picture
Assuming that rest of the code is written correctly, could anyone point me in the right direction ? I have been trying to figure this one out, but i can't get any results.
So your problem is that you are relying on positions within your XPath. Whilst this can be OK in some cases, it is not here, because you are expecting the first td in a given tr to have a div with the class.
Looking at the source in Chrome, it shows this is not always the case. You can see this by comparing the "1" element in the calendar, to "2" and "3". You'll notice the "1" element has a number of elements around it, which the others don't.
Your original XPath query does not return an element, this is why you are getting the error. In the event the XPath query you give HtmlAgilityPack does not result in a DOM element, it will return null.
Now, because you've not shown your entire code, I don't know how this code is being run. However, I am guessing you are trying to loop through all of the calendar items. Regardless, you have multiple ways of doing this, but I will show you that with the descendant XPath selector, you can just grab the whole lot in one go:
//div[#class='kal']//table//descendant::div[#class='cipars']
This will return all of the calendar items (ie 1 through 30).
However, to get all the items in a particular row, you can just stick that tr into the query:
//div[#class='kal']//table//tr[3]/descendant::div[#class='cipars']
This would return 2 to 8 (the second row of calendar items).
To target a specific one, well, you'll have to make an assumption on the source code of the website. It looks like that every "cipars" div has an ancestor of a td with a class datums....so to get the "3" value from your question:
//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']
Hopefully this is enough to show the issue at least.
Edit
Although you do have an XPath problem, you also have another issue.
The site is created very strangely. The calendar is loaded in a strange way. When I hit that URL, the calendar is created by some Javascript calling an XML web service (written in PHP) that then calculates the full table to be used for the calendar.
Due to the fact this is Javascript (client side code), HtmlAgilityPack won't execute it. Therefore, HtmlAgilityPack doesn't even "see" the table. Hence the queries against it come back as "not found" (null).
Ways around this: 1) Use a tool that will call the scripts. By this, I mean load up a Browser. A great tool to use for this is called Selenium. This will probably be the better overall solution because it means all the scripting used by the site will actually be called. You can still use XPath with it, so your queries will not change.
The second way is to send a request off to the same web service that the page does. This is to basically get back the same HTML that the page is getting, and using that with HtmlAgilityPack. How do we do that?
Well, you can easily POST data to a web service using C#. Just for ease of use I've stolen the code from this SO question. With this, we can send the same request the page is, and get the same HTML back.
So to send some POST data, we generate a method like so.....
public static string SendPost(string url, string postData)
{
string webpageContent = string.Empty;
byte[] byteArray = Encoding.UTF8.GetBytes(postData);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.ContentLength = byteArray.Length;
using (Stream webpageStream = webRequest.GetRequestStream())
{
webpageStream.Write(byteArray, 0, byteArray.Length);
}
using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
{
using (StreamReader reader = new StreamReader(webResponse.GetResponseStream()))
{
webpageContent = reader.ReadToEnd();
}
}
return webpageContent;
}
We can call it like so:
string responseBody = SendPost("http://lekcijas.va.lv/lekcijas_request.php", "nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=");
How did I get this? Well the php file we are calling is the web service the page is, and the POST data is too. The way I found out what data it sends to the service is by debugging the Javascript (using Chrome's Developer console), but you may notice it's pretty much the same thing that is in the URL. That seems to be intentional.
The responseBody that is returned is the physical HTML of just the table for the calendar.
What do we do with it now? We load that up into HtmlAgilityPack, because it is able to accept pure HTML.
var document = new HtmlDocument();
document.LoadHtml(webpageContent);
Now, we stick that original XPath in:
var node = document.DocumentNode.SelectSingleNode("//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']");
Now, we print out what should hopefully be "3":
Console.WriteLine(node.InnerText);
My output, running it locally, is indeed: 3.
However, although this would get you over the problem you are having, I am assuming the rest of the site is like this. If this is the case, you may still be able to work around it using technique above, but tools like Selenium were created for this very reason.

Is there a way to authenticate a windows authenticated user in Html Agility Pack?

In HttpTests there's a way to authenticate using
request.Credentials = CredentialCache.DefaultCredentials;
is there something similar in Html Agility Pack? I want to test my localhost project but it's receiving a:
HTTP Error 401.2 - Unauthorized You are not authorized to view this page
I found a blog by Jon Gallant: http://blog.jongallant.com/2012/07/htmlagilitypack-windows-authentication.html#.UJEQam8xol8
creates a new instance of HtmlWeb, creates a new WebProxy which sets UseDefaultCredentials to true, creates a new variable called document on webload sets the url to a GET request, inserts the default credentials and gets the system credentials of the application.
var web = new HtmlWeb();
web.PreRequest = delegate(HttpWebRequest webRequest)
{
webRequest.Timeout = 1200000;
return true;
};
var proxy = new WebProxy() { UseDefaultCredentials = true };
var doc = web.Load("http://localhost:2120", "GET", proxy,
CredentialCache.DefaultNetworkCredentials);
var linksOnPage = from lnks in document.DocumentNode.Descendants()
where lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
linksOnPage.All(t => { Console.WriteLine(t.Text + " : " + t.Url); return true; });
Is there a way to authenticate a windows authenticated user in Html Agility Pack?
No,
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT.
It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface).
Sample applications:
Page fixing or generation : You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
Web scanners : You can easily get to img/src or a/hrefs with a bunch XPATH queries.
Web scrapers : You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.
Html Agility Code Examples

Check whether an url is text/html or other file types such as images

I am writing my own C# 4.0 WPF specific web crawler. Currently I am using htmlagilitypack to process html documents.
Now the way below i am downloading the pages
HtmlWeb hwWeb = new HtmlWeb();
hwWeb.UserAgent = lstAgents[GenerateRandomValue.GenerateRandomValueMin(irAgentsCount, 0)];
hwWeb.PreRequest = OnPreRequest;
HtmlDocument hdMyDoc;
hwWeb = new HtmlWeb
{
AutoDetectEncoding = false,
OverrideEncoding = Encoding.GetEncoding("iso-8859-9"),
};
hdMyDoc = hwWeb.Load(srPageUrl);
private static bool OnPreRequest(HttpWebRequest request)
{
request.AllowAutoRedirect = true;
return true;
}
Now my question is i want to be able to determine whether given url is text/html (crawlable content) or image/pdf simply other types. How can i do that ?
Thank you very much for the answers.
C# 4.0 , WPF application
Rather than relying on HTMLAgilityPack to download it for you, you can download the page with HttpWebRequest which contains a property on the HttpWebResponse that you can check. This would allow you to perform your check before attempting to parse the content.
You want to read the content-type in the response header. I do not think it can be done with HtmlAgility pack from my experience with it.
I've never used html agility pack, but I went ahead and looked at the documentation.
I see that you're setting the PreRequest field on the HtmlWeb object to a PreRequestHandler delegate. There's also a PostResponse field that takes a PostResponseHandler delegate. It looks like the HtmlWeb object will pass that delegate the actual response it gets from the server, in the form of a HttpWebResponse object.
However, when your code in that delegate finishes, it looks like the agility pack will continue to do whatever it would've done. Does it throw an exception when it encounters non-html? You may have to throw your own exception from your PostResponse function and catch it when you call Load().
As I said, I didn't try any of this. Hope it gets you started in the right direction..

Categories

Resources