How to parse tumblr search results page? - c#

There is a Tumblr page with the search results, e.g. https://www.tumblr.com/search/fruit+apple
I need to scrape at least 10 results from it and parse them. How to do that?
It's seems like the Tumblr API doesn't have the appropriate method. And the
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("https://www.tumblr.com/search/fruit+apple");
Console.WriteLine(doc.DocumentNode.OuterHtml); //<--it freezes here
method from HtmlAgilityPack doesn't work with this web-address (or I'm doing something wrong).
Is it possible to get the search results? Please help (or say that it's impossible). Thanks in advance.

Related

Extract html element attribute value with html agility pack

I need to retrieve a form anti forgery token from the html page.
To do so, I'm using the Html Agility Pack, but I'm fairly new to it.
This is my code:
var page = new HtmlDocument();
page.LoadHtml(htmlPage);
var tokenNode = page.DocumentNode.SelectSingleNode("/html/body/div[3]/div[2]/form/input").Attributes["value"].Value;
The 'tokenNode' variable is returning null.
I've managed to trackdown my problem to this method:
page.DocumentNode.SelectSingleNode("/html/body/div[3]/div[2]/form/input");
If I simply use page.DocumentNode.SelectSingleNode("/html/body/div[3]); it returns a value. However when I add the second div to my xpath, it starts returning null.
What am I missing here?
Edit: Got the xpath using chrome developer tools.
Edit2: After all the problem was in the Xpath I got from chrome.
TL;DR The html code in the brwoser was different from the one my http request retrieved, therefore the xpath was wrong.
Here's a more thorough explanation
To get anti forgery token from a page your could just call GetElementById method by passing id
For example
var page = new HtmlDocument();
page.LoadHtml(htmlPage);
string token = page.GetElementbyId("__RequestVerificationToken").GetAttributeValue("value", "");;
You no need to go through the nested path

C# asp.net Using WebClient, is there a way to get a web page's rendered Html?

Is there a way to get the fully rendered html of a web page using WebClient instead of the page source? I'm trying to scrape some data from the page's html. My current code is like this:
WebClient client = new WebClient();
var result = client.DownloadString("https://somepageoutthere.com/");
//using CsQuery
CQ dom = result;
var someElementHtml = dom["body > main];
WebClient will only return the URL you requested. It will not run any javacript on the page (which runs on the client) so if javascript is changing the page DOM in any way, you will not get that through webclient.
You are better off using some other tools. Look for those that will render the HTML and javascript in the page.
I don't know what you mean by "fully rendered", but if you mean "with all data loaded by ajax calls", the answer is: no, you can't.
The data which is not present in the initial html page is loaded through javascript in the browser, and WebClient has no idea what javascript is, and cannot interpret it, only browsers do.
To get this kind of data, you need to identify these calls (if you don't know the url of the data webservice, you can use tools like Fiddler), simulate/replay them from your application, and then, if successful, get response data, and extract data from it (will be easy if data comes as json, and more tricky if it comes as html)
better use http://html-agility-pack.net
it has all the functionality to scrap web data and having good help on the site

Check whether an url is text/html or other file types such as images

I am writing my own C# 4.0 WPF specific web crawler. Currently I am using htmlagilitypack to process html documents.
Now the way below i am downloading the pages
HtmlWeb hwWeb = new HtmlWeb();
hwWeb.UserAgent = lstAgents[GenerateRandomValue.GenerateRandomValueMin(irAgentsCount, 0)];
hwWeb.PreRequest = OnPreRequest;
HtmlDocument hdMyDoc;
hwWeb = new HtmlWeb
{
AutoDetectEncoding = false,
OverrideEncoding = Encoding.GetEncoding("iso-8859-9"),
};
hdMyDoc = hwWeb.Load(srPageUrl);
private static bool OnPreRequest(HttpWebRequest request)
{
request.AllowAutoRedirect = true;
return true;
}
Now my question is i want to be able to determine whether given url is text/html (crawlable content) or image/pdf simply other types. How can i do that ?
Thank you very much for the answers.
C# 4.0 , WPF application
Rather than relying on HTMLAgilityPack to download it for you, you can download the page with HttpWebRequest which contains a property on the HttpWebResponse that you can check. This would allow you to perform your check before attempting to parse the content.
You want to read the content-type in the response header. I do not think it can be done with HtmlAgility pack from my experience with it.
I've never used html agility pack, but I went ahead and looked at the documentation.
I see that you're setting the PreRequest field on the HtmlWeb object to a PreRequestHandler delegate. There's also a PostResponse field that takes a PostResponseHandler delegate. It looks like the HtmlWeb object will pass that delegate the actual response it gets from the server, in the form of a HttpWebResponse object.
However, when your code in that delegate finishes, it looks like the agility pack will continue to do whatever it would've done. Does it throw an exception when it encounters non-html? You may have to throw your own exception from your PostResponse function and catch it when you call Load().
As I said, I didn't try any of this. Hope it gets you started in the right direction..

crawling / scraping a search form based webpages

I want to crawl/scrape a webpage which has a form
to be precise following is the URL
http://lafayetteassessor.com/propertysearch.cfm
The problem is, i want to make a search and save the result in a webpage.
my search string will always give a unique page, so result count won't be a problem.
the search over there doesn't search on URL (e.g. google searching url contains parameters to search). How can i search from starting page (as above) and get the result page ?
please give me some idea.
I am using C#/.NET.
If you look at the forms on that page, you will notice that they use the POST method, rather than the GET method. As I'm sure you know, GET forms pass their parameters as part of the URL, eg mypage?arg1=value&arg2=value
However, for POST requests, you need to pass the parameters as the request body. It takes the same format, it's just passed in differently. To do this, use code similar to this:
HttpRequest myRequest = (HttpRequest)WebRequest.Create(theURL);
myRequest.Method = "post";
using(TextWriter body = new StreamWriter(myRequest.GetRequestStream())) {
body.Write("arg1=value1&arg2=value2");
}
WebResponse theResponse = myRequest.GetResponse();
//do stuff with the response
Don't forget that you still need to escape the arguments, etc.

page posting issue when working in Screen Scraping

I am working on screen scraping and done successfully in 3 websites, I have an issue in last website
here is my url, When I hit with my parameter, it is showing result on next page, simply posting to other page and showing the result fine on other page
Here is My Test
However, when I hit from my application, since here I don't have an option to post, it only fetch html of requested page that is obviously my above mention HTML test link, that actually have parameter in URL to get the result.
How can I handle this situtation?
Please give me hint.
Thanks
here is my C# code, I am using HTMLAgality
String url;
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc;
url = "http://mysampleURL";
doc = hw.Load(url);
Use the WebClient class for posting the form of the first page with the expected input values. The input values can be found in the source of the first page, but it's also possible to capture them using Fiddler which is imho a great tool for these scenarios.
Example:
NameValueCollection values = new NameValueCollection();
values.Add("action","hotelPackageWizard#searchHotelOnly");
values.Add("packageType","HOTEL_ONLY");
// etc..
WebClient webclient = new WebClient();
webclient.Headers.Add("Content-Type","application/x-www-form-urlencoded");
byte[] responseArray = webclient.UploadValues("http://www.expedia.com/Hotels?rfrr=-905&","POST", values);
string response = System.Text.Encoding.ASCII.GetString(responseArray);
If the resource requires a POST, then you MUST submit a POST.
This is a fairly simple task. Here is an example from Rick Strahl's blog. The code is a bit rustic but works and will get you heading the right direction
string lcUrl = "http://www.west-wind.com/testpage.wwd";
HttpWebRequest loHttp =
(HttpWebRequest) WebRequest.Create(lcUrl);
// *** Send any POST data
string lcPostData =
"Name=" + HttpUtility.UrlEncode("Rick Strahl") +
"&Company=" + HttpUtility.UrlEncode("West Wind ");
loHttp.Method="POST";
byte [] lbPostBuffer = System.Text.
Encoding.GetEncoding(1252).GetBytes(lcPostData);
loHttp.ContentLength = lbPostBuffer.Length;
Stream loPostData = loHttp.GetRequestStream();
loPostData.Write(lbPostBuffer,0,lbPostBuffer.Length);
loPostData.Close();
HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();
Encoding enc = System.Text.Encoding.GetEncoding(1252);
StreamReader loResponseStream =
new StreamReader(loWebResponse.GetResponseStream(),enc);
string lcHtml = loResponseStream.ReadToEnd();
loWebResponse.Close();
loResponseStream.Close();
For screen scraping tasks that involve posting forms such as log-ins, maintaining cookies, taking care of XSRF tokens, one solution is to use CURL. But it is not easy.
I then explored Selenium and I love it. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server
After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.
Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.
I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html
In the above link select the option of regular download.
I spent good amount of time in figuring it out, so thought it may save somebody's time.

Categories

Resources