Scraping multiple lists from a website.

Scraping multiple lists from a website. - c#

I'm currently working on a web scraper for a website that displays a table of data. The problem I'm running into is that the website doesn't sort my searches by state on the first search. I have to do it though the drop down menu on the second page when it loads. The way I load the first page is with what I believe to be a WebClient POST request. I get the proper html response and can parse though it, but I want to load the more filtered search, but the html I get back is incorrect when I compare it to the html I see in the chrome developers tab.
Here's my code
// The website I'm looking at.
public string url = "https://www.missingmoney.com/Main/Search.cfm";
// The POST requests for the working search, but doesn't filter by states
public string myPara1 = "hJava=Y&SearchFirstName=Jacob&SearchLastName=Smith&HomeState=MN&frontpage=1&GO.x=19&GO.y=18&GO=Go";
// The POST request that also filters by state, but doesn't return the correct html that I would need to parse
public string myPara2 = "hJava=Y&SearchLocation=1&SearchFirstName=Jacob&SearchMiddleName=&SearchLastName=Smith&SearchCity=&SearchStateID=MN&GO.x=17&GO.y=14&GO=Go";
// I save the two html responses in these
public string htmlResult1;
public string htmlResult2;
public void LoadHtml(string firstName, string lastName)
{
using (WebClient client = new WebClient())
{
client.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
htmlResult1 = client.UploadString(url, myPara1);
htmlResult2 = client.UploadString(url, myPara2);
}
}
Just trying to figure out why the first time I pass in my parameters it works and when I do it in the second one it doesn't.
Thank you for the time you spent looking at this!!!

I simply forgot to add the cookie to the new search. Using google chrome or fiddler you can see the web traffic. All I needed to do was add
client.Headers.Add(HttpRequestHeader.Cookie, "cookie");
to my code right before it uploaded it. Doing so gave me the right html response and I can now parse though my data.
#derloopkat pointed it out, credits to that individual!!!

Related

How to retrieve HTML Page without getting redirected?

I want to scrape the HTML of a website. When I access this website with my browser (no matter if it is Chrome or FireFox), I have no problem accessing the website + HTML.
When I try to parse the HTML with C# using methods like HttpWebRequest and HtmlAgilityPack, the website redirects me to another website and thus I parse the HTML of the redirected website.
Any idea how to solve this problem?
I thought the site recognises my program as a program and redirects immediately, so I tried using Selenium and a GoogleDriver and FireFoxDriver but also no luck, I get redirected immediately.
The Website: https://www.jodel.city/7700#!home
private void bt_load_Click(object sender, EventArgs e)
{
var url = #"https://www.jodel.city/7700#!home";
var req = (HttpWebRequest)WebRequest.Create(url);
req.AllowAutoRedirect = false;
// req.Referer = "http://www.muenchen.de/";
var resp = req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
String returnedContent = sr.ReadToEnd();
Console.WriteLine(returnedContent);
return;
}

And of course, cookies are to blame again, because cookies are great and amazing.
So, let's look at what happens in Chrome the first time you visit the site:
(I went to https://www.jodel.city/7700#!home):
Yes, I got a 302 redirect, but I also got told by the server to set a __cfduid cookie (twice actually).
When you visit the site again, you are correctly let into the site:
Notice how this time a __cfduid cookie was sent along? That's the key here.
Your C# code needs to:
Go to the site once, get redirected, but obtain the cookie value from the response header.
Go BACK to the site with the correct cookie value in the request header.
You can go to the first link in this post to see an example of how to set cookie values for requests.

Authenticating requests to external REST service

I'm really new to web development, and I don't really have a good grip on the main concepts of web. However, I've been tasked with writing an asp.net application where users can search documents by querying an external RESTful web service. Requests to this REST service must be authenticated by HTTP Basic Authentication.
So far so good, I've been able to query the service using HttpWebRequest and HttpWebResponse, adding the encoded user:pass to the request's authorization header, deserialize the Json response and produce a list of strings with url's to the pdf documents resulting from the search.
So now I'm programmatically adding HyperLink elements to the page with these urls:
foreach (string url in urls) {
HyperLink link = new HyperLink();
link.Text = url;
link.NavigateUrl = url;
Page.Controls.Add(link);
}
The problem is that requests to these documents has to be authorized with the same basic http authentication and the same user:pass as when querying the REST service, and since I'm just creating links for the user to click, and not creating any HttpWebRequest objects, I don't know how to authenticate such a request resulting from a user clicking a link.
Any pointers to how I can accomplish this is very much appreciated. Thanks in advance!

You probably want to do the request server-side, as I think you're already doing, and then show the results embedded in your own pages, or just stream the result directly back to the users.
It's a bit unclear what it is you need (what are the links, what do you show the users, etc.), so this is the best suggesting I can do based on the info you give.
Update:
I would create a HttpHandler (an .ashx file in an ASP.NET project), and link to that, with arguments so you can make the request to the REST service and get the correct file, then stream the data directly back to the visitor. Here's a simple example:
public class DocumentHandler : IHttpHandler {
public Boolean IsReusable {
get { return true; }
}
public void ProcessRequest(HttpContext context) {
// TODO: Get URL of the document somehow for the REST request
// context.Request
// TODO: Make request to REST service
// Some pseudo-code for you:
context.Response.ContentType = "application/pdf";
Byte[] buffer = new WebClient().DownloadData(url);
context.Response.OutputStream.Write(buffer, 0, buffer.Length);
context.Response.End();
}
}
I hope you can fill in the blanks yourself.

Downloading the HTML of the site returns something completely different

I'm using C# to download the HTML of a webpage, but when I check the actual code of the web page and my downloaded code, they are completely different. Here is the code:
public static string getSourceCode(string url) {
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
req.Method = "GET";
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string soruceCode = sr.ReadToEnd();
sr.Close();
resp.Close();
return soruceCode;
using (StreamReader sRead = new StreamReader(resp.GetResponseStream(), Encoding.UTF8)) {
// veriyi döndür
return sRead.ReadToEnd();
}
private void button1_Click(object sender, EventArgs e) {
string url = "http://www.booking.com/hotel/tr/nena.en-gb.html?label=gog235jc-hotel-en-tr-mina-nobrand-tr-com-T002-1;sid=fcc1c6c78f188a42870dcbe1cabf2fb4;dcid=1;origin=disamb;srhash=3938286438;srpos=5";
string sourceCode = Finder.getSourceCode(url);
StreamWriter sw = new StreamWriter("HotelPrice.txt");//Here the code are completly different with web page code.
sw.Write(sourceCode);
sw.Close();
#region //Get Score Value
int StartIndex = sourceCode.IndexOf("<strong id=\"rsc_total\">") + 23;
sourceCode = sourceCode.Substring(StartIndex, 3);
#endregion
}

Most likely the cause for the difference is that when you use the browser to request the same page it's part of a session which is not established when you request the same page using the WebRequest.
Looking at the URL it looks like that query parameter sid is a session identifier or a nonce of some sort. The page probably verifies that against the actually session id and when it determines that they are different it gives you some sort of "Ooopss.. wrong seesion" sort of response.
In order to mimic the browser's request you will have to make sure you generate the proper request which may need to include one or more of the following:
cookies (previously sent to you by the webserver)
a valid/proper user agent
some specific query parameters (again depending on what the page expects)
potentially a referrer URL
authentication credentials
The best way to determine what you need is to follow a conversation between your browser and the web server serving that page from start to finish and see exactly which pages are requested, what order and what information is passed back and forth. You can accomplish this using WireShark or Fidler - both free tools!

I ran into the same problem when trying to use HttpWebRequest to crawl a page, and the page used ajax to load all the data I was after. In order to get the ajax calls to occur I switched to the WebBrowser control.
This answer provides an example of how to use the control outside of a WinForms app. You'll want to hookup to the browser's DocumentCompleted event before parsing the page. Be warned, this event may fire multiple times before the page is ready to be parsed. You may want to add something like this
if(browser.ReadyState == WebBrowserReadyState.Complete)
to your event handler, to know when the page is completely done loading.

crawling / scraping a search form based webpages

I want to crawl/scrape a webpage which has a form
to be precise following is the URL
http://lafayetteassessor.com/propertysearch.cfm
The problem is, i want to make a search and save the result in a webpage.
my search string will always give a unique page, so result count won't be a problem.
the search over there doesn't search on URL (e.g. google searching url contains parameters to search). How can i search from starting page (as above) and get the result page ?
please give me some idea.
I am using C#/.NET.

If you look at the forms on that page, you will notice that they use the POST method, rather than the GET method. As I'm sure you know, GET forms pass their parameters as part of the URL, eg mypage?arg1=value&arg2=value
However, for POST requests, you need to pass the parameters as the request body. It takes the same format, it's just passed in differently. To do this, use code similar to this:
HttpRequest myRequest = (HttpRequest)WebRequest.Create(theURL);
myRequest.Method = "post";
using(TextWriter body = new StreamWriter(myRequest.GetRequestStream())) {
body.Write("arg1=value1&arg2=value2");
}
WebResponse theResponse = myRequest.GetResponse();
//do stuff with the response
Don't forget that you still need to escape the arguments, etc.

page posting issue when working in Screen Scraping

I am working on screen scraping and done successfully in 3 websites, I have an issue in last website
here is my url, When I hit with my parameter, it is showing result on next page, simply posting to other page and showing the result fine on other page
Here is My Test
However, when I hit from my application, since here I don't have an option to post, it only fetch html of requested page that is obviously my above mention HTML test link, that actually have parameter in URL to get the result.
How can I handle this situtation?
Please give me hint.
Thanks
here is my C# code, I am using HTMLAgality
String url;
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc;
url = "http://mysampleURL";
doc = hw.Load(url);

Use the WebClient class for posting the form of the first page with the expected input values. The input values can be found in the source of the first page, but it's also possible to capture them using Fiddler which is imho a great tool for these scenarios.
Example:
NameValueCollection values = new NameValueCollection();
values.Add("action","hotelPackageWizard#searchHotelOnly");
values.Add("packageType","HOTEL_ONLY");
// etc..
WebClient webclient = new WebClient();
webclient.Headers.Add("Content-Type","application/x-www-form-urlencoded");
byte[] responseArray = webclient.UploadValues("http://www.expedia.com/Hotels?rfrr=-905&","POST", values);
string response = System.Text.Encoding.ASCII.GetString(responseArray);

If the resource requires a POST, then you MUST submit a POST.
This is a fairly simple task. Here is an example from Rick Strahl's blog. The code is a bit rustic but works and will get you heading the right direction
string lcUrl = "http://www.west-wind.com/testpage.wwd";
HttpWebRequest loHttp =
(HttpWebRequest) WebRequest.Create(lcUrl);
// *** Send any POST data
string lcPostData =
"Name=" + HttpUtility.UrlEncode("Rick Strahl") +
"&Company=" + HttpUtility.UrlEncode("West Wind ");
loHttp.Method="POST";
byte [] lbPostBuffer = System.Text.
Encoding.GetEncoding(1252).GetBytes(lcPostData);
loHttp.ContentLength = lbPostBuffer.Length;
Stream loPostData = loHttp.GetRequestStream();
loPostData.Write(lbPostBuffer,0,lbPostBuffer.Length);
loPostData.Close();
HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();
Encoding enc = System.Text.Encoding.GetEncoding(1252);
StreamReader loResponseStream =
new StreamReader(loWebResponse.GetResponseStream(),enc);
string lcHtml = loResponseStream.ReadToEnd();
loWebResponse.Close();
loResponseStream.Close();

For screen scraping tasks that involve posting forms such as log-ins, maintaining cookies, taking care of XSRF tokens, one solution is to use CURL. But it is not easy.
I then explored Selenium and I love it. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server
After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.
Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.
I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html
In the above link select the option of regular download.
I spent good amount of time in figuring it out, so thought it may save somebody's time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Scraping multiple lists from a website. - c#

Related

How to retrieve HTML Page without getting redirected?

Authenticating requests to external REST service

Downloading the HTML of the site returns something completely different

crawling / scraping a search form based webpages

page posting issue when working in Screen Scraping

Categories

Resources