page posting issue when working in Screen Scraping - c#

I am working on screen scraping and done successfully in 3 websites, I have an issue in last website
here is my url, When I hit with my parameter, it is showing result on next page, simply posting to other page and showing the result fine on other page
Here is My Test
However, when I hit from my application, since here I don't have an option to post, it only fetch html of requested page that is obviously my above mention HTML test link, that actually have parameter in URL to get the result.
How can I handle this situtation?
Please give me hint.
Thanks
here is my C# code, I am using HTMLAgality
String url;
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc;
url = "http://mysampleURL";
doc = hw.Load(url);

Use the WebClient class for posting the form of the first page with the expected input values. The input values can be found in the source of the first page, but it's also possible to capture them using Fiddler which is imho a great tool for these scenarios.
Example:
NameValueCollection values = new NameValueCollection();
values.Add("action","hotelPackageWizard#searchHotelOnly");
values.Add("packageType","HOTEL_ONLY");
// etc..
WebClient webclient = new WebClient();
webclient.Headers.Add("Content-Type","application/x-www-form-urlencoded");
byte[] responseArray = webclient.UploadValues("http://www.expedia.com/Hotels?rfrr=-905&","POST", values);
string response = System.Text.Encoding.ASCII.GetString(responseArray);

If the resource requires a POST, then you MUST submit a POST.
This is a fairly simple task. Here is an example from Rick Strahl's blog. The code is a bit rustic but works and will get you heading the right direction
string lcUrl = "http://www.west-wind.com/testpage.wwd";
HttpWebRequest loHttp =
(HttpWebRequest) WebRequest.Create(lcUrl);
// *** Send any POST data
string lcPostData =
"Name=" + HttpUtility.UrlEncode("Rick Strahl") +
"&Company=" + HttpUtility.UrlEncode("West Wind ");
loHttp.Method="POST";
byte [] lbPostBuffer = System.Text.
Encoding.GetEncoding(1252).GetBytes(lcPostData);
loHttp.ContentLength = lbPostBuffer.Length;
Stream loPostData = loHttp.GetRequestStream();
loPostData.Write(lbPostBuffer,0,lbPostBuffer.Length);
loPostData.Close();
HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();
Encoding enc = System.Text.Encoding.GetEncoding(1252);
StreamReader loResponseStream =
new StreamReader(loWebResponse.GetResponseStream(),enc);
string lcHtml = loResponseStream.ReadToEnd();
loWebResponse.Close();
loResponseStream.Close();

For screen scraping tasks that involve posting forms such as log-ins, maintaining cookies, taking care of XSRF tokens, one solution is to use CURL. But it is not easy.
I then explored Selenium and I love it. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server
After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.
Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.
I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html
In the above link select the option of regular download.
I spent good amount of time in figuring it out, so thought it may save somebody's time.

Related

HTTPWebRequest acts differently from same request sent through browser

I am working with the shift4shop web api. These guys used to be known as threeDCart if that helps anyone. Its an eCommerce platform.
we are trying to apply a promotion code to an open cart.
support has verified there is no api-way to do that.
there is an url that will apply the promotion. This is often emailed to customers so they can apply the promo if they choose to.
we can paste the correct url in chrome, brave, edge, or firefox and it correctly applies the promotion.
We used private tabs for the different browser tests and the browsers were 'cold'. we launched the browser and immediately entered the URL.
We are thinking this eliminates the possibility that there are cookies that are necessary.
https://www.mywebsite.com/continue_order.asp?orderkey=CDC886A7O4Srgyn278668&ApplyPromo=40pro
However, when I try to do this in C#, i get a response that is redirected a page that says 'The cart is empty'.
The promotion is not applied
I am stumped as to how the website would respond differently to the same URL when it comes from a browser as opposed to the c# system.net library.
here is the c# code I am using
using System.Net;
//i really create this using my data, but this is the resulting url
string url = "https://www.mywebsite.com/continue_order.asp?orderkey=CDC886A7O4Srgyn278668&ApplyPromo=40pro"
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
string result = "";
using (StreamReader rdr = new StreamReader(response.GetResponseStream()))
{
result = rdr.ReadToEnd();
}
You can also call ".view_cart.asp" w the same parameters and the browsers will cause the promo to be applied.
I have tried setting the method to [ , GET, get ]
There has to be something about the request settings that are preventing this from working.
I do not know what else to try.
Any thoughts are appreciated.
As per shift4shop support, the continue_order.asp has a 302.
the browsers land on continue_order.asp and process that page.
They then continue on to view_order.asp.
The 2 pages together perform functionality that you can not get by just calling continue_order.asp
Thanks to savoy w/ shift4Shop for helping on that.

Parsing web site using HtmlAgilityPack does not return values as seen on browser

When parsing the site https://holfuy.com/en/weather/1284 HtmlAgilityPack returns "-" for relevant data.
string url = "https://holfuy.com/en/weather/1284";
var web = new HtmlWeb();
web.PreRequest += request =>
{
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
HtmlDocument doc = web.Load(url);
string data = doc.DocumentNode.SelectNodes("//*[#id=\"j_pressure\"]")[0].InnerText;
Console.WriteLine(data);
What is the reason behind this?
It seems that data is dynamically loaded into page and if you need to parse it you need to hook real browser, through for example Selenium and use one of available drivers there or if you don't want to include entire Selenium just hook some headless browser like phantom.js directly. Once you do it, just set some small delay for data to render, load page and parse.
You can see more information here:
Running Scripts in HtmlAgilityPack

Scraping multiple lists from a website.

I'm currently working on a web scraper for a website that displays a table of data. The problem I'm running into is that the website doesn't sort my searches by state on the first search. I have to do it though the drop down menu on the second page when it loads. The way I load the first page is with what I believe to be a WebClient POST request. I get the proper html response and can parse though it, but I want to load the more filtered search, but the html I get back is incorrect when I compare it to the html I see in the chrome developers tab.
Here's my code
// The website I'm looking at.
public string url = "https://www.missingmoney.com/Main/Search.cfm";
// The POST requests for the working search, but doesn't filter by states
public string myPara1 = "hJava=Y&SearchFirstName=Jacob&SearchLastName=Smith&HomeState=MN&frontpage=1&GO.x=19&GO.y=18&GO=Go";
// The POST request that also filters by state, but doesn't return the correct html that I would need to parse
public string myPara2 = "hJava=Y&SearchLocation=1&SearchFirstName=Jacob&SearchMiddleName=&SearchLastName=Smith&SearchCity=&SearchStateID=MN&GO.x=17&GO.y=14&GO=Go";
// I save the two html responses in these
public string htmlResult1;
public string htmlResult2;
public void LoadHtml(string firstName, string lastName)
{
using (WebClient client = new WebClient())
{
client.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
htmlResult1 = client.UploadString(url, myPara1);
htmlResult2 = client.UploadString(url, myPara2);
}
}
Just trying to figure out why the first time I pass in my parameters it works and when I do it in the second one it doesn't.
Thank you for the time you spent looking at this!!!
I simply forgot to add the cookie to the new search. Using google chrome or fiddler you can see the web traffic. All I needed to do was add
client.Headers.Add(HttpRequestHeader.Cookie, "cookie");
to my code right before it uploaded it. Doing so gave me the right html response and I can now parse though my data.
#derloopkat pointed it out, credits to that individual!!!

How to retrieve HTML Page without getting redirected?

I want to scrape the HTML of a website. When I access this website with my browser (no matter if it is Chrome or FireFox), I have no problem accessing the website + HTML.
When I try to parse the HTML with C# using methods like HttpWebRequest and HtmlAgilityPack, the website redirects me to another website and thus I parse the HTML of the redirected website.
Any idea how to solve this problem?
I thought the site recognises my program as a program and redirects immediately, so I tried using Selenium and a GoogleDriver and FireFoxDriver but also no luck, I get redirected immediately.
The Website: https://www.jodel.city/7700#!home
private void bt_load_Click(object sender, EventArgs e)
{
var url = #"https://www.jodel.city/7700#!home";
var req = (HttpWebRequest)WebRequest.Create(url);
req.AllowAutoRedirect = false;
// req.Referer = "http://www.muenchen.de/";
var resp = req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
String returnedContent = sr.ReadToEnd();
Console.WriteLine(returnedContent);
return;
}
And of course, cookies are to blame again, because cookies are great and amazing.
So, let's look at what happens in Chrome the first time you visit the site:
(I went to https://www.jodel.city/7700#!home):
Yes, I got a 302 redirect, but I also got told by the server to set a __cfduid cookie (twice actually).
When you visit the site again, you are correctly let into the site:
Notice how this time a __cfduid cookie was sent along? That's the key here.
Your C# code needs to:
Go to the site once, get redirected, but obtain the cookie value from the response header.
Go BACK to the site with the correct cookie value in the request header.
You can go to the first link in this post to see an example of how to set cookie values for requests.

crawling / scraping a search form based webpages

I want to crawl/scrape a webpage which has a form
to be precise following is the URL
http://lafayetteassessor.com/propertysearch.cfm
The problem is, i want to make a search and save the result in a webpage.
my search string will always give a unique page, so result count won't be a problem.
the search over there doesn't search on URL (e.g. google searching url contains parameters to search). How can i search from starting page (as above) and get the result page ?
please give me some idea.
I am using C#/.NET.
If you look at the forms on that page, you will notice that they use the POST method, rather than the GET method. As I'm sure you know, GET forms pass their parameters as part of the URL, eg mypage?arg1=value&arg2=value
However, for POST requests, you need to pass the parameters as the request body. It takes the same format, it's just passed in differently. To do this, use code similar to this:
HttpRequest myRequest = (HttpRequest)WebRequest.Create(theURL);
myRequest.Method = "post";
using(TextWriter body = new StreamWriter(myRequest.GetRequestStream())) {
body.Write("arg1=value1&arg2=value2");
}
WebResponse theResponse = myRequest.GetResponse();
//do stuff with the response
Don't forget that you still need to escape the arguments, etc.

Categories

Resources