crawling / scraping a search form based webpages

crawling / scraping a search form based webpages - c#

I want to crawl/scrape a webpage which has a form
to be precise following is the URL
http://lafayetteassessor.com/propertysearch.cfm
The problem is, i want to make a search and save the result in a webpage.
my search string will always give a unique page, so result count won't be a problem.
the search over there doesn't search on URL (e.g. google searching url contains parameters to search). How can i search from starting page (as above) and get the result page ?
please give me some idea.
I am using C#/.NET.

If you look at the forms on that page, you will notice that they use the POST method, rather than the GET method. As I'm sure you know, GET forms pass their parameters as part of the URL, eg mypage?arg1=value&arg2=value
However, for POST requests, you need to pass the parameters as the request body. It takes the same format, it's just passed in differently. To do this, use code similar to this:
HttpRequest myRequest = (HttpRequest)WebRequest.Create(theURL);
myRequest.Method = "post";
using(TextWriter body = new StreamWriter(myRequest.GetRequestStream())) {
body.Write("arg1=value1&arg2=value2");
}
WebResponse theResponse = myRequest.GetResponse();
//do stuff with the response
Don't forget that you still need to escape the arguments, etc.

Related

Scraping multiple lists from a website.

I'm currently working on a web scraper for a website that displays a table of data. The problem I'm running into is that the website doesn't sort my searches by state on the first search. I have to do it though the drop down menu on the second page when it loads. The way I load the first page is with what I believe to be a WebClient POST request. I get the proper html response and can parse though it, but I want to load the more filtered search, but the html I get back is incorrect when I compare it to the html I see in the chrome developers tab.
Here's my code
// The website I'm looking at.
public string url = "https://www.missingmoney.com/Main/Search.cfm";
// The POST requests for the working search, but doesn't filter by states
public string myPara1 = "hJava=Y&SearchFirstName=Jacob&SearchLastName=Smith&HomeState=MN&frontpage=1&GO.x=19&GO.y=18&GO=Go";
// The POST request that also filters by state, but doesn't return the correct html that I would need to parse
public string myPara2 = "hJava=Y&SearchLocation=1&SearchFirstName=Jacob&SearchMiddleName=&SearchLastName=Smith&SearchCity=&SearchStateID=MN&GO.x=17&GO.y=14&GO=Go";
// I save the two html responses in these
public string htmlResult1;
public string htmlResult2;
public void LoadHtml(string firstName, string lastName)
{
using (WebClient client = new WebClient())
{
client.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
htmlResult1 = client.UploadString(url, myPara1);
htmlResult2 = client.UploadString(url, myPara2);
}
}
Just trying to figure out why the first time I pass in my parameters it works and when I do it in the second one it doesn't.
Thank you for the time you spent looking at this!!!

I simply forgot to add the cookie to the new search. Using google chrome or fiddler you can see the web traffic. All I needed to do was add
client.Headers.Add(HttpRequestHeader.Cookie, "cookie");
to my code right before it uploaded it. Doing so gave me the right html response and I can now parse though my data.
#derloopkat pointed it out, credits to that individual!!!

<form action="./"> and working out params for web scraping

I'm trying to do some web scraping from a simple form in C#.
My issue is trying to figure out the action to post to and how to work out the post params.
The form I am trying to submit has:
<form method="post" action="./"
As the page sits at www.foobar.com I am creating a WebRequest object in my C# code and posting to this address.
The other issue with this is that I am not sure of the post values as the inputs only have ids not names:
<input name="ctl00$MainContent$txtSearchName" type="text" maxlength="8" id="MainContent_txtSearchName" class="input-large input-upper">
So I read this: c# - programmatically form fill and submit login, amongst others and my code looks like this:
var httpRequest = WebRequest.Create("https://www.foobar.com/");
var values = "SearchName=Foo&SearchLastName=Bar";
byte[] send = Encoding.Default.GetBytes(values);
httpRequest.Method = "POST";
httpRequest.ContentType = "application/x-www-form-urlencoded";
httpRequest.ContentLength = send.Length;
Stream sout = httpRequest.GetRequestStream();
sout.Write(send, 0, send.Length);
sout.Flush();
sout.Close();
WebResponse res = httpRequest.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string returnvalue = sr.ReadToEnd();
File.WriteAllText(#"C:\src\test.html", returnvalue);
However, the resulting html page that is created does not show the search results, it shows the initial search form.
I am assuming the post is failing. My questions are around post I am making.
Does action="./" mean it posts back to the same page?
Do I need to submit all the form values (or can I get away with only submitting one or two)?
Is there any way to infer what the correct post parameter names are from the form?
Or am I missing something completely about web scraping and submitting forms in server side code?

What I would suggest is not doing all of this work manually, but letting your computer take a bit of the workload. You can use a tool such as Fiddler and the Fiddler Request To Code Plugin in order to programmatically generate the C# code for duplicating the web request. You can then modify it to take whatever dynamic input you may need.
If this isn't the route you'd like to take, you should make sure that you are requesting this data with the correct cookies (if applicable) and that you are supplying ALL POST data, no matter how menial it may seem.

Html Agility Pack - reading div InnerText in table

My problem is that I can't get div InnerText from table. I have successfully extraced different kind of data, but i don't know how to read div from table.
In following picture I've highlighted div, and I need to get InnerText from it, in this case - number 3.
Click here for first picture
I'm trying to accomplish this using following path:
"//div[#class='kal']//table//tr[2]/td[1]/div[#class='cipars']"
But I'm getting following Error:
Click here for Error message picture
Assuming that rest of the code is written correctly, could anyone point me in the right direction ? I have been trying to figure this one out, but i can't get any results.

So your problem is that you are relying on positions within your XPath. Whilst this can be OK in some cases, it is not here, because you are expecting the first td in a given tr to have a div with the class.
Looking at the source in Chrome, it shows this is not always the case. You can see this by comparing the "1" element in the calendar, to "2" and "3". You'll notice the "1" element has a number of elements around it, which the others don't.
Your original XPath query does not return an element, this is why you are getting the error. In the event the XPath query you give HtmlAgilityPack does not result in a DOM element, it will return null.
Now, because you've not shown your entire code, I don't know how this code is being run. However, I am guessing you are trying to loop through all of the calendar items. Regardless, you have multiple ways of doing this, but I will show you that with the descendant XPath selector, you can just grab the whole lot in one go:
//div[#class='kal']//table//descendant::div[#class='cipars']
This will return all of the calendar items (ie 1 through 30).
However, to get all the items in a particular row, you can just stick that tr into the query:
//div[#class='kal']//table//tr[3]/descendant::div[#class='cipars']
This would return 2 to 8 (the second row of calendar items).
To target a specific one, well, you'll have to make an assumption on the source code of the website. It looks like that every "cipars" div has an ancestor of a td with a class datums....so to get the "3" value from your question:
//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']
Hopefully this is enough to show the issue at least.
Edit
Although you do have an XPath problem, you also have another issue.
The site is created very strangely. The calendar is loaded in a strange way. When I hit that URL, the calendar is created by some Javascript calling an XML web service (written in PHP) that then calculates the full table to be used for the calendar.
Due to the fact this is Javascript (client side code), HtmlAgilityPack won't execute it. Therefore, HtmlAgilityPack doesn't even "see" the table. Hence the queries against it come back as "not found" (null).
Ways around this: 1) Use a tool that will call the scripts. By this, I mean load up a Browser. A great tool to use for this is called Selenium. This will probably be the better overall solution because it means all the scripting used by the site will actually be called. You can still use XPath with it, so your queries will not change.
The second way is to send a request off to the same web service that the page does. This is to basically get back the same HTML that the page is getting, and using that with HtmlAgilityPack. How do we do that?
Well, you can easily POST data to a web service using C#. Just for ease of use I've stolen the code from this SO question. With this, we can send the same request the page is, and get the same HTML back.
So to send some POST data, we generate a method like so.....
public static string SendPost(string url, string postData)
{
string webpageContent = string.Empty;
byte[] byteArray = Encoding.UTF8.GetBytes(postData);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.ContentLength = byteArray.Length;
using (Stream webpageStream = webRequest.GetRequestStream())
{
webpageStream.Write(byteArray, 0, byteArray.Length);
}
using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
{
using (StreamReader reader = new StreamReader(webResponse.GetResponseStream()))
{
webpageContent = reader.ReadToEnd();
}
}
return webpageContent;
}
We can call it like so:
string responseBody = SendPost("http://lekcijas.va.lv/lekcijas_request.php", "nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=");
How did I get this? Well the php file we are calling is the web service the page is, and the POST data is too. The way I found out what data it sends to the service is by debugging the Javascript (using Chrome's Developer console), but you may notice it's pretty much the same thing that is in the URL. That seems to be intentional.
The responseBody that is returned is the physical HTML of just the table for the calendar.
What do we do with it now? We load that up into HtmlAgilityPack, because it is able to accept pure HTML.
var document = new HtmlDocument();
document.LoadHtml(webpageContent);
Now, we stick that original XPath in:
var node = document.DocumentNode.SelectSingleNode("//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']");
Now, we print out what should hopefully be "3":
Console.WriteLine(node.InnerText);
My output, running it locally, is indeed: 3.
However, although this would get you over the problem you are having, I am assuming the rest of the site is like this. If this is the case, you may still be able to work around it using technique above, but tools like Selenium were created for this very reason.

unshortening urls

I am trying to unshorten urls and have not been able to find code (vb.net/c#) to do this. These are the twitter shortened urls and I guess I could try and access one of the web services available and do a httpwebrequest but would prefer to find some programmatic way of doing this.

You can get it directly from response of the shortened url since it will return a status code MovedPermanently and the location for the real url.(This should work for most of the sites without the need for navigating to the real url)
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://t.co/xqbLEi6s");
req.AllowAutoRedirect = false;
var resp = req.GetResponse();
string realUrl = resp.Headers["Location"];
Other test data: http://goo.gl/zdf2n , http://tinyurl.com/8xc9vca , http://x.co/iEup, http://is.gd/vTOlz6 , http://bit.ly/FUA4YU

There is no magic way to unshorten a URL without asking the service which created the URL (and the way to ask will be different for each service), or more pragmatically, just opening the URL and watching where it redirects to.

HTTP Post in C# console app doesn't return the same thing as a browser request

I have a C# console app (.NET 2.0 framework) that does an HTTP post using the following code:
StringBuilder postData = new StringBuilder(100);
postData.Append("post.php?");
postData.Append("Key1=");
postData.Append(val1);
postData.Append("&Key2=");
postData.Append(val2);
byte[] dataArray = Encoding.UTF8.GetBytes(postData.ToString());
HttpWebRequest httpRequest = (HttpWebRequest)WebRequest.Create("http://example.com/");
httpRequest.Method = "POST";
httpRequest.ContentType = "application/x-www-form-urlencoded";
httpRequest.ContentLength = dataArray.Length;
Stream requestStream = httpRequest.GetRequestStream();
requestStream.Write(dataArray, 0, dataArray.Length);
requestStream.Flush();
requestStream.Close();
HttpWebResponse webResponse = (HttpWebResponse)httpRequest.GetResponse();
if (httpRequest.HaveResponse == true) {
Stream responseStream = webResponse.GetResponseStream();
StreamReader responseReader = new System.IO.StreamReader(responseStream, Encoding.UTF8);
String responseString = responseReader.ReadToEnd();
}
The outputs from this are:
webResponse.ContentLength = -1
webResponse.ContentType = text/html
webResponse.ContentEncoding is blank
The responseString is HTML with a title and body.
However, if I post the same URL into a browser (http://example.com/post.php?Key1=some_value&Key2=some_other_value), I get a small XML snippet like:
<?xml version="1.0" ?>
<RESPONSE RESULT="SUCCESS"/>
with none of the same HTML as in the application. Why are the responses so different? I need to parse the returned result which I am not getting in the HTML. Do I have to change how I do the post in the application? I don't have control over the server side code that accepts the post.

If you are indeed supposed to use the POST HTTP method, you have a couple things wrong. First, this line:
postData.Append("post.php?");
is incorrect. You want to post to post.php, you don't want post the value "post.php?" to the page. Just remove this line entirely.
This piece:
... WebRequest.Create("http://example.com/");
needs post.php added to it, so...
... WebRequest.Create("http://example.com/post.php");
Again this is assuming you are actually supposed to be POSTing to the specified page instead of GETing. If you are supposed to be using GET, then the other answers already supplied apply.

You'll want to get an HTTP sniffer tool like Fiddler and compare the headers that are being sent from your app to the ones being sent by the browser. There will be something different that is causing the server to return a different response. When you tweak your app to send the same thing browser is sending you should get the same response. (It could be user-agent, cookies, anything, but something is surely different.)

I've seen this in the past.
When you run from a browser, the "User-Agent" in the header is "Mozilla ...".
When you run from a program, it's different and generally specific to the language used.

I think you need to use a GET request, instead of POST. If the url you're using has querystring values (like ?Key1=some_value&Key2=some_other_value) then it's expecting a GET. Instead of adding post values to your webrequest, just put this data in the querystring.
HttpWebRequest httpRequest = (HttpWebRequest)WebRequest.Create("http://example.com/?val1=" + val1 + "&val2=" + val2);
httpRequest.Method = "GET";
httpRequest.ContentType = "application/x-www-form-urlencoded";
....
So, the result you're getting is different when you POST the data from your app because the server-side code has a different output when it can't read the data it's expecting in the querystring.

In your code you a specify the POST method which sends the data to the PHP file without putting the data in the web address. When you put the information in the address bar, that is not the POST method, that is the GET method. The name may be confusing, but GET just means that the data is being sent to the PHP file through the web address, instead of behind the scenes, not that it is supposed to get any information. When you put the address in the browser it is using a GET.
Create a simple html form and specify POST as the method and your url as the action. You will see that the information is sent without appearing in the address bar.
Then do the same thing but specify GET. You will see the information you sent in the address bar.

I believe the problem has something to do with the way your headers are set up for the WebRequest.
I have seen strange cases where attempting to simulate a browser by changing headers in the request makes a difference to the server.
The short answer is that your console application is not a web browser and the web server of example.com is expecting to interact with a browser.
You might also consider changing the ContentType to be "multipart/form-data".
What I find odd is that you are essentially posting nothing. The work is being done by the query string. Therefore, you probably should be using a GET instead of a POST.

Is the form expecting a cookie? That is another possible reason why it works in the browser and not from the console app.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

crawling / scraping a search form based webpages - c#

Related

Scraping multiple lists from a website.

<form action="./"> and working out params for web scraping

Html Agility Pack - reading div InnerText in table

unshortening urls

HTTP Post in C# console app doesn't return the same thing as a browser request

Categories

Resources