I'm trying to do some web scraping from a simple form in C#.
My issue is trying to figure out the action to post to and how to work out the post params.
The form I am trying to submit has:
<form method="post" action="./"
As the page sits at www.foobar.com I am creating a WebRequest object in my C# code and posting to this address.
The other issue with this is that I am not sure of the post values as the inputs only have ids not names:
<input name="ctl00$MainContent$txtSearchName" type="text" maxlength="8" id="MainContent_txtSearchName" class="input-large input-upper">
So I read this: c# - programmatically form fill and submit login, amongst others and my code looks like this:
var httpRequest = WebRequest.Create("https://www.foobar.com/");
var values = "SearchName=Foo&SearchLastName=Bar";
byte[] send = Encoding.Default.GetBytes(values);
httpRequest.Method = "POST";
httpRequest.ContentType = "application/x-www-form-urlencoded";
httpRequest.ContentLength = send.Length;
Stream sout = httpRequest.GetRequestStream();
sout.Write(send, 0, send.Length);
sout.Flush();
sout.Close();
WebResponse res = httpRequest.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string returnvalue = sr.ReadToEnd();
File.WriteAllText(#"C:\src\test.html", returnvalue);
However, the resulting html page that is created does not show the search results, it shows the initial search form.
I am assuming the post is failing. My questions are around post I am making.
Does action="./" mean it posts back to the same page?
Do I need to submit all the form values (or can I get away with only submitting one or two)?
Is there any way to infer what the correct post parameter names are from the form?
Or am I missing something completely about web scraping and submitting forms in server side code?
What I would suggest is not doing all of this work manually, but letting your computer take a bit of the workload. You can use a tool such as Fiddler and the Fiddler Request To Code Plugin in order to programmatically generate the C# code for duplicating the web request. You can then modify it to take whatever dynamic input you may need.
If this isn't the route you'd like to take, you should make sure that you are requesting this data with the correct cookies (if applicable) and that you are supplying ALL POST data, no matter how menial it may seem.
Related
I am trying to get specific information from a website. Right Now I have this html string as you can see my code, the html source code of the website is placed in "responseText". I know I can do this with If's statement but it would be really tedious. I'm a newbie so I have no idea what I'm doing with this. I'm sure there must be another easier way to retrieve information from a website... This is c# for windows store so I can't use webclient. This codes get the string but isn't there is a way I can remove the html code and only leave the variables or something? I just want to do this for a webpage and I know the variables I want because I looked at the html code of the webpage. Isn't it a way to request a list of variables with its information from the website? I'm just kind of lost here. So basically I just want to get specific information from a website in c#, I'm making an app for windows store.
StringBuilder sb = new StringBuilder();
// used on each read operation
byte[] buf = new byte[8192];
// prepare the web page we will be asking for
HttpClient searchClient;
searchClient = new HttpClient();
searchClient.MaxResponseContentBufferSize = 256000;
HttpResponseMessage response = await searchClient.GetAsync(url);
response.EnsureSuccessStatusCode();
responseText = await response.Content.ReadAsStringAsync();
This codes get the string but isn't there is a way I can remove the html code and only leave the variables or something?
What "variables"? You get the HTML - that's the response from the web server. If you want to strip that HTML, that's up to you. You might want to use HTML Tidy to make it more pleasant to work with, but the business of extracting relevant information from HTML is up to you. HTML isn't designed to be machine-readable as a raw information source - it's meant to be mark-up to present to humans.
You should investigate whether the information is available in a more machine-friendly source, with no presentation information etc. For example, there may be some way of getting the data as JSON or XML.
I have been having a hell of a time sorting out PayPal's documentation, as all of it applies to ASP but not MVC (including their otherwise-handy Integration Wizard). I have seen oft-reference guide by Rick Strahl, but it is also for ASP, and I have no experience with Webforms to translate into MVC.
I am stuck on one part, and have a security concern about another.
First: how do you actually submit the request to the paypal api? The documentation tells you to use a form with your password in it.
<form method=post action=https://api-3t.sandbox.paypal.com/nvp>
<input type=hidden name=USER value=API_username>
<input type=hidden name=PWD value=API_password>
<input type=hidden name=SIGNATURE value=API_signature>
<input type=hidden name=VERSION value=XX.0>
<input type=hidden name=PAYMENTREQUEST_0_PAYMENTACTION
value=Sale>
<input name=PAYMENTREQUEST_0_AMT value=19.95>
<input type=hidden name=RETURNURL
value=https://www.YourReturnURL.com>
<input type=hidden name=CANCELURL
value=https://www.YourCancelURL.com>
<input type=submit name=METHOD value=SetExpressCheckout>
</form>
Surely this form isn't going into the View where anyone with the sense to check your source could steal your login info? I would assume this needs to be done from the controller, but I don't know how to create do this from the controller. HttpWebRequest and WebClient look promising, but I don't know how to actually add a form to them.
Second: even if I did make this form and api call from inside the controller where the user can't see it, anyone with access to the source code (like the web host, or other developers) would be able to see the password. This doesn't seem like good security. What's the practice here? How can this be made secure?
EDIT
For the people who come looking, this is how I eventually submitted the initial request (condensed the code into one block for readability)
public static string GetResponse(RequestContext context, decimal price)
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://api-3t.sandbox.paypal.com/nvp");
//HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://api-3t.sandbox.paypal.com/nvp");
request.Method = "POST";
UrlHelper url = new UrlHelper(context);
string urlBase = string.Format("{0}://{1}", context.HttpContext.Request.Url.Scheme, context.HttpContext.Request.Url.Authority);
string formContent = "USER=" + System.Configuration.ConfigurationManager.AppSettings["paypalUser"] +
"&PWD=" + System.Configuration.ConfigurationManager.AppSettings["paypalPassword"] +
"&SIGNATURE=" + System.Configuration.ConfigurationManager.AppSettings["paypalSignature"] +
"&VERSION=84.0" +
"&PAYMENTREQUEST_0_PAYMENTACTION=Sale" +
"&PAYMENTREQUEST_0_AMT=" + String.Format("{0:0.00}", price) +
"&RETURNURL=" + urlBase + url.Action("Confirm", "Checkout") +
"&CANCELURL=" + urlBase + url.Action("Canceled", "Checkout") +
"&METHOD=SetExpressCheckout";
byte[] byteArray = Encoding.UTF8.GetBytes(formContent);
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = byteArray.Length;
Stream dataStream = request.GetRequestStream();
dataStream.Write(byteArray, 0, byteArray.Length);
dataStream.Close();
WebResponse response = request.GetResponse();
dataStream = response.GetResponseStream();
StreamReader reader = new StreamReader(dataStream);
string responseFromServer = HttpUtility.UrlDecode(reader.ReadToEnd());
reader.Close();
dataStream.Close();
response.Close();
return responseFromServer;
}
AFAIK, Paypal also provides a webservice... instead of just posting data.
You can make a POST request from your controller, allowing to hide the sensitive data from users (all those hidden values).
Here you can see an example of posting your data from code: http://msdn.microsoft.com/en-us/library/debx8sh9.aspx
About your second concern, you can have those parameters that are sensitive encripted in a web.config, and that way only on runtime you have those parameters readable.
PayPal also provides a Sandbox, for you to test your integration... so at that moment you could have this values without encripting. Once you move your app to production, replace the test parameters with your encripted production credentials.
Referring your reply to Ashok Padmanabhan;
I have, but he seems to pass right over this section, instead focusing on the IPN handling. I also tried to find the code from the video, but couldn't
This was what i asked you to Google for. The code for MvcStoreFront by Rob Connery. And here is the link
My previous answer was meant to let you know that even if you do get the source code, i doubt you could learn much from it. At least for me. My fault for assuming the same for everyone else. The reason is because its of a different version of MVC, and there are various complications that i encountered due to the differences between the codes in the video and the final code in the source code.
I am struggling to implement PayPal myself. I have given up hope on IPN and PDT as i'm now working on integrating the normal return URL. I think i would go with Romias' idea of encrypting the code in web.config (although i still don't seem to quite fathom this method yet, hope i will soon).
Hope this is a more constructive answer :)
I want to crawl/scrape a webpage which has a form
to be precise following is the URL
http://lafayetteassessor.com/propertysearch.cfm
The problem is, i want to make a search and save the result in a webpage.
my search string will always give a unique page, so result count won't be a problem.
the search over there doesn't search on URL (e.g. google searching url contains parameters to search). How can i search from starting page (as above) and get the result page ?
please give me some idea.
I am using C#/.NET.
If you look at the forms on that page, you will notice that they use the POST method, rather than the GET method. As I'm sure you know, GET forms pass their parameters as part of the URL, eg mypage?arg1=value&arg2=value
However, for POST requests, you need to pass the parameters as the request body. It takes the same format, it's just passed in differently. To do this, use code similar to this:
HttpRequest myRequest = (HttpRequest)WebRequest.Create(theURL);
myRequest.Method = "post";
using(TextWriter body = new StreamWriter(myRequest.GetRequestStream())) {
body.Write("arg1=value1&arg2=value2");
}
WebResponse theResponse = myRequest.GetResponse();
//do stuff with the response
Don't forget that you still need to escape the arguments, etc.
I'm trying to login to a website using C# and the WebRequest class. This is the code I wrote up last night to send POST data to a web page:
public string login(string URL, string postData)
{
Stream webpageStream;
WebResponse webpageResponse;
StreamReader webpageReader;
byte[] byteArray = Encoding.UTF8.GetBytes(postData);
_webRequest = WebRequest.Create(URL);
_webRequest.Method = "POST";
_webRequest.ContentType = "application/x-www-form-urlencoded";
_webRequest.ContentLength = byteArray.Length;
webpageStream = _webRequest.GetRequestStream();
webpageStream.Write(byteArray, 0, byteArray.Length);
webpageResponse = _webRequest.GetResponse();
webpageStream = webpageResponse.GetResponseStream();
webpageReader = new StreamReader(webpageStream);
string responseFromServer = webpageReader.ReadToEnd();
webpageReader.Close();
webpageStream.Close();
webpageResponse.Close();
return responseFromServer;
}
and it works fine, but I have no idea how I can modify it to send POST data to a login script and then save a cookie(?) and log in.
I have looked at my network transfers using Firebug on the websites login page and it is sending POST data to a URL that looks like this:
accountName=myemail%40gmail.com&password=mypassword&persistLogin=on&app=com-sc2
As far as I'm aware, to be able to use my account with this website in my C# app I need to save the cookie that the web server sends, and then use it on every request? Is this right? Or can I get away with no cookie at all?
Any help is greatly apprecated, thanks! :)
The login process depends on the concrete web site. If it uses cookies, you need to use them.
I recommend to use Firefox with some http-headers watching plugin to look inside headers how they are sent to your particular web site, and then implement it the same way in C#. I answered very similar question the day before yesterday, including example with cookies. Look here.
I've found more luck using the HtmlElement class to manipulate around websites.
Here is cross post to an example of how logging in through code would work (provided you're using a WebBrowser Control)
I have a C# console app (.NET 2.0 framework) that does an HTTP post using the following code:
StringBuilder postData = new StringBuilder(100);
postData.Append("post.php?");
postData.Append("Key1=");
postData.Append(val1);
postData.Append("&Key2=");
postData.Append(val2);
byte[] dataArray = Encoding.UTF8.GetBytes(postData.ToString());
HttpWebRequest httpRequest = (HttpWebRequest)WebRequest.Create("http://example.com/");
httpRequest.Method = "POST";
httpRequest.ContentType = "application/x-www-form-urlencoded";
httpRequest.ContentLength = dataArray.Length;
Stream requestStream = httpRequest.GetRequestStream();
requestStream.Write(dataArray, 0, dataArray.Length);
requestStream.Flush();
requestStream.Close();
HttpWebResponse webResponse = (HttpWebResponse)httpRequest.GetResponse();
if (httpRequest.HaveResponse == true) {
Stream responseStream = webResponse.GetResponseStream();
StreamReader responseReader = new System.IO.StreamReader(responseStream, Encoding.UTF8);
String responseString = responseReader.ReadToEnd();
}
The outputs from this are:
webResponse.ContentLength = -1
webResponse.ContentType = text/html
webResponse.ContentEncoding is blank
The responseString is HTML with a title and body.
However, if I post the same URL into a browser (http://example.com/post.php?Key1=some_value&Key2=some_other_value), I get a small XML snippet like:
<?xml version="1.0" ?>
<RESPONSE RESULT="SUCCESS"/>
with none of the same HTML as in the application. Why are the responses so different? I need to parse the returned result which I am not getting in the HTML. Do I have to change how I do the post in the application? I don't have control over the server side code that accepts the post.
If you are indeed supposed to use the POST HTTP method, you have a couple things wrong. First, this line:
postData.Append("post.php?");
is incorrect. You want to post to post.php, you don't want post the value "post.php?" to the page. Just remove this line entirely.
This piece:
... WebRequest.Create("http://example.com/");
needs post.php added to it, so...
... WebRequest.Create("http://example.com/post.php");
Again this is assuming you are actually supposed to be POSTing to the specified page instead of GETing. If you are supposed to be using GET, then the other answers already supplied apply.
You'll want to get an HTTP sniffer tool like Fiddler and compare the headers that are being sent from your app to the ones being sent by the browser. There will be something different that is causing the server to return a different response. When you tweak your app to send the same thing browser is sending you should get the same response. (It could be user-agent, cookies, anything, but something is surely different.)
I've seen this in the past.
When you run from a browser, the "User-Agent" in the header is "Mozilla ...".
When you run from a program, it's different and generally specific to the language used.
I think you need to use a GET request, instead of POST. If the url you're using has querystring values (like ?Key1=some_value&Key2=some_other_value) then it's expecting a GET. Instead of adding post values to your webrequest, just put this data in the querystring.
HttpWebRequest httpRequest = (HttpWebRequest)WebRequest.Create("http://example.com/?val1=" + val1 + "&val2=" + val2);
httpRequest.Method = "GET";
httpRequest.ContentType = "application/x-www-form-urlencoded";
....
So, the result you're getting is different when you POST the data from your app because the server-side code has a different output when it can't read the data it's expecting in the querystring.
In your code you a specify the POST method which sends the data to the PHP file without putting the data in the web address. When you put the information in the address bar, that is not the POST method, that is the GET method. The name may be confusing, but GET just means that the data is being sent to the PHP file through the web address, instead of behind the scenes, not that it is supposed to get any information. When you put the address in the browser it is using a GET.
Create a simple html form and specify POST as the method and your url as the action. You will see that the information is sent without appearing in the address bar.
Then do the same thing but specify GET. You will see the information you sent in the address bar.
I believe the problem has something to do with the way your headers are set up for the WebRequest.
I have seen strange cases where attempting to simulate a browser by changing headers in the request makes a difference to the server.
The short answer is that your console application is not a web browser and the web server of example.com is expecting to interact with a browser.
You might also consider changing the ContentType to be "multipart/form-data".
What I find odd is that you are essentially posting nothing. The work is being done by the query string. Therefore, you probably should be using a GET instead of a POST.
Is the form expecting a cookie? That is another possible reason why it works in the browser and not from the console app.