matweb.com: How to get source of page?

matweb.com: How to get source of page? - c#

I have url like:
http://www.matweb.com/search/DataSheet.aspx?MatGUID=849e2916ab1541be9ff6a17b78f95c82
I want to download source code from that page using this code:
private static string urlTemplate = #"http://www.matweb.com/search/DataSheet.aspx?MatGUID=";
static string GetSource(string guid)
{
try
{
Uri url = new Uri(urlTemplate + guid);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "GET";
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
StreamReader responseStreamReader = new StreamReader(responseStream);
String result = responseStreamReader.ReadToEnd();
return result;
}
catch (Exception ex)
{
return null;
}
}
When I do so I get:
You do not seem to have cookies enabled. MatWeb Requires cookies to be enabled.
Ok, that I understand, so I added lines:
CookieContainer cc = new CookieContainer();
webRequest.CookieContainer = cc;
I got:
Your IP Address has been restricted due to excessive use. The problem may be compounded when an IP address may be shared by many people in a company or through an internet service provider. We apologize for any inconvenience.
I can understand this but I'm not getting this message when I try to visit this page using web browser. What can I do to get the source code? Some cookies or http headers?

It probably doesn't like your UserAgent. Try this:
webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"; //maybe substitute your own in here

It looks like you're doing something that the company doesn't like, if you got an "excessive use" response.

You are downloading pages too fast.
When you use a browser you might get up to one page per second. Using a application you can get several pages per second and that's probably what their web server is detecting. Hence the excessive usage.

Related

C# WinApp throws (403) Fobidden exception while sending HTTP/GET request

I have ASP.NET website. When I call the url 'http://example.org/worktodo.ashx' from browser it works ok.
I have created one android app and if I call the above url from android app then also it works ok.
I have created windows app in C# and if I call the above url from that windows app then it fails with error 403 forbidden.
Following is the C# code.
try
{
bool TEST_LOCAL = false;
//
// One way to call the url
//
WebClient client = new WebClient();
string url = TEST_LOCAL ? "http://localhost:1805/webfolder/worktodo.ashx" : "http://example.org/worktodo.ashx";
string status = client.DownloadString(url);
MessageBox.Show(status, "WebClient Response");
//
// Another way to call the url
//
WebRequest request = WebRequest.Create(url);
request.Method = "GET";
request.Headers.Add("Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
request.Headers.Add("Connection:keep-alive");
request.Headers.Add("User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36");
request.Headers.Add("Upgrade-Insecure-Requests:1");
request.Headers.Add("Accept-Encoding:gzip, deflate, sdch");
request.ContentType = "text/json";
WebResponse response = request.GetResponse();
string responseString = new System.IO.StreamReader(response.GetResponseStream()).ReadToEnd();
MessageBox.Show(responseString, "WebRequest Response");
}
catch (WebException ex)
{
string error = ex.Status.ToString();
}
The exception thrown is:
The remote server returned an error: (403) Forbidden.
StatusCode value is 'Forbidden'
StatusDescription value is 'ModSecurity Action'
Following is android app code (uses org.apache.http library):
Handler handler = new Handler() {
Context ctx = context; // save context for use inside handleMessage()
#SuppressWarnings("deprecation")
public void handleMessage(Message message) {
switch (message.what) {
case HttpConnection.DID_START: {
break;
}
case HttpConnection.DID_SUCCEED: {
String response = (String) message.obj;
JSONObject jobjdata = null;
try {
JSONObject jobj = new JSONObject(response);
jobjdata = jobj.getJSONObject("data");
String status = URLDecoder.decode(jobjdata.getString("status"));
Toast.makeText(ctx, status, Toast.LENGTH_LONG).show();
} catch (Exception e1) {
Toast.makeText(ctx, "Unexpected error encountered", Toast.LENGTH_LONG).show();
// e1.printStackTrace();
}
}
}
}
};
final ArrayList<NameValuePair> params1 = new ArrayList<NameValuePair>();
if (RUN_LOCALLY)
new HttpConnection(handler).post(LOCAL_URL, params1);
else
new HttpConnection(handler).post(WEB_URL, params1);
}
Efforts / Research done so far to solve the issue:
I found following solutions that fixed 403 forbidden error for them but that could not fix my problem
Someone said, the file needs to have appropriate 'rwx' permissions set, so, I set 'rwx' permissions for the file
Someone said, specifying USER-AGENT worked, I tried (ref. Another way to call)
Someone said, valid header fixed it - used Fiddler to find valid header to be set, I used Chrome / Developer Tools and set valid header (ref.
another way to call)
Someone configured ModSecurity to fix it, but, I don't have ModSecurity installed for my website, so, not an option for me
Many were having problem with MVC and fixed it, but, I don't use MVC, so those solutions are not for me
ModSecurity Reference manual says, to remove it from a website, add <modules><remove name="ModSecurityIIS" /></modules> to web.config. I did but couldn't fix the issue
My questions are:
Why C# WinApp fails where as Android App succeeds?
Why Android App doesn't encounter 'ModSecurity Action' exception?
Why C# WinApp encounter 'ModSecurity Action' exception?
How to fix C# code?
Please help me solve the issue. Thank you all.

I found the answer. Below is the code that works as expected.
bool TEST_LOCAL = false;
string url = TEST_LOCAL ? "http://localhost:1805/webfolder/worktodo.ashx" : "http://example.org/worktodo.ashx";
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.Method = "GET";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36";
request.ContentType = "text/json";
WebResponse response = request.GetResponse();
string responseString = new System.IO.StreamReader(response.GetResponseStream()).ReadToEnd();
MessageBox.Show(responseString, "WebRequest Response");
NOTE: requires using System.Net;

HttpWebRequest an Angular website

I have the following code for getting a website and it works fine. The problem come up when I try to get a web page developed in Angular.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201";
request.Method = "GET";
request.Timeout = 30000;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream flujo = response.GetResponseStream();
Encoding encode = Encoding.GetEncoding("utf-8");
StreamReader readStream = new StreamReader(flujo, encode);
String html;
try
{
html = readStream.ReadToEnd();
} catch(System.IO.IOException)
{
return;
}
response.Close();
readStream.Close();
HtmlAgilityPack.HtmlDocument DOM = new HtmlAgilityPack.HtmlDocument();
DOM.LoadHtml(html);
I know Angular first supply the skeleton of the page and in client side, fecth for info and display it.
When I try to get some info using HtmlAgilityPack, I get nothing.
My question is if it's possible to setup HttpWebRequest or HttpWebResponse or any other class to indicate to wait for javascript is done before getting the content or something similar.
Also, I tried to get content using WebBrowser and used the loadCompleted event and the same problem.
Any help?
Thanks.

How to login a website using httpwebrequest via my web app or generic handler and access the content?

Basically I am making a chat app for my university students only and for that I have to make sure they are genuine by checking there details on UMS(university management system) and get their basic detail so they chat genuinely. I am nearly done with my chat app only the login is left.
So I want to login to my UMS page via my website from a generic handler.
and then navigate to another page in it to access there basic info keeping the session alive.
I did research on httpwebrequest and failed to login with my credentials.
https://ums.lpu.in/lpuums
(made in asp.net)
I did tried codes from other posts for login.
I am novice to this part so bear with me.. any help will be appreciated.

Without the actual handshake with UMS via a defined API, you would end up scraping UMS html, which is bad for various reasons.
I would suggest you read up on Single Sign On (SSO).
A few articles on SSO and ASP.NET -
1. Codeproject
2. MSDN
3. asp.net forum
Edit 1
Although, I think this is a bad idea, since you say you are out of options, here is a link that shows how Html Agility Pack can help in scraping the web pages.
Beware of the drawbacks of screen scraping, changes from UMS will not be communicated to you, and you will see your application not working all of a sudden.

public string Scrap(string Username, string Password)
{
string Url1 = "https://www.example.com";//first url
string Url2 = "https://www.example.com/login.aspx";//secret url to post request to
//first request
CookieContainer jar = new CookieContainer();
HttpWebRequest request1 = (HttpWebRequest)WebRequest.Create(Url1);
request1.CookieContainer = jar;
//Get the response from the server and save the cookies from the first request..
HttpWebResponse response1 = (HttpWebResponse)request1.GetResponse();
//second request
string postData = "***viewstate here***";//VIEWSTATE
HttpWebRequest request2 = (HttpWebRequest)WebRequest.Create(Url2);
request2.CookieContainer = jar;
request2.KeepAlive = true;
request2.Referer = Url2;
request2.Method = WebRequestMethods.Http.Post;
request2.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request2.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
request2.ContentType = "application/x-www-form-urlencoded";
request2.AllowWriteStreamBuffering = true;
request2.ProtocolVersion = HttpVersion.Version11;
request2.AllowAutoRedirect = true;
byte[] byteArray = Encoding.ASCII.GetBytes(postData);
request2.ContentLength = byteArray.Length;
Stream newStream = request2.GetRequestStream(); //open connection
newStream.Write(byteArray, 0, byteArray.Length); // Send the data.
newStream.Close();
HttpWebResponse response2 = (HttpWebResponse)request2.GetResponse();
using (StreamReader sr = new StreamReader(response2.GetResponseStream()))
{
responseData = sr.ReadToEnd();
}
return responseData;
}
This is the code which works for me any one can add there links and viewstate for asp.net websites to scrap and you need to take care of cookie too.
and for other websites(non asp.net) they don't require viewstate.
Use fiddler to find things needed to add in header and viewstate or cookie.
Hope this helps if some one having the problem. :)

Html Agility Pack, Web Scraping, and spoofing in C#

Is there a way to spoof a web request from C# code so it doesn't look like a bot or spam hitting the site? I am trying to web scrape my website, but keep getting blocked after a certain amount of calls. I want to act like a real browser. I am using this code, from HTML Agility Pack.
var web = new HtmlWeb();
web.UserAgent =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";

I do way too much web scraping, but here are the options:
I have a default list of headers I add as all of these are expected from a browser:
wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";
wc.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
wc.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
wc.Headers[HttpRequestHeader.AcceptEncoding] = "gzip,deflate,sdch";
wc.Headers[HttpRequestHeader.AcceptLanguage] = "en-GB,en-US;q=0.8,en;q=0.6";
wc.Headers[HttpRequestHeader.AcceptCharset] = "ISO-8859-1,utf-8;q=0.7,*;q=0.3";
(WC is my WebClient).
As a further help - here is my webclient class that keeps cookies stored - which is also a massive help:
public class CookieWebClient : WebClient
{
public CookieContainer m_container = new CookieContainer();
public WebProxy proxy = null;
protected override WebRequest GetWebRequest(Uri address)
{
try
{
ServicePointManager.DefaultConnectionLimit = 1000000;
WebRequest request = base.GetWebRequest(address);
request.Proxy = proxy;
HttpWebRequest webRequest = request as HttpWebRequest;
webRequest.Pipelined = true;
webRequest.KeepAlive = true;
if (webRequest != null)
{
webRequest.CookieContainer = m_container;
}
return request;
}
catch
{
return null;
}
}
}
Here is my usual use for it. Add a static copy to your base site class with all your parsing functions you likely have:
protected static CookieWebClient wc = new CookieWebClient();
And call it as such:
public HtmlDocument Download(string url)
{
HtmlDocument hdoc = new HtmlDocument();
HtmlNode.ElementsFlags.Remove("option");
HtmlNode.ElementsFlags.Remove("select");
Stream read = null;
try
{
read = wc.OpenRead(url);
}
catch (ArgumentException)
{
read = wc.OpenRead(HttpHelper.HTTPEncode(url));
}
hdoc.Load(read, true);
return hdoc;
}
The other main reason you may be crashing out is the connection is being closed by the server as you have had an open connection for too long. You can prove this by adding a try catch around the download part as above and if it fails, reset the webclient and try to download again:
HtmlDocument d = new HtmlDocument();
try
{
d = this.Download(prp.PropertyUrl);
}
catch (WebException e)
{
this.Msg(Site.ErrorSeverity.Severe, "Error connecting to " + this.URL + " : Resubmitting..");
wc = new CookieWebClient();
d = this.Download(prp.PropertyUrl);
}
This saves my ass all the time, even if it was the server rejecting you, this can re-jig the lot. Cookies are cleared and your free to roam again. If worse truly comes to worse - add proxy support and get a new proxy applied per 50-ish requests.
That should be more than enough for you to kick your own and any other sites arse.
RATE ME!

Use a regular browser and fiddler (if the developer tools are not up to scratch) and take a look at the request and response headers.
Build up your requests and request headers to match what the browser sends (you can use a couple of different browsers to asses if this makes a difference).
In regards to "getting blocked after a certain amount of calls" - throttle your calls. Only make one call every x seconds. Behave nicely to the site and it will behave nicely to you.
Chances are good that they simply look at the number of calls from your IP address per second and if it passes a threshold, the IP address gets blocked.

Why is this WebRequest code slow?

I requested 100 pages that all 404. I wrote
{
var s = DateTime.Now;
for(int i=0; i < 100;i++)
DL.CheckExist("http://google.com/lol" + i.ToString() + ".jpg");
var e = DateTime.Now;
var d = e-s;
d=d;
Console.WriteLine(d);
}
static public bool CheckExist(string url)
{
HttpWebRequest wreq = null;
HttpWebResponse wresp = null;
bool ret = false;
try
{
wreq = (HttpWebRequest)WebRequest.Create(url);
wreq.KeepAlive = true;
wreq.Method = "HEAD";
wresp = (HttpWebResponse)wreq.GetResponse();
ret = true;
}
catch (System.Net.WebException)
{
}
finally
{
if (wresp != null)
wresp.Close();
}
return ret;
}
Two runs show it takes 00:00:30.7968750 and 00:00:26.8750000. Then i tried firefox and use the following code
<html>
<body>
<script type="text/javascript">
for(var i=0; i<100; i++)
document.write("<img src=http://google.com/lol" + i + ".jpg><br>");
</script>
</body>
</html>
Using my comp time and counting it was roughly 4 seconds. 4 seconds is 6.5-7.5faster then my app. I plan to scan through a thousands of files so taking 3.75hours instead of 30mins would be a big problem. How can i make this code faster? I know someone will say firefox caches the images but i want to say 1) it still needs to check the headers from the remote server to see if it has been updated (which is what i want my app to do) 2) I am not receiving the body, my code should only be requesting the header. So, how do i solve this?

I noticed that an HttpWebRequest hangs on the first request. I did some research and what seems to be happening is that the request is configuring or auto-detecting proxies. If you set
request.Proxy = null;
on the web request object, you might be able to avoid an initial delay.
With proxy auto-detect:
using (var response = (HttpWebResponse)request.GetResponse()) //6,956 ms
{
}
Without proxy auto-detect:
request.Proxy = null;
using (var response = (HttpWebResponse)request.GetResponse()) //154 ms
{
}

change your code to asynchronous getresponse
public override WebResponse GetResponse() {
•••
IAsyncResult asyncResult = BeginGetResponse(null, null);
•••
return EndGetResponse(asyncResult);
}
Async Get

Probably Firefox issues multiple requests at once whereas your code does them one by one. Perhaps adding threads will speed up your program.

The answer is changing HttpWebRequest/HttpWebResponse to WebRequest/WebResponse only. That fixed the problem.

Have you tried opening the same URL in IE on the machine that your code is deployed to? If it is a Windows Server machine then sometimes it's because the url you're requesting is not in IE's (which HttpWebRequest works off) list of secure sites. You'll just need to add it.
Do you have more info you could post? I've doing something similar and have run into tons of problems with HttpWebRequest before. All unique. So more info would help.
BTW, calling it using the async methods won't really help in this case. It doesn't shorten the download time. It just doesn't block your calling thread that's all.

close the response stream when you are done, so in your checkExist(), add wresp.Close() after wresp = (HttpWebResponse)wreq.GetResponse();

OK if you are getting status code 404 for all webpages then it is due to not specifying credentials. So you need to add
wreq.Credentials = CredentialCache.DefaultCredentials;
Then you may also come across status code= 500 for that you need to specify User Agent. Which looks something like the below line
wreq.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0";
"A WebClient instance does not send optional HTTP headers by default. If your request requires an optional header, you must add the header to the Headers collection. For example, to retain queries in the response, you must add a user-agent header. Also, servers may return 500 (Internal Server Error) if the user agent header is missing."
reference: https://msdn.microsoft.com/en-us/library/system.net.webclient(v=vs.110).aspx
To improve the Performance of the HttpWebrequest you need to add
wreq.Proxy=null
now the code will look like:
static public bool CheckExist(string url)
{
HttpWebRequest wreq = null;
HttpWebResponse wresp = null;
bool ret = false;
try
{
wreq = (HttpWebRequest)WebRequest.Create(url);
wreq.Credentials = CredentialCache.DefaultCredentials;
wreq.Proxy=null;
wreq.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0";
wreq.KeepAlive = true;
wreq.Method = "HEAD";
wresp = (HttpWebResponse)wreq.GetResponse();
ret = true;
}
catch (System.Net.WebException)
{
}
finally
{
if (wresp != null)
wresp.Close();
}
return ret;
}

Set cookie is matter and you must add AspxAutoDetectCookieSupport=1 like this code
req.CookieContainer = new CookieContainer();
req.CookieContainer.Add(new Cookie("AspxAutoDetectCookieSupport", "1") { Domain = target.Host });

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

matweb.com: How to get source of page? - c#

It probably doesn't like your UserAgent. Try this: webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"; //maybe substitute your own in here

It looks like you're doing something that the company doesn't like, if you got an "excessive use" response.

You are downloading pages too fast. When you use a browser you might get up to one page per second. Using a application you can get several pages per second and that's probably what their web server is detecting. Hence the excessive usage.

Related

C# WinApp throws (403) Fobidden exception while sending HTTP/GET request

HttpWebRequest an Angular website

How to login a website using httpwebrequest via my web app or generic handler and access the content?

Html Agility Pack, Web Scraping, and spoofing in C#

Why is this WebRequest code slow?

Categories

Resources