I started off with the simple code below in order to grab the html from webpages into a string to later process. For some sites like Digikey it works but for others like Mouser it doesn't.
I have tried putting headers and userAgents onto the WebClient along with converting the url to a Uri with no success. Does anybody have any other suggestions of what I could try? Or could anybody try to get the code to work and let me know how it goes?
String url = "http://www.mouser.com/ProductDetail/Vishay-Thin-Film/PCNM2512E1000BST5/?
qs=sGAEpiMZZMu61qfTUdNhG6MW4lgzyHBgo9k7HJ54G4u10PG6pMa7%252bA%3d%3d"
WebClient web = new WebClient();
String html = web.DownloadString(url);
MessageBox.Show(html);
EDIT : The link should lead here: link
EDIT : I tried the following chunk of code with no luck:
String url = "http://www.mouser.com/ProductDetail/Vishay-Thin-Film/PCNM2512E1000BST5/?
qs=sGAEpiMZZMu61qfTUdNhG6MW4lgzyHBgo9k7HJ54G4u10PG6pMa7%252bA%3d%3d"
WebClient web = new WebClient();
web.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
String html = web.DownloadString(url);
MessageBox.Show(html);
Need to download Fiddler it's free (was originally developed by Microsoft) and it lets you record browser sessions. So launch it open chrome or whatever your browser is and go though the steps. Once you done you can stop it and look at every request and response and the raw data sent.
Makes it easy to spot the difference between your code and the browser.
There are also many free tools that will take your request/response data and generate the C# code for you such as Request To Code. That is not the only one, I'm not at work and I can't recall the one I use there, but there are plenty to choose from.
Hope this helps
Related
I wrote a xml grabber to receive/decode xml files from website. It works fine mostly but it always return error:
"The remote server returned an error: (403) Forbidden."
for site http://w1.weather.gov/xml/current_obs/KSRQ.xml
My code is:
CookieContainer cookies = new CookieContainer();
HttpWebRequest webRequest = (HttpWebRequest)HttpWebRequest.Create(Path);
webRequest.Method = "GET";
webRequest.CookieContainer = cookies;
using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
{
using (StreamReader streamReader = new StreamReader(webResponse.GetResponseStream()))
{
string xml = streamReader.ReadToEnd();
xmldoc.LoadXml(xml);
}
}
And the exception is throw in GetResponse method. How can I find out what happened?
It could be that your request is missing a header that is required by the server. I requested the page in a browser, recorded the exact request using Fiddler and then removed the User-Agent header and reissued the request. This resulted in a 403 response.
This is often used by servers in an attempt to prevent scripting of their sites just like you are doing ;o)
In this case, the server header in the 403 response is "AkamaiGHost" which indicates an edge node from some cloud security solution from Akamai. Maybe a WAF rule to prevent bots is triggering the 403.
It seems like adding any value to the User-Agent header will work for this site. For example I set it to "definitely-not-a-screen-scraper" and that seems to work fine.
In general, when you have this kind of problem it very often helps to look at the actual HTTP requests and responses using browser tools or a proxy like Fiddler. As Scott Hanselman says
The internet is not a black box
http://www.hanselman.com/blog/TheInternetIsNotABlackBoxLookInside.aspx
Clearly, the URL works from a browser. It just doesn't work from the code. It would appear that the server is accepting/rejecting requests based on the user agent, probably as a very basic way of trying to prevent crawlers.
To get through, just set the UserAgent property to something it will recognize, for instance:
webRequest.UserAgent = #"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36";
That does seem to work.
In my particular case, it was not the UserAgent header, but the Accept header that the server didn't like.
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8";
You can use the browsers network tab of dev tools to see what the correct headers should be.
Is your request going through a proxy server? If yes, add the following line before your GetResponse() call.
webRequest.Proxy.Credentials = System.Net.CredentialCache.DefaultCredentials;
So iv been looking around trying to find out how i can go to a specified github page and get the last commit value and bind this into a value in my application, nothing seems to make sense and there aren't many if any good examples to base anything on. As well as noone seeming to want to share their knowledge on this topic.
Im trying to get the last commit value only from a github page, and use that as a value in my application, can someone give me an example of how to do this? I am using C# with a WPF project type.
If you want to clone the repository locally and inspect it, you could use GitSharp libgit2sharp. If that is not an option for you then you can use the github API. The url you are after is:
https://api.github.com/repos/<repo_path>/commits
e.g. https://api.github.com/repos/NancyFx/Nancy/commits
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)");
using (var response = client.GetAsync("https://api.github.com/repos/NancyFx/Nancy/commits").Result)
{
var json = response.Content.ReadAsStringAsync().Result;
dynamic commits = JArray.Parse(json);
string lastCommit = commits[0].commit.message;
}
}
As mentioned in comments, this will couple your implementation to github, so be sure that your app doesn't need to work with other git hosts in the future if you choose the 2nd option.
I'm experiencing a strange issue with WebClient.DownloadString that I can't seem to solve, my code:
Dim client As New WebClient()
Dim html = client.DownloadString("http://www.btctrade.com/")
The content doesn't seem to be fully AJAX, so it can't be that. Is it due to the web page being in Chinese? I'm guessing HTML is just served as HTML, so can't really be that either. The URL is fine when I go to it and there seems to be no redirects to https either.
Anyone know why this is happening?
You must set cookies and useragent in the webclient headers this works
client .Headers.Add(HttpRequestHeader.UserAgent, "UserAgent,Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1");
client .Headers.Add(HttpRequestHeader.Cookie, "USER_PW=9b1283bfe37ac47b243a1e0c9c1c9e52; PHPSESSID=f692406a0c84dba2605a7065d55a3b53")
and if u want that the request do all this work , you have to user httpwebrequest then save all the response's headers and use them in a new request
WebClient is not buggy, so probably the server is returning data you did not expect. Use Fiddler to watch what happens when you go to the site in a web browser.
When I executed your code the web site returned no data. When I visited the site in a web browser it returned data. Probably, the site is detecting that you are a bot and denying you access. Fake being a browser by mimicking what you see in Fiddler.
Is it possible to make an exact identical POST with HttpWebRequest in C# as a browser would? Without a page being able to detect that it is actually no browser?
If so, were could i read up more on that?
Download and become familiar with a tool like Fiddler. It allows you to inspect web requests made from applications, like a normal browser, and see exactly what is being sent. You can then emulate the data being sent with a request created in C#, providing values for headers, cookies, etc.
I think this is doable.
Browser detection is done based on a header in the request. All you need to do is set that header. In HttpWebRequest we dont need to set the headers collection but rather the .UserAgent property.
Eg:
.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)";
There is quite a lot to user agents. Check this link for the complete list of User-Agents
Useful Links:
How to create a simple proxy in C#?
Is WebRequest The Right C# Tool For Interacting With Websites?
http://codehelp.smartdev.eu/2009/05/08/improve-webclient-by-adding-useragent-and-cookies-to-your-requests/
Here's the code I'm trying to run:
var wc = new WebClient();
var stream = wc.OpenRead(
"http://en.wikipedia.org/wiki/List_of_communities_in_New_Brunswick");
But I keep getting a 403 forbidden error. Don't understand why. It worked fine for other pages. I can open the page fine in my browser. How can I fix this?
I wouldn't normally use OpenRead(), try DownloadData() or DownloadString() instead.
Also it might be that wikipedia is deliberately blocking your request because you have not provided a user agent string:
WebClient client = new WebClient();
client.Headers.Add("user-agent",
"Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
I use WebClient quite often, and learned quite quickly that websites can and will block your request if you don't provide a user agent string that matches a known web browser. Also, if you make up your own user agent string (eg "my super cool web scraper") you will also be blocked.
[Edit]
I changed my example user agent string to that of a modern version of Firefox. The original example I gave was the user agent string for IE6 which is not a good idea. Why? Some websites may perform filtering based on IE6 and send anyone with that browser a message or to a different page that says "Please update your browser" - this means you will not get the content you wanted to get.