WebClient forbids opening wikipedia page? - c#

Here's the code I'm trying to run:
var wc = new WebClient();
var stream = wc.OpenRead(
"http://en.wikipedia.org/wiki/List_of_communities_in_New_Brunswick");
But I keep getting a 403 forbidden error. Don't understand why. It worked fine for other pages. I can open the page fine in my browser. How can I fix this?

I wouldn't normally use OpenRead(), try DownloadData() or DownloadString() instead.
Also it might be that wikipedia is deliberately blocking your request because you have not provided a user agent string:
WebClient client = new WebClient();
client.Headers.Add("user-agent",
"Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
I use WebClient quite often, and learned quite quickly that websites can and will block your request if you don't provide a user agent string that matches a known web browser. Also, if you make up your own user agent string (eg "my super cool web scraper") you will also be blocked.
[Edit]
I changed my example user agent string to that of a modern version of Firefox. The original example I gave was the user agent string for IE6 which is not a good idea. Why? Some websites may perform filtering based on IE6 and send anyone with that browser a message or to a different page that says "Please update your browser" - this means you will not get the content you wanted to get.

Related

C# WPF WebClient.DownloadString() not returning anything

I started off with the simple code below in order to grab the html from webpages into a string to later process. For some sites like Digikey it works but for others like Mouser it doesn't.
I have tried putting headers and userAgents onto the WebClient along with converting the url to a Uri with no success. Does anybody have any other suggestions of what I could try? Or could anybody try to get the code to work and let me know how it goes?
String url = "http://www.mouser.com/ProductDetail/Vishay-Thin-Film/PCNM2512E1000BST5/?
qs=sGAEpiMZZMu61qfTUdNhG6MW4lgzyHBgo9k7HJ54G4u10PG6pMa7%252bA%3d%3d"
WebClient web = new WebClient();
String html = web.DownloadString(url);
MessageBox.Show(html);
EDIT : The link should lead here: link
EDIT : I tried the following chunk of code with no luck:
String url = "http://www.mouser.com/ProductDetail/Vishay-Thin-Film/PCNM2512E1000BST5/?
qs=sGAEpiMZZMu61qfTUdNhG6MW4lgzyHBgo9k7HJ54G4u10PG6pMa7%252bA%3d%3d"
WebClient web = new WebClient();
web.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
String html = web.DownloadString(url);
MessageBox.Show(html);
Need to download Fiddler it's free (was originally developed by Microsoft) and it lets you record browser sessions. So launch it open chrome or whatever your browser is and go though the steps. Once you done you can stop it and look at every request and response and the raw data sent.
Makes it easy to spot the difference between your code and the browser.
There are also many free tools that will take your request/response data and generate the C# code for you such as Request To Code. That is not the only one, I'm not at work and I can't recall the one I use there, but there are plenty to choose from.
Hope this helps

WebClient.DownloadString returning no data

I'm experiencing a strange issue with WebClient.DownloadString that I can't seem to solve, my code:
Dim client As New WebClient()
Dim html = client.DownloadString("http://www.btctrade.com/")
The content doesn't seem to be fully AJAX, so it can't be that. Is it due to the web page being in Chinese? I'm guessing HTML is just served as HTML, so can't really be that either. The URL is fine when I go to it and there seems to be no redirects to https either.
Anyone know why this is happening?
You must set cookies and useragent in the webclient headers this works
client .Headers.Add(HttpRequestHeader.UserAgent, "UserAgent,Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1");
client .Headers.Add(HttpRequestHeader.Cookie, "USER_PW=9b1283bfe37ac47b243a1e0c9c1c9e52; PHPSESSID=f692406a0c84dba2605a7065d55a3b53")
and if u want that the request do all this work , you have to user httpwebrequest then save all the response's headers and use them in a new request
WebClient is not buggy, so probably the server is returning data you did not expect. Use Fiddler to watch what happens when you go to the site in a web browser.
When I executed your code the web site returned no data. When I visited the site in a web browser it returned data. Probably, the site is detecting that you are a bot and denying you access. Fake being a browser by mimicking what you see in Fiddler.

HttpWebRequest POST data

Is it possible to make an exact identical POST with HttpWebRequest in C# as a browser would? Without a page being able to detect that it is actually no browser?
If so, were could i read up more on that?
Download and become familiar with a tool like Fiddler. It allows you to inspect web requests made from applications, like a normal browser, and see exactly what is being sent. You can then emulate the data being sent with a request created in C#, providing values for headers, cookies, etc.
I think this is doable.
Browser detection is done based on a header in the request. All you need to do is set that header. In HttpWebRequest we dont need to set the headers collection but rather the .UserAgent property.
Eg:
.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)";
There is quite a lot to user agents. Check this link for the complete list of User-Agents
Useful Links:
How to create a simple proxy in C#?
Is WebRequest The Right C# Tool For Interacting With Websites?
http://codehelp.smartdev.eu/2009/05/08/improve-webclient-by-adding-useragent-and-cookies-to-your-requests/

Http Post with Partial URL in C#.NET

I have a web application (which I have no control over) I need to send HTTP post programatically to. Currently I've using HttpWebRequest like
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://someserver.com/blah/blah.aspx");
However the application was returning a "Unknown Server Error (not the IIS error, a custom application error page)" when posting to data. Using Fiddler to compare my Post vs IE post I can see the only difference is in the POST line of the request:
In Internet Explorer Fiddler (RAW view) shows traffic
POST /blah/blah.aspx HTTP/1.1
In my C# program fiddler (RAW view) records traffic as
POST https://someserver.com/blah/blah.aspx HTTP/1.1
This is only difference from both both requests.
From what I've researched so far it seems there is no way to make HttpWebRequest.Create post the relative URL.Note: I see many posts on "how to use relative URLs" but these suggestions do not work, as the actual post is still done using an absolute URL (when you sniff the HTTP traffic)
What is simplest way to accomplish this post with relative URL?
(Traffic is NOT going through a proxy)
Update: For the time being I'm using IE automation to do scheduled perf test, instead of method above. I might look at another scripting language as I did want to test without any browser.
No, you can't do POST without server in a Url.
One possible reason your program fails is if it does not use correct proxy and as result can't resolve server name.
Note: Fiddler shows path and host separately in the view you are talking about.
Configure you program to use Fiddler as proxy (127.0.0.1:8888) and compare requests that you are making with browser's ones. Don't forget to switch Fiddler to "show all proceses".
Here is article on configuring Fiddler for different type of environment including C# code: Fiddler: Configuring clients
objRequest = (HttpWebRequest)WebRequest.Create(url);
objRequest.Proxy= new WebProxy("127.0.0.1", 8888);

Download js generated html with C#

There is a reports website which content I want to parse in C#. I tried downloading the html with WebClient but then I don't get the complete source since most of it is generated via js when I visit the website.
I tried using WebBrowser but could't get it to work in a console app, even after using Application.Run() and SetApartmentState(ApartmentState.STA).
Is there another way to access this generated html? I also took a look into mshtml but couldn't figure it out.
Thanks
The Javascript is executed by the browser. If your console app gets the JS, then it is working as expected, and what you really need is for your console app to execute the JS code that was downloaded.
You can use a headless browser - XBrowser may server.
If not, try HtmlUnit as described in this blog post.
Just a comment here. There shouldn't be any difference between performing an HTTP request with some C# code and the request generated by a browser. If the target web page is getting confused and not generating the correct markup because it can't make heads or tails of from the type of browser it thinks it's serving then maybe all you have to do is set the user agent like so:
((HttpWebRequest)myWebClientRequest).UserAgent = "<a valid user agent>";
For example, my current user agent is:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
Maybe once you do that the page will work correctly. There may be other factors at work here, such as the referrer and so on, but I would try this first and see if it works.
Your best bet is to abandon the console app route and build a Windows Forms application. In that case the WebBrowser will work without any work needed.

Categories

Resources