C# NET.WebClient DownloadString() Issue - Page redirects - c#

I have this problem - I am writing a simple web spider and it works good so far. Problem is the site I am working on has the nasty habit of redirecting or adding stuff to the address sometimes. In some pages it adds "/about" after you load them and on some it totally redirects to another page.
The webclient gets confused since it downloads the html code and starts to parse the links, but since many of them are in the format "../../something", it simply crashes after a while, because it calculates the link according to the first given address(before redirecting or adding "/about"). When the newly created page comes out of the queue it throws 404 Not Found exception(surpriiise).
Now I can just add "/about" to every page myself, but for shits and giggles, the website itself doesn't always add it...
I would appreciate any ideas.
Thank you for your time and all best!

If you want to get the redirected URI of a page for parsing the links inside it, use a subclass of WebClient like this:
class MyWebClient : WebClient
{
Uri _responseUri;
public Uri ResponseUri
{
get { return _responseUri; }
}
protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse response = base.GetWebResponse(request);
_responseUri = response.ResponseUri;
return response;
}
}
Now use MyWebClient instead of WebClient and parse the links using ResponseUri

Related

Scraping multiple lists from a website.

I'm currently working on a web scraper for a website that displays a table of data. The problem I'm running into is that the website doesn't sort my searches by state on the first search. I have to do it though the drop down menu on the second page when it loads. The way I load the first page is with what I believe to be a WebClient POST request. I get the proper html response and can parse though it, but I want to load the more filtered search, but the html I get back is incorrect when I compare it to the html I see in the chrome developers tab.
Here's my code
// The website I'm looking at.
public string url = "https://www.missingmoney.com/Main/Search.cfm";
// The POST requests for the working search, but doesn't filter by states
public string myPara1 = "hJava=Y&SearchFirstName=Jacob&SearchLastName=Smith&HomeState=MN&frontpage=1&GO.x=19&GO.y=18&GO=Go";
// The POST request that also filters by state, but doesn't return the correct html that I would need to parse
public string myPara2 = "hJava=Y&SearchLocation=1&SearchFirstName=Jacob&SearchMiddleName=&SearchLastName=Smith&SearchCity=&SearchStateID=MN&GO.x=17&GO.y=14&GO=Go";
// I save the two html responses in these
public string htmlResult1;
public string htmlResult2;
public void LoadHtml(string firstName, string lastName)
{
using (WebClient client = new WebClient())
{
client.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
htmlResult1 = client.UploadString(url, myPara1);
htmlResult2 = client.UploadString(url, myPara2);
}
}
Just trying to figure out why the first time I pass in my parameters it works and when I do it in the second one it doesn't.
Thank you for the time you spent looking at this!!!
I simply forgot to add the cookie to the new search. Using google chrome or fiddler you can see the web traffic. All I needed to do was add
client.Headers.Add(HttpRequestHeader.Cookie, "cookie");
to my code right before it uploaded it. Doing so gave me the right html response and I can now parse though my data.
#derloopkat pointed it out, credits to that individual!!!

How to retrieve HTML Page without getting redirected?

I want to scrape the HTML of a website. When I access this website with my browser (no matter if it is Chrome or FireFox), I have no problem accessing the website + HTML.
When I try to parse the HTML with C# using methods like HttpWebRequest and HtmlAgilityPack, the website redirects me to another website and thus I parse the HTML of the redirected website.
Any idea how to solve this problem?
I thought the site recognises my program as a program and redirects immediately, so I tried using Selenium and a GoogleDriver and FireFoxDriver but also no luck, I get redirected immediately.
The Website: https://www.jodel.city/7700#!home
private void bt_load_Click(object sender, EventArgs e)
{
var url = #"https://www.jodel.city/7700#!home";
var req = (HttpWebRequest)WebRequest.Create(url);
req.AllowAutoRedirect = false;
// req.Referer = "http://www.muenchen.de/";
var resp = req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
String returnedContent = sr.ReadToEnd();
Console.WriteLine(returnedContent);
return;
}
And of course, cookies are to blame again, because cookies are great and amazing.
So, let's look at what happens in Chrome the first time you visit the site:
(I went to https://www.jodel.city/7700#!home):
Yes, I got a 302 redirect, but I also got told by the server to set a __cfduid cookie (twice actually).
When you visit the site again, you are correctly let into the site:
Notice how this time a __cfduid cookie was sent along? That's the key here.
Your C# code needs to:
Go to the site once, get redirected, but obtain the cookie value from the response header.
Go BACK to the site with the correct cookie value in the request header.
You can go to the first link in this post to see an example of how to set cookie values for requests.

Downloading a file from a redirection page

How would I go about downloading a file from a redirection page (which itself does some calculations based on the user).
For example, if I wanted the user to download a game, I would use WebClient and do something like:
client.DownloadFile("http://game-side.com/downloadfetch/");
It's not as simple as doing
client.DownloadFile("http://game-side.com/download.exe");
But if the user were to click on the first one, it would redirect and download it.
As far as I know this isn't possible with DownloadFile();
You could use this
HttpWebRequest myHttpWebRequest=(HttpWebRequest)WebRequest.Create("http://game-side.com/downloadfetch/");
myHttpWebRequest.MaximumAutomaticRedirections=1;
myHttpWebRequest.AllowAutoRedirect=true;
HttpWebResponse myHttpWebResponse=(HttpWebResponse)myHttpWebRequest.GetResponse();
See also
Download file through code that has a redirect?
I think, you should go with slightly customized WebClient class like that. It will follow code 300 redirects:
public class MyWebClient : WebClient
{
protected override WebResponse GetWebResponse(WebRequest request)
{
(request as HttpWebRequest).AllowAutoRedirect = true;
WebResponse response = base.GetWebResponse(request);
return response;
}
}
...
WebClient client=new MyWebClient();
client.DownloadFile("http://game-side.com/downloadfetch/", "download.zip");

How to download file from url that redirects?

i'm trying to download a file from a link that doesn't contain the file, but instead it redirects to another (temporary) link that contains the actual file. The objective is to get an updated copy of the program without the need to open a browser. The link is:
http://www.bleepingcomputer.com/download/minitoolbox/dl/65/
I've tried to use WebClient, but it won't work:
private void Button1_Click(object sender, EventArgs e)
{
WebClient webClient = new WebClient();
webClient.DownloadFileCompleted += new AsyncCompletedEventHandler(Completed);
webClient.DownloadFileAsync(new Uri("http://www.bleepingcomputer.com/download/minitoolbox/dl/65/"), #"C:\Downloads\MiniToolBox.exe");
}
After searching and trying many things i've found this solution that involves using HttpWebRequest.AllowAutoRedirect.
Download file through code that has a redirect?
// Create a new HttpWebRequest Object to the mentioned URL.
HttpWebRequest myHttpWebRequest=(HttpWebRequest)WebRequest.Create("http://www.contoso.com");
myHttpWebRequest.MaximumAutomaticRedirections=1;
myHttpWebRequest.AllowAutoRedirect=true;
HttpWebResponse myHttpWebResponse=(HttpWebResponse)myHttpWebRequest.GetResponse();
It seems that's exactly what i'm looking for, but i simply don't know how to use it :/
I guess the link is a parameter of WebRequest.Create. But how can i retrieve the file to my directory? yes, i´m a noob... Thanks in advance for your help.
I switched from a WebClient based approach to a HttpWebRequest too because auto redirects didn't seem to be working with WebClient. I was using similar code to yours but could never get it to work, it never redirected to the actual file. Looking in Fiddler I could see I wasn't actually getting the final redirect.
Then I came across some code for a custom version of WebClient in this question:
class CustomWebclient: WebClient
{
[System.Security.SecuritySafeCritical]
public CustomWebclient(): base()
{
}
public CookieContainer cookieContainer = new CookieContainer();
protected override WebRequest GetWebRequest(Uri myAddress)
{
WebRequest request = base.GetWebRequest(myAddress);
if (request is HttpWebRequest)
{
(request as HttpWebRequest).CookieContainer = cookieContainer;
(request as HttpWebRequest).AllowAutoRedirect = true;
}
return request;
}
}
The key part in that code is AllowAutoRedirect = true, it's supposed to be on by default according to the documentation, which states:
AllowAutoRedirect is set to true in WebClient instances.
but that didn't seem to be the case when I was using it.
I also needed the CookieContainer part for this to work with the SharePoint external URLs we were trying to access.
I guess the easy option is simply this (after what you've got there.. and the URL you provided in place of http://www.contoso.com):
using (var responseStream = myHttpWebResponse.GetResponseStream()) {
using (var fileStream =
new FileStream(Path.Combine("folder_here", "filename_here"), FileMode.Create)) {
responseStream.CopyTo(fileStream);
}
}
EDIT:
In fact, this won't work. It isn't a HTTP redirect that downloads the file. Look at the source of that page.. you'll see this:
<meta http-equiv="refresh" content="3; url=http://download.bleepingcomputer.com/dl/1f92ae2ecf0ba549294300363e9e92a8/52ee41aa/windows/security/security-utilities/m/minitoolbox/MiniToolBox.exe">
It basically uses the browser to redirect. Unfortunately what you're trying to do won't work.

URL Shortener is not previewing the target page

I'm creating an internal (links only from our site) URL shortening service. When I use a service like bit .ly or tinyurl and then post the shortened link to facebook, the preview of the target (full link) is displayed. When I try to do this with my own page, it displays the redirection page.
For example http://youtu.be/2323 would map to http://www.youtube.com/watch?v=123456, but my link
http://exam.pl/2323 will show http://exam.pl/Redirect.aspx instead of the actual page in the database. Do I need to the redirection on the server itself or something?
Thanks
UPDATE: Solved with an HttpHandler like in the answer below. I changed the response because apparently Response.Redirect automatically sends a 302 status whereas 301 is more correct.
context.Response.Status = "301 Moved Permanently";
context.Response.AddHeader("Location", httplocation);
context.Response.End();
I recommend using an http handler instead of an actual page to do the redirect http://support.microsoft.com/kb/308001
I also recommend you provide a proper 301 http status http://en.wikipedia.org/wiki/HTTP_301
Update: (This is purley from memory and may not compile as is)
public class IISHandler1 : IHttpHandler
{
public bool IsReusable
{
get { return true; }
}
public void ProcessRequest(HttpContext context)
{
string url = content.Request.Url.ToString();
string newUrl = Translate(url);
context.Response.ResponseCode = 301;
context.Response.Redirect(newUrl);
}
}
You modules get processed AFTER handlers so you should handle the request in a handler. If not possibly to handle then just ignore it and let it pass through

Categories

Resources