Fastest HTML Downloader - c#

I want the fastest method to download the source of HTML with given URL address
Is there any solution beyond normal C# solutions like (WebClient Download or HttpWebRequest, HttpWebResponse)
that speed up fetching HTML source code ??

I normally just use this function when downloading and viewing html.
string getHtml(string url)
{
HttpWebRequest myWebRequest = (HttpWebRequest)HttpWebRequest.Create(url);
myWebRequest.Method = "GET";
// make request for web page
HttpWebResponse myWebResponse = (HttpWebResponse)myWebRequest.GetResponse();
StreamReader myWebSource = new StreamReader(myWebResponse.GetResponseStream());
string myPageSource = string.Empty;
myPageSource = myWebSource.ReadToEnd();
myWebResponse.Close();
return myPageSource;
}
http://www.devasp.net/net/articles/display/994.html

Related

Getting the Redirected URLs from the Original URL [duplicate]

Using the WebClient class I can get the title of a website easily enough:
WebClient x = new WebClient();
string source = x.DownloadString(s);
string title = Regex.Match(source,
#"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>",
RegexOptions.IgnoreCase).Groups["Title"].Value;
I want to store the URL and the page title. However when following a link such as:
http://tinyurl.com/dbysxp
I'm clearly going to want to get the Url I'm redirected to.
QUESTIONS
Is there a way to do this using the WebClient class?
How would I do it using HttpResponse and HttpRequest?
If I understand the question, it's much easier than people are saying - if you want to let WebClient do all the nuts and bolts of the request (including the redirection), but then get the actual response URI at the end, you can subclass WebClient like this:
class MyWebClient : WebClient
{
Uri _responseUri;
public Uri ResponseUri
{
get { return _responseUri; }
}
protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse response = base.GetWebResponse(request);
_responseUri = response.ResponseUri;
return response;
}
}
Just use MyWebClient everywhere you would have used WebClient. After you've made whatever WebClient call you needed to do, then you can just use ResponseUri to get the actual redirected URI. You'd need to add a similar override for GetWebResponse(WebRequest request, IAsyncResult result) too, if you were using the async stuff.
I know this is already an answered question, but this works pretty to me:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://tinyurl.com/dbysxp");
request.AllowAutoRedirect = false;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string redirUrl = response.Headers["Location"];
response.Close();
//Show the redirected url
MessageBox.Show("You're being redirected to: "+redirUrl);
Cheers.! ;)
With an HttpWebRequest, you would set the AllowAutoRedirect property to false. When this happens, any response with a status code between 300-399 will not be automatically redirected.
You can then get the new url from the response headers and then create a new HttpWebRequest instance to the new url.
With the WebClient class, I doubt you can change it out-of-the-box so that it does not allow redirects. What you could do is derive a class from the WebClient class and then override the GetWebRequest and the GetWebResponse methods to alter the WebRequest/WebResponse instances that the base implementation returns; if it is an HttpWebRequest, then set the AllowAutoRedirect property to false. On the response, if the status code is in the range of 300-399, then issue a new request.
However, I don't know that you can issue a new request from within the GetWebRequest/GetWebResponse methods, so it might be better to just have a loop that executes with HttpWebRequest/HttpWebResponse until all the redirects are followed.
I got the Uri for the redirected page and the page contents.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(strUrl);
request.AllowAutoRedirect = true;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream dataStream = response.GetResponseStream();
strLastRedirect = response.ResponseUri.ToString();
StreamReader reader = new StreamReader(dataStream);
string strResponse = reader.ReadToEnd();
response.Close();
In case you are only interested in the redirect URI you can use this code:
public static string GetRedirectUrl(string url)
{
HttpWebRequest request = (HttpWebRequest) HttpWebRequest.Create(url);
request.AllowAutoRedirect = false;
using (HttpWebResponse response = HttpWebResponse)request.GetResponse())
{
return response.Headers["Location"];
}
}
The method will return
null - in case of no redirect
a relative url - in case of a redirect
Please note: The using statement (or a final response.close()) is essential. See MSDN Library for details. Otherwise you may run out of connections or get a timeout when executing this code multiple times.
HttpWebRequest.AllowAutoRedirect can be set to false. Then you'd have to manually http status codes in the 300 range.
// Create a new HttpWebRequest Object to the mentioned URL.
HttpWebRequest myHttpWebRequest=(HttpWebRequest)WebRequest.Create("http://www.contoso.com");
myHttpWebRequest.MaximumAutomaticRedirections=1;
myHttpWebRequest.AllowAutoRedirect=true;
HttpWebResponse myHttpWebResponse=(HttpWebResponse)myHttpWebRequest.GetResponse();
The WebClient class has an option to follow redirects. Set that option and you should be fine.
Ok this is really hackish, but the key is to use the HttpWebRequest and then set the AllowAutoRedirect property to true.
Here's a VERY hacked together example
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://tinyurl.com/dbysxp");
req.Method = "GET";
req.AllowAutoRedirect = true;
WebResponse response = req.GetResponse();
response.GetResponseStream();
Stream responseStream = response.GetResponseStream();
// Content-Length header is not trustable, but makes a good hint.
// Responses longer than int size will throw an exception here!
int length = (int)response.ContentLength;
const int bufSizeMax = 65536; // max read buffer size conserves memory
const int bufSizeMin = 8192; // min size prevents numerous small reads
// Use Content-Length if between bufSizeMax and bufSizeMin
int bufSize = bufSizeMin;
if (length > bufSize)
bufSize = length > bufSizeMax ? bufSizeMax : length;
StringBuilder sb;
// Allocate buffer and StringBuilder for reading response
byte[] buf = new byte[bufSize];
sb = new StringBuilder(bufSize);
// Read response stream until end
while ((length = responseStream.Read(buf, 0, buf.Length)) != 0)
sb.Append(Encoding.UTF8.GetString(buf, 0, length));
string source = sb.ToString();string title = Regex.Match(source,
#"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>",RegexOptions.IgnoreCase).Groups["Title"].Value;
enter code here

Open a WebResponse on Browser

I have a Request which I make to a page and works fine. I can also view that page the response page with Fiddler.
But how do I open this response in my browser?
Currently what I have:
Cookie cookie = new Cookie("test","this");
cookie.Domain = "foobar";
HttpWebRequest request = (HttpWebRequest) HttpWebRequest.Create("http://foobar/ReportServer/");
request.CookieContainer = new CookieContainer();
request.CookieContainer.Add(cookie);
WebResponse response = request.GetResponse();
Stream sr = response.GetResponseStream();
StreamReader sre = new StreamReader(sr);
string s = sre.ReadToEnd();
Response.Write(s);
Save it to an HTML file and open the browser with the path to that file.
Because you have addressibility to the request "Stream" you can use this method: NavigateToStream :
http://msdn.microsoft.com/en-us/library/system.windows.controls.webbrowser.navigatetostream(v=vs.100).aspx
this.webBrowser.NavigateToStream(sr);
You can use System.Windows.Forms.WebBrowser control.

Web file image/video header

In C#, is it possible to detect if the web address of a file is an image, or a video? Is there such a header value for this?
I have the following code that gets the filesize of a web file:
System.Net.WebRequest req = System.Net.HttpWebRequest.Create("http://test.png");
req.Method = "HEAD";
using (System.Net.WebResponse resp = req.GetResponse())
{
int ContentLength;
if(int.TryParse(resp.Headers.Get("Content-Length"), out ContentLength))
{
//Do something useful with ContentLength here
}
}
Can this code be modified to see if a file is an image or a video?
Thanks in advance
What you're looking for is the "Content-Type" header
string uri = "http://assets3.parliament.uk/iv/main-large//ImageVault/Images/id_7382/scope_0/ImageVaultHandler.aspx.jpg";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Method = "HEAD";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
var contentType = response.Headers["Content-Type"];
Console.WriteLine(contentType);
}
You can check resp.Headers.Get("Content-Type") in response header.
For example, it will be image/jpeg for jpg file.
See list of available content types.

Omit images from webpage requested through HttpWebRequest

I fetch webpages in order to feed data to my application. However, the pages contain a lot of images which I don't require at all. I only need the text data.
My problem is that the web requests take an unacceptable amount of time. I think the images also are fetch during a web request. Is there any way to eliminate the images and download only the text data?
The following is the code that I am using currently.
var httpWebRequest = HttpWebRequest.Create(url) as HttpWebRequest;
httpWebRequest.Method = "GET";
httpWebRequest.ProtocolVersion = HttpVersion.Version11;
httpWebRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Proxy = null;
httpWebRequest.KeepAlive = true;
httpWebRequest.Accept = "text/html";
string responseString = null;
var httpWebResponse = httpWebRequest.GetResponse() as HttpWebResponse;
using (var responseStream = httpWebResponse.GetResponseStream())
{
using (var streamReader = new StreamReader(responseStream))
{
responseString = streamReader.ReadToEnd();
}
}
Also, any other optimization suggestions are most welcome.
That is incorrect.
HttpWebRequest does not know anything about HTML or images; it just sends raw HTTP requests.
You can use Fiddler to see exactly what's going on.

How do you login to a webpage and retrieve its content in C#?

How do you login to a webpage and retrieve its content in C#?
That depends on what's required to log in. You could use a webclient to send the login credentials to the server's login page (via whatever method is required, GET or POST), but that wouldn't persist a cookie. There is a way to get a webclient to handle cookies, so you could just POST the login info to the server, then request the page you want with the same webclient, then do whatever you want with the page.
Look at System.Net.WebClient, or for more advanced requirements System.Net.HttpWebRequest/System.Net.HttpWebResponse.
As for actually applying these: you'll have to study the html source of each page you want to scrape in order to learn exactly what Http requests it's expecting.
How do you mean "login"?
If the subfolder is protected on the OS level, and the browser pops of a login dialog when you go there, you will need to set the Credentials property on the HttpWebRequest.
If the website has it's own cookie-based membership/login system, you will have to use HttpWebRequest to first response to the login form.
string postData = "userid=ducon";
postData += "&username=camarche" ;
byte[] data = Encoding.ASCII.GetBytes(postData);
WebRequest req = WebRequest.Create(
URL);
req.Method = "POST";
req.ContentType = "application/x-www-form-urlencoded";
req.ContentLength = data.Length;
Stream newStream = req.GetRequestStream();
newStream.Write(data, 0, data.Length);
newStream.Close();
StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream(), System.Text.Encoding.GetEncoding("iso-8859-1"));
string coco = reader.ReadToEnd();
Use the WebClient class.
Dim Html As String
Using Client As New System.Net.WebClient()
Html = Client.DownloadString("http://www.google.com")
End Using
You can use the build in WebClient Object instead of crating the request yourself.
WebClient wc = new WebClient();
wc.Credentials = new NetworkCredential("username", "password");
string url = "http://foo.com";
try
{
using (Stream stream = wc.OpenRead(new Uri(url)))
{
using (StreamReader reader = new StreamReader(stream))
{
return reader.ReadToEnd();
}
}
}
catch (WebException e)
{
//Error handeling
}
Try this:
public string GetContent(string url)
{
using (System.Net.WebClient client =new System.Net.WebClient())
{
return client.DownloadString(url);
}
}

Categories

Resources