Retrieve html from website

Retrieve html from website - c#

This is a little bit tricky but this is how it goes.
Page loads
Executes some javascript which generates more html code. And source code is the one I need.
Now I see I can't use html parser because there isn't actually a way to run the code.
Using http I can manage getting the first source code but the javascript isn't executed so I never get the source code I need.
What is the best way to retrieve that code generated afterwards?
Edit: I am trying to avoid using a hidden web browser. It is actually possible with it since it works as a javascript interpreter here but it is very slow and very ugly way.
Edit2: Added code
static private string _InetReadEx(string sUrl)
{
string aRet;
HttpWebRequest webReq = (HttpWebRequest)HttpWebRequest.Create(sUrl);
try
{
webReq.CookieContainer = new CookieContainer();
webReq.Method = "GET";
using (WebResponse response = webReq.GetResponse())
{
using (Stream stream = response.GetResponseStream())
{
StreamReader reader = new StreamReader(stream);
aRet = reader.ReadToEnd();
return aRet;
}
}
}
catch (Exception ex)
{
return string.Empty;
}
}

Unless you're using WebBrowser as you mentioned you want to avoid. There is no other conveneiet way.
You can mimic the behavior of the JavaScript that runs and execute it and than format it as the WebBrowser does, but this will not be dynamic formatting and thus much less desired.

Related

Why does mono sometimes truncate http downloads?

I use the following code to download text (json):
var request = WebRequest.Create(url);
using (var response = request.GetResponse())
{
string charset = null;
var httpResponse = response as HttpWebResponse;
if (httpResponse != null)
{
if (httpResponse.StatusCode != HttpStatusCode.OK)
{
throw new System.Net.WebException("Ststus code was: " + httpResponse.StatusCode);
}
charset = httpResponse.CharacterSet;
}
Encoding enc = charset != null ? Encoding.GetEncoding(charset) : null;
using (var reader = new StreamReader(response.GetResponseStream(), enc, true))
{
return reader.ReadToEnd();
}
}
On Windows (.net) it works fine. On Linux (Mono runtime) it sometimes returns truncated data: The json parser crashes, because can't find the closing delimiter for strings and similar errors. It is not a problem with the parser: I have tried 2 different. It does not seem to be a problem with encoding, because it sometimes works and sometimes doesn't for the exact same downloaded data.
Why would mono behave this way and how can I avoid this problem?
Edit: I added a console print for debugging purposes. The string coming directly from the code above is definitively truncated.
Edit2: Here is how I use the result:
string json = DownloadTextFile(url);
dynamic obj = Json.Decode(json);//Decoding fails here, because string is truncated.

The problem occurs much less frequently when I let the program run on a server with a good very good connection to the net. (After a few thousand downloads, instead of after a few hundred). That is good enough for my purposes.
Checking the content length does not help much, because it is -1 more often than not. It is sad that network stuff is implemented so poorly in mono. (On .net the same code works flawlessly even with a bad connection.)

How can I get the title of a page on another site?

Pretty long question;
How can I do the following in C#:
Open a web page (Preferably not visible)
Check whether the page redirects to a different page (Site is down, 404, etc.)
Check if the title is not equal to a said string
Then separately, (They need to click a confirm button)
open their browser, and go to the address of the first (It'll be the only one) hyperlink on the site.
I literally have been looking on Google for ages and haven't found anything similar to what I need.
Whether you give me a link to a site with a tutorial on this area of programming or actual source code doesn't make a difference to me.

check out the webrequest class, it can do redirection :) then you can just parse the html and find the title tag using xpath or something
sort of like this
using System.Xml;
using System.Xml.XPath;
using System.Xml.Linq;
using System.Net;
...
HttpWebRequest myReq = ( HttpWebRequest )WebRequest.Create( "http://www.contoso.com/" );
myReq.AllowAutoRedirect = true;
myReq.MaximumAutomaticRedirections = 5;
XNode result;
using( var responseStream = myReq.GetResponse( ).GetResponseStream( ) ) {
result = XElement.Load( responseStream );
}
var title = result.XPathSelectElement( "//title" ).Value;
obviosly your xpath can (and probably should) be more sophisticated :) you can find out more on xpath here
on a similar note you can use xpath on the xml you get back to find the links and pick out the first one:
var links = result.XPathSelectElements( "//a" ).Select( linktag => linktag.Attribute( "href" ).Value );
when you eventually find the url you want to open you can use
System.Diagnostics.Process.Start( links.First() );
to get it to open in the browser. a nice aspect of this is that it will open what ever browser is the default for the client. it does have security implications though, you should make sure that its an url and not an exe file or something.
also, its possible that the html use diffrent capital letters for its elements, you'd have to deal with that when looking for linsk

You could use WebRequest or HttpWebRequest, but if you want a browser UI you will need to use the WebBrowser control: http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx
You will need to handle the completion event from the Navigate call which will load the page for you:
WebBrowser myWebBrowser = new WebBrowser();
webBrowser1.Navigating += new WebBrowserNavigatingEventHandler(webBrowser1_IDontKnow);
myWebBrowser.Navigate("http://myurl.com/mypage.htm");
You can then implement your handler as follows, and interact with the WebBrowser ui as necessary... the DocumentText property contains the HTML of the currently loaded web page:
private void webBrowser1_IDontKnow(object sender, WebBrowserNavigatingEventArgs e)
{
CheckHTMLConfirmAndRedirect(webBrowser1.DocumentText);
}

Use HttpWebRequest and parse the response:
private static void method1()
{
string strWORD = "pain";
const string WORDWEBURI = "http://www.wordwebonline.com/search.pl?w=";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(WORDWEBURI + strWORD.ToUpper());
request.UserAgent = #"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)";
request.ContentType = "text/html";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StringBuilder sb = new StringBuilder();
Stream resStream = response.GetResponseStream();
byte[] buffer = new byte[8192];
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buffer, 0, buffer.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.UTF8.GetString(buffer, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
Console.Write(sb.ToString());
}

Getting a Stream from an absolute path?

I have this method:
public RasImage Load(Stream stream);
if I want to load a url like:
string _url = "http://localhost/Application1/Images/Icons/hand.jpg";
How can I make this url in to a stream and pass it into my load method?

Here's one way. I don't really know if it's the best way or not, but it works.
// requires System.Net namespace
WebRequest request = WebRequest.Create(_url);
using (var response = request.GetRespone())
using (var stream = response.GetResponseStream())
{
RasImage image = Load(stream);
}
UPDATE: It looks like in Silverlight, the WebRequest class has no GetResponse method; you've no choice but to do this asynchronously.
Below is some sample code illustrating how you might go about this. (I warn you: I wrote this just now, without putting much thought into how sensible it is. How you choose to implement this functionality would likely be quite different. Anyway, this should at least give you a general idea of what you need to do.)
WebRequest request = WebRequest.Create(_url);
IAsyncResult getResponseResult = request.BeginGetResponse(
result =>
{
using (var response = request.EndGetResponse(result))
using (var stream = response.GetResponseStream())
{
RasImage image = Load(stream);
// Do something with image.
}
},
null
);
Console.WriteLine("Waiting for response from '{0}'...", _url);
getResponseResult.AsyncWaitHandle.WaitOne();
Console.WriteLine("The stream has been loaded. Press Enter to quit.");
Console.ReadLine();

Dan's answer is a good one, though you're requesting from localhost. Is this a file you can access from the filesystem? If so, I think you should be able to just pass in a FileStream:
FileStream stream = new FileStream(#"\path\to\file", FileMode.Open);

Stream.Dispose or stream=null?

I've have some code similar to this:
HttpWebRequest req;
HttpWebResponse response;
Stream receiveStream = null;
StreamReader readStream = null;
try
{
req = (HttpWebRequest)WebRequest.Create("someUrl"));
req.Credentials = CredentialCache.DefaultCredentials;
req.Method = "GET";
response = (HttpWebResponse)req.GetResponse();
receiveStream = response.GetResponseStream();
readStream = new StreamReader(receiveStream, Encoding.Default);
return readStream.ReadToEnd();
}
catch
{
return "Error";
}
finally
{
readStream = null;
receiveStream = null;
response = null;
req = null;
}
Should this code have readStream.Dispose() and responseStream.Dispose() instead of setting both to null?

It's almost always a mistake to set local variables to null, unless you want to actually use that value later on. It doesn't force garbage collection any earlier - if you're not going to read from the variable later, the garbage collector can ignore the reference (when not in debug mode).
However, it's almost always correct to close streams - ideally in a using statement for simplicity.
It's also almost always wrong to have a bare "catch" block like that. Do you really want to handle anything going wrong, including things like OutOfMemoryException?
I would rewrite your code as:
HttpWebRequest req = (HttpWebRequest) WebRequest.Create("someUrl"));
req.Credentials = CredentialCache.DefaultCredentials;
req.Method = "GET";
using (WebResponse response = req.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream(),
Encoding.Default))
{
return reader.ReadToEnd();
}
}
Now if something goes wrong, the exception will be propagated to the caller. You might want to catch a few specific exceptions, but it's generally not a good idea to represent errors using a value which could have been a valid "normal" response.
Finally, are you really sure you want Encoding.Default? That's the default encoding of the local machine - you normally want the encoding indicated by the response itself.

It should have using [which calls Dispose()].

Yes, Dispose() them.
Even better to do something like
using (HttpWebResponse response = (HttpWebResponse)req.GetResponse() )
using (Stream receiveStream = response.GetResponseStream() )
using (readStream = new StreamReader(receiveStream, Encoding.Default) )
{
return readStream.ReadToEnd();
}
A using(x) {} block will be rewritten (by the compiler)
as a try {} finally {x.Dispose();}
Note that the WebRequest is not IDisposable.
Also note that the following lines accomplish the same thing as all of your code:
using (var client = new System.Net.WebClient())
{
client.Encoding = ...;
client.Credentials = ...;
return client.DownloadString("SomeUrl");
}

Yes. Pretty much anything that implements a Dispose() method must have its Dispose() method called. You can implicitly call it in a with the 'using' statement:
using(StreamReader stream = GetStream())
{
stream.DoStuff();
}

Yes.
When you set to null, it only nulls the reference. It doesn't run any cleanup code the creator of the Stream class wrote.
You may also want to consider the using(){ } statement which handles this for you on IDisposable types.
Example:
using (MyDisposableObject mdo = new MyDisposableObject)
{
// Do some stuff with mdo
// mdo will automatically get disposed by the using clause
}

No, you should call Dispose or Close

Safest method:
try {
HttpWebRequest request = (HttpWebRequest) WebRequest.Create("someUrl");
request.Credentials = CredentialCache.DefaultCredentials;
request.Method = "GET";
using (HttpWebResponse response = (HttpWebResponse) request.GetResponse()) {
using (StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.Default)) {
return reader.ReadToEnd();
}
}
} catch {
return "Error";
}
There's no need to dispose of the response.GetResponseStream() stream explicitly because the attached StreamReader will dispose it for you.
EDIT: I agree with the other answers - catching exceptions like that is very bad practice. I just left it in for comparison purposes. :-)

Yes - you should explicitly call Dispose() on classes that implement IDisposable after you have used them - this ensures all their resources get cleaned up in a timely fashion. Wrapping the variable in a using() does the same thing (which adds wrapping code that calls Dispose for you):
using (StreamReader reader = new StreamReader()) {
// do stuff with reader
}

There are a few gotchas in the .net libraries. Stream is one, and the other is much of the Imaging API. These entities that use certain system resources don't garbage collect the attached system resources.
If anything uses the IDisposable API, the best thing to do is wrap it in a using block, as people have pointed out above.
Read up on "using", and keep it in mind whenever you're dealing with file handles or images.

Really, the question has been answered but I do want to elaborate on one thing.
Any time an object implements the IDisposable interface, you should dispose it with the Dispose method or (even better) use the using statement.
If you are ever faced with this question in the future, just find out which interfaces it implements. Once you see IDisposable you know to dispose it.

If you need to clear the stream use null; Otherwise, use the Dispose(); method if your application no longer requires the use of the stream.

How do I download a large file (via HTTP) in .NET?

I need to download a large file (2 GB) over HTTP in a C# console application. Problem is, after about 1.2 GB, the application runs out of memory.
Here's the code I'm using:
WebClient request = new WebClient();
request.Credentials = new NetworkCredential(username, password);
byte[] fileData = request.DownloadData(baseURL + fName);
As you can see... I'm reading the file directly into memory. I'm pretty sure I could solve this if I were to read the data back from HTTP in chunks and write it to a file on disk.
How could I do this?

If you use WebClient.DownloadFile you could save it directly into a file.

The WebClient class is the one for simplified scenarios. Once you get past simple scenarios (and you have), you'll have to fall back a bit and use WebRequest.
With WebRequest, you'll have access to the response stream, and you'll be able to loop over it, reading a bit and writing a bit, until you're done.
From the Microsoft documentation:
We don't recommend that you use WebRequest or its derived classes for
new development. Instead, use the System.Net.Http.HttpClient class.
Source: learn.microsoft.com/WebRequest
Example:
public void MyDownloadFile(Uri url, string outputFilePath)
{
const int BUFFER_SIZE = 16 * 1024;
using (var outputFileStream = File.Create(outputFilePath, BUFFER_SIZE))
{
var req = WebRequest.Create(url);
using (var response = req.GetResponse())
{
using (var responseStream = response.GetResponseStream())
{
var buffer = new byte[BUFFER_SIZE];
int bytesRead;
do
{
bytesRead = responseStream.Read(buffer, 0, BUFFER_SIZE);
outputFileStream.Write(buffer, 0, bytesRead);
} while (bytesRead > 0);
}
}
}
}
Note that if WebClient.DownloadFile works, then I'd call it the best solution. I wrote the above before the "DownloadFile" answer was posted. I also wrote it way too early in the morning, so a grain of salt (and testing) may be required.

You need to get the response stream and then read in blocks, writing each block to a file to allow memory to be reused.
As you have written it, the whole response, all 2GB, needs to be in memory. Even on a 64bit system that will hit the 2GB limit for a single .NET object.
Update: easier option. Get WebClient to do the work for you: with its DownloadFile method which will put the data directly into a file.

WebClient.OpenRead returns a Stream, just use Read to loop over the contents, so the data is not buffered in memory but can be written in blocks to a file.

i would use something like this

The connection can be interrupted, so it is better to download the file in small chunks.
Akka streams can help download file in small chunks from a System.IO.Stream using multithreading. https://getakka.net/articles/intro/what-is-akka.html
The Download method will append the bytes to the file starting with long fileStart. If the file does not exist, fileStart value must be 0.
using Akka.Actor;
using Akka.IO;
using Akka.Streams;
using Akka.Streams.Dsl;
using Akka.Streams.IO;
private static Sink<ByteString, Task<IOResult>> FileSink(string filename)
{
return Flow.Create<ByteString>()
.ToMaterialized(FileIO.ToFile(new FileInfo(filename), FileMode.Append), Keep.Right);
}
private async Task Download(string path, Uri uri, long fileStart)
{
using (var system = ActorSystem.Create("system"))
using (var materializer = system.Materializer())
{
HttpWebRequest request = WebRequest.Create(uri) as HttpWebRequest;
request.AddRange(fileStart);
using (WebResponse response = request.GetResponse())
{
Stream stream = response.GetResponseStream();
await StreamConverters.FromInputStream(() => stream, chunkSize: 1024)
.RunWith(FileSink(path), materializer);
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Retrieve html from website - c#

Unless you're using WebBrowser as you mentioned you want to avoid. There is no other conveneiet way. You can mimic the behavior of the JavaScript that runs and execute it and than format it as the WebBrowser does, but this will not be dynamic formatting and thus much less desired.

Related

Why does mono sometimes truncate http downloads?

How can I get the title of a page on another site?

Getting a Stream from an absolute path?

Stream.Dispose or stream=null?

How do I download a large file (via HTTP) in .NET?

Categories

Resources