Retrieve web page content like a browser

Retrieve web page content like a browser - c#

After I learned some things about differents technologies, I wanted to make a small project using UWP+NoSQL. I wanted to do a small UWP app that grabs the horoscope and display it on my raspberry Pi every morning.
So I took a WebClient, and I do the following:
WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");
But it seems that it detect that this request isn't coming from a browser, since the interesting part is not in the content(and when I check with the browser, it is in the initial HTML, according to fiddler).
I also tried with ScrapySharp but I got the same result. Any idea why?
(I've already done the UWP part, so I don't want to change the topic of my personal project just because it is detected as a "bot")
EDIT
It seems I wasn't clear enough. The issue is **not* that I'm unable to parse the HTML, the issue is that I don't receive expected HTML when using ScrapySharp/WebClient
EDIT2
Here is what I retrieve: http://pastebin.com/sXi4JJRG
And, I don't get(by example) the "Star ratings by domain" + the related images for each stars

You can read the entire content of the web page using the code snippet shown below:
internal static string ReadText(string Url, int TimeOutSec)
{
try
{
using (HttpClient _client = new HttpClient() { Timeout = TimeSpan.FromSeconds(TimeOutSec) })
{
_client.DefaultRequestHeaders.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("text/html"));
using (HttpResponseMessage _responseMsg = _client.GetAsync(Url))
{
using (HttpContent content = _responseMsg.Content)
{
return content.ReadAsString();
}
}
}
}
catch { throw; }
}
Or in a simple way:
public static void DownloadString (string address)
{
WebClient client = new WebClient ();
string reply = client.DownloadString (address);
Console.WriteLine (reply);
}
(re: https://msdn.microsoft.com/en-us/library/fhd1f0sw(v=vs.110).aspx)

yes, WebClient won't give you expected result. many sites have scripts to load content. so to emulate browser you also should run page scripts.
I have never did similar things, so my answer pure theoretical.
To solve the problem you need "headless browser".
I know two project for this (I have never try ony of it):
http://webkitdotnet.sourceforge.net/ - it seems to be outdated
http://www.awesomium.com/

Ok, I think I know what's going on: I compared the real output (no fancy user agent strings) to the output as supplied by your pastebin and found something interesting. On line 213, your pastebin has:
<li class="dropdown"><a href="/us/profiles/zodiac/index-profile-zodiac-sign.aspx" class="dropdown-toggle" data-hov...ck">Forecast Tarot Readings</div>
Mind the data-hov...ck near the end. In the real output, this was:
<li class="dropdown">Astrology
followed by about 600 lines of code, including the aforementioned 'interesting part'. On line 814, it says:
<div class="bot-explore-col-subtitle f14 blocksubtitle black">Forecast Tarot Readings</div>
which, starting with the ck in black, matches up with the rest of the pastebin output. So, either pastebin has condensed the output or the original output was.
I created a new console application, inserted your code, and got the result I expected, including the 600 lines of html you seem to miss:
static void Main(string[] args)
{
WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");
File.WriteAllText(#"D:\Temp\source-mywebclient.html", downloadString);
}
My WebClient is from System.Net. And changing the UserAgent hardly has any effect, a couple of links are a bit different.
So, to sum it up: Your problem has nothing to do with content that is inserted dynamically after the initial get, but possibly with webclient combined with UWP. There's another question regarding webclient and UWP on the site: (UWP) WebClient and downloading data from URL in that states you should use HttpClient. Maybe that's a solution?

Some time ago I used http://www.nrecosite.com/phantomjs_wrapper_net.aspx it worked well, and as Anton mentioned it is a headless browser. Maybe it will be some help.

I'm wondering if all the 'interesting part' you expect to see 'in the content' are images? You are aware of the fact you have to retrieve any images separately? The fact that a html page contains <image.../> tags does not magically display them as well. As you can see with Fiddler, after retrieving a page, the browser then retrieves all images, style sheets, javascript and all other items that are specified, but not included in the page. (you might need to clear the browser cache to see this happen...)

Related

Webclient 404 protocol error on valid url c#

I have a webclient that calls to a URL that works fine when i view it in a browser, which led me to believe i would need to add headers in to my call
I have done this, but am still getting the error.
I do have other calls to the same API that work fine, and have checked that all the parameters I am passing across are exactly the same as expected(case, spelling)
using (var wb = new WebClient())
{
wb.Proxy = proxy;
wb.Headers.Add("Accept-Language", " en-US");
wb.Headers.Add("Accept", " text/html, application/xhtml+xml, */*");
wb.Headers.Add("User-Agent", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)");
byte[] response = wb.UploadValues("http://myserver/api/account/GetUser",
new NameValueCollection()
{
{ "email", register.Email },
});
userDetails = Encoding.UTF8.GetString(response);
}
Does anyone have an idea why I am still getting the protocol error on a call that works perfectly fine in a browser?

UploadValue uses a HTTP POST. Are you sure that it what you want? If you are viewing it in a browser it is likely a GET, unless you are filling out some sort of web form.
One might surmise that what you are trying to do is GET this response "http://myserver/api/account/GetUser?email=blah#blah.com"
in which case you would just formulate that url, with query parameters, and execute a GET using one of the DownloadString overloads.
using (var wb = new WebClient())
{
wb.Proxy = proxy;
userDetails = wb.DownloadString("http://myserver/api/account/GetUser?email=" + register.Email);
}
The Wikipedia article on REST has a nice table that outlines the semantics of each HTTP verb, which may help choosing the appropriate WebClient method to use for your use cases.

C# Crawler Moving single thread WebClient call to multi threading

We currently have a web crawler for our clients that do not have the ability to provide us with an xml file. The list is building so I need to have a more efficient way to crawl these sites. The logic of the crawler is simply:
Pass in www.example.com/widgets
Store the html and pass it to crawler function
crawl widgets page 1
IF widgets page 1 is the end of their product list
stop
else
go to widgets page 2
This repeats for every site in the queue. As you can see, if Site 1 has 5000 products, Site 2 cannot proceed until it is done. What would be the best way to multi thread this so that i can limit how many request i make to each site, but grab multiple sites at one time? I tried Parallel.ForEach but the result was very sporadic and unpredictable. Currently we handle this by having "groups" of stores fire off and the same time using Windows Task Manager. here is some example code:
foreach(site in ListofSites)
{
int page = 1;
bool continue_crawling = true;
while(continue_crawling)
{
HtmlDocument doc = new HtmlDocument();
var htmlWeb = new HtmlWeb();
htmlWeb.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36";
doc = htmlWeb.Load(URL + page);
string html = doc.DocumentNode.OuterHtml;
continue_crawling = ParseFile(html);
page++;
}
}
private bool Parse(string html)
{
//parse the file and see if we have enough data
return endofproduct;
}

All C# HTTP requests will go through the ServicePoint for the request URL domain. The ServicePoint will inherit the ConnectionLimit from the ServicePointManager.DefaultConnectionLimit. The default value is 2, in accordance with the 'good clients practice' of RFC2616:
A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy.
What all that translates to is that launching 1 gazzilion HTTP requests to URLs in the same domain will only send at most 2 HTTP request, and another one will not start unless one finishes. To achieve higher concurrency you need to increase the ServicePoint connection limit for the domain of interest. Doing concurrent IO using threads (including PTL) is rather primitive, but should work if you fix the limitation. Doing async IO would be preferable, of course.

Webservice sourced images show as ? in certain versions of stock Android Browser

I am attempting to dynamically source images, using an ID rendered into the path when the page binds it data. However, the images are showing as blue question marks in a box [?]. The images load fine on iOS, Mobile Chrome, older versions of Android browser (2.3), newer versions of Android browser (4.2.2) and IE/Firefox/Chrome on desktop. This issue appears (so far) only on Android 4.0 and 4.1.
This is how i'm trying to load the images:
Ex. <img src="../services/getImage?id=f6c799b2-ff31-4fbc-abc9-31f20d5e69c8">
This request hits a .NET webservice (IHttpAsyncHandler implementation) which looks like this
public virtual UploadedImage getImage(Guid imageId) {
string eTag;
Entities.Image.DTO image = null;
if
(
image = //get image entity
)
{
eTag = Delta.Crypto.CreateMD5Hash(image.ModifiedDate.ToEpoch().ToString());
if (Request.Headers[HTTPRequestHeaderKeys.IfNoneMatch].IsNotNullOrEmpty() && Request.Headers[HTTPRequestHeaderKeys.IfNoneMatch] == eTag)
{
this.RespondWithNoUpdate();
return null;
}
if (image.ImageUrl.IsNullOrEmpty() || image.ImageContent == null || image.ImageContent.Length == 0)
{
this.RespondWithNotFound();
return null;
}
Response.AddHeader(HTTPResponseHeaderKeys.ETag, eTag);
return new UploadedImage()
{
contentType = "image/" + System.IO.Path.GetExtension(image.ImageUrl).ToLower().Substring(1),
fileContents = image.ImageContent,
fileName = image.ImageUrl
};
}
return null;
So we're setting the mime type using the file extensions, which is maybe not 100% reliable, but i have confirmed to be correct in these cases.
Here is a copy of the Request and successful Response on my desktop Chrome browser
Request:
Accept:image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8,es;q=0.6
Connection:keep-alive
Host:localhost
Referer:http://localhost/delta/events/bigevent/app/event.html
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36
Response:
Cache-Control:no-cache
Content-Disposition:inline; filename="2dab739b-a06c-4579-8555-0598d738f858_eventApplayoutContainerEventApplicationlandingScreenImageContainer_background-image.png"
Content-Length:236
Content-Type:image/png
Date:Tue, 11 Feb 2014 19:53:31 GMT
ETag:1c79507d4969ea7534f3068ca1e60be4
Expires:-1
Pragma:no-cache
My only guess is that when requesting an image in this way, the img control does not know the mime type when rendered, and thus is complaining.
Note: The request does succeed on the Android browser when accessing directly (in a separate tab).
Does anyone have any idea what may be causing the [?] and a potential solution? I haven't been able to locate much, if any documentation on the stock browser. If you have a link to some documentation, that would also be much appreciated. Thanks!
EDIT: I should note that resource images with relative paths are loading fine
Ex. <img src="../images/EmptyProfile.png">

I was actually able to figure this one out.
The root of the issue is that the Android browser on those versions does not send an Accept header with the request.
My webservice tries to negotiate a content delivery type based upon the client preferences. There was no default.
Hope this helps someone in the future!

ASP Classic VBScript to ASP.NET C# Conversion

I am familiar with ASP.NET, but not with Visual Basic.
Here is the Visual Basic code:
myxml="http://api.ipinfodb.com/v3/ip-city/?key="&api_key&"&ip=" &UserIPAddress&"&format=xml"
set xml = server.CreateObject("MSXML2.DOMDocument.6.0")
xml.async = "false"
xml.resolveExternals = "false"
xml.setProperty "ServerHTTPRequest", true
xml.load(myxml)
response.write "<p><strong>First result</strong><br />"
for i=0 to 10
response.write xml.documentElement.childNodes(i).nodename & " : "
response.write xml.documentElement.childNodes(i).text & "<br/>"
NEXT
response.write "</p>"
What is going on in this code?
How can I convert this to ASP.NET (C#)?

Based on a quick glance at the site you linked to in a comment, it looks like the intended functionality is to make a request to a URL and receive the response. The first example given on that site is:
http://api.ipinfodb.com/v3/ip-city/?key=<your_api_key>&ip=74.125.45.100
You can probably use something like the System.Net.WebClient object to make an HTTP request and receive the response. The example on MSDN can be modified for your URL. Maybe something like this:
var client = new WebClient();
client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
var data = client.OpenRead(#"http://api.ipinfodb.com/v3/ip-city/?key=<your_api_key>&ip=74.125.45.100");
var reader = new StreamReader(data);
var result = reader.ReadToEnd();
data.Close();
reader.Close();
(There's also the WebRequest class, which appears to share roughly the same functionality.)
At that point the result variable contains the response from the API. Which you can handle however you need to.

From the looks of the Visual Basic code, I think you should create two methods to "convert" this to an ASP.NET C# web page:
LoadXmlData method - use an XmlDocument to load from the URL via the XmlDocument's Load function. Read ASP.net load XML file from URL for an example.
BuildDisplay method - use an ASP.NET PlaceHolder or Panel to create a container to inject the paragraph tag and individual results into.

How to get the content of web page? [duplicate]

This question already exists:
Closed 11 years ago.
Possible Duplicate:
Reading web page by sending username & password?
My problem is this. There is a site that has data which is frequently updated that I would like to get at regular intervals for later reporting.
for getting that data i have to provide the userid & password.
I have used HttpWebRequest to get data but the problem is that response text returns "Your browser doesn't support frame" instead of the data i want.
how can i get it?

Most likely you are having this problem because you are not setting the user-agent in your request, i.e. with a WebClient:
using(WebClient wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
string htmlResult = wc.DownloadString(someUrl);
}

You can make use WebBrowser control to solve your problem. This approach works like this, First, you have to load the specific webpage on to the WebBrowser Control, then once the document has been loaded or not . If loaded then you can retrieve the web page stream using DocumentStream property.
Hope this helps.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Retrieve web page content like a browser - c#

Some time ago I used http://www.nrecosite.com/phantomjs_wrapper_net.aspx it worked well, and as Anton mentioned it is a headless browser. Maybe it will be some help.

Related

Webclient 404 protocol error on valid url c#

C# Crawler Moving single thread WebClient call to multi threading

Webservice sourced images show as ? in certain versions of stock Android Browser

ASP Classic VBScript to ASP.NET C# Conversion

How to get the content of web page? [duplicate]

Categories

Resources