Webclient.DownloadString() does not retrieve current whole page - c#

I know there is another question with practically identical title here: Webclient.DownloadString does not retrieve the whole page
But the solution doesn't help me, maybe somebody else have the same problem.
I'm trying to get the html code of this URL:
https://cubebrush.co/?freebies=true
To achieve that, I'm using the following code in C#:
WebClient webClient = new WebClient();
string webString = webClient.DownloadString("https://cubebrush.co/?freebies=true");
But the retrieved html lacks some information, for example, all the button tags inside the website. This can be quickly checked using the library HtmlAgilityPack and checking for all the tags inside the website with the following code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(webString);
HashSet<string> hs = new HashSet<string>();
foreach (var dec in doc.DocumentNode.Descendants())
{
hs.Add(dec.Name);
}
If we run this, it will show 26 tags, but none of them will be a button tag. This makes sense, since the initial webString also lacks that "button information".
I've tried to copy webString into a file, to check if, as the initial commented post says, was a problem with the visualizer, but it doesn't, visualizer and file looks exactly equal.
Can somebody tells me what I'm doing wrong? Thanks!

Related

Extract image sources in C# from web page using JS [duplicate]

This question already has answers here:
C# .NET: Scraping dynamic (JS) websites
(1 answer)
htmlagilitypack and dynamic content issue
(3 answers)
Closed 4 years ago.
UPDATE #2 Continuing on this effort (see original and update #1 below). ScrapySharp had potential but no matter what I tried, the process consumed all available memory and didn't produce anything. I did find that, due to jQuery, in a test WebBrowser control, the correct web page does not load. It seems the site has a function that determines how you got to the page requested, validates something about your browser, and redirects you to a generic sign-up page.
Thoughts on how to appease the gate keeper?
UPDATE #1 (details of original, not unique question below):
First of all - THANK YOU! For the suggestions!
I tried to use ScrapySharp as #John pointed out as:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
HtmlNode rawHTML = PageResult.Html;
Console.WriteLine(rawHTML.InnerHtml);
Console.ReadLine();
However, it resulted in a memory leak. To get a sense of how it works, I tried:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri(url));
HtmlNode rawHTML = PageResult.Html;
var imgNodes = rawHTML.SelectNodes("//img");
Which also created a memory leak. What am I missing with my implementation of it?
ORIGINAL QUESTION:
I'm trying to get my application to grab specific images from a web site. So far I've been using HtmlAgilityPack but it only grabs the basic HTML. I don't know how to explain the tags from the missing elements other than they show up when using Inspect in Chrome (but Regex and HtmlAgilityPack can't seem to access/see them), and they have a "data-v-??????" identifier inside the tag. Here's an example:
<div data-v-1a7a6550="" class="product-extra-images"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_1Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_2Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"></div>
Please let me know if you need additional details. Here's a sample of my latest code that couldn't extract the elements (in case it helps):
var htmlDoc = new HtmlAgilityPack.HtmlDocument()
{
OptionFixNestedTags = true,
OptionAutoCloseOnEnd = true,
OptionReadEncoding = false
};
var imgNodes = htmlDoc.DocumentNode.SelectNodes("//div");
foreach (var imgNode in imgNodes)
{
//decode the string
var img = HttpUtility.HtmlDecode(imgNode.InnerText).Trim();
imagesouces.Add(img);
}
File.WriteAllLines(#"C:\Users\user\Desktop\WriteText.txt", imagesouces);

C# htmlAgility Webscrape html node inside the first Html node

I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.

How do I recover the full html of a page, including what is generated by javascript

How do I recover the full html of a page, including what is generated by javascript. The problem is that I want to access the contents of the select tag, but the page but it is coming empty, this probably being generated dynamically. Please I'm about to give up!
I just posted a piece of code because this very large, if I find it necessary to put the whole code.
res = (HttpWebResponse)req.GetResponse();
res.Cookies = req.CookieContainer.GetCookies(req.RequestUri);
cookieContainer.Add(res.Cookies);
sr = new StreamReader(res.GetResponseStream());
getHtml = sr.ReadToEnd();
viewstate = rxViewstate.Match(getHtml).Groups[1].Value;
EventValdidation = rxEventValidation.Match(getHtml).Groups[1].Value;
viewstate = HttpUtility.UrlEncode(viewstate);
EventValdidation = HttpUtility.UrlEncode(EventValdidation);
//Here I should take the contents of the select tag.
getHtml = rxDropDownMenu.Match(getHtml).Groups[2].Value;
You can't just do this with HttpWebRequest, all that does is download the raw HTML and non of the linked JavaScript files.
It also wouldn't run the JavaScript or give you any kind of DOM to inspect.
You'd really need to use WebBrowser or perhaps something like Awesomium.

asp.net C# get final page source of a webpage

I'm trying to get the FINAL source of a webpage. I am using webclient openRead method, but this method is only returning the initial page source. After the source downloads, there is a javascript that runs and collect the data that I need in a different format and my method will be looking for something that got completely changed.
What I am talking about is exactly like the difference between:
right-click on a webpage -> select view source
access the developer tools
Look at this site to know what I am talking about: http://www.augsburg.edu/history/fac_listing.html and watch how any of the email is displayed using each option. I think what happening is that the first will show you the initial load of the page. The second will show you the final page html. The webclient only lets me do option #1.
here is the code that will only return option #1. Oh I need to do this from a console application. Thank you!
private static string GetReader(string site)
{
WebClient client = new WebClient();
try
{
data = client.OpenRead(site);
reader = new StreamReader(data);
}
catch
{
return "";
}
return reader.ReadToEnd();
}
I've found a solution to my problem.
I ended up using Selenium-WebDriver PageSource property. It worked beautifully!
Learn about Selenium and Webdriver. It is an easy thing to learn. It helps for testing and on this!

Trouble Scraping .HTM File

I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.
The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM
Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:
/html/body/table[#id='MainTable']/tbody/tr[1]/td/table[#id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[#id='Home']/tbody/tr[3]/td
When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?
I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!
p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.
Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.
I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.
I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.
When I do
string test = string.Empty;
StreamReader sr = new StreamReader(#"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = #"//table[#id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.
Examining the html I couldn't find a /tbody.

Categories

Resources