Extract image sources in C# from web page using JS [duplicate] - c#

This question already has answers here:
C# .NET: Scraping dynamic (JS) websites
(1 answer)
htmlagilitypack and dynamic content issue
(3 answers)
Closed 4 years ago.
UPDATE #2 Continuing on this effort (see original and update #1 below). ScrapySharp had potential but no matter what I tried, the process consumed all available memory and didn't produce anything. I did find that, due to jQuery, in a test WebBrowser control, the correct web page does not load. It seems the site has a function that determines how you got to the page requested, validates something about your browser, and redirects you to a generic sign-up page.
Thoughts on how to appease the gate keeper?
UPDATE #1 (details of original, not unique question below):
First of all - THANK YOU! For the suggestions!
I tried to use ScrapySharp as #John pointed out as:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
HtmlNode rawHTML = PageResult.Html;
Console.WriteLine(rawHTML.InnerHtml);
Console.ReadLine();
However, it resulted in a memory leak. To get a sense of how it works, I tried:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri(url));
HtmlNode rawHTML = PageResult.Html;
var imgNodes = rawHTML.SelectNodes("//img");
Which also created a memory leak. What am I missing with my implementation of it?
ORIGINAL QUESTION:
I'm trying to get my application to grab specific images from a web site. So far I've been using HtmlAgilityPack but it only grabs the basic HTML. I don't know how to explain the tags from the missing elements other than they show up when using Inspect in Chrome (but Regex and HtmlAgilityPack can't seem to access/see them), and they have a "data-v-??????" identifier inside the tag. Here's an example:
<div data-v-1a7a6550="" class="product-extra-images"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_1Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_2Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"></div>
Please let me know if you need additional details. Here's a sample of my latest code that couldn't extract the elements (in case it helps):
var htmlDoc = new HtmlAgilityPack.HtmlDocument()
{
OptionFixNestedTags = true,
OptionAutoCloseOnEnd = true,
OptionReadEncoding = false
};
var imgNodes = htmlDoc.DocumentNode.SelectNodes("//div");
foreach (var imgNode in imgNodes)
{
//decode the string
var img = HttpUtility.HtmlDecode(imgNode.InnerText).Trim();
imagesouces.Add(img);
}
File.WriteAllLines(#"C:\Users\user\Desktop\WriteText.txt", imagesouces);

Related

Webclient.DownloadString() does not retrieve current whole page

I know there is another question with practically identical title here: Webclient.DownloadString does not retrieve the whole page
But the solution doesn't help me, maybe somebody else have the same problem.
I'm trying to get the html code of this URL:
https://cubebrush.co/?freebies=true
To achieve that, I'm using the following code in C#:
WebClient webClient = new WebClient();
string webString = webClient.DownloadString("https://cubebrush.co/?freebies=true");
But the retrieved html lacks some information, for example, all the button tags inside the website. This can be quickly checked using the library HtmlAgilityPack and checking for all the tags inside the website with the following code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(webString);
HashSet<string> hs = new HashSet<string>();
foreach (var dec in doc.DocumentNode.Descendants())
{
hs.Add(dec.Name);
}
If we run this, it will show 26 tags, but none of them will be a button tag. This makes sense, since the initial webString also lacks that "button information".
I've tried to copy webString into a file, to check if, as the initial commented post says, was a problem with the visualizer, but it doesn't, visualizer and file looks exactly equal.
Can somebody tells me what I'm doing wrong? Thanks!

C# htmlAgility Webscrape html node inside the first Html node

I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.

HTMLAgilityPack load AJAX content for scraping

Im trying to scrape a webpage using HTMLAgilityPack in a c# webforms project.
All the solutions Ive seen for doing this use a WebBrowser control. However, from what I can determine, this is only available in WinForms projects.
At present Im calling the required page via this code:
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(inputUri);
HtmlAgilityPack.HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[#class=\"nav\"]");
An example bit of code that Ive seen saying to use the WebBrowser control:
if (this.webBrowser1.Document.GetElementsByTagName("html")[0] != null)
_htmlAgilityPackDocument.LoadHtml(this.webBrowser1.Document.GetElementsByTagName("html")[0].OuterHtml);
Any suggestions / pointers as to how to grab the page once AJAX has been loaded, will be appreciated.
It seems that using HTMLAgilityPack it is only possible to scrape content that is loaded via the html itself. Thus anything loaded via AJAX will not be visible to HTMLAgilityPack.
Perhaps the easiest option -where feasible- is to use a browser based tool such as Firebug to determine the source of the data loaded by AJAX. Then manipulate the source data directly. An added advantage of this might be the ability to scrape a larger dataset.
I struggled all day to get this right so here is a FedEx tracking example of what the accepted answer is referring to (I think):
Dim body As String
body = "data={""TrackPackagesRequest"":{""appType"":""WTRK"",""appDeviceType"":""DESKTOP"",""supportHTML"":true,""supportCurrentLocation"":true,""uniqueKey"":"""",""processingParameters"":{},""trackingInfoList"":[{""trackNumberInfo"":{""trackingNumber"":" & Chr(34) & "YOUR TRACKING NUMBER HERE" & Chr(34) & ",""trackingQualifier"":"""",""trackingCarrier"":""""}}]}}"
body = body & "&action=trackpackages&locale=en_US&version=1&format=json"
With CreateObject("MSXML2.XMLHTTP")
.Open("POST", "https://www.fedex.com/trackingCal/track", False)
.setRequestHeader("Referer", "https://www.fedex.com/apps/fedextrack/?tracknumbers=YOUR TRACKING NUMBER HERE")
.setRequestHeader("User-Agent", "Mozilla/5.0")
.setRequestHeader("X-Requested-With", "XMLHttpRequest")
.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8")
.send(body)
Dim Reply = .responseText
End With
Alternatively have you considered building a browser into your application using Cefsharp.net and then using Dev Tools through the .net interface?
You may have noticed that even dynamically AJAX/JS generated HTML can be found using e.g. Inspect Element option in Firefox. So that code is sitting on your computer even if you can't scrape it using traditional HTML scraping methods.
Another option to consider.
https://cefsharp.github.io/

What's the most efficient way to visit a .html page?

I have a .html page that just has 5 characters on it (4 numbers and a period).
The only way I know of is to make a webbrowser that navigates to a URL, then use
browser.GetElementByID();
However that uses IE so I'm sure it's slow. Is there any better way (without using an API, something built into C#) to simply visit a webpage in a fashion that you can read off of it?
Try these 2 lines:
var wc = new System.Net.WebClient();
string html = wc.DownloadString("http://google.com"); // Your page will be in that html variable
It appears that you want to download a url, parse it as html then to find an element and read its inner text, right? Use nuget to grab a reference to HtmlAgilityPack, then:
using(var wc = new System.Net.WebClient()){
string html = wc.DownloadString("http://foo.com");
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var el = doc.GetElementbyId("foo");
if(el != null)
{
var text = el.InnerText;
Console.WriteLine(text);
}
}
Without using any APIs? You're in the .NET framework, so you're already using an abstraction layer to some extent. But if you want pure C# without any addons, you could just open a TCP socket to the site and download the contents (it's just a formatted string, after all) and read the data.
Here's a similar question: How to get page via TcpClient?

Trouble Scraping .HTM File

I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.
The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM
Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:
/html/body/table[#id='MainTable']/tbody/tr[1]/td/table[#id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[#id='Home']/tbody/tr[3]/td
When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?
I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!
p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.
Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.
I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.
I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.
When I do
string test = string.Empty;
StreamReader sr = new StreamReader(#"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = #"//table[#id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.
Examining the html I couldn't find a /tbody.

Categories

Resources