C# scrape correct web content following jquery - c#

I've been using HtmlAgilityPack for awhile but the web resource I have been working with now has a (seems like) jQuery protocol the browser passes through. What I expect to load is a product page but what actually loads (verified by a WebBrowser control, and a WebClient DownloadString) is a redirect, asking the visitor to select a consultant and sign up with them.
In other words, using Chrome's Inspect >> Elements tool, I get:
<div data-v-1a7a6550="" class="product-extra-images">
<img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_1MainImage-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
<img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_2Image2-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
But WebBrowser and HTMLAgilityPack only get:
<div class="container content">
<div class="alert alert-danger " role="alert">
<button type="button" class="close" data-dismiss="alert">
<span aria-hidden="true">×</span>
</button>
<h2 style="text-align: center; background: none; padding-bottom: 0;">It looks like you haven't selected a Consultant yet!</h2>
<p style="text-align: center;"><span>...were you just wanting to browse or were you looking to shop and pick a Consultant to shop under?</span></p>
<div class="text-center">
<form action="/just-browsing/" method="POST" class="form-inline">
...
After digging into the class definitions in the head, I found the page does use jQuery to handle proper loading, and to handle actions (scrolling, resizing, hovering over images, selecting other images, etc) while the visitor browses the page. Here's from the head of the jQuery:
/*!
* jQuery JavaScript Library v2.1.4
* http://jquery.com/
*
* Includes Sizzle.js
* http://sizzlejs.com/
*
* Copyright 2005, 2014 jQuery Foundation, Inc. and other contributors
* Released under the MIT license
* http://jquery.org/license
*
* Date: 2015-04-28T16:01Z
*/
I tried ScrapySharp as described here:
C# .NET: Scraping dynamic (JS) websites
But that just ended up consuming all available memory and never producing anything.
Also this:
htmlagilitypack and dynamic content issue
Loaded the incorrect redirect as noted above.
I can provide more of the source I'm trying to extract from, including the complete jQuery if needed.

Use CaptureRedirect = false; to bypass redirection page. This worked for me with the page you mentioned:
var web = new HtmlWeb();
web.CaptureRedirect = false;
web.BrowserTimeout = TimeSpan.FromSeconds(15);
Now keep trying till seeing the text "Product Description" on the page.
var doc = web.LoadFromBrowser(url, html =>
{
return html.Contains("Product Description");
});
Latests versions of HtmlAgilityPack can run a browser in background. So we don't really need another library like ScrapySharp for scraping dynamic content.

Related

Is it possible to give name to Quasar Uploader in <form>

I have a <form> where I am using Vue.js and Quasar, and submits in as a HTML form in asp .net core 3.1
The problem is when I am using Quasar Uploader (https://quasar.dev/vue-components/uploader).
The functionality works fine, but when I submit the form (post i C# and .net core).
I cant get the the file in the controller.
As you can see from this example: https://codepen.io/cbrown___/pen/GRZwpxw
When the Uploader is rendered it does not have the attribute name. From the example above you have this input:
<input tabindex="-1" type="file" title="" class="q-uploader__input overflow-hidden absolute-full">.
I guess that is why I cant get it from my Controller. How can I solve this when I am using .net Core 3.1 to submit this form?
And I see no good solutinos in letting people upload files through my API before the record is created.
Is it a option here I do not see?
EDIT:
The <input> is on the plus-icon. So by using inspect elements you should be able to see that no name occurs.
EXAMPLE CODE:
HTML
<div id="q-app">
<div class="q-pa-md">
<div class="q-gutter-sm row items-start">
<q-uploader
url="http://localhost:4444/upload"
style="max-width: 300px"
></q-uploader>
<q-uploader
url="http://localhost:4444/upload"
color="teal"
flat
bordered
style="max-width: 300px"
></q-uploader>
<q-uploader
url="http://localhost:4444/upload"
label="Upload files"
color="purple"
square
flat
bordered
style="max-width: 300px"
></q-uploader>
<q-uploader
url="http://localhost:4444/upload"
label="No thumbnails"
color="amber"
text-color="black"
no-thumbnails
style="max-width: 300px"
></q-uploader>
</div>
</div>
</div>
JS:
new Vue({
el: '#q-app'
})
If any body else need this I ended up with using this:
:name="'file_description[' + index + ']'"
That way each one got one name...and from another array I knew how many file_description it would be.

How do I download the HTML code of the url with the images NOT being hidden

I am trying to do some webscraping but when I download the html of the url the images are hidden but in my browser they are not "user-ad-row__image image image--is-hidden" instead of "user-ad-row__image image image--is-visible". Was seeing if webclient changed anything. Using the HtmlAgilityPack.
var url = "https://www.gumtree.com.au/s-motorcycles-scooters/wa/drz+400/k0c18322l3008845";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlAgilityPack.HtmlDocument();
WebClient client = new WebClient();
htmlDocument.LoadHtml(client.DownloadString(url));
<div class="user-ad-row__main-image-wrapper user-ad-row__main-image-wrapper--has-image"><img class="user-ad-row__image image image--is-hidden" src="" alt="Suzuki Drz-400E"></div>
</div>
<div class="user-ad-row__details">
<div class="user-ad-row__info">
<p class="user-ad-row__title">Suzuki Drz-400E</p>
<div class="user-ad-price user-ad-price--row"><span class="user-ad-price__price">$4,250</span>
<!-- -->
<!-- --><span class="user-ad-price__price-negotiable user-ad-price__price-negotiable--with-price">Negotiable</span>
<!-- -->
<!-- -->
<!-- -->
</div>
<ul class="user-ad-attributes">
<li class="user-ad-attributes__attribute">Learner Approved</li>
<li class="user-ad-attributes__attribute">6000 km</li>
</ul>
<p id="user-ad-desc-MAIN-1228533281" class="user-ad-row__description user-ad-row__description--regular">For sale 2008 Drz-400E excellent condition, well looked after starts first time evertime serviced about a month ago. Just paid 3 months rego. Call or text </p>
</div>
<div class="user-ad-row__extra-info">
<div class="user-ad-row__location"><span class="user-ad-row__location-area">Perth City Area</span>Perth<span class="user-ad-row__distance"> </span></div>
<p class="user-ad-row__age">15/09/2019</p>
</div>
</div>
<button id="" type="button" class="user-ad-row__watchlist-heart-wrapper watchlist-heart Button__buttonBase--3YR6h Button__button--2NsdC Button__buttonBasic--3CSBx" role=""><span class="" aria-hidden="true"><span class="icon-heart heart"></span></span>
</button>```
The website that you provided loads images using Javascript and according to an internet search it appears that HtmlAgilityPack only renders the HTML but is unable to run Javascript.
Some solutions would be:
WebBrowser Class
It's kind of tricky if you want to mix it with the HtmlAgilityPack but provides decent performance.
Selenium
You can use Selenium+a webdriver for your prefered browser (Chrome, Firefox, PhantomJS). It is somewhat slow but is very flexible.
Javascript.Net
It allows you to run scripts using Chrome's V8 JavaScript engine. Near the bottom of the page there will be something like <script src="/latest/resources/react/app.full.something.js"></script>
If you are able to figure out how that loads then you should be able to get all of the images.

HtmlAgilityPack (C#) can't read past hidden text

using the following url:
link to search results page
I am trying to first scrape the text from the a tag from this html that can be seen from the source code when viewed with Firebug:
<div id="search-results" class="search_results">
<div class="searchResultItem">
<div class="searchResultImage photo">
<h3 class="black">
<a class="linkmed " href="/content/1/2484243.html">加州旱象不减 开源节流声声急</a>
</h3>
<p class="resultPubDate">15.10.2014 06:08 </p>
<p class="resultText">
</div>
</div>
<p class="more-results">
But what I get back when I scrape the page is:
<div class="search_results" id="search-results">
<input type="hidden" name="ctl00$ctl00$cpAB$cp1$hidSearchType" id="hidSearchType">
</div>
<p class="more-results">
Is there anyway to view the source the way Firebug does?
How are you scraping the page? Use something like Fiddler and check the request and the response for dynamic pages like these ones. The reason why Firebug sees more is because all of the dynamic elements have loaded already when you are viewing it in your browser, when in fact your scraping method is only one piece of the puzzle (the initial HTML).
Hint: For this search page, you will see that the request for the results data is actually a) a separate GET request with b) a long query string and c) a cookie on the header, which returns a JSON object containing the data. This is why the link you posted just gives me "undefined," because it does not contain the search data.

Parsing with Async, HtmlAgilityPack, and XPath

I have run into a rather strange problem. It's very hard to explain so please bear with me, but basically here is a brief introduction:
I am new to Async programming but couldn't locate a problem in my code
I have used HtmlAgilityPack before, but never the .NET 4.5 version.
This is a learning project, I am not trying to scrape or anything like that.
Basically, what is happening is this: I am retrieving a page from the internet, loading it via stream into an HtmlDocument, then retrieving certain HtmlNodes from it using XPath expressions. Here is a piece of simplified code:
myStream = await httpClient.GetStreamAsync(string.Format("{0}{1}", SomeString, AnotherString);
using (myStream)
{
myDocument.Load(myStream);
}
The HTML is being retreived correctly, but the HtmlNodes extracted by XPath are getting their HTML mangled. Here is a sample piece of HTML which I got in a response taken from Fiddler:
<div id="menu">
<div id="splash">
<div id="menuItem_1" class="ScreenTitle" >Horse Racing</div>
<div id="menuItem_2" class="Title" >Wednesday Racing</div>
<div id="subMenu_2">
<div id="menuItem_3" class="Level2" >» 21.51 Britannia Way</div>
<div id="menuItem_4" class="Level2" >» 21.54 Britannia Way</div>
<div id="menuItem_5" class="Level2" >» 21.57 Britannia Way</div>
<div id="menuItem_6" class="Level2" >» 22.00 Britannia Way</div>
<div id="menuItem_7" class="Level2" >» 22.03 Britannia Way</div>
<div id="menuItem_8" class="Level2" >» 22.06 Britannia Way</div>
</div>
</div>
</div>
The XPath I am using is 100% correct because it works in the browser on the same page, but here is an example a tag which it is retreiving from the previously shown page:
1.54 Britannia Way</
And here is the original which I copied from above for simplicity:
21.54 Britannia Way</div>
As you can see, the InnerText has changed considerably and so has the URL. Obviously my program doesn't work, but I don't know how. What can cause this? Is it a bug in HtmlAgilityPack? Please advise! Thanks for reading!
Don't make the assumption that an XPath expression working in your browser (after DOM-conversion, possibly loading data with AJAX, ...). This seems a site giving bet quotes, I'd guess they're loading the data with some javascript calls.
Verify whether your XPath expression matches the pages source code (like fetched using wget or by clicking "View Source Code" in your browser – don't use Firebug/... for this!
If the site is using AJAX to load the data, you might have luck by using Firebug to monitor what resources get fetched while the page is loaded. Often these are JSON- or XML-files very easy to parse, and it's even easier to work with them than parsing a website of horrible messes of HTML.
Update: In this special case, the site forwards users not sending an Accept-Language header to a language-selection-page. Send such a header to receive the same contents as the browser does. In curl, it would look like this:
curl -H "Accept-Language: en-US;q=0.6,en;q=0.4" https://mobile.bet365.com/sport/splash/Default.aspx?Sport
After many hours of guessing and debugging, the problem turned out to be an HtmlDocument that I was re-using. I solved the problem by creating a new HtmlDocument each time I wanted to load a new page, instead of using the same one.
I hope this saves you time that I lost!

Passing ASP.NET values into an external web servers iframe using GET

I will be putting a big bounty on this question as it's a service I think people would really be able to leverage.
I have recently wrote a script using ASP.NET MVC/C#. It is simple enough - it displays a frontend for a database of users on a webpage. The entire code can be posted below, it doesn't really need to be read as it is quite simple, but just to give a full background of what I am doing -
Here is the main page of the site :
#model IEnumerable<WhoIs.Models.Employee>
#{
ViewBag.Title = "Contoso Employees";
}
#section featured {
<section class="featured">
<div class="content-wrapper">
<hgroup class="title">
<h1>#ViewBag.Title.</h1>
</hgroup>
</div>
</section>
}
<label for="filter">Enter employee details here : </label>
<input type="text" name="filter" value="" id="filter" />
<h2><strong>Users</strong> (create new)</h2>
<br />
<div style="min-height: 150px; font-size: 1.25em">
<div style="margin-bottom: .5em">
<table><thead><tr><th>Name</th><th>Branch</th><th>Phone No.</th><th>Username</th><th>Email</th></tr></thead>
<tbody>
#foreach ( var prod in Model )
{
<tr>
<td>#prod.FullName</td>
<td>#prod.Branch</td>
<td>#prod.PhoneNo</td>
<td>#prod.DomainAC</td>
<td>#prod.Email</td>
#if (User.IsInRole(#"Admins") || User.Identity.Name == prod.DomainAC) {
<td>edit</td>
}else{
<td>User => #User.ToString()</td>
}
<td><input type="checkbox" name="message" value="#prod.PhoneNo">Message<br></td>
</tr>
}
</tbody>
</table>
</div>
</div>
<div id="iframe2">
<br />
<iframe src="http://webaddress.com/web/login.htm" width="400" height="400" />
</div>
This renders a simple page with a list of employees at Contoso, giving Admins and users themselves the ability to edit their details. After this, I have an iframe of a remote web server I do not have control of. The iframe has some input boxes which pass values into a PHP function with a GET. Where the checkboxes are selected above, in my ASP.NET/MVC, I would like to pass the value of the checkbox (the users phone number) into my url/get command.
How can I do this ?
For completeness, although this cannot be edited, the HTML of the iframe looks like :
echo "<body>";
echo "<form action=d.php method=get>";
echo "<input type=hidden name=u value=$u>"; <!--Username passed from previous page of iFrame-->
echo "<input type=hidden name=p value=$p>";<!--Password passed from previous page of iFrame-->
echo "<input type=text name=s value=$s>";<!--List of phone numbers-->
The PHP in d.php simplifies to :
$u = $_GET['u'];
$p = $_GET['p'];
$s = $_GET['s'];
So if I put in some values, the URL changes to :
http://webaddress.com/web/d.php?u=112233&p=1234&s=12345678910&m=test
What I want to do is, for each checkbox selected above, append to &s a comma followed by the phone number of the row of the selected user.
So, for instance, if I have this in my ASP.NET MVC
Full Name Branch PhoneNo DomainAC Email Checkbox
Test1 Test1 7777777 DOMAIN\TEST1 test1#test1.com Ticked
I would like to run http://webaddress.com/web/d.php?u=112233&p=1234&s=7777777&m=test
If I have another user named "John" with a phone number of 121212, then :
http://webaddress.com/web/d.php?u=112233&p=1234&s=7777777,121212&m=test
Is this possible? How can I do this ?
If I understand your question correctly, you are trying to manipulate the values of input text fields and checkboxes inside the iframe.
The simple answer is:
If the remote domain is different from your hosting domain, it is not possible to call methods or access the iframe's content document directly using javascript. You can use cross-document messaging, but that won't work in this scenario as you cannot change the remote site's source.
If both were on the same domain, this could be accomplished by using javascript. An example of accessing the iframe's document would be:
var iframe_doc = document.getElementById('iframe_id').contentWindow.document;
From there you could transverse the DOM using js as you normally would.

Categories

Resources