HtmlAgilityPack (C#) can't read past hidden text - c#

using the following url:
link to search results page
I am trying to first scrape the text from the a tag from this html that can be seen from the source code when viewed with Firebug:
<div id="search-results" class="search_results">
<div class="searchResultItem">
<div class="searchResultImage photo">
<h3 class="black">
<a class="linkmed " href="/content/1/2484243.html">加州旱象不减 开源节流声声急</a>
</h3>
<p class="resultPubDate">15.10.2014 06:08 </p>
<p class="resultText">
</div>
</div>
<p class="more-results">
But what I get back when I scrape the page is:
<div class="search_results" id="search-results">
<input type="hidden" name="ctl00$ctl00$cpAB$cp1$hidSearchType" id="hidSearchType">
</div>
<p class="more-results">
Is there anyway to view the source the way Firebug does?

How are you scraping the page? Use something like Fiddler and check the request and the response for dynamic pages like these ones. The reason why Firebug sees more is because all of the dynamic elements have loaded already when you are viewing it in your browser, when in fact your scraping method is only one piece of the puzzle (the initial HTML).
Hint: For this search page, you will see that the request for the results data is actually a) a separate GET request with b) a long query string and c) a cookie on the header, which returns a JSON object containing the data. This is why the link you posted just gives me "undefined," because it does not contain the search data.

Related

Is it possible to give name to Quasar Uploader in <form>

I have a <form> where I am using Vue.js and Quasar, and submits in as a HTML form in asp .net core 3.1
The problem is when I am using Quasar Uploader (https://quasar.dev/vue-components/uploader).
The functionality works fine, but when I submit the form (post i C# and .net core).
I cant get the the file in the controller.
As you can see from this example: https://codepen.io/cbrown___/pen/GRZwpxw
When the Uploader is rendered it does not have the attribute name. From the example above you have this input:
<input tabindex="-1" type="file" title="" class="q-uploader__input overflow-hidden absolute-full">.
I guess that is why I cant get it from my Controller. How can I solve this when I am using .net Core 3.1 to submit this form?
And I see no good solutinos in letting people upload files through my API before the record is created.
Is it a option here I do not see?
EDIT:
The <input> is on the plus-icon. So by using inspect elements you should be able to see that no name occurs.
EXAMPLE CODE:
HTML
<div id="q-app">
<div class="q-pa-md">
<div class="q-gutter-sm row items-start">
<q-uploader
url="http://localhost:4444/upload"
style="max-width: 300px"
></q-uploader>
<q-uploader
url="http://localhost:4444/upload"
color="teal"
flat
bordered
style="max-width: 300px"
></q-uploader>
<q-uploader
url="http://localhost:4444/upload"
label="Upload files"
color="purple"
square
flat
bordered
style="max-width: 300px"
></q-uploader>
<q-uploader
url="http://localhost:4444/upload"
label="No thumbnails"
color="amber"
text-color="black"
no-thumbnails
style="max-width: 300px"
></q-uploader>
</div>
</div>
</div>
JS:
new Vue({
el: '#q-app'
})
If any body else need this I ended up with using this:
:name="'file_description[' + index + ']'"
That way each one got one name...and from another array I knew how many file_description it would be.

How do I download the HTML code of the url with the images NOT being hidden

I am trying to do some webscraping but when I download the html of the url the images are hidden but in my browser they are not "user-ad-row__image image image--is-hidden" instead of "user-ad-row__image image image--is-visible". Was seeing if webclient changed anything. Using the HtmlAgilityPack.
var url = "https://www.gumtree.com.au/s-motorcycles-scooters/wa/drz+400/k0c18322l3008845";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlAgilityPack.HtmlDocument();
WebClient client = new WebClient();
htmlDocument.LoadHtml(client.DownloadString(url));
<div class="user-ad-row__main-image-wrapper user-ad-row__main-image-wrapper--has-image"><img class="user-ad-row__image image image--is-hidden" src="" alt="Suzuki Drz-400E"></div>
</div>
<div class="user-ad-row__details">
<div class="user-ad-row__info">
<p class="user-ad-row__title">Suzuki Drz-400E</p>
<div class="user-ad-price user-ad-price--row"><span class="user-ad-price__price">$4,250</span>
<!-- -->
<!-- --><span class="user-ad-price__price-negotiable user-ad-price__price-negotiable--with-price">Negotiable</span>
<!-- -->
<!-- -->
<!-- -->
</div>
<ul class="user-ad-attributes">
<li class="user-ad-attributes__attribute">Learner Approved</li>
<li class="user-ad-attributes__attribute">6000 km</li>
</ul>
<p id="user-ad-desc-MAIN-1228533281" class="user-ad-row__description user-ad-row__description--regular">For sale 2008 Drz-400E excellent condition, well looked after starts first time evertime serviced about a month ago. Just paid 3 months rego. Call or text </p>
</div>
<div class="user-ad-row__extra-info">
<div class="user-ad-row__location"><span class="user-ad-row__location-area">Perth City Area</span>Perth<span class="user-ad-row__distance"> </span></div>
<p class="user-ad-row__age">15/09/2019</p>
</div>
</div>
<button id="" type="button" class="user-ad-row__watchlist-heart-wrapper watchlist-heart Button__buttonBase--3YR6h Button__button--2NsdC Button__buttonBasic--3CSBx" role=""><span class="" aria-hidden="true"><span class="icon-heart heart"></span></span>
</button>```
The website that you provided loads images using Javascript and according to an internet search it appears that HtmlAgilityPack only renders the HTML but is unable to run Javascript.
Some solutions would be:
WebBrowser Class
It's kind of tricky if you want to mix it with the HtmlAgilityPack but provides decent performance.
Selenium
You can use Selenium+a webdriver for your prefered browser (Chrome, Firefox, PhantomJS). It is somewhat slow but is very flexible.
Javascript.Net
It allows you to run scripts using Chrome's V8 JavaScript engine. Near the bottom of the page there will be something like <script src="/latest/resources/react/app.full.something.js"></script>
If you are able to figure out how that loads then you should be able to get all of the images.

C# scrape correct web content following jquery

I've been using HtmlAgilityPack for awhile but the web resource I have been working with now has a (seems like) jQuery protocol the browser passes through. What I expect to load is a product page but what actually loads (verified by a WebBrowser control, and a WebClient DownloadString) is a redirect, asking the visitor to select a consultant and sign up with them.
In other words, using Chrome's Inspect >> Elements tool, I get:
<div data-v-1a7a6550="" class="product-extra-images">
<img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_1MainImage-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
<img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_2Image2-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
But WebBrowser and HTMLAgilityPack only get:
<div class="container content">
<div class="alert alert-danger " role="alert">
<button type="button" class="close" data-dismiss="alert">
<span aria-hidden="true">×</span>
</button>
<h2 style="text-align: center; background: none; padding-bottom: 0;">It looks like you haven't selected a Consultant yet!</h2>
<p style="text-align: center;"><span>...were you just wanting to browse or were you looking to shop and pick a Consultant to shop under?</span></p>
<div class="text-center">
<form action="/just-browsing/" method="POST" class="form-inline">
...
After digging into the class definitions in the head, I found the page does use jQuery to handle proper loading, and to handle actions (scrolling, resizing, hovering over images, selecting other images, etc) while the visitor browses the page. Here's from the head of the jQuery:
/*!
* jQuery JavaScript Library v2.1.4
* http://jquery.com/
*
* Includes Sizzle.js
* http://sizzlejs.com/
*
* Copyright 2005, 2014 jQuery Foundation, Inc. and other contributors
* Released under the MIT license
* http://jquery.org/license
*
* Date: 2015-04-28T16:01Z
*/
I tried ScrapySharp as described here:
C# .NET: Scraping dynamic (JS) websites
But that just ended up consuming all available memory and never producing anything.
Also this:
htmlagilitypack and dynamic content issue
Loaded the incorrect redirect as noted above.
I can provide more of the source I'm trying to extract from, including the complete jQuery if needed.
Use CaptureRedirect = false; to bypass redirection page. This worked for me with the page you mentioned:
var web = new HtmlWeb();
web.CaptureRedirect = false;
web.BrowserTimeout = TimeSpan.FromSeconds(15);
Now keep trying till seeing the text "Product Description" on the page.
var doc = web.LoadFromBrowser(url, html =>
{
return html.Contains("Product Description");
});
Latests versions of HtmlAgilityPack can run a browser in background. So we don't really need another library like ScrapySharp for scraping dynamic content.

ASP.NET MVC C# show image from text url in database

I have been searching for an answer for this but can't seam to find one.
<div class="text-center">
<img src="#Html.DisplayFor(model => model.Image)" class="img-thumbnail">
</div>
This is the section I'm using. Basically I'm inputting the url as text into the database and when I use the above code (in the details page) it shows the image in Internet Explorer, but not in Chrome.
I'm VERY new to this so I'm unsure how to do this.
Thanks
Use Url.Content as shown below since Model.Image has the URL:
<div class="text-center">
<img src="#Url.Content(Model.Image)" class="img-thumbnail">
</div>

How can I pull HTML block from a external location and render it with Razor?

Is there anyway I can do the following code in razor?
<div>
<c:import url="http://hostName/HTML-file-name/" />
</div>
I would like to pull HTML from a given location and render it on a page. This should be possible...
Hope this makes sense...
In Razor, no. In HTML yes:
<div>
<iframe src="http://hostName/HTML-file-name/"></iframe>
</div>
Well actually you could use server side code to send an HTTP request to the remote resource and display the result inline:
<div>
#Html.Raw(new System.Net.WebClient().DownloadString("http://hostName/HTML-file-name/"))
</div>
But bear in mind that this will fetch only the content situated on the specified address. If this is for example an HTML page referencing external CSS, and javascript files, they will not be retrieved.

Categories

Resources