Parsing with Async, HtmlAgilityPack, and XPath

Parsing with Async, HtmlAgilityPack, and XPath - c#

I have run into a rather strange problem. It's very hard to explain so please bear with me, but basically here is a brief introduction:
I am new to Async programming but couldn't locate a problem in my code
I have used HtmlAgilityPack before, but never the .NET 4.5 version.
This is a learning project, I am not trying to scrape or anything like that.
Basically, what is happening is this: I am retrieving a page from the internet, loading it via stream into an HtmlDocument, then retrieving certain HtmlNodes from it using XPath expressions. Here is a piece of simplified code:
myStream = await httpClient.GetStreamAsync(string.Format("{0}{1}", SomeString, AnotherString);
using (myStream)
{
myDocument.Load(myStream);
}
The HTML is being retreived correctly, but the HtmlNodes extracted by XPath are getting their HTML mangled. Here is a sample piece of HTML which I got in a response taken from Fiddler:
<div id="menu">
<div id="splash">
<div id="menuItem_1" class="ScreenTitle" >Horse Racing</div>
<div id="menuItem_2" class="Title" >Wednesday Racing</div>
<div id="subMenu_2">
<div id="menuItem_3" class="Level2" >» 21.51 Britannia Way</div>
<div id="menuItem_4" class="Level2" >» 21.54 Britannia Way</div>
<div id="menuItem_5" class="Level2" >» 21.57 Britannia Way</div>
<div id="menuItem_6" class="Level2" >» 22.00 Britannia Way</div>
<div id="menuItem_7" class="Level2" >» 22.03 Britannia Way</div>
<div id="menuItem_8" class="Level2" >» 22.06 Britannia Way</div>
</div>
</div>
</div>
The XPath I am using is 100% correct because it works in the browser on the same page, but here is an example a tag which it is retreiving from the previously shown page:
1.54 Britannia Way</
And here is the original which I copied from above for simplicity:
21.54 Britannia Way</div>
As you can see, the InnerText has changed considerably and so has the URL. Obviously my program doesn't work, but I don't know how. What can cause this? Is it a bug in HtmlAgilityPack? Please advise! Thanks for reading!

Don't make the assumption that an XPath expression working in your browser (after DOM-conversion, possibly loading data with AJAX, ...). This seems a site giving bet quotes, I'd guess they're loading the data with some javascript calls.
Verify whether your XPath expression matches the pages source code (like fetched using wget or by clicking "View Source Code" in your browser – don't use Firebug/... for this!
If the site is using AJAX to load the data, you might have luck by using Firebug to monitor what resources get fetched while the page is loaded. Often these are JSON- or XML-files very easy to parse, and it's even easier to work with them than parsing a website of horrible messes of HTML.
Update: In this special case, the site forwards users not sending an Accept-Language header to a language-selection-page. Send such a header to receive the same contents as the browser does. In curl, it would look like this:
curl -H "Accept-Language: en-US;q=0.6,en;q=0.4" https://mobile.bet365.com/sport/splash/Default.aspx?Sport

After many hours of guessing and debugging, the problem turned out to be an HtmlDocument that I was re-using. I solved the problem by creating a new HtmlDocument each time I wanted to load a new page, instead of using the same one.
I hope this saves you time that I lost!

Related

Is it possible to give name to Quasar Uploader in <form>

I have a <form> where I am using Vue.js and Quasar, and submits in as a HTML form in asp .net core 3.1
The problem is when I am using Quasar Uploader (https://quasar.dev/vue-components/uploader).
The functionality works fine, but when I submit the form (post i C# and .net core).
I cant get the the file in the controller.
As you can see from this example: https://codepen.io/cbrown___/pen/GRZwpxw
When the Uploader is rendered it does not have the attribute name. From the example above you have this input:
<input tabindex="-1" type="file" title="" class="q-uploader__input overflow-hidden absolute-full">.
I guess that is why I cant get it from my Controller. How can I solve this when I am using .net Core 3.1 to submit this form?
And I see no good solutinos in letting people upload files through my API before the record is created.
Is it a option here I do not see?
EDIT:
The <input> is on the plus-icon. So by using inspect elements you should be able to see that no name occurs.
EXAMPLE CODE:
HTML
<div id="q-app">
<div class="q-pa-md">
<div class="q-gutter-sm row items-start">
<q-uploader
url="http://localhost:4444/upload"
style="max-width: 300px"
></q-uploader>
<q-uploader
url="http://localhost:4444/upload"
color="teal"
flat
bordered
style="max-width: 300px"
></q-uploader>
<q-uploader
url="http://localhost:4444/upload"
label="Upload files"
color="purple"
square
flat
bordered
style="max-width: 300px"
></q-uploader>
<q-uploader
url="http://localhost:4444/upload"
label="No thumbnails"
color="amber"
text-color="black"
no-thumbnails
style="max-width: 300px"
></q-uploader>
</div>
</div>
</div>
JS:
new Vue({
el: '#q-app'
})

If any body else need this I ended up with using this:
:name="'file_description[' + index + ']'"
That way each one got one name...and from another array I knew how many file_description it would be.

C# scrape correct web content following jquery

I've been using HtmlAgilityPack for awhile but the web resource I have been working with now has a (seems like) jQuery protocol the browser passes through. What I expect to load is a product page but what actually loads (verified by a WebBrowser control, and a WebClient DownloadString) is a redirect, asking the visitor to select a consultant and sign up with them.
In other words, using Chrome's Inspect >> Elements tool, I get:
<div data-v-1a7a6550="" class="product-extra-images">
<img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_1MainImage-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
<img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_2Image2-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
But WebBrowser and HTMLAgilityPack only get:
<div class="container content">
<div class="alert alert-danger " role="alert">
<button type="button" class="close" data-dismiss="alert">
<span aria-hidden="true">×</span>
</button>
<h2 style="text-align: center; background: none; padding-bottom: 0;">It looks like you haven't selected a Consultant yet!</h2>
<p style="text-align: center;"><span>...were you just wanting to browse or were you looking to shop and pick a Consultant to shop under?</span></p>
<div class="text-center">
<form action="/just-browsing/" method="POST" class="form-inline">
...
After digging into the class definitions in the head, I found the page does use jQuery to handle proper loading, and to handle actions (scrolling, resizing, hovering over images, selecting other images, etc) while the visitor browses the page. Here's from the head of the jQuery:
/*!
* jQuery JavaScript Library v2.1.4
* http://jquery.com/
*
* Includes Sizzle.js
* http://sizzlejs.com/
*
* Copyright 2005, 2014 jQuery Foundation, Inc. and other contributors
* Released under the MIT license
* http://jquery.org/license
*
* Date: 2015-04-28T16:01Z
*/
I tried ScrapySharp as described here:
C# .NET: Scraping dynamic (JS) websites
But that just ended up consuming all available memory and never producing anything.
Also this:
htmlagilitypack and dynamic content issue
Loaded the incorrect redirect as noted above.
I can provide more of the source I'm trying to extract from, including the complete jQuery if needed.

Use CaptureRedirect = false; to bypass redirection page. This worked for me with the page you mentioned:
var web = new HtmlWeb();
web.CaptureRedirect = false;
web.BrowserTimeout = TimeSpan.FromSeconds(15);
Now keep trying till seeing the text "Product Description" on the page.
var doc = web.LoadFromBrowser(url, html =>
{
return html.Contains("Product Description");
});
Latests versions of HtmlAgilityPack can run a browser in background. So we don't really need another library like ScrapySharp for scraping dynamic content.

Selecting children of parent, based on other child

Using HTMLAgilityPack I am trying to generate a list of clickable objects, using the function FindElementsByXPath, based on below structure.
<div class = "table-container">
<div>
<strong>
<a>Txt<a/>
</strong>
</div>
<Table class="sc" style="display: None;">
</Table>
</div>
The problem however is that I only want to include the deepest-level a-tag if the table has the style-attribute set to "display: None;" (note that if the table is already expanded, the style attribute does not exist).
I am trying to generate an XPath expression that would help me achieve this. So far, I have made this:
//*[#class='table-container' and table[contains(#style,'display: None;')]]/div/strong/a
However, this is not working. I tried to search for the solution online and experimented with various settings, but no luck so far. I am new to XPath selectors and find myself stuck at this moment. Any help would be appreciated.

Solution
The following query should work:
//*[#class='table-container' and Table[contains(#style,'display: None;')]]/div/strong/a
It's very close to what you had.
Testing
I tested it on the following Xml:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<div class="table-container">
<div>
<strong>
<a>Txt</a>
</strong>
</div>
<Table class="sc" style="display: None;"/>
</div>
<div class="table-container">
<div>
<strong>
<a>Txt2</a>
</strong>
</div>
<Table class="sc"/>
</div>
</root>
and it returns
<a>Txt</a>
Notes
You query was basically correct. Note the following.
Xml parsers can be really finicky. Check the case of the items in the selectors. For example, table might not match, but Table might.
Xml parsers can be really fragile. Check that the markup that you're trying to parse is valid. In the posted snipped we had <a>Txt<a/> which caused my parser to barf. Once I changed it to <a>Txt</a> it was fine.
There are often many different ways to do the same thing. The most appropriate will depend heavily on the structure of your actual Xml. For example, //div[Table[#style='display: None;']]//a works fine on the test data, but might not work "in real life". For example, if the Xml you're actually using varied between display:None and display: None (with a space after the colon) that would cause another problem.

I found the answer after returning from work and looking at it anew. Turns out if you hadn't clicked on the contained text in the a-tag, the table was simply not "there" as far as the XML was concerned. Only once you have clicked on it, it became visible in firebug with a distinguishing style being equal to either "display: None;" or being empty. For my application, I thus had to check if the table was present and, if not, click the a-tag. The definitive XPath was:
//*[#class='table-container' and not(Table)]/div/strong/a
Credit does have to go Ezra for pointing out the nuances of XPath!

HtmlAgilityPack (C#) can't read past hidden text

using the following url:
link to search results page
I am trying to first scrape the text from the a tag from this html that can be seen from the source code when viewed with Firebug:
<div id="search-results" class="search_results">
<div class="searchResultItem">
<div class="searchResultImage photo">
<h3 class="black">
<a class="linkmed " href="/content/1/2484243.html">加州旱象不减 开源节流声声急</a>
</h3>
<p class="resultPubDate">15.10.2014 06:08 </p>
<p class="resultText">
</div>
</div>
<p class="more-results">
But what I get back when I scrape the page is:
<div class="search_results" id="search-results">
<input type="hidden" name="ctl00$ctl00$cpAB$cp1$hidSearchType" id="hidSearchType">
</div>
<p class="more-results">
Is there anyway to view the source the way Firebug does?

How are you scraping the page? Use something like Fiddler and check the request and the response for dynamic pages like these ones. The reason why Firebug sees more is because all of the dynamic elements have loaded already when you are viewing it in your browser, when in fact your scraping method is only one piece of the puzzle (the initial HTML).
Hint: For this search page, you will see that the request for the results data is actually a) a separate GET request with b) a long query string and c) a cookie on the header, which returns a JSON object containing the data. This is why the link you posted just gives me "undefined," because it does not contain the search data.

HtmlAgilityPack reading HTML in a wrong way?

I have been using HAP for a pretty long time. And now I have a really simple question.
How to correctly load a webpage?
The reason I'm asking is because there is a website and a specific part in the formatting messes up with HAP:
<div class="like-bar">
<div class="g-bar"><div class="green-bar" style="width:55.47%"/></div></div>
<div class="like-descr">76 Likes, 61 Dislikes</div>
</div>
So the part I'm having the problem with is "style="width:55.47%"/></div></div>". So there is a closing tag for the g-bar class, a closing tag for the "green-bar" class, but the greenbar class by itself has the closing bracket (/>). As you could imagine, this screws up the whole formatting and makes it impossible to parse.
When I use inspect in any browser the "/>" tag is just not there. How can I figure out what writes it down? I download the page using the Load method from the HtmlWeb class.
Update #1
For some really strange reason, the following does not work:
<div class="like-bar">
<div class="g-bar">
<div class="green-bar" style="width:55.474452554745%"></div>
</div>
<div class="like-descr">
<span class="bold">76</span><span>Likes</span>, <span class="bold">61</span><span>Dislikes</span>
</div>
</div>
The last is not associated to the class like-bar, instead it links to a parent.
What's wrong with this?
Thank you for your attention!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing with Async, HtmlAgilityPack, and XPath - c#

After many hours of guessing and debugging, the problem turned out to be an HtmlDocument that I was re-using. I solved the problem by creating a new HtmlDocument each time I wanted to load a new page, instead of using the same one. I hope this saves you time that I lost!

Related

Is it possible to give name to Quasar Uploader in <form>

C# scrape correct web content following jquery

Selecting children of parent, based on other child

HtmlAgilityPack (C#) can't read past hidden text

HtmlAgilityPack reading HTML in a wrong way?

Categories

Resources