HtmlAgilityPack reading HTML in a wrong way? - c#

I have been using HAP for a pretty long time. And now I have a really simple question.
How to correctly load a webpage?
The reason I'm asking is because there is a website and a specific part in the formatting messes up with HAP:
<div class="like-bar">
<div class="g-bar"><div class="green-bar" style="width:55.47%"/></div></div>
<div class="like-descr">76 Likes, 61 Dislikes</div>
</div>
So the part I'm having the problem with is "style="width:55.47%"/></div></div>". So there is a closing tag for the g-bar class, a closing tag for the "green-bar" class, but the greenbar class by itself has the closing bracket (/>). As you could imagine, this screws up the whole formatting and makes it impossible to parse.
When I use inspect in any browser the "/>" tag is just not there. How can I figure out what writes it down? I download the page using the Load method from the HtmlWeb class.
Update #1
For some really strange reason, the following does not work:
<div class="like-bar">
<div class="g-bar">
<div class="green-bar" style="width:55.474452554745%"></div>
</div>
<div class="like-descr">
<span class="bold">76</span><span>Likes</span>, <span class="bold">61</span><span>Dislikes</span>
</div>
</div>
The last is not associated to the class like-bar, instead it links to a parent.
What's wrong with this?
Thank you for your attention!

Related

asp.net core html not updating

my html code:
<div class="header"></div>
<div class="tiles"></div>
<div class="list"></div>
<div class="footer"></div>
changed to:
<div class="header"></div>
<div class="content">
<div class="tiles"></div>
<div class="list"></div>
</div>
<div class="footer"></div>
I made a content div around tiles and list.
My problem is by doing that my html won't update in my browser.
and yes i already cleared my browser cache, I disabled caching in devtool.
I added no cache meta tag. I rebuild my solution, clean solution no luck. Whatever code i add or delete nothing happens. My css is updated not the html.
I tested it by just deleting my content and when I run my project I still had every content on my browser. What else can I do to fix this?

Remove invalid/incorrectly placed tags from html string

I'm wondering if there is a good (or good enough) way to remove invalid or incorrectly placed HTML tags from an HTML string in C#?
Example 1: <div> </div> </div> should be changed to <div> </div>
Example 2: <div> </section> </div> should be changed to <div> </div>
Basically the transformed html string should be W3C validated markup. I understand that this may be a bit difficult to do, perhaps there is a library that does the job well?
Thanks!
I'd recommend using HTMLTidy.
Since you're using C#, there's the tidy.net project. I think there are dlls that you can just reference and use in your C# code.
Or, you can just use the command line stuff for HTMLTidy.
I ended up fixing the root issue that generated an invalid HTML string. In such a scenario, it is exceedingly better to fix the main problem - if possible - than the symptoms.

HtmlAgilityPack (C#) can't read past hidden text

using the following url:
link to search results page
I am trying to first scrape the text from the a tag from this html that can be seen from the source code when viewed with Firebug:
<div id="search-results" class="search_results">
<div class="searchResultItem">
<div class="searchResultImage photo">
<h3 class="black">
<a class="linkmed " href="/content/1/2484243.html">加州旱象不减 开源节流声声急</a>
</h3>
<p class="resultPubDate">15.10.2014 06:08 </p>
<p class="resultText">
</div>
</div>
<p class="more-results">
But what I get back when I scrape the page is:
<div class="search_results" id="search-results">
<input type="hidden" name="ctl00$ctl00$cpAB$cp1$hidSearchType" id="hidSearchType">
</div>
<p class="more-results">
Is there anyway to view the source the way Firebug does?
How are you scraping the page? Use something like Fiddler and check the request and the response for dynamic pages like these ones. The reason why Firebug sees more is because all of the dynamic elements have loaded already when you are viewing it in your browser, when in fact your scraping method is only one piece of the puzzle (the initial HTML).
Hint: For this search page, you will see that the request for the results data is actually a) a separate GET request with b) a long query string and c) a cookie on the header, which returns a JSON object containing the data. This is why the link you posted just gives me "undefined," because it does not contain the search data.

Parsing with Async, HtmlAgilityPack, and XPath

I have run into a rather strange problem. It's very hard to explain so please bear with me, but basically here is a brief introduction:
I am new to Async programming but couldn't locate a problem in my code
I have used HtmlAgilityPack before, but never the .NET 4.5 version.
This is a learning project, I am not trying to scrape or anything like that.
Basically, what is happening is this: I am retrieving a page from the internet, loading it via stream into an HtmlDocument, then retrieving certain HtmlNodes from it using XPath expressions. Here is a piece of simplified code:
myStream = await httpClient.GetStreamAsync(string.Format("{0}{1}", SomeString, AnotherString);
using (myStream)
{
myDocument.Load(myStream);
}
The HTML is being retreived correctly, but the HtmlNodes extracted by XPath are getting their HTML mangled. Here is a sample piece of HTML which I got in a response taken from Fiddler:
<div id="menu">
<div id="splash">
<div id="menuItem_1" class="ScreenTitle" >Horse Racing</div>
<div id="menuItem_2" class="Title" >Wednesday Racing</div>
<div id="subMenu_2">
<div id="menuItem_3" class="Level2" >» 21.51 Britannia Way</div>
<div id="menuItem_4" class="Level2" >» 21.54 Britannia Way</div>
<div id="menuItem_5" class="Level2" >» 21.57 Britannia Way</div>
<div id="menuItem_6" class="Level2" >» 22.00 Britannia Way</div>
<div id="menuItem_7" class="Level2" >» 22.03 Britannia Way</div>
<div id="menuItem_8" class="Level2" >» 22.06 Britannia Way</div>
</div>
</div>
</div>
The XPath I am using is 100% correct because it works in the browser on the same page, but here is an example a tag which it is retreiving from the previously shown page:
1.54 Britannia Way</
And here is the original which I copied from above for simplicity:
21.54 Britannia Way</div>
As you can see, the InnerText has changed considerably and so has the URL. Obviously my program doesn't work, but I don't know how. What can cause this? Is it a bug in HtmlAgilityPack? Please advise! Thanks for reading!
Don't make the assumption that an XPath expression working in your browser (after DOM-conversion, possibly loading data with AJAX, ...). This seems a site giving bet quotes, I'd guess they're loading the data with some javascript calls.
Verify whether your XPath expression matches the pages source code (like fetched using wget or by clicking "View Source Code" in your browser – don't use Firebug/... for this!
If the site is using AJAX to load the data, you might have luck by using Firebug to monitor what resources get fetched while the page is loaded. Often these are JSON- or XML-files very easy to parse, and it's even easier to work with them than parsing a website of horrible messes of HTML.
Update: In this special case, the site forwards users not sending an Accept-Language header to a language-selection-page. Send such a header to receive the same contents as the browser does. In curl, it would look like this:
curl -H "Accept-Language: en-US;q=0.6,en;q=0.4" https://mobile.bet365.com/sport/splash/Default.aspx?Sport
After many hours of guessing and debugging, the problem turned out to be an HtmlDocument that I was re-using. I solved the problem by creating a new HtmlDocument each time I wanted to load a new page, instead of using the same one.
I hope this saves you time that I lost!

"Element is not currently visible and so may not be interacted with" but another is?

I've created another question which I think is the cause for this error: Why does the Selenium Firefox Driver consider my modal not displayed when the parent has overflow:hidden?
Selenium version 2.33.0
Firefox driver
The code that causes the error:
System.Threading.Thread.Sleep(5000);
var dimentions = driver.Manage().Window.Size;
var field = driver.FindElement(By.Id("addEmployees-password")); //displayed is true
field.Click(); //works fine
var element = driver.FindElement(By.Id(buttonName)); //displayed is false
element.Click(); //errors out
The button that its trying to click:
<div id="addEmployees" class="modal hide fade" tabindex="-1" role="dialog" aria-labelledby="addEmployeesLabel" aria-hidden="true">
<div class="modal-header">
<button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button>
<h3>Add Employee</h3>
</div>
<div class="modal-body">
<p class="alert alert-info">
<input name="addEmployees-username" id="addEmployees-username" />
<input name="addEmployees-password" id="addEmployees-password" type="password" />
<input name="addEmployees-employee" id="addEmployees-employee" />
</p>
</div>
<div class="modal-footer">
<button name="addEmployees-add" id="addEmployees-add" type="button" class="btn" data-ng-click="submit()">Add</button>
</div>
</div>
If I change the call to FindElements then I get ONE element, so there isn't anything else on the page.
If I FindElement on a field that occurs right before the button, say addEmployees-employee, then addEmployees-employee is displayed
In the browser itself, it shows up fine, all i need to do is actually click the button and the desired behavior executes, but the webdriver refuses to consider the element displayed
How is it that one field can be considered displayed and the other is not?
The modal with the add button in the bottom right, all the other elements are displayed = true
The window size is 1200x645 per driver.Manage().Window.Size;
The element location is: 800x355y per driver.FindElement(By.Id(buttonName)).Location
The element dimentions are: 51x30 per driver.FindElement(By.Id(buttonName)).Size
The password element location is: 552x233y per driver.FindElement(By.Id("addEmployees-password")).Size
Brian's response was right: use an explicit wait versus Thread.Sleep(). Sleep() is generally brittle, you're losing five seconds needlessly, and moreover it's just a really rotten practice for automated testing. (It took me a long, LONG time to learn that, so you're not alone there.)
Avoid implicit waits. They generally work for new items being added to the DOM, not for transitions for things like a modal to become active.
Explicit waits have a great set of ExpectedConditions (detailed in the Javadox) which can get you past these problems. Use the ExpectedCondition which matches the state you need for your next action.
Also, see Ian Rose's great blogpost on the topic, too.
Selenium WebDriver does not just check for opacity != 0, visibility = true, height > 0 and display != none on the current element in question, but it also searches up the DOM's ancestor chain to ensure that there are no parent elements that also match these checkers.
(UPDATE After looking at the JSON wire code that all the bindings refer back to, SWD also requires overflow != hidden, as well as a few other cases.)
I would do two things before restructuring the code as #Brian suggests.
Ensure that the "div.modal_footer" element does not have any reason for SWD to consider it to not be visible.
Inject some Javascript to highlight the element in question in your browser so you know absolutely you have selected the right element. You can use this gist as a starting point. If the button is highlighted in a yellow border, then you know you have the right element selected. If not, it means that the element selected is located elsewhere in the DOM. If this is the case, you probably don't have unique IDs as you would expect, which makes manipulation of the DOM very confusing.
If I had to guess, I would say that number two is what you are running into. This has happened to me as well, where a Dev reused an element ID, causing contention in which element you're supposed to find.
After discussing this with you in chat, I think the best solution (for now, at least) is to move the button out of the footer for your modal and into the body of it.
This is what you want (for now):
<div class="modal-body">
<p class="alert alert-info">
<input name="addEmployees-username" id="addEmployees-username" />
<input name="addEmployees-password" id="addEmployees-password" type="password" />
<input name="addEmployees-employee" id="addEmployees-employee" />
<button name="addEmployees-add" id="addEmployees-add" type="button" class="btn" data-ng-click="submit()">Add</button>
</p>
</div>
And not this:
<div class="modal-body">
<p class="alert alert-info">
<input name="addEmployees-username" id="addEmployees-username" />
<input name="addEmployees-password" id="addEmployees-password" type="password" />
<input name="addEmployees-employee" id="addEmployees-employee" />
</p>
</div>
<div class="modal-footer">
<button name="addEmployees-add" id="addEmployees-add" type="button" class="btn" data-ng-click="submit()">Add</button>
</div>
I had the same issue of element not visible so cannot be interacted with. it just got solved. i updated my selenium stand alone server. previous version was 2.33.0 and now it is 2.35.0
In my case the element was already present in the page but it was disabled,
so this didn't work (python):
wait.until(lambda driver: driver.find_element_by_id("myBtn"))
driver.find_element_by_id("myBtn").click()
it failed with error:
“Element is not currently visible and so may not be interacted with"
To solve my problem, I had to wait a couple of seconds ( time.sleep(5) ) until the element became visible.
You can also enable the element using JavaScript, a python example:
driver.execute_script("document.getElementById('myBtn').disabled='' ")
driver.execute_script("document.getElementById('myBtn').click() ")

Categories

Resources