C#\HtmlAgilityPack Parse Other Then FIreBug - c#

im using HtmlAgilityPack to parse Html nodes,
im using firebug to search the node attributes which im looking for, like div with class name "ABC"
I've noticed that sometimes im getting no result for the div im looking for, i debug that and saw that the XPATH from firebug and from HtmlAgilityPack is different for the same Node:S
/html[1]/body[1]/div[2]/div[3]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]/div[1]/table[1]/tr[1]/td[1]/div[1]/table[1]/tr[2]/td[1]/div[2]/table[1]/tr[1]/td[1]/div[1]/td[1]/div[1]
/html/body/div[3]/div[3]/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/div/table/tbody/tr/td/div/table/tbody/tr[2]/td/div[2]/table/tbody/tr/td/div/div/table/tbody/tr[3]/td/table/tbody/tr/td[2]/div
First one is firebug. nyone knows where im wrong?

There are two possible reasons
HTML Agility Pack is not parsing the HTML correctly
The web page has been altered by client script after the page was loaded. When you view with Firebug, you are looking at the DOM, not the HTML source. HAP can only work the HTML source.
You will notice in the paths you have shown that (for example) there are no TBODY tags in the HAP version. TBODY is is optional in HTML markup, but still a required tag in a complete DOM. Browser HTML parsers will always add TBODY if it's missing. HAP will not. This can result in paths that work in a browser, failing in HAP
An alternative to HAP is CsQuery (on nuget), which uses a standards-compliant HTML parser (actually - the same parser as Firefox). CsQuery is a C# jquery port, it works with CSS selectors (not xpath). It should give you a DOM that matches the one the browser shows. This will not change anything if the problem is simply that javascript is altering the DOM, though.

Html Agility Pack only concentrates on markup. It has no idea of how things will be rendered. Firebug I think relies on the current in-firefox-memory DOM which can be dramatically different. That's why you see elements such as TBODY that only exist in DOM, not in markup (where they are optional).
Plus you can add to that the fact that there is an infinite possible XPATH expression for a given Xml node.
Anyway, in general, the needed XPATH when doing queries using Html Agility Pack don't need the full XPATH expression that a tool would give. You just need to focus on discriminants, for example specific attributes (like the class), id, etc.... Your code will be much more resistant to changes. But it means you need to learn a bit about XPATH (this is a good starting point: XPath Tutorial). So you really want to build XPATH expression such as this:
//div[#class = 'ABC']
which will get all DIV elements with a CLASS attribute named 'ABC'.

Related

ASP.NET Core: Load the actually rendered text from a URL

Im looking for a simple way to get a string from a URL that contains all text actually displayed to the user.
I.e. anything loaded with a delay (using JavaScript) should be contained. Also, the result should ideally be free from HTML tags etc.
A straightforward approach with WebClient.DownlodString() and subsequent HTML-regex is pretty much pointless, because most content in modern web apps is not contained in the initial HTML document.
Most probably you can use Selenium WebDriver to fully load the page and then dump the full DOM.

Selenium: Finding if a Webelement is contained inside another Webelement?

I'm currently learning Selenium by building a test framework up to test Trello - for no good reason other than it's free and reasonably complex.
The problem I have right now is that, I want to check if a card is in a certain column or not.
I have the card & column as WebElements so what I'm doing right now is:
column.FindElements(By.LinkText(_card.Text)).Count > 0;
But this doesn't work if there's no text and seems pretty brittle.
What I want is something like:
column.Contains(_card)
I've searched on SO but I've only seen solutions which pass an XPath - I don't have an XPath, I have the WebElement.
Any ideas?
Two things,
relative xpath is fairly easy to learn and could probably take care of this for you.
CSS selectors should also easily identify the container regardless of the text. Without seeing the code, I can't help much more.
You should be able to find all elements with a certain css tag.
Using Firefox with the Firebug extension, right click your element and go to Inspect Element with Firebug. Then, when the html of your element comes up in the window, right click the element and select Copy XPath. Now you have an XPath to use.
To use the CSS Selectors that others are talking about, you can select Copy CSS Path instead of Copy XPath.
Hope this helps.

How to navigate through a website and "mine information"

I want to "simulate" navigation through a website and parse the responses.
I just want to make sure i am doing something reasonable before i start, I saw 2 options to do so:
Using the WebBrowser class.
Using the HttpWebRequest class.
So my initial though was to use HttpWebRequest and just parse the response.
What do you guys think?
Also wanted to ask,i use c# cause its my strongest language, but what are common languages used to do such stuff as mining from websites?
If you start doing it manually, you probably will end up hard-coding lots of cases. Try Html Agility Pack or something else support xpath expressions.
There are alot of Mining and ETL tools out there for serious data mining needs.
For "user simulation" I would suggest using Selenum web driver or PhantomJS, which is much faster but has some limitations in browser emulation, while Selenium provides almost 100% browser features support.
If you're going to mine data from a website there is something you must do first in order to be 'polite' to the websites you are mining from. You have to obey the rules set in that websites robots.txt, which is almost always located at www.example.com/robots.txt.
Then use HTML Agility Pack to traverse the website.
Or Convert the html document to xhtml using html2xhtml. Then use an xml parser to traverse the website.
Remember to:
Check for duplicate pages. (general idea is to hash each the html doc at each url. Look up (super)shingles)
Respect the robots.txt
Get the absolute URL from each page
Filter duplicate URL from your queue
Keep track of the URLs you have visited(ie. timestamp)
Parse your html doc. And keep your queue updated.
Keywords: robots.txt, absolute URL, html parser, URL normalization, mercator scheme.
Have fun.

Navigate HTML Source while performing WatiN tests

I am performing actions on the page during WatiN tests. What is the neatest method for asserting certain elements are where they should be by evaluating the HTML source? I am scraping the source but looking for a clean way to navigate the tags pulled back.
UPDATE: Right now I am thinking about grabbing certain elements within HTML source using regular expressions and then analysing that to see if other elements exist within. Other thoughts appreciated.
IE.ContainsText("myText") is not enough for your scenario?
I would use XPath to navigate tags in HTML without using regexps.

Html Parser & Object Model for .net/C#

I'm looking to parse html using .net for the purposes of testing or asserting its content.
i.e.
HtmlDocument doc = GetDocument("some html")
List forms = doc.Forms()
Link link = doc.GetLinkByText("New Customer")
the idea is to allow people to write tests in c# similar to how they do in webrat (ruby).
i.e.
visits('\')
fills_in "Name", "mick"
clicks "save"
I've seen the html agility pack, sgmlreader etc but has anyone created an object model for this, i.e. a set of classes representing the html elements, such as form, button etc??
Cheers.
Here is good library for html parsing, objects like HtmlButton , HtmlInput s are not created but it is a good point to start and to create them yourself if you don't want to use HTML DOM
The closest thing to an HTML DOM in .NET, as far as I can tell, is the HTML DOM.
You can use the Windows Forms WebBrowser control, load it with your HTML, then access the DOM from the outside.
BTW, this is .NET. Any code that works for VB.NET would work for C#.
you have 2 major options:
Use some browser engine (i.e. internet explorer) that will parse the html for u and then will give give u access to the generated DOM. this option will require u to hvae some interop with the browser engine (in the case of i.e. it's simple COM)
use some light weight parser like HtmlAgilityPack
It sounds to me like you are trying to do HTML unit tests. Have you looked into Selenium? It even has C# library so that you can write your HTML unit tests in C# and assert that elements exist and that they have the correct values and even click on links. It even works with JavaScript / AJAX sites.
The best parser for HTML is the HTQL COM. Use can use HTQL queries to retrieve HTML content.

Categories

Resources