HTML parsers not finding table element on a web page

HTML parsers not finding table element on a web page - c#

I'm trying to get to this element: //*[#id="table-matches"]/table on this page: http://www.oddsportal.com/matches/soccer/20140221/
I want to get the table that contains matches. Table starts under Kick off time tab. The element I'm looking for is 'table class=" table-main"' and it is inside the element 'div id="table-matches" style="display: block;"'
I tried getting this document with HtmlAgilityPack in C# and I can find 'div' element, but it says that it doesn't have any child nodes (there should be a table child node). If I try to get the table, the result is null. Here is the code:
var webGet = new HtmlWeb();
var document = webGet.Load("http://www.oddsportal.com/matches/soccer/20140221/");
var div = document.DocumentNode.SelectNodes("//div[#id='table-matches']");
var table = document.DocumentNode.SelectNodes("//*[#id='table-matches']/table");
var table2 = document.DocumentNode.SelectNodes("//table");
So, div variable contains the div element (but it has no child nodes), table variable is null, even table2 variable contains 4 elements, but none of them are desired table.
I figured there is a problem with HtmlAgilityPack and tried to get the whole web page with Python. So I got the whole HTML document in a text file and searched the text file and I can find div element but it is empty. There is no table element inside. Why is that? Why can I see table element in chrome or internet explorer, but when I download html there is no such element?
Here is the python code:
url = urllib.urlopen("http://www.oddsportal.com/matches/")
document = url.read()
htmlOddsPortal = open("htmlOddsPortal.txt", "w")
htmlOddsPortal.write(document)
Here is the element in the final text document:
<div id="table-matches"></div> <!-- END PAGE BODY -->

Table is loaded with JavaScript (probably with AJAX) so you won't get it with webGet.Load(). You only get HTML that server returns in response.
You can check this if you (in Chrome) open Console (F12), click on Settings and check Disable JavaScript, then refresh page. You will see blank content.
I had same problem, but I worked in java, and I have used HTMLUnit to solve this. Probably there is similar tool for C#, or you can check if HtmlAgilityPack is able to do asynchronous call or something like WebBrowser component.

Related

Xpath seems to exist but the element cannot be found through code

I am trying to fetch an element using HTML Agility Pack (C#) but it keeps returning null.
The HTML looks like this (sorry for the image).
I can successfully find the first node I marked with the arrow using the following xpath:
var node = htmlDoc.DocumentNode.SelectSingleNode("//div[#id='listing-container']");
But it says this node only has one "text" child while in the HTML it clearly has other div I need to access (as in the image).
How can I access this node with class = c-listing?
Does not seem to be in a frame or anything like that.
Thank for you help, I am not experienced in this field.
Cheers

Action on SharePoint New Item form in selenium 2 using C# is not working

I am trying to automate sharepoint site new item form but what ever method i try it is showing not found.
I tried switchTo() to a new iframe, window...
Tried this code which finds the outer content
IWebElement table1 = WebElement.FindElement(By.XPath("//table[#class=\"s4-wpTopTable\"]"));
int table1count = WebElement.FindElements(By.XPath("//table[#class=\"s4-wpTopTable\"]")).Count;
MessageBox.Show(table1count.ToString());
above code displays the table count as 2. Going beyond this element does not show any element.
And I am using IE as the browser.
I used Xpath and could identify till the red mark and it does not identify beyond that.. i am trying to identify the elements marked in green.
var iframecount = driver.FindElement(By.XPath("//html/body/form/div[8]/div/div[4]/div[2]/div[2]/div/div/table/tbody/tr/td
Here is the xpath is used taken from FireBug
var iframecount = driver.FindElement(By.XPath("//html/body/form/div[8]/div/div[4]/div[2]/div[2]/div/div/table/tbody/tr/td/div/span/table/tbody/tr/td[2]/span/span/input"));

i have found answer for this...
Sharepoint New item form (i.e modal pop up) has 3 iframes without id or name so switching to iframe using the below code works
driver.SwitchTo().Frame(2);
i.e frames start from 0 index.

Any DOM parsers that do not modify the DOM?

I need to write a page, can use PHP or .NET, that will display the unmodified html for an element of another page.
The other page may not have valid HTML, but we want it to be returned unmodified. We will not be selecting based on the invalid elements, but will select their parent element and need them returned unmodified.
An example HTML page that my page will be fetching:
<body>
<div>
<p>test1</p>
<br>
<p>test2
<p>test3</p>
</div>
</body>
So far everything I have tried attempts to fix the HTML, it makes the br in the example self closing and the second paragraph tags gets closed.
Is there anything out there that can do this?
Thanks!

Element Visible on the page but shows null

Not able to identify the element in a page.It gives null.I want to identify the element in the Iframe (textbox) .I used selenium webdriver to click on the element,but it is not able to identify the element
1) My HTML Code is as shown bellow
<html>
<head>
<body>
<iframe id="iframeOne">
</iframe>
</body>
</head>
</html>
2. I used javascript to identify the textbox like document.getElementById('textbox').
But it return null.
3.I even Tried using selenium webdriver
IWebElement ClickElement = Wait.Until((d) => webDriver.FindElement(By.Id(parameter1))); It gives object reference error
ClickElement.Click();

You cannot put html inside an iframe tag. it is to load another page inside the curent page. and your input tag should caontain the type of the control. and check the HTML validation errors.

The html code you put inside the iframe tag will be loaded and visible if and only if the browser does not support iframe tag. So probably never, unless you're using older Netscape navigator or IE 4.
Add src attribute to the iframe pointing to the url you want to load. Then you can access elements inside this way:
var frame = document.getElementById('iframeOne');
var frameDocument = frame.contentDocument;
var element = frameDocument.getElementById('xxxx');
There's one thing to take into account, though: accesing contentDocument when iframe's src is cross-domain might not work as expected.

Logic for Implementing a Dynamic Web Scraper in C#

I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows:
Get the URL from the user.
Load the Web page in the IE UI control(embedded browser) in WINForms.
Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page.
When the User wishes to persist the location (the HTML DOM location) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his subsequent visits.
Assume that the loaded website is a pricelisting site and the quoted rate keeps on changing, the idea is to persist the DOM hierarchy so that I can traverse it next time.
I would be able to do this if all the HTML elements had their id attributes. In the case where the id is null , i am not able to accomplish this .
Could someone suggest a valid idea on this (a bare minimum code snippet if possible).?
It would be helpful , even if you can share some online resources.
thanks,
vijay

One approach is to build a stack of tags/styles/id down to the element which you want to select.
From the element you want, traverse up to the nearest id element. This way you will get rid of most of the top header etc. Then build a sequence to look for.
Example:
<html>
<body>
<!-- lots of html -->
<div id="main">
<div>
<span>
<div class="pricearea">
<table> <!-- with price data -->
For the exmaple you would store in your db a sequence of: [id=main],div,span,div,table or perhaps div[class=pricearea],table.
Using styles/classes might also be used to create your path. It's your choice to look for either a tag, an attribute of a tag or a combination. You want it as accurate as possible with as few elements as possible to make it robust.
If the layout seldom changes, this would let you navigate to the same location each time.
I would also suggest you perhaps use HTML Agility Pack or something similar for the DOM parsing, as the IE control is slow.
Screen scraping is fun, but it's difficult to get it 100% for all pages. Good luck!

After a bit of googling , i encountered a fairly simple solution . Below attached is the sample snippet.
if (webBrowser.Document != null)
{
IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;// loads the HTML DOM
IHTMLSelectionObject selection = HtmlDoc.selection;// Fetches the currently selected HTML Element.
IHTMLTxtRange range = (IHTMLTxtRange)selection.createRange();
IHTMLElement parentElement = range.parentElement();// Identifies the parent element
targetSourceIndex = parentElement.sourceIndex;
//dataLocation = range.parentElement().id;
MessageBox.Show(range.text);//range.parentElement().sourceIndex
}
I used a Embedded Web Browser in a Winforms applications, which loads the HTML DOM of the current web page.
The IHTMLElement instance exposes a property named 'SourceIndex' which allocates a unique id to each of the html elements.
One can store this SourceIndex to the DB and Query for the content at that location. using the following code.
if (webBrowser.Document != null)
{
IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;
IHTMLElement targetElement = null;
foreach (IHTMLElement domElement in HtmlDoc.all)
{
if (domElement.sourceIndex == int.Parse(node.InnerText))// fetching the persisted data from the XML file.
{
targetElement = domElement;
break;
}
}
MessageBox.Show(targetElement.innerText); //range.parentElement().sourceIndex
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTML parsers not finding table element on a web page - c#

Related

Xpath seems to exist but the element cannot be found through code

Action on SharePoint New Item form in selenium 2 using C# is not working

Any DOM parsers that do not modify the DOM?

Element Visible on the page but shows null

Logic for Implementing a Dynamic Web Scraper in C#

Categories

Resources