Find specific data in html with HtmlElement(Collection) and webbrowser

Find specific data in html with HtmlElement(Collection) and webbrowser - c#

I want to find a div with the class name XYZ then in it I want to loop through a bunch of elements named ABC. Then grab the links (a href) inside and possibly other information.
How do I find the div with XYZ from webBrowser1.Document.Links and any subitems I want?

First you said you want to find a div with the class name XYZ, so why are you looking in webBrowser1.Documnet.Links? Find the Div first, then get to the links within it.
HtmlDocument doc = webBrowser.Document;
HtmlElementCollection col = doc.GetElementsByTagName("div");
foreach (HtmlElement element in col)
{
string cls = element.GetAttribute("className");
if (String.IsNullOrEmpty(cls) || !cls.Equals("XYZ"))
continue;
HtmlElementCollection childDivs = element.Children.GetElementsByName("ABC");
foreach (HtmlElement childElement in childDivs)
{
//grab links and other stuff same way
}
}
Also note the use of "className" instead of "class", it will get you the name of the proper class. Using just "class" will return an empty string. This is documented in MSDN - SetAttribute, but not in GetAttribute. So it causes a little bit of confusion.

Related

Can we use "link" attribute for element finding?

This may be a simple question for selenium users:
I know some of the attributes that we could use while finding an element like: Name, TagName, Css etc etc..
But can we use something like "link=-----" in c# for element finding based on that attribute??

Not familiar with Selenium IDE, here I assume link=601-800 students means something like <a href='something'>601-800 students</a>.
Then you can use By.XPath to locate the link with its text, or use By.LinkText, even By.PartialLinkText.
driver.FindElement(By.XPath("//a[text()='601-800 students']"));
//driver.FindElement(By.LinkText("601-800 students"));
EDIT:
If you have some links with the same text, try identify the unique ancestors.
E.g.
var headLink = driver.FindElement(By.XPath("//*[#id='header']//a[text()='601-800 students']"));
var mainLink = driver.FindElement(By.XPath("//*[#id='main']//a[text()='601-800 students']"));
If that's not possible, get them together by FindElements (note this is not FindElement), them index them.
IList<IWebElement> links = driver.FindElements(By.XPath("//a[text()='601-800 students']"));
//IList<IWebElement> links = driver.FindElements(By.LinkText("601-800 students"));
var firstLink = links[0];
var secondLink = links[1];
foreach(IWebElement link in links) {
// stuff to do with link
}

C# grab urls using htmlagility

Okay so I have this list of URLs on this webpage, I am wondering how do I grab the URLs and add them to a ArrayList?
http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A
I only want the URLs which are in the list, look at it to see what I mean. I tried doing it myself and for whatever reason it takes all of the other URLs except for the ones I need.
http://pastebin.com/a7hJnXPP

Using Html Agility Pack
using (var wc = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(wc.DownloadString("http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A"));
var links = doc.DocumentNode.SelectSingleNode("//div[#class='lst']")
.Descendants("a")
.Select(x => x.Attributes["href"].Value)
.ToArray();
}

If you want only the ones in the list, then the following code should work (this is assuming you have the page loaded into an HtmlDocument already)
List<string> hrefList = new List<string>(); //Make a list cause lists are cool.
foreach (HtmlNode node animePage.DocumentNode.SelectNodes("//a[contains(#href, 'id=')]"))
{
//Append animenewsnetwork.com to the beginning of the href value and add it
// to the list.
hrefList.Add("http://www.animenewsnetwork.com" + node.GetAttributeValue("href", "null"));
}
//a[contains(#href, 'id=')] Breaking this XPath down as follows:
//a Select all <a> nodes...
[contains(#href, 'id=')] ... that contain an href attribute that contains the text id=.
That should be enough to get you going.
As an aside, I would suggest not listing each link in its own messagebox considering there are around 500 links on that page. 500 links = 500 messageboxes :(

Get HtmlAgilityPack Node using exact HTML search or Converting HTMLElement to HTMLNode

I have created a HTMLElement picker (DOM) by using the default .net WebBrowser.
The user can pick (select) a HTMLElement by clicking on it.
I want to get the HtmlAgilityPack.HTMLNode corresponding to the HTMLElement.
The easiest way (in my mind) is to use doc.DocumentNode.SelectSingleNode(EXACTHTMLTEXT) but it does not really work (because the function only accepts xpath code).
How can I do this?
A sample HTMLElement select by a user looks like this (The OuterHtml Code):
<a onmousedown="return wow" class="l" href="http://site.com"><em>Great!!!</em> <b>come and see more</b></a>
Of course, any element can be selected, that's why I need a way to get the HTMLNode.

Same concept, but a bit simpler because you don't have to know the element type:
HtmlNode n = doc.DocumentNode.Descendants().Where(n => n.OuterHtml.Equals(text, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();

I came up with a solution. Don't know if it's the best (I would appreciate if somebody knows a better way to achieve this to let me know).
Here is the class that will get the HTMLNode:
public HtmlNode GetNode(string text)
{
if (text.StartsWith("<")) //get the type of the element (a, p, div etc..)
{
string type = "";
for (int i = 1; i < text.Length; i++)
{
if (text[i] == ' ')
{
type = text.Substring(1, i - 1);
break;
}
}
try //check to see if there are any nodes of your HTMLElement type that have an OuterHtml equal to the HtmlElement Outer HTML. If a node exist, than that's the node we want to use
{
HtmlNode n = doc.DocumentNode.SelectNodes("//" + type).Where(x => x.OuterHtml == text).First();
return n;
}
catch (Exception)
{
throw new Exception("Cannot find the HTML element in the HTML Page");
}
}
else
{
throw new Exception("Invalid HTML Element supplied. The selected HTML element must start with <");
}
}
The idea is that you pass the OuterHtml of the HtmlElement. Example:
HtmlElement el=....
HtmlNode N = GetNode(el.OuterHtml);

Getting only the DIRECT InnerText of an IHTMLElement

Consider the following html code:
<div id='x'><div id='y'>Y content</div>X content</div>
I'd like to extract only the content of 'x'. However, its innerText property includes the content of 'y' as well. I tried iterating over its children and all properties but they only return the inner tags.
How can I access through the IHTMLElement interface only the actual data of 'x'?
Thanks

Use something like:
function getText(this) {
var txt = this.innerHTML;
txt.replace(/<(.)*>/g, "");
return txt;
}
Since this.innerHTML returns
<div id='y'>Y content</div>X content
the function getText would return
X content
Maybe this'll help.

Use the childNodes collection to return child elements and textnodes
You need to QI IHTMLDomNote from IHTMLelement for that.

Here is the final code as suggested by Sheng (just a part of the sample, of course):
mshtml.IHTMLElementCollection c = ((mshtml.HTMLDocumentClass)(wbBrowser.Document)).getElementsByTagName("div");
foreach (IHTMLElement div in c)
{
if (div.className == "lyricbox")
{
IHTMLDOMNode divNode = (IHTMLDOMNode)div;
IHTMLDOMChildrenCollection children = (IHTMLDOMChildrenCollection)divNode.childNodes;
foreach (IHTMLDOMNode child in children)
{
Console.WriteLine(child.nodeValue);
}
}
}

Since innerText() doesn't work with ie, there is no real way i guess.
Maybe try server-side solving the issue by creating content the following way:
<div id='x'><div id='y'>Y content</div>X content</div>
<div id='x-plain'>_plain X content_</div>
"Plain X content" represents your c# generated content for the element.
Now you gain access to the element by refering to getObject('x-plan').innerHTML().

HTML Agility pack - parsing tables

I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model.
I looked at the link example, but did not find any table data this way.
Can I use XPath to get the tables? I am basically lost after having loaded the data as to how to get the tables. I have done this in Perl before and it was a bit clumsy, but worked. (HTML::TableParser).
I am also happy if one can just shed a light on the right object order for the parsing.

How about something like:
Using HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
Console.WriteLine("Found: " + table.Id);
foreach (HtmlNode row in table.SelectNodes("tr")) {
Console.WriteLine("row");
foreach (HtmlNode cell in row.SelectNodes("th|td")) {
Console.WriteLine("cell: " + cell.InnerText);
}
}
}
Note that you can make it prettier with LINQ-to-Objects if you want:
var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
from row in table.SelectNodes("tr").Cast<HtmlNode>()
from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
select new {Table = table.Id, CellText = cell.InnerText};
foreach(var cell in query) {
Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
}

The most simple what I've found to get the XPath for a particular Element is to install FireBug extension for Firefox go to the site/webpage press F12 to bring up firebug; right select and right click the element on the page that you want to query and select "Inspect Element" Firebug will select the element in its IDE then right click the Element in Firebug and choose "Copy XPath" this function will give you the exact XPath Query you need to get the element you want using HTML Agility Library.

I know this is a pretty old question but this was my solution that helped with visualizing the table so you can create a class structure. This is also using the HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
var table = doc.DocumentNode.SelectSingleNode("//table");
var tableRows = table.SelectNodes("tr");
var columns = tableRows[0].SelectNodes("th/text()");
for (int i = 1; i < tableRows.Count; i++)
{
for (int e = 0; e < columns.Count; e++)
{
var value = tableRows[i].SelectSingleNode($"td[{e + 1}]");
Console.Write(columns[e].InnerText + ":" + value.InnerText);
}
Console.WriteLine();
}

In my case, there is a single table which happens to be a device list from a router. If you wish to read the table using TR/TH/TD (row, header, data) instead of a matrix as mentioned above, you can do something like the following:
List<TableRow> deviceTable = (from table in document.DocumentNode.SelectNodes(XPathQueries.SELECT_TABLE)
from row in table?.SelectNodes(HtmlBody.TR)
let rows = row.SelectSingleNode(HtmlBody.TR)
where row.FirstChild.OriginalName != null && row.FirstChild.OriginalName.Equals(HtmlBody.T_HEADER)
select new TableRow
{
Header = row.SelectSingleNode(HtmlBody.T_HEADER)?.InnerText,
Data = row.SelectSingleNode(HtmlBody.T_DATA)?.InnerText}).ToList();
}
TableRow is just a simple object with Header and Data as properties.
The approach takes care of null-ness and this case:
<tr>
<td width="28%"> </td>
</tr>
which is row without a header. The HtmlBody object with the constants hanging off of it are probably readily deduced but I apologize for it even still. I came from the world where if you have " in your code, it should either be constant or localizable.

Line from above answer:
HtmlDocument doc = new HtmlDocument();
This doesn't work in VS 2015 C#. You cannot construct an HtmlDocument any more.
Another MS "feature" that makes things more difficult to use. Try HtmlAgilityPack.HtmlWeb and check out this link for some sample code.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Find specific data in html with HtmlElement(Collection) and webbrowser - c#

I want to find a div with the class name XYZ then in it I want to loop through a bunch of elements named ABC. Then grab the links (a href) inside and possibly other information. How do I find the div with XYZ from webBrowser1.Document.Links and any subitems I want?

Related

Can we use "link" attribute for element finding?

C# grab urls using htmlagility

Get HtmlAgilityPack Node using exact HTML search or Converting HTMLElement to HTMLNode

Getting only the DIRECT InnerText of an IHTMLElement

HTML Agility pack - parsing tables

Categories

Resources