get div information with html agility pack

get div information with html agility pack - c#

Hi I want to process information on a html page, with the following code I can get the information
This is how the order is received
new-link-1
new-link-2
new-link-3
But when it comes to the new-link-no-title section, it breaks up And it changes to
new-link-3
new-link-1
new-link-2
And at the end of the program it stops with an ArgumentOutOfRangeException error
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = await web.LoadFromWebAsync(Link);
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex())
{
var x = item.SelectNodes("//div[#class='new-link-2']")[index].InnerText;
var xx = item.SelectNodes("//div[#class='new-link-3']//a")[index];
MessageBox.Show(item.InnerText);
MessageBox.Show(x);
MessageBox.Show(xx.Attributes["href"].Value);
}
and html
<div id="new-link">
<ul>
<li>
<div class="new-link-1"> فصل پنجم</div>
<div class="new-link-2"> تکمیل شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
<li class="new-link-no-titel">
<div class="new-link-1"> فصل ششم</div>
<div class="new-link-2"> درحال پخش</div>
<div class="new-link-3">
<i class="fa fa-arrow-down" title=حال پخش">
</i>
</div>
</li>
<li>
<divs="new-link-1"> قسمت 1</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلودلینک مستقیم
</div>
</li>
<li>
<div class="new-link-1"> قسمت 7</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
</ul>
</div>

This is what I found to be the issue with your code.
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex()) //-> Gives 4 indecies for index
item.SelectNodes("//div[#class='new-link-2']") // -> This produces 4 nodes
item.SelectNodes("//div[#class='new-link-3']//a") // -> This produces only 3 nodes
Issue:
When you search with //div, you search All nodes.. not just from the item you are currently on.
Solution/Suggestion: Your current code searches all a elements starting from the root node. If you prefix it with a dot instead only the descendants of the current node will be considered. (Excerpt from here)
foreach (HtmlNode item in doc.DocumentNode.SelectNodes(".//li"))
{
try
{
var x0 = item.SelectSingleNode(".//div[#class='new-link-1']");
var x = item.SelectSingleNode(".//div[#class='new-link-2']");
var xx = item.SelectSingleNode(".//a");
MessageBox.Show(x0.InnerText);
MessageBox.Show(x.InnerText);
if (xx.Attributes["href"] != null)
MessageBox.Show(xx.Attributes["href"].Value);
}
catch { }
}

Related

Get the href link of an element with classname having spaces

I have been trying to get the link of an element using the class name but always getting an error that no element found
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByClassName("column.wrap-text").ToList();
I somehow managed to get the links I want using the below code but I know that is not a good approach.
try
{
Selenium.Selenium.driver.Navigate().GoToUrl(txt_url.Text);
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByTagName("a").ToList();
List<string> ValidLinks = new List<string>();
foreach (IWebElement LinkElement in LinkElements)
{
string LinkString = LinkElement.GetAttribute("href");
if (LinkString != null)
{
if (LinkString.Contains("documents"))
{
list.Items.Add(LinkString);
}
}
}
}
catch (Exception)
{ }
Below is the html code for the element I want to extract the href link("/view/garnimii#/Testing%20Folder/MyFile.txt") with the title name in it. I have tried every possible way but not able to read the element with the findbyclassname or findbyxpath(which is very vague here). can anyone please help me with this?
<div class="wrapper fluid-element">
<div class="wrapper fluid-element">
<div class="wrapper fluid-element">
<div class="column wrap-text">
<a title="MyFile.txt" href="https://drive.corp.amazon.com/documents/garnimii#/Testing%20Folder/MyFile.txt">MyFile.txt</a
>
</div>
</div>
<div class="column actions resource-actions-view">
<a data-turbolink="true" href="/view/garnimii#/Testing%20Folder/MyFile.txt"><i class="fa fa-
external-link"></i> View
</a></div>
<div class="column actions resource-actions-share">
<a data-target="#resource-modal-share" data-toggle="modal"
href="/share/garnimii#/Testing%20Folder/MyFile.txt">
<i class="fa fa-share-alt"></i> Share
</a>
</div>
<div class="column actions resource-actions-rename resource-header-actions">
<a data-resource-basename="MyFile.txt" data-resource-id="8a520062-5dbe-46ba-b4b0-b672f6481c17"
data-root-path="/" data-target="#resource-modal-rename" data-toggle="modal" href="#resource-
modal-rename">
<i class="fa fa-pencil"></i> Rename
</a>
</div>
</div>
</div>
Update
foreach (IWebElement LinkElement in LinkElements)
{
string LinkString = LinkElement.GetAttribute("title");
if (LinkString != null)
{
if(LinkString.Contains("myfile.txt"))
{
list.Items.Add(LinkString.GetAttribute('href'));
}
}
}

You can even try with //a xpath.
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByXpath("//a");
List<string> ValidLinks = new List<string>();
foreach (IWebElement LinkElement in LinkElements){
Console.WriteLine(LinkElement.GetAttribute('href'))
}
print all the GetAttribute with href first. and if your output contains all the href then we can proceed further with adding into other list.
Update :
string LinkString = Selenium.Selenium.driver.FindElementByXpath("//a[#title='MyFile.txt']").GetAttribute('href')

FindElementsByClassName can locate element by single class name.
For multiple class names you should use XPath or CSS selector.
So instead of
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByClassName("column.wrap-text").ToList();
Try using
List<IWebElement> LinkElements = Selenium.Selenium.driver.FindElementsByCssSelector("div.column.wrap-text").ToList();

Get specific href values or link from email which is parsed as html in c#

I am processing emails in my C# service. I need to extract certain links present in the same to add to DB. I am using HtmlagilityPack. The div and p tags turn out interchangeable in the parsed email. I have to extract the links present below the tags 'Scheduler Link', 'Data Path' and 'Link' from the email. After cleaning it up, a sample data is as follows :
<html>
<body>
......//contains some other tags which i dont need, may include hrefs but
//i dont need them
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;">Scheduler link :</div>
<div align="justify" style="margin:0;"></div>
<div style="margin:0;"><a href="https://something.com/requests/26428">
https://something.com/requests/26428</a>
</div>
<div style="margin:0;"></div>
<div style="margin:0;"></div>
<div style="margin:0;"></div>
<div align="justify" style="margin:0;">Data path :</div>
<div align="left" style="text-align:justify;margin:0;"><a
href="file:///\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui">
\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui</a>
</div>
<div align="left" style="text-align:justify;margin:0;"><a
href="file:///\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui">
\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui</a>
</div>
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;">Link :</div>
<div align="justify" style="margin:0;"><a
href="https://Thisisanotherlink.abcdef/sites/this/498592/rkjfb/3874y">
This is some text</a></div>
<div align="justify" style="margin:0 0 5pt 0;">This is another text</div>
......//contains some other tags which i dont need
</body>
</html>
I am looking for the div tag of 'Scheduler Link', 'Data Path' and 'Link' using regular expressions as follows :
HtmlNode schedulerLink = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["scheduler"]).Value.ToString() + "')]]");
HtmlNode dataPath = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["datapath"]).Value.ToString() + "')]]");
HtmlNode link = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["link"]).Value.ToString() + "')]]");
The div tags are returning me the respective nodes. The number of links present against the three in each email varies and so does the order of the tags. I need to capture the links against each in a list. I am using the following code :
foreach (HtmlNode link in schedulerLink.Descendants())
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!(link.InnerText.Contains("\r\n")))
{
if (link.InnerText.Contains("/"))
{
schedulersList.Add(link.InnerText.Trim());
}
}
}
The descendants sometimes is not returning the correct number of nodes. Also how do i get the specific links against the 3 tags in 3 different lists since descendants usually return all the nodes present below.

If I understand correctly, you want to capture the content of the first href-attribute after a specific string like scheduler link. I don't know about the HtmlagilityPack, but my approach would be to just search the email body with a regex like this:
Scheduler link(?:\s|\S)*?href="([^"]+)
This regex should capture the content of the first href-attribute after every occurence of "Scheduler link" in the mail.
You can try it here: Regex101
To find the other types of links just replace the Scheduler link part with the respective string.
I hope this is helpful.
Additional info about the regex:
Scheduler link matches the string literally
(?:\s|\S)*?href=" non-capturing group that matches any character until the first occurence of the literal string href="
([^"]+) captures everything despite the " character

As you have mentioned different hrefs in your question,
one way of doing it is by following:
var html = #"<html> <body> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'>Scheduler link :</div> <div align='justify' style='margin:0;'></div> <div style='margin:0;'><a href='https://something.com/requests/26428'> https://something.com/requests/26428</a> </div> <div style='margin:0;'></div> <div style='margin:0;'></div> <div style='margin:0;'></div> <div align='justify' style='margin:0;'>Data path :</div> <div align='left' style='text-align:justify;margin:0;'><a href='file:///\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui'> \\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui</a> </div> <div align='left' style='text-align:justify;margin:0;'><a href='file:///\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui'> \\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui</a> </div> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'>Link :</div> <div align='justify' style='margin:0;'><a href='https://Thisisanotherlink.abcdef/sites/this/498592/rkjfb/3874y'> This is some text</a></div> <div align='justify' style='margin:0 0 5pt 0;'>This is another text</div> </body></html>";
var document = new HtmlDocument();
document.LoadHtml(html);
var schedulerNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"something\")]");
var dataPathNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"mycompany\")]");
var linkNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"Thisisanotherlink\")]");
foreach (var item in schedulerNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
foreach (var item in dataPathNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
foreach (var item in linkNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
Hope that helps !!
EDIT ::
var result = document.DocumentNode.SelectNodes("//div//text()[normalize-space()] | //a");
// select all textnodes and a tags
string sch = "Scheduler link :";
string dataLink = "Data path :";
string linkpath = "Link :";
foreach (var item in result)
{
if (item.InnerText.Trim().Contains(sch))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(sch)).Skip(1);
// skip the result till we reache to Scheduler.
Debug.WriteLine("====================Scheduler link=========================");
foreach (var subitem in processResult)
{
Debug.WriteLine(subitem.GetAttributeValue("href", ""));
// if href then add to list TODO
if (subitem.InnerText.Contains(dataLink)) // break when data link appears.
{
break;
}
}
}
if (item.InnerText.Trim().Contains(dataLink))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(dataLink)).Skip(1);
Debug.WriteLine("====================Data link=========================");
foreach (var subitem in processResult)
{
Debug.WriteLine(subitem.GetAttributeValue("href", ""));
if (subitem.InnerText.Contains(dataLink))
{
break;
}
}
}
if (item.InnerText.Trim().Contains("Link :"))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(linkpath)).Skip(1);
Debug.WriteLine("====================Link=========================");
foreach (var subitem in processResult)
{
var hrefValue = subitem.GetAttributeValue("href", "");
Debug.WriteLine(hrefValue);
if (subitem.InnerText.Contains(dataLink))
{
break;
}
}
}
}
I have mentioned logic in code commments.
Hope that helps

Html nodes issue with HtmlAgilityPack

I'm having a big trouble trying to parse these html contents with HtmlAgilityPack library.
In this piece of code, I would like to retrieve only the url (href) that reffers to uploaded.net, but I can't determine whether the url reffers to it.
<div class='downloads' id='download_block'>
<h5 style='text-align:center'>FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
</div>
This is how it looks on the webpage
And this is what I have:
nodes = myHrmlDoc.DocumentNode.SelectNodes(".//div[#class='downloads']/ul[#class='parts']")
I can't just use an array-index to determine the position like:
nodes(0) = uploadable.ch node
nodes(1) = uploaded.net node
nodes(2) = novafile.com node
...because they could change the amount of nodes and its hosting positions.
Note that also the urls will not contains the hosting names, are redirections like:
http://xxxxxx/r/YEHUgL44xONfQAnCNUVw_aYfY5JYAy0DT-i--
What could I do, in C# or else VB.Net?.

this should do, untested though:
doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'uploaded.net')]/following-sibling::ul//a").Attributes["href"].Value
also use contains because you never know if the text contains spaces.

The only way I see this working is 2 fold approach. Sorry, I don't have HtmlAgilityPack at hand, but here is an example of using the standard XmlDocument. Even though you said you can't use array indexes to access, this process should allow you to do that by specifically grabbing the correct index dynamically.
void Main()
{
var xml = #"
<div class=""downloads"" id=""download_block"">
<h5 style=""text-align:center"">FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
</div>";
var xmlDocument = new XmlDocument();
xmlDocument.LoadXml(xml);
var nav = xmlDocument.CreateNavigator();
var index = nav.Evaluate("count(//h4[text()='uploaded.net']/preceding-sibling::h4)+1").ToString();
var text = xmlDocument.SelectSingleNode("//ul["+index +"]//a/#href").InnerText;
Console.WriteLine(text);
}
Basically, it gets the index of the uploaded.net h4 and then uses that index to select the correct ul tag and get the URL out the of underlying anchor tag.
Sorry for the not so clean and error prone code, but it should get you in the right direction.

Give the snippet you supplied, this will help you get started.
var page = "<div class=\"downloads\" id=\"download_block\"> <h5 style=\"text-align:center\">FREE DOWNLOAD LINKS</h5> <h4>uploadable.ch</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>uploaded.net</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>novafile.com</h4> <ul class=\"parts\"> <li> text here </li> </ul></div>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var nodes = doc.DocumentNode.Descendants("h4").Where(n => n.InnerText.Contains("uploadable"));
foreach (var node in nodes)
{
var attr = node.NextSibling.NextSibling.Descendants().Where(x=> x.Name == "a").FirstOrDefault().Attributes["href"];
attr.Value.Dump();
}

Fetch data from website using HtmlAgilityPack

I am developing an app in which I have to fetch data from website. The format of website is:
<div id="id1" class="class1">
<ol class="cls_ol">
<li>
<div class="class2">Content 1</div>
<div class="cls_img">
*** Code for some image ***
</div>
Content 2
</li>
<li> *** Same like above <li> *** </li>
<li> *** Same like above <li> *** </li>
</ol>
</div>
I use code for fetching this...
protected void Button1_Click(object sender, EventArgs e)
{
var obj = new HtmlWeb();
var document = obj.Load(" ** url of a website ** ");
var bold = document.DocumentNode.SelectNodes("//div[#class='class1']");
foreach (var i in bold)
{
Response.Write(i.InnerHtml);
}
But, the problem with my code is this, it also fetches the images of <div class="cls_img"></div>. I don't need this image. So, how to fetch all the content of <div id="id1" class="class1"> without fetch the image from <div class="cls_img">.

Step 1 - select and remove images inside the <div class="cls_img"> inside the <div class="class1"> tag:
var images = document.DocumentNode.SelectNodes(
"//div[#class='class1']//*//div[#class='cls_img']//img"
);
// note that if no nodes found "images" variable will hold a null value
foreach (var image in images)
{
image.Remove();
}
Step 2 - select <div class="class1"> elements (you already done it) - now without that images:
var bold = document.DocumentNode.SelectNodes("//div[#class='class1']");
foreach (var node in bold)
{
Console.Write(node.InnerHtml);
}

Loop through the nodes and find a node with the matching attribute of class="cls_img" and remove that node.
node.ParentNode.RemoveChild(node);

How to getelement by class?

I am trying to code a way using webBrowser1 to get a hold of of a download link via href, but the problem is I must find it using its class name.
<body>
<iframe scrolling="no" frameborder="0" allowtransparency="true" tabindex="0" name="twttrHubFrame" style="position: absolute; top: -9999em; width: 10px; height: 10px;" src="http://platform.twitter.com/widgets/hub.html">
‌¶
<div id="main">
‌¶‌→
<div id="header">
<div style="float:left;">
‌¶‌→
<div id="content">
‌¶‌→
<h1 style="background-image:url('http://static.mp3skull.com/img/bgmen.JPG'); background-repeat:repeat-x;">Rush‌·Mp3‌·Download</h1>
‌¶‌→
<a id="bitrate" onclick="document.getElementById('ofrm').submit(); return false;" rel="nofollow" href="">
<form id="ofrm" method="POST" action="">
‌¶‌→‌¶‌→‌→
<div id="song_html" class="show1">
‌¶‌→‌→‌→
<div class="left">
‌¶‌→‌→‌→
<div id="right_song">
‌¶‌→‌→‌→‌→
<div style="font-size:15px;">
‌¶‌→‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→‌→‌→
<div style="float:left;">
‌¶‌→‌→‌→‌→‌→
<div style="float:left; height:27px; font-size:13px; padding-top:2px;">
‌¶‌→‌→‌→‌→‌→‌→
<div style="float:left; width:27px; text-align:center;">
‌¶‌→‌→‌→‌→‌→‌→
<div style="margin-left:8px; float:left;">
<a style="color:green;" target="_blank" rel="nofollow" href="http://dc182.4shared.com/img/1011303409/865387c9/dlink__2Fdownload_2F6QmedN8H_3Ftsid_3D20111211-54337-a79f8d10/preview.mp3">Download</a>
</div>
‌·‌¶‌→‌→‌→‌→‌→‌→
<div style="margin-left:8px; float:left;">
‌¶‌→‌→‌→‌→‌→‌→
<div style="margin-left:8px; float:left;">
‌·‌¶‌→‌→‌→‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→‌→‌→‌→
</div>
‌¶‌→‌→‌→‌→‌→
<div id="player155580779" class="player" style="float:left; margin-left:10px;"></div>
‌¶‌→‌→‌→‌→
</div>
‌→‌¶‌→‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→‌→
</div>
‌¶‌→‌→‌→
<div style="clear:both;"></div>
‌¶‌→‌→
</div>
I looked and searched all over google, but I found PHP examples?
I understand you would do something along the lines of this
HtmlElement downloadlink = webBrowser1.Document.GetElementById("song_html").All[0];
URL = downloadlink.GetAttribute("href");
but I do not understand how to do it by the class "show1".
Please point me in the right direction with examples and/or a website I can visit so I can learn how to do this as I searched and have no clue.
EDIT: I pretty much need the href link ("http://dc182.4shared.com/img/1011303409/865387c9/dlink__2Fdownload_2F6QmedN8H_3Ftsid_3D20111211-54337-a79f8d10/preview.mp3"), so how would I obtain it?

There is nothing built-in in the WebBrowser control to retrieve an element by class name. Since you know it is going to be an a element the best you can do is get all a elements and search for the one you want:
var links = webBrowser1.Document.GetElementsByTagName("a");
foreach (HtmlElement link in links)
{
if (link.GetAttribute("className") == "show1")
{
//do something
}
}

Extension Method for HtmlDocument
Returns a list of elements with a particular tag, which coincides with the given className
It can be used to capture the elements only on the tag, or only by class name
internal static class Utils
{
internal static List<HtmlElement> getElementsByTagAndClassName(this HtmlDocument doc, string tag = "", string className = "")
{
List<HtmlElement> lst = new List<HtmlElement>();
bool empty_tag = String.IsNullOrEmpty(tag);
bool empty_cn = String.IsNullOrEmpty(className);
if (empty_tag && empty_cn) return lst;
HtmlElementCollection elmts = empty_tag ? doc.All : doc.GetElementsByTagName(tag);
if (empty_cn)
{
lst.AddRange(elmts.Cast<HtmlElement>());
return lst;
}
for (int i = 0; i < elmts.Count; i++)
{
if (elmts[i].GetAttribute("className") == className)
{
lst.Add(elmts[i]);
}
}
return lst;
}
}
Usage:
WebBrowser wb = new WebBrowser();
List<HtmlElement> lst_div = wb.Document.getElementsByTagAndClassName("div");// all div elements
List<HtmlElement> lst_err_elmnts = wb.Document.getElementsByTagAndClassName(String.Empty, "error"); // all elements with "error" class
List<HtmlElement> lst_div_err = wb.Document.getElementsByTagAndClassName("div", "error"); // all div's with "error" class

I followed up these answers and make my method to hide div by class name.
I shared for whom concern.
public void HideDivByClassName(WebBrowser browser, string classname)
{
if (browser.Document != null)
{
var byTagName = browser.Document.GetElementsByTagName("div");
foreach (HtmlElement element in byTagName)
{
if (element.GetAttribute("className") == classname)
{
element.Style = "display:none";
}
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

get div information with html agility pack - c#

Related

Get the href link of an element with classname having spaces

Get specific href values or link from email which is parsed as html in c#

Html nodes issue with HtmlAgilityPack

Fetch data from website using HtmlAgilityPack

How to getelement by class?

Categories

Resources