I have a HTML file that looks like this:
<div class="user_meals">
<div class="name">Name Surname</div>
<div class="day_meals">
<div class="meal">First Meal</div>
</div>
<div class="day_meals">
<div class="meal">Second Meal</div>
</div>
<div class="day_meals">
<div class="meal">Third Meal</div>
</div>
<div class="day_meals">
<div class="meal">Fourth Meal</div>
</div>
<div class="day_meals">
<div class="meal">Fifth Meal</div>
</div>
This code repeats a few times.
I want to get Name and Surname which is between <div> tag with class "name".
This is my code using HtmlAgilityPack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"C:\workspace\file.html");
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='name']"))
{
string vaule = node.InnerText;
}
But actually it doesn't work. Visual Studio throws me Exception:
An unhandled exception of type 'System.NullReferenceException'.
You are using wrong method to load HTML from a path LoadHtml expect HTML and not location of the file. Use Load instead.
The error you are getting is quite misleading as all properties are not null and standard tips from What is a NullReferenceException, and how do I fix it? don't apply.
Essentially this comes from the fact SelectNodes correctly returns null as there are not elements matching the query and foreach throws on it.
Fixed code:
HtmlDocument doc = new HtmlDocument();
// either doc.Load(#"C:\workspace\file.html") or pass HTML:
doc.LoadHtml("<div class='user_meals'><div class='name'>Name Surname</div></div> ");
var nodes = doc.DocumentNode.SelectNodes("//div[#class='name']");
// SelectNodes returns null if nothing found - may need to check
if (nodes == null)
{
throw new InvalidOperationException("Where all my nodes???");
}
foreach (HtmlNode node in nodes)
{
string vaule = node.InnerText;
vaule.Dump();
}
Related
I have this html document:
<div class="link1">
link1
</div>
<div class="link2">
link2
</div>
<div class="link3">
link3
</div>
<div class="link3">
link4
</div>
<div class="link5">
link4
</div>
I want to show elements that specified with "link3" in webBrowser control by getting element by class name.
This code works, but if we have two elements by same class name it show nothing!
foreach (HtmlElement elm in webBrowser1.Document.All)
if (elm.GetAttribute("className") == "link3")
{
HtmlDocument doc = webBrowser1.Document;
doc.Body.InnerHtml = elm.InnerHtml;
}
Use this instead:
StringBuilder sb=new StringBuilder();
foreach (HtmlElement elm in webBrowser1.Document.All)
if (elm.GetAttribute("className") == "link3")
sb.Append(elm.InnerHtml);
HtmlDocument doc = webBrowser1.Document;
doc.Body.InnerHtml=sb.ToString();
first .. sorry about my bad english
my question is how can i scrape div inside div in htmlagilitypack c#
this is test html code
<html>
<div class="all_ads">
<div class="ads__item">
<div class="test">
test 1
</div>
</div>
<div class="ads__item">
<div class="test">
test 2
</div>
</div>
<div class="ads__item">
<div class="test">
test 3
</div>
</div>
</div>
</html>
how to make a loop that get all ads then loop that control test inside ads
You can select all the nodes inside class all_ads as follow:-
var res = div.SelectNodes(".//div[#class='all_ads ads__item']");
.//div[#class='all_ads ads__item'] This will select all the nodes inside all_adswhich has class ads_item.
You have to use this path => //div[contains(#class, 'test')]
This means you need to select those div(s) that contains class with name ads__item.
and then select all those selected div(s) inner html. like
class Program
{
static void Main(string[] args)
{
string html = File.ReadAllText(#"Path to your html file");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var innerContent = doc.DocumentNode.SelectNodes("//div[contains(#class, 'test')]").Select(x => x.InnerHtml.Trim());
foreach (var item in innerContent)
Console.WriteLine(item);
Console.ReadLine();
}
}
Output:
I am processing emails in my C# service. I need to extract certain links present in the same to add to DB. I am using HtmlagilityPack. The div and p tags turn out interchangeable in the parsed email. I have to extract the links present below the tags 'Scheduler Link', 'Data Path' and 'Link' from the email. After cleaning it up, a sample data is as follows :
<html>
<body>
......//contains some other tags which i dont need, may include hrefs but
//i dont need them
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;">Scheduler link :</div>
<div align="justify" style="margin:0;"></div>
<div style="margin:0;"><a href="https://something.com/requests/26428">
https://something.com/requests/26428</a>
</div>
<div style="margin:0;"></div>
<div style="margin:0;"></div>
<div style="margin:0;"></div>
<div align="justify" style="margin:0;">Data path :</div>
<div align="left" style="text-align:justify;margin:0;"><a
href="file:///\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui">
\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui</a>
</div>
<div align="left" style="text-align:justify;margin:0;"><a
href="file:///\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui">
\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui</a>
</div>
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;">Link :</div>
<div align="justify" style="margin:0;"><a
href="https://Thisisanotherlink.abcdef/sites/this/498592/rkjfb/3874y">
This is some text</a></div>
<div align="justify" style="margin:0 0 5pt 0;">This is another text</div>
......//contains some other tags which i dont need
</body>
</html>
I am looking for the div tag of 'Scheduler Link', 'Data Path' and 'Link' using regular expressions as follows :
HtmlNode schedulerLink = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["scheduler"]).Value.ToString() + "')]]");
HtmlNode dataPath = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["datapath"]).Value.ToString() + "')]]");
HtmlNode link = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["link"]).Value.ToString() + "')]]");
The div tags are returning me the respective nodes. The number of links present against the three in each email varies and so does the order of the tags. I need to capture the links against each in a list. I am using the following code :
foreach (HtmlNode link in schedulerLink.Descendants())
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!(link.InnerText.Contains("\r\n")))
{
if (link.InnerText.Contains("/"))
{
schedulersList.Add(link.InnerText.Trim());
}
}
}
The descendants sometimes is not returning the correct number of nodes. Also how do i get the specific links against the 3 tags in 3 different lists since descendants usually return all the nodes present below.
If I understand correctly, you want to capture the content of the first href-attribute after a specific string like scheduler link. I don't know about the HtmlagilityPack, but my approach would be to just search the email body with a regex like this:
Scheduler link(?:\s|\S)*?href="([^"]+)
This regex should capture the content of the first href-attribute after every occurence of "Scheduler link" in the mail.
You can try it here: Regex101
To find the other types of links just replace the Scheduler link part with the respective string.
I hope this is helpful.
Additional info about the regex:
Scheduler link matches the string literally
(?:\s|\S)*?href=" non-capturing group that matches any character until the first occurence of the literal string href="
([^"]+) captures everything despite the " character
As you have mentioned different hrefs in your question,
one way of doing it is by following:
var html = #"<html> <body> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'>Scheduler link :</div> <div align='justify' style='margin:0;'></div> <div style='margin:0;'><a href='https://something.com/requests/26428'> https://something.com/requests/26428</a> </div> <div style='margin:0;'></div> <div style='margin:0;'></div> <div style='margin:0;'></div> <div align='justify' style='margin:0;'>Data path :</div> <div align='left' style='text-align:justify;margin:0;'><a href='file:///\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui'> \\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui</a> </div> <div align='left' style='text-align:justify;margin:0;'><a href='file:///\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui'> \\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui</a> </div> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'>Link :</div> <div align='justify' style='margin:0;'><a href='https://Thisisanotherlink.abcdef/sites/this/498592/rkjfb/3874y'> This is some text</a></div> <div align='justify' style='margin:0 0 5pt 0;'>This is another text</div> </body></html>";
var document = new HtmlDocument();
document.LoadHtml(html);
var schedulerNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"something\")]");
var dataPathNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"mycompany\")]");
var linkNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"Thisisanotherlink\")]");
foreach (var item in schedulerNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
foreach (var item in dataPathNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
foreach (var item in linkNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
Hope that helps !!
EDIT ::
var result = document.DocumentNode.SelectNodes("//div//text()[normalize-space()] | //a");
// select all textnodes and a tags
string sch = "Scheduler link :";
string dataLink = "Data path :";
string linkpath = "Link :";
foreach (var item in result)
{
if (item.InnerText.Trim().Contains(sch))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(sch)).Skip(1);
// skip the result till we reache to Scheduler.
Debug.WriteLine("====================Scheduler link=========================");
foreach (var subitem in processResult)
{
Debug.WriteLine(subitem.GetAttributeValue("href", ""));
// if href then add to list TODO
if (subitem.InnerText.Contains(dataLink)) // break when data link appears.
{
break;
}
}
}
if (item.InnerText.Trim().Contains(dataLink))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(dataLink)).Skip(1);
Debug.WriteLine("====================Data link=========================");
foreach (var subitem in processResult)
{
Debug.WriteLine(subitem.GetAttributeValue("href", ""));
if (subitem.InnerText.Contains(dataLink))
{
break;
}
}
}
if (item.InnerText.Trim().Contains("Link :"))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(linkpath)).Skip(1);
Debug.WriteLine("====================Link=========================");
foreach (var subitem in processResult)
{
var hrefValue = subitem.GetAttributeValue("href", "");
Debug.WriteLine(hrefValue);
if (subitem.InnerText.Contains(dataLink))
{
break;
}
}
}
}
I have mentioned logic in code commments.
Hope that helps
I'm working on a page, where page loads dynamically and the data gets added while scrolling. To identify the properties of an item, I identified the parent div, where to identify the address, I have to locate an XPath from the parent to span element.
Below is my DOM structure:
<div class = "parentdiv">
<div class = "search">
<div class="header">
<div class="data"></div>
<div class="address-data">
<div class="address" itemprop="address">
<a itemprop="url" href="/search/Los-Angeles-CA-90025">
<span itemprop="streetAddress">
Avenue
</span>
<br>
<span itemprop="Locality">Los Angeles</span>
<span itemprop="Region">CA</span>
</a>
</div>
</div>
</div>
</div>
</div>
</div>
Here I want to locate the three spans, where I' currently in parent div.
Can someone guide how to locate an element using XPath from particular div?
You can try the following XPaths,
To locate the street address:
//div[#class="parentdiv"]/div/div/a/span[#itemprop="streetAddress"]
To locate the locality/city:
//div[#class="parentdiv"]/div/div/a/span[#itemprop="Locality"]
To locate the state:
//div[#class="parentdiv"]/div/div/a/span[#itemprop="Region"]
To print the list of <span> tagged WebElements with texts like Avenue with respect to div class = "parentdiv" node you can use the following block of code :
IList<IWebElement> myList = Driver.FindElements(By.CssSelector("div.parentdiv > div.address > a[itemprop=url] > span"));
foreach (IWebElement element in myList)
{
string my_add = element.GetAttribute("innerHTML");
Console.WriteLine(my_add);
}
Your DOM might become fairly large, since it adds elements while scrolling, so using CSS selectors might be quicker.
To get all the span tags in the div, use:
div[class='address'] span
To get a specific span by using the itemprop attribute use:
div[class='address'] span[itemprop='streetAddress']
div[class='address'] span[itemprop='Locality']
div[class='address'] span[itemprop='Region']
You can store the elements in a variable like so:
var streetAddress = driver.FindElement(By.CssSelector("div[class='address'] span[itemprop='streetAddress']"));
var locality = driver.FindElement(By.CssSelector("div[class='address'] span[itemprop='Locality']"));
var region = driver.FindElement(By.CssSelector("div[class='address'] span[itemprop='Region']"));
Firstly, I tried a lot of ways but I couldn't solve my problem. I don't know how to place my node way in SelectSingleNode(?) method. I create a html path to reach my node in my c# code but if I run this code, I take NullReferenceException because of my html path. I just want you that how can I create my html way or any other solution?
This is example of html code:
<html>
<body>
<div id="container">
<div id="box">
<div class="box">
<div class="boxContent">
<div class="userBox">
<div class="userBoxContent">
<div class="userBoxElement">
<ul id ="namePart">
<li>
<span class ="namePartContent>
</span>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
And this my C# code:
namespace AgilityTrial
{
class Program
{
static void Main(string[] args)
{
Uri url = new Uri("https://....");
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
string html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string path = #"//html/body/div[#id='container']/div[#id='classifiedDetail']"+
"/div[#class='classifiedDetail']/div[#class='classifiedDetailContent']"+
"/div[#class='classifiedOtherBoxes']/div[#class='classifiedUserBox']"+
"/div[#class='classifiedUserContent']/ul[#id='phoneInfoPart']/li"+
"/span[#class='pretty-phone-part show-part']";
var tds = doc.DocumentNode.SelectSingleNode(path);
var date = tds.InnerHtml;
Console.WriteLine(date);
}
}
}
Take as an example your namePartContent span node. If you want to fetch that data you would simply do this:
doc.DocumentNode.SelectSingleNode(".//span[#class='namePartContent']")?.InnerText;
It will search/fetch a single span node with namePartContent as its class, begining at the root node, in your case <html>;