How to use HtmlAgilityPack to get this two content? - c#

Html code:
<div>
<div>Name</div>
Date
</div>
How to use HtmlAgilityPack to get the Name and Date values?

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode div in doc.DocumentElement.SelectNodes("//div"])
{
string parent = div.InnerText; //this will give you Name
foreach (HtmlNode child in div.ChildNodes)
{
string childDiv = child.InnerText; //this will give you the child
}
}

var doc = new HtmlDocument();
doc.LoadHtml("<div><div>Name</div>Date</div>");
var nodes = doc.DocumentNode
.DescendantNodes()
.Where(n => n.NodeType == HtmlNodeType.Text);
foreach (var node in nodes)
{
var value = node.InnerText;
}

Hard-coded:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<div><div>Name</div>Date</div>");
Console.WriteLine(((HtmlTextNode)doc.DocumentNode.ChildNodes[0].ChildNodes[0]).Text);
Console.WriteLine(((HtmlTextNode)doc.DocumentNode.ChildNodes[1]).Text);
If what you're looking for, is all the text nodes (as in Kevin Babcock's answer!). You have to remember that text in Xml (and with HtmlAgility) are nodes.

Related

Get innerText from <div class> with an <a href> child

I am working with a webBrowser in C# and I need to get the text from the link. The link is just a href without a class.
its like this
<div class="class1" title="myfirstClass">
<a href="link.php">text I want read in C#
<span class="order-level"></span>
Shouldn't it be something like this?
HtmlElementCollection theElementCollection = default(HtmlElementCollection);
theElementCollection = webBrowser1.Document.GetElementsByTagName("div");
foreach (HtmlElement curElement in theElementCollection)
{
if (curElement.GetAttribute("className").ToString() == "class1")
{
HtmlElementCollection childDivs = curElement.Children.GetElementsByName("a");
foreach (HtmlElement childElement in childDivs)
{
MessageBox.Show(childElement.InnerText);
}
}
}
This is how you get the element by tag name:
String elem = webBrowser1.Document.GetElementsByTagName("div");
And with this you should extract the value of the href:
var hrefLink = XElement.Parse(elem)
.Descendants("a")
.Select(x => x.Attribute("href").Value)
.FirstOrDefault();
If you have more then 1 "a" tag in it, you could also put in a foreach loop if that is what you want.
EDIT:
With XElement:
You can get the content including the outer node by calling element.ToString().
If you want to exclude the outer tag, you can call String.Concat(element.Nodes()).
To get innerHTML with HtmlAgilityPack:
Install HtmlAgilityPack from NuGet.
Use this code.
HtmlWeb web = new HtmlWeb();
HtmlDocument dc = web.Load("Your_Url");
var s = dc.DocumentNode.SelectSingleNode("//a[#name="a"]").InnerHtml;
I hope it helps!
Here I created the console app to extract the text of anchor.
static void Main(string[] args)
{
string input = "<div class=\"class1\" title=\"myfirstClass\"><a href=\"link.php\">text I want read in C#<span class=\"order-level\"></span>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(input);
foreach (HtmlNode item in doc.DocumentNode.Descendants("div"))
{
var link = item.Descendants("a").First();
var text = link.InnerText.Trim();
Console.Write(text);
}
Console.ReadKey();
}
Note this is htmlagilitypack question so tag the question properly.

Adding info to a xml file

I have a XML file which contains about 850 XML nodes. Like this:
<NameValueItem>
<Text>Test</Text>
<Code>Test</Code>
</NameValueItem>
........ 849 more
And I want to add a new Childnode inside each and every Node. So I end up like this:
<NameValueItem>
<Text>Test</Text>
<Code>Test</Code>
<Description>TestDescription</Description>
</NameValueItem>
........ 849 more
I've tried the following:
XmlDocument doc = new XmlDocument();
doc.Load(xmlPath);
XmlNodeList nodes = doc.GetElementsByTagName("NameValueItem");
Which gives me all of the nodes, but from here am stuck(guess I need to iterate over all of the nodes and append to each and every) Any examples?
You need something along the lines of this example below. On each of your nodes, you need to create a new element to add to it. I assume you will be getting different values for the InnerText property, but I just used your example.
foreach (var rootNode in nodes)
{
XmlElement element = doc.CreateElement("Description");
element.InnerText = "TestDescription";
root.AppendChild(element);
}
You should just be able to use a foreach loop over your XmlNodeList and insert the node into each XmlNode:
foreach(XmlNode node in nodes)
{
node.AppendChild(new XmlNode()
{
Name = "Description",
Value = [value to insert]
});
}
This can also be done with XDocument using LINQ to XML as such:
XDocument doc = XDocument.Load(xmlDoc);
var updated = doc.Elements("NameValueItem").Select(n => n.Add(new XElement() { Name = "Description", Value = [newvalue]}));
doc.ReplaceWith(updated);
If you don't want to parse XML using proper classes (i.e. XDocument), you can use Regex to find a place to insert your tag and insert it:
string s = #"<NameValueItem>
<Text>Test</Text>
<Code>Test</Code>
</NameValueItem>";
string newTag = "<Description>TestDescription</Description>";
string result = Regex.Replace(s, #"(?<=</Code>)", Environment.NewLine + newTag);
but the best solution is Linq2XML (it's much better, than simple XmlDocument, that is deprecated at now).
string s = #"<root>
<NameValueItem>
<Text>Test</Text>
<Code>Test</Code>
</NameValueItem>
<NameValueItem>
<Text>Test2</Text>
<Code>Test2</Code>
</NameValueItem>
</root>";
var doc = XDocument.Load(new StringReader(s));
var elms = doc.Descendants("NameValueItem");
foreach (var element in elms)
{
element.Add(new XElement("Description", "TestDescription"));
}
var text = new StringWriter();
doc.Save(text);
Console.WriteLine(text);

HtmlAgilityPack HtmlNodeCollection returning NULL , shouldn't

I made a simple program for fetching youtube users in comments.
This is the code
string html;
using (var client = new WebClient())
{
html = client.DownloadString("http://www.youtube.com/watch?v=ER5EnjskCvE");
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
List<string> data = new List<string>();
HtmlNodeCollection nodeCollection = doc.DocumentNode.SelectNodes("//*[#id='comments-view']/ul[1]/li[1]/a/span/span/span/span/img");
foreach (HtmlNode node in nodeCollection)
{
data.Add(node.GetAttributeValue("alt",null));
}
But i have a problem that my nodeCollection is returning null.
For the XPath i used copy XPath option in chrome under F12
try this replace "*" , "div"
"/html/body//div[#id='comments-view']/ul[1]/li[1]/a/span/span/span/span/img"

How to extract html links from html file in C#?

Can anyone help me by explaining how to extract urls/links from HTML File in C#
look at Html Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
yourList.Add(att.Value)
}
doc.Save("file.htm");
Use HTMLAgility Pack...
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(r => r.Attributes.ToList().ConvertAll(i => i.Value)).SelectMany(j => j).ToList();
}
It works for me.
You can use an HTQL COM object and query the page using query:
<a>:href
HTQLCOMLib.HtqlControl h = new HTQLCOMLib.HtqlControl();
string page = "<html><body><a href='test1.html'>test1</a><a href='test2.html'>test2</a> </body></html>";
h.setSourceData(page, page.Length);
h.setQuery("<a>: href ");
for (h.moveFirst(); 0 == h.isEOF(); h.moveNext() )
{
MessageBox.Show(h.getValueByIndex(1));
}
It will show messages of:
test1.html
test2.html

HtmlAgilityPack selecting childNodes not as expected

I am attempting to use the HtmlAgilityPack library to parse some links in a page, but I am not seeing the results I would expect from the methods. In the following I have a HtmlNodeCollection of links. For each link I want to check if there is an image node and then parse its attributes but the SelectNodes and SelectSingleNode methods of linkNode seems to be searching the parent document not the childNodes of linkNode. What gives?
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);
HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode linkNode in linkNodes)
{
string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
if (linkTitle == string.Empty)
{
HtmlNode imageNode = linkNode.SelectSingleNode("/img[#alt]");
}
}
Is there any other way I could get the alt attribute of the image childnode of linkNode if it exists?
You should remove the forwardslash prefix from "/img[#alt]" as it signifies that you want to start at the root of the document.
HtmlNode imageNode = linkNode.SelectSingleNode("img[#alt]");
With an xpath query you can also use "." to indicate the search should start at the current node.
HtmlNode imageNode = linkNode.SelectSingleNode(".//img[#alt]");
Also, watch out for null checks; SelectNodes returns null instead of blank collection.
HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[#href]");
**if(linkNodes!=null)**
{
foreach(HtmlNode linkNode in linkNodes)
{
string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
if (linkTitle == string.Empty)
{
**HtmlNode imageNode = linkNode.SelectSingleNode("img[#alt]");**
}
}
}

Categories

Resources