Given the snippet of html and code bellow if you know part of the src e.g. 'FileName' how do you get the post ID of the parent div this could be higher up the dom tree and there could be 0, 1 or many src's with the same 'FileName'
I'm after "postId_19701770"
I've attempted to follow this page and this page I get Error CS1061 'HtmlNodeCollection' does not contain a definition for 'ParentNode'
namespace GetParent
{
class Program
{
static void Main(string[] args)
{
var html =
#"<body>
<div id='postId_19701770' class='b-post'>
<h1>This is <b>bold</b> heading</h1>
<p>This is <u>underlined</u> paragraph <div src='example.com/FileName_720p.mp4' </div></p>
</div>
</body>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string keyword = "FileName";
var node = htmlDoc.DocumentNode.SelectNodes("//*[text()[contains(., '" + keyword + "')]]");
var parentNode = node.ParentNode;
Console.WriteLine(parentNode.Name);
Console.ReadLine();
}
}
}
Reason your code is not working is because you are looking up a ParentNode of a collection of nodes. You need to select a single node and then look up its parent.
You can search all the nodes (collection) by src as well that contains the data you are looking for. Once you have the collection, you can search each of those nodes to see which one you need or select the First() one from that collection to get its Parent.
var html =
#"<body>
<div id='postId_19701770' class='b-post'>
<h1>This is <b>bold</b> heading</h1>
<p>This is <u>underlined</u> paragraph <div src='example.com/FileName_720p.mp4' </div></p>
</div>
</body>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string keyword = "FileName";
var node = htmlDoc.DocumentNode.SelectNodes("//*[contains(#src, '" + keyword + "')]");
var parent = node.First().ParentNode; //node is a collection so get the first node for ex.
Console.WriteLine(parent.GetAttributeValue("id", string.Empty));
// Prints
postId_19701770
Instead of looking up "all" nodes, you can search specifically for 1 node via SelectSingleNode method
var singleNode = htmlDoc.DocumentNode.SelectSingleNode(#"//*[contains(#src, '" + keyword + "')]");
Console.WriteLine(singleNode.ParentNode.GetAttributeValue("id", string.Empty));
// prints
postId_19701770
Related
(Sorry about my english, I'm brazilian)
I'm trying to get the InnerText from a h4 tag using the HtmlAgilityPack, I managed to get that type of value in 3 of 4 tags in the web site that I need. But the last one is the most important and it just returns an empty value.
Is it possible, that the structure of how the website was build requires a different way to get this value?
This is the specific h4 that I'm trying to extract InnetText ("356.386.496,02"):
<h4 class="text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3">
<span class="align-middle fs-12 fs-lg-12 pr-4">R$</span>
"356.386.496,02"
</h4>
I've tried this:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(data);
var nodes = htmlDocument.DocumentNode.SelectNodes("//h4[#class='text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3']");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}
//Result in console:
//=>
Note that the SelectNodes method doesn't return null, it find the h4 node perfectly, but the InnerText value is "".
try to replace "356.386.496,02" with 356.386.496,02 or with ""356.386.496,02""
this solution should be work
public static void Main()
{
var html =
#"<h4 class=""text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3"">
<span class=""align-middle fs-12 fs-lg-12 pr-4"">R$</span>
""56.386.496,02""
</h4>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//h4[#class='text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3']");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.InnerText);
}
}
I have some HTML that I'm parsing using C#
The sample text is below, though this is repeated about 150 times with different records
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
I'm trying to get the text in an array which will be like
customerArray [0,0] = Title
customerArray [0,1] = Mr
customerArray [1,0] = First Name
customerArray [1,1] = Fake
customerArray [2,0] = Surname
customerArray [2,1] = Guy
I can get the text in the array but I'm just having trouble getting the text after the STRONG closing tab up until the BR tag then finding the next STRONG tag
any help would be appreciated
You can use XPath following-sibling::text()[1] to get text node located directly after each strong. Here is a minimal but complete example :
var raw = #"<div>
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//strong"))
{
var val = node.SelectSingleNode("following-sibling::text()[1]");
Console.WriteLine(node.InnerText + ", " + val.InnerText);
}
dotnetfiddle demo
output :
Title, : Mr
First name, : Fake
Surname, : Guy
You should be able to remove the ":" by doing simple string manipulation, if needed...
<strong> is a common tag, so something specific for the sample format you provided.
var html = #"
<div>
<strong>First name</strong><em>italic</em>: Fake<br>
<strong>Bold</strong> <a href='#'>hyperlink</a><br>.
<strong>bold</strong>
<strong>bold</strong> <br>
text
</div>
<div>
<strong>Title</strong>: Mr<BR>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var document = new HtmlDocument();
document.LoadHtml(html);
// 1. <strong>
var strong = document.DocumentNode.SelectNodes("//strong");
if (strong != null)
{
foreach (var node in strong.Where(
// 2. followed by non-empty text node
x => x.NextSibling is HtmlTextNode
&& !string.IsNullOrEmpty(x.NextSibling.InnerText.Trim())
// 3. followed by <br>
&& x.NextSibling.NextSibling is HtmlNode
&& x.NextSibling.NextSibling.Name.ToLower() == "br"))
{
Console.WriteLine("{0} {1}", node.InnerText, node.NextSibling.InnerText);
}
}
I'm building an app that crawls OkCupid matches. Their match result contains Html that looks like this.
<div id="match_results">
<div>person1</div>
<div>person2</div>
<div>person3</div>
</div>
I want to do a foreach person's div inside the div match_results. However, something's not quite right with my C# code. matchesList only contains one element (itself? and not all the divs inside it...)
HtmlDocument matchesHtmlDoc = new HtmlDocument();
matchesHtmlDoc.LoadHtml(matches);
string matchResultDivId = "match_results";
// match results
HtmlNodeCollection matchesList = matchesHtmlDoc.DocumentNode.SelectNodes("//div[#id = '" + matchResultDivId + "']");
foreach (HtmlNode match in matchesList)
{
//test
Console.WriteLine(match.ToString());
}
You forgot to select child divs:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(matches);
string matchResultDivId = "match_results";
string xpath = String.Format("//div[#id='{0}']/div", matchResultDivId);
var people = doc.DocumentNode.SelectNodes(xpath).Select(p => p.InnerText);
foreach(var person in people)
Console.WriteLine(person);
Output:
person1
person2
person3
In HtmlAgilityPack, I want to create HtmlTextNode, which is a HtmlNode (inherts from HtmlNode) that has a custom InnerText.
HtmlTextNode CreateHtmlTextNode(string name, string text)
{
HtmlDocument doc = new HtmlDocument();
HtmlTextNode textNode = doc.CreateTextNode(text);
textNode.Name = name;
return textNode;
}
The problem is that the textNode.OuterHtml and textNode.InnerHtml will be equal to "text" after the method above.
e.g. CreateHtmlTextNode("title", "blabla") will generate:
textNode.OuterHtml = "blabla" instead of <Title>blabla</Title>
Is there any better way to create HtmlTextNode?
The following lines creates a outer html with content
var doc = new HtmlDocument();
// create html document
var html = HtmlNode.CreateNode("<html><head></head><body></body></html>");
doc.DocumentNode.AppendChild(html);
// select the <head>
var head = doc.DocumentNode.SelectSingleNode("/html/head");
// create a <title> element
var title = HtmlNode.CreateNode("<title>Hello world</title>");
// append <title> to <head>
head.AppendChild(title);
// returns Hello world!
var inner = title.InnerHtml;
// returns <title>Hello world!</title>
var outer = title.OuterHtml;
Hope it helps.
A HTMLTextNode contains just Text, no tags.
It's like the following:
<div> - HTML Node
<span>text</span> - HTML Node
This is the Text Node - Text Node
<span>text</span> - HTML Node
</div>
You're looking for a standard HtmlNode.
HtmlDocument doc = new HtmlDocument();
HtmlNode textNode = doc.CreateElement("title");
textNode.InnerHtml = HtmlDocument.HtmlEncode(text);
Be sure to call HtmlDocument.HtmlEncode() on the text you're adding. That ensures that special characters are properly encoded.
I want to get some text from two html div from HTML file.
After some searches i decided to use HTMLAgility Pack for doing this.
I wrote this code :
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*div[#class='item']");
string value = node.InnerText;
'result' is my content of the File.
But i get this exception : 'Expression must evaluate to a node-set'
And this is some of mt file's content :
<div class="Clear" style="height:15px;"></div>
<div class='Container Select' id="Container_1">
<div class='Item'><div class='Part Lable'>موضوع : </div><div class='Part ...
try either
"//*/div[#class='item']"
or simply
"//div[#class='item']"
have you tried using XPath
for example if I wanated to find a if a node is selected in my example I would do the following
string xpath = null;
XmlNode configNode = configDom.DocumentElement;
// collect selected nodes in node list
XmlNodeList nodeList =
configNode.SelectNodes(#"//*[#status='checked']");
in your case you would do the following
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*/div[#class='item']");
string value = node.InnerText;