I am trying to parse html page and I am facing a problem which is that I want to get the inner text of a node segmented i.e iterate on html node children assuming each text segment as a in child:
<node1>
This text I WANT on iterate#1
<innernode>This text I WANT on iterate#2</innernode>
This text I WANT on iterate#3
<innernode>This text I WANT on iterate#4</innernode>
This text I WANT on iterate#5
</node1>
I am using htmlagilitypack as a parser but I think that I will face this problem with any other html parser
Depending on your .NET version, you could use an extension method that works on the node you want.
I havent used the html agility pack, so this is a mix of C# and psuedo-code.
eg
public static List<string> GetTextSegments(this HtmlNode node)
{
string nodesText = ... // get the nodes text
yield nodesText;
List<HtmlNode> innerNodes = ... // get the list of inner nodes with a
// query like node.SelectNodes("//innerNodes")
foreach(HtmlNode iNode in innerNodes)
{
string iNodeText = ... // get iNodes text
yield iNodeText;
}
}
You could then call this like so:
HtmlNode nodeOfTypeNode1 = ... //
foreach(string text : nodeOfTypeNode1.getTextSegments())
{
Console.WriteLine(text);
}
To get your goal, use SelectNodes with XPath.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);//content is the variable containing your html.
var items = doc.DocumentNode.SelectNodes("/node1//text()");
foreach (var item in items)
{
Console.WriteLine(item.OuterHtml.Replace("\r\n",""));
}
Related
I have multiple links in page of structure like this:
<a ....>
<b>Text I Need</b>
Also Text I need
</a>
And i want to extract string for example from code above "Text I NeedAlso Text I need"
I successfully extract second part, but I'm not sure how to select text inside b tags as well, currently I'm using this:
var link_list = doc.DocumentNode.SelectNodes(#"/a/text()");
foreach (var link in link_list)
{
Console.WriteLine(link.InnerText);
}
Should i perhaps instead get not text but html of a and remove tags with regex and extract text then, or is there some other ways?
Accessing InnerText property of <a> should give you all text nodes at once :
var html = #"<a ....>
<b>Text I Need</b>
Also Text I need
</a>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var link_list = doc.DocumentNode.SelectNodes("/a");
foreach (var link in link_list)
{
Console.WriteLine(link.InnerText);
}
or if you really need to get only direct child text nodes and grand child text nodes, try this way :
var link_list = doc.DocumentNode.SelectNodes("/a");
foreach (var link in link_list)
{
var texts = link.SelectNodes("text() | */text()");
Console.WriteLine(String.Join("", texts.Select(o => o.InnerText)));
}
output :
Text I Need
Also Text I need
How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.
I have a variable in my program that contains HTML data as a string. The variable, htmlText, contains something like the following:
<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>
I'd like to iterate through this HTML, using the HtmlAgilityPack, but every example I see tries to load the HTML as a document. I already have the HTML that I want to parse within the variable htmlText. Can someone show me how to parse this, without loading it as a document?
The example I'm looking at right now looks like this:
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
I want to convert this to use my htmlText and find all underline elements within. I just don't want to load this as a document since I already have the HTML that I want to parse stored in a variable.
You can use the LoadHtml method of HtmlDocument class
Document is simply a name, it's not really a document (or doesn't have to be).
var doc = New HtmlAgilityPack.HtmlDocument;
string myHTML = "<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>";
doc.LoadHtml(myHTML);
foreach (var node in doc.DocumentNode.SelectNodes("//a[#href]")) {
Console.WriteLine(node.InnerHtml);
}
I've used this exact same thing to parse html chunks in variables.
I am trying to get some text from an external site. The text I am trying to get is nested in a paragraph tag. The div has has a class value
html code snippet:
<div class="discription"><p>this is the text I want to grab</p></div>
current c# code:
public String getDiscription(string url)
{
var web = new HtmlWeb();
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//div[#class='discription']");
if (nodes != null)
{
foreach (var node in nodes)
{
string Description = node.InnerHtml;
return Description;
}
} else
{
string error = "could not find text";
return error;
}
}
what I dont understand is the syntax of the xpath //div[#class='discription'] I know it is wrong what should the xpath be?
use //div[#class='discription']/p.
Breakdown:
//div - All div elements
[#class='discription'] - With a class attribute whose value is discription
/p - Select the child p elements
How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");