Getting text from elements without id or class name - c#

I am trying to parse HTML code using Html Agility Pack. Is there any tutorial available, or can someone tell me how can I get a text from a <td> that has no Id and no class?
<table id="results-table">
<tr class="row1">
<td>Diode Zener Single 12V 5% 1W 2-Pin DO-41 Bulk</td>
...
Each row contains 10 different <td>. Thanks!

You can try using this XPATH to query all the tds within your table having id="results-table"
//table[#id='results-table']/tr/td
Firepath for Firefox can help you in formulating XPATH and you can manipulate it from there.
Sample code below
HtmlDocument doc = new HtmlDocument();
var fileName = #"..\..\..\docs\10960189.htm";
doc.Load(fileName);
var nodes = doc.DocumentNode.SelectNodes("//table[#id='results-table']/tr/td");
foreach (var node in nodes)
{
Debug.WriteLine(node.InnerText);
}
HTH

Here is a link that explain how to use XPath:
http://www.w3schools.com/xpath/

I guess some of your td tags will have class/id. Use the following code. I wrote that in linqpad
void Main()
{
var webGet = new HtmlAgilityPack.HtmlDocument();
//web page/string that need to be parsed
webGet.LoadHtml(#"<table id='results-table'>" +
"<tr class='row1'>" +
"<td class='testclass'>test td with class</td>" +
"<td id='testid'>test td with id</td>" +
"<td>Diode Zener Single 12V 5% 1W 2-Pin DO-41 Bulk</td>" +
"<td>test td without class or id</td>" +
"<tr/>"
);
var tableOnPage = (from tds in webGet.DocumentNode.Descendants()
where lnks.Name == "td" &&
lnks.Attributes["class"] == null && tds.Attributes["id"] == null &&
tds.ParentNode.InnerText.Trim().Length > 0 && lnks.InnerText.Trim().Length > 0
select new
{
td = tds.DescendantNodes().SingleOrDefault ().InnerHtml.Trim(),
});
//looping through each items
foreach (var item in tableOnPage)
{
Console.WriteLine(item.td);
}
}
Output will be
Diode Zener Single 12V 5% 1W 2-Pin DO-41 Bulk
test td without class or id

Related

Parsing RSS feed from XML document

I'm trying to read RSS feed, but I can't get it to work. I'm trying to get content from td tag, but code always throws NullReferenceException while parsing table rows. Any help is appreciated.
Code:
public void readRss()
{
string Url = "mylink.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var table = doc.DocumentNode.SelectSingleNode("//table");
var rows = table.SelectNodes("//tr");
if (rows != null && rows.Count > 0)
{
foreach (var row in rows)
{
var cells = row.SelectNodes("//td");
//do stuff
}
}
}
XML file is formatted like this:
<![CDATA[<table>
<tr>
<td>Name</td>
<td>LastName</td>
<td>Age</td>
<tr>
</table>
]]>
Does your web.Load(Url) is respond with the example XML file example? If it does then selecting nodes within CDATA will simply not work. The content within CDATA[...] is treated as text only and none of its content will form part of the document node tree. Therefore, your first SelectSingleNode("//table") will always give you a null result.
BTW: you should be testing for null value after setting the table and doc variables, just as you do for the rows. Both of these, can return null.

HTML AGILITY PACK Parsing Div Blocks

I need to parse items from an internet-shop—I need their name and price. Each item-block is located in a different div within a div-catalog of these items.
So I tried this, and it kinda works, but I would prefer to parse both name and price in 1 loop. How might I do so? Thanks!
var url = "http://bestaqua.com.ua/catalog/filtry-obratnogo-osmosa";
HtmlWeb web = new HtmlWeb();
HtmlDocument HtmlDoc = web.Load(url);
var RootNode = HtmlDoc.DocumentNode;
foreach (HtmlNode node in
HtmlDoc.DocumentNode.SelectNodes("//div[#class='catalog_blocks']"))
{
foreach (HtmlNode item_name in
node.SelectNodes("//div[#class='catalog_blocks-item-name']"))
{
string name = item_name.InnerText;
System.Diagnostics.Debug.Write("NAME :" + name + "\n" );
}
foreach (HtmlNode item_price in
node.SelectNodes("//span[#class='price-new']"))
{
string price = item_price.InnerText;
System.Diagnostics.Debug.Write("PRICE: " + price + "\n");
}
}
Since SelectNodes is using an XPATH-expression, you could just use a union in your class filter using "|", which will result in a single collection to loop over.
Note that you would then still need to check which element you've actually selected within the for-loop.

HTML Agility Pack - Grab Text after a node

I have some HTML that I'm parsing using C#
The sample text is below, though this is repeated about 150 times with different records
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
I'm trying to get the text in an array which will be like
customerArray [0,0] = Title
customerArray [0,1] = Mr
customerArray [1,0] = First Name
customerArray [1,1] = Fake
customerArray [2,0] = Surname
customerArray [2,1] = Guy
I can get the text in the array but I'm just having trouble getting the text after the STRONG closing tab up until the BR tag then finding the next STRONG tag
any help would be appreciated
You can use XPath following-sibling::text()[1] to get text node located directly after each strong. Here is a minimal but complete example :
var raw = #"<div>
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//strong"))
{
var val = node.SelectSingleNode("following-sibling::text()[1]");
Console.WriteLine(node.InnerText + ", " + val.InnerText);
}
dotnetfiddle demo
output :
Title, : Mr
First name, : Fake
Surname, : Guy
You should be able to remove the ":" by doing simple string manipulation, if needed...
<strong> is a common tag, so something specific for the sample format you provided.
var html = #"
<div>
<strong>First name</strong><em>italic</em>: Fake<br>
<strong>Bold</strong> <a href='#'>hyperlink</a><br>.
<strong>bold</strong>
<strong>bold</strong> <br>
text
</div>
<div>
<strong>Title</strong>: Mr<BR>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var document = new HtmlDocument();
document.LoadHtml(html);
// 1. <strong>
var strong = document.DocumentNode.SelectNodes("//strong");
if (strong != null)
{
foreach (var node in strong.Where(
// 2. followed by non-empty text node
x => x.NextSibling is HtmlTextNode
&& !string.IsNullOrEmpty(x.NextSibling.InnerText.Trim())
// 3. followed by <br>
&& x.NextSibling.NextSibling is HtmlNode
&& x.NextSibling.NextSibling.Name.ToLower() == "br"))
{
Console.WriteLine("{0} {1}", node.InnerText, node.NextSibling.InnerText);
}
}

Regular expression to match everything, except HTML tags

<tr><td>Di, 12.04.16</td><td>1</td><td>D</td><td>D</td><td>255</td><td>ABC</td><tr>
I want to only match ABC or anything else that stand between
<td>
</td> (before and after ABC)
This Patter doesnt work for me:
((?!<tr><td>[D-M][i-r],[' ][0-3][0-9]\\.[0-1][0-9]\\.[0-9][0-9]</td><td>[1-9][0-2]?</td><td>[A-Z]?[A-Z]?[A-Z]?[A-Z]?[1-5]?</td><td>(---|[A-Z]?[A-Z]?[A-Z]?[A-Z]?[1-5]?)</td><td>).*(?!</td></tr>))
Do you have any idea?
Thx for help
As Amy said, don't use regex to parse HTML. You can install Html Agility Pack from NuGet and use System.Linq Namespace to parse it.
For example here:
string html = "<html><head></head><body><p class='testclass'>This is a paragraph.</p><table><tr><td>Di, 12.04.16</td><td>1</td><td>D</td><td>D</td><td>255</td><td>ABC</td><tr></table></body></html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var programmes = doc.DocumentNode.Descendants().Where(d => d.GetAttributeValue("class", "") == "testclass");
var trs = doc.DocumentNode.Descendants("tr"); // Give you all the trs
foreach (var tr in trs)
{
var tds = tr.Descendants("td").ToArray(); // Get all the tds
//Sample, show the result in a TextBlock
foreach (var td in tds)
{
txt.Text = txt.Text + " " + td.InnerText;
}
}
The result is so:

Grab all text from html with Html Agility Pack

Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText, but it will give foobarbaz - I want to get each text, not all at a time.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.
I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:
StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
Console.WriteLine(node.InnerText);
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
The specified example for html content:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
will produce the following output:
foo bar baz
public string html2text(string html) {
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"<html><body>" + html + "</body></html>");
return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}
This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack).
https://github.com/jamietre/CsQuery
have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.
var text = CQ.CreateDocument(htmlText).Text();
Here's a complete console application:
using System;
using CsQuery;
public class Program
{
public static void Main()
{
var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
var text = CQ.CreateDocument(html).Text();
Console.WriteLine(text); // Output: Hello World some text inside h1 tag under p tag
}
}
I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!
I just changed and fixed some people's answers to work better:
var document = new HtmlDocument();
document.LoadHtml(result);
var sb = new StringBuilder();
foreach (var node in document.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
{
string text = node.InnerText?.Trim();
if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
}
}

Categories

Resources