Parsing RSS feed from XML document - c#

I'm trying to read RSS feed, but I can't get it to work. I'm trying to get content from td tag, but code always throws NullReferenceException while parsing table rows. Any help is appreciated.
Code:
public void readRss()
{
string Url = "mylink.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var table = doc.DocumentNode.SelectSingleNode("//table");
var rows = table.SelectNodes("//tr");
if (rows != null && rows.Count > 0)
{
foreach (var row in rows)
{
var cells = row.SelectNodes("//td");
//do stuff
}
}
}
XML file is formatted like this:
<![CDATA[<table>
<tr>
<td>Name</td>
<td>LastName</td>
<td>Age</td>
<tr>
</table>
]]>

Does your web.Load(Url) is respond with the example XML file example? If it does then selecting nodes within CDATA will simply not work. The content within CDATA[...] is treated as text only and none of its content will form part of the document node tree. Therefore, your first SelectSingleNode("//table") will always give you a null result.
BTW: you should be testing for null value after setting the table and doc variables, just as you do for the rows. Both of these, can return null.

Related

Scrape Table Inside Comment With HTMLAgilityPack

I'd like to scrape a table within a comment using HTMLAgilityPack. For example, on the page
http://www.baseball-reference.com/register/team.cgi?id=f72457e4
there is a table with id="team_pitching". I can get this comment as a block of text with:
var tags = doc.DocumentNode.SelectSingleNode("//comment()[contains(., 'team_pitching')]");
however my preference would be to select the rows from the table with something like:
var tags = doc.DocumentNode.SelectNodes("//comment()[contains(., 'team_pitching')]//table//tbody//tr");
or
var tags = doc.DocumentNode.SelectNodes("//comment()//table[#id = 'team_pitching']//tbody//tr");
but these both return null. Is there a way to do this so I don't have to parse the text manually to get all of the table data?
Sample HTML - I'm looking to find nodes inside <!-- ... -->:
<p>not interesting HTML here</p>
<!-- <table id=team_pitching>
<tbody><tr>...</tr>...</tbody>...</table> -->
Content of comment is not parsed as DOM nodes, so you can't search outside comment and inside comment with single XPath.
You can get InnerHTML of the comment node, trim comment tags, load it into the HtmlDocument and query on it. Something like this should work
var commentNode = doc.DocumentNode
.SelectSingleNode("//comment()[contains(., 'team_pitching')]");
var commentHtml = commentNode.InnerHtml.TrimStart('<', '!', '-').TrimEnd('-', '>');
var commentDoc = new HtmlDocument();
commentDoc.LoadHtml(commentHtml);
var tags = commentDoc.DocumentNode.SelectNodes("//table//tbody//tr");

MVC StackOverflowException with larger html data

I have the following method (i'm using the htmlagilitypack):
public DataTable tableIntoTable(HtmlDocument doc)
{
var nodes = doc.DocumentNode.SelectNodes("//table");
var table = new DataTable("MyTable");
table.Columns.Add("raw", typeof(string));
foreach (var node in nodes)
{
if (
(!node.InnerHtml.Contains("pldefault"))
&& (!node.InnerHtml.Contains("ntdefault"))
&& (!node.InnerHtml.Contains("bgtabon"))
)
{
table.Rows.Add(node.InnerHtml);
}
}
return table;
}
It accepts html grabbed using this:
public HtmlDocument getDataWithGet(string url)
{
using (var wb = new WebClient())
{
string response = wb.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(response);
return doc;
}
}
All works fine with an html document that is 3294 lines long.
When I feed it some html that is 33960 lines long I get:
StackOverflowException was unhandled at the IF statement in the tableIntoTable method as seen in this image:
http://imgur.com/Q2FnIgb
I thought it might be related to the MaxHttpCollectionKeys limit of 1000 so I tried putting this in my Web.config and it still doesn't work:
add key="aspnet:MaxHttpCollectionKeys" value="9999"
I'm not really sure where to go from here, it only breaks with larger html documents.
Assuming the values in your if statement are contained in some attribute value of some decendant of a table.
var xpath = #"//table[not(.//*[contains(#*,'pldefault') or
contains(#*,'ntdefault') or
contains(#*,'bgtabon')])]";
var tables = doc.DocumentNode.SelectNodes(xpath);
Upadte: More accurately based on your comments:
#"//table[not(.//td[contains(#class,'pldefault') or
contains(#class,'ntdefault') or
contains(#class,'bgtabon')])]";

Use HtmlAgilityPack to parse HTML variable, not HTML document?

I have a variable in my program that contains HTML data as a string. The variable, htmlText, contains something like the following:
<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>
I'd like to iterate through this HTML, using the HtmlAgilityPack, but every example I see tries to load the HTML as a document. I already have the HTML that I want to parse within the variable htmlText. Can someone show me how to parse this, without loading it as a document?
The example I'm looking at right now looks like this:
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
I want to convert this to use my htmlText and find all underline elements within. I just don't want to load this as a document since I already have the HTML that I want to parse stored in a variable.
You can use the LoadHtml method of HtmlDocument class
Document is simply a name, it's not really a document (or doesn't have to be).
var doc = New HtmlAgilityPack.HtmlDocument;
string myHTML = "<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>";
doc.LoadHtml(myHTML);
foreach (var node in doc.DocumentNode.SelectNodes("//a[#href]")) {
Console.WriteLine(node.InnerHtml);
}
I've used this exact same thing to parse html chunks in variables.

Getting text from elements without id or class name

I am trying to parse HTML code using Html Agility Pack. Is there any tutorial available, or can someone tell me how can I get a text from a <td> that has no Id and no class?
<table id="results-table">
<tr class="row1">
<td>Diode Zener Single 12V 5% 1W 2-Pin DO-41 Bulk</td>
...
Each row contains 10 different <td>. Thanks!
You can try using this XPATH to query all the tds within your table having id="results-table"
//table[#id='results-table']/tr/td
Firepath for Firefox can help you in formulating XPATH and you can manipulate it from there.
Sample code below
HtmlDocument doc = new HtmlDocument();
var fileName = #"..\..\..\docs\10960189.htm";
doc.Load(fileName);
var nodes = doc.DocumentNode.SelectNodes("//table[#id='results-table']/tr/td");
foreach (var node in nodes)
{
Debug.WriteLine(node.InnerText);
}
HTH
Here is a link that explain how to use XPath:
http://www.w3schools.com/xpath/
I guess some of your td tags will have class/id. Use the following code. I wrote that in linqpad
void Main()
{
var webGet = new HtmlAgilityPack.HtmlDocument();
//web page/string that need to be parsed
webGet.LoadHtml(#"<table id='results-table'>" +
"<tr class='row1'>" +
"<td class='testclass'>test td with class</td>" +
"<td id='testid'>test td with id</td>" +
"<td>Diode Zener Single 12V 5% 1W 2-Pin DO-41 Bulk</td>" +
"<td>test td without class or id</td>" +
"<tr/>"
);
var tableOnPage = (from tds in webGet.DocumentNode.Descendants()
where lnks.Name == "td" &&
lnks.Attributes["class"] == null && tds.Attributes["id"] == null &&
tds.ParentNode.InnerText.Trim().Length > 0 && lnks.InnerText.Trim().Length > 0
select new
{
td = tds.DescendantNodes().SingleOrDefault ().InnerHtml.Trim(),
});
//looping through each items
foreach (var item in tableOnPage)
{
Console.WriteLine(item.td);
}
}
Output will be
Diode Zener Single 12V 5% 1W 2-Pin DO-41 Bulk
test td without class or id

SelectSingleNode returns the wrong result on a foreach

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(content);
var nodes = doc.DocumentNode.SelectNodes("//div[#class=\"noprint res\"]/div");
if (nodes != null)
{
foreach (HtmlNode data in nodes)
{
// Works but not what I want
MessageBox.Show(data.InnerHtml);
// Should work ? but does not ?
MessageBox.Show(data.SelectSingleNode("//span[#class=\"pp-place-title\"]").InnerText);
}
}
I am trying to parse the results of a HTML, the initial node for the foreach, works just as expected and gives me a result of 10 items which matchs what I need.
When I get into the foreach, if I output the inner html of the data item it display the correct data but if I output the SelectSingleNode it will always display the data from the first item from the foreach, is that a normal behavior or am I doing something wrong ?
In order to resolve the issue I had to create a new html inside the foreach for every data item like this:
HtmlAgilityPack.HtmlDocument innerDoc = new HtmlAgilityPack.HtmlDocument();
innerDoc.LoadHtml(data.InnerHtml);
// Select what I need
MessageBox.Show(innerDoc.DocumentNode.SelectSingleNode("//span[#class=\"pp-place-title\"]").InnerText);
Then I get the correct per item data.
The page I was trying to get data from was http://maps.google.com/maps?q=consulting+loc:+US if u want to try and see what happens for yourself.
Basically I am reading the left side column for company names and the above happens.
By starting your XPath expression with //, you're searching in the entire document that contains the data node.
You should be able to use ".//[...]" to only check nodes within data.

Categories

Resources