Parsing results from HTMLAgiltyPack - c#

I'm trying to parse the Yahoo Finance page for a list of stock symbols and company names. The URL i'm using is: http://uk.finance.yahoo.com/q/cp?s=%5EFTSE
The code i'm using is;
HtmlAgilityPack.HtmlDocument page = new HtmlWeb().Load("http://uk.finance.yahoo.com/q/cp?s=%5EFTSE");
var titles = page.DocumentNode.SelectNodes("//td[#class='yfnc_tabledata1']");
// Returns all titles on the home page of this site in an array.
foreach (var title in titles)
{
txtLog.AppendText(title.InnerHtml + System.Environment.NewLine);
}
The txtLog.AppendText line was just me testing. The code correctly gets each lines that contains a class of yfnc_tabledata1 under the node of td. Now when i'm in the foreach loop i need to parse title to grab the symbol and company name from the following HTML;
<b>GLEN.L</b>
GLENCORE XSTRAT
<b>343.95</b> <nobr><small>3 May 16:35</small></nobr>
<img width="10" height="14" style="margin-right:-2px;" border="0"
src="http://l.yimg.com/os/mit/media/m/base/images/transparent-1093278.png"
class="pos_arrow" alt="Up"> <b style="color:#008800;">12.80</b>
<bstyle="color:#008800;"> (3.87%)</b> 68,086,160
Is it possible to parse the results of a parsed document? I'm a little unsure on where to start.

You just need to continue some XPATH extraction work from where you are. There are many possibilities. The difficulty is all the yfnc_tabledata1 nodes are at the same level. Here is how you can do it (in a console app example it will dump the list of symbols and companies):
HtmlAgilityPack.HtmlDocument page = new HtmlWeb().Load("http://uk.finance.yahoo.com/q/cp?s=%5EFTSE");
// get directly the symbols under the 1st TD element. Recursively search for an A element that has an HREF attribute under this TD.
var symbols = page.DocumentNode.SelectNodes("//td[#class='yfnc_tabledata1']//a[#href]");
foreach (var symbol in symbols)
{
// from the current A element, go up two level and get the next TD element.
var company = symbol.SelectSingleNode("../../following-sibling::td").InnerText.Trim();
Console.WriteLine(symbol.InnerText + ": " + company);
}
More on XPATH axes here: XPATH Axes

Related

Looping a node collection gives me unique nodes but selecting nodes inside from these give me the results of the first loop item

Context: Using the HTMLAgilityPack library, im looping a HtmlNodeCollection, printing the HTML of the node gives me the data that I need, but when im selecting nodes inside the html, all of them gives me the result of the first item I selected nodes in.
Writing the nodes html as node.InnerHtml gives me the unique htmls of them, all correct, but when I do SelectSingleNode, all of them give me the same data.
Due to the project, I cannot disclose the website. What I can say is that theres 17 nodes, all of them are a div with the class k-user-item. All Items are unique, meaning they all are different.
Thanks for the help!
Code:
var nodes = w.DocumentNode.SelectNodes("//div[contains(#class, 'k-user-item')]");
List<Sales> saleList = new List<Sales>();
foreach (HtmlNode node in nodes)
{
//This line prints correct html, selecting single nodes gives me always the same data of the first item from the loop.
//Debug.WriteLine(node.InnerHtml);
string payout = node.SelectSingleNode("//*[#class=\"k-item--buy-date\"]").InnerText;
string size = node.SelectSingleNode("//*[#class=\"k-panel-title\"]").SelectNodes("//span")[1].InnerText;
var trNodes = node.SelectNodes("//tr");
string status = trNodes[1].SelectSingleNode("//b").InnerText;
string orderId = trNodes[2].SelectNodes("//td")[1].SelectSingleNode("//span").InnerHtml;
string sellDate = node.SelectSingleNode("//*[#class=\"k-panel-heading\"]").SelectNodes("//small")[1].InnerHtml;
}
This issue was solved by adding to the XPath a "." on to the start.
Not adding the dot onto the XPath means that the node will search in the whole document and not just the exact node html.

Extracting data from HTML file using c# script

What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)
What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?
Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email #sampleemail.com but I think that is a bad approach since in some html files there will be a lot of
"<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked
Sample tag containing information of from:
<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p>
HTML FILE output:
HTMLAgilityPack is your friend. Simply using XPath like //p[#class ='MsoNormal'] to get tags content in HTML
public static void Main()
{
var html =
#"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//p[#class ='MsoNormal']");
foreach(var node in nodes)
Console.WriteLine(node.InnerText);
}
Result:
From:1234#sampleemail.com
Update
We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.
public static void MainFunc()
{
string str = #"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
Console.WriteLine(result);
}

Scrape Table Inside Comment With HTMLAgilityPack

I'd like to scrape a table within a comment using HTMLAgilityPack. For example, on the page
http://www.baseball-reference.com/register/team.cgi?id=f72457e4
there is a table with id="team_pitching". I can get this comment as a block of text with:
var tags = doc.DocumentNode.SelectSingleNode("//comment()[contains(., 'team_pitching')]");
however my preference would be to select the rows from the table with something like:
var tags = doc.DocumentNode.SelectNodes("//comment()[contains(., 'team_pitching')]//table//tbody//tr");
or
var tags = doc.DocumentNode.SelectNodes("//comment()//table[#id = 'team_pitching']//tbody//tr");
but these both return null. Is there a way to do this so I don't have to parse the text manually to get all of the table data?
Sample HTML - I'm looking to find nodes inside <!-- ... -->:
<p>not interesting HTML here</p>
<!-- <table id=team_pitching>
<tbody><tr>...</tr>...</tbody>...</table> -->
Content of comment is not parsed as DOM nodes, so you can't search outside comment and inside comment with single XPath.
You can get InnerHTML of the comment node, trim comment tags, load it into the HtmlDocument and query on it. Something like this should work
var commentNode = doc.DocumentNode
.SelectSingleNode("//comment()[contains(., 'team_pitching')]");
var commentHtml = commentNode.InnerHtml.TrimStart('<', '!', '-').TrimEnd('-', '>');
var commentDoc = new HtmlDocument();
commentDoc.LoadHtml(commentHtml);
var tags = commentDoc.DocumentNode.SelectNodes("//table//tbody//tr");

How read content of a span tag using HtmlAgilityPack?

I'm using HtmlAgilityPack to scrap data from a link(site). There are many p tags, header and span tags in a site. I need to scrap data from a particular span tag.
var webGet = new HtmlWeb();
var document = webGet.Load(URL);
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
{
string strData = node.InnerText.Trim();
}
I had tried by using keyword on parent tag which was not working for all kind of URLs.
Please help me to fix it.
What is the error?
You can start by fixing this:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
it should be:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("//span"))
But I want exact data. For example, there are too many span tags in source as <span>abc</span>, <span>def</span>, <span>pqr</span>, <span>xyz</span>. I want the result as "pqr". Is there any option to get it by count of particular tag or by index?
If you want to get, for example, the third span tag from the root:
doc.DocumentNode.SelectSingleNode("//span[3]")
If you want to get the node containing the text "pqr":
doc.DocumentNode.SelectSingleNode("//span[contains(text(),'pqr')]");
You can use SelectNodes for the latter to get all span tags containing "pqr" in the text.

C#, Html Agility, Selecting every paragraph within a div tag

How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");

Categories

Resources