how to get data from exact chtml class agility pack - c#

I want to extract not the whole web-page but only text from one class, I want text from td class="result-neutral" and I don't know what is wrong with this code:
<td class="result-neutral" xseid="xz1nBfht">3 - 2 </td>
And this is C# code:
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
HtmlWeb hw = new HtmlWeb();
doc = hw.Load("htt
var scoreNodes = doc.DocumentNode.Descendants("td").Where(d =>d.Attributes.Contains("class")&&d.Attributes["class"].Value.Contains("result-neutral"));
foreach (var item in scoreNodes)
{
result += item.OuterHtml + Environment.NewLine;
}
Info.Text = result;
}

The OuterHtml returns html with start & end of the element. Don't you want InnerHtml or InnerText?
EDIT:
This snippet works for me:
const string html = #"<html><body><table><tr><td class='result-neutral' xseid='xz1nBfht'><a href='/hockey/russia/khl/ska-st-petersburg-metallurg-magnitogorsk-xz1nBfht/'>3 - 2</a></td></tr></table></body></html>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var scoreNodes = doc.DocumentNode.Descendants("td").Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("result-neutral"));
string result = "";
foreach (var item in scoreNodes) {
result += item.InnerText + Environment.NewLine;
}
result = result.TrimEnd(); // the result is "3-2"

Related

Scrape data with HtmlAgilityPack for a tag which doesn't have class

Here is my C# code what i am trying to do is to scrape data from a website by using HtmlAgilityPack but it's showing nothing found every time don't know what i am doing wrong a bit confused
HtmlAgilityPack.HtmlWeb webb = new HtmlAgilityPack.HtmlWeb();
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
HtmlAgilityPack.HtmlDocument doc = webb.Load("mywebsite");
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//ul[#class='unstyled']//li//a");
if (nodes != null)
{
foreach (HtmlNode n in nodes)
{
q = n.InnerText;
q = System.Net.WebUtility.HtmlDecode(q);
q = q.Trim();
Console.WriteLine(q);
}
}
else
{
Console.WriteLine("nothing found");
}
Here is the picture of the tag from which i am trying to capture data i need data from <a> tag .
The XPath used to select the tag is incorrect.
HtmlNodeCollection nodes =
doc.DocumentNode.SelectNodes("//ul[#class='unstyled']/li/a");
This should select all the anchor nodes and then you can loop through the nodes to get the InnerHtml.
Working sample shown below
string s = "<ul class='unstyle no-overflow'><li><ul class='unstyled'><li><a href='http://www.smsconnexion.com'>SMS ConneXion</a></li></ul><ul class='unstyled'><li><a href='http://www.celusion.com'>Celusion</a></li></ul></li></ul>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
HtmlNodeCollection nodes =
doc.DocumentNode.SelectNodes("//ul[#class='unstyled']/li/a");
foreach(var node in nodes)
{
Console.WriteLine(node.Attributes["href"].Value);
}
Console.ReadLine();

NodeSet exception while using agilitypack

private void ShowStatistics_Click(object sender, RoutedEventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
HtmlWeb hw = new HtmlWeb();
doc = hw.Load("http://www.gamerankings.com/browse.html");
HtmlNodeCollection nodes= doc.DocumentNode.SelectNodes("//a/");
string result = "";
foreach (var item in nodes)
{
result += item.InnerText+Environment.NewLine;
}
Info.ItemsSource = result;
}
By pressing the button i want to get information from the webpage in a textbox called Info.
After pressing the button I get an exception saying that the result of expression should be NodeSet, what should I do? I'm using agility pack
Your XPATH is wrong. You can use this instead if you want to get all hyperlink elements
var nodes = doc.DocumentNode.Descendants("a");
In addition to #Hung Cao, you can actually shorten this/work around:
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.SelectNodes("Selector here")){
//your code here
}

XML to string (C#)

I have a XML loaded from a URL like this:
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
try
{
string reply = client.DownloadString("http://Example.com/somefile.xml");
label1.Text = reply;
}
catch
{
label1.Text = "FAILED";
}
That XML belongs to a RSS Feed. I want that label1.Text shows just the titles of that XML. How can I achieve that?
Example of label1.Text
This is my first title - This is my 2nd title - And this is my last title
You can load your XML into an XmlDocument and then use XPath to Get the value of each node you're targeting.
XmlDocument doc = new XmlDocument();
doc.LoadXml(reply);
XmlNodeList nodes = doc.SelectNodes("//NodeToSelect");
foreach (XmlNode node in nodes)
{
//If the value you want is the content of the node
label1.Text = node.InnerText;
//If the value you want is an attribute of the node
label1.Text = node.Attributes["AttibuteName"].Value;
}
If you are not familiar with XPath you can always check here :
http://www.w3schools.com/xpath/xpath_syntax.asp
var xml= XElement.Parse(reply);
label1.Text = string.Join(Environment.NewLine, xml
.Descendants()
.Where (x => !string.IsNullOrEmpty(x.Value))
.Select(x=> string.Format("{0}: {1}", x.Name, x.Value))
.ToArray());
You probably need to parse the RSS XML manually to get the title. Here is some sample code for your reference:
private static List<FeedsItem> ParseFeeds(string feedsXml)
{
XDocument xDoc = XDocument.Parse(feedsXml);
XNamespace xmlns = "http://www.w3.org/2005/Atom";
var items = from entry in xDoc.Descendants(xmlns + "entry")
select new FeedsItem
{
Id = (string)entry.Element(xmlns + "id").Value,
Title = (string)entry.Element(xmlns + "title").Value,
AlternateLink = (string)entry.Descendants(xmlns + "link").Where(link => link.Attribute("rel").Value == "alternate").First().Attribute("href").Value
};
Console.WriteLine("Count = {0}", items.Count());
foreach(var i in items)
{
Console.WriteLine(i);
}
return null;
}

c# using html agility pack URI formats not supported

I am trying to use HTML agility pack to get my program to read in a file and get all the image srcs from it. Heres what I got so far:
private ArrayList GetImageLinks(String html,String link)
{
//link = url of webpage
//html = a string of the html, just for testing will remove after
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(link);
List<String> imgs = (from x in htmlDoc.DocumentNode.Descendants()
where x.Name.ToLower() == "img"
select x.Attributes["src"].Value).ToList<String>();
Console.Out.WriteLine("Hey");
ArrayList imageLinks = new ArrayList(imgs);
foreach (String element in imageLinks)
{
Console.WriteLine(element);
}
return imageLinks;
}
And this is the error im getting:
System.ArgumentException: URI formats are not supported.
HtmlDocument docHtml = new HtmlWeb().Load(url);

How to select specific node using Html Agility Pack?

<div class="form-field wide-80 normal">1997-09-15</div>
I am trying to select the date inside it 1997-09-15. I tried this code but its giving an error of "Xpath Exception was Unhandled" what's wrong in the code please Help
string Url = "http://whois.domaintools.com/google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(Url);
var SpanNodes = doc.DocumentNode.SelectNodes("//div[#class=form-field wide-80 normal]");
if (SpanNodes != null)
{
foreach (HtmlNode SN in SpanNodes)
{
string text = SN.FirstChild.InnerText.Trim();
MessageBox.Show(text);
}
}
You forget 's
var SpanNodes =
doc.DocumentNode.SelectNodes("//div[#class='form-field wide-80 normal']");

Categories

Resources