How to get a text between nodes - c#

I have a problem with extracting text between nodes.It shows me the entire span node.I would like to get value of hours e.g 4:45;5:15 e.t.c.
var html = #"https://programtv.onet.pl/";
HtmlWeb web = new HtmlWeb();
var htmldoc=web.Load(html);
var findhours = htmldoc.DocumentNode.SelectNodes("//div[#id='boxTV1']//div[#class='hours']//span[#class='hour']");
if (findhours != null)
{
foreach (var x in findhours )
{
Console.WriteLine(x.OuterHtml);
}
}
else
{
Console.WriteLine("node = null");
}
Console.ReadLine();
Application window

You can simply use the InnerText property of your HtmlNode object. Checkout the following documentation.
foreach (var x in findhours )
{
Console.WriteLine(x.InnerText);
}

Related

How can i extract all links from html document using htmlagilitypack?

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s1);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
count++;
HtmlAttribute att = link.Attributes["href"];
if (att.Value.StartsWith("http") && !listBox1.Items.Contains(att.Value))
listBox1.Items.Add(att.Value);
}
I'm getting for example 151 results but in fact there are more then 300.
In many cases where it found links it contain more then one link inside for example:
href="http://www.test.com dfsdfgfg https://www.test1.com 656567 http://test2.com
In this cases i need to break it so it will show me and count as 3 links and not one.
I tried to change att.Value.StartsWith("http") to att.Value.Contains("http") but that's not the solution.
Here is what you can do:
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
count++;
HtmlAttribute att = link.Attributes["href"];
foreach (var link in att.Value.Split(' ')) {
if (link.StartsWith("http") && !listBox1.Items.Contains(link))
listBox1.Items.Add(link);
}
}
This will find links in the <a href="..."> tags of the HTML document. If you need to find ALL links (including javascript codes, styles etc), you can use regular expression, something like this:
private static readonly Regex cHttpUrlsRegex = new Regex(#"(?<url>((http|https):[/][/]|www.)([a-z]|[A-Z]|[0-9]|[_/.=&?%-]|[~])*)", RegexOptions.IgnoreCase);
public static IEnumerable<string> ExtractHttpUrls(string aText, string aMatch = null)
{
if (String.IsNullOrEmpty(aText)) yield break;
var matches = cHttpUrlsRegex.Matches(aText);
var vMatcher = aMatch == null ? null : new Regex(aMatch);
foreach (Match match in matches)
{
var vUrl = HttpUtility.UrlDecode(match.Groups["url"].Value);
if (vMatcher == null || vMatcher.IsMatch(vUrl))
yield return vUrl;
}
}
foreach (var link ExtractHttpUrls(s1))
{
count++;
if (link.StartsWith("http") && !listBox1.Items.Contains(link))
listBox1.Items.Add(link);
}

C# HtmlAgilityPack : startIndex cannot be larger than length of string

I'm trying to do something like this :
var document = htmlWeb.Load(searchUrl);
var hotels = document.DocumentNode.Descendants("div")
.Where(x => x.Attributes.Contains("class") &&
x.Attributes["class"].Value.Contains("listing-content"));
int count = 1;
foreach (var hotel in hotels)
{
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(hotel.InnerText);
if (htmlDoc.DocumentNode != null)
{
var anchors = htmlDoc.DocumentNode.Descendants("div")
.Where(x => x.Attributes.Contains("class") &&
x.Attributes["class"].Value.Contains("srp-business-name")); // Error Occurring in here //
foreach (var anchor in anchors)
{
Console.WriteLine(anchor.InnerHtml);
}
}
}
I'm getting results like this :
New York Marriott Marquis
<span class="external-link">
<img height="15" src="/images/sprites/search/icon-link-external.png" width="16">
</span>
And
Courtyard by Marriott New York Manhattan/Times Square South
And so on.
Now I want the innerHtml of the anchors tags having class="url redbold mip-link". So I'm doing this :
var document = htmlWeb.Load(searchUrl);
var hotels = document.DocumentNode.Descendants("div")
.Where(x => x.Attributes.Contains("class") &&
x.Attributes["class"].Value.Contains("listing-content"));
int count = 1;
foreach (var hotel in hotels)
{
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(hotel.InnerText);
if (htmlDoc.DocumentNode != null)
{
var anchors = htmlDoc.DocumentNode.Descendants("div")
.Where(x => x.Attributes.Contains("class") &&
x.Attributes["class"].Value.Contains("srp-business-name"));
foreach (var anchor in anchors)
{
htmlDoc.LoadHtml(anchor.InnerHtml);
var hoteltags = htmlDoc.DocumentNode.SelectNodes("//a");
foreach (var tag in hoteltags)
{
if (!string.IsNullOrEmpty(tag.InnerHtml) || !string.IsNullOrWhiteSpace(tag.InnerHtml))
{
Console.WriteLine(tag.InnerHtml);
}
}
}
}
}
I' getting the first result properly which is New York Marriott Marquis but in the second result an error occurring :
startIndex cannot be larger than length of string. What am I doing wrong ??
You are using the same DOM object for all your operations:
foreach (var hotel in hotels)
{
HtmlDocument htmlDoc = new HtmlDocument();
And after that you are using the same object for loading anchor tags:
foreach (var anchor in anchors)
{
htmlDoc.LoadHtml(anchor.InnerHtml);
Just change the document in the second iterator and it should work as expected.
foreach (var anchor in anchors)
{
var htmlDocAnchor= new HtmlDocument();
htmlDocAnchor.LoadHtml(anchor.InnerHtml);// And etc..

scraping 3rd node using htmlagilitypack

In a webpage there are several nodes having class='inner'. But i need to the 3rd node having class='inner'. If i use
string x = textBox1.Text;
string q = "";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("myweb_link" + x);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class='inner']");
if (nodes != null)
{
foreach (HtmlNode n in nodes)
{
q = n.InnerText;
q = System.Net.WebUtility.HtmlDecode(q);
q = q.Trim();
MessageBox.Show(q);
}
}
else
MessageBox.Show("nothing found ");
it gives me all the nodes having class='inner'. i also know that.
But i want only the 3rd node. How can i get that???
Get the third node from the nodes variable using the indexer:
var thirdNode = nodes[2];

How to get table from Wikipedia

I want to put one table from Wikipedia into xml file and then parse it to C#. Is it possible? If yes, can I save in xml only Title and Genre column?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/2012_in_film");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//table[#class='wikitable']");
You can use a web browser:
//First navigate to your address
webBrowser1.Navigate("http://en.wikipedia.org/wiki/2012_in_film");
List<string> Genre = new List<string>();
List<string> Title = new List<string>();
//When page loaded
foreach (HtmlElement table in webBrowser1.Document.GetElementsByTagName("table"))
{
if (table.GetAttribute("className").Equals("wikitable"))
{
foreach (HtmlElement tr in table.GetElementsByTagName("tr"))
{
int columncount = 1;
foreach (HtmlElement td in tr.GetElementsByTagName("td"))
{
//Title
if (columncount == 4)
{
Title.Add(td.InnerText);
}
//Genre
if (columncount == 7)
{
Genre.Add(td.InnerText);
}
columncount++;
}
}
}
}
now you have two list (genre and title).
you can simply convert them to xml file
You can use this code:
Search for the html tag which you want to search for and make a regular expression to parse the rest of the data.
This code will search for the table which has width 150 and gets all the url/nav url's.
HtmlElementCollection links = webBrowser1.Document.GetElementsByTagName("table"); //get collection in link
{
foreach (HtmlElement link_data in links) //parse for each collection
{
String width = link_data.GetAttribute("width");
{
if (width != null && width == "150")
{
Regex linkX = new Regex("<a[^>]*?href=\"(?<href>[\\s\\S]*?)\"[^>]*?>(?<Title>[\\s\\S]*?)</a>", RegexOptions.IgnoreCase);
MatchCollection category_urls = linkX.Matches(link_data.OuterHtml);
if (category_urls.Count > 0)
{
foreach (Match match in category_urls)
{
//rest of the code
}
}
}
}
}
}
Also consider looking at the Wikipedia API to zero in on a particular section of a wikipedia page
https://en.wikipedia.org/w/api.php?action=parse&page=2012_in_film&mobileformat=html&section=1&prop=wikitext
The API documentation describes how you can format the search results for subsequent parsing.

HTML Agility Pack - Get Text From 1st STRONG Tag Inside SPAN Tag

There are 5 STRONG Tags inside my SPAN Tag from my Html document.
I want to know how to get the text from the first STRONG Tag inside the SPAN TAG?
Here is my code so far.
var web = new HtmlWeb();
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//span[#class='advisory_link']/strong");
foreach (var node in nodes)
{
richTextBox1.Text = node.InnerHtml;
}
var nodes = doc.DocumentNode.SelectNodes("//span[#class='advisory_link']//strong[1]");
if (nodes != null)
{
foreach (var node in nodes)
{
string Description = node.InnerHtml;
return Description;
}
}
return null;

Categories

Resources