How to get table from Wikipedia - c#

I want to put one table from Wikipedia into xml file and then parse it to C#. Is it possible? If yes, can I save in xml only Title and Genre column?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/2012_in_film");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//table[#class='wikitable']");

You can use a web browser:
//First navigate to your address
webBrowser1.Navigate("http://en.wikipedia.org/wiki/2012_in_film");
List<string> Genre = new List<string>();
List<string> Title = new List<string>();
//When page loaded
foreach (HtmlElement table in webBrowser1.Document.GetElementsByTagName("table"))
{
if (table.GetAttribute("className").Equals("wikitable"))
{
foreach (HtmlElement tr in table.GetElementsByTagName("tr"))
{
int columncount = 1;
foreach (HtmlElement td in tr.GetElementsByTagName("td"))
{
//Title
if (columncount == 4)
{
Title.Add(td.InnerText);
}
//Genre
if (columncount == 7)
{
Genre.Add(td.InnerText);
}
columncount++;
}
}
}
}
now you have two list (genre and title).
you can simply convert them to xml file

You can use this code:
Search for the html tag which you want to search for and make a regular expression to parse the rest of the data.
This code will search for the table which has width 150 and gets all the url/nav url's.
HtmlElementCollection links = webBrowser1.Document.GetElementsByTagName("table"); //get collection in link
{
foreach (HtmlElement link_data in links) //parse for each collection
{
String width = link_data.GetAttribute("width");
{
if (width != null && width == "150")
{
Regex linkX = new Regex("<a[^>]*?href=\"(?<href>[\\s\\S]*?)\"[^>]*?>(?<Title>[\\s\\S]*?)</a>", RegexOptions.IgnoreCase);
MatchCollection category_urls = linkX.Matches(link_data.OuterHtml);
if (category_urls.Count > 0)
{
foreach (Match match in category_urls)
{
//rest of the code
}
}
}
}
}
}

Also consider looking at the Wikipedia API to zero in on a particular section of a wikipedia page
https://en.wikipedia.org/w/api.php?action=parse&page=2012_in_film&mobileformat=html&section=1&prop=wikitext
The API documentation describes how you can format the search results for subsequent parsing.

Related

how to remove rows with specific character in c#

I am trying to export some links from a website html file to a datagridview. The problem is some href attributes values in html file are: #.
I want to delete the rows with value of #. I try blew code but it doesn't work and nothing happens.
private void findsuburls(string str,DataGridView dgv)
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load(str);
foreach (HtmlNode pages in document.DocumentNode.SelectNodes("//ul[#class='pagination ']/li/a[#href]"))
{
dgv.Rows.Add(pages.Attributes["href"].Value);
}
foreach (DataGridViewRow row in dgv.Rows)
{
if (row.Cells[0].Value == "#")
dgv.Rows.Remove(row);
}
}
I filter them out when adding rows in the first place.
private void findsuburls(string str,DataGridView dgv)
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load("http://goldtag.net"+str);
foreach (HtmlNode pages in document.DocumentNode.SelectNodes("//ul[#class='pagination ']/li/a[#href]"))
{
var temp = pages.Attributes["href"].Value;
if (temp != "#")
{
dgv.Rows.Add(temp);
}
}
}

How to get a text between nodes

I have a problem with extracting text between nodes.It shows me the entire span node.I would like to get value of hours e.g 4:45;5:15 e.t.c.
var html = #"https://programtv.onet.pl/";
HtmlWeb web = new HtmlWeb();
var htmldoc=web.Load(html);
var findhours = htmldoc.DocumentNode.SelectNodes("//div[#id='boxTV1']//div[#class='hours']//span[#class='hour']");
if (findhours != null)
{
foreach (var x in findhours )
{
Console.WriteLine(x.OuterHtml);
}
}
else
{
Console.WriteLine("node = null");
}
Console.ReadLine();
Application window
You can simply use the InnerText property of your HtmlNode object. Checkout the following documentation.
foreach (var x in findhours )
{
Console.WriteLine(x.InnerText);
}

How can i extract all links from html document using htmlagilitypack?

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s1);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
count++;
HtmlAttribute att = link.Attributes["href"];
if (att.Value.StartsWith("http") && !listBox1.Items.Contains(att.Value))
listBox1.Items.Add(att.Value);
}
I'm getting for example 151 results but in fact there are more then 300.
In many cases where it found links it contain more then one link inside for example:
href="http://www.test.com dfsdfgfg https://www.test1.com 656567 http://test2.com
In this cases i need to break it so it will show me and count as 3 links and not one.
I tried to change att.Value.StartsWith("http") to att.Value.Contains("http") but that's not the solution.
Here is what you can do:
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
count++;
HtmlAttribute att = link.Attributes["href"];
foreach (var link in att.Value.Split(' ')) {
if (link.StartsWith("http") && !listBox1.Items.Contains(link))
listBox1.Items.Add(link);
}
}
This will find links in the <a href="..."> tags of the HTML document. If you need to find ALL links (including javascript codes, styles etc), you can use regular expression, something like this:
private static readonly Regex cHttpUrlsRegex = new Regex(#"(?<url>((http|https):[/][/]|www.)([a-z]|[A-Z]|[0-9]|[_/.=&?%-]|[~])*)", RegexOptions.IgnoreCase);
public static IEnumerable<string> ExtractHttpUrls(string aText, string aMatch = null)
{
if (String.IsNullOrEmpty(aText)) yield break;
var matches = cHttpUrlsRegex.Matches(aText);
var vMatcher = aMatch == null ? null : new Regex(aMatch);
foreach (Match match in matches)
{
var vUrl = HttpUtility.UrlDecode(match.Groups["url"].Value);
if (vMatcher == null || vMatcher.IsMatch(vUrl))
yield return vUrl;
}
}
foreach (var link ExtractHttpUrls(s1))
{
count++;
if (link.StartsWith("http") && !listBox1.Items.Contains(link))
listBox1.Items.Add(link);
}

HTML Agility Pack - Get Text From 1st STRONG Tag Inside SPAN Tag

There are 5 STRONG Tags inside my SPAN Tag from my Html document.
I want to know how to get the text from the first STRONG Tag inside the SPAN TAG?
Here is my code so far.
var web = new HtmlWeb();
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//span[#class='advisory_link']/strong");
foreach (var node in nodes)
{
richTextBox1.Text = node.InnerHtml;
}
var nodes = doc.DocumentNode.SelectNodes("//span[#class='advisory_link']//strong[1]");
if (nodes != null)
{
foreach (var node in nodes)
{
string Description = node.InnerHtml;
return Description;
}
}
return null;

Substring without breaking html c#

Hi guys I'm trying to take a description which has been entered in a wysiwyg editor and take a substring of it..
i.e
This is some <span style="font-weight:bold;">text</span>
I'd like to limit some descriptions without breaking the html if i just substring and add ...
it breaks the html tags..
I've tried:
string HtmlSubstring(string html, int maxlength)
{
string htmltag = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
string emptytags = "<(\\w+)((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?></\\1>";
var expression = new Regex(string.Format("({0})|(.?)", htmltag));
MatchCollection matches = expression.Matches(html);
int i = 0;
StringBuilder content = new StringBuilder();
foreach (Match match in matches)
{
if (match.Value.Length == 1 && i < maxlength)
{
content.Append(match.Value);
i++;
}
else if (match.Value.Length > 1)
{
content.Append(match.Value);
}
}
return Regex.Replace(content.ToString(), emptytags, string.Empty);
}
but it doesn't quite get me there!
Use the HTML Agility Pack to load the HTML and then get InnerText.
var document = new HtmlDocument();
document.LoadHtml("...");
document.DocumentNode.InnerText;
Also see C#: HtmlAgilityPack extract inner text

Categories

Resources