how to remove rows with specific character in c# - c#

I am trying to export some links from a website html file to a datagridview. The problem is some href attributes values in html file are: #.
I want to delete the rows with value of #. I try blew code but it doesn't work and nothing happens.
private void findsuburls(string str,DataGridView dgv)
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load(str);
foreach (HtmlNode pages in document.DocumentNode.SelectNodes("//ul[#class='pagination ']/li/a[#href]"))
{
dgv.Rows.Add(pages.Attributes["href"].Value);
}
foreach (DataGridViewRow row in dgv.Rows)
{
if (row.Cells[0].Value == "#")
dgv.Rows.Remove(row);
}
}

I filter them out when adding rows in the first place.
private void findsuburls(string str,DataGridView dgv)
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load("http://goldtag.net"+str);
foreach (HtmlNode pages in document.DocumentNode.SelectNodes("//ul[#class='pagination ']/li/a[#href]"))
{
var temp = pages.Attributes["href"].Value;
if (temp != "#")
{
dgv.Rows.Add(temp);
}
}
}

Related

Add columns do DataTable with loop from html file

I want to add columns to my DataTable with the help of foreach from my <th> tags.
I have some problem with it. I don't understand why there is an null exception. In my HTML file i don't have any empty tags.
Fragment of my C# code:
DataTable dt = new DataTable();
int i = 0;
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var row in table.SelectNodes("tr"))
{
var headers = row.SelectNodes("th");
foreach (var el in headers)
{
if (headers != null)
{
dt.Columns.Add(headers[i].InnerText);
i++;
}
}
}
There is a fragment of my HTML file:
<table>
<colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup>
<tr><th>id</th><th>inserted_at</th><th>DisplayName</th><th>DistinguishedName</th><th>Enabled</th><th>GivenName</th><th>HomeDirectory</th><th>Manager</th><th>Name</th><th>ObjectClass</th><th>ObjectGUID</th><th>SamAccountName</th><th>Surname</th><th>UserPrincipalName</th><th>RowError</th><th>RowState</th><th>Table</th><th>ItemArray</th><th>HasErrors</th></tr>
This works for your html:
var str = #"<table>
<colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup>
<tr><th>id</th><th>inserted_at</th><th>DisplayName</th><th>DistinguishedName</th><th>Enabled</th><th>GivenName</th><th>HomeDirectory</th><th>Manager</th><th>Name</th><th>ObjectClass</th><th>ObjectGUID</th><th>SamAccountName</th><th>Surname</th><th>UserPrincipalName</th><th>RowError</th><th>RowState</th><th>Table</th><th>ItemArray</th><th>HasErrors</th></tr>";
var hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml(str);
var headerElements = hdoc.DocumentNode.Descendants("th");
foreach(var headerElement in headerElements)
{
Console.WriteLine(headerElement.InnerText);
}
I also need to select it from specific table so..
This actually worked for me:
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
var headerElements = table.Descendants("th");
foreach (var headerElement in headerElements)
{
dt.Columns.Add(headerElement.InnerText, typeof(string));
}

Scrape data with HtmlAgilityPack for a tag which doesn't have class

Here is my C# code what i am trying to do is to scrape data from a website by using HtmlAgilityPack but it's showing nothing found every time don't know what i am doing wrong a bit confused
HtmlAgilityPack.HtmlWeb webb = new HtmlAgilityPack.HtmlWeb();
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
HtmlAgilityPack.HtmlDocument doc = webb.Load("mywebsite");
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//ul[#class='unstyled']//li//a");
if (nodes != null)
{
foreach (HtmlNode n in nodes)
{
q = n.InnerText;
q = System.Net.WebUtility.HtmlDecode(q);
q = q.Trim();
Console.WriteLine(q);
}
}
else
{
Console.WriteLine("nothing found");
}
Here is the picture of the tag from which i am trying to capture data i need data from <a> tag .
The XPath used to select the tag is incorrect.
HtmlNodeCollection nodes =
doc.DocumentNode.SelectNodes("//ul[#class='unstyled']/li/a");
This should select all the anchor nodes and then you can loop through the nodes to get the InnerHtml.
Working sample shown below
string s = "<ul class='unstyle no-overflow'><li><ul class='unstyled'><li><a href='http://www.smsconnexion.com'>SMS ConneXion</a></li></ul><ul class='unstyled'><li><a href='http://www.celusion.com'>Celusion</a></li></ul></li></ul>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
HtmlNodeCollection nodes =
doc.DocumentNode.SelectNodes("//ul[#class='unstyled']/li/a");
foreach(var node in nodes)
{
Console.WriteLine(node.Attributes["href"].Value);
}
Console.ReadLine();

NodeSet exception while using agilitypack

private void ShowStatistics_Click(object sender, RoutedEventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
HtmlWeb hw = new HtmlWeb();
doc = hw.Load("http://www.gamerankings.com/browse.html");
HtmlNodeCollection nodes= doc.DocumentNode.SelectNodes("//a/");
string result = "";
foreach (var item in nodes)
{
result += item.InnerText+Environment.NewLine;
}
Info.ItemsSource = result;
}
By pressing the button i want to get information from the webpage in a textbox called Info.
After pressing the button I get an exception saying that the result of expression should be NodeSet, what should I do? I'm using agility pack
Your XPATH is wrong. You can use this instead if you want to get all hyperlink elements
var nodes = doc.DocumentNode.Descendants("a");
In addition to #Hung Cao, you can actually shorten this/work around:
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.SelectNodes("Selector here")){
//your code here
}

Like statement or removal of trailing blanks in html agility pack?

I m trying to download data from a website into a datatable. The problem is I cannot access the right node because there seem to be blanck spaces. Here is my code so far:
public static DataTable downloadtable()
{
DataTable dt = new DataTable();
string htmlCode = "";
using (WebClient client = new WebClient())
{
client.Headers.Add(HttpRequestHeader.UserAgent, "AvoidError");
htmlCode = client.DownloadString("https://www.eex.com/en/Market%20Data/Trading%20Data/Power/Hour%20Contracts%20%7C%20Spot%20Hourly%20Auction/Area%20Prices/spot-hours-area-table/2013-08-22");
}
//this is just to check the file structure from text file
System.IO.StreamWriter file = new System.IO.StreamWriter("c:\\temp\\test.txt");
file.WriteLine(htmlCode);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
dt = new DataTable();
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table[#class='list electricity']/tr/th[#class='title'][.='Market Area']"))
{
//This is the problem name where I get the error
foreach (HtmlNode row in table.SelectNodes("//td[#class='title'][.=' 00-01 ']"))
{
foreach (var cell in row.SelectNodes("//td"))
{
//this is to check for correct result, final result would be to dump it into datatable
Console.WriteLine(cell.InnerText);
}
}
}
return dt;
}
I m trying to download the Hours prices from the link in the code but it seems to fail because of trailing blanks (I think).
Is there a like statement for the name of a node? Or can you remove trailing blanks?
I believe your problem is that you are trying to retrieve td's from inside a td node which obviously doesn't have more td's.
<tr>
<td class="title"> 00-01 </td>
<td class="spacer"></td>
<td class="r">€/MWh</td>
<td class="spacer"></td>
<td>35.34</td>
<td class="spacer"></td>
<td>34.02</td>
<td class="spacer"></td>
<td>34.02</td>
</tr>
So if you try to iterate with your result table.SelectNodes("//td[#class='title'][.=' 00-01 ']") it will contain no td's inside of it.
If you want all the rows starting from 00-01 you can use this one:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (HtmlNode row in doc2.DocumentNode.SelectNodes("//td[#class='title'][(normalize-space(.)='00-01')]/ancestor::table"))
{
foreach (var cell in row.SelectNodes("./tr/td"))
{
if (string.IsNullOrEmpty(cell.InnerText.Trim()))
continue;
Console.WriteLine(cell.InnerText.Trim());
}
}
If you want only the 00-01 row you can use this one:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//td[#class='title']"))
{
if (row.InnerText.Trim() == "00-01")
{
foreach (var cell in row.ParentNode.ChildNodes)
{
if (string.IsNullOrEmpty(cell.InnerText.Trim()))
continue;
Console.WriteLine(cell.InnerText.Trim());
}
}
}
Or you can use it as:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (HtmlNode row in doc2.DocumentNode.SelectNodes("//td[#class='title'][(normalize-space(.)='00-01')]"))
{
foreach (var cell in row.ParentNode.ChildNodes)
{
if (string.IsNullOrEmpty(cell.InnerText.Trim()))
continue;
Console.WriteLine(cell.InnerText.Trim());
}
}

How to get table from Wikipedia

I want to put one table from Wikipedia into xml file and then parse it to C#. Is it possible? If yes, can I save in xml only Title and Genre column?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/2012_in_film");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//table[#class='wikitable']");
You can use a web browser:
//First navigate to your address
webBrowser1.Navigate("http://en.wikipedia.org/wiki/2012_in_film");
List<string> Genre = new List<string>();
List<string> Title = new List<string>();
//When page loaded
foreach (HtmlElement table in webBrowser1.Document.GetElementsByTagName("table"))
{
if (table.GetAttribute("className").Equals("wikitable"))
{
foreach (HtmlElement tr in table.GetElementsByTagName("tr"))
{
int columncount = 1;
foreach (HtmlElement td in tr.GetElementsByTagName("td"))
{
//Title
if (columncount == 4)
{
Title.Add(td.InnerText);
}
//Genre
if (columncount == 7)
{
Genre.Add(td.InnerText);
}
columncount++;
}
}
}
}
now you have two list (genre and title).
you can simply convert them to xml file
You can use this code:
Search for the html tag which you want to search for and make a regular expression to parse the rest of the data.
This code will search for the table which has width 150 and gets all the url/nav url's.
HtmlElementCollection links = webBrowser1.Document.GetElementsByTagName("table"); //get collection in link
{
foreach (HtmlElement link_data in links) //parse for each collection
{
String width = link_data.GetAttribute("width");
{
if (width != null && width == "150")
{
Regex linkX = new Regex("<a[^>]*?href=\"(?<href>[\\s\\S]*?)\"[^>]*?>(?<Title>[\\s\\S]*?)</a>", RegexOptions.IgnoreCase);
MatchCollection category_urls = linkX.Matches(link_data.OuterHtml);
if (category_urls.Count > 0)
{
foreach (Match match in category_urls)
{
//rest of the code
}
}
}
}
}
}
Also consider looking at the Wikipedia API to zero in on a particular section of a wikipedia page
https://en.wikipedia.org/w/api.php?action=parse&page=2012_in_film&mobileformat=html&section=1&prop=wikitext
The API documentation describes how you can format the search results for subsequent parsing.

Categories

Resources