Like statement or removal of trailing blanks in html agility pack? - c#

I m trying to download data from a website into a datatable. The problem is I cannot access the right node because there seem to be blanck spaces. Here is my code so far:
public static DataTable downloadtable()
{
DataTable dt = new DataTable();
string htmlCode = "";
using (WebClient client = new WebClient())
{
client.Headers.Add(HttpRequestHeader.UserAgent, "AvoidError");
htmlCode = client.DownloadString("https://www.eex.com/en/Market%20Data/Trading%20Data/Power/Hour%20Contracts%20%7C%20Spot%20Hourly%20Auction/Area%20Prices/spot-hours-area-table/2013-08-22");
}
//this is just to check the file structure from text file
System.IO.StreamWriter file = new System.IO.StreamWriter("c:\\temp\\test.txt");
file.WriteLine(htmlCode);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
dt = new DataTable();
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table[#class='list electricity']/tr/th[#class='title'][.='Market Area']"))
{
//This is the problem name where I get the error
foreach (HtmlNode row in table.SelectNodes("//td[#class='title'][.=' 00-01 ']"))
{
foreach (var cell in row.SelectNodes("//td"))
{
//this is to check for correct result, final result would be to dump it into datatable
Console.WriteLine(cell.InnerText);
}
}
}
return dt;
}
I m trying to download the Hours prices from the link in the code but it seems to fail because of trailing blanks (I think).
Is there a like statement for the name of a node? Or can you remove trailing blanks?

I believe your problem is that you are trying to retrieve td's from inside a td node which obviously doesn't have more td's.
<tr>
<td class="title"> 00-01 </td>
<td class="spacer"></td>
<td class="r">€/MWh</td>
<td class="spacer"></td>
<td>35.34</td>
<td class="spacer"></td>
<td>34.02</td>
<td class="spacer"></td>
<td>34.02</td>
</tr>
So if you try to iterate with your result table.SelectNodes("//td[#class='title'][.=' 00-01 ']") it will contain no td's inside of it.
If you want all the rows starting from 00-01 you can use this one:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (HtmlNode row in doc2.DocumentNode.SelectNodes("//td[#class='title'][(normalize-space(.)='00-01')]/ancestor::table"))
{
foreach (var cell in row.SelectNodes("./tr/td"))
{
if (string.IsNullOrEmpty(cell.InnerText.Trim()))
continue;
Console.WriteLine(cell.InnerText.Trim());
}
}
If you want only the 00-01 row you can use this one:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//td[#class='title']"))
{
if (row.InnerText.Trim() == "00-01")
{
foreach (var cell in row.ParentNode.ChildNodes)
{
if (string.IsNullOrEmpty(cell.InnerText.Trim()))
continue;
Console.WriteLine(cell.InnerText.Trim());
}
}
}
Or you can use it as:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (HtmlNode row in doc2.DocumentNode.SelectNodes("//td[#class='title'][(normalize-space(.)='00-01')]"))
{
foreach (var cell in row.ParentNode.ChildNodes)
{
if (string.IsNullOrEmpty(cell.InnerText.Trim()))
continue;
Console.WriteLine(cell.InnerText.Trim());
}
}

Related

Add columns do DataTable with loop from html file

I want to add columns to my DataTable with the help of foreach from my <th> tags.
I have some problem with it. I don't understand why there is an null exception. In my HTML file i don't have any empty tags.
Fragment of my C# code:
DataTable dt = new DataTable();
int i = 0;
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var row in table.SelectNodes("tr"))
{
var headers = row.SelectNodes("th");
foreach (var el in headers)
{
if (headers != null)
{
dt.Columns.Add(headers[i].InnerText);
i++;
}
}
}
There is a fragment of my HTML file:
<table>
<colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup>
<tr><th>id</th><th>inserted_at</th><th>DisplayName</th><th>DistinguishedName</th><th>Enabled</th><th>GivenName</th><th>HomeDirectory</th><th>Manager</th><th>Name</th><th>ObjectClass</th><th>ObjectGUID</th><th>SamAccountName</th><th>Surname</th><th>UserPrincipalName</th><th>RowError</th><th>RowState</th><th>Table</th><th>ItemArray</th><th>HasErrors</th></tr>
This works for your html:
var str = #"<table>
<colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup>
<tr><th>id</th><th>inserted_at</th><th>DisplayName</th><th>DistinguishedName</th><th>Enabled</th><th>GivenName</th><th>HomeDirectory</th><th>Manager</th><th>Name</th><th>ObjectClass</th><th>ObjectGUID</th><th>SamAccountName</th><th>Surname</th><th>UserPrincipalName</th><th>RowError</th><th>RowState</th><th>Table</th><th>ItemArray</th><th>HasErrors</th></tr>";
var hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml(str);
var headerElements = hdoc.DocumentNode.Descendants("th");
foreach(var headerElement in headerElements)
{
Console.WriteLine(headerElement.InnerText);
}
I also need to select it from specific table so..
This actually worked for me:
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
var headerElements = table.Descendants("th");
foreach (var headerElement in headerElements)
{
dt.Columns.Add(headerElement.InnerText, typeof(string));
}

how to remove rows with specific character in c#

I am trying to export some links from a website html file to a datagridview. The problem is some href attributes values in html file are: #.
I want to delete the rows with value of #. I try blew code but it doesn't work and nothing happens.
private void findsuburls(string str,DataGridView dgv)
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load(str);
foreach (HtmlNode pages in document.DocumentNode.SelectNodes("//ul[#class='pagination ']/li/a[#href]"))
{
dgv.Rows.Add(pages.Attributes["href"].Value);
}
foreach (DataGridViewRow row in dgv.Rows)
{
if (row.Cells[0].Value == "#")
dgv.Rows.Remove(row);
}
}
I filter them out when adding rows in the first place.
private void findsuburls(string str,DataGridView dgv)
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load("http://goldtag.net"+str);
foreach (HtmlNode pages in document.DocumentNode.SelectNodes("//ul[#class='pagination ']/li/a[#href]"))
{
var temp = pages.Attributes["href"].Value;
if (temp != "#")
{
dgv.Rows.Add(temp);
}
}
}

how to get data from exact chtml class agility pack

I want to extract not the whole web-page but only text from one class, I want text from td class="result-neutral" and I don't know what is wrong with this code:
<td class="result-neutral" xseid="xz1nBfht">3 - 2 </td>
And this is C# code:
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
HtmlWeb hw = new HtmlWeb();
doc = hw.Load("htt
var scoreNodes = doc.DocumentNode.Descendants("td").Where(d =>d.Attributes.Contains("class")&&d.Attributes["class"].Value.Contains("result-neutral"));
foreach (var item in scoreNodes)
{
result += item.OuterHtml + Environment.NewLine;
}
Info.Text = result;
}
The OuterHtml returns html with start & end of the element. Don't you want InnerHtml or InnerText?
EDIT:
This snippet works for me:
const string html = #"<html><body><table><tr><td class='result-neutral' xseid='xz1nBfht'><a href='/hockey/russia/khl/ska-st-petersburg-metallurg-magnitogorsk-xz1nBfht/'>3 - 2</a></td></tr></table></body></html>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var scoreNodes = doc.DocumentNode.Descendants("td").Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("result-neutral"));
string result = "";
foreach (var item in scoreNodes) {
result += item.InnerText + Environment.NewLine;
}
result = result.TrimEnd(); // the result is "3-2"

Adding HTML table to datagridview using HTML Agility pack

I writing simple app for parsing HTML table to datagridview using the help of HTML Agility pack. but when I run the code it throw me an error "This row already belongs to this table"
I need to parse simple HTML table like below
<html>
<head>
</head>
<body>
<table>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
</table>
</body>
</html>
this is my code
private void button1_Click(object sender, EventArgs e)
{
var htmlCode = richTextBox1.Text.Trim();
var doc = new HtmlDocument();
doc.LoadHtml(htmlCode);
dt = new DataTable();
dt.Columns.Add("Company", typeof (string));
dt.Columns.Add("Contact", typeof (string));
dt.Columns.Add("Country", typeof (string));
var count = 0;
foreach (var table in doc.DocumentNode.SelectNodes("//table"))
{
foreach (var row in table.SelectNodes("//tr"))
{
var dr = dt.NewRow();
var i = 0;
foreach (var cell in row.SelectNodes("//td"))
{
dr["Company"] = cell.InnerText.Replace(" ", "");
dr["Contact"] = cell.InnerText.Replace(" ", "");
dr["Country"] = cell.InnerText.Replace(" ", "");
}
dt.Rows.Add(dr);
}
grid.DataSource = dt;
}
}
I need a simple output on datagridview like this
How can I do this with HTML Agility pack ?
Move the following line
dt.Rows.Add(dr);
outside the foreach loop over the table cells. You try to add the same row multiple times to the DataTable.
int i = 0;
foreach (var cell in row.SelectNodes("//td"))
{
dr[i++] = cell.InnerText;
}
dt.Rows.Add(dr);

Build HTMLTable from C# Serverside with InnerHtml string

I am trying to loop through my client side Html table contents on my c# server side. Setting the Html table to runat="server" is not an option because it conflicts with the javascript in use.
I use ajax to pass my clientside html table's InnerHtml to my server side method. I thought I would be able to simple create an HtmlTable variable in c# and set the InnerHtml property when I quickly realized this is not possible because I got the error {"'HtmlTable' does not support the InnerHtml property."}
For simplicity , lets say my InnerHtml string passed from client to server is:
string myInnerHtml = "<colgroup>col width="100"/></colgroup><tbody><tr><td>hello</td></tr></tbody>"
I followed a post from another stack overflow question but can not quite get it working.
Can someone point out my errors?
string myInnerHtml = "<colgroup>col width="100"/></colgroup><tbody><tr><td>hello</td></tr></tbody>"
HtmlTable table = new HtmlTable();
System.Text.StringBuilder sb = new System.Text.StringBuilder(myInnerHtml);
System.IO.StringWriter tw = new System.IO.StringWriter(sb);
HtmlTextWriter hw = new HtmlTextWriter(tw);
table.RenderControl(hw);
for (int i = 0; i < table.Rows.Count; i++)
{
for (int c = 0; c < table.Rows[i].Cells.Count; i++)
{
// get cell contents
}
}
hope this helps,
string myInnerHtml = #"<table>
<colgroup>col width='100'/></colgroup>
<tbody>
<tr>
<td>hello 1</td><td>hello 2</td>
</tr>
<tr>
<td>hello 3</td><td>hello 4</td>
</tr>
</tbody>
</table>";
DataSet ds = new DataSet();
ds.ReadXml(new XmlTextReader(new StringReader(myInnerHtml)));
var tr = ds.Tables["tr"];
var td = ds.Tables["td"];
foreach (DataRow trRow in tr.Rows)
foreach(DataRow tdRow in td.AsEnumerable().Where(x => (int)x["tr_Id"] == (int)trRow["tr_Id"]))
Console.WriteLine( tdRow["tr_Id"] + " | " + tdRow["td_Text"]);
output
=========================
//0 | hello 1
//0 | hello 2
//1 | hello 3
//1 | hello 4
Ended up using the HTMLAgilityPack. I found it a lot more readable and manageable for diverse situations.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<html><body><table>" + innerHtml + "</table></html></body>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
{
foreach (HtmlNode row in table.SelectNodes("//tr"))
{
foreach (HtmlNode cell in row.SelectNodes("td"))
{
var divExists = cell.SelectNodes("div");
if (divExists != null)
{
foreach (HtmlNode div in cell.SelectNodes("div"))
{
string test = div.InnerText + div.Attributes["data-id"].Value;
}
}
}
}
}

Categories

Resources