Parse HTML table to a CSV file (colspan and rowspan) - c#

I want to parse a HTML table into a CSV file, but keeping the right number of colspan and rowpspan.
I'm using ";" as delimiter cell. Thus, when there colspan of 2 columns, for example, instead of having only one, ";", it will have 2.
I can extract the content of the table and make line breaks where tr indicators ends, but don't know how to treat colspan and rowspan.
HtmlNodeCollection rows = tables[0].SelectNodes("tr");
// Aux vars
int i;
// ncolspan
// For each row...
for (i = 0; i < rows.Count; ++i)
{
// For each cell in the col...
foreach (HtmlNode cell in rows[i].SelectNodes("th|td"))
{
/* Unsuccessful attempt to treat colspan
foreach (HtmlNode n_cell in rows[i].SelectNodes("//td[#colspan]"))
{
ncolspan = n_cell.Attributes["colspan"].Value;
}
*/
text.Write(System.Text.RegularExpressions.Regex.Replace(cell.InnerText, #"\s\s+", ""));
text.Write(";");
/*
for (int x = 0; x <= int.Parse(ncolspan); x++)
{
text.Write(";");
}
*/
}
text.WriteLine();
ncolspan = "0";
}
Any help, please? Thank you!
UPDATE: Here a simple example table to use:
<table id="T123" border="1">
<tr>
<td colspan="3"><center><font color="red">Title</font></center></td>
</tr>
<tr>
<th>R1 C1</th>
<th>R1 C2</th>
<th>R1 C3</th>
</tr>
<tr>
<td>R2 C1</td>
<td>R2 C2</td>
<td>R2 C3</td>
</tr>
<tr>
<td colspan="2">R3 C1 e C2 with "</td>
<td>R3 C3</td>
</tr>
<tr>
<td>R4 C1</td>
<td colspan=2>R4 C2 e C3 without "</td>
</tr>
<tr>
<td>R5 C1</td>
<td>R5 C2</td>
<td>R5 C3</td>
</tr>
<tr>
<td rowspan ="2">R6/R7 C1: Two lines rowspan. Must leave the second line blank.</td>
<td>R6 C2</td>
<td>R6 C3</td>
</tr>
<tr>
<td>R7 C2</td>
<td>R7 C3</td>
</tr>
<tr>
<td>End</td>
</tr>
</table>

CSV doesn't handle rowspan or colspan values - it's a very simple format that has no concept of columns or rows beyond it's delimiter and the end of line character.
If you want to try preserve the rowspan and colspan you will need to use an intermediate object model which you can use to store the specific contents of a cell and it's location, for example, before exporting the model to CSV. And even then, the CSV format will not preserve the colspan and rowspan as you may be hoping (i.e. like an Excel sheet would).

is true, that you can not put a rowspan or colspan in the csv format, what worked for me is to put blank spaces where the span should exist
It is not the best option, but aesthetically it looks similar
"";SEPTIEMBRE;;OCTUBRE;;NOVIEMBRE;;TOTAL;
PRODUCTOS;cantidad;monto;cantidad;monto;cantidad;monto;cantidad;monto

Related

Html Agility Pack Loop Through Table - Get cell value based on previous cell value

I have multiple tables and Location Value is given in different index order.
How can I get location value if previous cell string is "Location" when I loop through table. On below example it is cells[7] but on other table it will be 9. How can I conditionally get values after cells inner text is "Location"? Basically find the cell "Location" get inner text of next cell.
Html Table:
<table class="tbfix FieldsTable"">
<tbody>
<tr>
<td class="name">Last Movement</td>
<td class="value">Port Exit</td>
</tr>
<tr>
<td class="name">Date</td>
<td class="value">26/06/2017 00:00:00</td>
</tr>
<tr>
<td class="name">From</td>
<td class="value">HAMBURGE</td>
</tr>
<tr>
<td class="name">Location</td>
<td class="value">EUROGATE HAMBURG</td>
</tr>
<tr>
<td class="name">E/F</td>
<td class="value">E</td>
</tr>
</tbody>
Controller Loop Through:
foreach (var eachNode in driver.FindElements(By.XPath("//table[contains(descendant::*, 'Last Movement')]")))
{
var cells = eachNode.FindElements(By.XPath(".//td"));
cd = new Detail();
for (int i = 0; i < cells.Count(); i++)
{
cd.ActionType = cells[1].Text.Trim();
string s = cells[3].Text.Trim();
DateTime dt = Convert.ToDateTime(s);
if (_minDate > dt) _minDate = dt;
cd.ActionDate = dt;
}
}
In your foreach loop you could use this:
var location = eachNode.FindElement(By.XPath(".//td[contains(text(),'Location')]/following-sibling::td));
Assuming your data is always structured like that I would loop over all the tags and add the data to a dictionary.
Try something like this:
Dictionary<string,string> tableData = new Dictionary<string, string>();
var trNodes = eachNode.FindElements(By.TagName("tr"));
foreach (var trNode in trNodes)
{
var name = trNode.FindElement(By.CssSelector(".name")).Text.Trim();
var value = trNode.FindElement(By.CssSelector(".value")).Text.Trim();
tableData.Add(name,value);
}
var location = tableData["location"];
You would have to add validation and checks for the dictionary and the structure but that is the general idea.

HTMLAgilityPack - Detecting a blank table?

I'm using c# with htmlagilitypack. Everything works fine except when the table I'm looking for contains no rows. I'm trying to read only the data from the 1st table on the page. The problem is if the first table contains no rows, the htmlagilitypack seems to jump down to the 2nd table for some reason.
The html I'm trying to read looks something like this:
<table class='stats'>
<tr>
<td colspan='2'>This is the 1st table</td>
<tr>
<td>Column A</td>
<td>Column B</td>
</tr>
<tr>
<td>Value A</td>
<td>Value B</td>
</tr>
</table>
<table class='stats'>
<tr>
<td colspan='2'>This is the 2nd table</td>
<tr>
<td>Column 1</td>
<td>Column 2</td>
</tr>
<tr>
<td>Value 111</td>
<td>Value 222</td>
</tr>
</table>
I then retrieve the 1st table's values using the following line:
foreach (HtmlNode node in root.SelectNodes("//table[#class='stats']/tr[position() > 2]/td"))
How do I ensure the data I'm grabbing is only from the 1st table?
Thanks.
You could ensure that you only select the first matching table by using a position index [1] after the table selector.
Try the following:
"//table[#class='stats'][1]/tr[position()>2]/td"
If the first table has no rows, then you will get null back so you should check for that before iterating in the foreach.
For example you might want to do the following:
var elements = root.SelectNodes("//table[#class='stats'][1]/tr[position()>2]/td");
if (elements != null)
{
foreach (HtmlNode node in elements)
{
// process the td node
}
}
You need to have an id on the table or row which uniquely identifies the table or or and then use the id in the xpath.

Parse data/numbers from table cells C# or VisualBasic

I have a string which contains html code from a webpage. There's a table in the code I'm interested in. I want to parse the numbers present in the table cells and put them in textboxes, each number in its own textbox. Here's the table:
<table class="tblSkills">
<tr>
<th class="th_first">Strength</th><td class="align_center">15</td>
<th>Passing</th><td class="align_center">17</td>
</tr>
<tr>
<th class="th_first">Stamina</th><td class="align_center">16</td>
<th>Crossing</th><td class="align_center"><img src='/pics/star.png' alt='20' title='20' /></td>
</tr>
<tr>
<th class="th_first">Pace</th><td class="align_center"><img src='/pics/star_silver.png' alt='19' title='19' /></td>
<th>Technique</th><td class="align_center">16</td>
</tr>
<tr>
<th class="th_first">Marking</th><td class="align_center">15</td>
<th>Heading</th><td class="align_center">10</td>
</tr>
<tr>
<th class="th_first">Tackling</th><td class="align_center"><span class='subtle'>5</span></td>
<th>Finishing</th><td class="align_center">15</td>
</tr>
<tr>
<th class="th_first">Workrate</th><td class="align_center">16</td>
<th>Longshots</th><td class="align_center">8</td>
</tr>
<tr>
<th class="th_first">Positioning</th><td class="align_center">18</td>
<th>Set Pieces</th><td class="align_center"><span class='subtle'>2</span></td>
</tr>
</table>
As you can see there are 14 numbers. To make things worse numbers like 19 and 20 are replaced by images and numbers lower than 6 have a span class.
I know I could use HTML agility pack or something similar, but I'm not yet that good to figure how to do it by myself, so I need your help.
Your HTML sample also happens to be good XML. You could use any of .net's XML reading/parsing techniques.
Using LINQ to XML in C#:
var doc = XDocument.Parse(yourHtml);
var properties = new List<string>(
from th in doc.Descendants("th")
select th.Value);
var values = new List<int>(
from td in doc.Descendants("td")
let img = td.Element("img")
let textValue = img == null ? td.Value : img.Attribute("alt").Value
select int.Parse(textValue));
var dict = new Dictionary<string, int>();
for (var i = 0; i < properties.Count; i++)
{
dict[properties[i]] = values[i];
}

Create asp:table inside an asp table

I want to do something like the code below using asp:table dynamically.
I know how to use Table, TableRow and TableCell, but I don't know how I can add a Table inside a TableRow.
<table>
<tr>
<td>[Image]</td>
<td>
<table>
<tr>
<td>Name</td>
<td>Test</td>
</tr>
<tr>
<td>Month</td>
<td>January</td>
</tr>
<tr>
<td>Code</td>
<td>11100</td>
</tr>
<tr>
<td>Price</td>
<td>$100,00</td>
</tr>
</table>
</td>
</tr>
</table>
I want to do something like the code bellow using asp:table
dynamically. I know how to use Table, TableRow and TableCell. But i
don't know how can i add a Table inside a TableRow.
Might I suggest instead you create a single table where you take advantage of row spans instead? Basically you should have 3 columns and the first column has a row height of 4 whereas the 2nd and 3rd columns don't.
Try this,
string table = "<table><tr><td>foo</td></tr></table>";
TableRow row = new TableRow();
TableCell cell = new TableCell();
cell.Text = table;
row.Cells.Add(cell);
Table1.Rows.Add(row);

Using jQuery for table manipulation - row count is off

I have a loop on the server ( C# ) that does this:
for(i=0;i<=Request.Files.Count - 1; i++)
{
// tell the client that the upload is about to happen
// and report useful information
Update(percent, message, i, 0);
System.Threading.Thread.Sleep(1000);
// tell the client that upload succeeded
// and report useful information
Update(percent, message, i, 1);
}
The function "Update" writes to the client-side javascript function "PublishUpdate".
The row parameter is the row in the table containing the uploading file. The 'status' tells us if the file is about to be uploaded (0) or completed (1).
THE PROBLEM is that I can't seem to get the count correct. The loop seems to
start 2 or 3 rows into the table or (after playing with the row value) it ends before the
final row. I am very new to jQuery. Does anything look obviously wrong to you?
function PublishUpdate(percent, message, row, status)
{
var bodyRows = $("#picTableDisplay tbody tr");
bodyRows.each(function(index){
if (index == row && status == 0)
$('#status'+index).html("<img alt='inproc' src='images/loading.gif' />");
else if (index == row && status == 1)
$('#status'+index).html("complete");
});
}
Finally, the table looks like this:
<table width="100%" cellspacing="0" cellpadding="3" border="0" align="center" id="picTableDisplay" summary="" class="x-pinfo-table">
<tbody id="files_list" class="scrollContent">
<tr class="evenRow">
<td class="numCol" id="num0">
</td>
<td class="fnameCol" id="fname0">
</td>
<td class="statusCol" nowrap="" id="status0">
</td>
<td class="remCol" id="rem0">
</td>
</tr>
<tr class="oddRow">
<td class="numCol" id="num1">
</td>
<td class="fnameCol" id="fname1">
</td>
<td class="statusCol" nowrap="" id="status1">
</td>
<td class="remCol" id="rem1">
</td>
</tr>
<tr class="evenRow">
<td class="numCol" id="num2">
</td>
<td class="fnameCol" id="fname2">
</td>
<td class="statusCol" nowrap="" id="status2">
</td>
<td class="remCol" id="rem2">
</td>
</tr>
AND SO ON ...
Thanks in advance.
The C# is using zero indexing, and typically HTML authors use indexing starting from one. Check to see if you need to correct the index from 0 to 1-based, like this:
$('#status' + (index + 1))
Also refactoring your code to something simpler can often fix hidden errors, or at least make the error more obvious. I'd suggest something along these lines:
if (index == row)
{
if (status == 0) {
html = "<img alt='inproc' src='images/loading.gif' />";
} else {
html = "complete";
}
$('#status'+index).html(html);
}
You should also use C# idiom for looping, < x not <= x - 1:
for(i=0; i < Request.Files.Count; i++)
I'm not entirely sure what you're trying to do as it's not clear where the #status elements are. However, assuming they're cells within the row it might be better to give them a class "status" and then write something like
function PublishUpdate(percent, message, row, status) {
$('#picTableDisplay tbody tr:eq('+row+') .status').html(
status==0 ? '<img alt="inproc" src="images/loading.gif"/>' : 'complete'
);
}

Categories

Resources