Parsing from HTML file using HtnlAgilityPack - c#

I have an HTML (DTD HTML 4.0 Transitional) file generated by Oracle Reports.
Here is source of HTML file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><META content="IE=5.0000" http-equiv="X-UA-Compatible">
<META http-equiv="Content-Type" content="text/html; charset=windows-1251">
<META name="GENERATOR" content="MSHTML 11.00.9600.17801"></HEAD>
<BODY dir="LTR" bgcolor="#ffffff"> <!-- Created by Oracle Reports -->
<TABLE width="960" border="0" cellspacing="0" cellpadding="0">
<TBODY>
<TR valign="top">
<TD height="9">
<TD width="71" rowspan="3" colspan="3"><FONT face="Courier New"
size="1"><B><TT>Date</TT></B></FONT><BR>
<TD>
<TD width="89" rowspan="3" colspan="3"><FONT face="Courier New"
size="1"><B><TT>Target Number</TT></B></FONT>
<TD>
<TD width="143" rowspan="3" colspan="7"><FONT face="Courier New"
size="1"><B><TT>Description</TT></B></FONT>
<TD colspan="11">
<TD width="101" rowspan="3" colspan="4"><FONT face="Courier New"
size="1"><B><TT>Transaction </TT></B></FONT><BR><FONT face="Courier New" size="1"><B><TT>Sum</TT></B></FONT><BR>
<TD colspan="2">
<TD width="89" rowspan="3"><FONT face="Courier New"
size="1"><B><TT>Fee</TT></B></FONT>
<TD>
<TD width="113" rowspan="3" colspan="4"><FONT face="Courier New"
size="1"><B><TT>Sum</TT></B></FONT>
<TD>
<TD width="137" rowspan="3" colspan="2"><FONT face="Courier New"
size="1"><B><TT>Device </TT></B></FONT><BR><FONT face="Courier New" size="1"><B><TT>Id</TT></B></FONT><BR>
<TD>
<TR valign="top">
<TD height="9">
<TD>
<TD>
<TD colspan="3">
<TD width="40" colspan="5"><FONT face="Courier New"
size="1"><B><TT>Reference</TT></B></FONT>
<TD colspan="3">
<TD colspan="2">
<TD>
<TD>
<TD>
<TR valign="top">
<TD height="9">
<TD>
<TD>
<TD colspan="11">
<TD colspan="2">
<TD>
<TD>
<TD>
<TR valign="top">
<TD height="9">
<TD width="71" rowspan="2" colspan="3"><FONT face="Courier New"
size="1"><TT>03/09/2015</TT></FONT>
<TD>
<TD width="89" rowspan="2" colspan="3"><FONT face="Courier New"
size="1"><TT>4405641418</TT></FONT>
<TD>
<TD width="143" rowspan="2" colspan="7"><FONT face="Courier New"
size="1"><TT>WWW.EXAMPLE.COM</TT></FONT>
<TD>
<TD width="71" rowspan="2" colspan="9"><FONT face="Courier New"
size="1"><TT>524601231313</TT></FONT>
<TD>
<TD width="101" rowspan="2" colspan="4"><FONT face="Courier New"
size="1"><TT> 1 087,00</TT></FONT>
<TD colspan="2">
<TD width="89" rowspan="2"><FONT face="Courier New"
size="1"><TT>-26,09</TT></FONT>
<TD>
<TD width="113" rowspan="2" colspan="4"><FONT face="Courier New"
size="1"><TT> 1 060,91</TT></FONT>
<TD>
<TD width="137" rowspan="2" colspan="2"><FONT face="Courier New"
size="1"><TT>11055700</TT></FONT>
<TD>
<TR valign="top">
<TD height="9">
<TD>
<TD>
<TD>
<TD>
<TD colspan="2">
<TD>
<TD>
<TD>
<TR>
<TD height="5" colspan="43">
<TR valign="top">
<TD height="9">
<TD width="71" rowspan="2" colspan="3"><FONT face="Courier New"
size="1"><TT>03/09/2015</TT></FONT>
<TD>
<TD width="89" rowspan="2" colspan="3"><FONT face="Courier New"
size="1"><TT>4405641418</TT></FONT>
<TD>
<TD width="143" rowspan="2" colspan="7"><FONT face="Courier New"
size="1"><TT>WWW.EXAMPLE.COM</TT></FONT>
<TD>
<TD width="71" rowspan="2" colspan="9"><FONT face="Courier New"
size="1"><TT>524601231313</TT></FONT>
<TD>
<TD width="101" rowspan="2" colspan="4"><FONT face="Courier New"
size="1"><TT> 55,00</TT></FONT>
<TD colspan="2">
<TD width="89" rowspan="2"><FONT face="Courier New"
size="1"><TT>-1,32</TT></FONT>
<TD>
<TD width="113" rowspan="2" colspan="4"><FONT face="Courier New"
size="1"><TT> 53,68</TT></FONT>
<TD>
<TD width="137" rowspan="2" colspan="2"><FONT face="Courier New"
size="1"><TT>11055700</TT></FONT>
<TD>
</BODY></HTML>
I need to parse that HTML to my C# entities using HTML agility pack. I'm not able to access TT tag in TD tag.
Here is C# code:
var tds = DocumentNode.SelectSingleNode("//body").SelectNodes("//tr[td[contains(#width,'71') and contains(#colspan,'3')]]").Descendants("tt");
How Can I access a TT tag?

I think Kent is on to something, your document has a lot of unclosed <td> tags and that will cause issues when parsing. I suppose there is a reason even oracle is forcing this to render in IE5 compatible mode.
When looking in the debugger you will see that the HtmlAgilityPack has added a whole lot of close tags to the end of the document (check doc.DocumentNode.OuterHtml in the debugger):
</td></td></td></td></td></td></td></td></td></td></td></td></td></td></td>
</td></td></tr></td></tr></td></td></td></td></td></td></td></td></td></tr>
</td></td></td></td></td></td></td></td></td></td></td></td></td></td></td>
</td></td></tr></td></td></td></td></td></td></td></td></tr></td></td></td>
</td></td></td></td></td></td></td></tr></td></td></td></td></td></td></td>
</td></td></td></td></td></td></td></td></tr></tbody></table></body></html>
These aren't closed where they're supposed to be... Unfortunately, the OptionFixNestedTags is turned on by default and it doesn't seem to influence the parser, as it does need to close these tags. neither does OptionAutoCloseOnEnd = false.
The next issue you're facing is that the SelectSingleNode and SelectNodes methods return null, not an empty collection, so your code will start throwing nullreference exceptions like crazy when anything is not found (which is probably the case in your code, at least it does that in my little test project). If you're using C#6 you can at least use ?. to pre-empt the exception, but that won't fix the search code.
Then you're first calling SelectSingleNode("//body") followed by .SelectNodes("//..."), that second call should not use // which is anchored at the document root, but should use .// to be anchored below the body tag. as it is you might as well remove the SelectSingleNode("//body") call.
Due to the nesting issues, the Xpath won't find any td's directly under tr it seems which fit your description. That is because as far as the Agility Pack is concerted, the td you're looking for is a child of the td that precedes it
This is the structure as it is read:
<TR valign="top">
<TD height="9">
<TD width="71" rowspan="3" colspan="3"><FONT face="Courier New"
size="1"><B><TT>Date</TT></B></FONT><BR>
<TD></td>
</td>
</td>
</tr>
So in order to find your tt tags, you'll have to resort to:
var tds = doc.DocumentNode.SelectNodes("//body//tr//td[#width=71 and #colspan=3]");
Note that I also simplified the attribute lookups, as contains will cause issues if there are any callspan=33 or width=171 for example.
Your best action is to probably go back to the source of the report and query the database directly. Or fix the document first by closing any empty <td>'s before further parsing them.
There may be ways of changing the parser to detect td and tr differently, using by changing the ElementFlags for the node before loading the document, but my attempts have all met the same issues as you're already encountering.
HtmlNode.ElementsFlags.Remove("td");
HtmlNode.ElementsFlags.Add("td", HtmlElementFlag.Closed | HtmlElementFlag.Empty);
HtmlNode.ElementsFlags.Remove("tr");
HtmlNode.ElementsFlags.Add("tr", HtmlElementFlag.Closed);
https://stackoverflow.com/a/293357/736079

If it is only the TT-tags you want.
HtmlNodeCollection tds = DocumentNode.SelectNodes("//body[#dir='LTR']//table//tbody//tr//td//tt");
Should give you all the TT-tags.
Next time could you give a shorter and more concrete HTML-file. This one doesn't have ending tages for Table or tbody.
Also I think that you have to set the option for nested tags to true or else it will ignore td and tt tags.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags=true;

Related

C# XML Parse string [duplicate]

I would like to read in a dynamic URL what contains a HTML file, and read it like an XML file, based on nodes (HTML tags). Is this somehow possible?
I mean, there is this HTML code:
<table class="bidders" cellpadding="0" cellspacing="0">
<tr class="bidRow4">
<td>kucik (automata)</td>
<td class="right">9 374 Ft</td>
<td class="bidders_date">2010-06-10 18:19:52</td>
</tr>
<tr class="bidRow4">
<td>macszaf (automata)</td>
<td class="right">9 373 Ft</td>
<td class="bidders_date">2010-06-10 18:19:52</td>
</tr>
<tr class="bidRow2">
<td>kucik (automata)</td>
<td class="right">9 372 Ft</td>
<td class="bidders_date">2010-06-10 18:19:42</td>
</tr>
<tr class="bidRow2">
<td>macszaf (automata)</td>
<td class="right">9 371 Ft</td>
<td class="bidders_date">2010-06-10 18:19:42</td>
</tr>
<tr class="bidRow0">
<td>kucik (automata)</td>
<td class="right">9 370 Ft</td>
<td class="bidders_date">2010-06-10 18:19:32</td>
</tr>
<tr class="bidRow0">
<td>macszaf (automata)</td>
<td class="right">9 369 Ft</td>
<td class="bidders_date">2010-06-10 18:19:32</td>
</tr>
<tr class="bidRow8">
<td>kucik (automata)</td>
<td class="right">9 368 Ft</td>
<td class="bidders_date">2010-06-10 18:19:22</td>
</tr>
<tr class="bidRow8">
<td>macszaf (automata)</td>
<td class="right">9 367 Ft</td>
<td class="bidders_date">2010-06-10 18:19:22</td>
</tr>
<tr class="bidRow6">
<td>kucik (automata)</td>
<td class="right">9 366 Ft</td>
<td class="bidders_date">2010-06-10 18:19:12</td>
</tr>
<tr class="bidRow6">
<td>macszaf (automata)</td>
<td class="right">9 365 Ft</td>
<td class="bidders_date">2010-06-10 18:19:12</td>
</tr>
</table>
I want to parse this into a ListView (or a Grid) to create rows with the data contained. All tr are different row, and all td in a given td is a column in the given row.
And also I want it to be as fast as possible, as it would update itself in 5 seconds.
Is there any library for this?
I recommend HTML Agility Pack. You'll have to handle the GUI part yourself. It doesn't require valid HTML, but creates a HtmlDocument similar to XmlDocument.
Sure, it's possible. But be warned — a compliant xml processor is supposed to treat anything that's not well-formed as a fatal error. That means it's only going to work on documents that pass validation for xhtml strict.
I normally use Fast XPath Reader in combination with LinqToXML for the job. It is rather old (2007) though.
I wasn't aware of the HTML Agility Pack, so I can't say how it compares (in both performance and ease of use).
Why not just do string replacement to convert the HTML table into XML:
<table class="bidders" cellpadding="0" cellspacing="0">
becomes:
<?xml version="1.0" encoding="UTF-8"?>
and
<tr class="bidRow4">
becomes
<item>
and
<td class="right">
becomes
<field1>
etc
EDIT 1:
I think also that the DataSet Class has a:
.ReadXML
method such that you could then databind to that dataset:
DataSet ds = new DataSet();
ds.ReadXml("foo.xml");
DataGrid.DataSource = ds;
DataGrid.DataBind();
or something similar

Error in dynamic data filtering While Upgrading. Net framework 3.5 to 4.5

I'm getting the folowing error on my custom 'List' page that uses Dynamic Data Filtering.
"The DynamicControl/DynamicField needs to exist inside a data control that is bound to a data source that supports Dynamic Data."
I am upgrading. Net framework 3.5 to 4.5 and using catalyst.web.dynamicdata dll
I am using visual studio 2017
<%# Register Assembly="Catalyst.Web.DynamicData" Namespace="Catalyst.Web.DynamicData" TagPrefix="ddasp" %>
<%# Register Src="~/DynamicDataa/Content/GridViewPager.ascx" TagName="GridviewPager" TagPrefix="asp" %>
<%# Register Src="~/DynamicDataa/Content/FilterUserControl.ascx" TagName="DynamicFilter" TagPrefix="asp" %>
<ddasp:DynamicFilterForm ID="DynamicFilterForm1" DataSourceID="GridDataSource" runat="server">
<table>
<tr>
<td>
Nombre activo</td>
<td>
<asp:DynamicFilterControl ID="DynamicFilterControl3" runat="server" DataField="Name"
FilterMode="Contains" /></td>
<td>
Marca</td>
<td>
<asp:DynamicFilterControl ID="DynamicFilterControl1" runat="server" DataField="Brand"
FilterMode="Contains" /></td>
</tr>
<tr>
<td>
Modelo</td>
<td>
<asp:DynamicFilterControl ID="DynamicFilterControl2" runat="server" DataField="Model"
FilterMode="Contains" /></td>
<td>
Ubicacion</td>
<td>
<asp:DynamicFilterControl ID="DynamicFilterControl4" runat="server" DataField="Location"
FilterMode="Contains" /></td>
</tr>
<tr>
<td>
Serial</td>
<td>
<asp:DynamicFilterControl ID="DynamicFilterControl5" runat="server" DataField="Serial"
FilterMode="Contains" /></td>
<td>
Sucursal</td>
<td>
<asp:DynamicFilterControl ID="DynamicFilterControl6" runat="server" DataField="Branch"
FilterMode="Equals" /></td>
</tr>
<tr>
<td>
Tipo de Servicio</td>
<td>
<asp:DynamicFilterControl ID="DynamicFilterControl7" runat="server" DataField="ServiceType"
FilterMode="Equals" /></td>
<td>
Capacidad</td>
<td>
<asp:DynamicFilterControl ID="DynamicFilterControl8" runat="server" DataField="Capacity"
FilterMode="Contains" /></td>
</tr>
<tr>
<td colspan="4">
<asp:LinkButton ID="LinkButton1" runat="server" CommandName="Search" CausesValidation="false">Search</asp:LinkButton><br />
<asp:LinkButton ID="LinkButton2" runat="server" CommandName="Clear" CausesValidation="false">Clear</asp:LinkButton><br />
<asp:LinkButton ID="LinkButton3" runat="server" CommandName="Browse" CausesValidation="false">Browse</asp:LinkButton>
</td>
</td>
</tr>
</table>
</FilterTemplate>
</ddasp:DynamicFilterForm>

Why do my rows in a colums suddenly jump to the right? And how do I fix that?

Good Day everyone. So while trying to add a few date fields to a popup window on a site I am making I experienced something odd. Adding these 3 rows to the pop up caused the column they were suppose to be in to jump to the right and now i can not get them to line up.
I do not know how important it is to note, but there is a textbox to the left of the column, but the boxes/rows i am adding will be below the textbox's height.
Below I have tried to take a slice of the code as an example, if it is enough I will attempt to add more:
<table style="float: left;">
<tr>
<div>
<tr>(the following code shows normally like it should)
<td align="left" valign="top" colspan="4">Lable:</td>
</tr>
<tr>
<td width="2px"></td>
<td align="left" valign="top">Label</td>
<td colspan="3">
<telerik:RadDatePicker ID="RDP1" runat="server"
Culture="Language"
DbSelectedDate='<%# (Container is GridEditFormInsertItem)? DateTime.Today : Eval("EVAL1") %>'
Width="145px">
<Calendar ID="Calendar3" runat="server" UseColumnHeadersAsSelectors="False" UseRowHeadersAsSelectors="False" ViewSelectorText="x">
</Calendar>
<DatePopupButton HoverImageUrl="" ImageUrl="" />
<DateInput ID="DateInput3" runat="server" DateFormat="dd-MM-yyyy" DisplayDateFormat="dd-MM-yyyy">
</DateInput>
</telerik:RadDatePicker>
</td>
</tr>
<tr>
<td></td>
<td colspan="3">Label</td>
</tr>
<tr>
<td></td>
<td align="left" valign="top">Label:</td>
<td align="left" valign="top" colspan="3">
<telerik:RadDatePicker ID="RDP2" runat="server" Culture="Language" DbSelectedDate='<%# Eval("EVAL2") %>' Width="170px">
<Calendar ID="Calendar5" runat="server" UseColumnHeadersAsSelectors="False" UseRowHeadersAsSelectors="False" ViewSelectorText="x">
</Calendar>
<DateInput ID="DateInput5" runat="server" DateFormat="dd-MM-yyyy" DisplayDateFormat="dd-MM-yyyy">
</DateInput>
</telerik:RadDatePicker>
<asp:ImageButton ID="btnDelete" runat="server" ImageUrl="url" OnClick="btnFunction_Click" ToolTip="Text" Style="vertical-align:middle;" />
</td>
</tr>
</div>
</tr>
For those that would like to see the CSS, there is none, at least none that would have an impact on my problem as they are pointing more towards the actual webpage and not the pop up window.
In advance I would like to say Thank you for the help and your time.
The problem occurs because your rows don't have an equal number of <td> or columns
First row - 1 td with colspan 4 > total 4
Second row - 1 td + 1 td + 1 td with colspan 3 > total 5
Third row - 1 td + 1 td with colspan 3 > total 4
Fourth row - 1 td + 1 td + 1 td with colspan 3 > total 5
Your table is not in the correct structure (you have a tr in a td)
Ensure your table is in the following structure:
<table>
<tr>
<td>
Also check your columns are all equal, use colspan="" to merge cells if needed.
<table>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="4"></td>
</tr>
<table>

Using a C# variable as a table element

What I have is this:
<table>
<tr bgcolor="#007ACC" style="color:White">
<td width="145">Account Group</td>
<td width="80"></td>
<td width="10">Active</td>
</tr>
<tr>
·
·
</tr>
</table>
What I need to do is make it so "Account Group" can be changed based on a user's treeview selection. i.e., if the user selects a Child node, I need to change that to "Account Number".
Is it possible to change a table element on-the-fly like that? If so, how would I do this?
Place a label in <td> to display text, so that you can change them based on label id
<td width="145">
<asp:Label Text="Account Group" ID="lblUserContent" runat="server" />
</td>
As per treeview selection changes you can change the text by using following code:
if(your condition)
lblUserContent.Text="Account Number"
else
lblUserContent.Text="Account Group"
The best way to do this will depend on how you're using your treeview, but here's a quick way to output the value of a C# variable into your table:
<table>
<tr bgcolor="#007ACC" style="color:White">
<td width="145"><%# Eval("MyCSharpVariable") %></td>
<td width="80"></td>
<td width="10">Active</td>
</tr>
<tr>
·
·
</tr>
</table>

Convert a html table into an rss feed

I have a table similar to the one below which I would like to convert into an rss feed somehow. What is the best way to approach this? Should I be scraping the contents and trying to build up an rss or is there a much simpler annd easier way (I'm hoping)? I'm using the asp.net / c# - anyone point me to any tutorials out there that will help me achieve this would be great:)
<table align="left" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td align="left" valign="top" style="width: 125px; height: 125px;" colspan="1"><img title="Costa Rica" alt="Costa Rica" src="/CR_sq.jpg?n=4185" /></td>
<td align="left" valign="top" colspan="1"><strong><font color="#fff" size="2">Costa Rica <br /></font><span class="SubHeadingGrey_7_0">16 August 2012</span></strong><br /><br />Some Text Here <a title="...read on" href="/WorkArea/linkit.aspx?LinkIdentifier=id&ItemID=1234">...read on</a></td>
</tr>
<tr>
<td align="left" valign="top" style="width: 125px; height: 125px;"><img width="117" height="117" title="South Africa" style="width: 117px; height: 117px;" alt="AL 2012 Icon" src="/SA2012.jpg?width=117&height=117&mode=max" /></td>
<td align="left" valign="top"><p><strong><font color="#fff" size="2">South African Story<br /></font><span class="SubHeadingGrey_7_0">16 August 2012</span></strong></p>
<p>This is summary text <a title="... read on" href="/SA.aspx">... read on</a></p>
</td>
</tr>
<tr>
<td align="left" valign="top" style="width: 125px; height: 125px;"><img title="ITALY" alt="ITALY" src="/Italy.jpg?n=43" /></td>
<td align="left" valign="top"><strong><font color="#fff" size="2">Italian Article<br /></font><span class="SubHeadingGrey_7_0">15 August 2012</span></strong><br /><br />Italian Visit Article<a title="...read on" href="/WorkArea/linkit.aspx?LinkIdentifier=id&ItemID=1256">...read on</a></td>
</tr>
</tbody>
</table>
As long as the html is well formed and matches XML you can read it in as xml and then use XSLT to convert it to an rss feed using XslTransform here is a simple example of how to use xlsTransform http://www.xmlfiles.com/articles/cynthia/xslt/default.asp

Categories

Resources