HtmlAgilityPack adding div elements to existing html file - c#

This is my original html:
<tr>
<td style="padding-left: 40pt;"><font style="background-color: lightgreen" color="black">Tove</font></td>
<td style="padding-left: 40pt;"><font style="background-color: lightgreen" color="black">To</font></td>
</tr>
And my goal is to have this:
<div class="select-me" /> <tr>...<tr/>
I am using HtmlAgilityPack and essentially going through each font tag and checking to see if it's style is light-green. But I'm not sure how to jump to back the table row tags and put a div tag around the table row tags.

You can use the following code to wrap them with div:
foreach(var node in selectMe)
node.ParentNode.OuterHtml = "<div class=\"select-me\">" + node.ParentNode.InnerHtml + "</div>";
Also you can select selectMe with this instead of checking one by one:
var selectMe = doc.DocumentNode.SelectNodes("//td[contains(#style,'background-color: lightgreen')]");

Related

Break out an html-element from within a table-element

I'm having problems finding a proper way of breaking out the H4-tag from the following code. Not only do I need to make it stay in the code, but I also need to delete the table it currently sits in.
So, how do I delete the whole table and keep the h4-tag where it is?
<table align="center" border="0" cellpadding="0" cellspacing="0">
<tr><td height="30" align="center" colspan="5"><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
<tr>
<td><img name="contents" src="../figs/contents.gif" border="0" alt="" onload=""></td>
<td><img src="../figs/iauthori.gif" alt="" name="authorindex" width="120" height="20" border="0" onload=""></td>
<td><img src="../figs/isubji.gif" alt="" name="subjindex" width="120" height="20" border="0" onload=""></td>
<td><img src="../figs/isearch.gif" alt="" name="search" width="120" height="20" border="0" onload=""></td>
<td><img name="home" src="../figs/ihome.gif" border="0" alt="" onload=""></td>
</tr>
</table>
Further on I have about 2500 html-documents following similar structure, but are in different versions of HTML, thus uses div's, tables or other elements from version to version. So I need a way to alter this method properly.
I have a document load ready, it loads all files in a list, so I will be feeding a method this list of filenames to open and parse. But I can't figure out how to use XPath for this one.
One way to solve the problem is to find all <h4> nodes, walk up it's parent chain until you find a stop tag/node, and replace the stop tag/node with your <h4>:
Given some sample HTML that resides in a HTML file:
var html =
#"<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<table align='center' border='0' cellpadding='0' cellspacing='0'>
<tr><td height='30' align='center' colspan='5'><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
<tr>
<td><a href='index.html'><img name='contents' src='../figs/contents.gif' border='0' alt='' onload=''></a></td>
<td><a href='../page.html'><img src='../figs/iauthori.gif' alt='' name='authorindex' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../page.html'><img src='../figs/isubji.gif' alt='' name='subjindex' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../search.html'><img src='../figs/isearch.gif' alt='' name='search' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../page.html'><img name='home' src='../figs/ihome.gif' border='0' alt='' onload=''></a></td>
</tr>
</table>
<div>
<h4>H4 nested in DIV</h4>
<p>Paragraph <strong>bold</strong> <a href=''>Hyperlink</a></p>
</div>
<p><h4>H4 nested in P</h4></p>
</body>
</html>";
Parse it with this method:
public string ParseHtmlToString(string inputFilePath)
{
var document = new HtmlDocument();
document.Load(inputFilePath);
var wantedNodes = document.DocumentNode.SelectNodes("//h4");
// stop at these tags while walking backwards up the chain
var stopTags = new string[] { "table", "div", "p" };
HtmlNode parentNode;
foreach (var node in wantedNodes)
{
HtmlNode testNode = node;
while ((parentNode = testNode.ParentNode) != null)
{
if (stopTags.Contains(parentNode.Name))
{
parentNode.ParentNode.ReplaceChild(node, parentNode);
}
testNode = parentNode;
}
}
return document.DocumentNode.WriteTo();
}
Then you can assign the parsed HTML to a variable like this:
var parsedHtml = ParseHtmlToString(INPUT_FILE);
which returns the following value:
<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4>
<h4>H4 nested in DIV</h4>
<h4>H4 nested in P</h4>
</body>
</html>
This is a alternative solution, it worked for all those documents where the Kuujinbo-solution failed, I ran them side by side as a try/final/catch-method. And it worked pretty good through all 2500 html-docs.
var doc = new HtmlDocument();
doc.Load(file);
var htmlBody = doc.DocumentNode.SelectSingleNode("//body");
var headerTables = doc.DocumentNode.SelectSingleNode("//body/table[1]");
var headerNode = doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'Information Research, Vol')]");
htmlBody.ReplaceChild(headerNode, headerTables);
headerTables.Remove();
doc.Save(file);
Basically it was run as
try {ParseHtmlToString(file)}
final {myAlternateSolution(file)}
catch (Exception Ex){Console.WriteLine(file +":"+ Ex.Message);}
It worked due to the fact that the table was most of the time the first node after body, and it was also the first table in the document. Some manual editing had to be done, due to the fact that some documents had malformed HTML, and could not be repaired with HTMLTidy and similar.

Selenium XPath not recognizing text in table cell

I am trying to do some unit testing with selenium2 using the following code:
private const string TicketName = "Automated Test Ticket";
[Test]
public void EditTicketTest() {
var tableData = driver.FindElement(By.XPath("//td[contains(text(), '" + TicketName + "')]"));
}
The test fails with the following reason:
OpenQA.Selenium.NoSuchElementException : Unable to locate element: {"method":"xpath","selector":"//td[contains(text(), 'Automated Test Ticket')]"}
But when I look at the page and inspect the element, the text is definitely inside the tag. Is it possible that there is some excess spacing or something else that could be causing it to not recognize the text?
Here is the HTML:
<tr data-id="55">
<td>55</td>
<td class="ticket-title">
<span data-original-title="Automated Test Ticket" class="work-on-ticket-note-icon-tickets" data-toggle="tooltip" data-placement="right" data-trigger="hover" title=""></span> <span data-original-title="This is a work on ticket note for a company!" class="work-on-ticket-note-icon-companies" data-toggle="tooltip" data-placement="right" data-trigger="hover" title=""></span> Automated Test Ticket
</td>
<td>Medium</td>
<td>Active</td>
<td>8/25/2014<br> <small>(0 changes)</small></td>
<td></td>
<td>
<strong class="text-danger">None Assigned</strong>
</td>
<td>
<span class="glyphicon glyphicon-edit"></span>
<span class="glyphicon glyphicon-list-alt"></span>
<span class="glyphicon glyphicon-time"></span>
</td>
</tr>
The td in question contains multiple empty text nodes as children, and when text() is used in a function that takes a string, it will evaluate to the string value of the first matching node in document order, so this:
//td[contains(text(), 'Automated Test Ticket')]
Is evaluating to something like this:
//td[contains(" ", 'Automated Test Ticket')]
Which will always produce an empty nodeset.
Two options here are this:
//td[contains(., 'Automated Test Ticket')]
which will match any td that has a contiguous "Automated Test Ticket" anywhere within it, or this:
//td[text()[contains(., 'Automated Test Ticket')]]
which will match any td that has an immediate child text node containing the text "Automated Test Ticket".
I prefer the first option because it's cleaner and has a better chance of turning up a match if you're not completely sure what the td is going to contain.

Retrieve data from HTML table in C#

I want to retrieve data from HTML document.
I am scraping data from a web site I almost done but get issue when tried to retrieve data from the table.
Here is HTML code
<div id="middle_column">
<form action="url?" method="post" name="inquirydetail">
<input type="hidden" name="ServiceName" value="SurgeWebService">
<input type="hidden" name="TemplateName" value="Inpat_AvailableResponses.htm">
<input type="hidden" name="CurrentPage" value="inquirydetail">
<form method="post" action="url" name="ResponseSel" onSubmit="return EditPage(document.forms[3])">
<TABLE
<tBody
<table
....
</table
<table
....
</table
<table border="0" width="90%">
<tr>
<td width="10%" valign="bottom" class="content"> Service Number</td>
<td width="30%" valign="bottom" class="content"> Status</td>
<td width="50%" valign="bottom" class="content"> Status Date</td>
</tr>
<tr>
<td width="20%" bgcolor="white" class="subtitle">1</td>
<td width="40%" bgcolor="white" class="subtitle">Approved</td>
<td width="40%" bgcolor="white" class="subtitle">03042014</td>
</tr>
<tr>
<td></td>
</tr>
</table>
</tbody>
</TABle>
</div>
I have to retrieve data for Status field It is Approved and write it in SQL DB
There are many tables in the form tag.Tables do not have IDs.How I can get correct table,row and cell
Here is my code
HtmlElement tBody = WB.Document.GetElementById("middle_column");
if (tBody != null)
{
string sURL = WB.Url.ToString();
int iTableCount = tBody.GetElementsByTagName("table").Count;
}
for (int i = 0; i <= iTableCount; i++)
{
HtmlElement tb=tBody.GetElementsByTagName("table")[i];
}
Something is wrong here
Please help with this.
Don't you have any control over the page being displayed within the Webbrowser control? If you do it's better you add an id field for status TD. Then your life would be much easier.
Anyway, here's how you could search a value within a table.
HtmlElementCollection tables = this.WB.Document.GetElementsByTagName("table");
foreach (HtmlElement TBL in tables)
{
foreach (HtmlElement ROW in TBL.All)
{
foreach (HtmlElement CELL in ROW.All)
{
// Now you are looping through all cells in each table
// Here you could use CELL.InnerText to search for "Status" or "Approved"
}
}
}
But, this is not a good approach as you are looping through each table and each cell within each table to find your text. Keep this as the last option.
Hope this helps you to get an idea.
I prefer using the dynamic type and the DomElement property, but you must be using .net 4+.
For tables, the main advantage here is that you don't have to loop through everything. If you know the row and column that you are looking for, then you can just target the important data by row and column numbers instead of looping through the whole table.
The other big advantage is that you can basically use the entire DOM, reading more than just the contents of the table. Make sure you use lowercase properties as required in javascript, even though you are in c#.
HtmlElement myTableElement;
//Set myTableElement using any GetElement... method.
//Use a loop or square bracket index if the method returns an HtmlElementCollection.
dynamic myTable = myTableElement.DomElement;
for (int i = 0; i < myTable.rows.length; i++)
{
for (int j = 0; j < myTable.rows[i].cells.length; j++)
{
string CellContents = myTable.rows[i].cells[j].innerText;
//You are not limited to innerText; you have the whole DOM available.
//Do something with the CellContents.
}
}

Multiplying a textbox with a cell in a dynamically created table with JQuery

I have a dynamically created table with id called "editTable" that looks as follows:
<tbody>
#{var i = 0;}
#foreach (var item in Model)
{
<tr>
<td width="25%">
#Html.DisplayFor(modelItem => item.Product.Name)
</td>
<td width="25%">
#Html.DisplayFor(modelItem => item.Quantity)
</td>
<td width="25%">
<div class="editor-field">
#Html.EditorFor(modelItem => item.UnitPrice)
#Html.ValidationMessageFor(model => item.UnitPrice)
</div>
</td>
<td width="25%" id="total"></td>
</td>
</tr>
}
</tbody>
The 3th td-element consists of a C# textbox that is turned into a element in html.
Now I want to multiply the quantity by the unit price to display this value in the 4th td element next to it. This value should update every time the value in the textbox is adjusted. I am a newbie at JQuery / JavaScript and came up with the following code:
// Calculating quantity*unitprice
$('#editTable tr td:nth-child(3) input').each( function (event) {
var $quant = $('#editTable tr td:nth-child(2)', this).val();
var $unitPrice = $('#editTable tr td:nth-child(3) input', this).val();
$('#editTable tr td:nth-child(4)').text($quant * $unitPrice);
});
This doesn't work and only displays NaN in the 4th element. Can anyone help me updating this code to a working version? Any help would be very much appreciated.
I geussed you accidentally switched units and price because it has more logic to change the number of units then the price. I took your html and javascript and tried to change as little as possible to make it work (I'm not saying the solution is perfect, I just don't want to give you a totaly different example of how to do it).
The html (The C# is irrelevant for this problem):
<table id="editTable">
<tbody>
<tr>
<td width="25%">
Product name
</td>
<td width="25%">
5
</td>
<td width="25%">
<div class="editor-field">
<input id="UnitPrice" name="UnitPrice" type="number" value="2" style="width:40px" />
</div>
</td>
<td width="25%" id="total"></td>
</tr>
</tbody>
</table>
The javascript/jquery (which should run on load):
$('#editTable tr td:nth-child(3) input').each(updateTotal);
$('#editTable tr td:nth-child(3) input').change(updateTotal);
var element;
function updateTotal(element)
{
var quantity = $(this).closest('tr').find('td:nth-child(2)').text();
var price = $(this).closest('tr').find('td:nth-child(3) input').val();
$(this).closest('tr').find('td:nth-child(4)').text(quantity * price);
}
The problem you had were with jquery. I've created a function that recieves an element (in our case it's your UnitPrice input), then it grabs the closest ancestor of type tr (the row it's in) and from there it does what you've tried to do.
You've used jquery selector to get all 2nd cells in all table rows, the closest('tr').find limits it to the current row.
You've tried to use .val() on a td element, you should use either .text() or .html(). Instead, You can also add a data-val="<%=value%>" on the td and then use .data('val').
It will be better to take the units directly from $(element).val() and no going to the tr and then back into the td and the input.
To see it working: http://jsfiddle.net/Ynsgf/1/
I hope I didn't caused you any confusion with my explanation and the options I gave you.
Here is another way to write the jquery part.
$('#editTable tr').each(function (i, row) {
var $quant = $(row).find('.editor-field input').val();
var $unitPrice = $(row).find('.editor-field input').val();
$(row).find('td:nth-child(4)').text($quant * $unitPrice);
});

Remove line from string where certain HTML tag is found

So I have this HTML page that is exported to an Excel file through an MVC action. The action actually goes and renders this partial view, and then exports that rendered view with correct formatting to an Excel file. However, the view is rendered exactly how it is seen before I do the export, and that view contains an "Export to Excel" button, so when I export this, the button image appears as a red X in the top left corner of the Excel file.
I can intercept the string containing this HTML to render in the ExcelExport action, and it looks like this for one example:
<div id="summaryInformation" >
<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />
<table class="resultsGrid" cellpadding="2" cellspacing="0">
<tr>
<td id="NicknameLabel" class="resultsCell">Nick Name</td>
<td id="NicknameValue" colspan="3">
Swap
</td>
</tr>
<tr>
<td id="EffectiveDateLabel" class="resultsCell">
<label for="EffectiveDate">Effective Date</label>
</td>
<td id="EffectiveDateValue" class="alignRight">
02-Mar-2011
</td>
<td id ="NotionalLabel" class="resultsCell">
<label for="Notional">Notional</label>
</td>
<td id="NotionalValue" class="alignRight">
<span>
USD
</span>
10,000,000.00
</td>
</tr>
<tr>
<td id="MaturityDateLabel" class="resultsCell">
<label for="MaturityDate">Maturity Date</label>
</td>
<td id="MaturityDateValue" class="alignRight">
02-Mar-2016
-
Modified Following
</td>
<td id="TimeStampLabel" class="resultsCell">
Rate Time Stamp
</td>
<td id="Timestamp" class="alignRight">
28-Feb-2011 16:00
</td>
</tr>
<tr >
<td id="HolidatCityLabel" class="resultsCell"> Holiday City</td>
<td id="ddlHolidayCity" colspan="3">
New York,
London
</td>
</tr>
</table>
</div>
<script>
$("#ExportToExcel").click(function () {
// ajax call to do the export
var actionUrl = "/Extranet/mvc/Indications.cfc/ExportToExcel";
var viewName = "/Extranet/Views/Indications/ResultsViews/SummaryInformation.aspx";
var fileName = 'SummaryInfo.xls';
GridExport(actionUrl, viewName, fileName);
});
</script>
That <img id="ExportToExcel" tag at the top is the one I want to remove just for the export. All of what you see is contained within a C# string. How would I go and remove that line from the string so it doesn't try and render the image in Excel?
EDIT: Would probably make sense also that we wouldn't need any of the <script> in the export either, but since that won't show up in Excel anyway I don't think that's a huge deal for now.
Remove all img tags:
string html2 = Regex.Replace( html, #"(<img\/?[^>]+>)", #"",
RegexOptions.IgnoreCase );
Include reference: using System.Text.RegularExpressions;
If it's in a C# string then just:
myHTMLString.Replace(#"<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />","");
The safest way to do this will be to use the HTML Agility Pack to read in the HTML and then write code that removes the image node from the HTML.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
HtmlNode image =doc.GetElementById("ExportToExcel"]);
image.Remove();
htmlString = doc.WriteTo();
You can use similar code to remove the script tag and other img tags.
I'm just using this
private string RemoveImages(string html)
{
StringBuilder retval = new StringBuilder();
using (StringReader reader = new StringReader(html))
{
string line = string.Empty;
do
{
line = reader.ReadLine();
if (line != null)
{
if (!line.StartsWith("<img"))
{
retval.Append(line);
}
}
} while (line != null);
}
return retval.ToString();
}

Categories

Resources