I'm having problems finding a proper way of breaking out the H4-tag from the following code. Not only do I need to make it stay in the code, but I also need to delete the table it currently sits in.
So, how do I delete the whole table and keep the h4-tag where it is?
<table align="center" border="0" cellpadding="0" cellspacing="0">
<tr><td height="30" align="center" colspan="5"><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
<tr>
<td><img name="contents" src="../figs/contents.gif" border="0" alt="" onload=""></td>
<td><img src="../figs/iauthori.gif" alt="" name="authorindex" width="120" height="20" border="0" onload=""></td>
<td><img src="../figs/isubji.gif" alt="" name="subjindex" width="120" height="20" border="0" onload=""></td>
<td><img src="../figs/isearch.gif" alt="" name="search" width="120" height="20" border="0" onload=""></td>
<td><img name="home" src="../figs/ihome.gif" border="0" alt="" onload=""></td>
</tr>
</table>
Further on I have about 2500 html-documents following similar structure, but are in different versions of HTML, thus uses div's, tables or other elements from version to version. So I need a way to alter this method properly.
I have a document load ready, it loads all files in a list, so I will be feeding a method this list of filenames to open and parse. But I can't figure out how to use XPath for this one.
One way to solve the problem is to find all <h4> nodes, walk up it's parent chain until you find a stop tag/node, and replace the stop tag/node with your <h4>:
Given some sample HTML that resides in a HTML file:
var html =
#"<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<table align='center' border='0' cellpadding='0' cellspacing='0'>
<tr><td height='30' align='center' colspan='5'><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
<tr>
<td><a href='index.html'><img name='contents' src='../figs/contents.gif' border='0' alt='' onload=''></a></td>
<td><a href='../page.html'><img src='../figs/iauthori.gif' alt='' name='authorindex' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../page.html'><img src='../figs/isubji.gif' alt='' name='subjindex' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../search.html'><img src='../figs/isearch.gif' alt='' name='search' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../page.html'><img name='home' src='../figs/ihome.gif' border='0' alt='' onload=''></a></td>
</tr>
</table>
<div>
<h4>H4 nested in DIV</h4>
<p>Paragraph <strong>bold</strong> <a href=''>Hyperlink</a></p>
</div>
<p><h4>H4 nested in P</h4></p>
</body>
</html>";
Parse it with this method:
public string ParseHtmlToString(string inputFilePath)
{
var document = new HtmlDocument();
document.Load(inputFilePath);
var wantedNodes = document.DocumentNode.SelectNodes("//h4");
// stop at these tags while walking backwards up the chain
var stopTags = new string[] { "table", "div", "p" };
HtmlNode parentNode;
foreach (var node in wantedNodes)
{
HtmlNode testNode = node;
while ((parentNode = testNode.ParentNode) != null)
{
if (stopTags.Contains(parentNode.Name))
{
parentNode.ParentNode.ReplaceChild(node, parentNode);
}
testNode = parentNode;
}
}
return document.DocumentNode.WriteTo();
}
Then you can assign the parsed HTML to a variable like this:
var parsedHtml = ParseHtmlToString(INPUT_FILE);
which returns the following value:
<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4>
<h4>H4 nested in DIV</h4>
<h4>H4 nested in P</h4>
</body>
</html>
This is a alternative solution, it worked for all those documents where the Kuujinbo-solution failed, I ran them side by side as a try/final/catch-method. And it worked pretty good through all 2500 html-docs.
var doc = new HtmlDocument();
doc.Load(file);
var htmlBody = doc.DocumentNode.SelectSingleNode("//body");
var headerTables = doc.DocumentNode.SelectSingleNode("//body/table[1]");
var headerNode = doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'Information Research, Vol')]");
htmlBody.ReplaceChild(headerNode, headerTables);
headerTables.Remove();
doc.Save(file);
Basically it was run as
try {ParseHtmlToString(file)}
final {myAlternateSolution(file)}
catch (Exception Ex){Console.WriteLine(file +":"+ Ex.Message);}
It worked due to the fact that the table was most of the time the first node after body, and it was also the first table in the document. Some manual editing had to be done, due to the fact that some documents had malformed HTML, and could not be repaired with HTMLTidy and similar.
Related
I am trying to create an automation for downloading files from lined text.
Unfortunatly I can't get it to work. I am new to selenium.
Here is an HTML site code:
Ttnc-18p - 17.34 GB
<table class="table table-stripped">
<thead>
<tr>
<th>Name</th>
<th>Größe</th>
<th>DL</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ttnc-18p.part01.rar
</td><td>500.00 MB</td>
<td><img width="16" height="16" src="//filer.net/media/images/ico_arrow_down.png?2018" alt="DL">
</td></tr>
<tr>
<td>Ttnc-18p.part01.rev
</td><td>500.00 MB</td>
<td><img width="16" height="16" src="//filer.net/media/images/ico_arrow_down.png?2018" alt="DL">
</td></tr>
<tr>
<td>Ttnc-18p.part02.rar
</td><td>500.00 MB</td>
<td><img width="16" height="16" src="//filer.net/media/images/ico_arrow_down.png?2018" alt="DL">
</td></tr>
<tr>
<td>Ttnc-18p.part03.rar
</td><td>500.00 MB</td>
<td><img width="16" height="16" src="//filer.net/media/images/ico_arrow_down.png?2018" alt="DL">
</td></tr>
I want it to download the link from: Ttnc-18p.part01.rar, Ttnc-18p.part02.rar, Ttnc-18p.part03.rar and so on...
I tried this:
ChromeDriver.FindElement(By.XPath("/table[contains(#class,'table table-stripped')]/tbody/tr/td/a[contains(text(),'" + "Ttnc-18p.part01.rar" + "')]")).Click();
It doesn't work and I can't figure out what to do. Any thing else fails.
The second thing I am trying to is that the code will generate an array of links that it need to download so I can feed the code different website that have differend number of linkes.
Please help.
You can try the below, but you would likely want to break the these in to separate methods for reusability:
public IEnumerable<string> DownloadLinks(string url)
{
// for storing each href value as it is retrieved from the links
var listOfLinks = new List<string>();
// start your instance of ChromeDriver
var driver = new ChromeDriver();
// Navigate to the url you passed in
driver.Navigate().GoToUrl(url);
// Get a collection of all anchor ("a") tags.
var anchorTags= driver.FindElements(By.TagName("a"));
// Now for each anchor tag...
foreach(var link in anchorTags)
{
// ...retrieve the value of its 'href' attribute (i.e. your link)...
var l = link.GetAttribute("href");
// ...add the link path to your listOfLinks
// and append your url to the href since the href is only a partial
// ( /get/vobqwunyrrxl2oo5 becomes https://yourwebsite.com/get/vobqwunyrrxl2oo5)
listOfLinks.Add(url + l);
// now click your link to simulate clicking the link and downloading the file
link.Click();
}
// and finally return your list of links
return listOfLinks;
}
This is my original html:
<tr>
<td style="padding-left: 40pt;"><font style="background-color: lightgreen" color="black">Tove</font></td>
<td style="padding-left: 40pt;"><font style="background-color: lightgreen" color="black">To</font></td>
</tr>
And my goal is to have this:
<div class="select-me" /> <tr>...<tr/>
I am using HtmlAgilityPack and essentially going through each font tag and checking to see if it's style is light-green. But I'm not sure how to jump to back the table row tags and put a div tag around the table row tags.
You can use the following code to wrap them with div:
foreach(var node in selectMe)
node.ParentNode.OuterHtml = "<div class=\"select-me\">" + node.ParentNode.InnerHtml + "</div>";
Also you can select selectMe with this instead of checking one by one:
var selectMe = doc.DocumentNode.SelectNodes("//td[contains(#style,'background-color: lightgreen')]");
Using Windows Forms and C#.
For example...
<table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>
I load the page using the WebBrowser Control. The page loads perfectly.
The next thing I want to do is search through all the rows in the table and check if they contain a specific value ; for example in this instance YES.
If they contain it I want the row to be passed on to me so I can store it as string.
But I want the row to be in HTML form. (containing the tags).
How can I accomplish this ?
Please help me.
You can use the HtmlAgilityPack to easily parse the html. For example, to get all of the TD elements, you can do this:
string value = #" <table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(value);
var nodes = doc.GetElementbyId("tbl").SelectNodes("tbody/tr/td");
foreach (var node in nodes)
{
Debug.WriteLine(node.InnerText);
}
You can use this: http://simplehtmldom.sourceforge.net/ , its really simple way how to search in HTML files
Just include simple_html_dom.php to your file and then just follow this manual
http://simplehtmldom.sourceforge.net/manual.htm
and your php code will looks like
$html = file_get_html('File.html');
foreach($html->find('td') as $element)
echo $element->text. '<br>';
Here is the Html code:
<table style="border:1px solid #000">
<tr style="background:#ddd;">
<td width="150">TableEle1</td>
<td width="150">TableEle2</td>
<td width="150">TableEle3</td>
<td width="150">TableEle4</td>
<td width="150">TableEle5</td>
<td width="150">TableEle6</td>
<td width="150">TableEle7</td>
<td width="150">TableEle8</td>
</tr>
And here is the code I use to extract the table element 1 (but not successful)
htmlHelper.SetNode(#"//td/text()='TableEle1'");
Is there any advice for me?
You can use a blend of HtmlAgilityPack and Linq to get the desired td node.
HtmlDocument document = new HtmlDocument();
document.LoadHtml("[your HTML string]");
var node = document.DocumentNode.SelectNodes("//td/text()");
var tdNode = node.Where(s => s.InnerText == "TableEle1").Select(s => s);
Hope this helps!
So I have this HTML page that is exported to an Excel file through an MVC action. The action actually goes and renders this partial view, and then exports that rendered view with correct formatting to an Excel file. However, the view is rendered exactly how it is seen before I do the export, and that view contains an "Export to Excel" button, so when I export this, the button image appears as a red X in the top left corner of the Excel file.
I can intercept the string containing this HTML to render in the ExcelExport action, and it looks like this for one example:
<div id="summaryInformation" >
<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />
<table class="resultsGrid" cellpadding="2" cellspacing="0">
<tr>
<td id="NicknameLabel" class="resultsCell">Nick Name</td>
<td id="NicknameValue" colspan="3">
Swap
</td>
</tr>
<tr>
<td id="EffectiveDateLabel" class="resultsCell">
<label for="EffectiveDate">Effective Date</label>
</td>
<td id="EffectiveDateValue" class="alignRight">
02-Mar-2011
</td>
<td id ="NotionalLabel" class="resultsCell">
<label for="Notional">Notional</label>
</td>
<td id="NotionalValue" class="alignRight">
<span>
USD
</span>
10,000,000.00
</td>
</tr>
<tr>
<td id="MaturityDateLabel" class="resultsCell">
<label for="MaturityDate">Maturity Date</label>
</td>
<td id="MaturityDateValue" class="alignRight">
02-Mar-2016
-
Modified Following
</td>
<td id="TimeStampLabel" class="resultsCell">
Rate Time Stamp
</td>
<td id="Timestamp" class="alignRight">
28-Feb-2011 16:00
</td>
</tr>
<tr >
<td id="HolidatCityLabel" class="resultsCell"> Holiday City</td>
<td id="ddlHolidayCity" colspan="3">
New York,
London
</td>
</tr>
</table>
</div>
<script>
$("#ExportToExcel").click(function () {
// ajax call to do the export
var actionUrl = "/Extranet/mvc/Indications.cfc/ExportToExcel";
var viewName = "/Extranet/Views/Indications/ResultsViews/SummaryInformation.aspx";
var fileName = 'SummaryInfo.xls';
GridExport(actionUrl, viewName, fileName);
});
</script>
That <img id="ExportToExcel" tag at the top is the one I want to remove just for the export. All of what you see is contained within a C# string. How would I go and remove that line from the string so it doesn't try and render the image in Excel?
EDIT: Would probably make sense also that we wouldn't need any of the <script> in the export either, but since that won't show up in Excel anyway I don't think that's a huge deal for now.
Remove all img tags:
string html2 = Regex.Replace( html, #"(<img\/?[^>]+>)", #"",
RegexOptions.IgnoreCase );
Include reference: using System.Text.RegularExpressions;
If it's in a C# string then just:
myHTMLString.Replace(#"<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />","");
The safest way to do this will be to use the HTML Agility Pack to read in the HTML and then write code that removes the image node from the HTML.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
HtmlNode image =doc.GetElementById("ExportToExcel"]);
image.Remove();
htmlString = doc.WriteTo();
You can use similar code to remove the script tag and other img tags.
I'm just using this
private string RemoveImages(string html)
{
StringBuilder retval = new StringBuilder();
using (StringReader reader = new StringReader(html))
{
string line = string.Empty;
do
{
line = reader.ReadLine();
if (line != null)
{
if (!line.StartsWith("<img"))
{
retval.Append(line);
}
}
} while (line != null);
}
return retval.ToString();
}