How to download linked text - Selenium C# - c#

I am trying to create an automation for downloading files from lined text.
Unfortunatly I can't get it to work. I am new to selenium.
Here is an HTML site code:
Ttnc-18p - 17.34 GB
<table class="table table-stripped">
<thead>
<tr>
<th>Name</th>
<th>Größe</th>
<th>DL</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ttnc-18p.part01.rar
</td><td>500.00 MB</td>
<td><img width="16" height="16" src="//filer.net/media/images/ico_arrow_down.png?2018" alt="DL">
</td></tr>
<tr>
<td>Ttnc-18p.part01.rev
</td><td>500.00 MB</td>
<td><img width="16" height="16" src="//filer.net/media/images/ico_arrow_down.png?2018" alt="DL">
</td></tr>
<tr>
<td>Ttnc-18p.part02.rar
</td><td>500.00 MB</td>
<td><img width="16" height="16" src="//filer.net/media/images/ico_arrow_down.png?2018" alt="DL">
</td></tr>
<tr>
<td>Ttnc-18p.part03.rar
</td><td>500.00 MB</td>
<td><img width="16" height="16" src="//filer.net/media/images/ico_arrow_down.png?2018" alt="DL">
</td></tr>
I want it to download the link from: Ttnc-18p.part01.rar, Ttnc-18p.part02.rar, Ttnc-18p.part03.rar and so on...
I tried this:
ChromeDriver.FindElement(By.XPath("/table[contains(#class,'table table-stripped')]/tbody/tr/td/a[contains(text(),'" + "Ttnc-18p.part01.rar" + "')]")).Click();
It doesn't work and I can't figure out what to do. Any thing else fails.
The second thing I am trying to is that the code will generate an array of links that it need to download so I can feed the code different website that have differend number of linkes.
Please help.

You can try the below, but you would likely want to break the these in to separate methods for reusability:
public IEnumerable<string> DownloadLinks(string url)
{
// for storing each href value as it is retrieved from the links
var listOfLinks = new List<string>();
// start your instance of ChromeDriver
var driver = new ChromeDriver();
// Navigate to the url you passed in
driver.Navigate().GoToUrl(url);
// Get a collection of all anchor ("a") tags.
var anchorTags= driver.FindElements(By.TagName("a"));
// Now for each anchor tag...
foreach(var link in anchorTags)
{
// ...retrieve the value of its 'href' attribute (i.e. your link)...
var l = link.GetAttribute("href");
// ...add the link path to your listOfLinks
// and append your url to the href since the href is only a partial
// ( /get/vobqwunyrrxl2oo5 becomes https://yourwebsite.com/get/vobqwunyrrxl2oo5)
listOfLinks.Add(url + l);
// now click your link to simulate clicking the link and downloading the file
link.Click();
}
// and finally return your list of links
return listOfLinks;
}

Related

Using HtmlAgilityPack with C# to find all href links within td elements in html page

I am attempting to use HtmlAgilityPack package to find each of the href links within td tags throughout an entire html page. The trick is that these tables start deep down into the html structure. I noticed with HtmlAgilityPack you can't just say get all tds that are within trs on a page. There is a parent div wrapped around each table with a class on it "table-group" that I am not showing in my sample below. Maybe I can use that as a starting point? The biggest trouble that I am dealing with is that there are several parent elements above everything in my sample below, but I want to skip all of that and start here.
Here is a sample of the structure I am trying to navigate:
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 1</td>
<td>1</td>
</tr>
<tr>
<td>Link 2</td>
<td>2</td>
</tr>
<tr>
<td>Link 3</td>
<td>3</td>
</tr>
</tbody>
</table>
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 4</td>
<td>4</td>
</tr>
<tr>
<td>Link 5</td>
<td>5</td>
</tr>
<tr>
<td>Link 6</td>
<td>6</td>
</tr>
</tbody>
</table>
I would like my end result to be:
https://path-to-pdf1
https://path-to-pdf2
https://path-to-pdf3
https://path-to-pdf4
https://path-to-pdf5
https://path-to-pdf6
Here is what I have tried:
var html = #"https://myurl.com";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
foreach (var item in nodes)
{
Console.WriteLine(item.Attributes["href"].Value);
}
Console.ReadKey();
Modify
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
to
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td[1]/a");
then you wil get the result you want ,you could read the documents related with XPath for more details
I tried in a MVC project with the same html file:
Update:
I copied the html codes to the html page in my local and get the nodes successfully

Break out an html-element from within a table-element

I'm having problems finding a proper way of breaking out the H4-tag from the following code. Not only do I need to make it stay in the code, but I also need to delete the table it currently sits in.
So, how do I delete the whole table and keep the h4-tag where it is?
<table align="center" border="0" cellpadding="0" cellspacing="0">
<tr><td height="30" align="center" colspan="5"><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
<tr>
<td><img name="contents" src="../figs/contents.gif" border="0" alt="" onload=""></td>
<td><img src="../figs/iauthori.gif" alt="" name="authorindex" width="120" height="20" border="0" onload=""></td>
<td><img src="../figs/isubji.gif" alt="" name="subjindex" width="120" height="20" border="0" onload=""></td>
<td><img src="../figs/isearch.gif" alt="" name="search" width="120" height="20" border="0" onload=""></td>
<td><img name="home" src="../figs/ihome.gif" border="0" alt="" onload=""></td>
</tr>
</table>
Further on I have about 2500 html-documents following similar structure, but are in different versions of HTML, thus uses div's, tables or other elements from version to version. So I need a way to alter this method properly.
I have a document load ready, it loads all files in a list, so I will be feeding a method this list of filenames to open and parse. But I can't figure out how to use XPath for this one.
One way to solve the problem is to find all <h4> nodes, walk up it's parent chain until you find a stop tag/node, and replace the stop tag/node with your <h4>:
Given some sample HTML that resides in a HTML file:
var html =
#"<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<table align='center' border='0' cellpadding='0' cellspacing='0'>
<tr><td height='30' align='center' colspan='5'><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
<tr>
<td><a href='index.html'><img name='contents' src='../figs/contents.gif' border='0' alt='' onload=''></a></td>
<td><a href='../page.html'><img src='../figs/iauthori.gif' alt='' name='authorindex' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../page.html'><img src='../figs/isubji.gif' alt='' name='subjindex' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../search.html'><img src='../figs/isearch.gif' alt='' name='search' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../page.html'><img name='home' src='../figs/ihome.gif' border='0' alt='' onload=''></a></td>
</tr>
</table>
<div>
<h4>H4 nested in DIV</h4>
<p>Paragraph <strong>bold</strong> <a href=''>Hyperlink</a></p>
</div>
<p><h4>H4 nested in P</h4></p>
</body>
</html>";
Parse it with this method:
public string ParseHtmlToString(string inputFilePath)
{
var document = new HtmlDocument();
document.Load(inputFilePath);
var wantedNodes = document.DocumentNode.SelectNodes("//h4");
// stop at these tags while walking backwards up the chain
var stopTags = new string[] { "table", "div", "p" };
HtmlNode parentNode;
foreach (var node in wantedNodes)
{
HtmlNode testNode = node;
while ((parentNode = testNode.ParentNode) != null)
{
if (stopTags.Contains(parentNode.Name))
{
parentNode.ParentNode.ReplaceChild(node, parentNode);
}
testNode = parentNode;
}
}
return document.DocumentNode.WriteTo();
}
Then you can assign the parsed HTML to a variable like this:
var parsedHtml = ParseHtmlToString(INPUT_FILE);
which returns the following value:
<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4>
<h4>H4 nested in DIV</h4>
<h4>H4 nested in P</h4>
</body>
</html>
This is a alternative solution, it worked for all those documents where the Kuujinbo-solution failed, I ran them side by side as a try/final/catch-method. And it worked pretty good through all 2500 html-docs.
var doc = new HtmlDocument();
doc.Load(file);
var htmlBody = doc.DocumentNode.SelectSingleNode("//body");
var headerTables = doc.DocumentNode.SelectSingleNode("//body/table[1]");
var headerNode = doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'Information Research, Vol')]");
htmlBody.ReplaceChild(headerNode, headerTables);
headerTables.Remove();
doc.Save(file);
Basically it was run as
try {ParseHtmlToString(file)}
final {myAlternateSolution(file)}
catch (Exception Ex){Console.WriteLine(file +":"+ Ex.Message);}
It worked due to the fact that the table was most of the time the first node after body, and it was also the first table in the document. Some manual editing had to be done, due to the fact that some documents had malformed HTML, and could not be repaired with HTMLTidy and similar.

List of items using Xpath and Selenium

Hello i'm trying to fetch all friends connected within Facebook using Xpath and Selenium the problem is when i try to locate all the friends it return an Empty List.
using System;
using System.Collections.Generic;
using OpenQA.Selenium;
using OpenQA.Selenium.Firefox;
using OpenQA.Selenium.Support.UI;
namespace Automation
{
class Program
{
static void Main(string[] args)
{
using (IWebDriver driver = new FirefoxDriver())
{
driver.Navigate().GoToUrl("https://mbasic.facebook.com");
IWebElement username = driver.FindElement(By.Name("email"));
username.SendKeys("email");
IWebElement password = driver.FindElement(By.Name("pass"));
password.SendKeys("password");
IWebElement submit = driver.FindElement(By.Name("login"));
submit.Submit();
var waitHomePage = new WebDriverWait(driver,TimeSpan.FromSeconds(10));
waitHomePage.Until(ExpectedConditions.ElementExists(By.PartialLinkText("Chat")));
IWebElement chat = driver.FindElement(By.XPath(".//*[#id='header']/div/a[6]"));
//driver.Manage().Timeouts().ImplicitlyWait(TimeSpan.FromSeconds(5));
chat.Click();
IList<IWebElement> friends = chat.FindElements(By.ClassName("m br bs"));
}
}
}
}
Friends.Count return 0 .
Here is the HTML of the friends chat list
<div class="bo bp bq">
<table class="m br bs">
<tbody>
<tr>
<td class="t bt">
<a class="bu" href="/messages/read/?fbid=100002640428096&click_type=buddylist#fua">Friend Name</a>
</td>
<td class="n bv">
<img class="bw bx s" src="https://fbstatic-a.akamaihd.net/rsrc.php/v3/yo/r/DbsprgIuYE0.png" width="7" height="14"/>
</td>
</tr>
</tbody>
<table class="m br bs">
<table class="m br bs">
<table class="m br bs">
</div>
</div>
As far as I can see you have several mistakes in your code (and the HTML you provided is not complete and without any information in it).
I think you try to search for friends via
IList<IWebElement> friends = chat.FindElements(By.ClassName("m br bs"));
but in this case you are using the chat object which referes to an a tag, see:
IWebElement chat = driver.FindElement(By.XPath(".//*[#id='header']/div/a[6]"));
so i would use driver instead of chat object (because I don't know what chat returns in your case).
Furthermore you are trying to estimate the number of friends by searching for a ClassName which in your HTML sample doesn't contain any information (empty table). I tried to look it up myself and the difficulty is that FB do not use any unquie IDs for their tables. In my browser the friendlist looks something like this:
<table class="l bs bt">
<tbody>
<tr>
<td class="s bu">
<a class="bv" href="/messages/read/fbid=111&click_type=buddylist#fua">Someuser1</a>
</td>
<td class="m bw"><img src="https://blabla" width="7" height="14"class="bx by r" /></td>
</tr>
</tbody>
</table>
It seems that FB uses for each contact a href tag with click_type=buddylist ... so I tried to use this information to find the user with xpath:
.//*[contains(#href,'buddylist')]
so you could to read the userlist with
IList<IWebElement> friends = driver.FindElements(By.XPath(".//*[contains(#href,'buddylist')]"));
It works for me. Hope I could help you or give you at least a hint...

How to search through html table rows?

Using Windows Forms and C#.
For example...
<table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>
I load the page using the WebBrowser Control. The page loads perfectly.
The next thing I want to do is search through all the rows in the table and check if they contain a specific value ; for example in this instance YES.
If they contain it I want the row to be passed on to me so I can store it as string.
But I want the row to be in HTML form. (containing the tags).
How can I accomplish this ?
Please help me.
You can use the HtmlAgilityPack to easily parse the html. For example, to get all of the TD elements, you can do this:
string value = #" <table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(value);
var nodes = doc.GetElementbyId("tbl").SelectNodes("tbody/tr/td");
foreach (var node in nodes)
{
Debug.WriteLine(node.InnerText);
}
You can use this: http://simplehtmldom.sourceforge.net/ , its really simple way how to search in HTML files
Just include simple_html_dom.php to your file and then just follow this manual
http://simplehtmldom.sourceforge.net/manual.htm
and your php code will looks like
$html = file_get_html('File.html');
foreach($html->find('td') as $element)
echo $element->text. '<br>';

Remove line from string where certain HTML tag is found

So I have this HTML page that is exported to an Excel file through an MVC action. The action actually goes and renders this partial view, and then exports that rendered view with correct formatting to an Excel file. However, the view is rendered exactly how it is seen before I do the export, and that view contains an "Export to Excel" button, so when I export this, the button image appears as a red X in the top left corner of the Excel file.
I can intercept the string containing this HTML to render in the ExcelExport action, and it looks like this for one example:
<div id="summaryInformation" >
<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />
<table class="resultsGrid" cellpadding="2" cellspacing="0">
<tr>
<td id="NicknameLabel" class="resultsCell">Nick Name</td>
<td id="NicknameValue" colspan="3">
Swap
</td>
</tr>
<tr>
<td id="EffectiveDateLabel" class="resultsCell">
<label for="EffectiveDate">Effective Date</label>
</td>
<td id="EffectiveDateValue" class="alignRight">
02-Mar-2011
</td>
<td id ="NotionalLabel" class="resultsCell">
<label for="Notional">Notional</label>
</td>
<td id="NotionalValue" class="alignRight">
<span>
USD
</span>
10,000,000.00
</td>
</tr>
<tr>
<td id="MaturityDateLabel" class="resultsCell">
<label for="MaturityDate">Maturity Date</label>
</td>
<td id="MaturityDateValue" class="alignRight">
02-Mar-2016
-
Modified Following
</td>
<td id="TimeStampLabel" class="resultsCell">
Rate Time Stamp
</td>
<td id="Timestamp" class="alignRight">
28-Feb-2011 16:00
</td>
</tr>
<tr >
<td id="HolidatCityLabel" class="resultsCell"> Holiday City</td>
<td id="ddlHolidayCity" colspan="3">
New York,
London
</td>
</tr>
</table>
</div>
<script>
$("#ExportToExcel").click(function () {
// ajax call to do the export
var actionUrl = "/Extranet/mvc/Indications.cfc/ExportToExcel";
var viewName = "/Extranet/Views/Indications/ResultsViews/SummaryInformation.aspx";
var fileName = 'SummaryInfo.xls';
GridExport(actionUrl, viewName, fileName);
});
</script>
That <img id="ExportToExcel" tag at the top is the one I want to remove just for the export. All of what you see is contained within a C# string. How would I go and remove that line from the string so it doesn't try and render the image in Excel?
EDIT: Would probably make sense also that we wouldn't need any of the <script> in the export either, but since that won't show up in Excel anyway I don't think that's a huge deal for now.
Remove all img tags:
string html2 = Regex.Replace( html, #"(<img\/?[^>]+>)", #"",
RegexOptions.IgnoreCase );
Include reference: using System.Text.RegularExpressions;
If it's in a C# string then just:
myHTMLString.Replace(#"<img id="ExportToExcel" style=" cursor: pointer;" src="/Extranet/img/btn_user_export_excel_off.gif" />","");
The safest way to do this will be to use the HTML Agility Pack to read in the HTML and then write code that removes the image node from the HTML.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
HtmlNode image =doc.GetElementById("ExportToExcel"]);
image.Remove();
htmlString = doc.WriteTo();
You can use similar code to remove the script tag and other img tags.
I'm just using this
private string RemoveImages(string html)
{
StringBuilder retval = new StringBuilder();
using (StringReader reader = new StringReader(html))
{
string line = string.Empty;
do
{
line = reader.ReadLine();
if (line != null)
{
if (!line.StartsWith("<img"))
{
retval.Append(line);
}
}
} while (line != null);
}
return retval.ToString();
}

Categories

Resources