How to extract specific link in c#? - c#

I'm using the HtmlAgilitypack to extract some data from the following website:
<div class="pull-right">
<ul class="list-inline">
<li class="social">
<a target="_blank" href="https://www.facebook.com/wsat.a?ref=ts&fref=ts" class="">
<i class="icon fa fa-facebook" aria-hidden="true"></i>
</a>
</li>
<li class="social">
<a target="_blank" href="https://twitter.com/wsat_News" class="">
<i class="icon fa fa-twitter" aria-hidden="true"></i>
</a>
</li>
<li>
<a href="/user" class="hide">
<i class=" icon fa fa-user" aria-hidden="true"></i>
</a>
</li>
<li>
<a onclick="ga('send', 'event', 'PDF', 'Download', '');" href="https://wsat.com/pdf/issue15170/index.html" target="_blank" class="">
PDF
<i class="icon fa fa-file-pdf-o" aria-hidden="true"></i>
</a>
</li>
I've managed to write this code to extract the first link in the html script which is https://www.facebook.com/wsat. However, all I want is to extract the link with the pdf which is
https://wsat.com/pdf/issue15170/index.html but without any luck. How do I specify which link to extract ?
var url = "https://wsat.com/";
var HttpClient = new HttpClient();
var html = await HttpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var links = htmlDocument.DocumentNode.Descendants("div").Where(node => node.GetAttributeValue("class", "").Equals("pull-right")).ToList();
var alink = links.First().Descendants("a").FirstOrDefault().ChildAttributes("href")?.FirstOrDefault().Value;
await Launcher.OpenAsync(alink);

Use an xpath expression as a selector:
var alink = htmlDocument.DocumentNode
.SelectSingleNode("//li/a[contains(#onclick, 'PDF')]")
.GetAttributeValue("href", "");
Explanation of xpath (as requested):
Match li tag at any depth in the document with an immediate child a tag, which has an attribute onclick that contains the string 'PDF'.

In your query Descendants("a") selected you all links in the root div. And following FirstOrDefault() returns you just the first link. So what you can do is to map every link into its href, and then use string operation over collection to find appropriate.
var alink = links.First().Descendants("a")
.Select(node => node.ChildAttributes("href").FirstOrDefault()?.Value)
.Where(s => !string.IsNullOrEmpty(s))
.ToList();
foreach (var l in alink)
{
Console.WriteLine(l);
}
Console.WriteLine();
var wsatCom = alink.FirstOrDefault(s => s.StartsWith("https://wsat.com"));
Console.WriteLine(wsatCom);
In addition. ?. operator is needed after FirstOrDefault() not before, if you want to handle links without href. I believe in that case ChildAttributes("href") returns empty collection, FirstOrDefault returns null, and you've got null reference exceotion.

Could Regex help you here? I think it will be a lot easier than using the HTML agility pack to traverse through the links and feels a lot less like a lucky shot.
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"https:\/\/wsat\.com\/[\w\-\.]+[^#?\s][^""]+";
string input = #"<div class=""pull-right"">
<ul class=""list-inline"">
<li class=""social"">
<a target=""_blank"" href=""https://www.facebook.com/wsat.a?ref=ts&fref=ts"" class="""">
<i class=""icon fa fa-facebook"" aria-hidden=""true""></i>
</a>
</li>
<li class=""social"">
<a target=""_blank"" href=""https://twitter.com/wsat_News"" class="""">
<i class=""icon fa fa-twitter"" aria-hidden=""true""></i>
</a>
</li>
<li>
<a href=""/user"" class=""hide"">
<i class="" icon fa fa-user"" aria-hidden=""true""></i>
</a>
</li>
<li>
<a onclick=""ga('send', 'event', 'PDF', 'Download', '');"" href=""https://wsat.com/pdf/issue15170/index.html"" target=""_blank"" class="""">
PDF
<i class=""icon fa fa-file-pdf-o"" aria-hidden=""true""></i>
</a>
</li>";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}

For this kind of job I'd recommend using AngleSharp
It allows you to use css selectors to select whatever element you need.
var doc = new HtmlParser().ParseDocument(myHtml);
var pdfUrl = doc.QuerySelector("ul.list-inline a:nth-child(4)").GetAttribute("href");
or
var links = doc.QuerySelectorAll("ul.list-inline a").Where(a=> a.GetAttribute("href").StartsWith("https://wsat.com/pdf/")).ToList();
Bonus point is that you can always test your selector in any browser developper console without having to code/compile your C#

Related

How to select an unordered list in selenium C# .. Not able to catch

I am trying to find the ul id "mainTimeSheet" and li item weekly report by id = timesheetReport but not getting it.
I am new to C# selenium.
Below is the code i am trying in C# :
//var drop = driver.FindElement(By.XPath("//ul[#id='mainTimeSheet']"));
var drop = driver.FindElement(By.Id("mainTimeSheet")).
FindElement(By.XPath(".//li[#id='timesheetReport']"));
drop.Click();
HTML Code:
<li class="dropdown">
<a href="javascript:;" data-toggle="collapse" data-
target="#mainTimeSheet">
<i class="fa fa-fw fa-tasks"></i> Time Sheet <span class="fa
arrow"></span>
</a>
<ul id="mainTimeSheet" class="nav nav-second-level collapse">
<li>
<a id="timesheetReport" href="/ShaeetsListsadas.aspx"><i
class="fa fa-fw fa-tasks"></i> Weekly Report</a>
</li>
<li>
<a id="submitTimeSheet" href="/SheetsFormsads.aspx"><i
class="fa fa-fw fa-tasks"></i> Submit Timesheet</a>
</li>
<li>
<a id="timeSheetDetailReport" href="/sheetsReportsasd.aspx"<i
class="fa fa-fw fa-tasks">
</i> Day Wise Report</a>
</li>
</ul>
</li>
Please check in the dev tools (Google chrome) if we have unique entry in HTML DOM or not.
Steps to check:
Press F12 in Chrome -> go to element section -> do a CTRL + F -> then paste the xpath and see, if your desired element is getting highlighted with 1/1 matching node.
Xpath to check:
//li//a[#id='timesheetReport']
If it shows 1/1 matching node then
click it like this:
WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
IWebElement timeSheet = wait.Until(e => e.FindElement(By.XPath("//ul[#id='mainTimeSheet']")));
timeSheet.Click();
IWebElement weeklyReport= wait.Until(e => e.FindElement(By.XPath("//li//a[#id='timesheetReport']")));
weeklyReport.Click();
Update:
to resolve selenium.common.exceptions.ElementNotInteractableException:
You should debug your code like below:
Make sure the browser is launched in full screen using
driver.Manage().Window.Maximize();
Use ActionChains:
Actions actionProvider = new Actions(driver);
actionProvider.MoveToElement(new WebDriverWait(driver, TimeSpan.FromSeconds(10)).Until(e => e.FindElement(By.XPath("//ul[#id='mainTimeSheet']")))).Click().Build().Perform();
Use ExecuteScript:
WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
IWebElement timeSheet = wait.Until(e => e.FindElement(By.XPath("//ul[#id='mainTimeSheet']")));
((IJavaScriptExecutor)driver).ExecuteScript("arguments[0].click(); ", timeSheet);

How to create menu dynamically with item selected in MVC5 Razor View?

I have this:
<ul class="side-nav">
<li class="side-nav-item active">
<a #Url.Action("Dashboard", "Home")>
<i class="dripicons-meter"></i>
<span> Dashboard </span>
</a>
</li>
<li class="side-nav-item">
<a #Url.Action("New", "Ticket")>
<i class="dripicons-plus"></i>
<span> #_Layout.NewTicket </span>
</a>
</li>
<li class="side-nav-item">
#***** submenu ******#
<a href="#nav-settings" class="nav-app side-nav-link" data-toggle="collapse" aria-expanded="false">
<i class="dripicons-view-apps"></i>
<span class="mf-rotate"> #_Layout.Settings </span>
<span class="menu-arrow"></span>
</a>
<ul id="nav-settings" class="side-nav-second-level collapse">
<li>
#_Layout.ManageUsers
</li>
<li>
#_Layout.MailingListManager
</li>
</ul>
</li>
</ul>
I want to dynamically create a menu item selected with the active class, also for the submenu.
What's the best way? via c # or js?
with c # i found this:
public static class MenuExtensions
{
public static MvcHtmlString MenuItem(this HtmlHelper htmlHelper, string text, string action, string controller)
{
var a = new TagBuilder("a");
var routeData = htmlHelper.ViewContext.RouteData;
var currentAction = routeData.GetRequiredString("action");
var currentController = routeData.GetRequiredString("controller");
if (string.Equals(currentAction, action, StringComparison.OrdinalIgnoreCase) &&
string.Equals(currentController, controller, StringComparison.OrdinalIgnoreCase))
{
a.AddCssClass("active");
}
a.InnerHtml = htmlHelper.ActionLink(text, action, controller).ToHtmlString();
return MvcHtmlString.Create(a.ToString());
}
}
and for use #Html.MenuItem("Home", "Index", "Home")
but how do I insert the <i class = "..."> </i> e <span> tags?
Thanks

HtmlAgilityPack - How to select first a tag href while using selectnodes

I'm trying to select the first tag and get the href value. But the problem is I'm using SelectNodes.
Here is the code i want to select a href value from:
<li>
<a class="img" href="link1"></a>
<div class="m_text">
<a class="title" href="link2" rel="27418">A Story</a>
<p><span class="stars star45"></span><span class="rate">4.35</span></p>
<p class="info" title="Action"></p>
<p class="nowrap latest">A Story</span> 29</p>
</div>
</li>
<li>
<a class="img" href="link1"></a>
<div class="m_text">
<a class="title" href="link2" rel="27418">A Story</a>
<p><span class="stars star45"></span><span class="rate">4.35</span></p>
<p class="info" title="Action"></p>
<p class="nowrap latest">A Story</span> 29</p>
</div>
</li>
Now as you see i have to select the first href value of a tag for multiple times and then i will use foreach.
The html i want to get value is :
<a class="img" href="link1"></a>
My code:
var documentx = new HtmlWeb().Load(post.ExtLink);
var div = documentx.DocumentNode.SelectNodes("//div[#id='content']/*//ul[#class='list']//li");
var test = div.Descendants("a")
.Select(a => a.GetAttributeValue("href", null))
.Where(s => !String.IsNullOrEmpty(s))
.ToList();
My code works fine but it's get all the a tag values and i only looking to get the first a tag href value.
Change
.Where(s=> !String.IsNullOrEmpty(s))
To
.FirstOrDefault(s=> !String.IsNullOrEmpty(s))
And remove the .ToList() at the end.

Select innerHTML between tag

In the HTML page, I will need to match all innerHTML one by one.
I make a REGEX wich permit to match all tag except innerHTML (include BR tag) but I can not do the opposite...
([<][^br][^<]*[>])
You can see an example on this URL : https://regex101.com/r/h9tKHj/1
On this DOM :
<li class="product-faq-item">
<p class="product-faq-title">{{XXXXXXXXXXXX1}}</p>
<div class="product-faq-container">
<p class="product-faq-text">{{XXXXXXXXXXXX2}}<br>
{{XXXXXXXXXXXX3}}
</p>
</div>
</li>
<li class="product-faq-item">
<p class="product-faq-title">{{XXXXXXXXXXXX4}}</p>
<div class="product-faq-container">
<p class="product-faq-text">{{XXXXXXXXXXXX5}}</p>
</div>
</li>
My goal is to recover this :
Match 1 : {{XXXXXXXXXXXX1}}
Match 2 : {{XXXXXXXXXXXX2}}
Match 3 : {{XXXXXXXXXXXX3}}
Match 4 : {{XXXXXXXXXXXX4}}
Match 5 : {{XXXXXXXXXXXX5}}
Thanks in advance for your help !
Have a nice day,
Anthony,
If you want to replace {{key}} with value maby replace it like this:
var input = #"<li class='product-faq-item'>
<p class='product-faq-title'>{{XXXXXXXXXXXX1}}</p>
<div class='product-faq-container'>
<p class='product-faq-text'>{{XXXXXXXXXXXX2}}<br>
{{XXXXXXXXXXXX3}}
</p>
</div>
</li>
<li class='product-faq-item'>
<p class='product-faq-title'>{{XXXXXXXXXXXX4}}</p>
<div class='product-faq-container'>
<p class='product-faq-text'>{{XXXXXXXXXXXX5}}</p>
</div>
</li>";
var regex = new Regex("{{.*?}}");
var dic = new Dictionary<string, object>();
dic["XXXXXXXXXXXX1"] = "X1Val";
dic["XXXXXXXXXXXX2"] = "X2Val";
dic["XXXXXXXXXXXX3"] = "X3Val";
dic["XXXXXXXXXXXX4"] = "X4Val";
dic["XXXXXXXXXXXX5"] = "X5Val";
var output = regex.Replace(input, match => $"{dic[match.Value.Replace("{", "").Replace("}", "")]}");

HtmlAgilityPack work with structure

I'm starting with some crawler in C# and I heard that HtmlAgilityPack is best solution for this.
I can't find valid example of usage so maybe someone here will help me with my issue.
In one class I'm using method to get part of code I want. For example ul with class "testable ul"
public static string GetElement(string url, string element, string type, string name)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string rate = doc.DocumentNode.SelectSingleNode("//"+ element +"[#"+ type +"='"+ name +"']").OuterHtml;
return rate;
}
so I am running
string content = SiteMethods.GetElement(startPage, "ul", "class", "testable ul");
now there is a part where I am doing some background work but in the end I'm loading that string again to HtmlAgality:
HtmlDocument html = new HtmlDocument();
html.OptionOutputAsXml = true;
html.LoadHtml(content);
HtmlNode document = html.DocumentNode;
And here I have a problem.
The structure inside content string is like that:
<ul class="testable ul">
<li>
<a href="http://www.veryimportant.link">
<div class="img">
<img src="http://image.so.important/">
</div>
<div class="info">
<span class="name">
NAME
</span>
<span class="price">10</span>
<span class="price2">8</span>
<span class="grade">C</span>
</div>
<p class="tips">tips</p>
</a>
</li>
<li>
<a href="http://www.veryimportant.link/2">
<div class="img">
<img src="http://image.so.important/2">
</div>
<div class="info">
<span class="name">
NAME2
</span>
<span class="price">3</span>
<span class="price2">4</span>
<span class="grade">A</span>
</div>
<p class="tips">tips2</p>
</a>
</li>
</ul>
So the questions are:
How to get every <li> to diffrent object? For further actions.
is it possible in one simple command to get links http://www.veryimportant.link and http://www.veryimportant.link/2 or for example images http://image.so.important/ and http://image.so.important/2 ? How to get them?
How to get NAME and NAME2 in list?
Is it possible to map the whole struct of html to list?
Please, with some examples the rest of learning will be really easy.

Categories

Resources