I have to check the page's canonical tag, but I have a problem when the href is empty.
This is the code:
ChromeOptions chromeCapabilities = new ChromeOptions();
chromeCapabilities.AddArguments("disable-infobars");
IWebDriver webDriver = new ChromeDriver(chromeCapabilities);
webDriver.Manage().Window.Maximize();
webDriver.Navigate().GoToUrl("https://www.example.com/subpage/page");
List <IWebElement> linkElements = webDriver.FindElements(By.TagName("link")).ToList();
string canonicalHref = linkElements.Find(x => String.Compare(x.GetAttribute("rel"), "canonical") == 0).GetAttribute("href");
//debug
var html = linkElements.Find(x => String.Compare(x.GetAttribute("rel"), "canonical") == 0);
Console.WriteLine(html.GetAttribute("outerHTML")); //<link href="" rel="canonical" />
Console.WriteLine(html.GetAttribute("href")); // should be "" but I get https://www.example.com/subpage/page
Console.WriteLine(html.GetAttribute("rel")); //canonical
Console.WriteLine(canonicalHref); // should be "" but I get https://www.example.com/subpage/page
Wrong settings example:
And I get the URL not empty string... but why? Did I call the wrong attribute? Any idea to get the real value?
The method GetAttribute returns the DOM property if it's present or the HTML attribute if the property is missing. It looks like the page changed the property.
To get the HTML attribute, you'll have to use a script injection:
string href = (string)((IJavaScriptExecutor)driver).ExecuteScript(
"return arguments[0].getAttribute('href') || '';",
link)
Related
I noticed some results of call to GetAttribute provide not fully decoded strings.
var element = driver.FindElement(By.CssSelector($"#lien_batiment_{id}"));
var href = element.GetAttribute("href");
Actual:
"javascript:f_affiche_contenu(9367718,'coopr_detail_batiment.php?id=9367718&filtre=32&contenuonly=1%27);"
Expected:
"javascript:f_affiche_contenu(9367718,'coopr_detail_batiment.php?id=9367718&filtre=32&contenuonly=1');"
If it was normal, first apostrophe should have been encoded too, and probably & too.
Expected value is provided as well by chrome developper panel. Edge provide the same, and it almost seems logic to have correct decoded JS to execute it with IJavaScriptExecutor.
I use a href= href.Replace("%27","'"), but it's not a serious option if I have js containing real decode %27 sequence.
Any Idea ?
Dependencies :
Selenium.WebDriver.ChromeDriver 103.0.5060.13400
Selenium.WebDriver 4.3.0
Selenium.Support 4.3.0
.Net Framework 4.8
here some more informations to reproduce :
var chromeOptions = new ChromeOptions();
var driverService = ChromeDriverService.CreateDefaultService();
using(var driver = new ChromeDriver(driverService, chromeOptions))
{
driver.Navigate().GoToUrl("https://france2.simagri.com/");
// have to login with account here
driver.FindElement(By.Name("Login")).SendKeys("login");
driver.FindElement(By.Name("Password")).SendKeys("password");
driver.FindElement(By.Name("FAccesCompte")).Submit();
driver.Navigate().GoToUrl("https://france2.simagri.com/liste_batiment.php");
// you need to have some batiments
var id = 9132371; // feed matching one
var element = driver.FindElement(By.CssSelector($"#lien_batiment_{id}"));
var href = element.GetAttribute("href");
}
as you can see, very simple...
Presumably you need to wait a bit for the element to completely render before you extract the href attribute inducing WebDriverWait for the _ElementIsVisible_as follows:
var href = new WebDriverWait(driver, TimeSpan.FromSeconds(30)).Until(ExpectedConditions.ElementIsVisible(By.CssSelector($"#lien_batiment_{id}"))).GetAttribute("href");
Controller code:
[HttpGet]
public FileStreamResult GETPDF(string guid)
{
var stream = XeroHelper.GetXeroPdf(guid).Result;
stream.Position = 0;
var cd = new ContentDisposition
{
FileName = $"{guid}.pdf",
Inline = true
};
Response.AppendHeader("Content-Disposition", cd.ToString());
return File(stream, "application/pdf");
}
As you can see the method's name is GETPDF. You can also see that I am configuring the name of the file name in the ContentDisposition header. If you see below, you will see that the method name is used as the title in the toolbar, rather than the file name.
The file name does get perpetuated. When I click "Download" the filename is the default value that is used in the file picker (note i changed the name to hide the sensitive guid):
If anyone has any ideas how to rename the title of that toolbar, it would be greatly appreciated.
As an aside, this is NOT a duplicate of: C# MVC: Chrome using the action name to set inline PDF title as no answer was accepted and the only one with upvotes has been implemented in my method above and still does not work.
Edit- For clarification, I do not want to open the PDF in a new tab. I want to display it in a viewer in my page. This behavior is already happening with the code I provided, it is just the Title that is wrong and coming from my controller method name. Using the controller code, I am then showing it in the view like so:
<h1>Quote</h1>
<object data="#Url.Action("GETPDF", new { guid = #Model.QuoteGuid })" type="application/pdf" width="800" height="650"></object>
try something like this:
[HttpGet]
public FileResult GETPDF(string guid)
{
var stream = XeroHelper.GetXeroPdf(guid).Result;
using (MemoryStream ms = new MemoryStream())
{
stream.CopyTo(ms);
// Download
//return File(ms.ToArray(), "application/pdf", $"{guid}.pdf");
// Open **(use window.open in JS)**
return File(ms.ToArray(), "application/pdf")
}
}
UPDATE: based on mention of viewer.
To embed in a page you can try the <embed> tag or <object> tag
here is an example
Recommended way to embed PDF in HTML?
ie:
<embed src="https://drive.google.com/viewerng/
viewer?embedded=true&url=[YOUR ACTION]" width="500" height="375">
Might need to try the File method with the 3rd parameter to see which works.
If the title is set in the filename, maybe this will display as the title.
(not sure what a download will do though, maybe set a download link with athe pdf name)
UPDATE 2:
Another idea:
How are you calling the url?
Are you specifying: GETPDF?guid=XXXX
Maybe try: GETPDF/XXXX (you may need to adjust the routing for this or call the parameter "id" if this is the default)
You could do this simply by adding your filename as part of URL:
<object data="#Url.Action("GETPDF/MyFileName", new { guid = #Model.QuoteGuid })" type="application/pdf" width="800" height="650"></object>`
You should ignore MyFileName in rout config. Chrome and Firefox are using PDFjs internally. PDFjs try to extract display name from URL.
According to the PDFjs code, it uses the following function to extract display name from URL:
function pdfViewSetTitleUsingUrl(url) {
this.url = url;
var title = pdfjsLib.getFilenameFromUrl(url) || url;
try {
title = decodeURIComponent(title);
} catch (e) {
// decodeURIComponent may throw URIError,
// fall back to using the unprocessed url in that case
}
this.setTitle(title);
}
function getFilenameFromUrl(url) {
const anchor = url.indexOf("#");
const query = url.indexOf("?");
const end = Math.min(
anchor > 0 ? anchor : url.length,
query > 0 ? query : url.length
);
return url.substring(url.lastIndexOf("/", end) + 1, end);
}
As you can see this code uses the last position of "/" to find the file name.
The following code is from PDFjs, I don't know why PDFjs doesn't use this instead of getFilenameFromUrl. This code use query string to detect file name and it uses as a fallback to find the file name.
function getPDFFileNameFromURL(url, defaultFilename = "document.pdf") {
if (typeof url !== "string") {
return defaultFilename;
}
if (isDataSchema(url)) {
console.warn(
"getPDFFileNameFromURL: " +
'ignoring "data:" URL for performance reasons.'
);
return defaultFilename;
}
const reURI = /^(?:(?:[^:]+:)?\/\/[^\/]+)?([^?#]*)(\?[^#]*)?(#.*)?$/;
// SCHEME HOST 1.PATH 2.QUERY 3.REF
// Pattern to get last matching NAME.pdf
const reFilename = /[^\/?#=]+\.pdf\b(?!.*\.pdf\b)/i;
const splitURI = reURI.exec(url);
let suggestedFilename =
reFilename.exec(splitURI[1]) ||
reFilename.exec(splitURI[2]) ||
reFilename.exec(splitURI[3]);
if (suggestedFilename) {
suggestedFilename = suggestedFilename[0];
if (suggestedFilename.includes("%")) {
// URL-encoded %2Fpath%2Fto%2Ffile.pdf should be file.pdf
try {
suggestedFilename = reFilename.exec(
decodeURIComponent(suggestedFilename)
)[0];
} catch (ex) {
// Possible (extremely rare) errors:
// URIError "Malformed URI", e.g. for "%AA.pdf"
// TypeError "null has no properties", e.g. for "%2F.pdf"
}
}
}
return suggestedFilename || defaultFilename;
}
I am trying to get element by using x-path tree element but showing null, and this type of x-path work for other site for me, only 2% site this types of X-Path not working, also i tried x-path from chrome also but when my x-path not work that time chrome x-path also not work.
public static void Main()
{
string url = "http://www.ndrf.gov.in/tender";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(url);
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("/html[1]/body[1]/section[2]/div[1]/div[1]/div[1]/div[1]/div[2]/table[1]"); // i want this type // not wroking
//var nodetest2 = htmlDoc.DocumentNode.SelectSingleNode("//*[#id=\"content\"]/div/div[1]/div[2]/table"); // from Google chrome // not wroking
//var nodetest3 = htmlDoc.DocumentNode.SelectSingleNode("//*[#id=\"content\"]"); // by ID but i don't want this type // wroking
Console.WriteLine(nodetest1.InnerText); //fail
//Console.WriteLine(nodetest2.InnerText); //fail
//Console.WriteLine(nodetest3.InnerText); //proper but I don't want this type
}
The answer that #QHarr suggested works perfectly, But the reason you get null with a correct x-path, is that there is a javascript file in the header of the site, that adds a wrapper div around the table, and since getting result in HtmlAgilityPack seems not loading or executing js, the x-path returns null.
what you observe, after that js runs is:
<div class="view-content">
<div class="guide-text">
...
</div>
<div class="scroll-table1">
<!-- Your table is here -->
</div>
</div>
but what actually you get whithout that js, is:
<div class="view-content">
<!-- Your table is here -->
</div>
thus your x-path should be:
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("/html[1]/body[1]/section[2]/div[1]/div[1]/div[1]/div[1]/table[1]");
Your xpath when used in browser selects for entire table. You can shorten and use as follows (fiddle):
using System;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
string url = "http://www.ndrf.gov.in/tender";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(url);
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("//table");
Console.WriteLine(nodetest1.InnerText);
}
}
Use Fizzler.Systems.HtmlAgilityPack
details here : https://www.nuget.org/packages/Fizzler.Systems.HtmlAgilityPack/
This library adds extension methods called QuerySelector and QuerySelectorAll, that takes CSS Selector not XPath.
Ali Bordbar caught perfect, This Url adds a wrapper div when I navigating URL in WebBrowser control in this all JavaScript file are loaded,
but when i load URL using HtmlWeb there is none of the JavaScript file loaded.
The HtmlWeb retrieves the static HTML response that the server sends, and does not execute any javascript, whereas a WebBrowser would.
So WebBrowser control HTML DOM data XPath and HtmlWeb HTML DOM data XPath not match.
My below code work perfect for this switchvation
HtmlWeb web = new HtmlWeb();
web.AutoDetectEncoding = true;
HtmlAgilityPack.HtmlDocument theDoc1 = web.Load("http://www.ndrf.gov.in/tender");
var HtmlDoc = new HtmlAgilityPack.HtmlDocument();
var bodytag = theDoc1.DocumentNode.SelectSingleNode("//html");
HtmlDoc.LoadHtml(bodytag.OuterHtml);
var xpathHtmldata = HtmlDoc.DocumentNode.SelectSingleNode(savexpath); //savexpath is my first xpath make from HTML DOM data of WebBrowser control which is work for most url.
if (xpathHtmldata == null)
{
//take last tag name from first xpath
string mainele = savexpath.Substring(savexpath.LastIndexOf("/") + 1);
if (mainele.Contains("[")) { mainele = mainele.Remove(mainele.IndexOf("[")); }
//collect all tag name with name of which is sotre in mainele variable
var taglist = HtmlDoc.DocumentNode.SelectNodes("//" + mainele);
foreach (var ele in taglist) //check one by one element
{
string htmltext1 = ele.InnerText;
htmltext1 = Regex.Replace(htmltext1, #"\s", "");
htmltext1 = htmltext1.Replace("&", "&").Trim();
htmltext1 = htmltext1.Replace(" ", "").Trim();
string htmltext2 = saveInnerText; // my previus xpath text from HTML DOM data of WebBrowser control
htmltext2 = Regex.Replace(htmltext2, #"\s", "");
if (htmltext1 == htmltext2) // check equality to my previus xpath text..if it is equal thats my new xpath
{
savexpath = ele.XPath;
break;
}
}
}
Via that code i have extracted all desired text out of a html document
private void RunThroughSearch(string url)
{
private IWebDriver driver;
driver = new FirefoxDriver();
INavigation nav = driver.Navigate();
nav.GoToUrl(url);
var div = driver.FindElement(By.Id("results"));
var element = driver.FindElements(By.ClassName("sa_wr"));
}
though as i need to refine results of extracted document
Container
HEADER -> Title of a given block
Url -> Link to the relevant block
text -> body of a given block
/Container
as u can see in my code i am able to get the value of the text part
as a text value , that was fine, but what if i want to have
the value of the container as HTML and not the extracted text ?
<div class="container">
<div class="Header"> Title...</div>
<div class="Url"> www.example.co.il</div>
<div class="ResConent"> bla.. </div>
</div>
so the container is about 10 times in a page
i need to extract it's innerHtml .
any ideas ? (using Selenium)
This seemed to work for me, and is less code:
var element = driver.FindElement(By.ClassName("sa_wr"));
var innerHtml = element.GetAttribute("innerHTML");
Find the element first, then use IJavaScriptExecutor to get the inner HTML.
var element = driver.FindElements(By.ClassName("sa_wr"));
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
if (js != null) {
string innerHtml = (string)js.ExecuteScript("return arguments[0].innerHTML;", element);
}
I found the solution from SQA-SO
IWebDriver driver;
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
js.ExecuteScript("document.getElementById("title").innerHTML = "New text!";");
Doing a project in Umbraco, and i've encountered problems in one case that when calling node.NiceUrl I get # as the result. What is weird though is that if i debug it somehow it resolves into the correct url.
var pages = Pages.Select((item, index) => new
{
Url = item.NiceUrl,
Selected = item.Id == currentPage.Id,
Index = index
}).ToList();
Where Pages is obtained from:
CurrentPage.Parent.ChildrenAsList
If I do it this way, it works, but I don't know why.
Url = new Node(item.Id).NiceUrl,
I've encountered this error and it was because the id belonged to a media node.
Media is treated differently to other content and there's no easy way of getting the url because different types of media store the url in different ways depending on context. That's why the NiceUrl function doesn't work for media (according to the umbraco developers).
My specific scenario was using images that had been selected using a media picker. I got the url via the following code. I wrapped it up in an extension method so you can consume it from a template in a convenient way.
public static string GetMediaPropertyUrl(this IPublishedContent thisContent, string alias, UmbracoHelper umbracoHelper = null)
{
string url = "";
if (umbracoHelper == null)
umbracoHelper = new UmbracoHelper(UmbracoContext.Current);
var property = thisContent.GetProperty(alias);
string nodeID = property != null ? property.Value.ToString() : "";
if (!string.IsNullOrWhiteSpace(nodeID))
{
//get the media via the umbraco helper
var media = umbracoHelper.TypedMedia(nodeID);
//if we got the media, return the url property
if (media != null)
url = media.Url;
}
return url;
}
Try like this
Url = umbraco.library.NiceUrl(Item.Id);