HtmlAgilityPack does not select subnodes with XPath [duplicate]

HtmlAgilityPack does not select subnodes with XPath [duplicate] - c#

I just wrote up this test to see if I was crazy...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
namespace HtmlAgilityPackFormBug
{
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.LoadHtml(#"
<!DOCTYPE html>
<html>
<head>
<title>Form Test</title>
</head>
<body>
<form>
<input type=""text"" />
<input type=""reset"" />
<input type=""submit"" />
</form>
</body>
</html>
");
var body = doc.DocumentNode.SelectSingleNode("//body");
foreach (var node in body.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
Console.WriteLine(node.XPath);
Console.ReadLine();
}
}
}
And it outputs:
/html[1]/body[1]/form[1]
/html[1]/body[1]/input[1]
/html[1]/body[1]/input[2]
/html[1]/body[1]/input[3]
But, if I change <form> to <xxx> it gives me:
/html[1]/body[1]/xxx[1]
(As it should). So... it looks like those input elements are not contained within the form, but directly within the body, as if the <form> just closed itself off immediately. What's up with that? Is this a bug?
Digging through the source, I see:
ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);
It has the "empty" flag, like META and IMG. Why?? Forms are most definitely not supposed to be empty.

This is also reported in this workitem. It contains a suggested workaround from DarthObiwan.
You can change this without recompiling. The ElementFlags list is a
static property on the HtmlNode class. It can be removed with
HtmlNode.ElementsFlags.Remove("form");
before doing the document load

Since I'm the original HAP author, I can explain why it's marked as empty :)
This is because when HAP was designed, back in 2000, HTML 3.2 was the standard. You're probably aware that tags can perfectly overlap in HTML. That is: <b>bold<i>italic and bold</b>italic</i> (bolditalic and bolditalic) is supported by all browsers (although it's not officially in the HTML specification). And the FORM tag can also perfectly overlap as well.
Since HAP has been designed to handle any HTML content, rather than break most pages that you could find at that time, we just decided to handle overlapping tags as EMPTY (using the ElementFlags property) so:
you can still load them
you can save them back without breaking the original HTML (If you don't need what's inside the form in any programmatic way).
The only thing you cannot do is work with them with the API, using the tree model, nor with XSL, or anything programmatic.
Today, with XHTML/XML almost everywhere, this sounds strange, but that's why I created the ElementFlags :)

Related

How do ASP.NET Core's "asp-fallback-*" CDN tag helpers work?

I understand what the asp-fallback-* tag helpers do. What I don't understand is how.
For example:
<link rel="stylesheet"
href="//ajax.aspnetcdn.com/ajax/bootstrap/3.3.5/css/bootstrap.min.css"
asp-fallback-href="~/lib/bootstrap/dist/css/bootstrap.min.css"
asp-fallback-test-class="sr-only"
asp-fallback-test-property="position"
asp-fallback-test-value="absolute" />
This loads bootstrap from the CDN, and loads the local copy if the CDN is down.
But how does it decide to do that? I assume it checks asp-fallback-test-class, asp-fallback-test-property, and asp-fallback-test-value. But what do those attributes mean?
If I want to hook up some other library off a CDN, I'll need to supply something for those, but I'm not sure what to put there.
There are lots of examples of this in action, but I can't find explanations about how this works.
UPDATE
I'm not really trying to understand how the tag helpers work - how they render, and so on. I'm trying to understand how to choose values for those attributes. For example, the jQuery fallback script usually has asp-fallback-test="window.jQuery" which makes sense - it's a test to see if jQuery has loaded. But the ones I've shown above are quite different. How does one choose them? If I want to use some other CDN delivered library, I'll need to specify values for those attributes... what would I use? Why were those ones chosen for bootstrap?
UPDATE 2
To understand how the fallback process itself works, and how those tags are written, see #KirkLarkin's answer. To understand why those test values were used, see my answer.
UPDATE 3
In bootstrap 5 the sr-only class was renamed to visually-hidden.

TL;DR:
A <meta> tag is added to the DOM that has a CSS class of sr-only.
Additional JavaScript is written to the DOM, which:
Locates said <meta> element.
Checks whether said element has a CSS property position that is set to absolute.
If no such property value is set, an additional <link> element is written to the DOM with a href of ~/lib/bootstrap/dist/css/bootstrap.min.css.
The LinkTagHelper class that runs against your <link> elements inserts a <meta> element in the output HTML that is given a CSS class of sr-only. The element ends up looking like this:
<meta name="x-stylesheet-fallback-test" content="" class="sr-only" />
The code that generates the element looks like this (source):
builder
.AppendHtml("<meta name=\"x-stylesheet-fallback-test\" content=\"\" class=\"")
.Append(FallbackTestClass)
.AppendHtml("\" />");
Unsurprisingly, the value for FallbackTestClass is obtained from the <link>'s asp-fallback-test-class attribute.
Right after this element is inserted, a corresponding <script> block is also inserted (source). The code for that starts off like this:
// Build the <script /> tag that checks the effective style of <meta /> tag above and renders the extra
// <link /> tag to load the fallback stylesheet if the test CSS property value is found to be false,
// indicating that the primary stylesheet failed to load.
// GetEmbeddedJavaScript returns JavaScript to which we add '"{0}","{1}",{2});'
builder
.AppendHtml("<script>")
.AppendHtml(JavaScriptResources.GetEmbeddedJavaScript(FallbackJavaScriptResourceName))
.AppendHtml("\"")
.AppendHtml(JavaScriptEncoder.Encode(FallbackTestProperty))
.AppendHtml("\",\"")
.AppendHtml(JavaScriptEncoder.Encode(FallbackTestValue))
.AppendHtml("\",");
There are a few things of interest here:
The last line of the comment, which refers to placeholders {0}, {1} and {2}.
FallbackJavaScriptResourceName, which represents a JavaScript resource that is output into the HTML.
FallbackTestProperty and FallbackTestValue, which are obtained from the attributes asp-fallback-test-property and asp-fallback-test-value respectively.
So, let's have a look at that JavaScript resource (source), which boils down to a function with the following signature:
function loadFallbackStylesheet(cssTestPropertyName, cssTestPropertyValue, fallbackHrefs, extraAttributes)
Combining this with the last line of the comment I called out earlier and the values of asp-fallback-test-property and asp-fallback-test-value, we can reason that this is invoked like so:
loadFallbackStylesheet('position', 'absolute', ...)
I won't dig into the fallbackHrefs and extraAttributes parameters as that should be somewhat obvious and easy to explore on your own.
The implementation of loadFallbackStylesheet does not do a great deal - I encourage you to explore the full implementation on your own. Here's the actual check from the source:
if (metaStyle && metaStyle[cssTestPropertyName] !== cssTestPropertyValue) {
for (i = 0; i < fallbackHrefs.length; i++) {
doc.write('<link href="' + fallbackHrefs[i] + '" ' + extraAttributes + '/>');
}
}
The script obtains the relevant <meta> element (it's assumed to be directly above the <script> itself) and simply checks that it has a property of position that is set to absolute. If it does not, additional <link> elements are written to the output for each fallback URL.

Ok I think I get it now, by combining #KirkLarkin's answer and common sense.
The sr-only is applied to a hidden meta element. If bootstrap is loaded then that element would get a css value of position:absolute. So that is tested, and if it's so, then it means Bootstrap has been loaded.
So for any library, you need to choose a good example of something only that library can do, and style a hidden <meta> tag accordingly, then specify which css style to test, and what value you are expecting.
For javscript it's even easier, because you can just test for the library itself, which usually has some well known variable added to the window or something to the DOM. So for jQuery it's window.jQuery, and for Bootstrap it can be tested as window.jQuery && window.jQuery.fn && window.jQuery.fn.modal and so on.

can you find html element by attribute with csquery

Can I use csquery to find a html with a certain attribute with a certain value.
So if I got a page where there is something like this
<html>
<body>
<div align="left">something</div>
</body>
</html>
Can I then get the hole line out by search for a div with the attribute align with the value left? or even just the html element, and then get the value from within the attribute?
As always, thanks for the help and time.

I haven't used csquery myself but when looking at the docs, and you can use css queries this should work
div[align='left']
EDIT:
After being assured that this is in response to a client side operation, in the script it should look like this:
var rows = query["div[align='left']"];
This how you can look up elements by tag and attribute selectors, is to have the attributes you want in brackets. and then the value interpolated like so.

HtmlAgilityPack -- Does <form> close itself for some reason?

I just wrote up this test to see if I was crazy...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
namespace HtmlAgilityPackFormBug
{
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.LoadHtml(#"
<!DOCTYPE html>
<html>
<head>
<title>Form Test</title>
</head>
<body>
<form>
<input type=""text"" />
<input type=""reset"" />
<input type=""submit"" />
</form>
</body>
</html>
");
var body = doc.DocumentNode.SelectSingleNode("//body");
foreach (var node in body.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
Console.WriteLine(node.XPath);
Console.ReadLine();
}
}
}
And it outputs:
/html[1]/body[1]/form[1]
/html[1]/body[1]/input[1]
/html[1]/body[1]/input[2]
/html[1]/body[1]/input[3]
But, if I change <form> to <xxx> it gives me:
/html[1]/body[1]/xxx[1]
(As it should). So... it looks like those input elements are not contained within the form, but directly within the body, as if the <form> just closed itself off immediately. What's up with that? Is this a bug?
Digging through the source, I see:
ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);
It has the "empty" flag, like META and IMG. Why?? Forms are most definitely not supposed to be empty.

This is also reported in this workitem. It contains a suggested workaround from DarthObiwan.
You can change this without recompiling. The ElementFlags list is a
static property on the HtmlNode class. It can be removed with
HtmlNode.ElementsFlags.Remove("form");
before doing the document load

Since I'm the original HAP author, I can explain why it's marked as empty :)
This is because when HAP was designed, back in 2000, HTML 3.2 was the standard. You're probably aware that tags can perfectly overlap in HTML. That is: <b>bold<i>italic and bold</b>italic</i> (bolditalic and bolditalic) is supported by all browsers (although it's not officially in the HTML specification). And the FORM tag can also perfectly overlap as well.
Since HAP has been designed to handle any HTML content, rather than break most pages that you could find at that time, we just decided to handle overlapping tags as EMPTY (using the ElementFlags property) so:
you can still load them
you can save them back without breaking the original HTML (If you don't need what's inside the form in any programmatic way).
The only thing you cannot do is work with them with the API, using the tree model, nor with XSL, or anything programmatic.
Today, with XHTML/XML almost everywhere, this sounds strange, but that's why I created the ElementFlags :)

Iteration through the HtmlDocument.All collection stops at the referenced stylesheet?

Since "bug in .NET" is often not the real cause of a problem, I wonder if I'm missing something here.
What I'm doing feels pretty simple. I'm iterating through the elements in a HtmlDocument called doc like this:
System.Diagnostics.Debug.WriteLine("*** " + doc.Url + " ***");
foreach (HtmlElement field in doc.All)
System.Diagnostics.Debug.WriteLine(string.Format("Tag = {0}, ID = {1} ", field.TagName, field.Id));
I then discovered the debug window output was this:
Tag = !, ID =
Tag = HTML, ID =
Tag = HEAD, ID =
Tag = TITLE, ID =
Tag = LINK, ID =
... when the actual HTML document looks like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Protocol</title>
<link rel="Stylesheet" type="text/css" media="all" href="ProtocolStyle.css">
</head>
<body onselectstart="return false">
<table>
<!-- Misc. table elements and cell values -->
</table>
</body>
</html>
Commenting out the LINK tag solves the issue for me, and the document is completely parsed. The ProtocolStyle.css file exist on disk and is loaded properly, if that would matter. Is this a bug in .NET 3.5 SP1, or what? For being such a web-oriented framework, I find it hard to believe there would be such a major bug in it.
Update: By the way, this iteration was done in the WebBrowser control's Navigated event.

After a few years, I returned to this code and finally discovered that the problem was that I walked through the HtmlDocument.All collection in the WebBrowser.Navigated event handler. The proper way to do this is to walk through the elements in WebBrowser.DocumentCompleted.
This mistake also caused embedded script code to seemingly "halt" parsing, exactly like the aforementioned LINK tags. In reality, it wasn't halting -- it just hadn't finished rendering the entire document yet.

Force browser to use new CSS

Is there a way to check if the user has a different version of the CSS cached by their browser and if so force their browser to pull the new version?

I don´t know if it is correct usage, but I think you can force a reload of the css file using a query string:
<link href="mystyle.css?SOME_UNIQUE_TEXT" type="text/css" rel="stylesheet" />
I remember I used this method years ago to force a reload of a web-cam image, but time has probably moved on...

Without using js, you can just keep the css filename in a session variable. When a request is made to the Main Page, you simply compose the css link tag with the session variable name.
Being the ccs file name different, you force the broswer to download it without needing to check what was previusly loaded in the browser.

As jeroen suggested you can have somthing like:
<link href="StyleSelector.aspx?foo=bar&baz=foz" type="text/css" rel="stylesheet" />
Then your StyleSelector.aspx file should be something like this:
<%# Page Language="cs" AutoEventWireup="false" Inherits="Demo.StyleSelector" Codebehind="StyleSelector.aspx.cs" %>
And your StyleSelector.aspx.cs like this:
using System.IO;
namespace Demo
{
public partial class StyleSelector : System.Web.UI.Page
{
public StyleSelector()
{
Me.Load += New EventHandler(doLoad);
}
protected void doLoad(object sender, System.EventArgs e)
{
// Make sure you add this line
Response.ContentType = "text/css";
string cssFileName = Request.QueryString("foo");
// I'm assuming you have your CSS in a css/ folder
Response.WriteFile("css/" + cssFileName + ".css");
}
}
}
This would send the user the contents of a CSS file (actually any file, see security note) based on query string arguments. Now the tricky part is doing the Conditional GET, which is the fancy name for checking if the user has the page in the cache or not.
First of all I highly recommend you reading HTTP Conditional GET for RSS hackers, a great article that explains the basics of HTTP Conditional GET mechanism. It is a must read, believe me.
I've posted a similar answer (but with PHP code, sorry) to the SO question can i use “http header” to check if a dynamic page has been changed. It should be easy to port the code from PHP to C# (I'll do it if later I have the time.)
Security note: it is highly insecure doing something like ("css/" + cssFileName + ".css"), as you may send a relative path string and thus you may send the user the content of a different file. You are to come up with a better way to find out what CSS file to send.
Design note: instead of an .aspx page you might want to use an IHttpModule or IHttpHandler, but this way works just fine.

Answer for question 1
You could write a Server Control inheriting from System.Web.UI.Control overriding the Render method:
public class CSSLink : System.Web.UI.Control
{
protected override void Render(System.Web.UI.HtmlTextWriter writer)
{
if ( ... querystring params == ... )
writer.WriteLine("<link href=\"/styles/css1.css\" type=\"text/css\" rel=\"stylesheet\" />")
else
writer.WriteLine("<link href=\"/styles/css2.css\" type=\"text/css\" rel=\"stylesheet\" />")
}
}
and insert an instance of this class in your MasterPage:
<%# Register TagPrefix="mycontrols" Namespace="MyNamespace" Assembly="MyAssembly" %>
...
<head runat="server">
...
<mycontrols:CSSLink id="masterCSSLink" runat="server" />
</head>
...

You should possibly just share a common ancestor class, then you can flick it with a single js command if need be.
<body class="style2">
<body class="style1">
etc.

I like jeroen's suggestion to add a querystring to the stylesheet URL. You could add the time stamp when the stylesheet file was last modified. It seems to me like a good candidate for a helper function or custom control that would generate the LINK tag for you.

I know the question was specifically about C# and I assume from that Windows Server of some flavour. Since I don't know either of those technologies well, I'll give an answer that'll work in PHP and Apache, and you may get something from it.
As suggested earlier, just set an ID or a class on the body of the page dependent on the specific query eg (in PHP)
<?php
if($_GET['admin_page']) {
$body_id = 'admin';
} else {
$body_id = 'normal';
}
?>
...
<body id="<?php echo $body_id; ?>">
...
</body>
And your CSS can target this:
body#admin h1 {
color: red;
}
body#normal h1 {
color: blue;
}
etc
As for the forcing of CSS download, you could do this in Apache with the mod_expires or mod_headers modules - for mod_headers, this in .htaccess would stop css files being cached:
<FilesMatch "\.(css)$">
Header set Cache-Control "max-age=0, private, no-store, no-cache, must-revalidate"
</FilesMatch>
But since you're probably not using apache, that won't help you much :(

Like in correct answer, i am using some similar method, but with some differences
<link href="mystyle.css?v=DIGIT" type="text/css" rel="stylesheet" />
As a DIGIT you can use a real number, set manually or automatically in your template. For example, on my projects i'm using Cache clearing modules in admin panel, and each time use this cache cleaner, it increments the DIGIT automatically.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.