Iteration through the HtmlDocument.All collection stops at the referenced stylesheet? - c#

Since "bug in .NET" is often not the real cause of a problem, I wonder if I'm missing something here.
What I'm doing feels pretty simple. I'm iterating through the elements in a HtmlDocument called doc like this:
System.Diagnostics.Debug.WriteLine("*** " + doc.Url + " ***");
foreach (HtmlElement field in doc.All)
System.Diagnostics.Debug.WriteLine(string.Format("Tag = {0}, ID = {1} ", field.TagName, field.Id));
I then discovered the debug window output was this:
Tag = !, ID =
Tag = HTML, ID =
Tag = HEAD, ID =
Tag = TITLE, ID =
Tag = LINK, ID =
... when the actual HTML document looks like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Protocol</title>
<link rel="Stylesheet" type="text/css" media="all" href="ProtocolStyle.css">
</head>
<body onselectstart="return false">
<table>
<!-- Misc. table elements and cell values -->
</table>
</body>
</html>
Commenting out the LINK tag solves the issue for me, and the document is completely parsed. The ProtocolStyle.css file exist on disk and is loaded properly, if that would matter. Is this a bug in .NET 3.5 SP1, or what? For being such a web-oriented framework, I find it hard to believe there would be such a major bug in it.
Update: By the way, this iteration was done in the WebBrowser control's Navigated event.

After a few years, I returned to this code and finally discovered that the problem was that I walked through the HtmlDocument.All collection in the WebBrowser.Navigated event handler. The proper way to do this is to walk through the elements in WebBrowser.DocumentCompleted.
This mistake also caused embedded script code to seemingly "halt" parsing, exactly like the aforementioned LINK tags. In reality, it wasn't halting -- it just hadn't finished rendering the entire document yet.

Related

How do I render special characters in HTML the same in WPF as in Winforms?

I'm working on an email notification project, where in a preexisting Winforms screen the client can edit an email template - adding html, text, etc. A very simplified example input:
<!DOCTYPE html>
<html>
<head>
<title>The Title</title>
</head>
<body bgcolor="#f2f2f2" style="margin: 0; padding: 0;">
<br />
<b>Please do not respond to this e-mail, as it is not monitored.</b>
<br/>
<br/>
“Foo bar baz.
<br/>
<br/>
Baz bar foo.”
<br/>
</body>
</html>
This is saved as a string. On the same screen, the user may then click a button which will raise a ShowDialog call on another Form. This form previews the user's html in a WebBrowser control:
this.webBrowser.DocumentText = theHtmlString;
And the results:
Problem:
I am creating a WPF screen related to the Winforms screens mentioned. It too needs the ability to preview the user's html. To do so I've used an attached behavior modified from this version. Essentially, this dialog also previews the user's html in a WebBrowser control:
webBrowser.NavigateToString(theHtmlString);
However, the results aren't correct, as highlighted below:
If this were my own html input, I'd simply remove the offending characters and replace them with standard quotations. But since this input is from the client, how do I get WPF to render the same as Winforms?
The reason this poses an issue:
In the old Winforms screen, the user creates/edits-existing email templates, previews them, is satisfied with the rendered example, saves changes.
In the new WPF screen, the user exports/imports existing email templates, previews them, is dissatisfied with the rendered example and strange characters, becomes confused when the other screen still renders correctly - calls to report a "bug" in the new screen.
Simple Reproduction Example: - Credit to Eser
var encoded = WebUtility.HtmlEncode(" “ Test ” "); //" “ Test ” "
var buf = Encoding.UTF8.GetBytes(" “ Test ” ");
var str = Encoding.GetEncoding("Windows-1252").GetString(buf); //" “ Test †"
Just add this meta tag to the <head> of your HTML:
<meta charset='utf-8'>
This will display the special characters correctly. I just tested your exact code with this and it works.

HtmlAgilityPack does not select subnodes with XPath [duplicate]

I just wrote up this test to see if I was crazy...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
namespace HtmlAgilityPackFormBug
{
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.LoadHtml(#"
<!DOCTYPE html>
<html>
<head>
<title>Form Test</title>
</head>
<body>
<form>
<input type=""text"" />
<input type=""reset"" />
<input type=""submit"" />
</form>
</body>
</html>
");
var body = doc.DocumentNode.SelectSingleNode("//body");
foreach (var node in body.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
Console.WriteLine(node.XPath);
Console.ReadLine();
}
}
}
And it outputs:
/html[1]/body[1]/form[1]
/html[1]/body[1]/input[1]
/html[1]/body[1]/input[2]
/html[1]/body[1]/input[3]
But, if I change <form> to <xxx> it gives me:
/html[1]/body[1]/xxx[1]
(As it should). So... it looks like those input elements are not contained within the form, but directly within the body, as if the <form> just closed itself off immediately. What's up with that? Is this a bug?
Digging through the source, I see:
ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);
It has the "empty" flag, like META and IMG. Why?? Forms are most definitely not supposed to be empty.
This is also reported in this workitem. It contains a suggested workaround from DarthObiwan.
You can change this without recompiling. The ElementFlags list is a
static property on the HtmlNode class. It can be removed with
HtmlNode.ElementsFlags.Remove("form");
before doing the document load
Since I'm the original HAP author, I can explain why it's marked as empty :)
This is because when HAP was designed, back in 2000, HTML 3.2 was the standard. You're probably aware that tags can perfectly overlap in HTML. That is: <b>bold<i>italic and bold</b>italic</i> (bolditalic and bolditalic) is supported by all browsers (although it's not officially in the HTML specification). And the FORM tag can also perfectly overlap as well.
Since HAP has been designed to handle any HTML content, rather than break most pages that you could find at that time, we just decided to handle overlapping tags as EMPTY (using the ElementFlags property) so:
you can still load them
you can save them back without breaking the original HTML (If you don't need what's inside the form in any programmatic way).
The only thing you cannot do is work with them with the API, using the tree model, nor with XSL, or anything programmatic.
Today, with XHTML/XML almost everywhere, this sounds strange, but that's why I created the ElementFlags :)

Navigating trough objects of a web page, using an embedded web browser

I have a Windows Forms application that uses a WebBrowser control to display an embedded web page. The file is (successfully) loaded using:
webHelp.DocumentStream=
Assembly.GetExecutingAssembly()
.GetManifestResourceStream("MyAssembly.help.html");
In order for this to work (i.e. the file to be loaded/displayed) I set the webHelp.AllowNavigation = false;. I don't fully understand why, but if it's set to true, the page is not displayed.
In my HTML document (see bellow) I want to be able to navigate trough different sections. But when I click on a link, the browser control does not go to the targeted element. The web page works fine in the stand-alone Internet Explorer 10, so it must have something to do with the control, more specifically the AllowNavigation property. MSDN didn't help much.
How can I achieve this navigation behavior? Is there another way of loading the HTML file without setting the AllowNavigation property to false?
This is my simple HTML file:
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title>Using this tool</title>
</head>
<body>
<h3>Description</h3>
<div><p id="contents">Contents</p></div>
<div>
<p id="general">Using the file converter</p>
<p>*converter description*</p>
Go To Top!
</div>
<div class="divBlock" >
<p id="selectOption">Selecting a conversion action</p>
<p>*action selection*</p>
Go To Top!
</div>
</body>
</html>
EDIT: After additional tests I found the root of the problem. The problem appeared after setting a value for the URL property, running the application and afterwards clearing this value. The embedded page is not loaded any more, unless the AllowNavigation property is set to false. There are two solutions, described in my answer bellow.
I also have my own WebBrowser. I've tested it and it loads your HTML file perfectly.
I simply used:
webBrowser1.Navigate("C:\\myPath\\SofNavigate.html");
When I click on links it goes to "#contents" without problems.
I am not sure why you need to use webHelp.Docstream instead of simple Navigate.
By the way, when I turn off navivation, then I am not able to go anywhere from the page that I started on. So Navigation must be on in order to go anywhere from the "home page".
Try to debug that part, as it appears to be the bigger problem that you have.
Here is a good example on how to set up simple webBrowser. Try to use it as a base and see what you do differently that messes up your navigation.
[EDITED] Win8/IE10, your code works for me unmodified inside Form.Load event on a simple form which has just a single WebBrowser control with all default settings (and WebBrowser.AllowNavigation is true by default). Check the properties of your WebBrowser control in the Designer, you may have something wrong in there.
[/EDITED]
You're using HTML5, which handles anchor links via id attribute (i.e. <p id="contents"> ... <a href="#contents">. By default, WebBrowser control works in legacy IE7 mode with HTML5 disabled. You need to turn it on with FEATURE_BROWSER_EMULATION feature control, before WebBrowser object gets created. The best place to do this is a static constructor of your form:
static MainForm()
{
SetBrowserFeatureControl();
}
private static void SetBrowserFeatureControl()
{
// http://msdn.microsoft.com/en-us/library/ee330730(v=vs.85).aspx#browser_emulation
// FeatureControl settings are per-process
var fileName = System.IO.Path.GetFileName(Process.GetCurrentProcess().MainModule.FileName);
// make sure the control is not running inside Visual Studio Designer
if (String.Compare(fileName, "devenv.exe", true) == 0 || String.Compare(fileName, "XDesProc.exe", true) == 0)
return;
// web pages containing standards-based !DOCTYPE directives are displayed in Standards mode
using (var key = Registry.CurrentUser.CreateSubKey(
#"Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
RegistryKeyPermissionCheck.ReadWriteSubTree))
{
key.SetValue(fileName, (UInt32)9000, RegistryValueKind.DWord);
}
}
Try it and your links should work as expected. This solution does NOT require admin rights, the affected key is under HKEY_CURRENT_USER.
[UPDATE] There may be a better solution, it works at least for IE10 here on my side. Add <meta http-equiv="X-UA-Compatible" content="IE=edge" /> as below and leave the registry intact. If you see document.compatMode: CSS1Compat, document.documentMode: 10, you should be good to go, but test with older IE versions too.
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<title></title>
<script type="text/javascript">
window.onload = function () {
info.firstChild.data =
"document.compatMode: " + document.compatMode +
", document.documentMode: " + document.documentMode;
}
</script>
</head>
<body>
<pre id="info"> </pre>
</body>
</html>
EDIT: After finding the cause of the problem (see the edit to the question) I can now propose three solutions:
1. WebBrowser control replacement:
Simply delete the existing WebBrowser control and add a new one. This solution does not require any modification of the AllowNavigation property. DO NOT modify the URL property.
2. When deleting and adding a new WebBrowser control is not an option:
Since the AllowNavigation property was influencing the loading and displaying of the web page, there was no reason for it to be left to false afterwards. Setting back the property in the Shown event solved the navigation problem, without requiring other alterations (e.g. in the HTML file or the Registry):
private void helpForm_Shown(object sender, EventArgs e)
{
webHelp.AllowNavigation = true;
}
3. Reseting the Document
It seams that the Document property gets (automatically) initialized if URL property is at one time set and reset. Adding webHelp.Document.OpenNew(true); before loading the resource stream solves the problem without the need for re-adding the WebBrowser and without modifying the AllowNavigation property.

HtmlAgilityPack -- Does <form> close itself for some reason?

I just wrote up this test to see if I was crazy...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
namespace HtmlAgilityPackFormBug
{
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.LoadHtml(#"
<!DOCTYPE html>
<html>
<head>
<title>Form Test</title>
</head>
<body>
<form>
<input type=""text"" />
<input type=""reset"" />
<input type=""submit"" />
</form>
</body>
</html>
");
var body = doc.DocumentNode.SelectSingleNode("//body");
foreach (var node in body.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
Console.WriteLine(node.XPath);
Console.ReadLine();
}
}
}
And it outputs:
/html[1]/body[1]/form[1]
/html[1]/body[1]/input[1]
/html[1]/body[1]/input[2]
/html[1]/body[1]/input[3]
But, if I change <form> to <xxx> it gives me:
/html[1]/body[1]/xxx[1]
(As it should). So... it looks like those input elements are not contained within the form, but directly within the body, as if the <form> just closed itself off immediately. What's up with that? Is this a bug?
Digging through the source, I see:
ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);
It has the "empty" flag, like META and IMG. Why?? Forms are most definitely not supposed to be empty.
This is also reported in this workitem. It contains a suggested workaround from DarthObiwan.
You can change this without recompiling. The ElementFlags list is a
static property on the HtmlNode class. It can be removed with
HtmlNode.ElementsFlags.Remove("form");
before doing the document load
Since I'm the original HAP author, I can explain why it's marked as empty :)
This is because when HAP was designed, back in 2000, HTML 3.2 was the standard. You're probably aware that tags can perfectly overlap in HTML. That is: <b>bold<i>italic and bold</b>italic</i> (bolditalic and bolditalic) is supported by all browsers (although it's not officially in the HTML specification). And the FORM tag can also perfectly overlap as well.
Since HAP has been designed to handle any HTML content, rather than break most pages that you could find at that time, we just decided to handle overlapping tags as EMPTY (using the ElementFlags property) so:
you can still load them
you can save them back without breaking the original HTML (If you don't need what's inside the form in any programmatic way).
The only thing you cannot do is work with them with the API, using the tree model, nor with XSL, or anything programmatic.
Today, with XHTML/XML almost everywhere, this sounds strange, but that's why I created the ElementFlags :)

Force browser to use new CSS

Is there a way to check if the user has a different version of the CSS cached by their browser and if so force their browser to pull the new version?
I don´t know if it is correct usage, but I think you can force a reload of the css file using a query string:
<link href="mystyle.css?SOME_UNIQUE_TEXT" type="text/css" rel="stylesheet" />
I remember I used this method years ago to force a reload of a web-cam image, but time has probably moved on...
Without using js, you can just keep the css filename in a session variable. When a request is made to the Main Page, you simply compose the css link tag with the session variable name.
Being the ccs file name different, you force the broswer to download it without needing to check what was previusly loaded in the browser.
As jeroen suggested you can have somthing like:
<link href="StyleSelector.aspx?foo=bar&baz=foz" type="text/css" rel="stylesheet" />
Then your StyleSelector.aspx file should be something like this:
<%# Page Language="cs" AutoEventWireup="false" Inherits="Demo.StyleSelector" Codebehind="StyleSelector.aspx.cs" %>
And your StyleSelector.aspx.cs like this:
using System.IO;
namespace Demo
{
public partial class StyleSelector : System.Web.UI.Page
{
public StyleSelector()
{
Me.Load += New EventHandler(doLoad);
}
protected void doLoad(object sender, System.EventArgs e)
{
// Make sure you add this line
Response.ContentType = "text/css";
string cssFileName = Request.QueryString("foo");
// I'm assuming you have your CSS in a css/ folder
Response.WriteFile("css/" + cssFileName + ".css");
}
}
}
This would send the user the contents of a CSS file (actually any file, see security note) based on query string arguments. Now the tricky part is doing the Conditional GET, which is the fancy name for checking if the user has the page in the cache or not.
First of all I highly recommend you reading HTTP Conditional GET for RSS hackers, a great article that explains the basics of HTTP Conditional GET mechanism. It is a must read, believe me.
I've posted a similar answer (but with PHP code, sorry) to the SO question can i use “http header” to check if a dynamic page has been changed. It should be easy to port the code from PHP to C# (I'll do it if later I have the time.)
Security note: it is highly insecure doing something like ("css/" + cssFileName + ".css"), as you may send a relative path string and thus you may send the user the content of a different file. You are to come up with a better way to find out what CSS file to send.
Design note: instead of an .aspx page you might want to use an IHttpModule or IHttpHandler, but this way works just fine.
Answer for question 1
You could write a Server Control inheriting from System.Web.UI.Control overriding the Render method:
public class CSSLink : System.Web.UI.Control
{
protected override void Render(System.Web.UI.HtmlTextWriter writer)
{
if ( ... querystring params == ... )
writer.WriteLine("<link href=\"/styles/css1.css\" type=\"text/css\" rel=\"stylesheet\" />")
else
writer.WriteLine("<link href=\"/styles/css2.css\" type=\"text/css\" rel=\"stylesheet\" />")
}
}
and insert an instance of this class in your MasterPage:
<%# Register TagPrefix="mycontrols" Namespace="MyNamespace" Assembly="MyAssembly" %>
...
<head runat="server">
...
<mycontrols:CSSLink id="masterCSSLink" runat="server" />
</head>
...
You should possibly just share a common ancestor class, then you can flick it with a single js command if need be.
<body class="style2">
<body class="style1">
etc.
I like jeroen's suggestion to add a querystring to the stylesheet URL. You could add the time stamp when the stylesheet file was last modified. It seems to me like a good candidate for a helper function or custom control that would generate the LINK tag for you.
I know the question was specifically about C# and I assume from that Windows Server of some flavour. Since I don't know either of those technologies well, I'll give an answer that'll work in PHP and Apache, and you may get something from it.
As suggested earlier, just set an ID or a class on the body of the page dependent on the specific query eg (in PHP)
<?php
if($_GET['admin_page']) {
$body_id = 'admin';
} else {
$body_id = 'normal';
}
?>
...
<body id="<?php echo $body_id; ?>">
...
</body>
And your CSS can target this:
body#admin h1 {
color: red;
}
body#normal h1 {
color: blue;
}
etc
As for the forcing of CSS download, you could do this in Apache with the mod_expires or mod_headers modules - for mod_headers, this in .htaccess would stop css files being cached:
<FilesMatch "\.(css)$">
Header set Cache-Control "max-age=0, private, no-store, no-cache, must-revalidate"
</FilesMatch>
But since you're probably not using apache, that won't help you much :(
Like in correct answer, i am using some similar method, but with some differences
<link href="mystyle.css?v=DIGIT" type="text/css" rel="stylesheet" />
As a DIGIT you can use a real number, set manually or automatically in your template. For example, on my projects i'm using Cache clearing modules in admin panel, and each time use this cache cleaner, it increments the DIGIT automatically.

Categories

Resources