Html Generation from Xml Tree (C#/.NET) - c#

I have an HTML document stored in memory as an Linq-to-XML object tree. How can I serialize an XDocument as HTML, taking into account the idiosyncrasies of HTML?
For example, empty tags such as <br/> should be serialized as <br>, whereas an empty <div/> should be serialized as <div></div>.
HTML output is possible from an XSLT stylesheet, and XmlWriterSettings has an OutputMethod property which can be set to HTML - but the setter is internal, for use by XSLT or Visual Studio, and I can't seem to find a way to serialize arbitrary XML as HTML.
So, short of using XSLT solely for the HTML output capability (i.e. doing something like running the document through an otherwise pointless chain of XDocument->XmlReader->via XSLT, to HTML), is there a way to serialize a .NET XDocument to HTML?

No. The XDocument->XmlReader->XSLT is the approach you need.
What you are looking for is a specialised serialiser that arbitarily adds meaning to tag names like br and div and renders each differently. One would also expect such a serialiser to work in both directions, IOW be able to read HTML Tag soup and generate an XDocument. Such a thing does not exist out-of-the-box.
The XmlReader to XSLT seems simple enough for the job, ultimately is just a chain of streams.

Like you, I'm really surprised that the HTML output method isn't exposed, and I don't know of any way round it, other than the XSLT route you've already identified. When I faced the same problem a couple of years ago, I wrote an XmlWriter wrapper class, that forced calls to WriteEndElement to use WriteFullEndElement on the underlying XmlWriter if the tag being processed wasn't in the list {"area", "base", "basefont", "bgsound", "br", "col", "embed", "frame", "hr", "isindex", "image", "img", "input", "link", "meta", "param", "spacer", "wbr" }.
This fixed the <div/> problem and was sufficient for me as what I wanted to write was polyglot documents. I didn't find a method to make <br/> appear as <br> but apart from not being able to validate as HTML 4.01 this doesn't cause a real problem. I guess that if you really need this, and don't want to use the XSLT method, you'll have to write your own XmlWriter implementation.

Of course there is!
//XDocument document; string filename;
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
typeof(XmlWriterSettings).GetField("outputMethod", BindingFlags.NonPublic|BindingFlags.Instance).SetValue(settings, XmlOutputMethod.Html);
using(XmlWriter xw = XmlWriter.Create(filename, settings))
{
document.Save(xw);
}

Related

Get XML from XPathDocument

I am working on a stylesheet and have some initial XML. However the XML is being manipulated a bit before styling and i would like to get the final XML sent into .Transform(). For instance, ...
XslCompiledTransform.Transform( xpd, xslArg, output )
...i would like to get the Xml content of xpd (as a string), so i can work on the stylesheet in other tools.
Is there a quick-and-dirty way to get this? Either in the VS2010 immediate window or as a quick C# line or two before the call to .Transform()?
EDIT: The .Transform() i'm using is
public void Transform(IXPathNavigable input,
XsltArgumentList arguments, TextWriter results);
...and xpd is an XPathDocument.
Edit: I misunderstood the intent of your question. The simple answer is to get the XML for any IXPathNavigable (which includes XPathDocument), you can do this:
string xml = xpd.CreateNavigator().OuterXml;
Below is my original answer, which explains how you could modify the XML from an XPathDocument in code before feeding it into a transform:
If xpd is an XPathDocument, you might be able to just get an XPathNavigator from the XPathDocument:
XPathNavigator xpn = xpd.CreateNavigator();
and use that to modify the XML. When you're done modifying it, you can just pass either xpn or xpd into the Transform() method. On the other hand, MSDN says that XPathDocument's CreateNavigator() creates a readonly navigator, so that may be a bit of a hitch.
If it really is readonly, you should be able to do this:
XmlDocument doc = new XmlDocument();
doc.LoadXml(xpd.CreateNavigator().OuterXml);
then use doc to modify the XML and pass doc into the transform when you're done.

Prevent XslCompiledTransform from using self-closing tags

I am using XslCompiledTransform to convert an XML file to HTML. Is there a way I can prevent it from using self-closing tags.
e.g.
<span></span> <!-- I want this even if content empty -->
<span/> <!-- stop doing this! ->
The self-closing tags on span's are messing up my document no matter which browser I use, though it is valid XML, it's just that 'span' is not allowed to have self-closing tags.
Is there a setting I can put in my xsl, or in my C#.Net code to prevent self-closing tags from being used?
Though I couldn't classify this as a direct solution (as it doesn't emit an empty element), the workaround I used was to put a space (using xsl:text) in the element -- since this is HTML markup, and if you are activating Standards mode (not quirks), the extra space doesn't change the rendered content. I also didn't have control over the invocation of the transform object.
<div class="clearBoth"><xsl:text> </xsl:text></div>
You can try <xsl:output method="html"/>, however the result would no longer be well-formed XML document.
Or, you can invoke the XslCompiledTransform.Transform() method passing as one of the parameters your own XmlWriter. In your implementation you are in full control and can implement any required serialization of the result tree.
The only solution I have been able to find, is to add logic to the XSL file. Basically if the the elements I wanted to wrap span around is empty, don't use the span element at all.
<xsl:if test="count(jar/beans) > 0">
<xsl:apply-templates select="jar/beans"/>
</xsl:if>
Not ideal to have to insert this everywhere in my xsl file, to compensate for the fact that even though I choose output method "html", it more than willingly will generate illegal HTML.
Sigh.
In your XSLT use <xsl:output method="html"/> and then make sure your HTML result elements your stylesheet creates are in no namespace. Furthermore depending on how you use XslCompiledTransform in your C# code you need to make sure the xsl:output settings in the stylesheet are honoured. You can easily achieve that by transforming to a file or stream or TextWriter, in that case nothing has to be done. However if you for some reasons transform to an XmlWriter then you need to ensure it is created with the proper settings e.g.
XslCompiledTransform proc = new XslCompiledTransform();
proc.Load("sheet.xsl");
using (XmlWriter xw = XmlWriter.Create("result.html", proc.OutputSettings))
{
proc.Transform("input.xml", null, xw);
}
But usually you should be fine by simply transforming to a Stream or TextWriter, in that case nothing in the C# code has to be done to honour the output method in the stylesheet.

Delay-load of XmlDocument

I'm writing an XML document based on a stream of data. This part has been accomplished using the XmlTextWriter and the XElement classes.
Now when I come to read in the document I want to be able to 'delay-load' the XML document so that certain nodes are skipped (i.e. the ones which contain large binary chunks.) and then load them when required.
Is this possible using the XmlDocument class? Or will I have to do things in a more manual way using the XmlTextReader class.
Thanks.
Nick.
Not possible with XmlDocument as the whole document needs to be loaded onto memory before parsed as tree.
XmlTextReader/SAX is the standard solution.
This is not possible with either XmlDocument or XDocument.
note that if you want to use XmlTextReader, it is fwd only. i.e. once youhave skipped it, you cant come back to it.
see MSDN on this

Update XML using XSLT in C# - How to update the same file

My requirement is to update an XML file (some elements identified via a parameter, with new attribute values again identified via a paramenter).
I am using XSLT to do the same via C# code.
My code is as below:
XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(f_Xslt);
XmlReader xr = XmlReader.Create("SourceXML.xml");
XmlWriter xw = XmlWriter.Create("DestinationXML.xml");
XsltArgumentList argsList = new XsltArgumentList();
argsList.AddParam("", "", "");
...
...
...
xslt.Transform(xr, argsList, xw);
In my XSLT file, I first copy all elements, attributes. And then based on <xsl:template match = ... />, I update the elements, attr/values.
All this is saved to Destination.xml
What if I want all of this to happen on Source.xml itself.
Of course, the easiest solution(or my solution so far) is to replace the Source.XML with Destination.XML after I complete the XSLT.Transform successfully.
I think your transform-to-file-then-replace solution is as good as you're going to get. You don't want to overwrite the Source.XML file while reading it, even if .NET and the OS would let you.
In order to suggest a better alternative to transform-to-file-then-replace (TTFTR), I would ask, what is it about TTFTR that you feel is suboptimal?
The only alternative I can think of off-hand is to write the result of your transform to memory; and when the transform is finished, save the result from memory onto your source file. To transform to memory, pass a MemoryStream object as the argument to XmlWriter.Create().
You never should try to update in-place with XSLT. This is bad design and not in the spirit of a functional language.
This said, you can copy the source XML file in a temporary directory, then apply the transformation with an XmlWriter instance that is created to overwrite the original file.
As I said before, I wouldn't recommend this!

Protecting from XSLT injection

I use a xsl tranform to convert a xml file to html in dotNet. I transform the node values in the xml to html tag contents and attributes.
I compose the xml by using .Net DOM manipulation, setting the InnerText property of the nodes with the arbitrary and possibly malicious text.
Right now, maliciously crafted input strings will make my html unsafe. Unsafe in the sense that some javascript might come from the the user and find its way to a link href attribute in the output html, for example.
The question is simple, what is the sanitizing, if any, that I have to do with my text before assigning it to the InnerText property? I thought that assigning to InnerText instead of InnerXml would do all the needed sanitization of the text, but that seems to not be the case.
Does my transform have to have any special characteristics to make this work safely? Any .net specific caveats that I should be aware?
Thanks!
You should sanitize your XML before transforming it with XSLT. You probably will need something like:
string encoded = HttpUtility.HtmlEncode("<script>alert('hi')</script>");
XmlElement node = xml.CreateElement("code");
node.InnerText = encoded;
Console.WriteLine(encoded);
Console.WriteLine(node.OuterXml);
With this, you'll get
<script>alert('hi')</script>
When you add this text into your node, you'll get
<code>&lt;script&gt;alert('hi')&lt;/script&gt;</code>
Now, if you run your XSLT, this encoded HTML will not cause any problems in your output.
It turns out that the problem came from the xsl itself, wich used disable-output-escaping. Without that the Tranform itself will do all the encoding necessary.
If you must use disable-output-escaping, you have to use the appriate encodeinf function for each element. HtmlEncode for tag contents, HtmlAttributeEncode for attribute values and UrlEncode for html attribute values (e.g href)

Categories

Resources