Parsing invalid XML or HTML with XSLT in C#

Parsing invalid XML or HTML with XSLT in C# - c#

I recently had to write a class that would process a provided template and return the result. I chose XSLT as my templating language due to the industry wide adoption it enjoys. The problem I had, however, was that the provided template had several restrictions that were proving to be a pain. Here is an example of my code:
public string ProcessTemplate(string template, IEnumerable<Field> fields)
{
// Surround the supplied template with the required XML
template = #"<?xml version=""1.0"" encoding=""UTF-8""?>
<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"">
<xsl:output method=""html"" version=""2.0"" encoding=""UTF-8"" indent=""yes""/>
<xsl:template match=""/entity"">"
+ template
+ "</xsl:template></xsl:stylesheet>";
// Turn our fields into XML with an "entity" tag as the root node
var t = GetTemplateXml(fields);
// Create a stringreader to read our template into memory
var sr = new StringReader(t.ToString());
var xr = new XmlTextReader(sr);
// Now create a XmlWriter attached to a StringBuilder to contain the transformed result
var sb = new StringBuilder();
var xws = new XmlWriterSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
};
var xw = XmlWriter.Create(sb, xws);
// Create the transform object and an XmlReader for our template
var xsl = new XslCompiledTransform();
var xslr = XmlReader.Create(new StringReader(template));
// Load our template into the transform object, transform it, and put the result into our XmlWriter (and therefore into our StringBuilder)
xsl.Load(xslr);
xsl.Transform(xr, xw);
var res = sb.ToString();
return res;
}
The user would provide a number of Field objects, which to be valid XML had to share a root node. I called this root node "entity" but didn't want the users to have to select "entity" every time they access a field. So I surround the template with <xsl:template match="/entity">, which means I can select the fields directly. Unfortunately I still had several problems:
Firstly, the template I was providing had an HTML Doctype declaration at the top of the page. I started getting errors around an "unexpected DTD declaration" because the DOCTYPE was appearing inside the xsl:template node.
If the user supplied any invalid XML (such as a tag that wasn't self closed and had no matching closing tag) then the parser would throw an exception, even if the HTML would have worked in a browser. This seemed unacceptable because as much as I would like my users to supply perfectly formed XML/HTML templates, I don't want to move the burden onto the end user if it's something that I can fix.
In the HTML template I was testing against, the opening HTML tag was surrounded with IE conditional comments in order to supply a different class to the tag depending on the version of IE. For instance <!--[if lt IE 7 ]><html class="ie ie6" lang="en"> <![endif]-->. Because this is a proprietary IE syntax, XML just sees a comment and no html tag. The closing tag at the end of the document therefore has no matching start tag and throws an exception.
As I tackled each issue one at a time the solution became less and less acceptable to me and I sought a robust solution that would not force the user to cater for the weaknesses of my template method.

I ended up settling on a solution that would allow invalid HTML and pass that responsibility back to the browser where the user expects it to lie. Some might disagree with this approach, reasoning that it is better for the template to fail quickly and obviously so that it can be fixed. However as I said, I am catering for an end user who has certain expectations from writing HTML - one of those is that the page will render if they provide poorly formed HTML. Even if it renders incorrectly.
I decided that I could handle all three issues outlined above by writing a method that would prepare the template before the transform, and then a method that would clean up the result before returning it to the user. Here they are:
/// <summary>
/// This function replaces all chevrons with %lt; and %gt; Any xsl tags are left in place
/// so they can be processed, and the parent tags of xsl:attribute tags are also left
/// untouched in order that the attribute can be correctly assigned.
/// The tokens used are deliberately different from the standard tokens of < and >
/// because we will want to revert these tokens later without reverting the normal tokens.
/// </summary>
/// <param name="template"></param>
/// <returns></returns>
private string PrepareTemplate(string template)
{
template = Regex.Replace(template, "<", "%lt;");
template = Regex.Replace(template, ">", "%gt;");
template = Regex.Replace(template, "%lt;xsl:(.*?)%gt;", "<xsl:$1>");
template = Regex.Replace(template, "%lt;/xsl:(.*?)%gt;", "</xsl:$+>");
template = Regex.Replace(template, "%lt;(.[^%]*?)%gt;(.[^%]*?)<xsl:attribute(.*?)>", "<$1>$2<xsl:attribute$3>", RegexOptions.Singleline);
template = Regex.Replace(template, "</xsl:attribute>(.*?)%lt;/(.*?)%gt;", "</xsl:attribute>$1</$2>", RegexOptions.Singleline);
return template;
}
I will explain each line in turn:
The first line replaces opening chevrons with a %lt; token.
The next line replaces closing chevrons with a %gt; token.
This looks for any opening XSL tags and turns our tokens back into chevrons.
(Same as 3 but for closing XSL tags)
This looks for any <xsl:attribute> tag, and replaces the nearest preceding tokenised tag with a proper XML tag. An xsl:attribute tag applies to the closest ancestor node, so we have to transform our tokenised tag to proper XML for it to work.
Similar to 5, this replaces any tokenised closing tag on the xsl:attribute parent node with the proper XML closing tag.
Further explanation of 5 & 6
Given the following template:
<a>
<xsl:attribute name="href">
<xsl:value-of select="url" />
</xsl:attribute>
<xsl:value-of select="linkText" />
</a>
We will end up with all non XSL nodes being tokenised like so:
%lt;a%gt;
<xsl:attribute name="href">
<xsl:value-of select="url" />
</xsl:attribute>
<xsl:value-of select="linkText" />
%lt;/a%gt;
Because the a tag is no longer a valid XML tag, the parser will not attach the attribute to the correct entity. Worse yet it will throw an exception when it tries to attach it to the root node. Looking for a preceding and following tag allows us to replace the a tag whilst leaving the rest of the document tokenised.
Problem
Unfortunately it has one limitation that I have noticed, which is that any non XSL tag within the a tag would cause the wrong tag to be replaced. Take the following example:
<a>
<xsl:attribute name="href">
<xsl:value-of select="url" />
</xsl:attribute>
<xsl:value-of select="linkText" />
<span> - Click here</span>
</a>
The regex would replace the closing span tag instead of the closing a tag, so we end up with this:
<a>
<xsl:attribute name="href">
<xsl:value-of select="url" />
</xsl:attribute>
<xsl:value-of select="linkText" />
%lt;span%gt; - Click here</span>
%lt;/a%gt;
Obviously this means the opening a tag and the closing span tag don't match, which throws an exception. The same is true if we put the span before the xsl:attribute tag except that the opening span is replaced instead of the opening a. My first thought was to look for a closing tag that matched the opening tag we found, or vice versa, but as the same tag can be nested it would involve counting the number of opening and closing tags to make sure they match up. That could get messy.
Solution
Thankfully this is easily solved by some normally invalid XSL. The reason for the xsl:attribute is to assign an attribute to an XML node - normally putting xml inside an attribute value is invalid and would throw an exception, but because we have tokenised our XML we can do so safely. So to add the attribute we just insert a normal xsl:value-of tag into the attribute like so:
<a href="<xsl:value-of select="logo" />">
<xsl:value-of select="linkText" />
<span> - Click here</span>
</a>
This would be invalid before we run PrepareTemplate, but afterwards it looks like this:
%lt;a href="<xsl:value-of select="logo" />"%gt;
<xsl:value-of select="linkText" />
%lt;span%gt; - Click here%lt;/span%gt;
%lt;/a%gt;
This is perfectly valid XML and has the added benefit of providing a more elegant approach (in my opinion) more akin to most templating languages. The xsl:attribute will still work, but only if there are no child elements (other than xsl tags) within the tag the attribute is being applied to.
Reverting
When we are finished processing the XML we then call the following method:
private string RevertTemplate(string template)
{
template = Regex.Replace(template, "%lt;", "<");
template = Regex.Replace(template, "%gt;", ">");
return template;
}
This converts our special chevron tokens back into the proper tags while leaving normal < and > tags alone.
Summary
To prepare our template we call PrepareTemplate like so:
template = #"<?xml version=""1.0"" encoding=""UTF-8""?>
<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"">
<xsl:output method=""html"" version=""2.0"" encoding=""UTF-8"" indent=""yes""/>
<xsl:template match=""/entity"">"
+ PrepareTemplate(template)
+ "</xsl:template></xsl:stylesheet>";
And after the XML has been transformed we revert the template like so:
var res = RevertTemplate(sb.ToString());
return res;
I hope this helps out someone facing a similar dilemma. While not a perfect solution it is a workable one. Obviously you should be careful if trying to use this in a high performance situation as regular expressions can slow your system right down if you are trying to process thousands of templates. It's worth doing some checking to see if the xsl:attribute tag is even present before trying to replace its parent tags, for instance.
Good luck and feel free to offer any alternative suggestions or improvements.

Related

How to copy inner text and Xml into another tag

I'm working on a C# application that pulls apart two Xml documents, merges some of their tag content, and produces a third Xml document. I'm faced with a situation where I need to extract the value of one tag, including inner tags, and transfer it to another tag. I started out doing something like this:
var summaryElement = elementExternal.Element("summary");
var summaryValue = (string)summaryElement;
var summaryValueClean = ElementValueClean(summaryValue);
var result = new XElement("para", summaryValueClean)
Where the ElementValueClean function removes extraneous white space.
This works satisfactorily if the value of the summary tag contains only text. The rub comes when the summary tag contains child elements like:
<summary>
Notifies the context that a new link exists between the <paramref name="source" /> and <paramref name="target" /> objects
and that the link is represented via the source.<paramref name="sourceProperty" /> which is a collection.
The context adds this link to the set of newly created links to be sent to
the data service on the next call to SaveChanges().
</summary>
I would like to produce something like this:
<para>
Notifies the context that a new link exists between the <paramref name="source" /> and <paramref name="target" /> objects
and that the link is represented via the source.<paramref name="sourceProperty" /> which is a collection.
The context adds this link to the set of newly created links to be sent to
the data service on the next call to SaveChanges().
</para>
There are roughly a dozen possible embedded tags that could appear across my catalog of source tags whose content I must merge into output tags. So I would like a C# solution that I can generalize. However, an Xslt transform that I can apply to Xml fragment to produce the Xml fragment would work for me also if it is simple enough. My Xslt skills have diminished from disuse.

You could update the ElementValueClean() function to support inline nodes and accept an Element instead of its string value:
foreach (XmlNode n in summaryElement.Nodes()) {
if (node.NodeType == XmlNodeType.Text) {
//do text cleanup
}
else n
}
An XSLT to rewrap the element is really simple, but I think a C# solution still makes more sense because you already have a usable C# text cleanup solution.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:template match="summary">
<para><xsl:apply-templates/></para>
</xsl:template>
<xsl:template match="node()|#*" priority="-1" mode="#default">
<xsl:copy>
<xsl:apply-templates select="node()|#*" mode="#current"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Or you could do the whole thing in XSLT, including text cleanup. It's not clear what that function does, but this is how you'd start it in XSLT:
<xsl:template match="text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>

Prevent XslCompiledTransform from using self-closing tags

I am using XslCompiledTransform to convert an XML file to HTML. Is there a way I can prevent it from using self-closing tags.
e.g.
<span></span> <!-- I want this even if content empty -->
<span/> <!-- stop doing this! ->
The self-closing tags on span's are messing up my document no matter which browser I use, though it is valid XML, it's just that 'span' is not allowed to have self-closing tags.
Is there a setting I can put in my xsl, or in my C#.Net code to prevent self-closing tags from being used?

Though I couldn't classify this as a direct solution (as it doesn't emit an empty element), the workaround I used was to put a space (using xsl:text) in the element -- since this is HTML markup, and if you are activating Standards mode (not quirks), the extra space doesn't change the rendered content. I also didn't have control over the invocation of the transform object.
<div class="clearBoth"><xsl:text> </xsl:text></div>

You can try <xsl:output method="html"/>, however the result would no longer be well-formed XML document.
Or, you can invoke the XslCompiledTransform.Transform() method passing as one of the parameters your own XmlWriter. In your implementation you are in full control and can implement any required serialization of the result tree.

The only solution I have been able to find, is to add logic to the XSL file. Basically if the the elements I wanted to wrap span around is empty, don't use the span element at all.
<xsl:if test="count(jar/beans) > 0">
<xsl:apply-templates select="jar/beans"/>
</xsl:if>
Not ideal to have to insert this everywhere in my xsl file, to compensate for the fact that even though I choose output method "html", it more than willingly will generate illegal HTML.
Sigh.

In your XSLT use <xsl:output method="html"/> and then make sure your HTML result elements your stylesheet creates are in no namespace. Furthermore depending on how you use XslCompiledTransform in your C# code you need to make sure the xsl:output settings in the stylesheet are honoured. You can easily achieve that by transforming to a file or stream or TextWriter, in that case nothing has to be done. However if you for some reasons transform to an XmlWriter then you need to ensure it is created with the proper settings e.g.
XslCompiledTransform proc = new XslCompiledTransform();
proc.Load("sheet.xsl");
using (XmlWriter xw = XmlWriter.Create("result.html", proc.OutputSettings))
{
proc.Transform("input.xml", null, xw);
}
But usually you should be fine by simply transforming to a Stream or TextWriter, in that case nothing in the C# code has to be done to honour the output method in the stylesheet.

Force XML character entities into XmlDocument

I have some XML that looks like this:
<abc x="{"></abc>
I want to force XmlDocument to use the XML character entities of the brackets, ie:
<abc x="{"></abc>
MSDN says this:
In order to assign an attribute value
that contains entity references, the
user must create an XmlAttribute node
plus any XmlText and
XmlEntityReference nodes, build the
appropriate subtree and use
SetAttributeNode to assign it as the
value of an attribute.
CreateEntityReference sounded promising, so I tried this:
XmlDocument doc = new XmlDocument();
doc.LoadXml("<abc />");
XmlAttribute x = doc.CreateAttribute("x");
x.AppendChild(doc.CreateEntityReference("#123"));
doc.DocumentElement.Attributes.Append(x);
And I get the exception Cannot create an 'EntityReference' node with a name starting with '#'.
Any reason why CreateEntityReference doesn't like the '#' - and more importantly how can I get the character entity into XmlDocument's XML? Is it even possible? I'm hoping to avoid string manipulation of the OuterXml...

You're mostly out of luck.
First off, what you're dealing with are called Character References, which is why CreateEntityReference fails. The sole reason for a character reference to exist is to provide access to characters that would be illegal in a given context or otherwise difficult to create.
Definition: A character reference
refers to a specific character in the
ISO/IEC 10646 character set, for
example one not directly accessible
from available input devices.
(See section 4.1 of the XML spec)
When an XML processor encounters a character reference, if it is referenced in the value of an attribute (that is, if the &#xxx format is used inside an attribute), it is set to "Included" which means its value is looked up and the text is replaced.
The string "AT&T;" expands to "
AT&T;" and the remaining ampersand is
not recognized as an entity-reference
delimiter
(See section 4.4 of the XML spec)
This is baked into the XML spec and the Microsoft XML stack is doing what it's required to do: process character references.
The best I can see you doing is to take a peek at these old XML.com articles, one of which uses XSL to disable output escaping so &#123; would turn into { in the output.
http://www.xml.com/pub/a/2001/03/14/trxml10.html
<!DOCTYPE stylesheet [
<!ENTITY ntilde
"<xsl:text disable-output-escaping='yes'>&ntilde;</xsl:text>">
]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output doctype-system="testOut.dtd"/>
<xsl:template match="test">
<testOut>
The Spanish word for "Spain" is "España".
<xsl:apply-templates/>
</testOut>
</xsl:template>
</xsl:stylesheet>
And this one which uses XSL to convert specific character references into other text sequences (to accomplish the same goal as the previous link).
http://www.xml.com/lpt/a/1426
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output use-character-maps="cm1"/>
<xsl:character-map name="cm1">
<xsl:output-character character=" " string="&nbsp;"/>
<xsl:output-character character="é" string="&233;"/> <!-- é -->
<xsl:output-character character="ô" string="&#244;"/>
<xsl:output-character character="—" string="--"/>
</xsl:character-map>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

You should always manipulate your strings with the preceding # like so #"My /?.,<> STRING". I don't know if that will solve your issue though.
I would approach the problem using XmlNode class from the XmlDocument. You can use the Attributes property and it'll be way easier. Check it out here:
http://msdn.microsoft.com/en-us/library/system.xml.xmlnode.attributes.aspx

How to change my XSL stylesheet to properly allow carriage returns

Hey, I was wondering if anybody knew how to alter the following XSL stylesheet so that ANY text in my transformed XML will retain the carriage returns and line feeds (which will be \r\n as I feed it to the XML). I know I'm supposed to be using in some way but I can't seem to figure out how to get it working
<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">
<xsl:template match=\"/\"><xsl:apply-templates /></xsl:template><xsl:template match=\"\r\n\"><xsl:text>
</xsl:text></xsl:template><xsl:template match=\"*\">
<xsl:element name=\"{local-name()}\"><xsl:value-of select=\"text()\"/><xsl:apply-templates select=\"*\"/></xsl:element ></xsl:template></xsl:stylesheet>

In your code above you can't apply templates and expect this template to get called:
<xsl:template match="\r\n\">
<xsl:text>
</xsl:text>
</xsl:template>
Unless you have a node in your XML named "\r\n" which is an illegal name anyhow. I think what you want to do is make this call explicitly when you want a carriage return:
<xsl:call-template name="crlf"/>
Here is an example of the template that could get called:
<xsl:template name="crlf">
<xsl:text>
</xsl:text>
<xsl:text>
</xsl:text>
<!--consult your system doc for appropriate carriage return coding -->
</xsl:template>

The answers from Chris and dkackman are on the mark but we also need to listen to the W3C every now and again:
XML parsed entities are often stored
in computer files which, for editing
convenience, are organized into lines.
These lines are typically separated by
some combination of the characters
carriage-return (#xD) and line-feed
(#xA).
This means that in your XSLT you can experiment with some combination of
and 
. Remember that different operating systems have different line-ending strategies.

It's not completely clear what you are trying to accomplish but...
Any whitespace that you absolutely want to show up in the output stream I would wrap in <xsl:text></xsl:text>
I would also highly recommend specifying an <xsl:output/> to control the output formatting.

Your question sounds like you want to control the format of the output XML. My advice: just don't.
XML is data, not text. The format it is in should be completely irrelevant to your application. If it is not, then your application needs some reworking.
Within non-empty text nodes, XML will retain line breaks by definition. Within attribute nodes they are retained as well, unless the product you use does not adhere to the spec.
But outside of text nodes (or in those empty text nodes between elements) line breaks are considered irrelevant white space and you should not rely on them or waste your time trying to create or retain them.
There is <xsl:output indent="yes" />, which does some (XSLT processor-specific) pretty-printing, but your application should not rely on such things.

Have you tried the preserve white space tag?

Strip WordML from a string

I've been tasked with build an accessible RSS feed for my company's job listings. I already have an RSS feed from our recruiting partner; so I'm transforming their RSS XML to our own proxy RSS feed to add additional data as well limit the number of items in the feed so we list on the latest jobs.
The RSS validates via feedvalidator.org (with warnings); but the problem is this. Unfortunately, no matter how many times I tell them not to; my company's HR team directly copies and pastes their Word documents into our Recruiting partners CMS when inserting new job listings, leaving WordML in my feed. I believe this WordML is causing issues with Feedburner's BrowserFriendly feature; which we want to show up to make it easier for people to subscribe. Therefore, I need to remove the WordML markup in the feed.
Anybody have experience doing this? Can anyone point me to a good solution to this problem?
Preferably; I'd like to be pointed to a solution in .Net (VB or C# is fine) and/or XSL.
Any advice on this is greatly appreciated.
Thanks.

I haven't yet worked with WordML, but assuming that its elements are in a different namespace from RSS, it should be quite simple to do with XSLT.
Start with a basic identity transform (a stylesheet that add all nodes from the input doc "as is" to the output tree). You need these two templates:
<!-- Copy all elements, and recur on their child nodes. -->
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="#*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<!-- Copy all non-element nodes. -->
<xsl:template match="#*|text()|comment()|processing-instruction()">
<xsl:copy/>
</xsl:template>
A transformation using a stylesheet containing just the above two templates would exactly reproduce its input document on output, modulo those things that standards-compliant XML processors are permitted to change, such as entity replacement.
Now, add in a template that matches any element in the WordML namespace. Let's give it the namespace prefix 'wml' for the purposes of this example:
<!-- Do not copy WordML elements or their attributes to the
output tree; just recur on child nodes. -->
<xsl:template match="wml:*">
<xsl:apply-templates/>
</xsl:template>
The beginning and end of the stylesheet are left as an exercise for the coder.

Jeff Attwood blogged about how to do this a while ago. His post contains some c# code that will clean the WordML.
http://www.codinghorror.com/blog/archives/000485.html

I would do something like this:
char[] charToRemove = { (char)8217, (char)8216, (char)8220, (char)8221, (char)8211 };
char[] charToAdd = { (char)39, (char)39, (char)34, (char)34, '-' };
string cleanedStr = "Your WordML filled Feed Text.";
for (int i = 0; i < charToRemove.Length; i++)
{
cleanedStr = cleanedStr.Replace(charToRemove.GetValue(i).ToString(), charToAdd.GetValue(i).ToString());
}
This would look for the characters in reference, (Which are the Word special characters that mess up everything and replaces them with their ASCII equivelents.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing invalid XML or HTML with XSLT in C# - c#

Related

How to copy inner text and Xml into another tag

Prevent XslCompiledTransform from using self-closing tags

Force XML character entities into XmlDocument

How to change my XSL stylesheet to properly allow carriage returns

Strip WordML from a string

Categories

Resources