Strip WordML from a string - c#

I've been tasked with build an accessible RSS feed for my company's job listings. I already have an RSS feed from our recruiting partner; so I'm transforming their RSS XML to our own proxy RSS feed to add additional data as well limit the number of items in the feed so we list on the latest jobs.
The RSS validates via feedvalidator.org (with warnings); but the problem is this. Unfortunately, no matter how many times I tell them not to; my company's HR team directly copies and pastes their Word documents into our Recruiting partners CMS when inserting new job listings, leaving WordML in my feed. I believe this WordML is causing issues with Feedburner's BrowserFriendly feature; which we want to show up to make it easier for people to subscribe. Therefore, I need to remove the WordML markup in the feed.
Anybody have experience doing this? Can anyone point me to a good solution to this problem?
Preferably; I'd like to be pointed to a solution in .Net (VB or C# is fine) and/or XSL.
Any advice on this is greatly appreciated.
Thanks.

I haven't yet worked with WordML, but assuming that its elements are in a different namespace from RSS, it should be quite simple to do with XSLT.
Start with a basic identity transform (a stylesheet that add all nodes from the input doc "as is" to the output tree). You need these two templates:
<!-- Copy all elements, and recur on their child nodes. -->
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="#*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<!-- Copy all non-element nodes. -->
<xsl:template match="#*|text()|comment()|processing-instruction()">
<xsl:copy/>
</xsl:template>
A transformation using a stylesheet containing just the above two templates would exactly reproduce its input document on output, modulo those things that standards-compliant XML processors are permitted to change, such as entity replacement.
Now, add in a template that matches any element in the WordML namespace. Let's give it the namespace prefix 'wml' for the purposes of this example:
<!-- Do not copy WordML elements or their attributes to the
output tree; just recur on child nodes. -->
<xsl:template match="wml:*">
<xsl:apply-templates/>
</xsl:template>
The beginning and end of the stylesheet are left as an exercise for the coder.

Jeff Attwood blogged about how to do this a while ago. His post contains some c# code that will clean the WordML.
http://www.codinghorror.com/blog/archives/000485.html

I would do something like this:
char[] charToRemove = { (char)8217, (char)8216, (char)8220, (char)8221, (char)8211 };
char[] charToAdd = { (char)39, (char)39, (char)34, (char)34, '-' };
string cleanedStr = "Your WordML filled Feed Text.";
for (int i = 0; i < charToRemove.Length; i++)
{
cleanedStr = cleanedStr.Replace(charToRemove.GetValue(i).ToString(), charToAdd.GetValue(i).ToString());
}
This would look for the characters in reference, (Which are the Word special characters that mess up everything and replaces them with their ASCII equivelents.

Related

Parsing invalid XML or HTML with XSLT in C#

I recently had to write a class that would process a provided template and return the result. I chose XSLT as my templating language due to the industry wide adoption it enjoys. The problem I had, however, was that the provided template had several restrictions that were proving to be a pain. Here is an example of my code:
public string ProcessTemplate(string template, IEnumerable<Field> fields)
{
// Surround the supplied template with the required XML
template = #"<?xml version=""1.0"" encoding=""UTF-8""?>
<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"">
<xsl:output method=""html"" version=""2.0"" encoding=""UTF-8"" indent=""yes""/>
<xsl:template match=""/entity"">"
+ template
+ "</xsl:template></xsl:stylesheet>";
// Turn our fields into XML with an "entity" tag as the root node
var t = GetTemplateXml(fields);
// Create a stringreader to read our template into memory
var sr = new StringReader(t.ToString());
var xr = new XmlTextReader(sr);
// Now create a XmlWriter attached to a StringBuilder to contain the transformed result
var sb = new StringBuilder();
var xws = new XmlWriterSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
};
var xw = XmlWriter.Create(sb, xws);
// Create the transform object and an XmlReader for our template
var xsl = new XslCompiledTransform();
var xslr = XmlReader.Create(new StringReader(template));
// Load our template into the transform object, transform it, and put the result into our XmlWriter (and therefore into our StringBuilder)
xsl.Load(xslr);
xsl.Transform(xr, xw);
var res = sb.ToString();
return res;
}
The user would provide a number of Field objects, which to be valid XML had to share a root node. I called this root node "entity" but didn't want the users to have to select "entity" every time they access a field. So I surround the template with <xsl:template match="/entity">, which means I can select the fields directly. Unfortunately I still had several problems:
Firstly, the template I was providing had an HTML Doctype declaration at the top of the page. I started getting errors around an "unexpected DTD declaration" because the DOCTYPE was appearing inside the xsl:template node.
If the user supplied any invalid XML (such as a tag that wasn't self closed and had no matching closing tag) then the parser would throw an exception, even if the HTML would have worked in a browser. This seemed unacceptable because as much as I would like my users to supply perfectly formed XML/HTML templates, I don't want to move the burden onto the end user if it's something that I can fix.
In the HTML template I was testing against, the opening HTML tag was surrounded with IE conditional comments in order to supply a different class to the tag depending on the version of IE. For instance <!--[if lt IE 7 ]><html class="ie ie6" lang="en"> <![endif]-->. Because this is a proprietary IE syntax, XML just sees a comment and no html tag. The closing tag at the end of the document therefore has no matching start tag and throws an exception.
As I tackled each issue one at a time the solution became less and less acceptable to me and I sought a robust solution that would not force the user to cater for the weaknesses of my template method.
I ended up settling on a solution that would allow invalid HTML and pass that responsibility back to the browser where the user expects it to lie. Some might disagree with this approach, reasoning that it is better for the template to fail quickly and obviously so that it can be fixed. However as I said, I am catering for an end user who has certain expectations from writing HTML - one of those is that the page will render if they provide poorly formed HTML. Even if it renders incorrectly.
I decided that I could handle all three issues outlined above by writing a method that would prepare the template before the transform, and then a method that would clean up the result before returning it to the user. Here they are:
/// <summary>
/// This function replaces all chevrons with %lt; and %gt; Any xsl tags are left in place
/// so they can be processed, and the parent tags of xsl:attribute tags are also left
/// untouched in order that the attribute can be correctly assigned.
/// The tokens used are deliberately different from the standard tokens of < and >
/// because we will want to revert these tokens later without reverting the normal tokens.
/// </summary>
/// <param name="template"></param>
/// <returns></returns>
private string PrepareTemplate(string template)
{
template = Regex.Replace(template, "<", "%lt;");
template = Regex.Replace(template, ">", "%gt;");
template = Regex.Replace(template, "%lt;xsl:(.*?)%gt;", "<xsl:$1>");
template = Regex.Replace(template, "%lt;/xsl:(.*?)%gt;", "</xsl:$+>");
template = Regex.Replace(template, "%lt;(.[^%]*?)%gt;(.[^%]*?)<xsl:attribute(.*?)>", "<$1>$2<xsl:attribute$3>", RegexOptions.Singleline);
template = Regex.Replace(template, "</xsl:attribute>(.*?)%lt;/(.*?)%gt;", "</xsl:attribute>$1</$2>", RegexOptions.Singleline);
return template;
}
I will explain each line in turn:
The first line replaces opening chevrons with a %lt; token.
The next line replaces closing chevrons with a %gt; token.
This looks for any opening XSL tags and turns our tokens back into chevrons.
(Same as 3 but for closing XSL tags)
This looks for any <xsl:attribute> tag, and replaces the nearest preceding tokenised tag with a proper XML tag. An xsl:attribute tag applies to the closest ancestor node, so we have to transform our tokenised tag to proper XML for it to work.
Similar to 5, this replaces any tokenised closing tag on the xsl:attribute parent node with the proper XML closing tag.
Further explanation of 5 & 6
Given the following template:
<a>
<xsl:attribute name="href">
<xsl:value-of select="url" />
</xsl:attribute>
<xsl:value-of select="linkText" />
</a>
We will end up with all non XSL nodes being tokenised like so:
%lt;a%gt;
<xsl:attribute name="href">
<xsl:value-of select="url" />
</xsl:attribute>
<xsl:value-of select="linkText" />
%lt;/a%gt;
Because the a tag is no longer a valid XML tag, the parser will not attach the attribute to the correct entity. Worse yet it will throw an exception when it tries to attach it to the root node. Looking for a preceding and following tag allows us to replace the a tag whilst leaving the rest of the document tokenised.
Problem
Unfortunately it has one limitation that I have noticed, which is that any non XSL tag within the a tag would cause the wrong tag to be replaced. Take the following example:
<a>
<xsl:attribute name="href">
<xsl:value-of select="url" />
</xsl:attribute>
<xsl:value-of select="linkText" />
<span> - Click here</span>
</a>
The regex would replace the closing span tag instead of the closing a tag, so we end up with this:
<a>
<xsl:attribute name="href">
<xsl:value-of select="url" />
</xsl:attribute>
<xsl:value-of select="linkText" />
%lt;span%gt; - Click here</span>
%lt;/a%gt;
Obviously this means the opening a tag and the closing span tag don't match, which throws an exception. The same is true if we put the span before the xsl:attribute tag except that the opening span is replaced instead of the opening a. My first thought was to look for a closing tag that matched the opening tag we found, or vice versa, but as the same tag can be nested it would involve counting the number of opening and closing tags to make sure they match up. That could get messy.
Solution
Thankfully this is easily solved by some normally invalid XSL. The reason for the xsl:attribute is to assign an attribute to an XML node - normally putting xml inside an attribute value is invalid and would throw an exception, but because we have tokenised our XML we can do so safely. So to add the attribute we just insert a normal xsl:value-of tag into the attribute like so:
<a href="<xsl:value-of select="logo" />">
<xsl:value-of select="linkText" />
<span> - Click here</span>
</a>
This would be invalid before we run PrepareTemplate, but afterwards it looks like this:
%lt;a href="<xsl:value-of select="logo" />"%gt;
<xsl:value-of select="linkText" />
%lt;span%gt; - Click here%lt;/span%gt;
%lt;/a%gt;
This is perfectly valid XML and has the added benefit of providing a more elegant approach (in my opinion) more akin to most templating languages. The xsl:attribute will still work, but only if there are no child elements (other than xsl tags) within the tag the attribute is being applied to.
Reverting
When we are finished processing the XML we then call the following method:
private string RevertTemplate(string template)
{
template = Regex.Replace(template, "%lt;", "<");
template = Regex.Replace(template, "%gt;", ">");
return template;
}
This converts our special chevron tokens back into the proper tags while leaving normal < and > tags alone.
Summary
To prepare our template we call PrepareTemplate like so:
template = #"<?xml version=""1.0"" encoding=""UTF-8""?>
<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"">
<xsl:output method=""html"" version=""2.0"" encoding=""UTF-8"" indent=""yes""/>
<xsl:template match=""/entity"">"
+ PrepareTemplate(template)
+ "</xsl:template></xsl:stylesheet>";
And after the XML has been transformed we revert the template like so:
var res = RevertTemplate(sb.ToString());
return res;
I hope this helps out someone facing a similar dilemma. While not a perfect solution it is a workable one. Obviously you should be careful if trying to use this in a high performance situation as regular expressions can slow your system right down if you are trying to process thousands of templates. It's worth doing some checking to see if the xsl:attribute tag is even present before trying to replace its parent tags, for instance.
Good luck and feel free to offer any alternative suggestions or improvements.

Force XML character entities into XmlDocument

I have some XML that looks like this:
<abc x="{"></abc>
I want to force XmlDocument to use the XML character entities of the brackets, ie:
<abc x="{"></abc>
MSDN says this:
In order to assign an attribute value
that contains entity references, the
user must create an XmlAttribute node
plus any XmlText and
XmlEntityReference nodes, build the
appropriate subtree and use
SetAttributeNode to assign it as the
value of an attribute.
CreateEntityReference sounded promising, so I tried this:
XmlDocument doc = new XmlDocument();
doc.LoadXml("<abc />");
XmlAttribute x = doc.CreateAttribute("x");
x.AppendChild(doc.CreateEntityReference("#123"));
doc.DocumentElement.Attributes.Append(x);
And I get the exception Cannot create an 'EntityReference' node with a name starting with '#'.
Any reason why CreateEntityReference doesn't like the '#' - and more importantly how can I get the character entity into XmlDocument's XML? Is it even possible? I'm hoping to avoid string manipulation of the OuterXml...
You're mostly out of luck.
First off, what you're dealing with are called Character References, which is why CreateEntityReference fails. The sole reason for a character reference to exist is to provide access to characters that would be illegal in a given context or otherwise difficult to create.
Definition: A character reference
refers to a specific character in the
ISO/IEC 10646 character set, for
example one not directly accessible
from available input devices.
(See section 4.1 of the XML spec)
When an XML processor encounters a character reference, if it is referenced in the value of an attribute (that is, if the &#xxx format is used inside an attribute), it is set to "Included" which means its value is looked up and the text is replaced.
The string "AT&T;" expands to "
AT&T;" and the remaining ampersand is
not recognized as an entity-reference
delimiter
(See section 4.4 of the XML spec)
This is baked into the XML spec and the Microsoft XML stack is doing what it's required to do: process character references.
The best I can see you doing is to take a peek at these old XML.com articles, one of which uses XSL to disable output escaping so &#123; would turn into { in the output.
http://www.xml.com/pub/a/2001/03/14/trxml10.html
<!DOCTYPE stylesheet [
<!ENTITY ntilde
"<xsl:text disable-output-escaping='yes'>&ntilde;</xsl:text>">
]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output doctype-system="testOut.dtd"/>
<xsl:template match="test">
<testOut>
The Spanish word for "Spain" is "España".
<xsl:apply-templates/>
</testOut>
</xsl:template>
</xsl:stylesheet>
And this one which uses XSL to convert specific character references into other text sequences (to accomplish the same goal as the previous link).
http://www.xml.com/lpt/a/1426
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output use-character-maps="cm1"/>
<xsl:character-map name="cm1">
<xsl:output-character character=" " string="&nbsp;"/>
<xsl:output-character character="é" string="&233;"/> <!-- é -->
<xsl:output-character character="ô" string="&#244;"/>
<xsl:output-character character="—" string="--"/>
</xsl:character-map>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
You should always manipulate your strings with the preceding # like so #"My /?.,<> STRING". I don't know if that will solve your issue though.
I would approach the problem using XmlNode class from the XmlDocument. You can use the Attributes property and it'll be way easier. Check it out here:
http://msdn.microsoft.com/en-us/library/system.xml.xmlnode.attributes.aspx

C#-Sorting XML elements -Possible?(without ADO.NET)

Just i need to recreate a xml file after appling sorting on key filed element(say EmpID),The thing is ,i should not use ADO.NET.Which is the best sort to go ahead ?.To do so,What XML Class do i need to use?,LINQ is quite handy?
No need for c\ to do this. you can do it via an XSL file
<xsl:template match="/">
<xsl:apply-template select="yourlementnode">
<xsl:sort select="EmpID" order="ascending" />
</xsl:apply-template>
</xsl:template>
LINQ to XML would probably be your best bet. You could either move the elements "in place" or (possibly more easily) create a new document with the re-ordered elements.
If you can give us some sample XML (input and desired output) it should be fairly easy to come up with some example code.

How to change my XSL stylesheet to properly allow carriage returns

Hey, I was wondering if anybody knew how to alter the following XSL stylesheet so that ANY text in my transformed XML will retain the carriage returns and line feeds (which will be \r\n as I feed it to the XML). I know I'm supposed to be using in some way but I can't seem to figure out how to get it working
<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">
<xsl:template match=\"/\"><xsl:apply-templates /></xsl:template><xsl:template match=\"\r\n\"><xsl:text>
</xsl:text></xsl:template><xsl:template match=\"*\">
<xsl:element name=\"{local-name()}\"><xsl:value-of select=\"text()\"/><xsl:apply-templates select=\"*\"/></xsl:element ></xsl:template></xsl:stylesheet>
In your code above you can't apply templates and expect this template to get called:
<xsl:template match="\r\n\">
<xsl:text>
</xsl:text>
</xsl:template>
Unless you have a node in your XML named "\r\n" which is an illegal name anyhow. I think what you want to do is make this call explicitly when you want a carriage return:
<xsl:call-template name="crlf"/>
Here is an example of the template that could get called:
<xsl:template name="crlf">
<xsl:text>
</xsl:text>
<xsl:text>
</xsl:text>
<!--consult your system doc for appropriate carriage return coding -->
</xsl:template>
The answers from Chris and dkackman are on the mark but we also need to listen to the W3C every now and again:
XML parsed entities are often stored
in computer files which, for editing
convenience, are organized into lines.
These lines are typically separated by
some combination of the characters
carriage-return (#xD) and line-feed
(#xA).
This means that in your XSLT you can experiment with some combination of
and 
. Remember that different operating systems have different line-ending strategies.
It's not completely clear what you are trying to accomplish but...
Any whitespace that you absolutely want to show up in the output stream I would wrap in <xsl:text></xsl:text>
I would also highly recommend specifying an <xsl:output/> to control the output formatting.
Your question sounds like you want to control the format of the output XML. My advice: just don't.
XML is data, not text. The format it is in should be completely irrelevant to your application. If it is not, then your application needs some reworking.
Within non-empty text nodes, XML will retain line breaks by definition. Within attribute nodes they are retained as well, unless the product you use does not adhere to the spec.
But outside of text nodes (or in those empty text nodes between elements) line breaks are considered irrelevant white space and you should not rely on them or waste your time trying to create or retain them.
There is <xsl:output indent="yes" />, which does some (XSLT processor-specific) pretty-printing, but your application should not rely on such things.
Have you tried the preserve white space tag?

Using C# Regular expression to replace XML element content

I'm writing some code that handles logging xml data and I would like to be able to replace the content of certain elements (eg passwords) in the document. I'd rather not serialize and parse the document as my code will be handling a variety of schemas.
Sample input documents:
doc #1:
<user>
<userid>jsmith</userid>
<password>myPword</password>
</user>
doc #2:
<secinfo>
<ns:username>jsmith</ns:username>
<ns:password>myPword</ns:password>
</secinfo>
What I'd like my output to be:
output doc #1:
<user>
<userid>jsmith</userid>
<password>XXXXX</password>
</user>
output doc #2:
<secinfo>
<ns:username>jsmith</ns:username>
<ns:password>XXXXX</ns:password>
</secinfo>
Since the documents I'll be processing could have a variety of schemas, I was hoping to come up with a nice generic regular expression solution that could find elements with password in them and mask the content accordingly.
Can I solve this using regular expressions and C# or is there a more efficient way?
This problem is best solved with XSLT:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="//password">
<xsl:copy>
<xsl:text>XXXXX</xsl:text>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
This will work for both inputs as long as you handle the namespaces properly.
Edit : Clarification of what I mean by "handle namespaces properly"
Make sure your source document that has the ns name prefix has as namespace defined for the document like so:
<?xml version="1.0" encoding="utf-8"?>
<secinfo xmlns:ns="urn:foo">
<ns:username>jsmith</ns:username>
<ns:password>XXXXX</ns:password>
</secinfo>
I'd say you're better off parsing the content with a .NET XmlDocument object and finding password elements using XPath, then changing their innerXML properties. It has the advantage of being more correct (since XML isn't regular in the first place), and it's conceptually easy to understand.
From experience with systems that try to parse and/or modify XML without proper parsers, let me say: DON'T DO IT. Use an XML parser (There are other answers here that have ways to do that quickly and easily).
Using non-xml methods to parse and/or modify an XML stream will ALWAYS lead you to pain at some point in the future. I know, because I have felt that pain.
I know that it seems like it would be quicker-at-runtime/simpler-to-code/easier-to-understand/whatever if you use the regex solution. But you're just going to make someone's life miserable later.
You can use regular expressions if you know enough about what you are trying to match. For example if you are looking for any tag that has the word "password" in it with no inner tags this regex expression would work:
(<([^>]*?password[^>]*?)>)([^<]*?)(<\/\2>)
You could use the same C# replace statement in zowat's answer as well but for the replace string you would want to use "$1XXXXX$4" instead.
Regex is the wrong approach for this, I've seen it go so badly wrong when you least expect it.
XDocument is way more fun anyway:
XDocument doc = XDocument.Parse(#"
<user>
<userid>jsmith</userid>
<password>password</password>
</user>");
doc.Element("user").Element("password").Value = "XXXX";
// Temp namespace just for the purposes of the example -
XDocument doc2 = XDocument.Parse(#"
<secinfo xmlns:ns='http://tempuru.org/users'>
<ns:userid>jsmith</ns:userid>
<ns:password>password</ns:password>
</secinfo>");
doc2.Element("secinfo").Element("{http://tempuru.org/users}password").Value = "XXXXX";
Here is what I came up with when I went with XMLDocument, it may not be as slick as XSLT, but should be generic enough to handle a variety of documents:
//input is a String with some valid XML
XmlDocument doc = new XmlDocument();
doc.LoadXml(input);
XmlNodeList nodeList = doc.SelectNodes("//*");
foreach (XmlNode node in nodeList)
{
if (node.Name.ToUpper().Contains("PASSWORD"))
{
node.InnerText = "XXXX";
}
else if (node.Attributes.Count > 0)
{
foreach (XmlAttribute a in node.Attributes)
{
if (a.LocalName.ToUpper().Contains("PASSWORD"))
{
a.InnerText = "XXXXX";
}
}
}
}
The main reason that XSLT exist is to be able to transform XML-structures, this means that an XSLT is a type of stylesheet that can be used to alter the order of elements och change content of elements. Therefore this is a typical situation where it´s highly recommended to use XSLT instead of parsing as Andrew Hare said in a previous post.

Categories

Resources