Creating a space ( ) in XSL - c#

Im trying to create automatic spacing in a XSL document in the following manner.
<td><xsl:value-of select="Name/First"/> <xsl:text disable-output-escaping="yes"><![CDATA[ ]]></xsl:text><xsl:value-of select="Name/Last"/> </td>
However, the rendered HTML is of the following form
<td>John&nbsp;Grisham</td>
Any idea on how I could fix this?

Your immediate problem is that while unicode 160 (hex 0xA0) is an HTML entity, it is not an XML entity.
Use   or   for non-breaking-space instead.
However for your larger problem, how to handle white space in XSL, the answer is simply this: Use <xsl:text>.
Every time you include ANY plain text, enclose it in <xsl:text> the text goes here </xsl:text> tags. If you don't you will be in for a world of pain the next time a clever text editor reformats your document.
You are already in for at least a Continent, or possibly if you are lucky a Country of pain, for expecting XML/XSL to conserve whitespace. Even geniuses who understand XSL to the nth degree still get County or at least Borough level pain from whitespace handling. (Borough level pain is encoded in the XML spec, "2.11 End-of-Line Handling" by its insane design decision to refuse to distinguish between LF and CRLF - so nobody can avoid that).
Just so you know what to expect: It isn't easy - you can get away without the <xsl:text> tags for a surprisingly long time, but if you just accept it, and put them in from the get-go, it will be easier in the long run.
Example WRONG:
<xsl:element name="MyElem">
<xsl:attribute name="fullPath">c:\base\Path\here\<xsl:value-of select="../parent/#relPath"/>\<xsl:value-of select="#fileName">
</xsl:attribute>
</xsl:element>
Example RIGHT:
<xsl:element name="MyElem">
<xsl:attribute name="fullPath">
<xsl:text>c:\base\Path\here\</xsl:text>
<xsl:value-of select="../parent/#relPath"/>
<xsl:text>\</xsl:text>
<xsl:value-of select="#fileName">
</xsl:attribute>
</xsl:element>
The thing is, they both produce exactly the same output.
But one of them will be messed up at some point in the future, yes, possibly by someone as yet unborn, the other will not.
The short-ish explanation is this: Nodes consisting ONLY OF WHITESPACE are ignored by default (unless you tweak the options). So that is anything consisting only of CR, LF, TAB and SPACE between > and <. Nodes consisting of non-whitespace text, with leading and trailing whitespace, may have whitespace "folded" - I.e. effed up.
So the difference between the Example RIGHT and this:
<xsl:element name="MyElem">
<xsl:attribute name="fullPath">
c:\base\Path\here\
<xsl:value-of select="../parent/#relPath"/>
\
<xsl:value-of select="#fileName">
</xsl:attribute>
</xsl:element>
is that one generates <MyElem fullPath="c:\base\Path\here\relative\path\filename.txt"/> and the other, depending on the DOM options in force, generates one of these:
<MyElem fullPath="c:\base\Path\here\relative\path\filename.txt"/>
<MyElem fullPath="c:\base\Path\here\ relative\path \ filename.txt"/>
<MyElem fullPath="c:\base\Path\here\
relative\path
\&10;filename.txt"/>
<MyElem fullPath="c:\base\Path\here\
relative\path
\ &10; filename.txt"/>
Only one of which was what you wanted... but any of which might be correct depending on the options in force...

Used this <xsl:text disable-output-escaping="yes">&</xsl:text>nbsp; and it worked!

Related

Remove Certain HTML tags in C#

I'm trying to remove a certain html tags in C# like this:
<div>
<blockquote style="font-size: 30px" width="300px">
For 50 years, WWF has been protecting the future of nature. The world's leading conservation organization, WWF works in 100 countries and is supported by 1.2 million members in the United States and close to 5 million globally.
</blockquote>
</div>
To be result as
<div>For 50 years, WWF has been protecting the future of nature. The world's leading conservation organization, WWF works in 100 countries and is supported by 1.2 million members in the United States and close to 5 million globally.</div>
So far, I'm trying to do the regex. (<.+?)\s+style\s*=\s*([""']).*?\2(.*?>) but this is only for removing the style but I'm not sure how can I able to achieve the result that I want.
Thanks!
As far as I can see, you want to remove the HTML elements that contain a style attribute, also remove their closing pairs. Unfortunately, there is no good way to do that with regexes. Without the 'also remove their closing pairs' clause, we could write an approximately good regex.
On the other hand, XSLT is the right tool for this, because it can handle the recursive nature of XML:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="//*[not(#style)]">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
What's happening here? The <xsl:template match="//*[not(#style)]"> part matches everything that does not have a style attribute. Then the <xsl:copy>...</xsl:copy> part copies them entirely. I.e. the items that have a style attribute, they will not be copied.
For the record, this is a slight variant of the XSLT identity transformation:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

Outputting carriage return from variable set in XSLT

I know it's been asked numerous times and while my question might seem to be a duplicate, I can't make the list of solutions work.
How come that doesn't seem to add "\r\n" or CRLF in my output file ?
<xsl:variable name="Newline">
</xsl:variable>
...
<xsl:text>UpperLine</xsl:text>
<xsl:value-of select="$Newline"/>
<xsl:text>LowerLine</xsl:text>
Even when looking a tthe hexadcimal representation of the file, I can't see the 0d0a equivalent of CRLF (I can only see the ones at the end)
I've also tried this with no success either :
<xsl:value-of select="concat('New Line', '
')"/>
change your variable declaration to this:
<xsl:variable name="Newline"><xsl:text>
</xsl:text></xsl:variable>

Force XML character entities into XmlDocument

I have some XML that looks like this:
<abc x="{"></abc>
I want to force XmlDocument to use the XML character entities of the brackets, ie:
<abc x="{"></abc>
MSDN says this:
In order to assign an attribute value
that contains entity references, the
user must create an XmlAttribute node
plus any XmlText and
XmlEntityReference nodes, build the
appropriate subtree and use
SetAttributeNode to assign it as the
value of an attribute.
CreateEntityReference sounded promising, so I tried this:
XmlDocument doc = new XmlDocument();
doc.LoadXml("<abc />");
XmlAttribute x = doc.CreateAttribute("x");
x.AppendChild(doc.CreateEntityReference("#123"));
doc.DocumentElement.Attributes.Append(x);
And I get the exception Cannot create an 'EntityReference' node with a name starting with '#'.
Any reason why CreateEntityReference doesn't like the '#' - and more importantly how can I get the character entity into XmlDocument's XML? Is it even possible? I'm hoping to avoid string manipulation of the OuterXml...
You're mostly out of luck.
First off, what you're dealing with are called Character References, which is why CreateEntityReference fails. The sole reason for a character reference to exist is to provide access to characters that would be illegal in a given context or otherwise difficult to create.
Definition: A character reference
refers to a specific character in the
ISO/IEC 10646 character set, for
example one not directly accessible
from available input devices.
(See section 4.1 of the XML spec)
When an XML processor encounters a character reference, if it is referenced in the value of an attribute (that is, if the &#xxx format is used inside an attribute), it is set to "Included" which means its value is looked up and the text is replaced.
The string "AT&T;" expands to "
AT&T;" and the remaining ampersand is
not recognized as an entity-reference
delimiter
(See section 4.4 of the XML spec)
This is baked into the XML spec and the Microsoft XML stack is doing what it's required to do: process character references.
The best I can see you doing is to take a peek at these old XML.com articles, one of which uses XSL to disable output escaping so &#123; would turn into { in the output.
http://www.xml.com/pub/a/2001/03/14/trxml10.html
<!DOCTYPE stylesheet [
<!ENTITY ntilde
"<xsl:text disable-output-escaping='yes'>&ntilde;</xsl:text>">
]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output doctype-system="testOut.dtd"/>
<xsl:template match="test">
<testOut>
The Spanish word for "Spain" is "España".
<xsl:apply-templates/>
</testOut>
</xsl:template>
</xsl:stylesheet>
And this one which uses XSL to convert specific character references into other text sequences (to accomplish the same goal as the previous link).
http://www.xml.com/lpt/a/1426
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output use-character-maps="cm1"/>
<xsl:character-map name="cm1">
<xsl:output-character character=" " string="&nbsp;"/>
<xsl:output-character character="é" string="&233;"/> <!-- é -->
<xsl:output-character character="ô" string="&#244;"/>
<xsl:output-character character="—" string="--"/>
</xsl:character-map>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
You should always manipulate your strings with the preceding # like so #"My /?.,<> STRING". I don't know if that will solve your issue though.
I would approach the problem using XmlNode class from the XmlDocument. You can use the Attributes property and it'll be way easier. Check it out here:
http://msdn.microsoft.com/en-us/library/system.xml.xmlnode.attributes.aspx

How to change my XSL stylesheet to properly allow carriage returns

Hey, I was wondering if anybody knew how to alter the following XSL stylesheet so that ANY text in my transformed XML will retain the carriage returns and line feeds (which will be \r\n as I feed it to the XML). I know I'm supposed to be using in some way but I can't seem to figure out how to get it working
<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">
<xsl:template match=\"/\"><xsl:apply-templates /></xsl:template><xsl:template match=\"\r\n\"><xsl:text>
</xsl:text></xsl:template><xsl:template match=\"*\">
<xsl:element name=\"{local-name()}\"><xsl:value-of select=\"text()\"/><xsl:apply-templates select=\"*\"/></xsl:element ></xsl:template></xsl:stylesheet>
In your code above you can't apply templates and expect this template to get called:
<xsl:template match="\r\n\">
<xsl:text>
</xsl:text>
</xsl:template>
Unless you have a node in your XML named "\r\n" which is an illegal name anyhow. I think what you want to do is make this call explicitly when you want a carriage return:
<xsl:call-template name="crlf"/>
Here is an example of the template that could get called:
<xsl:template name="crlf">
<xsl:text>
</xsl:text>
<xsl:text>
</xsl:text>
<!--consult your system doc for appropriate carriage return coding -->
</xsl:template>
The answers from Chris and dkackman are on the mark but we also need to listen to the W3C every now and again:
XML parsed entities are often stored
in computer files which, for editing
convenience, are organized into lines.
These lines are typically separated by
some combination of the characters
carriage-return (#xD) and line-feed
(#xA).
This means that in your XSLT you can experiment with some combination of
and 
. Remember that different operating systems have different line-ending strategies.
It's not completely clear what you are trying to accomplish but...
Any whitespace that you absolutely want to show up in the output stream I would wrap in <xsl:text></xsl:text>
I would also highly recommend specifying an <xsl:output/> to control the output formatting.
Your question sounds like you want to control the format of the output XML. My advice: just don't.
XML is data, not text. The format it is in should be completely irrelevant to your application. If it is not, then your application needs some reworking.
Within non-empty text nodes, XML will retain line breaks by definition. Within attribute nodes they are retained as well, unless the product you use does not adhere to the spec.
But outside of text nodes (or in those empty text nodes between elements) line breaks are considered irrelevant white space and you should not rely on them or waste your time trying to create or retain them.
There is <xsl:output indent="yes" />, which does some (XSLT processor-specific) pretty-printing, but your application should not rely on such things.
Have you tried the preserve white space tag?

Strip WordML from a string

I've been tasked with build an accessible RSS feed for my company's job listings. I already have an RSS feed from our recruiting partner; so I'm transforming their RSS XML to our own proxy RSS feed to add additional data as well limit the number of items in the feed so we list on the latest jobs.
The RSS validates via feedvalidator.org (with warnings); but the problem is this. Unfortunately, no matter how many times I tell them not to; my company's HR team directly copies and pastes their Word documents into our Recruiting partners CMS when inserting new job listings, leaving WordML in my feed. I believe this WordML is causing issues with Feedburner's BrowserFriendly feature; which we want to show up to make it easier for people to subscribe. Therefore, I need to remove the WordML markup in the feed.
Anybody have experience doing this? Can anyone point me to a good solution to this problem?
Preferably; I'd like to be pointed to a solution in .Net (VB or C# is fine) and/or XSL.
Any advice on this is greatly appreciated.
Thanks.
I haven't yet worked with WordML, but assuming that its elements are in a different namespace from RSS, it should be quite simple to do with XSLT.
Start with a basic identity transform (a stylesheet that add all nodes from the input doc "as is" to the output tree). You need these two templates:
<!-- Copy all elements, and recur on their child nodes. -->
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="#*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<!-- Copy all non-element nodes. -->
<xsl:template match="#*|text()|comment()|processing-instruction()">
<xsl:copy/>
</xsl:template>
A transformation using a stylesheet containing just the above two templates would exactly reproduce its input document on output, modulo those things that standards-compliant XML processors are permitted to change, such as entity replacement.
Now, add in a template that matches any element in the WordML namespace. Let's give it the namespace prefix 'wml' for the purposes of this example:
<!-- Do not copy WordML elements or their attributes to the
output tree; just recur on child nodes. -->
<xsl:template match="wml:*">
<xsl:apply-templates/>
</xsl:template>
The beginning and end of the stylesheet are left as an exercise for the coder.
Jeff Attwood blogged about how to do this a while ago. His post contains some c# code that will clean the WordML.
http://www.codinghorror.com/blog/archives/000485.html
I would do something like this:
char[] charToRemove = { (char)8217, (char)8216, (char)8220, (char)8221, (char)8211 };
char[] charToAdd = { (char)39, (char)39, (char)34, (char)34, '-' };
string cleanedStr = "Your WordML filled Feed Text.";
for (int i = 0; i < charToRemove.Length; i++)
{
cleanedStr = cleanedStr.Replace(charToRemove.GetValue(i).ToString(), charToAdd.GetValue(i).ToString());
}
This would look for the characters in reference, (Which are the Word special characters that mess up everything and replaces them with their ASCII equivelents.

Categories

Resources