OpenXML escaping illegal characters

OpenXML escaping illegal characters - c#

I am doing some string replacement within a Word Docx file using OpenXML Power Tools and it is working as expected. However things break when I have invalid characters in the substitution such as ampersand, so for instance "Harry & Sally" will break and produce an invalid document. According to this post illegal characters need to be converted to xHHHH.
I am having trouble finding the contents to the OOXML clause mentioned in the post and hence escaping characters appropriately.
I am hoping someone either has some code or insights into exactly what characters need to be escaped. I was also hopeful OpenXML Power Tools could do this for me in some way, but I cannot seem to find anything in there either.

The specification is just talking about the standard set of characters that have to be escaped in XML. The XML specification mentioned in the linked post is the one from the W3C, found here.
There are five characters that need to be escaped anywhere they appear in XML data (names, values, etc) unless they are part of a CDATA section. According to Section 2.4:
The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " " ".
In other words, escape the following characters:
' -> &apos;
" -> "
> -> >
< -> <
& -> &
Typically, you wouldn't encode these as xHHHH, you'd use the XML entities listed above, but either is allowed. You also don't need to encode quotes or the right-angle bracket in every case, only when they would otherwise represent XML syntax, but it's usually safer to do it all the time.
The XML specification also includes the list of every Unicode character that can appear in an XML document, in section 2.2:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
That list includes basically every Unicode character in the Basic plane (every one you're likely to run into), except for the control characters. Only the tab, CR, and LF characters are allowed -- any other character below ASCII 32 (space) needs to be escaped.
The big gap in the list (0xD800-0xDFF) is for surrogate encoding values, which shouldn't appear by themselves anyway, as they're not valid characters. The last two, 0xFFFE and 0xFFFF, are also not valid characters.

I created an extension method with help from Michael Edenfield's answer. Pretty self explanatory... just make sure you replace the ampersands first! Otherwise you will end up replacing your other escaped symbols by mistake.
public static string EscapeXmlCharacters(this string input)
{
switch (input)
{
case null: return null;
case "": return "";
default:
{
input = input.Replace("&", "&")
.Replace("'", "&apos;")
.Replace("\"", """)
.Replace(">", ">")
.Replace("<", "<");
return input;
}
}
}
.NET Fiddle: https://dotnetfiddle.net/PCqffy

Related

Why do I need to escape < and & when rendering an attribute?

I was reading the documentation for HtmlAttributeEncode, which as I understand it is intended for use when rendering HTML that appears within double quotes as an attribute, e.g.
<INPUT Value="This value must be escaped so that it doesn't contain any quotes">
As far as I can tell, the only character I would need to escape would be the double quote. The browser ought to be able to figure out everything else in that string belongs within the attribute.
Why, then, does the documentation say this?
The HtmlAttributeEncode method converts only quotation marks ("), ampersands (&), and left angle brackets (<) to equivalent character entities. It is considerably faster than the HtmlEncode method.
And in fact it does escape those, as can be seen by this poor guy.
is there any reason to escape the < and & characters in this circumstance? is it required by the HTML5 specification?
With my human eye I can easily see where the delimitation begins and ends in this character sequence:
<INPUT value="You & I can both easily see that 5 < 6!">
As long as the double quote sequence is properly closed (and double quotes are escaped) I don't understand why the other characters have to be HTML-encoded.

From the specs:
3.2.3.1 Attributes
Except where otherwise specified, attributes on HTML elements may have any string value, including the empty string. Except where explicitly stated, there is no restriction on what text can be specified in such attributes.
According to specs of html4, the content of the value attribute should be in the type of cdata.
From the HTML Document Representation:
5.3.2 Character entity references
Four character entity references deserve special mention since they are frequently used to escape special characters:
"<" represents the < sign.
">" represents the > sign.
"&" represents the & sign.
""" represents the " mark.
Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.

C# Troubles reading xml value [duplicate]

I've noticed that C# adds additional slashes (\) to paths. Consider the path C:\Test. When I inspect the string with this path in the text visualiser, the actual string is C:\\Test.
Why is this? It confuses me, as sometimes I may want to split the path up (using string.Split()), but have to wonder which string to use (one or two slashes).

The \\ is used because the \ is an escape character and is need to represent the a single \.
So it is saying treat the first \ as an escape character and then the second \ is taken as the actual value. If not the next character after the first \ would be parsed as an escaped character.
Here is a list of available escape characters:
\' - single quote, needed for character literals
\" - double quote, needed for string literals
\\ - backslash
\0 – Null
\a - Alert
\b - Backspace
\f - Form feed
\n - New line
\r - Carriage return
\t - Horizontal tab
\v - Vertical quote
\u - Unicode escape sequence for character
\U - Unicode escape sequence for surrogate pairs.
\x - Unicode escape sequence similar to "\u" except with variable length.
EDIT: To answer your question regarding Split, it should be no issue. Use Split as you would normally. The \\ will be treated as only the one character of \.

.Net is not adding anything to your string here. What your seeing is an effect of how the debugger chooses to display strings. C# strings can be represented in 2 forms
Verbatim Strings: Prefixed with an # sign and removes the need o escape \\ characters
Normal Strings: Standard C style strings where \\ characters need to escape themselves
The debugger will display a string literal as a normal string vs. a verbatim string. It's just an issue of display though, it doesn't affect it's underlying value.

Debugger visualizers display strings in the form in which they would appear in C# code. Since \ is used to escape characters in non-verbatum C# strings, \\ is the correct escaped form.

Okay, so the answers above are not wholly correct. As such I am adding my findings for the next person who reads this post.
You cannot split a string using any of the chars in the table above if you are reading said string(s) from an external source.
i.e,
string[] splitStrings = File.ReadAllText([path]).Split((char)7);
will not split by those chars. However internally created strings work fine.
i.e.,
string[] splitStrings = "hello\agoodbye".Split((char)7);
This may not hold true for other methods of reading text from a file. I am unsure as I have not tested with other methods. With that in mind, it is probably best not to use those chars for delimiting strings!

what does \ do on non escape characters?

I asked another question poorly so i'll ask something else.
According to http://www.c-point.com/javascript_tutorial/special_characters.htm there are a few escape characters such as \n and \b. However / is not one of them. What happens in this case? (\/) is the \ ignored?
I have a string in javascript 'http:\/\/www.site.com\/user'. Not that this is a literal with ' so with " it would look like \\/ anyways i would like to escape this string thus the question on what happens on non 'special' escape characters.
And another question is if i had name:\t me (or "name:\\t me" is there a function to escape it so there is a tab? i am using C# and these strings come from a JSON file

According to Mozilla:
For characters not listed [...] a preceding backslash is ignored, but this usage is deprecated and
should be avoided.
https://developer.mozilla.org/en/JavaScript/Guide/Values%2c_Variables%2c_and_Literals#section_19
The \/ sequence is not listed but there're at least two common usages:
<1> It's required to escape literal slashes in regular expressions that use the /foo/ syntax:
var re = /^http:\/\//;
<2> It's required to avoid invalid HTML when you embed JavaScript code inside HTML:
<script type="text/javascript"><!--
alert('</p>')
//--></script>
... triggers: end tag for element "P" which is not open
<script type="text/javascript"><!--
alert('<\/p>')
//--></script>
... doesn't.

If a backslash is found before a character which is not meaningful as an escape sequence, it will be ignored, i.e. "\/" and "/" are the same string in Javascript.
The / character is the regular expression delimiter, so it only has to be escaped in a regex context:
/[a-z]/[0-9]/ // Invalid.
/[a-z]\/[0-9]/ // Matches a lowercase letter, followed by a slash,
// followed by a digit.
Finally, if you want to collapse a backslash followed by a character into the corresponding escape sequence, you'll have to replace the whole expression:
string expr = "name:\\t me"; // Backslash followed by `t`.
expr = expr.Replace("\\t", "\t"); // Tab character.

\ is evaluated as \ if \ + next character is not an escape sequence.
examples:
\t -> escape sequence t -> tab
\\t -> escape \ and t -> \t
\\ -> escape sequence \ -> \
\c -> \c (not an escape sequence)
\a -> escape sequence a -> ???
Note that there are escape sequences also on completely weird symbols, so be careful. IMHO there is no good standard between languages and operating systems.
And actually, its even more non-stardard: in basic C '\y' -> y + warning, not \y. So this is very language dependent, be careful. (disregard my comment below).
br,
Juha
edit: What language are you using?= Java and c have slightly different behavior.
C and java seem to have the same escapes and python has different:
http://en.csharp-online.net/CSharp_FAQ:_What_are_the_CSharp_character_escape_sequences
http://www.cerritos.edu/jwilson/cis_182/language_resources/java_escape_sequences.htm
http://www.java2s.com/Code/Python/String/EscapeCodesbtnar.htm

In C# you can use the backslash character to tell the compiler what you really want. After compiling though, these escape characters do not exist.
If you use string myString = "\t"; the string will actually contain a TAB character, not just represent one. You can test this by checking myString.Length which is 1.
If you want to send the characters "backslash" and "t" to your JSON client however, you'll have to tell the compiler to keep his hands off the backslash, by escaping the backslash:
string myString = "\\t"; will result in a string of two characters, the "backslash" and the "t".
Things get messy if you have to cross multiple layers of escaping and unescaping, try to debug through these layers to see what's really happening under the hood.

What are the best characters to use for placeholders in file names (on Windows) AND URLs?

I am writing an application in C# that will need to find placeholders in URLs and/or filenames, and substitute in a value, much like this: C:\files\file{number} => C:\files\file1 Unfortunately for that example, curly braces are allowed in file names and URLs.
Can anyone please suggest some characters that I can use to denote placeholders in files and URLs? Thank you!

Windows rather helpfully tells you what characters aren't allowed in a filename, when you try to use on of them:
A filename cannot contain any of the following characters:
\ / : * ? " < > |
See this support article for more information, including the list of allowed characters.
Characters that are valid for naming
files, folders, or shortcuts include
any combination of letters (A-Z) and
numbers (0-9), plus the following
special characters:
^ Accent circumflex (caret)
& Ampersand
' Apostrophe (single quotation mark)
# At sign
{ Brace left
} Brace right
[ Bracket opening
] Bracket closing
, Comma
$ Dollar sign
= Equal sign
! Exclamation point
- Hyphen
# Number sign
( Parenthesis opening
) Parenthesis closing
% Percent
. Period
+ Plus
~ Tilde
_ Underscore
As for URLs, see section 2.2 of RFC 1738 for a description of allowed characters:
Thus, only alphanumerics, the
special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used unencoded within a URL.
...also of interest, from the same section:
Characters can be unsafe for a number
of reasons. The space character is
unsafe because significant spaces may
disappear and insignificant spaces may
be introduced when URLs are
transcribed or typeset or subjected to
the treatment of word-processing
programs. The characters "<" and ">"
are unsafe because they are used as
the delimiters around URLs in free
text; the quote mark (""") is used to
delimit URLs in some systems. The
character "#" is unsafe and should
always be encoded because it is used
in World Wide Web and in other systems
to delimit a URL from a
fragment/anchor identifier that might
follow it. The character "%" is
unsafe because it is used for
encodings of other characters. Other
characters are unsafe because gateways
and other transport agents are known
to sometimes modify such characters.
These characters are "{", "}", "|",
"\", "^", "~", "[", "]", and "`".
All unsafe characters must always be encoded within a URL.
It looks like the double-quote and angle bracket characters ("<>) are good options.

Cleaning strings to be valid JSON values

I want to clean strings that are retrieved from a database.
I ran into this issue where a property value (a name from a database) had an embedded TAB character, and Chrome gave me an invalid TOKEN error while trying to load the JSON object.
So now, I went to http://www.json.org/ and on the side it has a specification. But I'm having trouble understanding how to write a cleanser using this spec:
string
""
" chars "
chars
char
char chars
char
any-Unicode-character-
except-"-or--or-
control-character
\"
\\
/
\b
\f
\n
\r
\t
\u four-hex-digits
Given a string, how can I "clean" it such that I conform to this spec?
Specifically, I am confused: does the spec allow TAB (0x0900) characters? If so, why did Chrome given an invalid TOKEN error?

Tab characters (actual 0x09, not escapes) cannot appear inside of quotes in JSON (though they are valid whitespace outside of quotes). You'll need to escape them with \t or \u0009 (the former being preferable).
json.org says an unescaped character of a string must be:
Any UNICODE character except " or \ or
control character
Tab counts as a control character.

This maybe what you are looking for it shows how to use the JavaScriptSerializer class in C#.
How to create JSON String in C#

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

OpenXML escaping illegal characters - c#

Related

Why do I need to escape < and & when rendering an attribute?

C# Troubles reading xml value [duplicate]

what does \ do on non escape characters?

What are the best characters to use for placeholders in file names (on Windows) AND URLs?

Cleaning strings to be valid JSON values

Categories

Resources