Cleaning strings to be valid JSON values - c#

I want to clean strings that are retrieved from a database.
I ran into this issue where a property value (a name from a database) had an embedded TAB character, and Chrome gave me an invalid TOKEN error while trying to load the JSON object.
So now, I went to http://www.json.org/ and on the side it has a specification. But I'm having trouble understanding how to write a cleanser using this spec:
string
""
" chars "
chars
char
char chars
char
any-Unicode-character-
except-"-or--or-
control-character
\"
\\
/
\b
\f
\n
\r
\t
\u four-hex-digits
Given a string, how can I "clean" it such that I conform to this spec?
Specifically, I am confused: does the spec allow TAB (0x0900) characters? If so, why did Chrome given an invalid TOKEN error?

Tab characters (actual 0x09, not escapes) cannot appear inside of quotes in JSON (though they are valid whitespace outside of quotes). You'll need to escape them with \t or \u0009 (the former being preferable).
json.org says an unescaped character of a string must be:
Any UNICODE character except " or \ or
control character
Tab counts as a control character.

This maybe what you are looking for it shows how to use the JavaScriptSerializer class in C#.
How to create JSON String in C#

Related

how to replace doubleslash in form app in c# [duplicate]

I've noticed that C# adds additional slashes (\) to paths. Consider the path C:\Test. When I inspect the string with this path in the text visualiser, the actual string is C:\\Test.
Why is this? It confuses me, as sometimes I may want to split the path up (using string.Split()), but have to wonder which string to use (one or two slashes).
The \\ is used because the \ is an escape character and is need to represent the a single \.
So it is saying treat the first \ as an escape character and then the second \ is taken as the actual value. If not the next character after the first \ would be parsed as an escaped character.
Here is a list of available escape characters:
\' - single quote, needed for character literals
\" - double quote, needed for string literals
\\ - backslash
\0 – Null
\a - Alert
\b - Backspace
\f - Form feed
\n - New line
\r - Carriage return
\t - Horizontal tab
\v - Vertical quote
\u - Unicode escape sequence for character
\U - Unicode escape sequence for surrogate pairs.
\x - Unicode escape sequence similar to "\u" except with variable length.
EDIT: To answer your question regarding Split, it should be no issue. Use Split as you would normally. The \\ will be treated as only the one character of \.
.Net is not adding anything to your string here. What your seeing is an effect of how the debugger chooses to display strings. C# strings can be represented in 2 forms
Verbatim Strings: Prefixed with an # sign and removes the need o escape \\ characters
Normal Strings: Standard C style strings where \\ characters need to escape themselves
The debugger will display a string literal as a normal string vs. a verbatim string. It's just an issue of display though, it doesn't affect it's underlying value.
Debugger visualizers display strings in the form in which they would appear in C# code. Since \ is used to escape characters in non-verbatum C# strings, \\ is the correct escaped form.
Okay, so the answers above are not wholly correct. As such I am adding my findings for the next person who reads this post.
You cannot split a string using any of the chars in the table above if you are reading said string(s) from an external source.
i.e,
string[] splitStrings = File.ReadAllText([path]).Split((char)7);
will not split by those chars. However internally created strings work fine.
i.e.,
string[] splitStrings = "hello\agoodbye".Split((char)7);
This may not hold true for other methods of reading text from a file. I am unsure as I have not tested with other methods. With that in mind, it is probably best not to use those chars for delimiting strings!

C# Troubles reading xml value [duplicate]

I've noticed that C# adds additional slashes (\) to paths. Consider the path C:\Test. When I inspect the string with this path in the text visualiser, the actual string is C:\\Test.
Why is this? It confuses me, as sometimes I may want to split the path up (using string.Split()), but have to wonder which string to use (one or two slashes).
The \\ is used because the \ is an escape character and is need to represent the a single \.
So it is saying treat the first \ as an escape character and then the second \ is taken as the actual value. If not the next character after the first \ would be parsed as an escaped character.
Here is a list of available escape characters:
\' - single quote, needed for character literals
\" - double quote, needed for string literals
\\ - backslash
\0 – Null
\a - Alert
\b - Backspace
\f - Form feed
\n - New line
\r - Carriage return
\t - Horizontal tab
\v - Vertical quote
\u - Unicode escape sequence for character
\U - Unicode escape sequence for surrogate pairs.
\x - Unicode escape sequence similar to "\u" except with variable length.
EDIT: To answer your question regarding Split, it should be no issue. Use Split as you would normally. The \\ will be treated as only the one character of \.
.Net is not adding anything to your string here. What your seeing is an effect of how the debugger chooses to display strings. C# strings can be represented in 2 forms
Verbatim Strings: Prefixed with an # sign and removes the need o escape \\ characters
Normal Strings: Standard C style strings where \\ characters need to escape themselves
The debugger will display a string literal as a normal string vs. a verbatim string. It's just an issue of display though, it doesn't affect it's underlying value.
Debugger visualizers display strings in the form in which they would appear in C# code. Since \ is used to escape characters in non-verbatum C# strings, \\ is the correct escaped form.
Okay, so the answers above are not wholly correct. As such I am adding my findings for the next person who reads this post.
You cannot split a string using any of the chars in the table above if you are reading said string(s) from an external source.
i.e,
string[] splitStrings = File.ReadAllText([path]).Split((char)7);
will not split by those chars. However internally created strings work fine.
i.e.,
string[] splitStrings = "hello\agoodbye".Split((char)7);
This may not hold true for other methods of reading text from a file. I am unsure as I have not tested with other methods. With that in mind, it is probably best not to use those chars for delimiting strings!

OpenXML escaping illegal characters

I am doing some string replacement within a Word Docx file using OpenXML Power Tools and it is working as expected. However things break when I have invalid characters in the substitution such as ampersand, so for instance "Harry & Sally" will break and produce an invalid document. According to this post illegal characters need to be converted to xHHHH.
I am having trouble finding the contents to the OOXML clause mentioned in the post and hence escaping characters appropriately.
I am hoping someone either has some code or insights into exactly what characters need to be escaped. I was also hopeful OpenXML Power Tools could do this for me in some way, but I cannot seem to find anything in there either.
The specification is just talking about the standard set of characters that have to be escaped in XML. The XML specification mentioned in the linked post is the one from the W3C, found here.
There are five characters that need to be escaped anywhere they appear in XML data (names, values, etc) unless they are part of a CDATA section. According to Section 2.4:
The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " " ".
In other words, escape the following characters:
' -> &apos;
" -> "
> -> >
< -> <
& -> &
Typically, you wouldn't encode these as xHHHH, you'd use the XML entities listed above, but either is allowed. You also don't need to encode quotes or the right-angle bracket in every case, only when they would otherwise represent XML syntax, but it's usually safer to do it all the time.
The XML specification also includes the list of every Unicode character that can appear in an XML document, in section 2.2:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
That list includes basically every Unicode character in the Basic plane (every one you're likely to run into), except for the control characters. Only the tab, CR, and LF characters are allowed -- any other character below ASCII 32 (space) needs to be escaped.
The big gap in the list (0xD800-0xDFF) is for surrogate encoding values, which shouldn't appear by themselves anyway, as they're not valid characters. The last two, 0xFFFE and 0xFFFF, are also not valid characters.
I created an extension method with help from Michael Edenfield's answer. Pretty self explanatory... just make sure you replace the ampersands first! Otherwise you will end up replacing your other escaped symbols by mistake.
public static string EscapeXmlCharacters(this string input)
{
switch (input)
{
case null: return null;
case "": return "";
default:
{
input = input.Replace("&", "&")
.Replace("'", "&apos;")
.Replace("\"", """)
.Replace(">", ">")
.Replace("<", "<");
return input;
}
}
}
.NET Fiddle: https://dotnetfiddle.net/PCqffy

'\\' characters are being treated as one character when using the length property in c#

I am trying to get the length of a string that has \\\ values.
e.g. "C:\\\Dir1\\\Dir2\\\Dir3\\\Dir4\\\flower.bmp"
The length of the example is 38 characters.
When I use the length property the length is 33, basically it is treating \\\ as one character.
I have tried using StringInfo.LengthInTextElements and various other ways to try and get this working but with no joy.
Since the character \ is used to escape characters in a string, \\ actually represents the \ character literally.
Try a verbatim string if you want \\ to be treated as two characters:
#"C:\\Dir1\\Dir2\\Dir3\\Dir4\\flower.bmp"
MSDN Reference
My gut says you have a more fundamental problem, but have you tried wrapping it as a literal string?
string myString = #'C:\\Dir1\\Dir2\\Dir3\\Dir4\\flower.bmp'
it is one char. if you want it to be 2 chars either use # at the beginning or maybe \\ twice (haven't tried.. checking now)
That's because \\ in a C# string is known as an escape sequence. Your string in code:
"C:\\Dir1\\Dir2\\Dir3\\Dir4\\flower.bmp"
becomes this string on disk and in memory when the program is loaded.
"C:\Dir1\Dir2\Dir3\Dir4\flower.bmp"
So, the length of your example really is 33 characters. The original string, while it may be 38 characters in code, only represents 33 real characters.
33 is correct - \\ is indeed only one character, namely \. It's only the debugger that shows it escaped (\ has a special meaning for \n or \r, line feed and carriage return, respectively, for example).
The backslash \ is an escape character to put special characters in your string like \t for a tab and \n for a newline. A double backslash \\ will insert one backslash into the compiled string instead of your expected 2. The answer is to use the c# feature # in front of your string which prevent escaping or escaping all of your backslashes which would look like "C:\\\\Dir1\\\\Dir2\\\\Dir3\\\\Dir4\\\\flower.bmp"

Special character encoding

I am working with JSON to communicate data between two systems. One of the properties in JSON is rich text. Most of the times there are no problems but once in a blue moon special characters like curly quotes which are not UTF-8 characters make it into the rich text.
I want to replace these special characters with their UTF-8 equivalents. How can I achieve this in C Sharp?
Example of this string - “Cops bring lettuce & tomato, dispose of evidence,”. If I create a regular quote it's like this - "
Thanks
The quotes you posted are sometimes called "smart quotes" - “”. They are UTF-8, but are not proper JSON (and most programming language) quotes.
They are the kind of quotes produced from pasting code into Word.
The fix it to replace both characters with quotes that are valid for JSON (that is ").
If these appear in the JSON values, you need to escape them with a \ - so instead of " you will use \".
Also, take a look at this question and its answers - make sure that the server returns the JSON response as UTF-8 and not some other encoding.

Categories

Resources