I am working with JSON to communicate data between two systems. One of the properties in JSON is rich text. Most of the times there are no problems but once in a blue moon special characters like curly quotes which are not UTF-8 characters make it into the rich text.
I want to replace these special characters with their UTF-8 equivalents. How can I achieve this in C Sharp?
Example of this string - “Cops bring lettuce & tomato, dispose of evidence,”. If I create a regular quote it's like this - "
Thanks
The quotes you posted are sometimes called "smart quotes" - “”. They are UTF-8, but are not proper JSON (and most programming language) quotes.
They are the kind of quotes produced from pasting code into Word.
The fix it to replace both characters with quotes that are valid for JSON (that is ").
If these appear in the JSON values, you need to escape them with a \ - so instead of " you will use \".
Also, take a look at this question and its answers - make sure that the server returns the JSON response as UTF-8 and not some other encoding.
Related
I've noticed that C# adds additional slashes (\) to paths. Consider the path C:\Test. When I inspect the string with this path in the text visualiser, the actual string is C:\\Test.
Why is this? It confuses me, as sometimes I may want to split the path up (using string.Split()), but have to wonder which string to use (one or two slashes).
The \\ is used because the \ is an escape character and is need to represent the a single \.
So it is saying treat the first \ as an escape character and then the second \ is taken as the actual value. If not the next character after the first \ would be parsed as an escaped character.
Here is a list of available escape characters:
\' - single quote, needed for character literals
\" - double quote, needed for string literals
\\ - backslash
\0 – Null
\a - Alert
\b - Backspace
\f - Form feed
\n - New line
\r - Carriage return
\t - Horizontal tab
\v - Vertical quote
\u - Unicode escape sequence for character
\U - Unicode escape sequence for surrogate pairs.
\x - Unicode escape sequence similar to "\u" except with variable length.
EDIT: To answer your question regarding Split, it should be no issue. Use Split as you would normally. The \\ will be treated as only the one character of \.
.Net is not adding anything to your string here. What your seeing is an effect of how the debugger chooses to display strings. C# strings can be represented in 2 forms
Verbatim Strings: Prefixed with an # sign and removes the need o escape \\ characters
Normal Strings: Standard C style strings where \\ characters need to escape themselves
The debugger will display a string literal as a normal string vs. a verbatim string. It's just an issue of display though, it doesn't affect it's underlying value.
Debugger visualizers display strings in the form in which they would appear in C# code. Since \ is used to escape characters in non-verbatum C# strings, \\ is the correct escaped form.
Okay, so the answers above are not wholly correct. As such I am adding my findings for the next person who reads this post.
You cannot split a string using any of the chars in the table above if you are reading said string(s) from an external source.
i.e,
string[] splitStrings = File.ReadAllText([path]).Split((char)7);
will not split by those chars. However internally created strings work fine.
i.e.,
string[] splitStrings = "hello\agoodbye".Split((char)7);
This may not hold true for other methods of reading text from a file. I am unsure as I have not tested with other methods. With that in mind, it is probably best not to use those chars for delimiting strings!
Is there a C# syntax with which I can express strings containing double quotes without having to escape them? I frequently copy and paste strings between C# source code to other apps, and it's frustrating to keep adding and removing backslashes.
Eg. presently for the following string (simple example)
"No," he said.
I write in C# "\"No,\" he said."
But I'd rather write something like Python '"No," he said.', or Ruby %q{"No," he said.}, so I can copy and paste the contents verbatim to other apps.
I frequently copy and paste strings between C# source code to other apps, and it's frustrating to keep adding and removing backslashes.
Then it sounds like you probably shouldn't have the strings within source code.
Instead, create text files which are embedded in your assembly, and load them dynamically... or create resource files so you can look up strings by key.
There's no form of string literal in C# which would allow you to express a double-quote as just a single double-quote character in source code.
You could try this but you're still effectively escaping:
string s = #"""No,"" he said.";
Update 2022: C# 11 in Visual Studio 2022 version 17.2 (or later) supports raw string literals between """ https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-11#raw-string-literals
Raw string literals are a new format for string literals. Raw string literals can contain arbitrary text, including whitespace, new lines, embedded quotes, and other special characters without requiring escape sequences. A raw string literal starts with at least three double-quote (""") characters. It ends with the same number of double-quote characters. Typically, a raw string literal uses three double quotes on a single line to start the string, and three double quotes on a separate line to end the string. The newlines following the opening quote and preceding the closing quote aren't included in the final content:
Example (note that StackOverflow doesn't yet highlight correctly)
string longMessage = """
This is a long message.
It has several lines.
Some are indented
more than others.
Some should start at the first column.
Some have "quoted text" in them.
""";
I have a problem with my XML. When the tag values have special characters, I need these special characters to be converted to UTF-8. Do we have in C# any name space for handling this?
Is this what you are looking for?
http://devproj20.blogspot.com/2008/02/writing-xml-with-utf-8-encoding-using.html
Look at the first comment for an alternative
I want to assign a xml code into a string variable.
I can do this without escaping single or double-quotes by using triple-quote in python.
Is there a similar way to do this in F# or C#?
F# 3.0 supports triple quoted strings. See Visual Studio F# Team Blog Post on 3.0 features.
The F# 3.0 Spec Strings and Characters section specifically mentions the XML scenario:
A triple-quoted string is specified by using three quotation marks
(""") to ensure that a string that includes one or more escaped
strings is interpreted verbatim. For example, a triple-quoted string
can be used to embed XML blobs:
As far as I know, there is no syntax corresponding to this in C# / F#. If you use #"str" then you have to replace quote with two quotes and if you just use "str" then you need to add backslash.
In any case, there is some encoding of ":
var str = #"allows
multiline, but still need to encode "" as two chars";
var str = "need to use backslahs \" here";
However, the best thing to do when you need to embed large strings (such as XML data) into your application is probably to use .NET resources (or store the data somewhere else, depending on your application). Embedding large string literals in program is generally not very recommended. Also, there used to be a plugin for pasting XML as a tree that constructs XElement objects for C#, but I'm not sure whether it still exists.
Although, I would personally vote to add """ as known from Python to F# - it is very useful, especially for interactive scripting.
In case someone ran into this question when looking for triple quote strings in C# (rather than F#), C#11 now has raw string literals and they're (IMO) better than Python's (due to how indentation is handled)!
Raw string literals are a new format for string literals. Raw string literals can contain arbitrary text, including whitespace, new lines, embedded quotes, and other special characters without requiring escape sequences. A raw string literal starts with at least three double-quote (""") characters. It ends with the same number of double-quote characters. Typically, a raw string literal uses three double quotes on a single line to start the string, and three double quotes on a separate line to end the string. The newlines following the opening quote and preceding the closing quote are not included in the final content:
string longMessage = """
This is a long message.
It has several lines.
Some are indented
more than others.
Some should start at the first column.
Some have "quoted text" in them.
""";
Any whitespace to the left of the closing double quotes will be removed from the string literal. Raw string literals can be combined with string interpolation to include braces in the output text. Multiple $ characters denote how many consecutive braces start and end the interpolation:
var location = $$"""
You are at {{{Longitude}}, {{Latitude}}}
""";
The preceding example specifies that two braces starts and end an interpolation. The third repeated opening and closing brace are included in the output string.
https://devblogs.microsoft.com/dotnet/csharp-11-preview-updates/#raw-string-literals
https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-11
As shoosh said, you want to use the verbatim string literals in C#, where the string starts with # and is enclosed in double quotation marks. The only exception is if you need to put a double quotation mark in the string, in which case you need to double it
System.Console.WriteLine(#"Hello ""big"" world");
would output
Hello "big" world
http://msdn.microsoft.com/en-us/library/362314fe.aspx
In C# the syntax is #"some string"
see here
I want to clean strings that are retrieved from a database.
I ran into this issue where a property value (a name from a database) had an embedded TAB character, and Chrome gave me an invalid TOKEN error while trying to load the JSON object.
So now, I went to http://www.json.org/ and on the side it has a specification. But I'm having trouble understanding how to write a cleanser using this spec:
string
""
" chars "
chars
char
char chars
char
any-Unicode-character-
except-"-or--or-
control-character
\"
\\
/
\b
\f
\n
\r
\t
\u four-hex-digits
Given a string, how can I "clean" it such that I conform to this spec?
Specifically, I am confused: does the spec allow TAB (0x0900) characters? If so, why did Chrome given an invalid TOKEN error?
Tab characters (actual 0x09, not escapes) cannot appear inside of quotes in JSON (though they are valid whitespace outside of quotes). You'll need to escape them with \t or \u0009 (the former being preferable).
json.org says an unescaped character of a string must be:
Any UNICODE character except " or \ or
control character
Tab counts as a control character.
This maybe what you are looking for it shows how to use the JavaScriptSerializer class in C#.
How to create JSON String in C#