I tried to phrase this as a generic question but realized I don't know enough, so here is the problem I'm having.
Here is a snippet from a console application:
public void Run()
{
Run(Console.Out);
}
public void Run(TextWriter writer)
{
DataTable customers = _quickBooksAdapter.GetTableData("Customer");
customers.WriteXml(writer);
}
Then I run it from the console and use ">" to put it in a file.
c:\> QuickBooksETL extract US > qb_us.xml
If i try to load the result as I would normally:
var x = XDocument.Load("qb_us.xml");
I get the error:
Invalid character in the given encoding. Line 8, position 26.
So I tried to determine what .NET "thinks" it is using:
string path = #"\\ad1\accounting$\Xml\qb_us.xml";
StreamReader sr = new StreamReader(path);
sr.CurrentEncoding.Dump();
Result:
System.Text.UTF8Encoding
BodyName utf-8
EncodingName Unicode (UTF-8)
HeaderName utf-8
WebName utf-8
WindowsCodePage 1200
IsBrowserDisplay True
IsBrowserSave True
IsMailNewsDisplay True
IsMailNewsSave True
IsSingleByte False
EncoderFallback 5EncoderReplacementFallback
System.Text.EncoderReplacementFallback
DefaultString �
MaxCharCount 1
DecoderFallback 5DecoderReplacementFallback
System.Text.DecoderReplacementFallback
DefaultString �
MaxCharCount 1
IsReadOnly True
CodePage 65001
Finally, I find by guessing that it works if I just explicitly say it's ASCII:
string path = #"\\ad1\accounting$\Xml\qb_us.xml";
StreamReader sr = new StreamReader(path, Encoding.ASCII);
var x = XDocument.Load(sr);
Any thoughts on where am I going wrong would be greatly appreciated. I admit I have never taken the "deep dive" on character encodings, but I'm willing to put in the effort to get this right.
The simple answer is not to get the console involved. Write directly to the file from your code:
public void Run(string filename)
{
DataTable customers = _quickBooksAdapter.GetTableData("Customer");
customers.WriteXml(filename);
}
or create the TextWriter or Stream yourself and pass that in, e.g.
public void Run(Stream output)
{
DataTable customers = _quickBooksAdapter.GetTableData("Customer");
customers.WriteXml(output);
}
Note that by reading it as ASCII, you'll basically be getting question marks for any non-ASCII character in the original data. IIRC, that's the default behaviour of an encoding when it encounters binary data it can't handle.
Using a Stream it should default to writing out in UTF-8, and the XML declaration and the data within the file should match.
In my experience, if your data includes illegal characters (for example, character 12), the XML doesn't round trip unless you read the XML with an XmlTextReader with Normalization = false. I've been using XmlSerializer.Deserialize(), not XDocument.Load(). Still, you might try calling the Load(XmlReader) overload by passing in an XmlTextReader with Normalization = false.
I would add my voice to Jon's in suggesting that you write to your own stream, not Console.Out.
Related
I need to convert a big XElement to a byte array so that it can be uploaded later to a fileshare. What is the correct method to call to do that?
Below you see the signature of a method fileShare.PutFile that is internal:
void PutFile(string folder, string fileName, byte[] content);
Then given an XElement xml, I tried converting it to a byte array by encoding its XElement.Value using Encoding.Default.GetBytes() as follows:
byte[] bytes = Encoding.Default.GetBytes(xml.Value);
fileShare.PutFile(folderName, blobName, bytes);
I am not so sure xml.Value (XElement.Value) is really what GetBytes method is really needing though. Is this correct?
To test this, I spun up a console app and put in some fake data. I did this for the XElement:
XElement root = new XElement("Root",
new XElement("Child1", 1),
new XElement("Child2", 2),
new XElement("Child3", 3),
new XElement("Child4", 4),
new XElement("Child5", 5),
new XElement("Child6", 6)
);
Then I tried that line of code putting to a byte array
byte[] bytes = Encoding.Default.GetBytes(root.Value);
Well I guess I forgot that when I step over and see Autos that bytes variable is btye[6] and when I expand - I see that [0] = 49 and so on
Now this may not mean it is not working ... or does it mean that? How can I interpret the contents of the bytes array, to check whether it is correct?
Firstly, using Encoding.Default is not recommended. From the docs:
Warning
Different computers can use different encodings as the default, and the default encoding can change on a single computer. If you use the Default encoding to encode and decode data streamed between computers or retrieved at different times on the same computer, it may translate that data incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these reasons, using the default encoding is not recommended. To ensure that encoded bytes are decoded properly, you should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding. You could also use a higher-level protocol to ensure that the same format is used for encoding and decoding.
Secondly, XElement.Value returns
A String that contains all of the text content of this element. If there are multiple text nodes, they will be concatenated.
Thus if you upload the Value you will be stripping away the entire XML markup structure from your file leaving only the plain text. While you might want to do that, it seems very unlikely. If you compare the value with the string returned by XElement.ToString() the difference should be clear.
Instead, to convert the XML contents of your XElement (including both markup and text) to a byte array, it would be better to write your XElement directly to a MemoryStream using an appropriately configured XmlWriterSettings and return the byte array thereby created. The following extension method does the job:
public static partial class XNodeExtensions
{
static Encoding DefaultEncoding { get; } = new UTF8Encoding(false); // Disable the BOM because XElement.ToString() does not include it.
public static byte [] ToByteArray(this XNode node, SaveOptions options = default, Encoding encoding = default)
{
// Emulate the settings of XElement.ToString() and XDocument.ToString()
// https://referencesource.microsoft.com/#System.Xml.Linq/System/Xml/Linq/XLinq.cs,2004
// I omitted the XML declaration because XElement.ToString() omits it, but you might want to include it, depending upon your needs.
var settings = new XmlWriterSettings { OmitXmlDeclaration = true, Indent = (options & SaveOptions.DisableFormatting) == 0, Encoding = encoding ?? DefaultEncoding };
if ((options & SaveOptions.OmitDuplicateNamespaces) != 0)
settings.NamespaceHandling |= NamespaceHandling.OmitDuplicates;
return node.ToByteArray(settings);
}
public static byte [] ToByteArray(this XNode node, XmlWriterSettings settings)
{
using var ms = new MemoryStream();
using (var writer = XmlWriter.Create(ms, settings))
node.WriteTo(writer);
return ms.ToArray();
}
}
Now you can format your XElement to a UTF8-encoded byte array by doing:
var bytes = root.ToByteArray();
The extension method has the added advantage that, if you really need to use some encoding other than UTF8, unsupported Unicode characters will be escaped rather than replaced with a fallback as explained in this answer to XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter.
var bytes = root.ToByteArray(encoding : Encoding.Default);
To check for correctness, you could examine the contents of the byte array in the debugger or your console app by decoding it to a string as follows:
var resultString = Encoding.UTF8.GetString(bytes);
Console.WriteLine(resultString);
Or with the default encoding:
var resultString = Encoding.Default.GetString(bytes);
You could also assert that the contents of the byte array are correct by parsing the contents back to a new XElement and checking that the result is semantically identical to the original by using XNode.DeepEquals():
Assert.IsTrue(
XNode.DeepEquals(root,
XElement.Load(new StreamReader(new MemoryStream(bytes), encoding))));
Demo fiddle here.
I have a csv file.
When I try to read that file using filestream readtoend(), I get inverted commas and \r at many places that breaks my number of rows in each column.
Is there a way to remove inverted commas and \r.
I tried to replace
FileStream obj = new FileStream();
string a = obj.ReadToEnd();
a.Replace("\"","");
a.Replace("\r\"","");
When I visualize a all \r and inverted commas are removed.
But when I read the file again from beginning using ReadLine() they appear again?
First of all, a String is immutable. You might think this is not important for your question, but actualy it's important whenever you are developing.
If I look at your code snippet, I'm pretty sure you have no knowledge of immutable objects so I advice you to make sure you fully understand the concept.
More information regarding immutable objects can be found: http://en.wikipedia.org/wiki/Immutable_object
Basicly, it means one can never modify a string object. Strings will always point to a new object whenever we change the value.
That's why the Replace method returns a value, which's documentation can be found here: https://msdn.microsoft.com/en-us/library/system.string.replace%28v=vs.110%29.aspx and states clearly that it Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string.
In your example, you aren't using the return value of the Replace function.
Could you show us that the string values are actuably being replaced from your a variable? Because I do not believe this is going to be the case. When you visualize a string, carriage returns (\r) are not visual and replaced by an actual carriage return. If you debug and take alook at the actual string value, you should still see the \n.
Take the following code snippet:
var someString = "Hello / world";
someString.Replace("/", "");
Console.Log(someString);
You might think that the console will show "Hello world". However, on this fiddle you can see that it still logs "Hello / World": https://dotnetfiddle.net/cp59i3
What you have to do to correctly use String.Replace can be seen in this fiddle: https://dotnetfiddle.net/XCGtOu
Basicly, you want to log the return value of the Replace function:
var a = "Some / Value";
var b = a.Replace("/", "");
Console.WriteLine(b);
Also, as mentioned by others in the comment section at ur post, you are not replacing the contents of the file, but the string variable in your memory.
If you want to save the new string, make sure to use the Write method of the FileStream (or any other way to write to a file), an explanation can be found here: How to Find And Replace Text In A File With C#
Apart from all what I have been saying throughout this answer, you should not replace both inverted comma's and carriage returns in a file in most cases, they are there for a reason. Unless you do have a specific reason.
At last I succeeded. Thanks to everybody. Here is the code I did.
FileStream obj = new FileStream();
using(StreamReader csvr = new StreamReader(obj))
{
string a = obj.ReadToEnd();
a = a.Replace("\"","");
a = a.Replace("\r\"","");
obj.Dispose();
}
using(StreamWriter Wr = new StreamWriter(TempPath))
{
Wr.Write(a);
}
using(StreamReader Sr = new StreamReader(Tempath))
{
Sr.ReadLine();
}
I Created a temp path on the system. After this things were easy to enter into database.
Try something like this
StreamReader sReader = new StreamReader("filename");
string a = sReader.ReadToEnd();
a.Replace("\"", "");
a.Replace("\r\"", "");
StringReader reader = new StringReader(a);
string inputLine = "";
while ((inputLine = reader.ReadLine()) != null)
{
}
I've been struggling with a problem for a few days and have finally worked out what's going wrong but I've only been able to find contradicting answers on StackOverflow (et al) so would like to ask for an explanation of what's going on.
For example this link (in common with many other reference for example this one, or these seemingly go-to references on the topic by Jon Skeet here and here) states that "A string in C# is always UTF-16 [Unicode?], there is no way to "convert" it. The encoding is irrelevant as long as you manipulate the string in memory, it only matters if you write the string to a stream (file, memory stream, network stream...)."
The much simplified Test case I've built to demonstrate my issue is as below, it's probably not copy paste replicable as it depends on some of the strings to have a different encoding, but believe me the test passes as written. I'm using VS2012 Update 4.
The oddity is that the following two lines pass.
Assert.IsFalse(copiedFromXmlDoubleQuote == copiedFromXmlEscapedQuote);
Assert.AreNotEqual(copiedFromXmlDoubleQuote, copiedFromXmlEscapedQuote);
The identical strings fail equivalency as they are encoded differently (copiedFromXmlDoubleQuote had the \ replaced by " in the editor).
All this suggests that the Visual Studio editor is encoding aware, and the strings that the code declares are also encoding aware. My question is, have I done something stupid or can anyone please concur with my findings and if possible refer me to something that will help clarify what the story is with string encoding equivalence... As I'm going to be working in an Xml world a lot is it best practice to explicitly convert everything to Unicode at point of deserialization, and recode it as required when serializing out again?
[TestMethod]
public void EscapedCharacterDoesNotEqualLiteralString()
{
string actual = "\"";
Assert.AreEqual("\"", actual);
Assert.AreEqual(#"""", actual);
string typedEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
string typedDoubleQuote = #"<?xml version=""1.0"" encoding=""utf-16""?>";
Assert.IsTrue(typedDoubleQuote == typedEscapedQuote);
Assert.AreEqual(typedDoubleQuote, typedEscapedQuote);
string copiedFromXmlEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
string copiedFromXmlDoubleQuote = #"<?xml version=""1.0"" encoding=""utf-16""?>";
Assert.IsFalse(copiedFromXmlDoubleQuote == copiedFromXmlEscapedQuote);
Assert.AreNotEqual(copiedFromXmlDoubleQuote, copiedFromXmlEscapedQuote);
Assert.IsTrue(copiedFromXmlDoubleQuote.ToUnicode() == copiedFromXmlEscapedQuote.ToUnicode());
Assert.AreEqual(copiedFromXmlDoubleQuote.ToUnicode(), copiedFromXmlEscapedQuote.ToUnicode());
}
private static string BytesToString(byte[] bytes, Encoding encoding)
{
using (MemoryStream ms = new MemoryStream(bytes))
{
using (StreamReader sr = new StreamReader(ms, encoding))
{
string s = sr.ReadToEnd();
sr.Close();
return s;
}
}
}
public static string ToUnicode(this string s)
{
return BytesToString(new UnicodeEncoding().GetBytes(s), Encoding.Unicode);
}
I've loaded an example Vs2012 sln in a zip here
My initial check of your ZIP file shows that
static string copiedFromXmlEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
static string copiedFromXmlDoubleQuote = #"<?xml version=""1.0"" encoding=""utf-16""?>";
? copiedFromXmlEscapedQuote.Length
39
? copiedFromXmlDoubleQuote.Length
40
The first check for string equivalence in the .net framework is length check - it doesn't bother checking the content if the strings are different lengths.
Further checking;
? copiedFromXmlDoubleQuote.Last()
62 '>'
? copiedFromXmlEscapedQuote.Last()
62 '>'
? copiedFromXmlEscapedQuote.First()
60 '<'
? copiedFromXmlDoubleQuote.First()
65279 ''
So its the first char which is different. The value of 65279 is covered in this article. What is this char? 65279 ''.
It seems you are correct - it is the VS.net editor which is preserving the BOM char, and opening the program file in the binary editor shows these are different, so I'm guessing the use of # in VS.net tells the compiler to open the following bytes using a different encoder.
I would like to read byte[] using C# with the current encoding of the file.
As written in MSDN the default encoding will be UTF-8 when the constructor has no encoding:
var reader = new StreamReader(new MemoryStream(data)).
I have also tried this, but still get the file as UTF-8:
var reader = new StreamReader(new MemoryStream(data),true)
I need to read the byte[] with the current encoding.
A file has no encoding. A byte array has no encoding. A byte has no encoding. Encoding is something that transforms bytes to text and vice versa.
What you see in text editors and the like is actually program magic: The editor tries out different encodings an then guesses which one makes the most sense. This is also what you enable with the boolean parameter. If this does not produce what you want, then this magic fails.
var reader = new StreamReader(new MemoryStream(data), Encoding.Default);
will use the OS/Location specific default encoding. If that is still not what you want, then you need to be completely explicit, and tell the streamreader what exact encoding to use, for example (just as an example, you said you did not want UTF8):
var reader = new StreamReader(new MemoryStream(data), Encoding.UTF8);
I just tried leveraging different way of trying to figure out the ByteEncoding and it is not possible to do so as the byte array does not have an encoding in place as Jan mentions in his reply. However you can always take the value and do the type conversion to UTF8 or ASCII/Unicode and test the string values in case you are doing a "Text.EncodingFormat.GetString(byte [] array)"
public static bool IsUnicode(string input)
{
var asciiBytesCount = Encoding.ASCII.GetByteCount(input);
var unicodBytesCount = Encoding.UTF8.GetByteCount(input);
return asciiBytesCount != unicodBytesCount;
}
I'm currently searching for an easy way to serialize objects (in C# 3).
I googled some examples and came up with something like:
MemoryStream memoryStream = new MemoryStream ( );
XmlSerializer xs = new XmlSerializer ( typeof ( MyObject) );
XmlTextWriter xmlTextWriter = new XmlTextWriter ( memoryStream, Encoding.UTF8 );
xs.Serialize ( xmlTextWriter, myObject);
string result = Encoding.UTF8.GetString(memoryStream .ToArray());
After reading this question I asked myself, why not using StringWriter? It seems much easier.
XmlSerializer ser = new XmlSerializer(typeof(MyObject));
StringWriter writer = new StringWriter();
ser.Serialize(writer, myObject);
serializedValue = writer.ToString();
Another Problem was, that the first example generated XML I could not just write into an XML column of SQL Server 2005 DB.
The first question is: Is there a reason why I shouldn't use StringWriter to serialize an Object when I need it as a string afterwards? I never found a result using StringWriter when googling.
The second is, of course: If you should not do it with StringWriter (for whatever reasons), which would be a good and correct way?
Addition:
As it was already mentioned by both answers, I'll further go into the XML to DB problem.
When writing to the Database I got the following exception:
System.Data.SqlClient.SqlException:
XML parsing: line 1, character 38,
unable to switch the encoding
For string
<?xml version="1.0" encoding="utf-8"?><test/>
I took the string created from the XmlTextWriter and just put as xml there. This one did not work (neither with manual insertion into the DB).
Afterwards I tried manual insertion (just writing INSERT INTO ... ) with encoding="utf-16" which also failed.
Removing the encoding totally worked then. After that result I switched back to the StringWriter code and voila - it worked.
Problem: I don't really understand why.
at Christian Hayter: With those tests I'm not sure that I have to use utf-16 to write to the DB. Wouldn't setting the encoding to UTF-16 (in the xml tag) work then?
One problem with StringWriter is that by default it doesn't let you set the encoding which it advertises - so you can end up with an XML document advertising its encoding as UTF-16, which means you need to encode it as UTF-16 if you write it to a file. I have a small class to help with that though:
public sealed class StringWriterWithEncoding : StringWriter
{
public override Encoding Encoding { get; }
public StringWriterWithEncoding (Encoding encoding)
{
Encoding = encoding;
}
}
Or if you only need UTF-8 (which is all I often need):
public sealed class Utf8StringWriter : StringWriter
{
public override Encoding Encoding => Encoding.UTF8;
}
As for why you couldn't save your XML to the database - you'll have to give us more details about what happened when you tried, if you want us to be able to diagnose/fix it.
When serialising an XML document to a .NET string, the encoding must be set to UTF-16. Strings are stored as UTF-16 internally, so this is the only encoding that makes sense. If you want to store data in a different encoding, you use a byte array instead.
SQL Server works on a similar principle; any string passed into an xml column must be encoded as UTF-16. SQL Server will reject any string where the XML declaration does not specify UTF-16. If the XML declaration is not present, then the XML standard requires that it default to UTF-8, so SQL Server will reject that as well.
Bearing this in mind, here are some utility methods for doing the conversion.
public static string Serialize<T>(T value) {
if(value == null) {
return null;
}
XmlSerializer serializer = new XmlSerializer(typeof(T));
XmlWriterSettings settings = new XmlWriterSettings()
{
Encoding = new UnicodeEncoding(false, false), // no BOM in a .NET string
Indent = false,
OmitXmlDeclaration = false
};
using(StringWriter textWriter = new StringWriter()) {
using(XmlWriter xmlWriter = XmlWriter.Create(textWriter, settings)) {
serializer.Serialize(xmlWriter, value);
}
return textWriter.ToString();
}
}
public static T Deserialize<T>(string xml) {
if(string.IsNullOrEmpty(xml)) {
return default(T);
}
XmlSerializer serializer = new XmlSerializer(typeof(T));
XmlReaderSettings settings = new XmlReaderSettings();
// No settings need modifying here
using(StringReader textReader = new StringReader(xml)) {
using(XmlReader xmlReader = XmlReader.Create(textReader, settings)) {
return (T) serializer.Deserialize(xmlReader);
}
}
}
First of all, beware of finding old examples. You've found one that uses XmlTextWriter, which is deprecated as of .NET 2.0. XmlWriter.Create should be used instead.
Here's an example of serializing an object into an XML column:
public void SerializeToXmlColumn(object obj)
{
using (var outputStream = new MemoryStream())
{
using (var writer = XmlWriter.Create(outputStream))
{
var serializer = new XmlSerializer(obj.GetType());
serializer.Serialize(writer, obj);
}
outputStream.Position = 0;
using (var conn = new SqlConnection(Settings.Default.ConnectionString))
{
conn.Open();
const string INSERT_COMMAND = #"INSERT INTO XmlStore (Data) VALUES (#Data)";
using (var cmd = new SqlCommand(INSERT_COMMAND, conn))
{
using (var reader = XmlReader.Create(outputStream))
{
var xml = new SqlXml(reader);
cmd.Parameters.Clear();
cmd.Parameters.AddWithValue("#Data", xml);
cmd.ExecuteNonQuery();
}
}
}
}
}
<TL;DR> The problem is rather simple, actually: you are not matching the declared encoding (in the XML declaration) with the datatype of the input parameter. If you manually added <?xml version="1.0" encoding="utf-8"?><test/> to the string, then declaring the SqlParameter to be of type SqlDbType.Xml or SqlDbType.NVarChar would give you the "unable to switch the encoding" error. Then, when inserting manually via T-SQL, since you switched the declared encoding to be utf-16, you were clearly inserting a VARCHAR string (not prefixed with an upper-case "N", hence an 8-bit encoding, such as UTF-8) and not an NVARCHAR string (prefixed with an upper-case "N", hence the 16-bit UTF-16 LE encoding).
The fix should have been as simple as:
In the first case, when adding the declaration stating encoding="utf-8": simply don't add the XML declaration.
In the second case, when adding the declaration stating encoding="utf-16": either
simply don't add the XML declaration, OR
simply add an "N" to the input parameter type: SqlDbType.NVarChar instead of SqlDbType.VarChar :-) (or possibly even switch to using SqlDbType.Xml)
(Detailed response is below)
All of the answers here are over-complicated and unnecessary (regardless of the 121 and 184 up-votes for Christian's and Jon's answers, respectively). They might provide working code, but none of them actually answer the question. The issue is that nobody truly understood the question, which ultimately is about how the XML datatype in SQL Server works. Nothing against those two clearly intelligent people, but this question has little to nothing to do with serializing to XML. Saving XML data into SQL Server is much easier than what is being implied here.
It doesn't really matter how the XML is produced as long as you follow the rules of how to create XML data in SQL Server. I have a more thorough explanation (including working example code to illustrate the points outlined below) in an answer on this question: How to solve “unable to switch the encoding” error when inserting XML into SQL Server, but the basics are:
The XML declaration is optional
The XML datatype stores strings always as UCS-2 / UTF-16 LE
If your XML is UCS-2 / UTF-16 LE, then you:
pass in the data as either NVARCHAR(MAX) or XML / SqlDbType.NVarChar (maxsize = -1) or SqlDbType.Xml, or if using a string literal then it must be prefixed with an upper-case "N".
if specifying the XML declaration, it must be either "UCS-2" or "UTF-16" (no real difference here)
If your XML is 8-bit encoded (e.g. "UTF-8" / "iso-8859-1" / "Windows-1252"), then you:
need to specify the XML declaration IF the encoding is different than the code page specified by the default Collation of the database
you must pass in the data as VARCHAR(MAX) / SqlDbType.VarChar (maxsize = -1), or if using a string literal then it must not be prefixed with an upper-case "N".
Whatever 8-bit encoding is used, the "encoding" noted in the XML declaration must match the actual encoding of the bytes.
The 8-bit encoding will be converted into UTF-16 LE by the XML datatype
With the points outlined above in mind, and given that strings in .NET are always UTF-16 LE / UCS-2 LE (there is no difference between those in terms of encoding), we can answer your questions:
Is there a reason why I shouldn't use StringWriter to serialize an Object when I need it as a string afterwards?
No, your StringWriter code appears to be just fine (at least I see no issues in my limited testing using the 2nd code block from the question).
Wouldn't setting the encoding to UTF-16 (in the xml tag) work then?
It isn't necessary to provide the XML declaration. When it is missing, the encoding is assumed to be UTF-16 LE if you pass the string into SQL Server as NVARCHAR (i.e. SqlDbType.NVarChar) or XML (i.e. SqlDbType.Xml). The encoding is assumed to be the default 8-bit Code Page if passing in as VARCHAR (i.e. SqlDbType.VarChar). If you have any non-standard-ASCII characters (i.e. values 128 and above) and are passing in as VARCHAR, then you will likely see "?" for BMP characters and "??" for Supplementary Characters as SQL Server will convert the UTF-16 string from .NET into an 8-bit string of the current Database's Code Page before converting it back into UTF-16 / UCS-2. But you shouldn't get any errors.
On the other hand, if you do specify the XML declaration, then you must pass into SQL Server using the matching 8-bit or 16-bit datatype. So if you have a declaration stating that the encoding is either UCS-2 or UTF-16, then you must pass in as SqlDbType.NVarChar or SqlDbType.Xml. Or, if you have a declaration stating that the encoding is one of the 8-bit options (i.e. UTF-8, Windows-1252, iso-8859-1, etc), then you must pass in as SqlDbType.VarChar. Failure to match the declared encoding with the proper 8 or 16 -bit SQL Server datatype will result in the "unable to switch the encoding" error that you were getting.
For example, using your StringWriter-based serialization code, I simply printed the resulting string of the XML and used it in SSMS. As you can see below, the XML declaration is included (because StringWriter does not have an option to OmitXmlDeclaration like XmlWriter does), which poses no problem so long as you pass the string in as the correct SQL Server datatype:
-- Upper-case "N" prefix == NVARCHAR, hence no error:
DECLARE #Xml XML = N'<?xml version="1.0" encoding="utf-16"?>
<string>Test ሴ😸</string>';
SELECT #Xml;
-- <string>Test ሴ😸</string>
As you can see, it even handles characters beyond standard ASCII, given that ሴ is BMP Code Point U+1234, and 😸 is Supplementary Character Code Point U+1F638. However, the following:
-- No upper-case "N" prefix on the string literal, hence VARCHAR:
DECLARE #Xml XML = '<?xml version="1.0" encoding="utf-16"?>
<string>Test ሴ😸</string>';
results in the following error:
Msg 9402, Level 16, State 1, Line XXXXX
XML parsing: line 1, character 39, unable to switch the encoding
Ergo, all of that explanation aside, the full solution to your original question is:
You were clearly passing the string in as SqlDbType.VarChar. Switch to SqlDbType.NVarChar and it will work without needing to go through the extra step of removing the XML declaration. This is preferred over keeping SqlDbType.VarChar and removing the XML declaration because this solution will prevent data loss when the XML includes non-standard-ASCII characters. For example:
-- No upper-case "N" prefix on the string literal == VARCHAR, and no XML declaration:
DECLARE #Xml2 XML = '<string>Test ሴ😸</string>';
SELECT #Xml2;
-- <string>Test ???</string>
As you can see, there is no error this time, but now there is data-loss 🙀.
public static T DeserializeFromXml<T>(string xml)
{
T result;
XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));
using (StringReader sr3 = new StringReader(xml))
{
XmlReaderSettings settings = new XmlReaderSettings()
{
CheckCharacters = false // default value is true;
};
using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
{
result = (T)serializer.Deserialize(xr3);
}
}
return result;
}
For anyone in need of an F# version of the approved answer:
type private Utf8StringWriter() =
inherit StringWriter()
override _.Encoding = System.Text.Encoding.UTF8
It may have been covered elsewhere but simply changing the encoding line of the XML source to 'utf-16' allows the XML to be inserted into a SQL Server 'xml'data type.
using (DataSetTableAdapters.SQSTableAdapter tbl_SQS = new DataSetTableAdapters.SQSTableAdapter())
{
try
{
bodyXML = #"<?xml version="1.0" encoding="UTF-8" standalone="yes"?><test></test>";
bodyXMLutf16 = bodyXML.Replace("UTF-8", "UTF-16");
tbl_SQS.Insert(messageID, receiptHandle, md5OfBody, bodyXMLutf16, sourceType);
}
catch (System.Data.SqlClient.SqlException ex)
{
Console.WriteLine(ex.Message);
Console.ReadLine();
}
}
The result is all of the XML text is inserted into the 'xml' data type field but the 'header' line is removed. What you see in the resulting record is just
<test></test>
Using the serialization method described in the "Answered" entry is a way of including the original header in the target field but the result is that the remaining XML text is enclosed in an XML <string></string> tag.
The table adapter in the code is a class automatically built using the Visual Studio 2013 "Add New Data Source: wizard. The five parameters to the Insert method map to fields in a SQL Server table.