[Case]
I have reveived a bunch of 'xml files' with metadata about a big number of documents in them. At least, that was what I requested. What I received where 'xml files' without a root element, they are structured something like this (i left out a bunch of elements):
<folder name = "abc"></folder>
<folder name = "abc/def">
<document name = "ghi1">
</document>
<document name = "ghi2">
</document>
</folder>
[Problem]
When I try to read the file in an XmlTextReader object it fails telling me that there is no root element.
[Current workaround]
Of course I can read the file as a stream, append < xmlroot> and < /xmlroot> and write the stream to a new file and read that one in XmlTextReader. Which is exactly what I am doing now, but I prefer not to 'tamper' with the original data.
[Requested solution]
I understand that I should use XmlTextReader for this, with the DocumentFragment option. However, this gives the compiletime error:
An unhandled exception of type 'System.Xml.XmlException' occurred in
System.Xml.dll
Additional information: XmlNodeType DocumentFragment is not supported
for partial content parsing. Line 1, position 1.
[Faulty code]
using System.Diagnostics;
using System.Xml;
namespace XmlExample
{
class Program
{
static void Main(string[] args)
{
string file = #"C:\test.txt";
XmlTextReader tr = new XmlTextReader(file, XmlNodeType.DocumentFragment, null);
while(tr.Read())
Debug.WriteLine("NodeType: {0} NodeName: {1}", tr.NodeType, tr.Name);
}
}
}
This works:
using System.Diagnostics;
using System.Xml;
namespace XmlExample
{
class Program
{
static void Main(string[] args)
{
string file = #"C:\test.txt";
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
using (XmlReader reader = XmlReader.Create(file, settings))
{
while (reader.Read())
Debug.WriteLine("NodeType: {0} NodeName: {1}", reader.NodeType, reader.Name);
}
}
}
}
Even though the XmlReader can be made to read the data using the ConformanceLevel.Fragment option as demonstrated by Martijn, it seems that XmlDataDocument does not like the idea of having multiple root elements.
I thought I'd try a different approach, much like the one you're currently using, but without the intermediate file. Most XML libraries (XmlDocument, XDocument, XmlDataDocument) can take a TextReader as an input, so I've implemented one of my own. It's used like so:
var dataDocument = new XmlDataDocument();
dataDocument.Load(new FakeRootStreamReader(File.OpenRead("test.xml")));
The code of the actual class:
public class FakeRootStreamReader : TextReader
{
private static readonly char[] _rootStart;
private static readonly char[] _rootEnd;
private readonly TextReader _innerReader;
private int _charsRead;
private bool _eof;
static FakeRootStreamReader()
{
_rootStart = "<root>".ToCharArray();
_rootEnd = "</root>".ToCharArray();
}
public FakeRootStreamReader(Stream stream)
{
_innerReader = new StreamReader(stream);
}
public FakeRootStreamReader(TextReader innerReader)
{
_innerReader = innerReader;
}
public override int Read(char[] buffer, int index, int count)
{
if (!_eof && _charsRead < _rootStart.Length)
{
// Prepend root element
return ReadFake(_rootStart, buffer, index, count);
}
if (!_eof)
{
// Normal reading operation
int charsRead = _innerReader.Read(buffer, index, count);
if (charsRead > 0) return charsRead;
// We've reached the end of the Stream
_eof = true;
_charsRead = 0;
}
// Append root element end tag at the end of the Stream
return ReadFake(_rootEnd, buffer, index, count);
}
private int ReadFake(char[] source, char[] buffer, int offset, int count)
{
int length = Math.Min(source.Length - _charsRead, count);
Array.Copy(source, _charsRead, buffer, offset, length);
_charsRead += length;
return length;
}
}
The first call to Read(...) will return only the <root> element. Subsequent calls read the stream as normal, until the end of the stream is reached, then the end tag is outputted.
The code is a bit... meh... mostly because I wanted to handle some never-gonna-happen cases where someone tries to read the stream less than 6 characters at a time.
Related
[Case]
I have reveived a bunch of 'xml files' with metadata about a big number of documents in them. At least, that was what I requested. What I received where 'xml files' without a root element, they are structured something like this (i left out a bunch of elements):
<folder name = "abc"></folder>
<folder name = "abc/def">
<document name = "ghi1">
</document>
<document name = "ghi2">
</document>
</folder>
[Problem]
When I try to read the file in an XmlTextReader object it fails telling me that there is no root element.
[Current workaround]
Of course I can read the file as a stream, append < xmlroot> and < /xmlroot> and write the stream to a new file and read that one in XmlTextReader. Which is exactly what I am doing now, but I prefer not to 'tamper' with the original data.
[Requested solution]
I understand that I should use XmlTextReader for this, with the DocumentFragment option. However, this gives the compiletime error:
An unhandled exception of type 'System.Xml.XmlException' occurred in
System.Xml.dll
Additional information: XmlNodeType DocumentFragment is not supported
for partial content parsing. Line 1, position 1.
[Faulty code]
using System.Diagnostics;
using System.Xml;
namespace XmlExample
{
class Program
{
static void Main(string[] args)
{
string file = #"C:\test.txt";
XmlTextReader tr = new XmlTextReader(file, XmlNodeType.DocumentFragment, null);
while(tr.Read())
Debug.WriteLine("NodeType: {0} NodeName: {1}", tr.NodeType, tr.Name);
}
}
}
This works:
using System.Diagnostics;
using System.Xml;
namespace XmlExample
{
class Program
{
static void Main(string[] args)
{
string file = #"C:\test.txt";
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
using (XmlReader reader = XmlReader.Create(file, settings))
{
while (reader.Read())
Debug.WriteLine("NodeType: {0} NodeName: {1}", reader.NodeType, reader.Name);
}
}
}
}
Even though the XmlReader can be made to read the data using the ConformanceLevel.Fragment option as demonstrated by Martijn, it seems that XmlDataDocument does not like the idea of having multiple root elements.
I thought I'd try a different approach, much like the one you're currently using, but without the intermediate file. Most XML libraries (XmlDocument, XDocument, XmlDataDocument) can take a TextReader as an input, so I've implemented one of my own. It's used like so:
var dataDocument = new XmlDataDocument();
dataDocument.Load(new FakeRootStreamReader(File.OpenRead("test.xml")));
The code of the actual class:
public class FakeRootStreamReader : TextReader
{
private static readonly char[] _rootStart;
private static readonly char[] _rootEnd;
private readonly TextReader _innerReader;
private int _charsRead;
private bool _eof;
static FakeRootStreamReader()
{
_rootStart = "<root>".ToCharArray();
_rootEnd = "</root>".ToCharArray();
}
public FakeRootStreamReader(Stream stream)
{
_innerReader = new StreamReader(stream);
}
public FakeRootStreamReader(TextReader innerReader)
{
_innerReader = innerReader;
}
public override int Read(char[] buffer, int index, int count)
{
if (!_eof && _charsRead < _rootStart.Length)
{
// Prepend root element
return ReadFake(_rootStart, buffer, index, count);
}
if (!_eof)
{
// Normal reading operation
int charsRead = _innerReader.Read(buffer, index, count);
if (charsRead > 0) return charsRead;
// We've reached the end of the Stream
_eof = true;
_charsRead = 0;
}
// Append root element end tag at the end of the Stream
return ReadFake(_rootEnd, buffer, index, count);
}
private int ReadFake(char[] source, char[] buffer, int offset, int count)
{
int length = Math.Min(source.Length - _charsRead, count);
Array.Copy(source, _charsRead, buffer, offset, length);
_charsRead += length;
return length;
}
}
The first call to Read(...) will return only the <root> element. Subsequent calls read the stream as normal, until the end of the stream is reached, then the end tag is outputted.
The code is a bit... meh... mostly because I wanted to handle some never-gonna-happen cases where someone tries to read the stream less than 6 characters at a time.
I'm writing a Windows app in C#. I have a custom data type that I need to write as raw data to a binary file (not text/string based), and then open that file later back into that custom data type.
For example:
Matrix<float> dbDescs = ConcatDescriptors(dbDescsList);
I need to save dbDescs to file blah.xyz and then restore it as Matrix<float> later. Anyone have any examples? Thanks!
As I've mentioned, the options are overwhelming and this question comes with a ton of opinions as far as which one is the best. With that being said, BinaryFormatter could prove to be useful here as it serializes and deserializes object (along with graphs of connected objects) in binary.
Here's the MSDN link that explains the usage: https://msdn.microsoft.com/en-us/library/system.runtime.serialization.formatters.binary.binaryformatter(v=vs.110).aspx
Just in case that link fails down the line and because I'm too lazy to provide my own example, here's an example from MSDN:
using System;
using System.IO;
using System.Collections;
using System.Runtime.Serialization.Formatters.Binary;
using System.Runtime.Serialization;
public class App
{
[STAThread]
static void Main()
{
Serialize();
Deserialize();
}
static void Serialize()
{
// Create a hashtable of values that will eventually be serialized.
Hashtable addresses = new Hashtable();
addresses.Add("Jeff", "123 Main Street, Redmond, WA 98052");
addresses.Add("Fred", "987 Pine Road, Phila., PA 19116");
addresses.Add("Mary", "PO Box 112233, Palo Alto, CA 94301");
// To serialize the hashtable and its key/value pairs,
// you must first open a stream for writing.
// In this case, use a file stream.
FileStream fs = new FileStream("DataFile.dat", FileMode.Create);
// Construct a BinaryFormatter and use it to serialize the data to the stream.
BinaryFormatter formatter = new BinaryFormatter();
try
{
formatter.Serialize(fs, addresses);
}
catch (SerializationException e)
{
Console.WriteLine("Failed to serialize. Reason: " + e.Message);
throw;
}
finally
{
fs.Close();
}
}
static void Deserialize()
{
// Declare the hashtable reference.
Hashtable addresses = null;
// Open the file containing the data that you want to deserialize.
FileStream fs = new FileStream("DataFile.dat", FileMode.Open);
try
{
BinaryFormatter formatter = new BinaryFormatter();
// Deserialize the hashtable from the file and
// assign the reference to the local variable.
addresses = (Hashtable) formatter.Deserialize(fs);
}
catch (SerializationException e)
{
Console.WriteLine("Failed to deserialize. Reason: " + e.Message);
throw;
}
finally
{
fs.Close();
}
// To prove that the table deserialized correctly,
// display the key/value pairs.
foreach (DictionaryEntry de in addresses)
{
Console.WriteLine("{0} lives at {1}.", de.Key, de.Value);
}
}
}
Consider the Json.Net package (you can download it to your project via Nuget; the better way, or get it directly from their website).
JSON is just a string (text) that holds values for complex objects. It allows you to turn many (not all) objects into savable files easily which then can be pulled back. To serialize into JSON with JSON.net:
Product product = new Product();
product.Name = "Apple";
product.Expiry = new DateTime(2008, 12, 28);
product.Sizes = new string[] { "Small" };
string json = JsonConvert.SerializeObject(product);
And then to deserialize:
var product = JsonConvert.DeserializeObject(json);
To write the json to a file:
using (StreamWriter writer = new StreamWriter(#"C:/file.txt"))
{
writer.WriteLine(json);
}
I am not a Web Developer so I am not sure that JSON is Binary. Isnt it still text based? So here is what I know is a Binary Answer. Hope this Helps!
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace BinarySerializerSample
{
class Program
{
public static void WriteValues(string fName, double[] vals)
{
using (BinaryWriter writer = new BinaryWriter(File.Open(fName, FileMode.Create)))
{
int len = vals.Length;
for (int i = 0; i < len; i++)
writer.Write(vals[i]);
}
}
public static double[] ReadValues(string fName, int len)
{
double [] vals = new double[len];
using (BinaryReader reader = new BinaryReader(File.Open(fName, FileMode.Open)))
{
for (int i = 0; i < len; i++)
vals[i] = reader.ReadDouble();
}
return vals;
}
static void Main(string[] args)
{
const double MAX_TO_VARY = 100.0;
const int NUM_ITEMS = 100;
const string FILE_NAME = "dblToTestx.bin";
double[] dblToWrite = new double[NUM_ITEMS];
Random r = new Random();
for (int i = 0; i < NUM_ITEMS; i++)
dblToWrite[i] = r.NextDouble() * MAX_TO_VARY;
WriteValues(FILE_NAME, dblToWrite);
double[] dblToRead ;
dblToRead = ReadValues(FILE_NAME, NUM_ITEMS);
int j = 0;
bool areEqual = true;
while (areEqual && j < NUM_ITEMS)
{
areEqual = dblToRead[j] == dblToWrite[j];
++j;
}
if (areEqual)
Console.WriteLine("Test Passed: Press any Key to Exit");
else
Console.WriteLine("Test Failed: Press any Key to Exit");
Console.Read();
}
}
}
There is an error in XML document (8, 20). Inner 1: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.
OK, I understand this error.
How I get it, however, is what perplexes me.
I create the document with Microsoft's Serialize tool. Then, I turn around and attempt to read it back, again, using Microsoft's Deserialize tool.
I am not in control of writing the XML file in the correct format - that I can see.
Here is the single routine I use to read and write.
private string xmlPath = System.Web.Hosting.HostingEnvironment.MapPath(WebConfigurationManager.AppSettings["DATA_XML"]);
private object objLock = new Object();
public string ErrorMessage { get; set; }
public StoredMsgs Operation(string from, string message, FileAccess access) {
StoredMsgs list = null;
lock (objLock) {
ErrorMessage = null;
try {
if (!File.Exists(xmlPath)) {
var root = new XmlRootAttribute(rootName);
var serializer = new XmlSerializer(typeof(StoredMsgs), root);
if (String.IsNullOrEmpty(message)) {
from = "Code Window";
message = "Created File";
}
var item = new StoredMsg() {
From = from,
Date = DateTime.Now.ToString("s"),
Message = message
};
using (var stream = File.Create(xmlPath)) {
list = new StoredMsgs();
list.Add(item);
serializer.Serialize(stream, list);
}
} else {
var root = new XmlRootAttribute("MessageHistory");
var serializer = new XmlSerializer(typeof(StoredMsgs), root);
var item = new StoredMsg() {
From = from,
Date = DateTime.Now.ToString("s"),
Message = message
};
using (var stream = File.Open(xmlPath, FileMode.Open, FileAccess.ReadWrite)) {
list = (StoredMsgs)serializer.Deserialize(stream);
if ((access == FileAccess.ReadWrite) || (access == FileAccess.Write)) {
list.Add(item);
serializer.Serialize(stream, list);
}
}
}
} catch (Exception error) {
var sb = new StringBuilder();
int index = 0;
sb.AppendLine(String.Format("Top Level Error: <b>{0}</b>", error.Message));
var err = error.InnerException;
while (err != null) {
index++;
sb.AppendLine(String.Format("\tInner {0}: {1}", index, err.Message));
err = err.InnerException;
}
ErrorMessage = sb.ToString();
}
}
return list;
}
Is something wrong with my routine? If Microsoft write the file, it seems to me that it should be able to read it back.
It should be generic enough for anyone to use.
Here is my StoredMsg class:
[Serializable()]
[XmlType("StoredMessage")]
public class StoredMessage {
public StoredMessage() {
}
[XmlElement("From")]
public string From { get; set; }
[XmlElement("Date")]
public string Date { get; set; }
[XmlElement("Message")]
public string Message { get; set; }
}
[Serializable()]
[XmlRoot("MessageHistory")]
public class MessageHistory : List<StoredMessage> {
}
The file it generates doesn't look to me like it has any issues.
I saw the solution here:
Error: The XML declaration must be the first node in the document
But, in that case, it seems someone already had an XML document they wanted to read. They just had to fix it.
I have an XML document created my Microsoft, so it should be read back in by Microsoft.
The problem is that you are adding to the file. You deserialize, then re-serialize to the same stream without rewinding and resizing to zero. This gives you multiple root elements:
<?xml version="1.0"?>
<StoredMessage>
</StoredMessage
<?xml version="1.0"?>
<StoredMessage>
</StoredMessage
Multiple root elements, and multiple XML declarations, are invalid according to the XML standard, thus the .NET XML parser throws an exception in this situation by default.
For possible solutions, see XML Error: There are multiple root elements, which suggests you either:
Enclose your list of StoredMessage elements in some synthetic outer element, e.g. StoredMessageList.
This would require you to load the list of messages from the file, add the new message, and then truncate the file and re-serialize the entire list when adding a single item. Thus the performance may be worse than in your current approach, but the XML will be valid.
When deserializing a file containing concatenated root elements, create an XML writer using XmlReaderSettings.ConformanceLevel = ConformanceLevel.Fragment and iteratively walk through the concatenated root node(s) and deserialize each one individually as shown, e.g., here. Using ConformanceLevel.Fragment allows the reader to parse streams with multiple root elements (although multiple XML declarations will still cause an error to be thrown).
Later, when adding a new element to the end of the file using XmlSerializer, seek to the end of the file and serialize using an XML writer returned from XmlWriter.Create(TextWriter, XmlWriterSettings)
with XmlWriterSettings.OmitXmlDeclaration = true. This prevents output of multiple XML declarations as explained here.
For option #2, your Operation would look something like the following:
private string xmlPath = System.Web.Hosting.HostingEnvironment.MapPath(WebConfigurationManager.AppSettings["DATA_XML"]);
private object objLock = new Object();
public string ErrorMessage { get; set; }
const string rootName = "MessageHistory";
static readonly XmlSerializer serializer = new XmlSerializer(typeof(StoredMessage), new XmlRootAttribute(rootName));
public MessageHistory Operation(string from, string message, FileAccess access)
{
var list = new MessageHistory();
lock (objLock)
{
ErrorMessage = null;
try
{
using (var file = File.Open(xmlPath, FileMode.OpenOrCreate))
{
list.AddRange(XmlSerializerHelper.ReadObjects<StoredMessage>(file, false, serializer));
if (list.Count == 0 && String.IsNullOrEmpty(message))
{
from = "Code Window";
message = "Created File";
}
var item = new StoredMessage()
{
From = from,
Date = DateTime.Now.ToString("s"),
Message = message
};
if ((access == FileAccess.ReadWrite) || (access == FileAccess.Write))
{
file.Seek(0, SeekOrigin.End);
var writerSettings = new XmlWriterSettings
{
OmitXmlDeclaration = true,
Indent = true, // Optional; remove if compact XML is desired.
};
using (var textWriter = new StreamWriter(file))
{
if (list.Count > 0)
textWriter.WriteLine();
using (var xmlWriter = XmlWriter.Create(textWriter, writerSettings))
{
serializer.Serialize(xmlWriter, item);
}
}
}
list.Add(item);
}
}
catch (Exception error)
{
var sb = new StringBuilder();
int index = 0;
sb.AppendLine(String.Format("Top Level Error: <b>{0}</b>", error.Message));
var err = error.InnerException;
while (err != null)
{
index++;
sb.AppendLine(String.Format("\tInner {0}: {1}", index, err.Message));
err = err.InnerException;
}
ErrorMessage = sb.ToString();
}
}
return list;
}
Using the following extension method adapted from Read nodes of a xml file in C#:
public partial class XmlSerializerHelper
{
public static List<T> ReadObjects<T>(Stream stream, bool closeInput = true, XmlSerializer serializer = null)
{
var list = new List<T>();
serializer = serializer ?? new XmlSerializer(typeof(T));
var settings = new XmlReaderSettings
{
ConformanceLevel = ConformanceLevel.Fragment,
CloseInput = closeInput,
};
using (var xmlTextReader = XmlReader.Create(stream, settings))
{
while (xmlTextReader.Read())
{ // Skip whitespace
if (xmlTextReader.NodeType == XmlNodeType.Element)
{
using (var subReader = xmlTextReader.ReadSubtree())
{
var logEvent = (T)serializer.Deserialize(subReader);
list.Add(logEvent);
}
}
}
}
return list;
}
}
Note that if you are going to create an XmlSerializer using a custom XmlRootAttribute, you must cache the serializer to avoid a memory leak.
Sample fiddle.
I'm trying to get a newline into a text node using XText from the Linq XML namespace.
I have a string which contains newline characters however I need to work out how to convert these to entity characters (i.e.
) rather than just having them appear in the XML as new lines.
XElement element = new XElement( "NodeName" );
...
string example = "This is a string\nWith new lines in it\n";
element.Add( new XText( example ) );
The XElement is then written out using an XmlTextWriter which results in the file containing the newline rather than an entity replacement.
Has anyone come across this problem and found a solution?
EDIT:
The problem manifests itself when I load the XML into EXCEL which doesn't seem to like the newline character but which accepts the entity replacement. The result is that newlines aren't showing in EXCEL unless I replace them with
Nick.
Cheating:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.CheckCharacters = false;
settings.NewLineChars = "
";
XmlWriter writer = XmlWriter.Create(..., settings);
element.WriteTo(writer);
writer.Flush();
UPDATE:
Complete program
using System;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
XElement element = new XElement( "NodeName" );
string example = "This is a string\nWith new lines in it\n";
element.Add( new XText( example ) );
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.CheckCharacters = false;
settings.NewLineChars = "
";
XmlWriter writer = XmlWriter.Create(Console.Out, settings);
element.WriteTo(writer);
writer.Flush();
}
}
}
OUTPUT:
C:\Users\...\\ConsoleApplication1\bin\Release>ConsoleApplication1.exe
<?xml version="1.0" encoding="ibm850"?>
<NodeName>This is a string
With new lines in it
</NodeName>
To any standard XML parser there is no difference between the entity
and a new line character, as they are one and the same thing.
To illustrate this the following code shows that they are the same thing:
string s1 = "<root>Test
Test2</root>";
string s2 = "<root>Test\nTest2</root>";
XDocument doc1 = XDocument.Parse(s1);
XDocument doc2 = XDocument.Parse(s2);
Console.WriteLine(doc1.ToString());
Console.WriteLine(doc2.ToString());
It's the XmlTextWriter which is responsible for outputting escaped entities. So if you do this, for example:
using (XmlTextWriter w = new XmlTextWriter("test.xml", Encoding.UTf8))
{
w.WriteString("");
}
You will also get an escaped ampersand output in text.xml , which you don't want. You would like to keep the sequence raw, as is.
The solution I propose is to create a new StreamWriter implementation capable of detecting an escaped string like "":
// A StreamWriter that does not escape
characters
public class NonXmlEscapingStreamWriter : StreamWriter
{
private const string AmpToken = "amp";
private int _bufferState = 0; // used to keep state
// add other ctors overloads if needed
public NonXmlEscapingStreamWriter(string path)
: base(path)
{
}
// NOTE this code is based on the assumption that StreamWriter
// only overrides these 4 Write functions, which is true today but could change in the future
// and also on the assumption that the XmlTextWrite writes escaped values in a specific WriteXX calls sequence
public override void Write(char value)
{
if (value == '&')
{
if (_bufferState == 0)
{
_bufferState++;
return; // hold it
}
else
{
_bufferState = 0;
}
}
else if (value == ';')
{
if (_bufferState > 1)
{
_bufferState++;
return;
}
else
{
Write('&'); // release what's been held
Write(AmpToken);
_bufferState = 0;
}
}
else if (value == '\n') // detect non escaped \n
{
base.Write("
");
return;
}
base.Write(value);
}
public override void Write(string value)
{
if (_bufferState > 0)
{
if (value == AmpToken)
{
_bufferState++;
return; // hold it
}
else
{
Write('&'); // release what's been held
_bufferState = 0;
}
}
base.Write(value);
}
public override void Write(char[] buffer, int index, int count)
{
if (_bufferState > 2)
{
_bufferState = 0;
base.Write('&'); // release this anyway
string replace;
if ((buffer != null) && ((replace = GetReplaceLength(buffer, index, count)) != null))
{
base.Write(replace);
base.Write(buffer, index + replace.Length, count - replace.Length);
return;
}
else
{
base.Write(AmpToken); // release this
base.Write(';'); // release this
}
}
base.Write(buffer, index, count);
}
public override void Write(char[] buffer)
{
Write(buffer, 0, buffer != null ? buffer.Length : 0);
}
private string GetReplaceLength(char[] buffer, int index, int count)
{
// this is specific to the 10 character but could be adapted
const string token = "#10;";
if ((index + count) < token.Length)
return null;
// we test the char array to avoid string allocations
for(int i = 0; i < token.Length; i++)
{
if (buffer[index + i] != token[i])
return null;
}
return token;
}
}
And you can use it like this:
using (XmlTextWriter w = new XmlTextWriter(new NonXmlEscapingStreamWriter("test.xml")))
{
element.WriteTo(w);
}
NOTE: Although it is capable of detecting lonely \n sequences, I suggest you ensure all \n are actually escaped in your original text, so, you need to replace \n by before you actually output xml, like this:
string example = "This is a stringWith new lines in it";
I am trying to parse XML messages which are send to my C# application over TCP. Unfortunately, the protocol can not be changed and the XML messages are not delimited and no length prefix is used. Moreover the character encoding is not fixed but each message starts with an XML declaration <?xml>. The question is, how can i read one XML message at a time, using C#.
Up to now, I tried to read the data from the TCP stream into a byte array and use it through a MemoryStream. The problem is, the buffer might contain more than one XML messages or the first message may be incomplete. In these cases, I get an exception when trying to parse it with XmlReader.Read or XmlDocument.Load, but unfortunately the XmlException does not really allow me to distinguish the problem (except parsing the localized error string).
I tried using XmlReader.Read and count the number of Element and EndElement nodes. That way I know when I am finished reading the first, entire XML message.
However, there are several problems. If the buffer does not yet contain the entire message, how can I distinguish the XmlException from an actually invalid, non-well-formed message? In other words, if an exception is thrown before reading the first root EndElement, how can I decide whether to abort the connection with error, or to collect more bytes from the TCP stream?
If no exception occurs, the XmlReader is positioned at the start of the root EndElement. Casting the XmlReader to IXmlLineInfo gives me the current LineNumber and LinePosition, however it is not straight forward to get the byte position where the EndElement really ends. In order to do that, I would have to convert the byte array into a string (with the encoding specified in the XML declaration), seek to LineNumber,LinePosition and convert that back to the byte offset. I try to do that with StreamReader.ReadLine, but the stream reader gives no public access to the current byte position.
All this seams very inelegant and non robust. I wonder if you have ideas for a better solution. Thank you.
After locking around for some time I think I can answer my own question as following (I might be wrong, corrections are welcome):
I found no method so that the XmlReader can continue parsing a second XML message (at least not, if the second message has an XmlDeclaration). XmlTextReader.ResetState could do something similar, but for that I would have to assume the same encoding for all messages. Therefor I could not connect the XmlReader directly to the TcpStream.
After closing the XmlReader, the buffer is not positioned at the readers last position. So it is not possible to close the reader and use a new one to continue with the next message. I guess the reason for this is, that the reader could not successfully seek on every possible input stream.
When XmlReader throws an exception it can not be determined whether it happened because of an premature EOF or because of a non-wellformed XML. XmlReader.EOF is not set in case of an exception. As workaround I derived my own MemoryBuffer, which returns the very last byte as a single byte. This way I know that the XmlReader was really interested in the last byte and the following exception is likely due to a truncated message (this is kinda sloppy, in that it might not detect every non-wellformed message. However, after appending more bytes to the buffer, sooner or later the error will be detected.
I could cast my XmlReader to the IXmlLineInfo interface, which gives access to the LineNumber and the LinePosition of the current node. So after reading the first message I remember these positions and use it to truncate the buffer. Here comes the really sloppy part, because I have to use the character encoding to get the byte position. I am sure you could find test cases for the code below where it breaks (e.g. internal elements with mixed encoding). But up to now it worked for all my tests.
Here is the parser class I came up with -- may it be useful (I know, its very far from perfect...)
class XmlParser {
private byte[] buffer = new byte[0];
public int Length {
get {
return buffer.Length;
}
}
// Append new binary data to the internal data buffer...
public XmlParser Append(byte[] buffer2) {
if (buffer2 != null && buffer2.Length > 0) {
// I know, its not an efficient way to do this.
// The EofMemoryStream should handle a List<byte[]> ...
byte[] new_buffer = new byte[buffer.Length + buffer2.Length];
buffer.CopyTo(new_buffer, 0);
buffer2.CopyTo(new_buffer, buffer.Length);
buffer = new_buffer;
}
return this;
}
// MemoryStream which returns the last byte of the buffer individually,
// so that we know that the buffering XmlReader really locked at the last
// byte of the stream.
// Moreover there is an EOF marker.
private class EofMemoryStream: Stream {
public bool EOF { get; private set; }
private MemoryStream mem_;
public override bool CanSeek {
get {
return false;
}
}
public override bool CanWrite {
get {
return false;
}
}
public override bool CanRead {
get {
return true;
}
}
public override long Length {
get {
return mem_.Length;
}
}
public override long Position {
get {
return mem_.Position;
}
set {
throw new NotSupportedException();
}
}
public override void Flush() {
mem_.Flush();
}
public override long Seek(long offset, SeekOrigin origin) {
throw new NotSupportedException();
}
public override void SetLength(long value) {
throw new NotSupportedException();
}
public override void Write(byte[] buffer, int offset, int count) {
throw new NotSupportedException();
}
public override int Read(byte[] buffer, int offset, int count) {
count = Math.Min(count, Math.Max(1, (int)(Length - Position - 1)));
int nread = mem_.Read(buffer, offset, count);
if (nread == 0) {
EOF = true;
}
return nread;
}
public EofMemoryStream(byte[] buffer) {
mem_ = new MemoryStream(buffer, false);
EOF = false;
}
protected override void Dispose(bool disposing) {
mem_.Dispose();
}
}
// Parses the first xml message from the stream.
// If the first message is not yet complete, it returns null.
// If the buffer contains non-wellformed xml, it ~should~ throw an exception.
// After reading an xml message, it pops the data from the byte array.
public Message deserialize() {
if (buffer.Length == 0) {
return null;
}
Message message = null;
Encoding encoding = Message.default_encoding;
//string xml = encoding.GetString(buffer);
using (EofMemoryStream sbuffer = new EofMemoryStream (buffer)) {
XmlDocument xmlDocument = null;
XmlReaderSettings settings = new XmlReaderSettings();
int LineNumber = -1;
int LinePosition = -1;
bool truncate_buffer = false;
using (XmlReader xmlReader = XmlReader.Create(sbuffer, settings)) {
try {
// Read to the first node (skipping over some element-types.
// Don't use MoveToContent here, because it would skip the
// XmlDeclaration too...
while (xmlReader.Read() &&
(xmlReader.NodeType==XmlNodeType.Whitespace ||
xmlReader.NodeType==XmlNodeType.Comment)) {
};
// Check for XML declaration.
// If the message has an XmlDeclaration, extract the encoding.
switch (xmlReader.NodeType) {
case XmlNodeType.XmlDeclaration:
while (xmlReader.MoveToNextAttribute()) {
if (xmlReader.Name == "encoding") {
encoding = Encoding.GetEncoding(xmlReader.Value);
}
}
xmlReader.MoveToContent();
xmlReader.Read();
break;
}
// Move to the first element.
xmlReader.MoveToContent();
if (xmlReader.EOF) {
return null;
}
// Read the entire document.
xmlDocument = new XmlDocument();
xmlDocument.Load(xmlReader.ReadSubtree());
} catch (XmlException e) {
// The parsing of the xml failed. If the XmlReader did
// not yet look at the last byte, it is assumed that the
// XML is invalid and the exception is re-thrown.
if (sbuffer.EOF) {
return null;
}
throw e;
}
{
// Try to serialize an internal data structure using XmlSerializer.
Type type = null;
try {
type = Type.GetType("my.namespace." + xmlDocument.DocumentElement.Name);
} catch (Exception e) {
// No specialized data container for this class found...
}
if (type == null) {
message = new Message();
} else {
// TODO: reuse the serializer...
System.Xml.Serialization.XmlSerializer ser = new System.Xml.Serialization.XmlSerializer(type);
message = (Message)ser.Deserialize(new XmlNodeReader(xmlDocument));
}
message.doc = xmlDocument;
}
// At this point, the first XML message was sucessfully parsed.
// Remember the lineposition of the current end element.
IXmlLineInfo xmlLineInfo = xmlReader as IXmlLineInfo;
if (xmlLineInfo != null && xmlLineInfo.HasLineInfo()) {
LineNumber = xmlLineInfo.LineNumber;
LinePosition = xmlLineInfo.LinePosition;
}
// Try to read the rest of the buffer.
// If an exception is thrown, another xml message appears.
// This way the xml parser could tell us that the message is finished here.
// This would be prefered as truncating the buffer using the line info is sloppy.
try {
while (xmlReader.Read()) {
}
} catch {
// There comes a second message. Needs workaround for trunkating.
truncate_buffer = true;
}
}
if (truncate_buffer) {
if (LineNumber < 0) {
throw new Exception("LineNumber not given. Cannot truncate xml buffer");
}
// Convert the buffer to a string using the encoding found before
// (or the default encoding).
string s = encoding.GetString(buffer);
// Seek to the line.
int char_index = 0;
while (--LineNumber > 0) {
// Recognize \r , \n , \r\n as newlines...
char_index = s.IndexOfAny(new char[] {'\r', '\n'}, char_index);
// char_index should not be -1 because LineNumber>0, otherwise an RangeException is
// thrown, which is appropriate.
char_index++;
if (s[char_index-1]=='\r' && s.Length>char_index && s[char_index]=='\n') {
char_index++;
}
}
char_index += LinePosition - 1;
var rgx = new System.Text.RegularExpressions.Regex(xmlDocument.DocumentElement.Name + "[ \r\n\t]*\\>");
System.Text.RegularExpressions.Match match = rgx.Match(s, char_index);
if (!match.Success || match.Index != char_index) {
throw new Exception("could not find EndElement to truncate the xml buffer.");
}
char_index += match.Value.Length;
// Convert the character offset back to the byte offset (for the given encoding).
int line1_boffset = encoding.GetByteCount(s.Substring(0, char_index));
// remove the bytes from the buffer.
buffer = buffer.Skip(line1_boffset).ToArray();
} else {
buffer = new byte[0];
}
}
return message;
}
}
Reading into a MemoryStream is not necessary to use an XmlReader. You can attach the reader more directly to the stream to read as much as you require to reach the end of the XML document. A BufferedStream can be utilized to improve the efficiency of reading from the socket directly.
string server = "tcp://myserver"
string message = "GetMyXml"
int port = 13000;
int bufferSize = 1024;
using(var client = new TcpClient(server, port))
using(var clientStream = client.GetStream())
using(var bufferedStream = new BufferedStream(clientStream, bufferSize))
using(var xmlReader = XmlReader.Create(bufferedStream))
{
xmlReader.MoveToContent();
try
{
while(xmlReader.Read())
{
// Check for XML declaration.
if(xmlReader.NodeType != XmlNodeType.XmlDeclaration)
{
throw new Exception("Expected XML declaration.");
}
// Move to the first element.
xmlReader.Read();
xmlReader.MoveToContent();
// Read the root element.
// Hand this document to another method to process further.
var xmlDocument = XmlDocument.Load(xmlReader.ReadSubtree());
}
}
catch(XmlException ex)
{
// Record exception reading stream.
// Move reader to start of next document or rethrow exception to exit.
}
}
The key to making this work is the call to XmlReader.ReadSubtree() which creates a child reader on top of the parent reader, one that will treat the current element (in this case the root element) as the entire XML tree. This should allow you to parse document elements separately.
My code's a little sloppy around reading the document, especially as I ignore all the information in the XML declaration. I'm sure there's room for improvement, but hopefully this gets you on the right track.
Assuming that you can change the protocol, I'd suggest adding start and stop markers to the messages, so that when you read it all in as a text stream you can split it up in separate messages (leaving incomplete messages in an "incoming buffer" of some kind), clean up the markers and then you know that you've got exactly one message at the time.
The 2 issues that I found were:
XmlReader will only permit an XML declaration at the very beginning. Since it can't be reset it needs to be recreated.
Once XmlReader has done its work it will usually have consumed additional characters after the end of the document because it uses the Read(char[], int, int) method.
My (brittle) workaround is to create a wrapper that only fills the array until a '>' is encountered. This keeps the XmlReader from consuming characters past the ending > of the document it was parsing:
public class SegmentingReader : TextReader {
private TextReader reader;
private char trigger;
public SegmentingReader(TextReader reader, char trigger) {
this.reader = reader;
this.trigger = trigger;
}
// Dispose omitted for brevity
public override int Peek() { return reader.Peek(); }
public override int Read() { return reader.Read(); }
public override int Read(char[] buffer, int index, int count) {
int n = 0;
while (n < count) {
char ch = (char)reader.Read();
buffer[index + n] = ch;
n++;
if (ch == trigger) break;
}
return n;
}
}
Then it can be used as simply as:
using(var inputReader = new SegmentingReader(/*TextReader from somewhere */))
using(var serializer = new XmlSerializer(typeof(SerializedClass)))
while (inputReader.Peek() != -1)
{
using (var xmlReader = XmlReader.Create(inputReader)) {
xmlReader.MoveToContent();
var obj = serializer.Deserialize(xmlReader.ReadSubtree());
DoStuff(obj);
}
}