Split one xml file to multiple using c# [duplicate]

Split one xml file to multiple using c# [duplicate] - c#

[Case]
I have reveived a bunch of 'xml files' with metadata about a big number of documents in them. At least, that was what I requested. What I received where 'xml files' without a root element, they are structured something like this (i left out a bunch of elements):
<folder name = "abc"></folder>
<folder name = "abc/def">
<document name = "ghi1">
</document>
<document name = "ghi2">
</document>
</folder>
[Problem]
When I try to read the file in an XmlTextReader object it fails telling me that there is no root element.
[Current workaround]
Of course I can read the file as a stream, append < xmlroot> and < /xmlroot> and write the stream to a new file and read that one in XmlTextReader. Which is exactly what I am doing now, but I prefer not to 'tamper' with the original data.
[Requested solution]
I understand that I should use XmlTextReader for this, with the DocumentFragment option. However, this gives the compiletime error:
An unhandled exception of type 'System.Xml.XmlException' occurred in
System.Xml.dll
Additional information: XmlNodeType DocumentFragment is not supported
for partial content parsing. Line 1, position 1.
[Faulty code]
using System.Diagnostics;
using System.Xml;
namespace XmlExample
{
class Program
{
static void Main(string[] args)
{
string file = #"C:\test.txt";
XmlTextReader tr = new XmlTextReader(file, XmlNodeType.DocumentFragment, null);
while(tr.Read())
Debug.WriteLine("NodeType: {0} NodeName: {1}", tr.NodeType, tr.Name);
}
}
}

This works:
using System.Diagnostics;
using System.Xml;
namespace XmlExample
{
class Program
{
static void Main(string[] args)
{
string file = #"C:\test.txt";
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
using (XmlReader reader = XmlReader.Create(file, settings))
{
while (reader.Read())
Debug.WriteLine("NodeType: {0} NodeName: {1}", reader.NodeType, reader.Name);
}
}
}
}

Even though the XmlReader can be made to read the data using the ConformanceLevel.Fragment option as demonstrated by Martijn, it seems that XmlDataDocument does not like the idea of having multiple root elements.
I thought I'd try a different approach, much like the one you're currently using, but without the intermediate file. Most XML libraries (XmlDocument, XDocument, XmlDataDocument) can take a TextReader as an input, so I've implemented one of my own. It's used like so:
var dataDocument = new XmlDataDocument();
dataDocument.Load(new FakeRootStreamReader(File.OpenRead("test.xml")));
The code of the actual class:
public class FakeRootStreamReader : TextReader
{
private static readonly char[] _rootStart;
private static readonly char[] _rootEnd;
private readonly TextReader _innerReader;
private int _charsRead;
private bool _eof;
static FakeRootStreamReader()
{
_rootStart = "<root>".ToCharArray();
_rootEnd = "</root>".ToCharArray();
}
public FakeRootStreamReader(Stream stream)
{
_innerReader = new StreamReader(stream);
}
public FakeRootStreamReader(TextReader innerReader)
{
_innerReader = innerReader;
}
public override int Read(char[] buffer, int index, int count)
{
if (!_eof && _charsRead < _rootStart.Length)
{
// Prepend root element
return ReadFake(_rootStart, buffer, index, count);
}
if (!_eof)
{
// Normal reading operation
int charsRead = _innerReader.Read(buffer, index, count);
if (charsRead > 0) return charsRead;
// We've reached the end of the Stream
_eof = true;
_charsRead = 0;
}
// Append root element end tag at the end of the Stream
return ReadFake(_rootEnd, buffer, index, count);
}
private int ReadFake(char[] source, char[] buffer, int offset, int count)
{
int length = Math.Min(source.Length - _charsRead, count);
Array.Copy(source, _charsRead, buffer, offset, length);
_charsRead += length;
return length;
}
}
The first call to Read(...) will return only the <root> element. Subsequent calls read the stream as normal, until the end of the stream is reached, then the end tag is outputted.
The code is a bit... meh... mostly because I wanted to handle some never-gonna-happen cases where someone tries to read the stream less than 6 characters at a time.

Related

error in XML document. Unexpected XML declaration. XML declaration must be the first node in the document

There is an error in XML document (8, 20). Inner 1: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.
OK, I understand this error.
How I get it, however, is what perplexes me.
I create the document with Microsoft's Serialize tool. Then, I turn around and attempt to read it back, again, using Microsoft's Deserialize tool.
I am not in control of writing the XML file in the correct format - that I can see.
Here is the single routine I use to read and write.
private string xmlPath = System.Web.Hosting.HostingEnvironment.MapPath(WebConfigurationManager.AppSettings["DATA_XML"]);
private object objLock = new Object();
public string ErrorMessage { get; set; }
public StoredMsgs Operation(string from, string message, FileAccess access) {
StoredMsgs list = null;
lock (objLock) {
ErrorMessage = null;
try {
if (!File.Exists(xmlPath)) {
var root = new XmlRootAttribute(rootName);
var serializer = new XmlSerializer(typeof(StoredMsgs), root);
if (String.IsNullOrEmpty(message)) {
from = "Code Window";
message = "Created File";
}
var item = new StoredMsg() {
From = from,
Date = DateTime.Now.ToString("s"),
Message = message
};
using (var stream = File.Create(xmlPath)) {
list = new StoredMsgs();
list.Add(item);
serializer.Serialize(stream, list);
}
} else {
var root = new XmlRootAttribute("MessageHistory");
var serializer = new XmlSerializer(typeof(StoredMsgs), root);
var item = new StoredMsg() {
From = from,
Date = DateTime.Now.ToString("s"),
Message = message
};
using (var stream = File.Open(xmlPath, FileMode.Open, FileAccess.ReadWrite)) {
list = (StoredMsgs)serializer.Deserialize(stream);
if ((access == FileAccess.ReadWrite) || (access == FileAccess.Write)) {
list.Add(item);
serializer.Serialize(stream, list);
}
}
}
} catch (Exception error) {
var sb = new StringBuilder();
int index = 0;
sb.AppendLine(String.Format("Top Level Error: <b>{0}</b>", error.Message));
var err = error.InnerException;
while (err != null) {
index++;
sb.AppendLine(String.Format("\tInner {0}: {1}", index, err.Message));
err = err.InnerException;
}
ErrorMessage = sb.ToString();
}
}
return list;
}
Is something wrong with my routine? If Microsoft write the file, it seems to me that it should be able to read it back.
It should be generic enough for anyone to use.
Here is my StoredMsg class:
[Serializable()]
[XmlType("StoredMessage")]
public class StoredMessage {
public StoredMessage() {
}
[XmlElement("From")]
public string From { get; set; }
[XmlElement("Date")]
public string Date { get; set; }
[XmlElement("Message")]
public string Message { get; set; }
}
[Serializable()]
[XmlRoot("MessageHistory")]
public class MessageHistory : List<StoredMessage> {
}
The file it generates doesn't look to me like it has any issues.
I saw the solution here:
Error: The XML declaration must be the first node in the document
But, in that case, it seems someone already had an XML document they wanted to read. They just had to fix it.
I have an XML document created my Microsoft, so it should be read back in by Microsoft.

The problem is that you are adding to the file. You deserialize, then re-serialize to the same stream without rewinding and resizing to zero. This gives you multiple root elements:
<?xml version="1.0"?>
<StoredMessage>
</StoredMessage
<?xml version="1.0"?>
<StoredMessage>
</StoredMessage
Multiple root elements, and multiple XML declarations, are invalid according to the XML standard, thus the .NET XML parser throws an exception in this situation by default.
For possible solutions, see XML Error: There are multiple root elements, which suggests you either:
Enclose your list of StoredMessage elements in some synthetic outer element, e.g. StoredMessageList.
This would require you to load the list of messages from the file, add the new message, and then truncate the file and re-serialize the entire list when adding a single item. Thus the performance may be worse than in your current approach, but the XML will be valid.
When deserializing a file containing concatenated root elements, create an XML writer using XmlReaderSettings.ConformanceLevel = ConformanceLevel.Fragment and iteratively walk through the concatenated root node(s) and deserialize each one individually as shown, e.g., here. Using ConformanceLevel.Fragment allows the reader to parse streams with multiple root elements (although multiple XML declarations will still cause an error to be thrown).
Later, when adding a new element to the end of the file using XmlSerializer, seek to the end of the file and serialize using an XML writer returned from XmlWriter.Create(TextWriter, XmlWriterSettings)
with XmlWriterSettings.OmitXmlDeclaration = true. This prevents output of multiple XML declarations as explained here.
For option #2, your Operation would look something like the following:
private string xmlPath = System.Web.Hosting.HostingEnvironment.MapPath(WebConfigurationManager.AppSettings["DATA_XML"]);
private object objLock = new Object();
public string ErrorMessage { get; set; }
const string rootName = "MessageHistory";
static readonly XmlSerializer serializer = new XmlSerializer(typeof(StoredMessage), new XmlRootAttribute(rootName));
public MessageHistory Operation(string from, string message, FileAccess access)
{
var list = new MessageHistory();
lock (objLock)
{
ErrorMessage = null;
try
{
using (var file = File.Open(xmlPath, FileMode.OpenOrCreate))
{
list.AddRange(XmlSerializerHelper.ReadObjects<StoredMessage>(file, false, serializer));
if (list.Count == 0 && String.IsNullOrEmpty(message))
{
from = "Code Window";
message = "Created File";
}
var item = new StoredMessage()
{
From = from,
Date = DateTime.Now.ToString("s"),
Message = message
};
if ((access == FileAccess.ReadWrite) || (access == FileAccess.Write))
{
file.Seek(0, SeekOrigin.End);
var writerSettings = new XmlWriterSettings
{
OmitXmlDeclaration = true,
Indent = true, // Optional; remove if compact XML is desired.
};
using (var textWriter = new StreamWriter(file))
{
if (list.Count > 0)
textWriter.WriteLine();
using (var xmlWriter = XmlWriter.Create(textWriter, writerSettings))
{
serializer.Serialize(xmlWriter, item);
}
}
}
list.Add(item);
}
}
catch (Exception error)
{
var sb = new StringBuilder();
int index = 0;
sb.AppendLine(String.Format("Top Level Error: <b>{0}</b>", error.Message));
var err = error.InnerException;
while (err != null)
{
index++;
sb.AppendLine(String.Format("\tInner {0}: {1}", index, err.Message));
err = err.InnerException;
}
ErrorMessage = sb.ToString();
}
}
return list;
}
Using the following extension method adapted from Read nodes of a xml file in C#:
public partial class XmlSerializerHelper
{
public static List<T> ReadObjects<T>(Stream stream, bool closeInput = true, XmlSerializer serializer = null)
{
var list = new List<T>();
serializer = serializer ?? new XmlSerializer(typeof(T));
var settings = new XmlReaderSettings
{
ConformanceLevel = ConformanceLevel.Fragment,
CloseInput = closeInput,
};
using (var xmlTextReader = XmlReader.Create(stream, settings))
{
while (xmlTextReader.Read())
{ // Skip whitespace
if (xmlTextReader.NodeType == XmlNodeType.Element)
{
using (var subReader = xmlTextReader.ReadSubtree())
{
var logEvent = (T)serializer.Deserialize(subReader);
list.Add(logEvent);
}
}
}
}
return list;
}
}
Note that if you are going to create an XmlSerializer using a custom XmlRootAttribute, you must cache the serializer to avoid a memory leak.
Sample fiddle.

How to avoid c# File.ReadLines First() locking file

I do not want to read the whole file at any point, I know there are answers on that question, I want t
o read the First or Last line.
I know that my code locks the file that it's reading for two reasons 1) The application that writes to the file crashes intermittently when I run my little app with this code but it never crashes when I am not running this code! 2) There are a few articles that will tell you that File.ReadLines locks the file.
There are some similar questions but that answer seems to involve reading the whole file which is slow for large files and therefore not what I want to do. My requirement to only read the last line most of the time is also unique from what I have read about.
I nead to know how to read the first line (Header row) and the last line (latest row). I do not want to read all lines at any point in my code because this file can become huge and reading the entire file will become slow.
I know that
line = File.ReadLines(fullFilename).First().Replace("\"", "");
... is the same as ...
FileStream fs = new FileStream(#fullFilename, FileMode.Open, FileAccess.Read, FileShare.Read);
My question is, how can I repeatedly read the first and last lines of a file which may be being written to by another application without locking it in any way. I have no control over the application that is writting to the file. It is a data log which can be appended to at any time. The reason I am listening in this way is that this log can be appended to for days on end. I want to see the latest data in this log in my own c# programme without waiting for the log to finish being written to.
My code to call the reading / listening function ...
//Start Listening to the "data log"
private void btnDeconstructCSVFile_Click(object sender, EventArgs e)
{
MySandbox.CopyCSVDataFromLogFile copyCSVDataFromLogFile = new MySandbox.CopyCSVDataFromLogFile();
copyCSVDataFromLogFile.checkForLogData();
}
My class which does the listening. For now it simply adds the data to 2 generics lists ...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using MySandbox.Classes;
using System.IO;
namespace MySandbox
{
public class CopyCSVDataFromLogFile
{
static private List<LogRowData> listMSDataRows = new List<LogRowData>();
static String fullFilename = string.Empty;
static LogRowData previousLineLogRowList = new LogRowData();
static LogRowData logRowList = new LogRowData();
static LogRowData logHeaderRowList = new LogRowData();
static Boolean checking = false;
public void checkForLogData()
{
//Initialise
string[] logHeaderArray = new string[] { };
string[] badDataRowsArray = new string[] { };
//Get the latest full filename (file with new data)
//Assumption: only 1 file is written to at a time in this directory.
String directory = "C:\\TestDir\\";
string pattern = "*.csv";
var dirInfo = new DirectoryInfo(directory);
var file = (from f in dirInfo.GetFiles(pattern) orderby f.LastWriteTime descending select f).First();
fullFilename = directory + file.ToString(); //This is the full filepath and name of the latest file in the directory!
if (logHeaderArray.Length == 0)
{
//Populate the Header Row
logHeaderRowList = getRow(fullFilename, true);
}
LogRowData tempLogRowList = new LogRowData();
if (!checking)
{
//Read the latest data in an asynchronous loop
callDataProcess();
}
}
private async void callDataProcess()
{
checking = true; //Begin checking
await checkForNewDataAndSaveIfFound();
}
private static Task checkForNewDataAndSaveIfFound()
{
return Task.Run(() => //Call the async "Task"
{
while (checking) //Loop (asynchronously)
{
LogRowData tempLogRowList = new LogRowData();
if (logHeaderRowList.ValueList.Count == 0)
{
//Populate the Header row
logHeaderRowList = getRow(fullFilename, true);
}
else
{
//Populate Data row
tempLogRowList = getRow(fullFilename, false);
if ((!Enumerable.SequenceEqual(tempLogRowList.ValueList, previousLineLogRowList.ValueList)) &&
(!Enumerable.SequenceEqual(tempLogRowList.ValueList, logHeaderRowList.ValueList)))
{
logRowList = getRow(fullFilename, false);
listMSDataRows.Add(logRowList);
previousLineLogRowList = logRowList;
}
}
//System.Threading.Thread.Sleep(10); //Wait for next row.
}
});
}
private static LogRowData getRow(string fullFilename, bool isHeader)
{
string line;
string[] logDataArray = new string[] { };
LogRowData logRowListResult = new LogRowData();
try
{
if (isHeader)
{
//Asign first (header) row data.
//Works but seems to block writting to the file!!!!!!!!!!!!!!!!!!!!!!!!!!!
line = File.ReadLines(fullFilename).First().Replace("\"", "");
}
else
{
//Assign data as last row (default behaviour).
line = File.ReadLines(fullFilename).Last().Replace("\"", "");
}
logDataArray = line.Split(',');
//Copy Array to Generics List and remove last value if it's empty.
for (int i = 0; i < logDataArray.Length; i++)
{
if (i < logDataArray.Length)
{
if (i < logDataArray.Length - 1)
{
//Value is not at the end, from observation, these always have a value (even if it's zero) and so we'll store the value.
logRowListResult.ValueList.Add(logDataArray[i]);
}
else
{
//This is the last value
if (logDataArray[i].Replace("\"", "").Trim().Length > 0)
{
//In this case, the last value is not empty, store it as normal.
logRowListResult.ValueList.Add(logDataArray[i]);
}
else { /*The last value is empty, e.g. "123,456,"; the final comma denotes another field but this field is empty so we will ignore it now. */ }
}
}
}
}
catch (Exception ex)
{
if (ex.Message == "Sequence contains no elements")
{ /*Empty file, no problem. The code will safely loop and then will pick up the header when it appears.*/ }
else
{
//TODO: catch this error properly
Int32 problemID = 10; //Unknown ERROR.
}
}
return logRowListResult;
}
}
}

I found the answer in a combination of other questions. One answer explaining how to read from the end of a file, which I adapted so that it would read only 1 line from the end of the file. And another explaining how to read the entire file without locking it (I did not want to read the entire file but the not locking part was useful). So now you can read the last line of the file (if it contains end of line characters) without locking it. For other end of line delimeters, just replace my 10 and 13 with your end of line character bytes...
Add the method below to public class CopyCSVDataFromLogFile
private static string Reverse(string str)
{
char[] arr = new char[str.Length];
for (int i = 0; i < str.Length; i++)
arr[i] = str[str.Length - 1 - i];
return new string(arr);
}
and replace this line ...
line = File.ReadLines(fullFilename).Last().Replace("\"", "");
with this code block ...
Int32 endOfLineCharacterCount = 0;
Int32 previousCharByte = 0;
Int32 currentCharByte = 0;
//Read the file, from the end, for 1 line, allowing other programmes to access it for read and write!
using (FileStream reader = new FileStream(fullFilename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 0x1000, FileOptions.SequentialScan))
{
int i = 0;
StringBuilder lineBuffer = new StringBuilder();
int byteRead;
while ((-i < reader.Length) /*Belt and braces: if there were no end of line characters, reading beyond the file would give a catastrophic error here (to be avoided thus).*/
&& (endOfLineCharacterCount < 2)/*Exit Condition*/)
{
reader.Seek(--i, SeekOrigin.End);
byteRead = reader.ReadByte();
currentCharByte = byteRead;
//Exit condition: the first 2 characters we read (reading backwards remember) were end of line ().
//So when we read the second end of line, we have read 1 whole line (the last line in the file)
//and we must exit now.
if (currentCharByte == 13 && previousCharByte == 10)
{
endOfLineCharacterCount++;
}
if (byteRead == 10 && lineBuffer.Length > 0)
{
line += Reverse(lineBuffer.ToString());
lineBuffer.Remove(0, lineBuffer.Length);
}
lineBuffer.Append((char)byteRead);
previousCharByte = byteRead;
}
reader.Close();
}

Read a 'fake' xml document (xml fragment) in a XmlTextReader object

[Case]
I have reveived a bunch of 'xml files' with metadata about a big number of documents in them. At least, that was what I requested. What I received where 'xml files' without a root element, they are structured something like this (i left out a bunch of elements):
<folder name = "abc"></folder>
<folder name = "abc/def">
<document name = "ghi1">
</document>
<document name = "ghi2">
</document>
</folder>
[Problem]
When I try to read the file in an XmlTextReader object it fails telling me that there is no root element.
[Current workaround]
Of course I can read the file as a stream, append < xmlroot> and < /xmlroot> and write the stream to a new file and read that one in XmlTextReader. Which is exactly what I am doing now, but I prefer not to 'tamper' with the original data.
[Requested solution]
I understand that I should use XmlTextReader for this, with the DocumentFragment option. However, this gives the compiletime error:
An unhandled exception of type 'System.Xml.XmlException' occurred in
System.Xml.dll
Additional information: XmlNodeType DocumentFragment is not supported
for partial content parsing. Line 1, position 1.
[Faulty code]
using System.Diagnostics;
using System.Xml;
namespace XmlExample
{
class Program
{
static void Main(string[] args)
{
string file = #"C:\test.txt";
XmlTextReader tr = new XmlTextReader(file, XmlNodeType.DocumentFragment, null);
while(tr.Read())
Debug.WriteLine("NodeType: {0} NodeName: {1}", tr.NodeType, tr.Name);
}
}
}

This works:
using System.Diagnostics;
using System.Xml;
namespace XmlExample
{
class Program
{
static void Main(string[] args)
{
string file = #"C:\test.txt";
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
using (XmlReader reader = XmlReader.Create(file, settings))
{
while (reader.Read())
Debug.WriteLine("NodeType: {0} NodeName: {1}", reader.NodeType, reader.Name);
}
}
}
}

Even though the XmlReader can be made to read the data using the ConformanceLevel.Fragment option as demonstrated by Martijn, it seems that XmlDataDocument does not like the idea of having multiple root elements.
I thought I'd try a different approach, much like the one you're currently using, but without the intermediate file. Most XML libraries (XmlDocument, XDocument, XmlDataDocument) can take a TextReader as an input, so I've implemented one of my own. It's used like so:
var dataDocument = new XmlDataDocument();
dataDocument.Load(new FakeRootStreamReader(File.OpenRead("test.xml")));
The code of the actual class:
public class FakeRootStreamReader : TextReader
{
private static readonly char[] _rootStart;
private static readonly char[] _rootEnd;
private readonly TextReader _innerReader;
private int _charsRead;
private bool _eof;
static FakeRootStreamReader()
{
_rootStart = "<root>".ToCharArray();
_rootEnd = "</root>".ToCharArray();
}
public FakeRootStreamReader(Stream stream)
{
_innerReader = new StreamReader(stream);
}
public FakeRootStreamReader(TextReader innerReader)
{
_innerReader = innerReader;
}
public override int Read(char[] buffer, int index, int count)
{
if (!_eof && _charsRead < _rootStart.Length)
{
// Prepend root element
return ReadFake(_rootStart, buffer, index, count);
}
if (!_eof)
{
// Normal reading operation
int charsRead = _innerReader.Read(buffer, index, count);
if (charsRead > 0) return charsRead;
// We've reached the end of the Stream
_eof = true;
_charsRead = 0;
}
// Append root element end tag at the end of the Stream
return ReadFake(_rootEnd, buffer, index, count);
}
private int ReadFake(char[] source, char[] buffer, int offset, int count)
{
int length = Math.Min(source.Length - _charsRead, count);
Array.Copy(source, _charsRead, buffer, offset, length);
_charsRead += length;
return length;
}
}
The first call to Read(...) will return only the <root> element. Subsequent calls read the stream as normal, until the end of the stream is reached, then the end tag is outputted.
The code is a bit... meh... mostly because I wanted to handle some never-gonna-happen cases where someone tries to read the stream less than 6 characters at a time.

XDocument Text Node New Line

I'm trying to get a newline into a text node using XText from the Linq XML namespace.
I have a string which contains newline characters however I need to work out how to convert these to entity characters (i.e.
) rather than just having them appear in the XML as new lines.
XElement element = new XElement( "NodeName" );
...
string example = "This is a string\nWith new lines in it\n";
element.Add( new XText( example ) );
The XElement is then written out using an XmlTextWriter which results in the file containing the newline rather than an entity replacement.
Has anyone come across this problem and found a solution?
EDIT:
The problem manifests itself when I load the XML into EXCEL which doesn't seem to like the newline character but which accepts the entity replacement. The result is that newlines aren't showing in EXCEL unless I replace them with
Nick.

Cheating:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.CheckCharacters = false;
settings.NewLineChars = "
";
XmlWriter writer = XmlWriter.Create(..., settings);
element.WriteTo(writer);
writer.Flush();
UPDATE:
Complete program
using System;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
XElement element = new XElement( "NodeName" );
string example = "This is a string\nWith new lines in it\n";
element.Add( new XText( example ) );
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.CheckCharacters = false;
settings.NewLineChars = "
";
XmlWriter writer = XmlWriter.Create(Console.Out, settings);
element.WriteTo(writer);
writer.Flush();
}
}
}
OUTPUT:
C:\Users\...\\ConsoleApplication1\bin\Release>ConsoleApplication1.exe
<?xml version="1.0" encoding="ibm850"?>
<NodeName>This is a string
With new lines in it
</NodeName>

To any standard XML parser there is no difference between the entity
and a new line character, as they are one and the same thing.
To illustrate this the following code shows that they are the same thing:
string s1 = "<root>Test
Test2</root>";
string s2 = "<root>Test\nTest2</root>";
XDocument doc1 = XDocument.Parse(s1);
XDocument doc2 = XDocument.Parse(s2);
Console.WriteLine(doc1.ToString());
Console.WriteLine(doc2.ToString());

It's the XmlTextWriter which is responsible for outputting escaped entities. So if you do this, for example:
using (XmlTextWriter w = new XmlTextWriter("test.xml", Encoding.UTf8))
{
w.WriteString("");
}
You will also get an escaped ampersand output in text.xml &#x10, which you don't want. You would like to keep the  sequence raw, as is.
The solution I propose is to create a new StreamWriter implementation capable of detecting an escaped string like "&#x10;":
// A StreamWriter that does not escape
characters
public class NonXmlEscapingStreamWriter : StreamWriter
{
private const string AmpToken = "amp";
private int _bufferState = 0; // used to keep state
// add other ctors overloads if needed
public NonXmlEscapingStreamWriter(string path)
: base(path)
{
}
// NOTE this code is based on the assumption that StreamWriter
// only overrides these 4 Write functions, which is true today but could change in the future
// and also on the assumption that the XmlTextWrite writes escaped values in a specific WriteXX calls sequence
public override void Write(char value)
{
if (value == '&')
{
if (_bufferState == 0)
{
_bufferState++;
return; // hold it
}
else
{
_bufferState = 0;
}
}
else if (value == ';')
{
if (_bufferState > 1)
{
_bufferState++;
return;
}
else
{
Write('&'); // release what's been held
Write(AmpToken);
_bufferState = 0;
}
}
else if (value == '\n') // detect non escaped \n
{
base.Write("
");
return;
}
base.Write(value);
}
public override void Write(string value)
{
if (_bufferState > 0)
{
if (value == AmpToken)
{
_bufferState++;
return; // hold it
}
else
{
Write('&'); // release what's been held
_bufferState = 0;
}
}
base.Write(value);
}
public override void Write(char[] buffer, int index, int count)
{
if (_bufferState > 2)
{
_bufferState = 0;
base.Write('&'); // release this anyway
string replace;
if ((buffer != null) && ((replace = GetReplaceLength(buffer, index, count)) != null))
{
base.Write(replace);
base.Write(buffer, index + replace.Length, count - replace.Length);
return;
}
else
{
base.Write(AmpToken); // release this
base.Write(';'); // release this
}
}
base.Write(buffer, index, count);
}
public override void Write(char[] buffer)
{
Write(buffer, 0, buffer != null ? buffer.Length : 0);
}
private string GetReplaceLength(char[] buffer, int index, int count)
{
// this is specific to the 10 character but could be adapted
const string token = "#10;";
if ((index + count) < token.Length)
return null;
// we test the char array to avoid string allocations
for(int i = 0; i < token.Length; i++)
{
if (buffer[index + i] != token[i])
return null;
}
return token;
}
}
And you can use it like this:
using (XmlTextWriter w = new XmlTextWriter(new NonXmlEscapingStreamWriter("test.xml")))
{
element.WriteTo(w);
}
NOTE: Although it is capable of detecting lonely \n sequences, I suggest you ensure all \n are actually escaped in your original text, so, you need to replace \n by  before you actually output xml, like this:
string example = "This is a stringWith new lines in it";

Parsing concatenated, non-delimited XML messages from TCP-stream using C#

I am trying to parse XML messages which are send to my C# application over TCP. Unfortunately, the protocol can not be changed and the XML messages are not delimited and no length prefix is used. Moreover the character encoding is not fixed but each message starts with an XML declaration <?xml>. The question is, how can i read one XML message at a time, using C#.
Up to now, I tried to read the data from the TCP stream into a byte array and use it through a MemoryStream. The problem is, the buffer might contain more than one XML messages or the first message may be incomplete. In these cases, I get an exception when trying to parse it with XmlReader.Read or XmlDocument.Load, but unfortunately the XmlException does not really allow me to distinguish the problem (except parsing the localized error string).
I tried using XmlReader.Read and count the number of Element and EndElement nodes. That way I know when I am finished reading the first, entire XML message.
However, there are several problems. If the buffer does not yet contain the entire message, how can I distinguish the XmlException from an actually invalid, non-well-formed message? In other words, if an exception is thrown before reading the first root EndElement, how can I decide whether to abort the connection with error, or to collect more bytes from the TCP stream?
If no exception occurs, the XmlReader is positioned at the start of the root EndElement. Casting the XmlReader to IXmlLineInfo gives me the current LineNumber and LinePosition, however it is not straight forward to get the byte position where the EndElement really ends. In order to do that, I would have to convert the byte array into a string (with the encoding specified in the XML declaration), seek to LineNumber,LinePosition and convert that back to the byte offset. I try to do that with StreamReader.ReadLine, but the stream reader gives no public access to the current byte position.
All this seams very inelegant and non robust. I wonder if you have ideas for a better solution. Thank you.

After locking around for some time I think I can answer my own question as following (I might be wrong, corrections are welcome):
I found no method so that the XmlReader can continue parsing a second XML message (at least not, if the second message has an XmlDeclaration). XmlTextReader.ResetState could do something similar, but for that I would have to assume the same encoding for all messages. Therefor I could not connect the XmlReader directly to the TcpStream.
After closing the XmlReader, the buffer is not positioned at the readers last position. So it is not possible to close the reader and use a new one to continue with the next message. I guess the reason for this is, that the reader could not successfully seek on every possible input stream.
When XmlReader throws an exception it can not be determined whether it happened because of an premature EOF or because of a non-wellformed XML. XmlReader.EOF is not set in case of an exception. As workaround I derived my own MemoryBuffer, which returns the very last byte as a single byte. This way I know that the XmlReader was really interested in the last byte and the following exception is likely due to a truncated message (this is kinda sloppy, in that it might not detect every non-wellformed message. However, after appending more bytes to the buffer, sooner or later the error will be detected.
I could cast my XmlReader to the IXmlLineInfo interface, which gives access to the LineNumber and the LinePosition of the current node. So after reading the first message I remember these positions and use it to truncate the buffer. Here comes the really sloppy part, because I have to use the character encoding to get the byte position. I am sure you could find test cases for the code below where it breaks (e.g. internal elements with mixed encoding). But up to now it worked for all my tests.
Here is the parser class I came up with -- may it be useful (I know, its very far from perfect...)
class XmlParser {
private byte[] buffer = new byte[0];
public int Length {
get {
return buffer.Length;
}
}
// Append new binary data to the internal data buffer...
public XmlParser Append(byte[] buffer2) {
if (buffer2 != null && buffer2.Length > 0) {
// I know, its not an efficient way to do this.
// The EofMemoryStream should handle a List<byte[]> ...
byte[] new_buffer = new byte[buffer.Length + buffer2.Length];
buffer.CopyTo(new_buffer, 0);
buffer2.CopyTo(new_buffer, buffer.Length);
buffer = new_buffer;
}
return this;
}
// MemoryStream which returns the last byte of the buffer individually,
// so that we know that the buffering XmlReader really locked at the last
// byte of the stream.
// Moreover there is an EOF marker.
private class EofMemoryStream: Stream {
public bool EOF { get; private set; }
private MemoryStream mem_;
public override bool CanSeek {
get {
return false;
}
}
public override bool CanWrite {
get {
return false;
}
}
public override bool CanRead {
get {
return true;
}
}
public override long Length {
get {
return mem_.Length;
}
}
public override long Position {
get {
return mem_.Position;
}
set {
throw new NotSupportedException();
}
}
public override void Flush() {
mem_.Flush();
}
public override long Seek(long offset, SeekOrigin origin) {
throw new NotSupportedException();
}
public override void SetLength(long value) {
throw new NotSupportedException();
}
public override void Write(byte[] buffer, int offset, int count) {
throw new NotSupportedException();
}
public override int Read(byte[] buffer, int offset, int count) {
count = Math.Min(count, Math.Max(1, (int)(Length - Position - 1)));
int nread = mem_.Read(buffer, offset, count);
if (nread == 0) {
EOF = true;
}
return nread;
}
public EofMemoryStream(byte[] buffer) {
mem_ = new MemoryStream(buffer, false);
EOF = false;
}
protected override void Dispose(bool disposing) {
mem_.Dispose();
}
}
// Parses the first xml message from the stream.
// If the first message is not yet complete, it returns null.
// If the buffer contains non-wellformed xml, it ~should~ throw an exception.
// After reading an xml message, it pops the data from the byte array.
public Message deserialize() {
if (buffer.Length == 0) {
return null;
}
Message message = null;
Encoding encoding = Message.default_encoding;
//string xml = encoding.GetString(buffer);
using (EofMemoryStream sbuffer = new EofMemoryStream (buffer)) {
XmlDocument xmlDocument = null;
XmlReaderSettings settings = new XmlReaderSettings();
int LineNumber = -1;
int LinePosition = -1;
bool truncate_buffer = false;
using (XmlReader xmlReader = XmlReader.Create(sbuffer, settings)) {
try {
// Read to the first node (skipping over some element-types.
// Don't use MoveToContent here, because it would skip the
// XmlDeclaration too...
while (xmlReader.Read() &&
(xmlReader.NodeType==XmlNodeType.Whitespace ||
xmlReader.NodeType==XmlNodeType.Comment)) {
};
// Check for XML declaration.
// If the message has an XmlDeclaration, extract the encoding.
switch (xmlReader.NodeType) {
case XmlNodeType.XmlDeclaration:
while (xmlReader.MoveToNextAttribute()) {
if (xmlReader.Name == "encoding") {
encoding = Encoding.GetEncoding(xmlReader.Value);
}
}
xmlReader.MoveToContent();
xmlReader.Read();
break;
}
// Move to the first element.
xmlReader.MoveToContent();
if (xmlReader.EOF) {
return null;
}
// Read the entire document.
xmlDocument = new XmlDocument();
xmlDocument.Load(xmlReader.ReadSubtree());
} catch (XmlException e) {
// The parsing of the xml failed. If the XmlReader did
// not yet look at the last byte, it is assumed that the
// XML is invalid and the exception is re-thrown.
if (sbuffer.EOF) {
return null;
}
throw e;
}
{
// Try to serialize an internal data structure using XmlSerializer.
Type type = null;
try {
type = Type.GetType("my.namespace." + xmlDocument.DocumentElement.Name);
} catch (Exception e) {
// No specialized data container for this class found...
}
if (type == null) {
message = new Message();
} else {
// TODO: reuse the serializer...
System.Xml.Serialization.XmlSerializer ser = new System.Xml.Serialization.XmlSerializer(type);
message = (Message)ser.Deserialize(new XmlNodeReader(xmlDocument));
}
message.doc = xmlDocument;
}
// At this point, the first XML message was sucessfully parsed.
// Remember the lineposition of the current end element.
IXmlLineInfo xmlLineInfo = xmlReader as IXmlLineInfo;
if (xmlLineInfo != null && xmlLineInfo.HasLineInfo()) {
LineNumber = xmlLineInfo.LineNumber;
LinePosition = xmlLineInfo.LinePosition;
}
// Try to read the rest of the buffer.
// If an exception is thrown, another xml message appears.
// This way the xml parser could tell us that the message is finished here.
// This would be prefered as truncating the buffer using the line info is sloppy.
try {
while (xmlReader.Read()) {
}
} catch {
// There comes a second message. Needs workaround for trunkating.
truncate_buffer = true;
}
}
if (truncate_buffer) {
if (LineNumber < 0) {
throw new Exception("LineNumber not given. Cannot truncate xml buffer");
}
// Convert the buffer to a string using the encoding found before
// (or the default encoding).
string s = encoding.GetString(buffer);
// Seek to the line.
int char_index = 0;
while (--LineNumber > 0) {
// Recognize \r , \n , \r\n as newlines...
char_index = s.IndexOfAny(new char[] {'\r', '\n'}, char_index);
// char_index should not be -1 because LineNumber>0, otherwise an RangeException is
// thrown, which is appropriate.
char_index++;
if (s[char_index-1]=='\r' && s.Length>char_index && s[char_index]=='\n') {
char_index++;
}
}
char_index += LinePosition - 1;
var rgx = new System.Text.RegularExpressions.Regex(xmlDocument.DocumentElement.Name + "[ \r\n\t]*\\>");
System.Text.RegularExpressions.Match match = rgx.Match(s, char_index);
if (!match.Success || match.Index != char_index) {
throw new Exception("could not find EndElement to truncate the xml buffer.");
}
char_index += match.Value.Length;
// Convert the character offset back to the byte offset (for the given encoding).
int line1_boffset = encoding.GetByteCount(s.Substring(0, char_index));
// remove the bytes from the buffer.
buffer = buffer.Skip(line1_boffset).ToArray();
} else {
buffer = new byte[0];
}
}
return message;
}
}

Reading into a MemoryStream is not necessary to use an XmlReader. You can attach the reader more directly to the stream to read as much as you require to reach the end of the XML document. A BufferedStream can be utilized to improve the efficiency of reading from the socket directly.
string server = "tcp://myserver"
string message = "GetMyXml"
int port = 13000;
int bufferSize = 1024;
using(var client = new TcpClient(server, port))
using(var clientStream = client.GetStream())
using(var bufferedStream = new BufferedStream(clientStream, bufferSize))
using(var xmlReader = XmlReader.Create(bufferedStream))
{
xmlReader.MoveToContent();
try
{
while(xmlReader.Read())
{
// Check for XML declaration.
if(xmlReader.NodeType != XmlNodeType.XmlDeclaration)
{
throw new Exception("Expected XML declaration.");
}
// Move to the first element.
xmlReader.Read();
xmlReader.MoveToContent();
// Read the root element.
// Hand this document to another method to process further.
var xmlDocument = XmlDocument.Load(xmlReader.ReadSubtree());
}
}
catch(XmlException ex)
{
// Record exception reading stream.
// Move reader to start of next document or rethrow exception to exit.
}
}
The key to making this work is the call to XmlReader.ReadSubtree() which creates a child reader on top of the parent reader, one that will treat the current element (in this case the root element) as the entire XML tree. This should allow you to parse document elements separately.
My code's a little sloppy around reading the document, especially as I ignore all the information in the XML declaration. I'm sure there's room for improvement, but hopefully this gets you on the right track.

Assuming that you can change the protocol, I'd suggest adding start and stop markers to the messages, so that when you read it all in as a text stream you can split it up in separate messages (leaving incomplete messages in an "incoming buffer" of some kind), clean up the markers and then you know that you've got exactly one message at the time.

The 2 issues that I found were:
XmlReader will only permit an XML declaration at the very beginning. Since it can't be reset it needs to be recreated.
Once XmlReader has done its work it will usually have consumed additional characters after the end of the document because it uses the Read(char[], int, int) method.
My (brittle) workaround is to create a wrapper that only fills the array until a '>' is encountered. This keeps the XmlReader from consuming characters past the ending > of the document it was parsing:
public class SegmentingReader : TextReader {
private TextReader reader;
private char trigger;
public SegmentingReader(TextReader reader, char trigger) {
this.reader = reader;
this.trigger = trigger;
}
// Dispose omitted for brevity
public override int Peek() { return reader.Peek(); }
public override int Read() { return reader.Read(); }
public override int Read(char[] buffer, int index, int count) {
int n = 0;
while (n < count) {
char ch = (char)reader.Read();
buffer[index + n] = ch;
n++;
if (ch == trigger) break;
}
return n;
}
}
Then it can be used as simply as:
using(var inputReader = new SegmentingReader(/*TextReader from somewhere */))
using(var serializer = new XmlSerializer(typeof(SerializedClass)))
while (inputReader.Peek() != -1)
{
using (var xmlReader = XmlReader.Create(inputReader)) {
xmlReader.MoveToContent();
var obj = serializer.Deserialize(xmlReader.ReadSubtree());
DoStuff(obj);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.