XMLReader Invalid XML Character Exception

XMLReader Invalid XML Character Exception - c#

I am parsing a big XML file ~500MB, and it contains some invalid XML character 0x07 , so you can imagine what's happening, the XMLReader is throwing an Invalid XML character exception, to handle this we streamed the Stream into StreamReader and used Regex.Replace and wrote the result to memory using StreamWriter and stream the clean version back to XMLReader, now I would like to avoid this and skip this filthy tag from the XMLReader directly, my question is if there's anyway to achieve that, below is the code snippet where I try to do this but it's throwing the exception at this line
var node = (XElement)XNode.ReadFrom(xr);
protected override IEnumerable<XElement> StreamReader(Stream stream, string elementName)
{
var arrTag = elementName.Split('|').ToList();
using (var xr = XmlReader.Create(stream, new XmlReaderSettings { CheckCharacters = false }))
{
while (xr.Read())
{
if (xr.NodeType == XmlNodeType.Element && arrTag.Contains(xr.Name))
{
var node = (XElement)XNode.ReadFrom(xr);
node.ReplaceWith(node.Elements().Where(e => e.Name != "DaylightSaveInfo"));
yield return node;
}
}
xr.Close();
}
}
XML SAMPLE, the invalid attribute DaylightSaveInfo
<?xml version="1.0" encoding="ISO-8859-1"?>
<LATree>
<LA className="BTT00NE" fdn="NE=9739">
<attr name="fdn">NE=9739</attr>
<attr name="IP">10.157.144.100</attr>
<attr name="realLatitude">0D0&apos;0"S</attr>
<attr name="realLongitude">0D0&apos;0"W</attr>
<attr name="DaylightSaveInfo">NO</attr>
</LA>
</LATree>

I just saw that Jon Skeet wrote something about this, so I cannot take credit really, but since his stature on SO is way above mine, I could perhaps gain a point or two for writing it. :)
First I wrote a class that overloads the TextReader class.
(Some reference material in the links.)
https://www.w3.org/TR/xml/#NT-Char
https://github.com/Microsoft/referencesource/blob/master/mscorlib/system/io/textreader.cs
class FilterInvalidXmlReader : System.IO.TextReader
{
private System.IO.StreamReader _streamReader;
public System.IO.Stream BaseStream => _streamReader.BaseStream;
public FilterInvalidXmlReader(System.IO.Stream stream) => _streamReader = new System.IO.StreamReader(stream);
public override void Close() => _streamReader.Close();
protected override void Dispose(bool disposing) => _streamReader.Dispose();
public override int Peek()
{
var peek = _streamReader.Peek();
while (IsInvalid(peek, true))
{
_streamReader.Read();
peek = _streamReader.Peek();
}
return peek;
}
public override int Read()
{
var read = _streamReader.Read();
while (IsInvalid(read, true))
{
read = _streamReader.Read();
}
return read;
}
public static bool IsInvalid(int c, bool invalidateCompatibilityCharacters)
{
if (c == -1)
{
return false;
}
if (invalidateCompatibilityCharacters && ((c >= 0x7F && c <= 0x84) || (c >= 0x86 && c <= 0x9F) || (c >= 0xFDD0 && c <= 0xFDEF)))
{
return true;
}
if (c == 0x9 || c == 0xA || c == 0xD || (c >= 0x20 && c <= 0xD7FF) || (c >= 0xE000 && c <= 0xFFFD))
{
return false;
}
return true;
}
}
Then I created a console application and in the main I put:
using (var memoryStream = new System.IO.MemoryStream(System.Text.Encoding.UTF8.GetBytes("<Test><GoodAttribute>a\u0009b</GoodAttribute><BadAttribute>c\u0007d</BadAttribute></Test>")))
{
using (var xmlFilteredTextReader = new FilterInvalidXmlReader(memoryStream))
{
using (var xr = System.Xml.XmlReader.Create(xmlFilteredTextReader))
{
while (xr.Read())
{
if (xr.NodeType == System.Xml.XmlNodeType.Element)
{
var xe = System.Xml.Linq.XElement.ReadFrom(xr);
System.Console.WriteLine(xe.ToString());
}
}
}
}
}
Hopefully this could help, or at least provide some starter point.

Following xml linq code runs without errors. I used in xml file following "NO" :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
namespace ConsoleApplication108
{
class Program
{
const string FILENAME = #"c:\temp\test.xml";
static void Main(string[] args)
{
XmlReaderSettings settings = new XmlReaderSettings();
settings.CheckCharacters = false;
XmlReader reader = XmlReader.Create(FILENAME, settings);
XDocument doc = XDocument.Load(reader);
Dictionary<string, string> dict = doc.Descendants("attr")
.GroupBy(x => (string)x.Attribute("name"), y => (string)y)
.ToDictionary(x => x.Key, y => y.FirstOrDefault());
}
}
}

Related

ReadOuterXml is throwing OutOfMemoryException reading part of large (1 GB) XML file

I am working on a large XML file and while running the application, XmlTextReader.ReadOuterXml() method is throwing memory exception.
Lines of codes are like,
XmlTextReader xr = null;
try
{
xr = new XmlTextReader(fileName);
while (xr.Read() && success)
{
if (xr.NodeType != XmlNodeType.Element)
continue;
switch (xr.Name)
{
case "A":
var xml = xr.ReadOuterXml();
var n = GetDetails(xml);
break;
}
}
}
catch (Exception ex)
{
//Do stuff
}
Using:
private int GetDetails (string xml)
{
var rootNode = XDocument.Parse(xml);
var xnodes = rootNode.XPathSelectElements("//A/B").ToList();
//Then working on list of nodes
}
Now while loading the XML files, the application throwing exception on the xr.ReadOuterXml() line. What can be done to avoid this? The size of XML is almost 1 GB.

The most likely reason you are getting a OutOfMemoryException in ReadOuterXml() is that you are trying to read in a substantial portion of the 1 GB XML document into a string, and are hitting the Maximum string length in .Net.
So, don't do that. Instead load directly from the XmlReader using XDocument.Load() with XmlReader.ReadSubtree():
using (var xr = XmlReader.Create(fileName))
{
while (xr.Read() && success)
{
if (xr.NodeType != XmlNodeType.Element)
continue;
switch (xr.Name)
{
case "A":
{
// ReadSubtree() positions the reader at the EndElement of the element read, so the
// next call to Read() moves to the next node.
using (var subReader = xr.ReadSubtree())
{
var doc = XDocument.Load(subReader);
GetDetails(doc);
}
}
break;
}
}
}
And then in GetDetails() do:
private int GetDetails(XDocument rootDocument)
{
var xnodes = rootDocument.XPathSelectElements("//A/B").ToList();
//Then working on list of nodes
return xnodes.Count;
}
Not only will this use less memory, it will also be more performant. ReadOuterXml() uses a temporary XmlWriter to copy the XML in the input stream to an output StringWriter (which you then parse a second time). This version of the algorithm completely skips this extra work. It also avoids creating strings large enough to go on the large object heap which can cause additional performance issues.
If this is still using too much memory you will need to implement SAX-like parsing for your XML where you only load one element <B> at a time. First, introduce the following extension method:
public static partial class XmlReaderExtensions
{
public static IEnumerable<XElement> WalkXmlElements(this XmlReader xmlReader, Predicate<Stack<XName>> filter)
{
Stack<XName> names = new Stack<XName>();
while (xmlReader.Read())
{
if (xmlReader.NodeType == XmlNodeType.Element)
{
names.Push(XName.Get(xmlReader.LocalName, xmlReader.NamespaceURI));
if (filter(names))
{
using (var subReader = xmlReader.ReadSubtree())
{
yield return XElement.Load(subReader);
}
}
}
if ((xmlReader.NodeType == XmlNodeType.Element && xmlReader.IsEmptyElement)
|| xmlReader.NodeType == XmlNodeType.EndElement)
{
names.Pop();
}
}
}
}
Then, use it as follows:
using (var xr = XmlReader.Create(fileName))
{
Predicate<Stack<XName>> filter =
(stack) => stack.Peek().LocalName == "B" && stack.Count > 1 && stack.ElementAt(1).LocalName == "A";
foreach (var element in xr.WalkXmlElements(filter))
{
//Then working on the specific node.
}
}

using (var reader = XmlReader.Create(fileName))
{
XmlDocument oXml = new XmlDocument();
while (reader.Read())
{
oXml.Load(reader);
}
}
For me above code resolved the issue when we return it to XmlDocument through XmlDocument Load method

using XmlReader to get child nodes without knowing their names(in .net)

How do I get the the top level child nodes(unknownA) of root node with XmlReader in .net? Because their names are unknown, ReadToDescendant(string) and ReadToNextSibling(string)won't work.
<root>
<unknownA/>
<unknownA/>
<unknownA>
<unknownB/>
<unknownB/>
</unknownA>
<unknownA/>
<unknownA>
<unknownB/>
<unknownB>
<unknownC/>
<unknownC/>
</unknownB>
</unknownA>
<unknownA/>
</root>

You can walk through the file using XmlReader.Read(), checking the current Depth against the initial depth, until reaching an element end at the initial depth, using the following extension method:
public static class XmlReaderExtensions
{
public static IEnumerable<string> ReadChildElementNames(this XmlReader xmlReader)
{
if (xmlReader == null)
throw new ArgumentNullException();
if (xmlReader.NodeType == XmlNodeType.Element && !xmlReader.IsEmptyElement)
{
var depth = xmlReader.Depth;
while (xmlReader.Read())
{
if (xmlReader.Depth == depth + 1 && xmlReader.NodeType == XmlNodeType.Element)
yield return xmlReader.Name;
else if (xmlReader.Depth == depth && xmlReader.NodeType == XmlNodeType.EndElement)
break;
}
}
}
public static bool ReadToFirstElement(this XmlReader xmlReader)
{
if (xmlReader == null)
throw new ArgumentNullException();
while (xmlReader.NodeType != XmlNodeType.Element)
if (!xmlReader.Read())
return false;
return true;
}
}
Then it could be used as follows:
var xml = GetXml(); // Your XML string
using (var textReader = new StringReader(xml))
using (var xmlReader = XmlReader.Create(textReader))
{
xmlReader.ReadToFirstElement();
var names = xmlReader.ReadChildElementNames().ToArray();
Console.WriteLine(string.Join("\n", names));
}

Strategy for splitting a large JSON file

I'm trying to split very large JSON files into smaller files for a given array. For example:
{
"headerName1": "headerVal1",
"headerName2": "headerVal2",
"headerName3": [{
"element1Name1": "element1Value1"
},
{
"element2Name1": "element2Value1"
},
{
"element3Name1": "element3Value1"
},
{
"element4Name1": "element4Value1"
},
{
"element5Name1": "element5Value1"
},
{
"element6Name1": "element6Value1"
}]
}
...down to { "elementNName1": "elementNValue1" } where N is a large number
The user provides the name which represents the array to be split (in this example "headerName3") and the number of array objects per file, e.g. 1,000,000
This would result in N files each containing the top name:value pairs (headerName1, headerName3) and up to 1,000,000 of the headerName3 objects in each file.
I'm using the excellent Newtonsof JSON.net and understand that I need to do this using a stream.
So far I have looked a reading in JToken objects to establish where the PropertyName == "headerName3" occurs when reading in the tokens but what I would like to do is then read in the entire JSON object for each object in the array and not have to continue parsing JSON into JTokens;
Here's a snippet of the code I am building so far:
using (StreamReader oSR = File.OpenText(strInput))
{
using (var reader = new JsonTextReader(oSR))
{
while (reader.Read())
{
if (reader.TokenType == JsonToken.StartObject)
{
intObjectCount++;
}
else if (reader.TokenType == JsonToken.EndObject)
{
intObjectCount--;
if (intObjectCount == 1)
{
intArrayRecordCount++;
// Here I want to read the entire object for this record into an untyped JSON object
if( intArrayRecordCount % 1000000 == 0)
{
//write these to the split file
}
}
}
}
}
}
I don't know - and in fact, and am not concerned with - the structure of the JSON itself, and the objects can be of varying structures within the array. I am therefore not serializing to classes.
Is this the right approach? Is there a set of methods in the JSON.net library I can easily use to perform such operation?
Any help appreciated.

You can use JsonWriter.WriteToken(JsonReader reader, true) to stream individual array entries and their descendants from a JsonReader to a JsonWriter. You can also use JProperty.Load(JsonReader reader) and JProperty.WriteTo(JsonWriter writer) to read and write entire properties and their descendants.
Using these methods, you can create a state machine that parses the JSON file, iterates through the root object, loads "prefix" and "postfix" properties, splits the array property, and writes the prefix, array slice, and postfix properties out to new file(s).
Here's a prototype implementation that takes a TextReader and a callback function to create sequential output TextWriter objects for the split file:
enum SplitState
{
InPrefix,
InSplitProperty,
InSplitArray,
InPostfix,
}
public static void SplitJson(TextReader textReader, string tokenName, long maxItems, Func<int, TextWriter> createStream, Formatting formatting)
{
List<JProperty> prefixProperties = new List<JProperty>();
List<JProperty> postFixProperties = new List<JProperty>();
List<JsonWriter> writers = new List<JsonWriter>();
SplitState state = SplitState.InPrefix;
long count = 0;
try
{
using (var reader = new JsonTextReader(textReader))
{
bool doRead = true;
while (doRead ? reader.Read() : true)
{
doRead = true;
if (reader.TokenType == JsonToken.Comment || reader.TokenType == JsonToken.None)
continue;
if (reader.Depth == 0)
{
if (reader.TokenType != JsonToken.StartObject && reader.TokenType != JsonToken.EndObject)
throw new JsonException("JSON root container is not an Object");
}
else if (reader.Depth == 1 && reader.TokenType == JsonToken.PropertyName)
{
if ((string)reader.Value == tokenName)
{
state = SplitState.InSplitProperty;
}
else
{
if (state == SplitState.InSplitProperty)
state = SplitState.InPostfix;
var property = JProperty.Load(reader);
doRead = false; // JProperty.Load() will have already advanced the reader.
if (state == SplitState.InPrefix)
{
prefixProperties.Add(property);
}
else
{
postFixProperties.Add(property);
}
}
}
else if (reader.Depth == 1 && reader.TokenType == JsonToken.StartArray && state == SplitState.InSplitProperty)
{
state = SplitState.InSplitArray;
}
else if (reader.Depth == 1 && reader.TokenType == JsonToken.EndArray && state == SplitState.InSplitArray)
{
state = SplitState.InSplitProperty;
}
else if (state == SplitState.InSplitArray && reader.Depth == 2)
{
if (count % maxItems == 0)
{
var writer = new JsonTextWriter(createStream(writers.Count)) { Formatting = formatting };
writers.Add(writer);
writer.WriteStartObject();
foreach (var property in prefixProperties)
property.WriteTo(writer);
writer.WritePropertyName(tokenName);
writer.WriteStartArray();
}
count++;
writers.Last().WriteToken(reader, true);
}
else
{
throw new JsonException("Internal error");
}
}
}
foreach (var writer in writers)
using (writer)
{
writer.WriteEndArray();
foreach (var property in postFixProperties)
property.WriteTo(writer);
writer.WriteEndObject();
}
}
finally
{
// Make sure files are closed in the event of an exception.
foreach (var writer in writers)
using (writer)
{
}
}
}
This method leaves all the files open until the end in case "postfix" properties, appearing after the array property, need to be appended. Be aware that there is a limit of 16384 open files at one time, so if you need to create more split files, this won't work. If postfix properties are never encountered in practice, you can just close each file before opening the next and throw an exception in case any postfix properties are found. Otherwise you may need to parse the large file in two passes or close and reopen the split files to append them.
Here is an example of how to use the method with an in-memory JSON string:
private static void TestSplitJson(string json, string tokenName)
{
var builders = new List<StringBuilder>();
using (var reader = new StringReader(json))
{
SplitJson(reader, tokenName, 2, i => { builders.Add(new StringBuilder()); return new StringWriter(builders.Last()); }, Formatting.Indented);
}
foreach (var s in builders.Select(b => b.ToString()))
{
Console.WriteLine(s);
}
}
Prototype fiddle.

C# OpenXml Get DOCX WordStyle Property Simplified Code

Just curious if there is a more simplified version to check if the given body has the word style of "Heading3" applied given this sample C# code I wrote learning the OpenXML library. To be clear, I am just asking given a body element how can I determine if the given body element has what word style applied. I eventually have to write a program that process numerous .DOCX files and need to process them from a top to bottom approach.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System.IO;
namespace docxparsing
{
class Program
{
static void Main()
{
string file_to_parse = #"C:\temp\sample.docx";
WordprocessingDocument doc = WordprocessingDocument.Open(file_to_parse,false);
Body body = doc.MainDocumentPart.Document.Body;
string fooStr
foreach( var foo in body )
{
fooStr = foo.InnerXml;
/*
these 2 comments represent 2 different xml snippets from 'fooStr'. the only way i figure out how to get the word style is by reading
this xml and doing checks for values. i don't know of any other approach in using the body element to check for the applied word style
<w:pPr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:pStyle w:val="Heading2" />
<w:pPr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:pStyle w:val="Heading3" />
*/
bool hasHeading3 = fooStr.Contains("pStyle w:val=\"Heading3\"");
if ( hasHeading3 )
{
Console.WriteLine("heading3 found");
}
}
doc.Close();
}
}
}
// -------------------------------------------------------------------------------
EDIT
Here is updated code of one way to do this. Still not overall happy with it but it works. Function to look at is getWordStyleValue(string x)
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System;
using System.Diagnostics;
using System.IO;
using System.Text;
namespace docxparsing
{
class Program
{
// ************************************************
// grab the word style value
// ************************************************
static string getWordStyleValue(string x)
{
int p = 0;
p = x.IndexOf("w:pStyle w:val=");
if ( p == -1 )
{
return "";
}
p = p + 15;
StringBuilder sb = new StringBuilder();
while (true)
{
p++;
char c = x[p];
if (c != '"')
{
sb.Append(c);
}
else
{
break;
}
}
string s = sb.ToString();
return s;
}
// ************************************************
// Main
// ************************************************
static void Main(string[] args)
{
string theFile = #"C:\temp\sample.docx";
WordprocessingDocument doc = WordprocessingDocument.Open(theFile,false);
string body_table = "DocumentFormat.OpenXml.Wordprocessing.Table";
string body_paragraph = "DocumentFormat.OpenXml.Wordprocessing.Paragraph";
Body body = doc.MainDocumentPart.Document.Body;
StreamWriter sw1 = new StreamWriter("paragraphs.log");
foreach (var b in body)
{
string body_type = b.ToString();
if (body_type == body_paragraph)
{
string str = getWordStyleValue(b.InnerXml);
if (str == "" || str == "HeadingNon-TOC" || str == "TOC1" || str == "TOC2" || str == "TableofFigures" || str == "AcronymList" )
{
continue;
}
sw1.WriteLine(str + "," + b.InnerText);
}
if ( body_type == body_table )
{
// sw1.WriteLine("Table:\n{0}",b.InnerText);
}
}
doc.Close();
sw1.Close();
}
}
}

Yes. You could do something like this:
bool ContainsHeading3 = body.Descendants<ParagraphSytleId>().Any(psId => psId.Val == "Heading3");
This will look at all the ParagraphStyleId elements (w:pStyle in the xml) and see if any of them have the Val of Heading3.

Just pasting this Edit from original post so he has better visibility.
Here is one solution I came up with. Yes, it a little cody ( if that is a word ) but working LINQ ( my fav ) to optimize a more elegant solution.
--
Here is updated code of one way to do this. Still not overall happy with it but it works. Function to look at is getWordStyleValue(string x)
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System;
using System.Diagnostics;
using System.IO;
using System.Text;
namespace docxparsing
{
class Program
{
// ************************************************
// grab the word style value
// ************************************************
static string getWordStyleValue(string x)
{
int p = 0;
p = x.IndexOf("w:pStyle w:val=");
if ( p == -1 )
{
return "";
}
p = p + 15;
StringBuilder sb = new StringBuilder();
while (true)
{
p++;
char c = x[p];
if (c != '"')
{
sb.Append(c);
}
else
{
break;
}
}
string s = sb.ToString();
return s;
}
// ************************************************
// Main
// ************************************************
static void Main(string[] args)
{
string theFile = #"C:\temp\sample.docx";
WordprocessingDocument doc = WordprocessingDocument.Open(theFile,false);
string body_table = "DocumentFormat.OpenXml.Wordprocessing.Table";
string body_paragraph = "DocumentFormat.OpenXml.Wordprocessing.Paragraph";
Body body = doc.MainDocumentPart.Document.Body;
StreamWriter sw1 = new StreamWriter("paragraphs.log");
foreach (var b in body)
{
string body_type = b.ToString();
if (body_type == body_paragraph)
{
string str = getWordStyleValue(b.InnerXml);
if (str == "" || str == "HeadingNon-TOC" || str == "TOC1" || str == "TOC2" || str == "TableofFigures" || str == "AcronymList" )
{
continue;
}
sw1.WriteLine(str + "," + b.InnerText);
}
if ( body_type == body_table )
{
// sw1.WriteLine("Table:\n{0}",b.InnerText);
}
}
doc.Close();
sw1.Close();
}
}
}

XDocument Text Node New Line

I'm trying to get a newline into a text node using XText from the Linq XML namespace.
I have a string which contains newline characters however I need to work out how to convert these to entity characters (i.e.
) rather than just having them appear in the XML as new lines.
XElement element = new XElement( "NodeName" );
...
string example = "This is a string\nWith new lines in it\n";
element.Add( new XText( example ) );
The XElement is then written out using an XmlTextWriter which results in the file containing the newline rather than an entity replacement.
Has anyone come across this problem and found a solution?
EDIT:
The problem manifests itself when I load the XML into EXCEL which doesn't seem to like the newline character but which accepts the entity replacement. The result is that newlines aren't showing in EXCEL unless I replace them with
Nick.

Cheating:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.CheckCharacters = false;
settings.NewLineChars = "
";
XmlWriter writer = XmlWriter.Create(..., settings);
element.WriteTo(writer);
writer.Flush();
UPDATE:
Complete program
using System;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
XElement element = new XElement( "NodeName" );
string example = "This is a string\nWith new lines in it\n";
element.Add( new XText( example ) );
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.CheckCharacters = false;
settings.NewLineChars = "
";
XmlWriter writer = XmlWriter.Create(Console.Out, settings);
element.WriteTo(writer);
writer.Flush();
}
}
}
OUTPUT:
C:\Users\...\\ConsoleApplication1\bin\Release>ConsoleApplication1.exe
<?xml version="1.0" encoding="ibm850"?>
<NodeName>This is a string
With new lines in it
</NodeName>

To any standard XML parser there is no difference between the entity
and a new line character, as they are one and the same thing.
To illustrate this the following code shows that they are the same thing:
string s1 = "<root>Test
Test2</root>";
string s2 = "<root>Test\nTest2</root>";
XDocument doc1 = XDocument.Parse(s1);
XDocument doc2 = XDocument.Parse(s2);
Console.WriteLine(doc1.ToString());
Console.WriteLine(doc2.ToString());

It's the XmlTextWriter which is responsible for outputting escaped entities. So if you do this, for example:
using (XmlTextWriter w = new XmlTextWriter("test.xml", Encoding.UTf8))
{
w.WriteString("");
}
You will also get an escaped ampersand output in text.xml &#x10, which you don't want. You would like to keep the  sequence raw, as is.
The solution I propose is to create a new StreamWriter implementation capable of detecting an escaped string like "&#x10;":
// A StreamWriter that does not escape
characters
public class NonXmlEscapingStreamWriter : StreamWriter
{
private const string AmpToken = "amp";
private int _bufferState = 0; // used to keep state
// add other ctors overloads if needed
public NonXmlEscapingStreamWriter(string path)
: base(path)
{
}
// NOTE this code is based on the assumption that StreamWriter
// only overrides these 4 Write functions, which is true today but could change in the future
// and also on the assumption that the XmlTextWrite writes escaped values in a specific WriteXX calls sequence
public override void Write(char value)
{
if (value == '&')
{
if (_bufferState == 0)
{
_bufferState++;
return; // hold it
}
else
{
_bufferState = 0;
}
}
else if (value == ';')
{
if (_bufferState > 1)
{
_bufferState++;
return;
}
else
{
Write('&'); // release what's been held
Write(AmpToken);
_bufferState = 0;
}
}
else if (value == '\n') // detect non escaped \n
{
base.Write("
");
return;
}
base.Write(value);
}
public override void Write(string value)
{
if (_bufferState > 0)
{
if (value == AmpToken)
{
_bufferState++;
return; // hold it
}
else
{
Write('&'); // release what's been held
_bufferState = 0;
}
}
base.Write(value);
}
public override void Write(char[] buffer, int index, int count)
{
if (_bufferState > 2)
{
_bufferState = 0;
base.Write('&'); // release this anyway
string replace;
if ((buffer != null) && ((replace = GetReplaceLength(buffer, index, count)) != null))
{
base.Write(replace);
base.Write(buffer, index + replace.Length, count - replace.Length);
return;
}
else
{
base.Write(AmpToken); // release this
base.Write(';'); // release this
}
}
base.Write(buffer, index, count);
}
public override void Write(char[] buffer)
{
Write(buffer, 0, buffer != null ? buffer.Length : 0);
}
private string GetReplaceLength(char[] buffer, int index, int count)
{
// this is specific to the 10 character but could be adapted
const string token = "#10;";
if ((index + count) < token.Length)
return null;
// we test the char array to avoid string allocations
for(int i = 0; i < token.Length; i++)
{
if (buffer[index + i] != token[i])
return null;
}
return token;
}
}
And you can use it like this:
using (XmlTextWriter w = new XmlTextWriter(new NonXmlEscapingStreamWriter("test.xml")))
{
element.WriteTo(w);
}
NOTE: Although it is capable of detecting lonely \n sequences, I suggest you ensure all \n are actually escaped in your original text, so, you need to replace \n by  before you actually output xml, like this:
string example = "This is a stringWith new lines in it";

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XMLReader Invalid XML Character Exception - c#

Related

ReadOuterXml is throwing OutOfMemoryException reading part of large (1 GB) XML file

using XmlReader to get child nodes without knowing their names(in .net)

Strategy for splitting a large JSON file

C# OpenXml Get DOCX WordStyle Property Simplified Code

XDocument Text Node New Line

Categories

Resources