Can I change XmlReader.Settings? - c#

I am a library which has a method which parses XML from the supplied XmlReader. So, the caller passes me XmlReader instance (or an instance of any derived class) but I need to make sure whitespaces are ignored. I.e. I want to do this:
xmlReader.Settings.IgnoreWhitespace = true;
// Then do my parsing
// Finally, revert to whatever state xmlReader.Settings had prior to calling my method
However, if the caller didn't instantiate XmlReaderSettings when creating XmlReader instance, I don't see the way how I can fix this myself.
For instance, if the caller used this code:
XmlReader reader = new XmlTextReader(File.OpenRead("file.xml"));
reader.Settings will remain null. This property is read-only so I can't assign it.
I'm not responsible for the caller and I don't force them to use this or that way of getting XmlReader instance and configuring it. I know XmlTextReader is deprecated but it's still available in .NET 4.6 and folks can use it.
Does this mean there is no way to work around this in my library and it's the caller who must supply me already well-configured XmlReader?

You can wrap the provided XmlReader into a new one using XmlReader.Create():
public void ReadMyXml(XmlReader reader)
{
XmlReaderSettings settings = reader.Settings ?? new XmlReaderSettings();
settings.IgnoreWhitespace = true;
settings.CloseInput = false;
using(XmlReader myReader = XmlReader.Create(reader, settings))
{
// use myReader to read the xml
}
}
Set settings.CloseInput = false if you want to avoid closing the original reader at the end (thanks to Jon Hanna for the comment)

Related

XML XSD validation. When it occurs

I cannot understand when validation of XML occurs on Load or on Validate. Here is following code...
XmlDocument doc = null;
try
{
XmlReaderSettings settings = new XmlReaderSettings( );
settings.Schemas.Add("http://xxx/customs/DealFile/Common/ReleaseGoodsMessage",
ConfigurationManager.AppSettings.Get("Schemas"));
settings.ValidationType = ValidationType.Schema;
using (XmlReader reader = XmlReader.Create(path, settings)) {
doc = new XmlDocument( );
doc.Load(reader);
}
ValidationEventHandler eventHandler = new ValidationEventHandler(ValidationEventHandler);
doc.Validate(eventHandler);
}
catch(XmlSchemaException xmlErr)
{
// Do something
}
I expect a validation to occur on line doc.Validate(eventHandler);
However it always occurs on doc.Load(reader); . I've got an exception if something wrong with XML.
XMLHelpers.LoadXML(#"C:\work\Xml2Db\Xml2Db\Data\Tests\BadData\01.xml")
Exception thrown: 'System.Xml.Schema.XmlSchemaValidationException' in System.Xml.dll
xmlErr.Message
"The 'http://xxx/customs/DealFile/Common/ReleaseGoodsMessage:governmentProcedureType' element is invalid -
The value 'a' is invalid according to its datatype 'Int' - The string 'a' is not a valid Int32 value."
And this is the code from Microsoft's example https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmldocument.validate?view=netcore-3.1
try
{
XmlReaderSettings settings = new XmlReaderSettings();
settings.Schemas.Add("http://www.contoso.com/books", "contosoBooks.xsd");
settings.ValidationType = ValidationType.Schema;
XmlReader reader = XmlReader.Create("contosoBooks.xml", settings);
XmlDocument document = new XmlDocument();
document.Load(reader);
ValidationEventHandler eventHandler = new ValidationEventHandler(ValidationEventHandler);
// the following call to Validate succeeds.
document.Validate(eventHandler);
...
It's actually the same.
But, pay attention on comment // the following call to Validate succeeds. . They also expect to get validation on the line document.Validate(eventHandler);
What's going on.
As your block of code sets up the settings object, it sets a schema and the Validator to use ValidationType.Schema (i.e.: use the schema).
When you setup the XmlReader, using your settings it's setup to validate according to the schema, too - which is causing your schema-based error/exception.
The call to document.Validate(eventHandler); is completely redundant, because it will succeed in all circumstances - because the xml has already been validated. The comment is correct "the following call to Validate succeeds" because the document has already been proved valid.
I suspect that you are failing to distinguish between XML that is well-formed and XML that is valid.
A well-formed XML document satisfies all of the rules of the XML specification. If it does not, you should get a well-formedness error from any XML parser.
If you also choose to
a) supply an XSD that describes you XML document and
b) tell your XML processor to validate against that XSD
then the XML processor will also check that the document satisfies the rules in the XML schema (an XML Schema is composed of one or more XSDs).
If you are still not sure, edit your question and supply the error message(s) that you are seeing. You don't need to include any confidential information - the error template is enough to tell which kind of error it is.

Parsing XmlTextWriter object to String

I'm building a class library in C# which uses the XmlTextWriter class to build an XML which is then exported as a HTML document.
However when I save the file with a .html extension using the XmlTextWriter object as content, the resulting file only contains the text "System.Xml.XmlTextWriter"
This comes up in the method defined below, specifically in the final line:-
public void SaveAsHTML(string filepath)
{
XmlTextWriter html;
html = new XmlTextWriter(#"D:/HTMLWriter/XML/HTMLSaveAsConfig.xml", System.Text.Encoding.UTF8);
html.WriteStartDocument();
html.WriteStartElement("html");
html.WriteRaw(Convert.ToString(Head));
html.WriteRaw(Convert.ToString(Body));
html.WriteEndElement();
html.WriteEndDocument();
html.Flush();
html.Close();
System.IO.File.WriteAllText(filepath, html.ToString());
}
For context, the variables Head and Body are also XmlTextWriter objects containing what will become the and elements of the html file respectively.
I've tried using Convert.ToString(), which causes the same issue.
I'm tempted to try overriding the ToString() method for my class as a fix, potentially using the XmlSerializer class. However I was wondering if there's a less noisy way of returning the Xml object as a string?
The following function will extract the string from the System.Xml.XmlTextWriter objects you've created as Head and Body.
private string XmlTextWriterToString(XmlTextWriter writer)
{
// Ensure underlying stream is flushed.
writer.Flush();
// Reset position to beginning of stream.
writer.BaseStream.Position = 0;
using (var reader = new StreamReader(writer.BaseStream))
{
// Read and return content of stream as a single string
var result = reader.ReadToEnd();
return result;
}
}
Some caveats here are that the underlying System.IO.Stream object associated with the System.Xml.XmlTextWriter must support both 'read' and 'seek' operations (i.e., both Stream.CanRead and System.CanSeek properties must return true, respectively).
I've made the following edits to your original code:
Replaced the Convert.ToString() calls with calls to this new function.
Made an assumption that you're intending to write to the file specified by the filepath parameter to your SaveAsHTML() function, and not the hard-coded path.
Wrapped the creation (and use, and disposal) of the System.Xml.XmlTextWriter in a using block (if you're not familiar, see What are the uses of “using” in C#?).
Following is your code with those changes.
public void SaveAsHTML(string filepath)
{
using (var html = new XmlTextWriter(filepath, System.Text.Encoding.UTF8))
{
html.WriteStartDocument();
html.WriteStartElement("html");
html.WriteRaw(XmlTextWriterToString(Head));
html.WriteRaw(XmlTextWriterToString(Body));
html.WriteEndElement();
html.WriteEndDocument();
html.Flush();
html.Close();
}
}
Another thing of which to be mindful is that, not knowing for sure from the code provided how they're being managed, the lifetimes of Head and Body are subject to the same exception-based resource leak potential that html was before wrapping it in the using block.
A final thought: the page for System.Xml.XmlTextWriter notes the following: Starting with the .NET Framework 2.0, we recommend that you create XmlWriter instances by using the XmlWriter.Create method and the XmlWriterSettings class to take advantage of new functionality.
The last line writes the value of XmlTextWriter.ToString(), which does not return the text representation of the XML you wrote. Try leaving off the last line, it looks like your XmlTextWriter is already writing to a file.
#PhilBrubaker's solution seems to be on the right track. There are still a few bugs in my code that I'm working towards getting a fix for, but the good news is that the casting seems to be working now.
protected string XmlToString(XmlWriter xmlBody)
{
XmlTextWriter textXmlBody = (XmlTextWriter)xmlBody;
textxmlBody.BaseStream.Position = 0;
using (var reader = new StreamReader(textXmlBody.BaseStream))
{
var result = reader.ReadToEnd();
reader.Dispose();
return result;
}
}
I've changed the input parameter type from XmlWriter and cast it explicitly to XmlTextWriter in the method, this is so that the method also works when the Create() method is used instead of an initialisation as recommended for .NET 2.0. It's not 100% reliable at the moment as XmlWriter doesn't always cast correctly to XmlTextWriter (depending on the features), but that's out of the scope for this thread and I'm investigating that separately.
Thanks for your help!
On a side note, the using block is something I haven't come across before, but it's provided so many solutions across the board for me. So thanks for that too!

Serialize to an XML document without overwriting previous data

I need to serialize to an XML document without overwriting the data that is currently in there. I have a method that does this and it will save to the xml file, but will delete whatever is currently in that file upon serializing. Below is the code.
public void SaveSubpart()
{
SOSDocument doc = new SOSDocument();
doc.ID = 1;
doc.Subpart = txtSubpart.Text;
doc.Title = txtTitle.Text;
doc.Applicability = txtApplicability.Text;
doc.Training = txtTraining.Text;
doc.URL = txtUrl.Text;
StreamWriter writer = new StreamWriter(Server.MapPath("~/App_Data/Contents.xml"));
System.Xml.Serialization.XmlSerializer serializer;
try
{
serializer = new System.Xml.Serialization.XmlSerializer(doc.GetType());
serializer.Serialize(writer, doc);
}
catch (Exception ex)
{
//e-mail admin - serialization failed
}
finally
{ writer.Close(); }
}
The contract for the StreamWriter constructor taking only a filename says that if the named file exists, it is overwritten. So this has nothing to do with serializing to XML, per se. You would get the same result if you wrote to the stream through some other means.
The way to do what you are looking for is to read the old XML file into memory, make whatever changes are necessary, and then serialize and write the result to disk.
And even if it was possible to transparently modify an on-disk XML file, that's almost certainly what would happen under the hood because it's the only way to really do it. Yes, you probably could fiddle around with seeking and writing directly on disk, but what if something caused the file to change on disk while you were doing that? If you do the read/modify/write sequence, then you lose out on the changes that were made after you read the file into memory; but if you modify the file directly on disk by seeking and writing, you would be almost guaranteed to end up with the file in an inconsistent state.
And of course, you could only do it if you could fit whatever changes you wanted to make into the bytes that were already on disk...
If concurrency is a problem, either use file locking or use a proper database with transactional support.
try this:
StreamWriter writer = new StreamWriter(Server.MapPath("~/App_Data/Contents.xml"),true);
this determines to append the data to the file.
true=append,
false = overwrite
more info http://msdn.microsoft.com/en-us/library/36b035cb.aspx
So what you want to implement is to serialize an object without overwriting it to an existing file.
Thus
XmlSerializer s = new XmlSerializer(doc.GetType());
TextWriter w = new StringWriter();
s.Serialize(w, doc);
var yourXMLstring = w.ToString();
Then you can process this xml string and append it to existing xml file if you want to.
XmlDocument xml = new XmlDocument();
xml.LoadXml(yourXMLstring );

Why does changing an XML schema and re-validating cause a memory leak?

I recently uncovered a memory leak in an application I maintain for work, and I'm confused as to why the code produces a leak. I've pulled out the relevant code (with slight modifications) and provided it below.
In our application, a given XML document could validate against one or more available schema files. Each schema file corresponds to a different version of the XML document as it has changed over time. We only care that the XML document validates against at least one schema. Each schema completely describes the contents of the XML document (they are not nested schema files).
According to the ANTS memory profiler, it looks like the XmlDocument object is hording references to the previous schemas, even after the schema set has been cleared. Commenting out the call to Validate(), leaving everything else the same, will stop the leak.
I fixed the leak in our application by loading the schemas once at application initialization time, and swapping out which schema file is associated with the XML document until we find one that validates.
The code below produces the memory leak, and I'm not sure why.
class Program
{
private static XmlDocument xmlDocument_ = new XmlDocument();
static void Main(string[] args)
{
using (StreamReader reader = new StreamReader("contents.xml"))
{
xmlDocument_.LoadXml(reader.ReadToEnd());
}
XmlReaderSettings xmlReaderSettings = new XmlReaderSettings();
xmlReaderSettings.CloseInput = true;
while (true)
{
xmlDocument_.Schemas = new XmlSchemaSet();
XmlReader xmlReader = XmlReader.Create("schema.xsd", xmlReaderSettings);
xmlDocument_.Schemas.Add(XmlSchema.Read(xmlReader, null));
xmlReader.Close();
xmlDocument_.Validate(null);
}
}
}
You have the memory leak because your XmlDocument reference is static and because of the SchemaInfo property, which is populated when you validate your XML. Since those properties hold references to objects from your compiled XSDs, you'll have those around for as long as you have the XmlDocument around, which could be quite a while (since it is static).
Some people may argue if indeed this is a leak or not: validating another XML with another set of XSDs will release previously held resources.
Try changing the while statement as below. I haven't tested this but it differs from the original code in that every while iteration disposes of the XmlReader.
The GC may dispose automatically of the XmlReader instances eventually but I doubt it, because XmlReader implements IDispose. That is, code that uses XmlReader must dispose it deterministically (garbage-collection is non-deterministic). If the GC was capable of disposing them, and if the while iterates thousands of times before the GC does this, the memory used will be killing the system anyway.
while (true)
{
xmlDocument_.Schemas = new XmlSchemaSet();
using (XmlReader xmlReader = XmlReader.Create("schema.xsd", xmlReaderSettings))
{
xmlDocument_.Schemas.Add(XmlSchema.Read(xmlReader, null));
}
xmlDocument_.Validate(null);
}
EDIT:
I read the MSDN page on XmlDocument.Validate, which provides a code sample that does this differently, using XmlReaderSettings to set validation options. Also, code in the OP assumes that the XML file is always encoded as UTF-8. Here's a rewrite that detects the text encoding and is based on the MSDN sample; this may fix the memory leak. This code is untested.
class Program
{
private static XmlDocument xmlDocument_ = new XmlDocument();
static void Main(string[] args)
{
XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.Schema;
settings.CloseInput = true;
xmlDocument_.Load(XmlReader.Create("contents.xml", settings));
while (true)
{
settings.Schemas = new XmlSchemaSet();
settings.Schemas.Add(null, "schema.xsd");
xmlDocument_.Validate(null);
}
}
}
You could try ILDASM to see what's inside XmlDocument.Validate.

Prevent DTD download when parsing XML

When using XmlDocument.Load , I am finding that if the document refers to a DTD, a connection is made to the provided URI. Is there any way to prevent this from happening?
After some more digging, maybe you should set the XmlResolver property of the XmlReaderSettings object to null.
'The XmlResolver is used to locate and
open an XML instance document, or to
locate and open any external resources
referenced by the XML instance
document. This can include entities,
DTD, or schemas.'
So the code would look like this:
XmlReaderSettings settings = new XmlReaderSettings();
settings.XmlResolver = null;
settings.DtdProcessing = DtdProcessing.Parse;
XmlDocument doc = new XmlDocument();
using (StringReader sr = new StringReader(xml))
using (XmlReader reader = XmlReader.Create(sr, settings))
{
doc.Load(reader);
}
The document being loaded HAS a DTD.
With:
settings.ProhibitDtd = true;
I see the following exception:
Service cannot be started. System.Xml.XmlException: For security reasons DTD is prohibited in this XML document. To enable DTD processing set the ProhibitDtd property on XmlReaderSettings to false and pass the settings into XmlReader.Create method.
So, it looks like ProhibitDtd MUST be set to true in this instance.
It looked like ValidationType would do the trick, but with:
settings.ValidationType = ValidationType.None;
I'm still seeing a connection to the DTD uri.
This is actually a flaw in the XML specifications. The W3C is bemoaning that people all hit their servers like mad to load schemas billions of times. Unfortunately just about no standard XML library gets this right, they all hit the servers over and over again.
The problem with DTDs is particularly serious, because DTDs may include general entity declarations (for things like & -> &) which the XML file may actually rely upon. So if your parser chooses to forgo loading the DTD, and the XML makes use of general entity references, parsing may actually fail.
The only solution to this problem would be a transparent caching entity resolver, which would put the downloaded files into some archive in the library search path, so that this archive would be dynamically created and almost automatically bundled with any software distributions made. But even in the Java world there is not one decent such EntityResolver floating about, certainly not built-in to anything from apache foundation.
Try something like this:
XmlDocument doc = new XmlDocument();
using (StringReader sr = new StringReader(xml))
using (XmlReader reader = XmlReader.Create(sr, new XmlReaderSettings()))
{
doc.Load(reader);
}
The thing to note here is that XmlReaderSettings has the ProhibitDtd property set to true by default.
Use an XMLReader to load the document and set the ValidationType property of the reader settings to None.

Categories

Resources