DTD prohibited in xml document exception

DTD prohibited in xml document exception - c#

I'm getting this error when trying to parse through an XML document in a C# application:
"For security reasons DTD is prohibited in this XML document. To enable DTD processing set the ProhibitDtd property on XmlReaderSettings to false and pass the settings into XmlReader.Create method."
For reference, the exception occurred at the second line of the following code:
using (XmlReader reader = XmlReader.Create(uri))
{
reader.MoveToContent(); //here
while (reader.Read()) //(code to parse xml doc follows).
My knowledge of Xml is pretty limited and I have no idea what DTD processing is nor how to do what the error message suggests. Any help as to what may be causing this and how to fix it? thanks...

First, some background.
What is a DTD?
The document you are trying to parse contains a document type declaration; if you look at the document, you will find near the beginning a sequence of characters beginning with <!DOCTYPE and ending with the corresponding >. Such a declaration allows an XML processor to validate the document against a set of declarations which specify a set of elements and attributes and constrain what values or contents they can have.
Since entities are also declared in DTDs, a DTD allows a processor to know how to expand references to entities. (The entity pubdate might be defined to contain the publication date of a document, like "15 December 2012", and referred to several times in the document as &pubdate; -- since the actual date is given only once, in the entity declaration, this usage makes it easier to keep the various references to publication date in the document consistent with each other.)
What does a DTD mean?
The document type declaration has a purely declarative meaning: a schema for this document type, in the syntax defined in the XML spec, can be found at such and such a location.
Some software written by people with a weak grasp of XML fundamentals suffers from an elementary confusion about the meaning of the declaration; it assumes that the meaning of the document type declaration is not declarative (a schema is over there) but imperative (please validate this document). The parser you are using appears to be such a parser; it assumes that by handing it an XML document that has a document type declaration, you have requested a certain kind of processing. Its authors might benefit from a remedial course on how to accept run-time parameters from the user. (You see how hard it is for some people to understand declarative semantics: even the creators of some XML parsers sometimes fail to understand them and slip into imperative thinking instead. Sigh.)
What are these 'security reasons' they are talking about?
Some security-minded people have decided that DTD processing (validation, or entity expansion without validation) constitutes a security risk. Using entity expansion, it's easy to make a very small XML data stream which expands, when all entities are fully expanded, into a very large document. Search for information on what is called the "billion laughs attack" if you want to read more.
One obvious way to protect against the billion laughs attack is for those who invoke a parser on user-supplied or untrusted data to invoke the parser in an environment which limits the amount of memory or time the parsing process is allowed to consume. Such resource limits have been standard parts of operating systems since the mid-1960s. For reasons that remain obscure to me, however, some security-minded people believe that the correct answer is to run parsers on untrusted input without resource limits, in the apparent belief that this is safe as long as you make it impossible to validate the input against an agreed schema.
This is why your system is telling you that your data has a security issue.
To some people, the idea that DTDs are a security risk sounds more like paranoia than good sense, but I don't believe they are correct. Remember (a) that a healthy paranoia is what security experts need in life, and (b) that anyone really interested in security would insist on the resource limits in any case -- in the presence of resource limits on the parsing process, DTDs are harmless. The banning of DTDs is not paranoia but fetishism.
Now, with that background out of the way ...
How do you fix the problem?
The best solution is to complain bitterly to your vendor that they have been suckered by an old wive's tale about XML security, and tell them that if they care about security they should do a rational security analysis instead of prohibiting DTDs.
Meanwhile, as the message suggests, you can "set the ProhibitDtd property on XmlReaderSettings to false and pass the settings into XmlReader.Create method." If the input is in fact untrusted, you might also look into ways of giving the process appropriate resource limits.
And as a fallback (I do not recommend this) you can comment out the document type declaration in your input.

Note that settings.ProhibitDtd is now obsolete, use DtdProcessing instead: (new options of Ignore, Parse, or Prohibit)
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
and as stated in this post: How does the billion laughs XML DoS attack work?
you should add a limit to the number of characters to avoid DoS attacks:
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
settings.MaxCharactersFromEntities = 1024;

As far as fixing this, with a bit of looking around I found it was as simple as adding:
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
and passing these settings into the create method.
[UPDATE 3/9/2017]
As some have pointed out, .ProhibitDTDT is now deprecated. Dr. Aaron Dishno's answer, below, shows the superseding solution

After trying all of the above answers without success I changing the service user from service#mydomain.com to service#mydomain.onmicrosoft.com and now the app works correctly while running in azure.
Alternatively if you run into this problem in an environment you have more control over; you can paste the following into your hosts file:
127.0.0.1 msoid.onmicrosoft.com
127.0.0.1 msoid.mydomain.com
127.0.0.1 msoid.mydomain.onmicrosoft.com
127.0.0.1 msoid.*.onmicrosoft.com

Related

C# validating XML with XSD without loading unvalidated XSD?

A security scan of our C# source reported "Missing XML Validation" as a possible injection flaw. It cited https://cwe.mitre.org/data/definitions/112.html and other sources.
Its recommendation was:
Always enable validation when you parse XML. If enabling validation
causes problems because the rules for defining a well-formed document
are Byzantine or altogether unknown, chances are good that there are
security errors nearby.
Example: The following code demonstrates how to enable validation when using XmlReader.XmlReader
Settings settings = new XmlReaderSettings();
settings.Schemas.Add(schema);
settings.ValidationType = ValidationType.Schema;
StringReader sr = new StringReader(xmlDoc);
XmlReader reader = XmlReader.Create(sr, settings);
I have an XSD schema available for validation. My question is, how do I load the XSD as an XmlSchema without duplicating the error of loading an XML file without validation?
If I read the XSD from the file system, I think I am just duplicating the same error (reading XML without validation). Is there a recommended way to do this?
Our first approach was to read the XSD from the file system, like:
XmlTextReader xsdReader = new XmlTextReader("MySchema.xsd"));
XmlSchema schema = XmlSchema.Read(xsdReader, ValidationCallback);
But, I believe this causes the same error, reading the XML (in this case the XSD) without validation.
The approach that we are using now (that I think will pass the security scan) is to load the XSD from an embedded resource.
Stream xsdStream = Assembly.GetAssembly(typeof(MyType))
.GetManifestResourceStream("MyNamespace.MySchema.xsd");
if (xsdStream == null) throw ...
XmlSchema schema = XmlSchema.Read(xsdStream, ValidationCallback);
We have not rescanned yet, but I suspect the embedded resource approach will pass. But, is there recommended or best practice approach to this?

Anyone who can write the phrase "If enabling validation causes problems because the rules for defining a well-formed document are Byzantine" is revealing that they know very little about XML; it seems they don't understand the difference between being valid and being well-formed, which is pretty fundamental. So you're having to find ways of getting around rules that aren't very smart. At this point you have to decide whether your objective is to make the system more secure, or to pass the security tests.
It's very hard to see what security vulnerabilities will be fixed by enabling validation.
Especially as you can write a schema that accepts any document as valid, and I bet your security tool will be happily content that you are obeying the rules even though you haven't increased security one iota.
When a schema processor loads a schema then it automatically validates that it is a valid schema. So there really isn't any risk. But whether your security scanner accepts that there isn't any risk is another matter entirely.

XmlTextReader vs. XDocument

I'm in the position to parse XML in .NET. Now I have the choice between at least XmlTextReader and XDocument. Are there any comparisons between those two (or any other XML parsers contained in the framework)?
Maybe this could help me to decide without trying both of them in depth.
The XML files are expected to be rather small, speed and memory usage are a minor issue compared to easiness of use. :-)
(I'm going to use them from C# and/or IronPython.)
Thanks!

If you're happy reading everything into memory, use XDocument. It'll make your life much easier. LINQ to XML is a lovely API.
Use an XmlReader (such as XmlTextReader) if you need to handle huge XML files in a streaming fashion, basically. It's a much more painful API, but it allows streaming (i.e. only dealing with data as you need it, so you can go through a huge document and only have a small amount in memory at a time).
There's a hybrid approach, however - if you have a huge document made up of small elements, you can create an XElement from an XmlReader positioned at the start of the element, deal with the element using LINQ to XML, then move the XmlReader onto the next element and start again.

XmlTextReader is kind of deprecated, do not use it.
From msdn blogs by XmlTeam
Effective Xml Part 1: Choose the right API
Avoid using XmlTextReader. It contains quite a few bugs that could not be fixed without breaking existing applications already using it.
The world has moved on, have you? Xml APIs you should avoid using.
Obsolete APIs are easy since the compiler helps identifying them but there are two more APIs you should avoid using – namely XmlTextReader and XmlTextWriter. We found a number of bugs in these classes which we could not fix without breaking existing applications. The easy route would be to deprecate these classes and ask people to use replacement APIs instead. Unfortunately these two classes cannot be marked as obsolete because they are part of ECMA-335 (Common Language Infrastructure) standard (http://www.ecma-international.org/publications/standards/Ecma-335.htm) – the companion CLILibrary.xml file which is a part of Partition IV).
The good news is that even though these classes are not deprecated there are replacement APIs for these in .NET Framework already and moving to them is relatively easy. First it is necessary to find the places where XmlTextReader or XmlTextWriter is being used (unfortunately it is a manual step). Now all the occurrences of XmlTextReader should be replaced with XmlReader and all the occurrences of XmlTextWriter should be replaced with XmlWriter (note that XmlTextReader derives from XmlReader and XmlTextWriter derives from XmlWriter so the app can already be using these e.g. as formal parameters). The last step is to change the way the XmlReader/XmlWriter objects are instantiated – instead of creating the reader/writer directly it is necessary to the static factory method .Create() present on both XmlReader and XmlWriter APIs.
Furthermore, intellisense in Visual Studio doesn't list XmlTextReader under System.Xml namespace. The class is defined as:
[EditorBrowsable(EditorBrowsableState.Never)]
public class XmlTextReader : XmlReader, IXmlLineInfo, IXmlNamespaceResolver
The XmlReader.Create factory methods return other internal implementations of the abstract class XmlReader depending on the settings passed.
For forward-only streaming API (i.e. that doesn't load the entire thing into memory), use XmlReader via XmlReader.Create method.
For an easier API to work with, go for XDocument aka LINQ To XML. Find XDocument vs XmlDocument here and here.

Should I use a Namespace of an XML file to identify its version

I'm using DataContractSerializer to serialize a class with DataContract and DataMember attributes to an XML file. My class could potentially change later, and thus the format of the serialized files could also change. I'd like to tag the files I'm saving with a version number so I at least know what version each file is from. I'm still deciding how and if I want to add functionality that will migrate files in older formats to later formats. But right now I'd be happy with just identifying a version mismatch.
Is the namespace of the XML file the correct place to store the version of the file? I was thinking of attributing my class with a DataContract attributes as follows.
[DataContract(Name="MyClass",Namespace="http://www.mycompany.com/MyProject/1.0
public class MyClass
...
Then later if MyClass changes I would change the namespace...
[DataContract(Name="MyClass",Namespace="http://www.mycompany.com/MyProject/2.0)]
public class MyClass
...
Is this the correct usage of XML namespaces, or is there another more prefered way to save the version of an XML file?

You can do it this way, but then the XML representation of your data becomes completely different from version to version from XML Infoset point of view (in which namespace is the part of the qualified name of the element), so you have neither backwards nor forwards compatibility.
Now, one advantage XML has is that it can be easily processed in a forward-compatible way with technologies such as XPath and XSLT - you just pick the elements you can interpret, and leave anything you don't recognize as is. But this requires elements with the same meaning to retain the same name (including namespace) between versions.
In general, it is best to make your schemas forward-compatible. If you can't achieve that, you might still want to provide as much compatibility as possible with existing tools (it is often easier to achieve compatibility against tools which only read data, rather than with those which also write it). Consequently, you avoid storing version number in such cases, and just try to parse whatever you're given, signalling an error if the input is definitely malformed.
If you come to the point where you absolutely must break compatibility in both directions and start from a clean slate, the suggested way of handling this for WCF data contracts is indeed by changing the namespace, as described in best practices on data contract versioning. There are a few minor variations there as well, such as using publication date instead of version number in the URL (W3C is quite fond of this for their schemas), but these are mostly stylistic.

Where can I find a list of all possible messages that an XmlException can contain?

I'm writing an XML code editor and I want to display syntax errors in the user interface. Because my code editor is strongly constrained to a particular problem domain and audience, I want to rewrite certain XMLException messages to be more meaningful for users. For instance, an exception message like this:
'"' is an unexpected token. The
expected token is '='. Line 30,
position 35
.. is very technical and not very informative to my audience. Instead, I'd like to rewrite it and other messages to something else. For completeness' sake that means I need to build up a dictionary of existing messages mapped to the new message I would like to display instead. To accomplish that I'm going to need a list of all possible messages XMLException can contain.
Is there such a list somewhere? Or can I find out the possible messages through inspection of objects in C#?
Edit: specifically, I am using XmlDocument.LoadXml to parse a string into an XmlDocument, and that method throws an XmlException when there are syntax errors. So specifically, my question is where I can find a list of messages applied to XmlException by XmlDocument.LoadXml. The discussion about there potentially being a limitless variation of actual strings in the Message property of XmlException is moot.
Edit 2: More specifically, I'm not looking for advice as to whether I should be attempting this; I'm just looking for any clues to a way to obtain the various messages. Ben's answer is a step in the right direction. Does anyone know of another way?

Technically there is no such thing, any class that throws an XmlException can set the message to any string. Really it depends on which classes you are using, and how they handle exceptions. It is perfectly possible you may be using a class that includes context specific information in the message, e.g. info about some xml node or attribute that is malformed. In that case the number of unqiue message strings could be infinite depending on the XML that was being processed. It is equally possible that a particular class does not work in this way and has a finite number of messages that occur under specific circumstances. Perhaps a better aproach would be to use try/catch blocks in specific parts of your code, where you understand the processing that is taking place and provide more generic error messages based on what is happening. E.g. in your example you could simply look at the line and character number and produce an error along the lines of "Error processing xml file LineX CharacterY" or even something as general as "error processing file".
Edit:
Further to your edit i think you will have trouble doing what you require. Essentially you are trying to change a text string to another text string based on certain keywords that may be in the string. This is likely to be messy and inconsistent. If you really want to do it i would advise using something like Redgate .net Reflector to reflect out the loadXML method and dig through the code to see how it handles different kinds of syntax errors in the XML and what kind of messages it generates based on what kind of errors it finds. This is likely to be time consuming and dificult. If you want to hide the technical errors but still provide useful info to the user then i would still recomend ignoring the error message and simply pointing the user to the location of the problem in the file.

Just my opinion, but ... spelunking the error messages and altering them before displaying them to the user seems like a really misguided idea.
First, The messages are different for each international language. Even if you could collect them for English, and you're willing to pay the cost, they'll be different for other languages.
Second, even if you are dealing with a single language, there's no way to be sure that an external package hasn't injected a novel XmlException into the scope of LoadXml.
Last, the list of messages is not stable. It may change from release to release.
A better idea is to just emit an appropriate message from your own app, and optionally display -- maybe upon demand -- the original error message contained in the XmlException.

Serialization for document storage

I write a desktop application that can open / edit / save documents.
Those documents are described by several objects of different types that store references to each other. Of course there is a Document class that that serves as the root of this data structure.
The question is how to save this document model into a file.
What I need:
Support for recursive structures.
It must be able to open files even if they were produced from slightly different classes. My users don't want to recreate every document after every release just because I added a field somewhere.
It must deal with classes that are not known at compile time (for plug-in support).
What I tired so far:
XmlSerializer -> Fails the first and last criteria.
BinarySerializer -> Fails the second criteria.
DataContractSerializer: Similar to XmlSerializer but with support for cyclic (recursive) references. Also it was designed with (forward/backward) compatibility in mind: Data Contract Versioning. [edit]
NetDataContractSerializer: While the DataContractSerializer still requires to know all types in advance (i.e. it can't work very well with inheritance), NetDataContractSerializer stores type information in the output. Other than that the two seem to be equivalent. [edit]
protobuf-net: Didn't have time to experiment with it yet, but it seems similar in function to DataContractSerializer, but using a binary format. [edit]
Handling of unknown types [edit]
There seem two be two philosophies about what to do when the static and dynamic type differ (if you have a field of type object but a, lets say, Person-object in it). Basically the dynamic type must somehow get stored in the file.
Use different XML tags for different dynamic types. But since the XML tag to be used for a particular class might not be equal to the class name, its only possible to go this route if the deserializer knows all possible types in advance (so that he can scan them for attributes).
Store the CLR type (class name, assembly name & version) during serialization. Use this info during deserialization to instantiate the right class. The types must not be known prior to deserialization.
The second one is simpler to use, but the resulting file will be CLR dependent (and less sensitive to code modifications). Thats probably why XmlSerializer and DataContractSerializer choose the first way. NetDataContractSerializer is not recomended because its using the second approch (So does BinarySerializer by the way).
Any ideas?

The one you haven't tried is DataContractSerializer. There is a constructor that takes a parameter bool preserveObjectReferences that should handle the first criteria.

The WCF data contract serializer is probably closest to your needs, although not perfect.
There is only limited support for backwards compatibility (i.e. whether old versions of the program can read documents generated with a newer version). New fields are supported (via IExtensibleDataObject), but new classes or new enum values not.

I would think the XmlSerializer is your best bet. You won't be able to support everything on your requirements list without a bit of work in your Document classes - but the XmlSerializer architecture gives you extensibility points which should allow you to tap into its mechanism deep enough to do just about anything.
Using the IXmlSerializable interface - by implementing that on your classes you want to store - you should be able to do just about anything, really.
The interface exposes basically two methods - ReadXml And WriteXml
public void WriteXml (XmlWriter writer)
{
// do what you need to do to write out your XML for this object
}
public void ReadXml (XmlReader reader)
{
// do what you need to do to read your object from XML
}
Using these two methods, you should be able to capture the necessary state information from just about any object you might want to store, and turn it into XML that can be persisted to disk - and deserialized back into an object when the time comes!

XmlSerializer can work for your first criteria, however you must provide the recursion for objects like the TreeView control.
BinaryFormatter can work for all 3 criteria. If a class changes, you may have to create a conversion tool to convert old format documents to a new format. Or recognize an older format, deserialize to the old, and then save to the new - keeping your old class format around for a little while.
This will help cover version tolerance which is what I think you're after: MSDN - Version Tolerant Serialization

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.