Validating and reparing xml

Validating and reparing xml - c#

Is there a way to get more useful information on validation error? XmlSchemaException provides the line number and position of the error which makes little sense to me. Xml document after all is not about its transient textual representation. I'd like to get an enumerated error (or an error code) specifying what when wrong, node name (or an xpath) to locate the source of the problem so that perhaps I can try and fix it.
Edit: I'm talking about valid xml documents - just not valid against a particular schema!

In my experience, you are lucky to get a line number and parse position.

You might consider validating via a DTD which can sometimes give slightly more interesting errors, however, on a project I currently work on, we validate using XSLTs. The transform checks the syntax and reports errors as outputted transform text. I would consider that route if you want more friendly error checking. For us, an empty output means no errors, otherwise we get some nice detail from the XSLT processing on what the error was and where.

You can accomplish this, sort of, by setting up an XmlReader whose XmlReaderSettings contain the schema and then using it to read through the input stream node by node. You can keep track of the last node read and have a pretty good idea of where you are in the document when a validation error happens.
I think that if you try this exercise, you'll discover that there are a lot of validation errors (e.g. required element missing) where the concept of the error node doesn't make much sense. Yes, the parent element is clearly what's in error in that case, but what really triggered the error was the reader encountering the end tag without ever seeing the required element, which is why the error line and position point at the end tag.

personally I'm not sure how to get a more detailed error, typcially f you open the document and go to the location mentioned you can easily find the error.
If the code isn't able to parse the file as valid XML, it is pretty hard for it to give an XPATH or other named XML detail.

It seems this is no easy task. Robert Rossney's answer comes closest to programmaticaly solving my problem so I'll accept that for now. I'll continue using the xsl solution. Anyone finding a better way to resolve validation errors can respond to this thread.

Related

DTD prohibited in xml document exception

I'm getting this error when trying to parse through an XML document in a C# application:
"For security reasons DTD is prohibited in this XML document. To enable DTD processing set the ProhibitDtd property on XmlReaderSettings to false and pass the settings into XmlReader.Create method."
For reference, the exception occurred at the second line of the following code:
using (XmlReader reader = XmlReader.Create(uri))
{
reader.MoveToContent(); //here
while (reader.Read()) //(code to parse xml doc follows).
My knowledge of Xml is pretty limited and I have no idea what DTD processing is nor how to do what the error message suggests. Any help as to what may be causing this and how to fix it? thanks...

First, some background.
What is a DTD?
The document you are trying to parse contains a document type declaration; if you look at the document, you will find near the beginning a sequence of characters beginning with <!DOCTYPE and ending with the corresponding >. Such a declaration allows an XML processor to validate the document against a set of declarations which specify a set of elements and attributes and constrain what values or contents they can have.
Since entities are also declared in DTDs, a DTD allows a processor to know how to expand references to entities. (The entity pubdate might be defined to contain the publication date of a document, like "15 December 2012", and referred to several times in the document as &pubdate; -- since the actual date is given only once, in the entity declaration, this usage makes it easier to keep the various references to publication date in the document consistent with each other.)
What does a DTD mean?
The document type declaration has a purely declarative meaning: a schema for this document type, in the syntax defined in the XML spec, can be found at such and such a location.
Some software written by people with a weak grasp of XML fundamentals suffers from an elementary confusion about the meaning of the declaration; it assumes that the meaning of the document type declaration is not declarative (a schema is over there) but imperative (please validate this document). The parser you are using appears to be such a parser; it assumes that by handing it an XML document that has a document type declaration, you have requested a certain kind of processing. Its authors might benefit from a remedial course on how to accept run-time parameters from the user. (You see how hard it is for some people to understand declarative semantics: even the creators of some XML parsers sometimes fail to understand them and slip into imperative thinking instead. Sigh.)
What are these 'security reasons' they are talking about?
Some security-minded people have decided that DTD processing (validation, or entity expansion without validation) constitutes a security risk. Using entity expansion, it's easy to make a very small XML data stream which expands, when all entities are fully expanded, into a very large document. Search for information on what is called the "billion laughs attack" if you want to read more.
One obvious way to protect against the billion laughs attack is for those who invoke a parser on user-supplied or untrusted data to invoke the parser in an environment which limits the amount of memory or time the parsing process is allowed to consume. Such resource limits have been standard parts of operating systems since the mid-1960s. For reasons that remain obscure to me, however, some security-minded people believe that the correct answer is to run parsers on untrusted input without resource limits, in the apparent belief that this is safe as long as you make it impossible to validate the input against an agreed schema.
This is why your system is telling you that your data has a security issue.
To some people, the idea that DTDs are a security risk sounds more like paranoia than good sense, but I don't believe they are correct. Remember (a) that a healthy paranoia is what security experts need in life, and (b) that anyone really interested in security would insist on the resource limits in any case -- in the presence of resource limits on the parsing process, DTDs are harmless. The banning of DTDs is not paranoia but fetishism.
Now, with that background out of the way ...
How do you fix the problem?
The best solution is to complain bitterly to your vendor that they have been suckered by an old wive's tale about XML security, and tell them that if they care about security they should do a rational security analysis instead of prohibiting DTDs.
Meanwhile, as the message suggests, you can "set the ProhibitDtd property on XmlReaderSettings to false and pass the settings into XmlReader.Create method." If the input is in fact untrusted, you might also look into ways of giving the process appropriate resource limits.
And as a fallback (I do not recommend this) you can comment out the document type declaration in your input.

Note that settings.ProhibitDtd is now obsolete, use DtdProcessing instead: (new options of Ignore, Parse, or Prohibit)
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
and as stated in this post: How does the billion laughs XML DoS attack work?
you should add a limit to the number of characters to avoid DoS attacks:
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
settings.MaxCharactersFromEntities = 1024;

As far as fixing this, with a bit of looking around I found it was as simple as adding:
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
and passing these settings into the create method.
[UPDATE 3/9/2017]
As some have pointed out, .ProhibitDTDT is now deprecated. Dr. Aaron Dishno's answer, below, shows the superseding solution

After trying all of the above answers without success I changing the service user from service#mydomain.com to service#mydomain.onmicrosoft.com and now the app works correctly while running in azure.
Alternatively if you run into this problem in an environment you have more control over; you can paste the following into your hosts file:
127.0.0.1 msoid.onmicrosoft.com
127.0.0.1 msoid.mydomain.com
127.0.0.1 msoid.mydomain.onmicrosoft.com
127.0.0.1 msoid.*.onmicrosoft.com

Determining content from StreamReader?

I am forced to work with a crappy 3rd party API where there is no consistency with the return type. So I submit a programmatic web request, grab the Stream back and the underlying content might be an error message (worse still because it can be either raw text, or xml they return) or it returns a binary file. I have no means of knowing what format to expect with any given request so I need a way to introspect this at runtime.
How should I go about tackling this? The stream is non-seekable so I can't do anything other than read it. I usually try not to use exception handling for flow control but it seems like that might be the best way to handle it. Always treat it like it should be the expected binary file type and if anything blows up then catch the exception and try to extract what should be an error message

One thing that comes to mind is to examine the first x number of bytes in the stream. If the first bit is well formed xml, then it's probably xml. The problem is trying to determine the difference between raw text or binary.

How can I programmatically determine the XML elements that can be inserted next?

When I am editing an XML document that has an XmlSchema, how can I programmatically determine the elements that can be inserted next? I am using C# and I already know which element I am in. Is there an MSXML method I can call or something else? Thanks.

Sounds like you are after the .Net Schema Object Model (SOM)
Schema Object Model
Here is an article on how to work with the SOM.
Example 1

Tarzan,
As I understand it, you are trying to determine the legal XML that can be added at a specific place in the document, based on the schema being used. If that is correct, it is a very difficult problem to solve. If you have an "any" element in your XSD, your complexity increases because you can literally be any element! Also, XSD schemas can be subclassed (i.e., an element definition structure based on another structure), then that introduces more complexity. There are only couple of products (Oxygen, Visual Studio) that have attempted this with any success (that I know of).
If your schema is fairly simple, and doesn't include any of these deal breakers, you might be able to use the Schema Object Model to find the legal elements at your current location, but only if you know what portion of the XSD applies to your current element.
Does this make sense?
Erick

Where can I find a list of all possible messages that an XmlException can contain?

I'm writing an XML code editor and I want to display syntax errors in the user interface. Because my code editor is strongly constrained to a particular problem domain and audience, I want to rewrite certain XMLException messages to be more meaningful for users. For instance, an exception message like this:
'"' is an unexpected token. The
expected token is '='. Line 30,
position 35
.. is very technical and not very informative to my audience. Instead, I'd like to rewrite it and other messages to something else. For completeness' sake that means I need to build up a dictionary of existing messages mapped to the new message I would like to display instead. To accomplish that I'm going to need a list of all possible messages XMLException can contain.
Is there such a list somewhere? Or can I find out the possible messages through inspection of objects in C#?
Edit: specifically, I am using XmlDocument.LoadXml to parse a string into an XmlDocument, and that method throws an XmlException when there are syntax errors. So specifically, my question is where I can find a list of messages applied to XmlException by XmlDocument.LoadXml. The discussion about there potentially being a limitless variation of actual strings in the Message property of XmlException is moot.
Edit 2: More specifically, I'm not looking for advice as to whether I should be attempting this; I'm just looking for any clues to a way to obtain the various messages. Ben's answer is a step in the right direction. Does anyone know of another way?

Technically there is no such thing, any class that throws an XmlException can set the message to any string. Really it depends on which classes you are using, and how they handle exceptions. It is perfectly possible you may be using a class that includes context specific information in the message, e.g. info about some xml node or attribute that is malformed. In that case the number of unqiue message strings could be infinite depending on the XML that was being processed. It is equally possible that a particular class does not work in this way and has a finite number of messages that occur under specific circumstances. Perhaps a better aproach would be to use try/catch blocks in specific parts of your code, where you understand the processing that is taking place and provide more generic error messages based on what is happening. E.g. in your example you could simply look at the line and character number and produce an error along the lines of "Error processing xml file LineX CharacterY" or even something as general as "error processing file".
Edit:
Further to your edit i think you will have trouble doing what you require. Essentially you are trying to change a text string to another text string based on certain keywords that may be in the string. This is likely to be messy and inconsistent. If you really want to do it i would advise using something like Redgate .net Reflector to reflect out the loadXML method and dig through the code to see how it handles different kinds of syntax errors in the XML and what kind of messages it generates based on what kind of errors it finds. This is likely to be time consuming and dificult. If you want to hide the technical errors but still provide useful info to the user then i would still recomend ignoring the error message and simply pointing the user to the location of the problem in the file.

Just my opinion, but ... spelunking the error messages and altering them before displaying them to the user seems like a really misguided idea.
First, The messages are different for each international language. Even if you could collect them for English, and you're willing to pay the cost, they'll be different for other languages.
Second, even if you are dealing with a single language, there's no way to be sure that an external package hasn't injected a novel XmlException into the scope of LoadXml.
Last, the list of messages is not stable. It may change from release to release.
A better idea is to just emit an appropriate message from your own app, and optionally display -- maybe upon demand -- the original error message contained in the XmlException.

Parsing NHibernate exception text

I'm trying to get NHibernate to load some records for me (it's been partially set up, and is used for some other parts of the app already), and while working on an <any> mapping, I got this exception:
[InvalidOperationException: any types do not have a unique referenced persister]
Can somebody help me parse what they mean by this? I can think of many completely different meanings for this sentence. I can interpret the first part as:
types declared with <any> are not allowed to have a URP, but yours do
types declared with <any> must have a URP, but yours don't
any of your program's types should ...
And with any of these, I can see the second part as:
you have more than one persister, but only one is allowed
you have no persister, but one is required
you have one, but failed to reference it correctly
(Yeah, I'm unclear on much of their terminology still, but usually when I'm unclear on some parts, error messages are at least clear enough that I can figure out what they mean by context. And the exception points to the entry point into NHibernate, not a bad mapping in my .hbm.xml file or a property in a specific class.)
I've looked at the API docs, but they seem completely unhelpful here.
thanks!

I interpret that as your first bullet point does; I do not understand your question about the "second part".

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.