XPathDocument with XPathNavigator VS Xmlreader?

XPathDocument with XPathNavigator VS Xmlreader? - c#

I saw this question :
XPathDocument vs. XmlDocument
But it doesnt have the info which im looking for:
my question :
I know that XPathDocument Loads the complete xml into the memory :
My question is from the stage where the xml is already loaded :
which one of them will faster find the desired elements :
XPathDocument with XPathNavigator
or
xmlReader with If's conditions

If by
"the xml is already loaded"
you mean that it's already going to be loaded into an XPathDocument or XmlDocument, then the performance of using either the XPathNavigator or XmlReader will be the same. Both will be traversing already parsed, in-memory nodes representing the XPath data model.
The main difference between the two is the XmlReader will provide forward-only access whereas the XPathNavigator provides cursor access to the document. Directly interacting with XmlReader is very useful when you don't want to incur the cost of loading entire document in-memory. It's not so useful otherwise.
I'd strongly suggest using the XPathNavigator.
There are two primary ways you can interact with the XPathNavigator:
Build your state machine (using if/elses). One huge plus of doing this with the XPathNavigator rather than the XmlReader is, due to the cursor access model of XPathNavigator, your state machine will be vastly simpler. Ex: Need to see if the parent has a specific attribute? Just navigate to it and take a look.
Use XPath queries to find the data you're looking for. May not be as fast, but will probably will be less error prone than building your own state machine. Of course this requires you to be versed in XPath.

Related

Difference between XmlTextWriter and XmlSerializer?

I was using XmlSerializer when I came across someone using XmlTextWriter.
What is the difference between those two?
To me, they serve the same function which is to create XML files. Microsoft website said that XmlTextWriter provides a fast, non-cached, forward-only way of generating streams but I don't really know what that means.

The XmlTextWriter class is an object that knows XML. You can use it to generate arbitrary XML documents. It doesn't matter where the data's coming from; you can pull data for XML elements, attributes, and contents along with the actual structure of the XML document from whatever source you see fit, and it doesn't need to match any particular object's structure or data.
On the other hand XmlSerializer is an object that knows types. It has the features necessary to analyze a type, extract the important information, and write that information out. It happens to be able to use an XmlTextWriter object to perform the actual I/O; you can provide your own, or at some level it will always create a similar object to handle the actual I/O. In other words, the serializer object doesn't really know XML per se, nor does it need to. It delegates that work to another object.
Microsoft website said that XmlTextWriter provides a fast, non-cached, forward-only way of generating streams but I don't really know what that means.
"fast": not slow
"non-cached": important pieces of information are not stored in memory longer than absolutely necessary
"forward-only": you cannot revisit parts of the XML document you've already created
That is in contrast to other methods for generating XML documents in which the entire document structure is held in memory as its constructed, and written to a file only once the entire document has been constructed. This is often described as the "document object model", or DOM.
The writer approach tends to be more efficient in performance because the XML data is being generated on the fly, as needed, directly from other in-memory data structures you already have. Because the DOM approach requires the entire file's data and structure to be represented in memory at once, it will usually use more memory, which in some cases can reduce performance (though, frankly, on modern computers and for typical XML documents, this is usually a complete non-issue).

DTD prohibited in xml document exception

I'm getting this error when trying to parse through an XML document in a C# application:
"For security reasons DTD is prohibited in this XML document. To enable DTD processing set the ProhibitDtd property on XmlReaderSettings to false and pass the settings into XmlReader.Create method."
For reference, the exception occurred at the second line of the following code:
using (XmlReader reader = XmlReader.Create(uri))
{
reader.MoveToContent(); //here
while (reader.Read()) //(code to parse xml doc follows).
My knowledge of Xml is pretty limited and I have no idea what DTD processing is nor how to do what the error message suggests. Any help as to what may be causing this and how to fix it? thanks...

First, some background.
What is a DTD?
The document you are trying to parse contains a document type declaration; if you look at the document, you will find near the beginning a sequence of characters beginning with <!DOCTYPE and ending with the corresponding >. Such a declaration allows an XML processor to validate the document against a set of declarations which specify a set of elements and attributes and constrain what values or contents they can have.
Since entities are also declared in DTDs, a DTD allows a processor to know how to expand references to entities. (The entity pubdate might be defined to contain the publication date of a document, like "15 December 2012", and referred to several times in the document as &pubdate; -- since the actual date is given only once, in the entity declaration, this usage makes it easier to keep the various references to publication date in the document consistent with each other.)
What does a DTD mean?
The document type declaration has a purely declarative meaning: a schema for this document type, in the syntax defined in the XML spec, can be found at such and such a location.
Some software written by people with a weak grasp of XML fundamentals suffers from an elementary confusion about the meaning of the declaration; it assumes that the meaning of the document type declaration is not declarative (a schema is over there) but imperative (please validate this document). The parser you are using appears to be such a parser; it assumes that by handing it an XML document that has a document type declaration, you have requested a certain kind of processing. Its authors might benefit from a remedial course on how to accept run-time parameters from the user. (You see how hard it is for some people to understand declarative semantics: even the creators of some XML parsers sometimes fail to understand them and slip into imperative thinking instead. Sigh.)
What are these 'security reasons' they are talking about?
Some security-minded people have decided that DTD processing (validation, or entity expansion without validation) constitutes a security risk. Using entity expansion, it's easy to make a very small XML data stream which expands, when all entities are fully expanded, into a very large document. Search for information on what is called the "billion laughs attack" if you want to read more.
One obvious way to protect against the billion laughs attack is for those who invoke a parser on user-supplied or untrusted data to invoke the parser in an environment which limits the amount of memory or time the parsing process is allowed to consume. Such resource limits have been standard parts of operating systems since the mid-1960s. For reasons that remain obscure to me, however, some security-minded people believe that the correct answer is to run parsers on untrusted input without resource limits, in the apparent belief that this is safe as long as you make it impossible to validate the input against an agreed schema.
This is why your system is telling you that your data has a security issue.
To some people, the idea that DTDs are a security risk sounds more like paranoia than good sense, but I don't believe they are correct. Remember (a) that a healthy paranoia is what security experts need in life, and (b) that anyone really interested in security would insist on the resource limits in any case -- in the presence of resource limits on the parsing process, DTDs are harmless. The banning of DTDs is not paranoia but fetishism.
Now, with that background out of the way ...
How do you fix the problem?
The best solution is to complain bitterly to your vendor that they have been suckered by an old wive's tale about XML security, and tell them that if they care about security they should do a rational security analysis instead of prohibiting DTDs.
Meanwhile, as the message suggests, you can "set the ProhibitDtd property on XmlReaderSettings to false and pass the settings into XmlReader.Create method." If the input is in fact untrusted, you might also look into ways of giving the process appropriate resource limits.
And as a fallback (I do not recommend this) you can comment out the document type declaration in your input.

Note that settings.ProhibitDtd is now obsolete, use DtdProcessing instead: (new options of Ignore, Parse, or Prohibit)
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
and as stated in this post: How does the billion laughs XML DoS attack work?
you should add a limit to the number of characters to avoid DoS attacks:
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
settings.MaxCharactersFromEntities = 1024;

As far as fixing this, with a bit of looking around I found it was as simple as adding:
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
and passing these settings into the create method.
[UPDATE 3/9/2017]
As some have pointed out, .ProhibitDTDT is now deprecated. Dr. Aaron Dishno's answer, below, shows the superseding solution

After trying all of the above answers without success I changing the service user from service#mydomain.com to service#mydomain.onmicrosoft.com and now the app works correctly while running in azure.
Alternatively if you run into this problem in an environment you have more control over; you can paste the following into your hosts file:
127.0.0.1 msoid.onmicrosoft.com
127.0.0.1 msoid.mydomain.com
127.0.0.1 msoid.mydomain.onmicrosoft.com
127.0.0.1 msoid.*.onmicrosoft.com

XmlTextReader vs. XDocument

I'm in the position to parse XML in .NET. Now I have the choice between at least XmlTextReader and XDocument. Are there any comparisons between those two (or any other XML parsers contained in the framework)?
Maybe this could help me to decide without trying both of them in depth.
The XML files are expected to be rather small, speed and memory usage are a minor issue compared to easiness of use. :-)
(I'm going to use them from C# and/or IronPython.)
Thanks!

If you're happy reading everything into memory, use XDocument. It'll make your life much easier. LINQ to XML is a lovely API.
Use an XmlReader (such as XmlTextReader) if you need to handle huge XML files in a streaming fashion, basically. It's a much more painful API, but it allows streaming (i.e. only dealing with data as you need it, so you can go through a huge document and only have a small amount in memory at a time).
There's a hybrid approach, however - if you have a huge document made up of small elements, you can create an XElement from an XmlReader positioned at the start of the element, deal with the element using LINQ to XML, then move the XmlReader onto the next element and start again.

XmlTextReader is kind of deprecated, do not use it.
From msdn blogs by XmlTeam
Effective Xml Part 1: Choose the right API
Avoid using XmlTextReader. It contains quite a few bugs that could not be fixed without breaking existing applications already using it.
The world has moved on, have you? Xml APIs you should avoid using.
Obsolete APIs are easy since the compiler helps identifying them but there are two more APIs you should avoid using – namely XmlTextReader and XmlTextWriter. We found a number of bugs in these classes which we could not fix without breaking existing applications. The easy route would be to deprecate these classes and ask people to use replacement APIs instead. Unfortunately these two classes cannot be marked as obsolete because they are part of ECMA-335 (Common Language Infrastructure) standard (http://www.ecma-international.org/publications/standards/Ecma-335.htm) – the companion CLILibrary.xml file which is a part of Partition IV).
The good news is that even though these classes are not deprecated there are replacement APIs for these in .NET Framework already and moving to them is relatively easy. First it is necessary to find the places where XmlTextReader or XmlTextWriter is being used (unfortunately it is a manual step). Now all the occurrences of XmlTextReader should be replaced with XmlReader and all the occurrences of XmlTextWriter should be replaced with XmlWriter (note that XmlTextReader derives from XmlReader and XmlTextWriter derives from XmlWriter so the app can already be using these e.g. as formal parameters). The last step is to change the way the XmlReader/XmlWriter objects are instantiated – instead of creating the reader/writer directly it is necessary to the static factory method .Create() present on both XmlReader and XmlWriter APIs.
Furthermore, intellisense in Visual Studio doesn't list XmlTextReader under System.Xml namespace. The class is defined as:
[EditorBrowsable(EditorBrowsableState.Never)]
public class XmlTextReader : XmlReader, IXmlLineInfo, IXmlNamespaceResolver
The XmlReader.Create factory methods return other internal implementations of the abstract class XmlReader depending on the settings passed.
For forward-only streaming API (i.e. that doesn't load the entire thing into memory), use XmlReader via XmlReader.Create method.
For an easier API to work with, go for XDocument aka LINQ To XML. Find XDocument vs XmlDocument here and here.

Best way to manipulate XML in .NET

I need to manipulate an existing XML document, and create a new one from it, removing a few nodes and attributes, and perhaps adding new ones, what would be the best group of classes to accomplish this?
There are a lot of .NET classes for XML manipulation, and I'm not sure what would be the optimal way to do it.

If it is a really huge XML which cannot fit into memory you should use XmlReader/XmlWriter. If not LINQ to XML is very easy to use. If you don't have .NET 3.5 you could use XmlDocument.
Here's an example of removing a node:
using System.Xml.Linq;
using System.Xml.XPath;
var doc = XElement.Load("test.xml");
doc.XPathSelectElement("//customer").Remove();
doc.Save("test.xml");

Use Linq to XML You can see the XDocument class here

Parsing the document with XML Style Sheets might be the easiest option if it is just a conversion process.
Here is how to use XSLT in .NET.
and
Here is an introduction to XSLT.
It confused me a bit at first, but now I pretty much use XSLT to do all my XML conversions.

If you have an official schema, you can use the XmlSerializer. Otherwise it is best to use the XmlDocument, XmlNode, XmlElement etc classes.
Otherwise it could also depend on what you are using the xml for, i.e. marking up some document, representing objects etc.

DeSerializing an XML file to a C# class

Does anyone know what advantages (memory/speed) there are by using a class generated by the XSD tool to explore a deserialized XML file as opposed to XPATH?

I'd say the advantage is that you get a strongly typed class which is more convenient to use, and also the constructor for the class will throw an exception if the XML data in the file is invalid for creating the object, so you get a minimal data validation for free.

If you don't want to write boilerplate code, and you need to check ANY values of your XML on the way through, you can't go wrong with the XSD.exe generated classes.

The two are very different; but XmlSerializer will always deserialize entire objects; with XPath you can pick and choose. I'd use XmlSerializer personally, though - harder to get wrong.
XPath, however, is a complex beast that depends on the back-end. For example, XmlDocument (mutable) will behave differently to XPathDocument (read-only, optimized for query).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.