In one of the applications we are developing we do lot of XML processing. Currently we use DOM and XPath for most of the processing and we are not much happy with the performance.
At the moment we are considering of moving XML processing logic to LINQ and our initial investigations suggest LINQ performance is much better than DOM.
Before making these changes I would like to know how others feel about this. Is using LINQ a better option? Any disavantages etc...
Thanks,
Shamika
Thank you very much for your answers. I did some performance tests and as expected XmlReader out performed both XmlDocument and LINQ. Please note that this is only for XML reading.
Also if you need the ease of use of LINQ you can implement LINQ XML processing by using some features of the XmlReader and can get much better performance than XmlDocument. Please refer to "rwwilden" comments for more information.
Thanks.
Using DOM (ie. System.Xml.XmlDocument) is likely to be slower, because of the rich navigation support (all those references start to add up), and this overhead will become more significant as the number of nodes increases.
Simpler object models (System.Xml.Linq.XDocument and System.Xml.XPath.XPathDocument) don't have such complex structures, but allow navigation by other means. This might add to CPU overhead but should save memory.
In the end you need to profile (time and space) in your case, and also consider how much real (user perceived) difference it makes.
But, for ultimate performance don't load the whole document into memory at all: use System.Xml.XmlReader and System.Xml.XmlWriter and do everything in a stream. Of course this adds development cost.
.NET has a rich (maybe too rich) set of XML APIs, which is best (or at least, least worst) for you can only be determined by you making the trade-offs which are best for you.
Personally I would avoid XmlDocument and use either XPathDocument (especially to read, and query with XPath) or XDocument (especially to create) where XmlReader/XmlWriter does not give enough of a performance boost to justify.
I'm not sure you would notice a very large performance improvement using LINQ2XML instead of DOM/XPath. For both DOM and LINQ2XML the document that you iterate over, is represented as an in-memory tree.
If performance really is an issue and you have rather large XML documents, you could take a look at the rudimentary XML streaming support that is implemented in the framework (via XStreamingElement). Also check this Microsoft XML team blog entry.
My take on it is that LINQ -> XML is leaps and bounds easier to use than DOM. It's more intuitive to me and much easier to read IMO.
Related
There are multiple ways of reading XML's and performing business logic.
The Business object can be reading, writing, Editing and getting required values many times. The XML file size also very large in GB's but mostly in MB's.
Based on the performance wise which approach suits best .
XMLreader
XMLSerialization
Linq to XML
StreamReader
XML Dom parsing
Probably doesn't matter ...
Unless you're using huge ... and I mean HUGE ... xml's ... or doing it for sooooooooooo many times that that is the bottleneck ...
You could always benchmark them yourself as well ...
The other question is ...what do you need?
If you need to read and write one might be more relevant than the other. If you just need to read, the one who offers right will be "slow" in comparison ...
It's a very wide question you ask ... without sufficient details...
Well. it depends upon the specific application where you are trying to use it.
But in theory, Linq is the best as far as performance is concerned.
I heard that LINQ to XML has some performance issues and some of my friends recommended me not to use it in my app. I couldn't find anything relevant on MSDN and I do not want to rely on "some internet blog". Does anyone know of a official point of view on this issue or some trustworthy source?
Using LINQ to XML will read the entire file into memory.
If you're reading an enormous XML file (hundreds of megabytes), this is a problem.
Instead, you can use a raw XmlReader, which provides a forward-only view of an XML file and will not read the entire file at once.
If you're dealing with normal-sized XML files, LINQ to XML will be fine.
LINQ to XML is several orders of magnitude easier to use than XmlReader.
You should only use XmlReader if you know that you'll be dealing with 200MB XML files, or if you've measured your performance and proved that the XDocument constructor is being too slow.
Check MSDN:Performance (LINQ to XML) and Performance of LINQ to XML by Eric White - Microsoft
Just google linq vs xmlreader you will have it.
The top result, http://www.nearinfinity.com/blogs/joe_ferner/performance_linq_to_sql_vs.html, leads to a conclusion that it's slower compare to xmlreader (of course, since linq2Xml is built on top of xmlreader), but IMHO it is far better than acceptable, as you gain the flexibility and easier to read/code.
what is the best way to work with xml file that represets a tree.
the xml size is 70mb.
Linq to XML is the easiest way to currently work with xml but this will typically load the entrire tree into memory which in your case with a 70mb file may not be ideal.
However there are ways around this as demonstrated in this blog post from Mick Taulty.
The answer depends on what you want to do with the XML. Generally with files that size you wouldn't want to read it all in one go. As such the following page makes an interesting read, providing a means to mine data from the file without loading it in memory. It allows you to combine the speed of XmlReader with the flexibility of Linq:
http://msdn.microsoft.com/en-us/library/bb387035.aspx
And quite an interesting article based on this technique:
Link
If you want to read data from a large xml file XmlTextReader is the way to go.
For .NET 3.5 and up, I prefer using LINQ to XML for all my work towards XML files.
LinqToXml is probably a good bet if you wish to query it in memory, but if you find that you are getting problems with how large your memory footprint is you could use an XMLReader
Linq To XML
Slower for larger documents (large memory footprint)
Queryable
XmlTextReader
Fast
Only one line at a time, so no querying
Since you are already using a DOM, an alternative XML parser you could try is a SAX parser. Instead of loading the entire tree into memory, a SAX parser is event-driven and handles nodes, etc. as it encounters them.
Further Reading: http://www.saxproject.org/event.html
What is the "best" way to search in xml?
Xpath or Linq2xml.
I'm asking this because we need to do a lot of searching in xml.
I'm always using XPath (since I've been using this from .NET 1.1). But with the introduction of Linq you can easely use Linq2Sql.
Regards,
M.
i use both extensively also xsl
they have very different uses imo
xpath is great for manipulating xml documents wheras linq2xml is great for mapping them into object collections.
In other words i regularly have applications that involve both.
for instance parsing csv into a given xml structure is almost cherry picked for xslt and xpath wheras linq2XMl will give you problems if you have an xml document that has optional elements. so i tend to use xpath to really lock down the xml format so that it is explicit and to keep my linq2xml mapping very very simple.
The result is a lot less bugs and much faster development.
no idea why the guy is talking about linq2xsd ... its a discontinued project that has very very little documentation. stay away from it.
Xdocument is an object that is actually enjoyable to work with ... xmldocument is one that is just fiddly imo. Obviously it depends on the task at hand, but the lack of xpath 2.0 makes me tend to use it as a data cleanser and then let linq2XMl do the real work.
as far as search goes, you can do everything that linq2xml does in xpath, the thing is that syntactically i far prefer to use linq2sql and play with strongly typed collections than mess about with xpath. Its much easier to come back to at a later date and adapt. Also you dont have to worry about syntax differences between xpath implementations and especially with regex implementations
Either. It depends.
Depending on your (and your team's) knowledge (i.e. XPath will not be effective is no one knows XPath, but all know LINQ to XML). Also some operations can be easier in one or the other.
You will need to define criteria first to judge what is best. And you need to decide whether you want to compare XPath 1.0 or 2.0 with LINQ to XML. Microsoft does not support XPath 2.0 but third party solutions exist, like Saxon 9 or like XQSharp.
I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size.
I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block.
One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted.
Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc.
I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...
XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article.
If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.
I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.
Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".
Download from Microsoft
Gigabyte XML files! I don't envy you this task.
Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.
I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?
One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.
I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?
This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html
http://msdn.microsoft.com/en-us/library/bb387013.aspx has a relevant example leveraging XStreamingElement.
In order to perform XPath queries with the standard .NET classes the whole document tree needs to be loaded in memory which might not be a good idea if it can take up to a gigabyte. IMHO the XmlReader is a nice class for handling such tasks.
It seems that you already tried using XPathDocument and could not accomodate the parsed xml document in memory.
If this is the case, before starting to split the file (which is ultimately the right decision!) you may try using the Saxon XSLT/XQuery processor. It has a very efficient in-memory representation of a loaded XML document (the "tinytree" model). In addition Saxon SA (the shema-aware version, which isn't free) has some streaming extensions. Read more about this here.
How about just reading the whole thing into a database and then work with the temp database? That might be better because then your queries can be done more efficiently using TSQL.
I think the best solution is to make your own xml parser that can read small chunks not the whole file, or you can split the large file into small files and use dotnet classes with these files.
The problem is you can not parse some of data till the whole data is available so I recommend to use your own parser not dotnet classes
Have you been trying XPathDocument?
This class is optimized for handling XPath queries efficiently.
If you cannot handle your input documents efficiently using XPathDocument you might consider preprocessing and/or splitting up your input documents using an XmlReader.
You've outlined your choices already.
Either you need to abandon the XPath and use XmlTextReader or you need to break the document up into managable chunks on which you can use XPath.
If you choose the latter use XPathDocument its readonly restriction allows better used of memory.
Since in your case the data size can run in Gbs have you considered using ADO.NET with XML as a database. In addition to that the memory footprint would not be huge.
Another approach would be using Linq to XML with using elements like XElementStream. Hope this helps.