LINQ How to search data from large XML file? - c#

I have an xml file with about 500 mb and i'm using LINQ with c# to query that file, but it's very slow, because it loads everything into memory. Is there anyway that i can query that file without loading all into memory?
Thanks

This article should get you up and running. Take a look at the SimpleStreamAxis method, which is very handy for finding nodes in large XML files. I've successfully used a variant of this method on 5GB XML files without loading the file into memory.

You can use the technique described on MSDN's page about XNode.ReadFrom to generate an IEnumerable of XNodes (in the example they provide, XElements) from an XmlReader.
Note that when you read an XElement from a Stream or XmlReader, the entire contents of that element must be read too - so you'll still need a little bit of custom logic in the IEnumerator logic to ensure that the right XElements get returned - for instance, if you return the root element, you might as well just parse the entire document right away since the root element contains almost everthing anyhow. The XNode.ReadFrom example contains such logic too.

No, its not possible when using Linq. Linq loads a model of the full xml into memory so you can have access using the tree structure.
If you want fast access without loading the file into memory you could use XmlReader class.
This class gives you a fast forward-only xml parser that has only the current node in memory.
Here is some help on that: http://support.microsoft.com/kb/307548
Edit: Sorry, didn't know that its possible to combine xmlreader with linq.

Related

Extracting a small subset of data from XMLs

I am writing a C# / VB program that is to be used for reporting data based upon information received in XMLs.
My situation is that I receive many XMLs per month (about 100-200) - Each ranging in size from 10mb to 350mb. For each of these XMLs, I only need a small subset of its data (less than 5% of any one file's entire data) so as to produce the necessary reports.
Also, that subset of data will always be held in the same key-structure (it will exist within multiple keys and at differing levels down, perhaps, but it will always exist within the same key names / the keys containing it will always have the with the same attributes such as "name", etc)
So, my current idea of how to go about doing this is to:
To create a "scraper" that will pull the necessary data from the XMLs using XPath.
Store that small subset of necessary data in a SQL Server table along with file characteristic data stored in a separate table so as to know which file this scraped data came from
Query out the data into a program for reporting it.
My main question here is really what is the best way to scrape that data out?
I am most familiar with XPath, but for multiple files of 200MB in size, I'm afraid of performance issues loading in the entire file.
Other things I have seen / researched are:
Creating an XSLT file to transform / pull from the XML only the data I want
Using Linq to XML
Somehow linking the XMLs to SQL server and then being able to query them directly
Using ADO to query the XMLs from within the program
Doing it using the XMLReader class (rather than loading in each XML entirely)
Maybe there is a native .Net component that does this very well already
Quite honestly, I just have no clue what the standard is given the high number of XMLs and the large variance in file sizes and I'm not familiar with any of the other ways of doing this - such as, for example, linking the XMLs to SQL Server directly / using ADO to query the XML - and, therefore, don't know of their possible benefits / drawbacks.
If any of you have been in a similar situation, I'd really appreciate any kind of pointers in the right direction / at least validation that my method isn't the worst one out there :)
Thanks!!!
As for the memory consumption and performance concerns, a nice feature of the .NET XML APIs is that you can combine XmlReader with XPathDocument or XmlDocument or XElement to only selectively read part of a document into memory to then have the XPath or LINQ to XML features available on that part. LINQ to XML has http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom%28v=vs.110%29.aspx for doing that, DOM/XmlDocument has http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.readnode%28v=vs.110%29.aspx. So depending on your XML structure you might be able to use an XmlReader to read forward through the XML in a fast way without consuming much memory and then, when you have the element you are interested in, you can read it into an XElement (LINQ to XML) or XmlNode (DOM) to then apply LINQ to XML and/or XPath to read out details.

Xml binary search through attributes

I have an xml document which contains tags with numerical attribute. They are sorted by this attribute. is it worth to finding this tags through binary search? Or access to the nodes are not in constant time and it is better to get the data to array? And is it implemented in xml libraries?
It depends how are you planing to read the document. There are two ways to handle XML documents:
First one, implemented by XmlReader, is stream-oriented and processes one element at a time, and nothing more is stored in program memory.
Second one follows Document Object Model interface: it loads the entire document into memory and allows you to query it without looking back to the file. The best method to use that in .NET is LINQ to XML.
Depending of the size of your document, it may be better to choose one or another, but you have to be aware that making any other then linear search with Stream-oriented API is not possible.
And as far I know there you can't get binary search with LINQ to XML out of the box, because it uses IEnumerable. You'd have to get an array of your elements and then implement the binary search on an array. Definitely not difficult task to complete anyway.

Performance difference between linq to xml and xml serialization

I would like to create an xml file (100 lines, 5 namespaces and 30 different tags, 20 attributes total). I already have a hardcoded xml example but i need to write some c# code to generate a dynamic xml and to fill the values, which of course can change. Performance is a concern.
Should I use linq to xml and create all the tags with XDocument and XElement and provide variables that contain the dynamic values
Since i have already an xml example, create a schema.xsd and provide the values to the object
The xml (the object stream) will be sent via HTTP POST every second to a web service.
I am going to timetest both versions but i was just curious if someone already did that.
The LINQ to XML version should have better performance.
If you want to optimize it even more you probably should consider direct string concatenation (but that's not a best practice and the performance gain won't be significant).
The next performance option will be XmlTextWriter. Probably the fastest way to write XML "correctly" - it don't need to create XML object model like LINQ to XML, so should be significantly faster.
You can optimize serialization a bit if you cache the XmlSerializer instance and won't create it every time. Then it will also be relatively fast, though definitely slower than direct XML writes.

Parsing XML with DataSet- Performance

As per my requirement I want to display some product related information into my UI.
All the information's come through one API URL.API Return XML Output. XML may have more than 100 tag. But as per my requirement I want only 30 to 50 tag. Here I need to pass parameter as input and get the product information.
I using .asmx service as wrapper service and all the parsing process are done here.
In code behind page, I consume the service and display the information.
How to parse the XML? Currently I planned to do is XML to DataSet-(ds.ReadXml(XML))
Does it affect performance? Is there any other way to do? Please guide me.
If you want to bind the result to a Control, then dataset approach (indicated by you) makes sense. How ever, if you need text value of those 30 / 50 tags without what the parent/child nodes are in between, you can use XmlDocument/XPath
I would use LINQ to XML
more info at
http://msdn.microsoft.com/en-us/library/bb387098.aspx
for older version of the framework use the XmlTextReader
Use the XmlTextReader class to process large XML documents in an efficient, forward - only manner. XmlTextReader uses small amounts of memory
Avoid using the DOM because the DOM reads the entire XML document into memory. If the entire XML document is read into memory, the scalability of your application is limited. Using XmlTextReader in combination with an XmlTextWriter class permits you to handle much larger documents than a DOM-based XmlDocument class.
http://msdn.microsoft.com/en-us/library/ff647804.aspx

import - xpath or convert to .csv?

Im looking for some advice on how I should go about a solution. I have an import to write using c#. The data comes from an xml file containing ~30000 records each with ~10 nodes for differnet data. My initial thought would be to create a node list of records ids(one of the nodes is a unique id). Then loop through the node list and use xpath to get the rest of the data for the record. My other thought was to convert the xml file into .cvs format and read it that way. Before i dive head first into one or the other any advice, pros/cons or suggestions? Thanks in advance
Go with whichever you feel more comfortable with.
Personally, I would use XDocument and LINQ to XML to query the XML directly.
Transforming to CSV has its own pitfalls, if you don't adhere to the rules (quoting fields, line breaks within fields etc...).
I agree with the above poster that you want to use LINQ to XML if possible, however if you are on an older version of the framework you could use an XMLDocument and the SelectNodes/SelectSingleNode methods. If you do that however make sure you use a NamespaceManager or you won't return anything from your methods unless your XML has no namespaces etc.
That got me a bunch of times.

Categories

Resources