Xml binary search through attributes - c#

I have an xml document which contains tags with numerical attribute. They are sorted by this attribute. is it worth to finding this tags through binary search? Or access to the nodes are not in constant time and it is better to get the data to array? And is it implemented in xml libraries?

It depends how are you planing to read the document. There are two ways to handle XML documents:
First one, implemented by XmlReader, is stream-oriented and processes one element at a time, and nothing more is stored in program memory.
Second one follows Document Object Model interface: it loads the entire document into memory and allows you to query it without looking back to the file. The best method to use that in .NET is LINQ to XML.
Depending of the size of your document, it may be better to choose one or another, but you have to be aware that making any other then linear search with Stream-oriented API is not possible.
And as far I know there you can't get binary search with LINQ to XML out of the box, because it uses IEnumerable. You'd have to get an array of your elements and then implement the binary search on an array. Definitely not difficult task to complete anyway.

Related

Indexing a large XML file

Given a large (74GB) XML file, I need to read specific XML nodes by a given Alphanumeric ID. It takes too long to read from top-to-bottom of the file looking for the ID.
Is there an analogy of an Index for XML files like there is for relational databases?, I imagine a small Index file, where the Alphanumeric ID is quick to find, and points to the location in the larger file.
Do Index files for XML exist?, how can they be implemented in C#?
XML databases such as BaseX, eXistDB, or MarkLogic do what you are looking for: they load XML documents into a persistent form on disk and allow fast access to parts of the document by use of indexes.
Some XML databases are optimized for handling many small documents, others are able to handle a small number of large documents, so choose your product carefully (I can't advise you on this), and consider breaking the document up into smaller parts as it is loaded.
If you need to split the large document into lots of small documents, consider a streaming XSLT 3.0 processor such as Saxon-EE. I would expect that processing 75Gb should take about an hour: dependent, obviously, on the speed of your machine.
No, that is beyond of the scope of what XML tries to achieve. If the XML does not change often and your read from it a lot, I would propose rewriting its content into a local SQLite DB once-per-change and then reading from the database instead. When doing the rewriting, remember that SAX-style XML reading is your friend in the case of huge files like this.
Theoretically, you can create a sort-of index by remembering location of already discovered IDs and then parse on your own, but that would be very brittle. XML si not simple enough for you to parse it on your own and hope you will be standard compliant.
Of course, I suppose here that you can't do anything with the larger design itself: as others noted, the size of that file suggests that there is an architectural problem.

C#: compare two xml files

I would like to write a function to compare two XML files. Based on the following cases, the diff could be due to
the order of the nodes and
some nodes or attributes could have been added or removed
I have found a solution to traverse the XML by using a recursive function or LINQ to XML so basically I can get all nodes and attributes. I have read about the XML Diff and Patch Tool but I'm trying to avoid dependencies on my project. An added complexity to this is to determine which line the diff occurred but this is optional for now.
I'm currently thinking of storing the nodes and attributes of the two XML files to a data structure (e.g dictionary) and compare the dictionaries later but I'm not pretty sure how to do this one. Can you share some ideas?
There is https://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.deepequals%28v=vs.110%29.aspx to do XNode.DeepEquals(XDocument.Load("file1.xml"), XDocument.Load("file2.xml")) but that will only give you a boolean result, not an indication of where the difference is.

Extracting a small subset of data from XMLs

I am writing a C# / VB program that is to be used for reporting data based upon information received in XMLs.
My situation is that I receive many XMLs per month (about 100-200) - Each ranging in size from 10mb to 350mb. For each of these XMLs, I only need a small subset of its data (less than 5% of any one file's entire data) so as to produce the necessary reports.
Also, that subset of data will always be held in the same key-structure (it will exist within multiple keys and at differing levels down, perhaps, but it will always exist within the same key names / the keys containing it will always have the with the same attributes such as "name", etc)
So, my current idea of how to go about doing this is to:
To create a "scraper" that will pull the necessary data from the XMLs using XPath.
Store that small subset of necessary data in a SQL Server table along with file characteristic data stored in a separate table so as to know which file this scraped data came from
Query out the data into a program for reporting it.
My main question here is really what is the best way to scrape that data out?
I am most familiar with XPath, but for multiple files of 200MB in size, I'm afraid of performance issues loading in the entire file.
Other things I have seen / researched are:
Creating an XSLT file to transform / pull from the XML only the data I want
Using Linq to XML
Somehow linking the XMLs to SQL server and then being able to query them directly
Using ADO to query the XMLs from within the program
Doing it using the XMLReader class (rather than loading in each XML entirely)
Maybe there is a native .Net component that does this very well already
Quite honestly, I just have no clue what the standard is given the high number of XMLs and the large variance in file sizes and I'm not familiar with any of the other ways of doing this - such as, for example, linking the XMLs to SQL Server directly / using ADO to query the XML - and, therefore, don't know of their possible benefits / drawbacks.
If any of you have been in a similar situation, I'd really appreciate any kind of pointers in the right direction / at least validation that my method isn't the worst one out there :)
Thanks!!!
As for the memory consumption and performance concerns, a nice feature of the .NET XML APIs is that you can combine XmlReader with XPathDocument or XmlDocument or XElement to only selectively read part of a document into memory to then have the XPath or LINQ to XML features available on that part. LINQ to XML has http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom%28v=vs.110%29.aspx for doing that, DOM/XmlDocument has http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.readnode%28v=vs.110%29.aspx. So depending on your XML structure you might be able to use an XmlReader to read forward through the XML in a fast way without consuming much memory and then, when you have the element you are interested in, you can read it into an XElement (LINQ to XML) or XmlNode (DOM) to then apply LINQ to XML and/or XPath to read out details.

Parsing XML with DataSet- Performance

As per my requirement I want to display some product related information into my UI.
All the information's come through one API URL.API Return XML Output. XML may have more than 100 tag. But as per my requirement I want only 30 to 50 tag. Here I need to pass parameter as input and get the product information.
I using .asmx service as wrapper service and all the parsing process are done here.
In code behind page, I consume the service and display the information.
How to parse the XML? Currently I planned to do is XML to DataSet-(ds.ReadXml(XML))
Does it affect performance? Is there any other way to do? Please guide me.
If you want to bind the result to a Control, then dataset approach (indicated by you) makes sense. How ever, if you need text value of those 30 / 50 tags without what the parent/child nodes are in between, you can use XmlDocument/XPath
I would use LINQ to XML
more info at
http://msdn.microsoft.com/en-us/library/bb387098.aspx
for older version of the framework use the XmlTextReader
Use the XmlTextReader class to process large XML documents in an efficient, forward - only manner. XmlTextReader uses small amounts of memory
Avoid using the DOM because the DOM reads the entire XML document into memory. If the entire XML document is read into memory, the scalability of your application is limited. Using XmlTextReader in combination with an XmlTextWriter class permits you to handle much larger documents than a DOM-based XmlDocument class.
http://msdn.microsoft.com/en-us/library/ff647804.aspx

LINQ How to search data from large XML file?

I have an xml file with about 500 mb and i'm using LINQ with c# to query that file, but it's very slow, because it loads everything into memory. Is there anyway that i can query that file without loading all into memory?
Thanks
This article should get you up and running. Take a look at the SimpleStreamAxis method, which is very handy for finding nodes in large XML files. I've successfully used a variant of this method on 5GB XML files without loading the file into memory.
You can use the technique described on MSDN's page about XNode.ReadFrom to generate an IEnumerable of XNodes (in the example they provide, XElements) from an XmlReader.
Note that when you read an XElement from a Stream or XmlReader, the entire contents of that element must be read too - so you'll still need a little bit of custom logic in the IEnumerator logic to ensure that the right XElements get returned - for instance, if you return the root element, you might as well just parse the entire document right away since the root element contains almost everthing anyhow. The XNode.ReadFrom example contains such logic too.
No, its not possible when using Linq. Linq loads a model of the full xml into memory so you can have access using the tree structure.
If you want fast access without loading the file into memory you could use XmlReader class.
This class gives you a fast forward-only xml parser that has only the current node in memory.
Here is some help on that: http://support.microsoft.com/kb/307548
Edit: Sorry, didn't know that its possible to combine xmlreader with linq.

Categories

Resources