Extracting a small subset of data from XMLs - c#

I am writing a C# / VB program that is to be used for reporting data based upon information received in XMLs.
My situation is that I receive many XMLs per month (about 100-200) - Each ranging in size from 10mb to 350mb. For each of these XMLs, I only need a small subset of its data (less than 5% of any one file's entire data) so as to produce the necessary reports.
Also, that subset of data will always be held in the same key-structure (it will exist within multiple keys and at differing levels down, perhaps, but it will always exist within the same key names / the keys containing it will always have the with the same attributes such as "name", etc)
So, my current idea of how to go about doing this is to:
To create a "scraper" that will pull the necessary data from the XMLs using XPath.
Store that small subset of necessary data in a SQL Server table along with file characteristic data stored in a separate table so as to know which file this scraped data came from
Query out the data into a program for reporting it.
My main question here is really what is the best way to scrape that data out?
I am most familiar with XPath, but for multiple files of 200MB in size, I'm afraid of performance issues loading in the entire file.
Other things I have seen / researched are:
Creating an XSLT file to transform / pull from the XML only the data I want
Using Linq to XML
Somehow linking the XMLs to SQL server and then being able to query them directly
Using ADO to query the XMLs from within the program
Doing it using the XMLReader class (rather than loading in each XML entirely)
Maybe there is a native .Net component that does this very well already
Quite honestly, I just have no clue what the standard is given the high number of XMLs and the large variance in file sizes and I'm not familiar with any of the other ways of doing this - such as, for example, linking the XMLs to SQL Server directly / using ADO to query the XML - and, therefore, don't know of their possible benefits / drawbacks.
If any of you have been in a similar situation, I'd really appreciate any kind of pointers in the right direction / at least validation that my method isn't the worst one out there :)
Thanks!!!

As for the memory consumption and performance concerns, a nice feature of the .NET XML APIs is that you can combine XmlReader with XPathDocument or XmlDocument or XElement to only selectively read part of a document into memory to then have the XPath or LINQ to XML features available on that part. LINQ to XML has http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom%28v=vs.110%29.aspx for doing that, DOM/XmlDocument has http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.readnode%28v=vs.110%29.aspx. So depending on your XML structure you might be able to use an XmlReader to read forward through the XML in a fast way without consuming much memory and then, when you have the element you are interested in, you can read it into an XElement (LINQ to XML) or XmlNode (DOM) to then apply LINQ to XML and/or XPath to read out details.

Related

Indexing a large XML file

Given a large (74GB) XML file, I need to read specific XML nodes by a given Alphanumeric ID. It takes too long to read from top-to-bottom of the file looking for the ID.
Is there an analogy of an Index for XML files like there is for relational databases?, I imagine a small Index file, where the Alphanumeric ID is quick to find, and points to the location in the larger file.
Do Index files for XML exist?, how can they be implemented in C#?
XML databases such as BaseX, eXistDB, or MarkLogic do what you are looking for: they load XML documents into a persistent form on disk and allow fast access to parts of the document by use of indexes.
Some XML databases are optimized for handling many small documents, others are able to handle a small number of large documents, so choose your product carefully (I can't advise you on this), and consider breaking the document up into smaller parts as it is loaded.
If you need to split the large document into lots of small documents, consider a streaming XSLT 3.0 processor such as Saxon-EE. I would expect that processing 75Gb should take about an hour: dependent, obviously, on the speed of your machine.
No, that is beyond of the scope of what XML tries to achieve. If the XML does not change often and your read from it a lot, I would propose rewriting its content into a local SQLite DB once-per-change and then reading from the database instead. When doing the rewriting, remember that SAX-style XML reading is your friend in the case of huge files like this.
Theoretically, you can create a sort-of index by remembering location of already discovered IDs and then parse on your own, but that would be very brittle. XML si not simple enough for you to parse it on your own and hope you will be standard compliant.
Of course, I suppose here that you can't do anything with the larger design itself: as others noted, the size of that file suggests that there is an architectural problem.

Is it better to have SQL Server parse a large multi document XML or send it each document separately

I need to request the XML from PubMed like
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=27087788,28322247,26158412&retmode=xml
The example has 3 IDs but the request can be as much as 200 at a time. The request is being done by a .NET web service. I am looking for the most efficient way to process the XML files. I know that the the term "best" or "efficient" is very subjective and dependent upon many things but:
Is it better to send the entire string to the SQL Server database (if it is even possible because of length or possible nesting levels) and let it parse the document and save it to the database or is it better to parse the document in the web service using a XMLTextReader or XML Document Object and send each document? Each document needs to be saved as a separate record.
Thanks for your information.
My first thought was: Why SQL-Server? Why send all this data around? Do the parsing in C#!
But - on the second sight: If I understand this correctly, you want to read many different XML files and store them in your database.
Now I'd rather ask: When do you need to retrieve data from these XMLs and do you need to store extracted data in relational tables? Would it be a possible approach for you to store alle these XMLs as-is in XML typed columns and read them on demand?
You can pass your XML as C#-string (which is unicode) and insert this directly into an XML-typed column. To avoid any hassel you should cut away the first lines (<xml>declaration and DOCTYPE) and start with <PubmedArticleSet>.
The rest should be easily transfered and stored in SQL-Server.
If you need help on how to read this? Just come back with another more pointed question.
About your Which is faster question you might read this.

Fastest efficient way to create XML document .NET/Oracle and return to web client

Does anyone know which is the fastest most efficient way to create XML documents in an Oracle/.NET environment.
There are two philosophies:
Do the coding in an Oracle Package and Use Oracles native XML abilities to return an XML document after querying the data from the DB, create the document by then looping through your Query result setting nodes like so (addXMLNode(doc, nContact, 'ROW_ID', rec_con.ROW_ID);)
Just query the Data from Oracle and use .NET on the data looping through the data reader and create your XML document using .NET XML classes. Essentially letting the DB serve the data, and DOM XML creation is done in .NET.
Assuming no knowledge difference in the two practices, does someone know if one is more efficient, faster, or better than the other one? Please don't give me your "favorite" way to handle it. An "our query was slow so we moved it into the code", or vice versa real world example would give me some direction for code refactoring and application performance improvement.
Thanks.

Performance difference between linq to xml and xml serialization

I would like to create an xml file (100 lines, 5 namespaces and 30 different tags, 20 attributes total). I already have a hardcoded xml example but i need to write some c# code to generate a dynamic xml and to fill the values, which of course can change. Performance is a concern.
Should I use linq to xml and create all the tags with XDocument and XElement and provide variables that contain the dynamic values
Since i have already an xml example, create a schema.xsd and provide the values to the object
The xml (the object stream) will be sent via HTTP POST every second to a web service.
I am going to timetest both versions but i was just curious if someone already did that.
The LINQ to XML version should have better performance.
If you want to optimize it even more you probably should consider direct string concatenation (but that's not a best practice and the performance gain won't be significant).
The next performance option will be XmlTextWriter. Probably the fastest way to write XML "correctly" - it don't need to create XML object model like LINQ to XML, so should be significantly faster.
You can optimize serialization a bit if you cache the XmlSerializer instance and won't create it every time. Then it will also be relatively fast, though definitely slower than direct XML writes.

Parsing XML with DataSet- Performance

As per my requirement I want to display some product related information into my UI.
All the information's come through one API URL.API Return XML Output. XML may have more than 100 tag. But as per my requirement I want only 30 to 50 tag. Here I need to pass parameter as input and get the product information.
I using .asmx service as wrapper service and all the parsing process are done here.
In code behind page, I consume the service and display the information.
How to parse the XML? Currently I planned to do is XML to DataSet-(ds.ReadXml(XML))
Does it affect performance? Is there any other way to do? Please guide me.
If you want to bind the result to a Control, then dataset approach (indicated by you) makes sense. How ever, if you need text value of those 30 / 50 tags without what the parent/child nodes are in between, you can use XmlDocument/XPath
I would use LINQ to XML
more info at
http://msdn.microsoft.com/en-us/library/bb387098.aspx
for older version of the framework use the XmlTextReader
Use the XmlTextReader class to process large XML documents in an efficient, forward - only manner. XmlTextReader uses small amounts of memory
Avoid using the DOM because the DOM reads the entire XML document into memory. If the entire XML document is read into memory, the scalability of your application is limited. Using XmlTextReader in combination with an XmlTextWriter class permits you to handle much larger documents than a DOM-based XmlDocument class.
http://msdn.microsoft.com/en-us/library/ff647804.aspx

Categories

Resources