Processing Large XML file into SQL Server with C#

Processing Large XML file into SQL Server with C# - c#

First, what I describe here is a small part of a larger ETL process that is already in place. So, please no suggestions to port to SSIS or some other environment because I can't.
In this ETL process, for each table in the SQL server database that is being inserted into, I am:
loading all of the relevant xml into an XElement object
then transforming the xml into a typed dataset datatable
then using a SqlBulkCopy object to quickly insert the data into the sql server table.
But, for one table, when I create the XElement, I get an OutOfMemory exception.
So, I now need to iteratively process the data in chunks, but I'm not sure of the best way to do this. The xml file is stored on the same machine that is running the ETL process.
Thanks for any help.
UPDATE
I'm getting started reading about the XmlReader class, which I've never used. If someone thinks this is the answer, please say so and provide any guidance that you will.

Don't use XmlElement - use the .NET SAX based parser to parse the XML stream. NEVER materialize the objects in memory. Simple. There is an API for that.
Basically, use an XmlTextReader.

In addition to plain use of XmlReader it could be useful to know about method XNode.ReadFrom. It works particularly well if XML is more like a very long list of entities as opposed to deep-nested hierarchy.

Related

Is it better to have SQL Server parse a large multi document XML or send it each document separately

I need to request the XML from PubMed like
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=27087788,28322247,26158412&retmode=xml
The example has 3 IDs but the request can be as much as 200 at a time. The request is being done by a .NET web service. I am looking for the most efficient way to process the XML files. I know that the the term "best" or "efficient" is very subjective and dependent upon many things but:
Is it better to send the entire string to the SQL Server database (if it is even possible because of length or possible nesting levels) and let it parse the document and save it to the database or is it better to parse the document in the web service using a XMLTextReader or XML Document Object and send each document? Each document needs to be saved as a separate record.
Thanks for your information.

My first thought was: Why SQL-Server? Why send all this data around? Do the parsing in C#!
But - on the second sight: If I understand this correctly, you want to read many different XML files and store them in your database.
Now I'd rather ask: When do you need to retrieve data from these XMLs and do you need to store extracted data in relational tables? Would it be a possible approach for you to store alle these XMLs as-is in XML typed columns and read them on demand?
You can pass your XML as C#-string (which is unicode) and insert this directly into an XML-typed column. To avoid any hassel you should cut away the first lines (<xml>declaration and DOCTYPE) and start with <PubmedArticleSet>.
The rest should be easily transfered and stored in SQL-Server.
If you need help on how to read this? Just come back with another more pointed question.
About your Which is faster question you might read this.

Fastest efficient way to create XML document .NET/Oracle and return to web client

Does anyone know which is the fastest most efficient way to create XML documents in an Oracle/.NET environment.
There are two philosophies:
Do the coding in an Oracle Package and Use Oracles native XML abilities to return an XML document after querying the data from the DB, create the document by then looping through your Query result setting nodes like so (addXMLNode(doc, nContact, 'ROW_ID', rec_con.ROW_ID);)
Just query the Data from Oracle and use .NET on the data looping through the data reader and create your XML document using .NET XML classes. Essentially letting the DB serve the data, and DOM XML creation is done in .NET.
Assuming no knowledge difference in the two practices, does someone know if one is more efficient, faster, or better than the other one? Please don't give me your "favorite" way to handle it. An "our query was slow so we moved it into the code", or vice versa real world example would give me some direction for code refactoring and application performance improvement.
Thanks.

Extracting a small subset of data from XMLs

I am writing a C# / VB program that is to be used for reporting data based upon information received in XMLs.
My situation is that I receive many XMLs per month (about 100-200) - Each ranging in size from 10mb to 350mb. For each of these XMLs, I only need a small subset of its data (less than 5% of any one file's entire data) so as to produce the necessary reports.
Also, that subset of data will always be held in the same key-structure (it will exist within multiple keys and at differing levels down, perhaps, but it will always exist within the same key names / the keys containing it will always have the with the same attributes such as "name", etc)
So, my current idea of how to go about doing this is to:
To create a "scraper" that will pull the necessary data from the XMLs using XPath.
Store that small subset of necessary data in a SQL Server table along with file characteristic data stored in a separate table so as to know which file this scraped data came from
Query out the data into a program for reporting it.
My main question here is really what is the best way to scrape that data out?
I am most familiar with XPath, but for multiple files of 200MB in size, I'm afraid of performance issues loading in the entire file.
Other things I have seen / researched are:
Creating an XSLT file to transform / pull from the XML only the data I want
Using Linq to XML
Somehow linking the XMLs to SQL server and then being able to query them directly
Using ADO to query the XMLs from within the program
Doing it using the XMLReader class (rather than loading in each XML entirely)
Maybe there is a native .Net component that does this very well already
Quite honestly, I just have no clue what the standard is given the high number of XMLs and the large variance in file sizes and I'm not familiar with any of the other ways of doing this - such as, for example, linking the XMLs to SQL Server directly / using ADO to query the XML - and, therefore, don't know of their possible benefits / drawbacks.
If any of you have been in a similar situation, I'd really appreciate any kind of pointers in the right direction / at least validation that my method isn't the worst one out there :)
Thanks!!!

As for the memory consumption and performance concerns, a nice feature of the .NET XML APIs is that you can combine XmlReader with XPathDocument or XmlDocument or XElement to only selectively read part of a document into memory to then have the XPath or LINQ to XML features available on that part. LINQ to XML has http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom%28v=vs.110%29.aspx for doing that, DOM/XmlDocument has http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.readnode%28v=vs.110%29.aspx. So depending on your XML structure you might be able to use an XmlReader to read forward through the XML in a fast way without consuming much memory and then, when you have the element you are interested in, you can read it into an XElement (LINQ to XML) or XmlNode (DOM) to then apply LINQ to XML and/or XPath to read out details.

Approach to process huge xml files in C#

Can someone please guide me with this problem?
In my institution, we process xml files of huge size(max 1 GB) and insert the details into a database table. Per current design, we are parsing xml file with XmlReader and form a xml string with required data, which will then be passed into a stored procedure (xml data type) to insert the details into db.
Now the problem is we are not sure if there would be a better approach other than this ? so please suggest if are any new features available with .Net 3.5 and/or sql server 2005 to handle this in a way better than our approach.
Any help in this reagrd would be highly appreciated.
Thanks.

Do you care at all what is in the XML-file? If not, you can just use a StreamReader and get the text from the XML and just pass it along to the database.
If you need to validate that the XML is correct, it is a good idea to use XmlReader.
However, just dumping 1GB of XML into your database seems a bit weird, what is the purpose of this XML data? Is it a lot of nested elements? Maybe you could de-serialize it and store each object in the appropriet table instead, which would imo lead to a easier understandable design.
There are a couple of things you can think of to make the design of your software easier/better:
Does more than one XML file occure in the database at once?
How is the data shared between applications?
Have you considered using MemoryMappedFile?
Is it possible to de-serialize the XML into entities instead and store them approprietly?

I suspect that if there are any performance issues it will be with the stored procedure and the database side of things rather that reading the file.
Why are you storing the XML file in a database table? I would suggest using a different solution would be appropriate, but without knowing more details about exactly what it is you are trying to do it is hard to advise.

If each first-level element in the xml is a record, i.e.
<rootNode>
<row>...</row>
<row>...</row>
<row>...</row>
</rootNode>
Then you could create an IDataReader implemention that reads the xml (via XmlReader) and presents each as a record, to be imported using SqlBulkCopy. Pretty much like my old answer here.
Advantages:
SqlBulkCopy is the fastest way to get data into a database
stripping it into records makes appropriate use of a database, allowing indexing and proper typing
it doesn't rely on a huge BLOB going over the wire in an atomic way (necessary for the xml data type)

Import 60mb XML file to SQL

I have a 60mb XML file that has a list of products, approx 8k of them. I need to get all the products from this xml file to a SQL table. The xml file has a static name so i know what to look for. I guess i want to know about the process, what makes the most sense and least overhead.
How?What? is the best way to do this?
When do i parse the xml, so i have SQL handle it, or some other method. in the past i have used a parser in a stored proc, but the old xml files where smaller, like 1-5mb, im not sure if a 60mb xml file will work.
Thoughts, Ideas?

Create a SSIS package so that you can rerun it. Have SQL handle the parsing by including the schema within the xml file.

It would probably be best to write a short program in a language that has both an XML parser and a DB interface. C#, Perl, Python, Java, whatever you know best.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Processing Large XML file into SQL Server with C# - c#

Don't use XmlElement - use the .NET SAX based parser to parse the XML stream. NEVER materialize the objects in memory. Simple. There is an API for that. Basically, use an XmlTextReader.

In addition to plain use of XmlReader it could be useful to know about method XNode.ReadFrom. It works particularly well if XML is more like a very long list of entities as opposed to deep-nested hierarchy.

Related

Is it better to have SQL Server parse a large multi document XML or send it each document separately

Fastest efficient way to create XML document .NET/Oracle and return to web client

Extracting a small subset of data from XMLs

Approach to process huge xml files in C#

Import 60mb XML file to SQL

Categories

Resources