Approach to process huge xml files in C#

Approach to process huge xml files in C# - c#

Can someone please guide me with this problem?
In my institution, we process xml files of huge size(max 1 GB) and insert the details into a database table. Per current design, we are parsing xml file with XmlReader and form a xml string with required data, which will then be passed into a stored procedure (xml data type) to insert the details into db.
Now the problem is we are not sure if there would be a better approach other than this ? so please suggest if are any new features available with .Net 3.5 and/or sql server 2005 to handle this in a way better than our approach.
Any help in this reagrd would be highly appreciated.
Thanks.

Do you care at all what is in the XML-file? If not, you can just use a StreamReader and get the text from the XML and just pass it along to the database.
If you need to validate that the XML is correct, it is a good idea to use XmlReader.
However, just dumping 1GB of XML into your database seems a bit weird, what is the purpose of this XML data? Is it a lot of nested elements? Maybe you could de-serialize it and store each object in the appropriet table instead, which would imo lead to a easier understandable design.
There are a couple of things you can think of to make the design of your software easier/better:
Does more than one XML file occure in the database at once?
How is the data shared between applications?
Have you considered using MemoryMappedFile?
Is it possible to de-serialize the XML into entities instead and store them approprietly?

I suspect that if there are any performance issues it will be with the stored procedure and the database side of things rather that reading the file.
Why are you storing the XML file in a database table? I would suggest using a different solution would be appropriate, but without knowing more details about exactly what it is you are trying to do it is hard to advise.

If each first-level element in the xml is a record, i.e.
<rootNode>
<row>...</row>
<row>...</row>
<row>...</row>
</rootNode>
Then you could create an IDataReader implemention that reads the xml (via XmlReader) and presents each as a record, to be imported using SqlBulkCopy. Pretty much like my old answer here.
Advantages:
SqlBulkCopy is the fastest way to get data into a database
stripping it into records makes appropriate use of a database, allowing indexing and proper typing
it doesn't rely on a huge BLOB going over the wire in an atomic way (necessary for the xml data type)

Related

Is it better to have SQL Server parse a large multi document XML or send it each document separately

I need to request the XML from PubMed like
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=27087788,28322247,26158412&retmode=xml
The example has 3 IDs but the request can be as much as 200 at a time. The request is being done by a .NET web service. I am looking for the most efficient way to process the XML files. I know that the the term "best" or "efficient" is very subjective and dependent upon many things but:
Is it better to send the entire string to the SQL Server database (if it is even possible because of length or possible nesting levels) and let it parse the document and save it to the database or is it better to parse the document in the web service using a XMLTextReader or XML Document Object and send each document? Each document needs to be saved as a separate record.
Thanks for your information.

My first thought was: Why SQL-Server? Why send all this data around? Do the parsing in C#!
But - on the second sight: If I understand this correctly, you want to read many different XML files and store them in your database.
Now I'd rather ask: When do you need to retrieve data from these XMLs and do you need to store extracted data in relational tables? Would it be a possible approach for you to store alle these XMLs as-is in XML typed columns and read them on demand?
You can pass your XML as C#-string (which is unicode) and insert this directly into an XML-typed column. To avoid any hassel you should cut away the first lines (<xml>declaration and DOCTYPE) and start with <PubmedArticleSet>.
The rest should be easily transfered and stored in SQL-Server.
If you need help on how to read this? Just come back with another more pointed question.
About your Which is faster question you might read this.

Fastest efficient way to create XML document .NET/Oracle and return to web client

Does anyone know which is the fastest most efficient way to create XML documents in an Oracle/.NET environment.
There are two philosophies:
Do the coding in an Oracle Package and Use Oracles native XML abilities to return an XML document after querying the data from the DB, create the document by then looping through your Query result setting nodes like so (addXMLNode(doc, nContact, 'ROW_ID', rec_con.ROW_ID);)
Just query the Data from Oracle and use .NET on the data looping through the data reader and create your XML document using .NET XML classes. Essentially letting the DB serve the data, and DOM XML creation is done in .NET.
Assuming no knowledge difference in the two practices, does someone know if one is more efficient, faster, or better than the other one? Please don't give me your "favorite" way to handle it. An "our query was slow so we moved it into the code", or vice versa real world example would give me some direction for code refactoring and application performance improvement.
Thanks.

SQL Server: variable length XML storage

I'm in the process of writing an application that interacts with a third party application.
The third party application will be passing my application several raw XML requests. I would like to save each of these requests in a communications log in my DB.
What's the most efficient way to store this variable-length data? VARCHAR(MAX)? NVARCHAR(MAX)?
If one is a better choice than the other (or there is another option I'm missing), please explain why it's the best choice.

Since you're using SQL Server 2K5 the best data type to store XML data is xml.
This provides parsing and schema validation features. It also allows you to index the XML data later if need be.

XML seams to be the obvious data type of choice when dealing with XML but not always.
Have a look at this article by Robert Sheldon. Working with the XML Data Type in SQL
In some cases, you shouldn’t use the XML data type, but instead use large object storage—VARCHAR(MAX), NVARCHAR(MAX), or VARBINARY(MAX). For example, if you simply store your XML documents in the database and retrieve and update those documents as a whole—that is, if you never need to query or modify the individual XML components—you should consider using one of the large object data types. The same goes for XML files that you want to preserve in their original form, such as legal documents. If you need to retain an exact textual copy, use large object storage.

I have used the XML Datatype for this type of thing MSDN link - XML DataType 2005
Native and allows you to do some normal angle bracket things to the actual data.
Big plus is that I am not converting or messing around with the actual data, and introducing subtle bugs with the actual XML.
Big plus if you want to do anything with the XML like render it.
Downside is that you have to be aware the column is XML data and you need to code for it in upstream apps.

Adding to Yuck's answer:
VARCHAR(MAX)
XML means Unicode, and if you choose non-Unicode storage then data loss is almost certain
NVARCHAR(MAX)
appropriate if you only want to log the XML data
XML
if you want to query XML content later
typed XML with XML SCHEMA COLLECTION
only if you have a fixed xml schema (XSD) which will never ever change. (ALTER XML SCHEMA COLLECTION does not support update or delete of XML entities as I understand)

Processing Large XML file into SQL Server with C#

First, what I describe here is a small part of a larger ETL process that is already in place. So, please no suggestions to port to SSIS or some other environment because I can't.
In this ETL process, for each table in the SQL server database that is being inserted into, I am:
loading all of the relevant xml into an XElement object
then transforming the xml into a typed dataset datatable
then using a SqlBulkCopy object to quickly insert the data into the sql server table.
But, for one table, when I create the XElement, I get an OutOfMemory exception.
So, I now need to iteratively process the data in chunks, but I'm not sure of the best way to do this. The xml file is stored on the same machine that is running the ETL process.
Thanks for any help.
UPDATE
I'm getting started reading about the XmlReader class, which I've never used. If someone thinks this is the answer, please say so and provide any guidance that you will.

Don't use XmlElement - use the .NET SAX based parser to parse the XML stream. NEVER materialize the objects in memory. Simple. There is an API for that.
Basically, use an XmlTextReader.

In addition to plain use of XmlReader it could be useful to know about method XNode.ReadFrom. It works particularly well if XML is more like a very long list of entities as opposed to deep-nested hierarchy.

Is it a good idea to store serialized objects in a Database instead of multiple xml text files?

I am currently working on a web application that requires certain requests by users to be persisted. I have three choices:
Serialize each request object and store it as an xml text file.
Serialize the request object and store this xml text in a DB using CLOB.
Store the requests in separate tables in the DB.
In my opinion I would go for option 2 (storing the serialized objects' xml text in the DB). I would do this because it would be so much easier to read from 1 column and then deserialize the objects to do some processing on them. I am using c# and asp .net MVC to write this application. I am fairly new to software development and would appreciate any help I can get.

Short answer: If option 2 fits your needs well, use it. There's nothing wrong with storing your data in the database.

The answer for this really depends on the details. What kind of data are storing? How do you need to query it? How often will you need to query it?
Generally, I would say it's not a good idea to do both 1 and 2. The problem with option 2 is that you it will be much harder to query for specific fields. If you're going to do a LIKE query and have it search a really long string, it's going to be an expensive operation and you'll likely run into perf issues later on.
If you really want to stay away from having to write code to read multiple columns to load your data, look into using an ORM like Linq to SQL. That will help load database tables into objects for you.

I have designed a number of systems where storing 'some' object as serialized xml in the db has proven the better choice. I also learned lessons where storing objects in the db as xml ended up causing more headaches down the road. So I came up with some questions that you have to answer yes to in order to be comfortable in doing:
Does the object need to be portable?
Is the data in the object encapsulated i.e. not part of something else, and not made up of something else.
In the future can number 2 change?
In SQL you can always create a table view using XQuery, but I would only recommend you do this if a) its too late to change your mind b) you don't have that many objects to manage.
Serializing and storing objects in XML has some real benefits, especially for extensibilty and agile development.

If the number of this kind of objects is large and the size of it isn't very large. I think that using the database is a good idea.
Whether store it in a separate table or store it in the original table depends on how would you use this CLOB data with the original table.
Go with option 2 if you will always need the CLOB data when you access the original table.
Otherwise go with option 3 to improve performance.

You need to also think about security and n-tier architecture. Storing serialized data in a database means your data will be on another server, ideal if the data needs to be secure, but will alos give you network latency, whereas storing the data in the filesystem will give you quicker IO access, but very limited searching ability.
I have a situiation like this and I use the database. It also gets backed up properly with the rest of the related data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Approach to process huge xml files in C# - c#

Related

Is it better to have SQL Server parse a large multi document XML or send it each document separately

Fastest efficient way to create XML document .NET/Oracle and return to web client

SQL Server: variable length XML storage

Processing Large XML file into SQL Server with C#

Is it a good idea to store serialized objects in a Database instead of multiple xml text files?

Categories

Resources