I have a 60mb XML file that has a list of products, approx 8k of them. I need to get all the products from this xml file to a SQL table. The xml file has a static name so i know what to look for. I guess i want to know about the process, what makes the most sense and least overhead.
How?What? is the best way to do this?
When do i parse the xml, so i have SQL handle it, or some other method. in the past i have used a parser in a stored proc, but the old xml files where smaller, like 1-5mb, im not sure if a 60mb xml file will work.
Thoughts, Ideas?
Create a SSIS package so that you can rerun it. Have SQL handle the parsing by including the schema within the xml file.
It would probably be best to write a short program in a language that has both an XML parser and a DB interface. C#, Perl, Python, Java, whatever you know best.
Related
I am writing a C# / VB program that is to be used for reporting data based upon information received in XMLs.
My situation is that I receive many XMLs per month (about 100-200) - Each ranging in size from 10mb to 350mb. For each of these XMLs, I only need a small subset of its data (less than 5% of any one file's entire data) so as to produce the necessary reports.
Also, that subset of data will always be held in the same key-structure (it will exist within multiple keys and at differing levels down, perhaps, but it will always exist within the same key names / the keys containing it will always have the with the same attributes such as "name", etc)
So, my current idea of how to go about doing this is to:
To create a "scraper" that will pull the necessary data from the XMLs using XPath.
Store that small subset of necessary data in a SQL Server table along with file characteristic data stored in a separate table so as to know which file this scraped data came from
Query out the data into a program for reporting it.
My main question here is really what is the best way to scrape that data out?
I am most familiar with XPath, but for multiple files of 200MB in size, I'm afraid of performance issues loading in the entire file.
Other things I have seen / researched are:
Creating an XSLT file to transform / pull from the XML only the data I want
Using Linq to XML
Somehow linking the XMLs to SQL server and then being able to query them directly
Using ADO to query the XMLs from within the program
Doing it using the XMLReader class (rather than loading in each XML entirely)
Maybe there is a native .Net component that does this very well already
Quite honestly, I just have no clue what the standard is given the high number of XMLs and the large variance in file sizes and I'm not familiar with any of the other ways of doing this - such as, for example, linking the XMLs to SQL Server directly / using ADO to query the XML - and, therefore, don't know of their possible benefits / drawbacks.
If any of you have been in a similar situation, I'd really appreciate any kind of pointers in the right direction / at least validation that my method isn't the worst one out there :)
Thanks!!!
As for the memory consumption and performance concerns, a nice feature of the .NET XML APIs is that you can combine XmlReader with XPathDocument or XmlDocument or XElement to only selectively read part of a document into memory to then have the XPath or LINQ to XML features available on that part. LINQ to XML has http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom%28v=vs.110%29.aspx for doing that, DOM/XmlDocument has http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.readnode%28v=vs.110%29.aspx. So depending on your XML structure you might be able to use an XmlReader to read forward through the XML in a fast way without consuming much memory and then, when you have the element you are interested in, you can read it into an XElement (LINQ to XML) or XmlNode (DOM) to then apply LINQ to XML and/or XPath to read out details.
Can someone please guide me with this problem?
In my institution, we process xml files of huge size(max 1 GB) and insert the details into a database table. Per current design, we are parsing xml file with XmlReader and form a xml string with required data, which will then be passed into a stored procedure (xml data type) to insert the details into db.
Now the problem is we are not sure if there would be a better approach other than this ? so please suggest if are any new features available with .Net 3.5 and/or sql server 2005 to handle this in a way better than our approach.
Any help in this reagrd would be highly appreciated.
Thanks.
Do you care at all what is in the XML-file? If not, you can just use a StreamReader and get the text from the XML and just pass it along to the database.
If you need to validate that the XML is correct, it is a good idea to use XmlReader.
However, just dumping 1GB of XML into your database seems a bit weird, what is the purpose of this XML data? Is it a lot of nested elements? Maybe you could de-serialize it and store each object in the appropriet table instead, which would imo lead to a easier understandable design.
There are a couple of things you can think of to make the design of your software easier/better:
Does more than one XML file occure in the database at once?
How is the data shared between applications?
Have you considered using MemoryMappedFile?
Is it possible to de-serialize the XML into entities instead and store them approprietly?
I suspect that if there are any performance issues it will be with the stored procedure and the database side of things rather that reading the file.
Why are you storing the XML file in a database table? I would suggest using a different solution would be appropriate, but without knowing more details about exactly what it is you are trying to do it is hard to advise.
If each first-level element in the xml is a record, i.e.
<rootNode>
<row>...</row>
<row>...</row>
<row>...</row>
</rootNode>
Then you could create an IDataReader implemention that reads the xml (via XmlReader) and presents each as a record, to be imported using SqlBulkCopy. Pretty much like my old answer here.
Advantages:
SqlBulkCopy is the fastest way to get data into a database
stripping it into records makes appropriate use of a database, allowing indexing and proper typing
it doesn't rely on a huge BLOB going over the wire in an atomic way (necessary for the xml data type)
We have a need to save some of our configuration items into files. I have been told this is for some localization features we are going to use. From what I have been told, it is much faster to read from a binary file than a straight XML file. Is this true, and is it ideal to save xml data in binary format or is there another way I should save the data to pull it into my web application?
I would like to be able to read the data using LINQ or casting it as an object. Also, what is the best way to parse the file for sepecific data? Any suggestions would be greatly appreciated?
I would recommend using XML over binary simply because it's easier to work with. Binary is faster but I doubt you would notice that speed gain in your application, especially if you cache the values you read from the file.
The easiest way to parse the XML file would be to deserialize it to an object using the XmlSerializer class. This is absolutely the most painless way to parse XML in my opinion.
First, what I describe here is a small part of a larger ETL process that is already in place. So, please no suggestions to port to SSIS or some other environment because I can't.
In this ETL process, for each table in the SQL server database that is being inserted into, I am:
loading all of the relevant xml into an XElement object
then transforming the xml into a typed dataset datatable
then using a SqlBulkCopy object to quickly insert the data into the sql server table.
But, for one table, when I create the XElement, I get an OutOfMemory exception.
So, I now need to iteratively process the data in chunks, but I'm not sure of the best way to do this. The xml file is stored on the same machine that is running the ETL process.
Thanks for any help.
UPDATE
I'm getting started reading about the XmlReader class, which I've never used. If someone thinks this is the answer, please say so and provide any guidance that you will.
Don't use XmlElement - use the .NET SAX based parser to parse the XML stream. NEVER materialize the objects in memory. Simple. There is an API for that.
Basically, use an XmlTextReader.
In addition to plain use of XmlReader it could be useful to know about method XNode.ReadFrom. It works particularly well if XML is more like a very long list of entities as opposed to deep-nested hierarchy.
I am creating an RSS reader as a hobby project, and at the point where the user is adding his own URL's.
I was thinking of two things.
A plaintext file where each url is a single line
SQLite where i can have unique ID's and descriptions following the URL
Is the SQLite idea to much of an overhead or is there a better way to do things like this?
What about as an OPML file? It's XML, so if you needed to store more data then the OPML specification supplies, you can always add your own namespace.
Additionally, importing and exporting from other RSS readers is all done via OPML. Often there is library support for it. If you're interested in having users switch then you have to support OPML. Thansk to jamesh for bringing that point up.
Why not XML?
If you're dealing with RSS anyway you mayaswell :)
Do you plan just to store URLs? Or you plan to add data like last_fetch_time or so?
If it's just a simple URL list that your program will read line-by-line and download data, store it in a file or even better in some serialized object written to a file.
If you plan to extend it, add comments/time of last fetch, etc, I'd go for SQLite, it's not that much overhead.
If it's a single user application that only has one instance, SQLite might be overkill.
You've got a few options as I see it:
SQLite / Database layer. Increases the dependencies your code needs to run. But allows concurrent access
Roll your own text parser. Complexity increases as you want to save more data and you're re-inventing the wheel. Less dependency and initially, while your data is simple, it's trivial for a novice user of your application to edit.
Use XML. It's well formed & defined and text editable. Could be overkill for storing just a URL though.
Use something like pickle to serialize your objects and save them to disk. Changes to your data structure means "upgrading" the pickle files. Not very intuitive to edit for a novice user, but extremely easy to implement.
I'd go with the XML text file option. You can use the XSD tool built into Visual Studio to create a DataTable out of the XML data, and it easily serializes back into the file when needed.
The other caveat is that I'm sure you're going to want the end user to be able to categorize their RSS feeds and be able to potentially search/sort them, and having that kind of datatable style will help with this.
You'll get easy file storage and access, the benefit of a "database" structure, but not quite the overhead of SQLite.