Best approach to reading large files - c#

I'm currently working on a program that reads writes a XML file. While this is a simple task, i'm concerned about future issues.
My code reads the streamed data from the XML, and checks every element <x> until an element that matches a criteria is founds, this works quite fast, since the file currently has about 100 <x> elements, but when more elements are added this task will be much slower, specially if the element that matches the criteria is the last one in avery large file.
What approach should I take to minimize the impact of this?
I was thinking about spliting files in smaller ones (containing up to 1000 elements each) and reading from various of those files at the same time. Is this a proper approach to this?
I'm coding in C#, in case it's relevant for a language-specific approach.

You should use one of the available XML APIs of .Net. Which one depends on the size of the XML files. In this question there is a discussion between XDocument (Linq-to-Xml) and XmlReader. To summarize: If your file fits in memory, then use XDocument. If not then use XmlReader.

This sounds like a batch process in your case. Maybe this link: https://www.codeproject.com/Articles/1155341/Batch-Processing-Patterns-with-Taskling will help you. I never did this in C#, but in Java, and it's a good way to resolve this kind of tasks. Hope it will help you.

Related

Is it better to decode a file and read all, or part, of it?

I have an XML that is entirely encoded in Base64, not just the node text. For some functionality of my program, I only need to get a couple of nodes from the file. The XML file could contain a couple of hundred nodes, so I'm wondering whether it would be more efficient to decode the file and read the few nodes or to read everything for when (if) it is needed later in the program?
EDIT: When I say, the XML file contain a couple of hundred nodes, there aren't many sub nodes and the file is likely to contain about that number of lines.
UPDATE: Of course though, it is not just about how long it takes! What effect would it have on the memory if I'm storing upwards of 500 strings in the RAM, that may not even be used?
I'm quite sure there won't be a big diffrence if there are that few nodes in your xml.
But in genreal you could say that it depends on the use case. If you need to work with the nodes a lot it may be more efficient to load all nodes in a faster accessible data container (dictionary for example).
Reason for that is that a Dictionary uses a HashTable to store the data. God thing about a HashTable is that the time complexity is allways O(1) - what compared with the O(n) max complexity of iterating over an xml is allways better.

XML serialization or reading from XML Objects?

I have different XML files that I will need to read. I'm wondering if I should deserialize the files into custom objects or just read the data using XDocument objects and Linq-to-XML.
The files range in size from 1-2kb to 3mb+, and the different objects also range in complexity (some have attributes, some have children, some both, some none).
I figure it would be easier to work with the objects as opposed to Linq-to-XML, but creating those objects would require some time up front. Are there any rules of thumb or suggestions about when to deserialize as opposed to Linq?
Thanks for any help!
It really depends on what you are doing with the data. If you are not using all of the information that is provided by the XML document, then a LINQ based approach is probably easiest. Think of taking an RSS feed, and only keeping track of the article dates, and nothing else. In this case using a deserialization technique doesn't really do anything for you.
If you are using just about every last bit of data in the XML document, and its structure reflects that of your object model, then certainly deserialize it. This is something that I do all of the time for things like settings files, and even simple file formats.
In your case it sounds like it already exists, and was created by some external source, and you don't have an object representation of the data in your code already, so I would suggest using a LINQ based approach. Additionally, you mention a lot of variation in the files so the flexibility of LINQ would again come in handy. That is a wild guess based on your description though.
You could use the xsd.exe tool which could generate those classes from you given an XML file:
C:\work>xsd test.xml
C:\work>xsd /classes test.xsd
There is no really a rule of thumb. Personally I prefer working with strongly typed objects unless the file sizes become large in which case I switch to XmlReader.

Quickest/best way to read XML

I need to read potentially large (~300mb) XML files, and edit some of the nodes. Basically I need to:
Read the XML from the start
Whenever I find a node called trgt
Add some text to it
What's the best way to approach this in C#? Which XML classes should I use to find and edit the nodes I need to change?
TIA
VTD-XML is the only XML parsing lib that supports a feature called incremental update. It is also memory efficient and performant. But it requires you to download it as a third party lib.
From my experience of transforming some very large (2GB+) xml files (don't ask!) I found xsl transforms to be the quickest - The engines involved are heavily optimised for such tasks, compare to any manual looping etc you might try.
you can use Linq-to-XML. in short, read with XDocument, parse and add data with Linq. This will not be the fastest code, but will probably be the quickest to write.
If you have memory constraints, you will probably have to parse it manually (i.e. load only part of it in memory, process that part, replace it in the file)
If it's a fairly simple operation similar to find-and-replace, you could try treating it as a normal text file instead of an xml document. I imagine that might be faster than all the xml parsing.

What is the best way to work with xml?

what is the best way to work with xml file that represets a tree.
the xml size is 70mb.
Linq to XML is the easiest way to currently work with xml but this will typically load the entrire tree into memory which in your case with a 70mb file may not be ideal.
However there are ways around this as demonstrated in this blog post from Mick Taulty.
The answer depends on what you want to do with the XML. Generally with files that size you wouldn't want to read it all in one go. As such the following page makes an interesting read, providing a means to mine data from the file without loading it in memory. It allows you to combine the speed of XmlReader with the flexibility of Linq:
http://msdn.microsoft.com/en-us/library/bb387035.aspx
And quite an interesting article based on this technique:
Link
If you want to read data from a large xml file XmlTextReader is the way to go.
For .NET 3.5 and up, I prefer using LINQ to XML for all my work towards XML files.
LinqToXml is probably a good bet if you wish to query it in memory, but if you find that you are getting problems with how large your memory footprint is you could use an XMLReader
Linq To XML
Slower for larger documents (large memory footprint)
Queryable
XmlTextReader
Fast
Only one line at a time, so no querying
Since you are already using a DOM, an alternative XML parser you could try is a SAX parser. Instead of loading the entire tree into memory, a SAX parser is event-driven and handles nodes, etc. as it encounters them.
Further Reading: http://www.saxproject.org/event.html

How best to use XPath with very large XML files in .NET?

I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size.
I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block.
One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted.
Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc.
I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...
XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article.
If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.
I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.
Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".
Download from Microsoft
Gigabyte XML files! I don't envy you this task.
Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.
I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?
One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.
I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?
This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html
http://msdn.microsoft.com/en-us/library/bb387013.aspx has a relevant example leveraging XStreamingElement.
In order to perform XPath queries with the standard .NET classes the whole document tree needs to be loaded in memory which might not be a good idea if it can take up to a gigabyte. IMHO the XmlReader is a nice class for handling such tasks.
It seems that you already tried using XPathDocument and could not accomodate the parsed xml document in memory.
If this is the case, before starting to split the file (which is ultimately the right decision!) you may try using the Saxon XSLT/XQuery processor. It has a very efficient in-memory representation of a loaded XML document (the "tinytree" model). In addition Saxon SA (the shema-aware version, which isn't free) has some streaming extensions. Read more about this here.
How about just reading the whole thing into a database and then work with the temp database? That might be better because then your queries can be done more efficiently using TSQL.
I think the best solution is to make your own xml parser that can read small chunks not the whole file, or you can split the large file into small files and use dotnet classes with these files.
The problem is you can not parse some of data till the whole data is available so I recommend to use your own parser not dotnet classes
Have you been trying XPathDocument?
This class is optimized for handling XPath queries efficiently.
If you cannot handle your input documents efficiently using XPathDocument you might consider preprocessing and/or splitting up your input documents using an XmlReader.
You've outlined your choices already.
Either you need to abandon the XPath and use XmlTextReader or you need to break the document up into managable chunks on which you can use XPath.
If you choose the latter use XPathDocument its readonly restriction allows better used of memory.
Since in your case the data size can run in Gbs have you considered using ADO.NET with XML as a database. In addition to that the memory footprint would not be huge.
Another approach would be using Linq to XML with using elements like XElementStream. Hope this helps.

Categories

Resources