What is the best approach on reading large amounts of xml files (I need to read 8000 xml's) and do some computations on them, and have best speed on it? Is it ok using a xmlreader and returning the nodes i'm interested in in a list? Or is it faster when reading the node, also to do some computations on it? I tried the second approach(Returning the nodes in a list, as values, because I tried writing my application with as much modules as possible). I am using C#, but this is not relevant.
Thank you.
Is it ok using a xmlreader and returning the nodes i'm interested in in a list? Or is it faster when reading the node, also to do some computations on it?
I can't say whether returning a list is ok or not, because I don't know how large each file is, which would be more important in this regard than the number of XML documents.
However, it certainly could be very expensive, if an XML document, and hence the list produced, were very large.
Conversely, reading the node and calculating as you go will certainly be quicker to start producing results, and use less memory and hence faster in a degree ranging from negligible to so considerable as to have other approaches be infeasible, depending on just how large that source data is. It's the approach I take if I either have a strong concern with performance, or a good reason to suspect such a large dataset.
Somewhere between the two, is the approach of an IEnumerable<T> implementation that yields objects as it reads, along the lines of:
public IEnumerable<SomeObject> ExtractFromXml(XmlReader rdr)
{
using(rdr)
while(rdr.Read())
if(rdr.NodeType == XmlNodeType.Element && rdr.LocalName = "thatElementYouReallyCareAbout")
{
var current = /*Code to create a SomeObject from the XML goes here */
yield return current;
}
}
As with producing a list, this separates the code doing the calculation from that which parses the XML, but because you can start enumerating through it with a foreach before it has finished that parsing, the memory use can be less, as will the time to start the calculation. This makes little difference with small documents, but a lot if they are large.
The best solution I have personally come up with to deal with XML files is by taking advantage of the .Net's XmlSerializer class. You can define a model for your xml and create a List of that model where you keep your xml data then:
using (StreamWriter sw = new StreamWriter("OutPutPath")) {
new XmlSerializer(typeof(List<Model>)).Serialize(sw, Models);
sw.WriteLine();
}
you can read the file and deserilize the data and then assign them back to the model by calling the Deserialize method.
Related
I have data stored in several seperate text files that I parse and analyze afterwards.
The size of the data processed differs a lot. It ranges from a few hundred megabytes (or less) to 10+ gigabytes.
I started out with storing the parsed data in a List<DataItem> because I wanted to perform a BinarySearch() during the analysis. However, the program throws an OutOfMemory-Exception if too much data is parsed. The exact amount the parser can handle depends on the fragmentation of the memory. Sometimes it's just 1.5 gb of the files and some other time it's 3 gb.
Currently I'm using a List<List<DataItem>> with a limited number of entries because I thought it would change anything for the better. There weren't any significant improvements though.
Another way I tried was serializing the parser data and than deserializing it if needed. The result of that approach was even worse. The whole process took much longer.
I looked into memory mapped files but I don't really know if they could help me because I never used them before. Would they?
So how can I quickly access the data from all the files without the danger of throwing an OutOfMemoryException and find DataItems depending on their attributes?
EDIT: The parser roughly works like this:
void Parse() {
LoadFile();
for (int currentLine = 1; currentLine < MAX_NUMBER_OF_LINES; ++currentLine) {
string line = GetLineOfFile(currentLine);
string[] tokens = SplitLineIntoTokens(line);
DataItem data = PutTokensIntoDataItem(tokens);
try {
List<DataItem>.Add(data);
} catch (OutOfMemoryException ex) {}
}
}
void LoadFile(){
DirectoryInfo di = new DirectroyInfo(Path);
FileInfo[] fileList = di.GetFiles();
foreach(FileInfo fi in fileList)
{
//...
StreamReader file = new SreamReader(fi.FullName);
//...
while(!file.EndOfStram)
strHelp = file.ReadLine();
//...
}
}
There is no right answer for this I believe. The implementation depends on many factors that only you can rate pros and cons on.
If your primary purpose is to parse large files and large number of them, keeping these in memory irrespective of how much RAM is available should be a secondary option, for various reasons for e.g. like persistance at times when an unhandled exception occured.
Although when profiling under initial conditions you may be encouraged and inclined to load them to memory retain for manipulation and search, this will soon change as the number of files increase and in no time your application supporters will start ditching this.
I would do the below
Read and store each file content to a document database like Raven DB for e.g.
Perform parse routine on these documents and store the relevant relations in an rdbms db if that is the requirement
Search at will, fulltext or otherwise, on either the document db (raw) or relational (your parse output)
By doing this, you are taking advantage of research done by the creators of these systems in managing the memory efficiently with focus on performance
I realise that this may not be the answer for you, but for someone who may think this is better and suits perhaps yes.
If the code in your question is representative of the actual code, it looks like you're reading all of the data from all of the files into memory, and then parsing. That is, you have:
Parse()
LoadFile();
for each line
....
And your LoadFile loads all of the files into memory. Or so it seems. That's very wasteful because you maintain a list of all the un-parsed lines in addition to the objects created when you parse.
You could instead load only one line at a time, parse it, and then discard the unparsed line. For example:
void Parse()
{
foreach (var line in GetFileLines())
{
}
}
IEnumerable<string> GetFileLines()
{
foreach (var fileName in Directory.EnumerateFiles(Path))
{
foreach (var line in File.ReadLines(fileName)
{
yield return line;
}
}
}
That limits the amount of memory you use to hold the file names and, more importantly, the amount of memory occupied by un-parsed lines.
Also, if you have an upper limit to the number of lines that will be in the final data, you can pre-allocate your list so that adding to it doesn't cause a re-allocation. So if you know that your file will contain no more than 100 million lines, you can write:
void Parse()
{
var dataItems = new List<DataItem>(100000000);
foreach (var line in GetFileLines())
{
data = tokenize_and_build(line);
dataItems.Add(data);
}
}
This reduces fragmentation and out of memory errors because the list is pre-allocated to hold the maximum number of lines you expect. If the pre-allocation works, then you know you have enough memory to hold references to the data items you're constructing.
If you still run out of memory, then you'll have to look at the structure of your data items. Perhaps you're storing too much information in them, or there are ways to reduce the amount of memory used to store those items. But you'll need to give us more information about your data structure if you need help reducing its footprint.
You can use:
Data Parallelism (Task Parallel Library)
Write a Simple Parallel.ForEach
I think it will make it will reduce memory exception and make files handling faster.
Is there some way I can combine two XmlDocuments without holding the first in memory?
I have to cycle through a list of up to a hundred large (~300MB) XML files, appending to each up to 1000 nodes, repeating the whole process several times (as the new node list is cleared to save memory). Currently I load the whole XmlDocument into memory before appending new nodes, which is currently not tenable.
What would you say is the best way to go about this? I have a few ideas but I'm not sure which is best:
Never load the whole XMLDocument, instead using XmlReader and XmlWriter simultaneously to write to a temp file which is subsequently renamed.
Make a XmlDocument for the new nodes only, and then manually write it to the existing file (i.e. file.WriteLine( "<node>\n" )
Something else?
Any help will be much appreciated.
Edit Some more details in answer to some of the comments:
The program parses several large logs into XML, grouping into different files by source. It only needs to run once a day, and once the XML is written there is a lightweight proprietary reader program which gives reports on the data. The program only needs to run once a day so can be slow, but runs on a server which performs other actions, mainly file compression and transfer, which cannot be effected too much.
A database would probably be easier, but the company isn't going to do this any time soon!
As is, the program runs on the dev machine using a few GB of memory at the most, but throws out of memory exceptions when run on the sever.
Final Edit
The task is quite low-prority, which is why it would only cost extra to get a database (though I will look into mongo).
The file will only be appended to, and won't grow indefinitely - each final file is only for a day's worth of the log, and then new files are generated the following day.
I'll probably use the XmlReader/Writer method since it will be easiest to ensure XML validity, but I have taken all your comments/answers into consideration. I know that having XML files this large is not a particularly good solution, but it's what I'm limited to, so thanks for all the help given.
If you wish to be completely certain of the XML structure, using XMLWriter and XMLReader are the best way to go.
However, for absolutely highest possible performance, you may be able to recreate this code quickly using direct string functions. You could do this, although you'd lose the ability to verify the XML structure - if one file had an error you wouldn't be able to correct it:
using (StreamWriter sw = new StreamWriter("out.xml")) {
foreach (string filename in files) {
sw.Write(String.Format(#"<inputfile name=""{0}"">", filename));
using (StreamReader sr = new StreamReader(filename)) {
// Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
if (max_performance) {
sr.CopyTo(sw);
} else {
string line = sr.ReadLine();
// parse the line and make any modifications you want
sw.Write(line);
sw.Write("\n");
}
}
sw.Write("</inputfile>");
}
}
Depending on the way your input XML files are structured, you might opt to remove the XML headers, maybe the document element, or a few other un-necessary structures. You could do that by parsing the file line by line
I need to iterate through a large XML file (~2GB) and selectively copy certain nodes to one or more separate XML files.
My first thought is to use XPath to iterate through matching nodes and for each node test which other file(s) the node should be copied to, like this:
var doc = new XPathDocument(#"C:\Some\Path.xml");
var nav = doc.CreateNavigator();
var nodeIter = nav.Select("//NodesOfInterest");
while (nodeIter.MoveNext())
{
foreach (Thing thing in ThingsThatMightGetNodes)
{
if (thing.AllowedToHaveNode(nodeIter.Current))
{
thing.WorkingXmlDoc.AppendChild(... nodeIter.Current ...);
}
}
}
In this implementation, Thing defines public System.Xml.XmlDocument WorkingXmlDoc to hold nodes that it is AllowedToHave(). I don't understand, though, how to create a new XmlNode that is a copy of nodeIter.Current.
If there's a better approach I would be glad to hear it as well.
Evaluation of an XPath expression requires that the whole XML document (XML Infoset) be in RAM.
For an XML file whose textual representation exceeds 2GB, typically more than 10GB of RAM should be available just to hold the XML document.
Therefore, while not impossible, it may be preferrable (especially on a server that must have resources quickly available to many requests) to use another technique.
The XmlReader (based classes) is an excellent tool for this scenario. It is fast, forward only, and doesn't require to retain the read nodes in memory. Also, your logic will remain almost the same.
You should consider LINQ to XML. Check this blog post for details and examples:
http://james.newtonking.com/archive/2007/12/11/linq-to-xml-over-large-documents.aspx
Try an XQuery processor that implements document projection (an idea first published by Marion and Simeon). It's implemented in a number of processors including Saxon-EE. Basically, if you run a query such as //x, it will filter the input event stream and build a tree that only contains the information needed to handle this query; it will then execute the query in the normal way, but against a much smaller tree. If this is a small part of the total document, you can easily reduce the memory requirement by 95% or so.
I want to search for an element value in all the XML files(assume 200+) in a folder using C#.
My scenario is each file will contain multiple item tags.So i have to check all item tags for User Selected SearchValue. Eg: ABC123
Currently i am using foreach loop and it's taking longtime.
Could you please suggest me a better option to get result much faster
Following is my current code implementation.
string[] arrFiles = Directory.GetFiles(temFolder, "*.xml");
foreach (string file in arrFiles)
{
XmlDocument doc = new XmlDocument();
doc.Load(file);
XmlNodeList lstEquip = doc.SelectNodes("scene/PackedUnit/Items/ItemCode");
foreach (XmlNode xnEquip in lstEquip)
{
if (xnEquip.InnerText.ToUpper() == equipCode.ToUpper())
{
String[] strings = file.Split('\\');
string fileName = strings[strings.Count() - 1];
fileName = fileName.Replace(".xml", "");
lstSubContainers.Add(fileName);
break;
}
}
}
Well, the first thing to work out is why it's taking a long time. You haven't provided any code, so it's hard to say what's going on.
One option is to parallelize the operation, using a pool of tasks each working on a single document at a time. In an ideal world you'd probably read from the files on a single thread (to prevent thrashing) and supply the files to the pool as you read them - but just reading in multiple threads it probably a good starting point. Using .NET 4's Parallel Extensions libraries would make this reasonably straightforward.
Personally I like the LINQ to XML API for querying, rather than using the "old" XmlElement etc API, but it's up to you. I wouldn't expect it to make much difference. Using XmlReader instead could be faster, avoiding creating as much garbage - but I would try to find out where the time is going in the "simple" code first. (I personally find XmlReader rather harder to use correctly than the "whole document in memory" APIs.)
If you're doing forward-only reading and not manipulating the Xml in anyway, switching to an XmlReader should speed up the processing, although I can't imagine it will really make a massive difference (maybe a second or two atmost) with the file sizes you have.
I've recently had to parse a 250mb XML file using LINQ-to-XML in Silverlight (a test app) and that took seconds to do. What is your machine?
I'm trying to do a dump to XML of a very large database (many gigabytes). I'm using Linq-to-SQL to get the data out of the database and Linq-to-XML to generate XML. I'm using XStreamingElement to keep memory use low. The job still allocates all available memory, however, before keeling over without having written any XML. The structure looks like this:
var foo =
new XStreamingElement("contracts",
<LinqtoSQL which fetches data>.Select(d =>
new XElement("contract",
... generate attributes etc...
using (StreamWriter sw = new StreamWriter("contracts.xml"))
{
using (XmlWriter xw = XmlWriter.Create(sw))
{
foo.WriteTo(xw);
}
}
I've also tried saving with:
foo.Save("contracts.xml", SaveOptions.DisableFormatting);
...to no avail.
Any clues?
How complex is the data? I'm not overly familiar with XStreamingElement, but I wonder if you might have more joy using XmlWriter directly? Especially for like data in a loop, it can be used pretty easily.
I would, however, have concerns over xml as the choice for this data. Is this s requirement? Or simply a convenient available format? In particular, it can be hard to parse that size of xml conveniently, as you'd have to use XmlReader (which is harder to get right than XmlWriter).
If you can use other formats, I'd advise it... a few leap to mind, but I won't babble on unless you mention that you'd be interested.
Sure, you only need one clue for that: don't do it. :-)
XML is not an adequate format for database dumps because it does not handle large amounts of data well.
All databases have some sort of "dump" utility to export their data in a format that can then be read into another database - that would be the way to go.
Right, "solved" the problem by chunking my data into sets of 10,000 items and writing them to separate XML files. Will ponder other data exchange format and buy a larger server.
I would still be mighty interesting if someone had figured out how to properly take advantage of XStreamingElement.