Reducing memory and increasing speed while parsing XML files

Reducing memory and increasing speed while parsing XML files - c#

I have a directory with about 30 randomly named XML files. So the name is no clue about their content. And I need to merge all of these files into a single file according to predefined rules. And unfortunately, it is too complex to use simple stylesheets.
Each file can have up to 15 different elements within its root. So, I have 15 different methods that each take an XDocument as parameter and search for a specific element in the XML. It will then process that data. And because I call these methods in a specific order, I can assure that all data is processed in the correct order.
Example nodes are e.g. a list of products, a list of prices for specific product codes, a list of translations for product names, a list of countries, a list of discounts on product in specific country and much, much more. And no, these aren't very simple structures either.
Right now, I'm doing something like this:
List<XmlFileData> files = ImportFolder.EnumerateFiles("*.xml", SearchOption.TopDirectoryOnly).Select(f => new XDocument(f.FullName)).ToList();
files.ForEach(MyXml, FileInformation);
files.ForEach(MyXml, ParseComments);
files.ForEach(MyXml, ParsePrintOptions);
files.ForEach(MyXml, ParseTranslations);
files.ForEach(MyXml, ParseProducts);
// etc.
MyXml.Save(ExportFile.FullName);
I wonder if I can do this in a way that I have to read less in memory and generate a faster result. Speed is more important than memory, though. Thus, this solution works. I just need something faster that will use less memory.
Any suggestions?

One approach would be to create a separate List<XElement> for each of the different data types. For example:
List<XElement> Comments = new List<XElement>();
List<XElement> Options = new List<XElement>();
// etc.
Then for each document you can go through the elements in that document and add them to the appropriate lists. Or, in pseudocode:
for each document
for each element in document
add element to the appropriate list
This way you don't have to load all of the documents into memory at the same time. In addition, you only do a single pass over each document.
Once you've read all of the documents, you can concatenate the different elements into your single MyXml document. That is:
MyXml = create empty document
Add Comments list to MyXml
Add Options list to MyXml
// etc.
Another benefit of this approach is that if the total amount of data is larger than will fit in memory, those lists of elements could be files. You'd write all of the Comment elements to the Comments file, the Options to the Options file, etc. And once you've read all of the input documents and saved the individual elements to files, you can then read each of the element files to create the final XML document.

Depending on the complexity of your rules, and how interdependent the data is between the various files, you could probably process each file in parallel (or at least certain chunks of it).
Given that the XDocument's aren't being changed during the read, you could most certainly gather your data in parallel, which would likely offer a speed advantage.
See https://msdn.microsoft.com/en-us/library/dd460693%28v=vs.110%29.aspx
You should examine the data you're loading in, and whether you can work on that in any special way to keep memory-usage low (and even gain some speed).

Related

Best way to reading large amounts of xml's

What is the best approach on reading large amounts of xml files (I need to read 8000 xml's) and do some computations on them, and have best speed on it? Is it ok using a xmlreader and returning the nodes i'm interested in in a list? Or is it faster when reading the node, also to do some computations on it? I tried the second approach(Returning the nodes in a list, as values, because I tried writing my application with as much modules as possible). I am using C#, but this is not relevant.
Thank you.

Is it ok using a xmlreader and returning the nodes i'm interested in in a list? Or is it faster when reading the node, also to do some computations on it?
I can't say whether returning a list is ok or not, because I don't know how large each file is, which would be more important in this regard than the number of XML documents.
However, it certainly could be very expensive, if an XML document, and hence the list produced, were very large.
Conversely, reading the node and calculating as you go will certainly be quicker to start producing results, and use less memory and hence faster in a degree ranging from negligible to so considerable as to have other approaches be infeasible, depending on just how large that source data is. It's the approach I take if I either have a strong concern with performance, or a good reason to suspect such a large dataset.
Somewhere between the two, is the approach of an IEnumerable<T> implementation that yields objects as it reads, along the lines of:
public IEnumerable<SomeObject> ExtractFromXml(XmlReader rdr)
{
using(rdr)
while(rdr.Read())
if(rdr.NodeType == XmlNodeType.Element && rdr.LocalName = "thatElementYouReallyCareAbout")
{
var current = /*Code to create a SomeObject from the XML goes here */
yield return current;
}
}
As with producing a list, this separates the code doing the calculation from that which parses the XML, but because you can start enumerating through it with a foreach before it has finished that parsing, the memory use can be less, as will the time to start the calculation. This makes little difference with small documents, but a lot if they are large.

The best solution I have personally come up with to deal with XML files is by taking advantage of the .Net's XmlSerializer class. You can define a model for your xml and create a List of that model where you keep your xml data then:
using (StreamWriter sw = new StreamWriter("OutPutPath")) {
new XmlSerializer(typeof(List<Model>)).Serialize(sw, Models);
sw.WriteLine();
}
you can read the file and deserilize the data and then assign them back to the model by calling the Deserialize method.

Search multiple XML files for string

I have a folder with 400k+ XML-documents and many more to come, each file is named with 'ID'.xml, and each belongs to a specific user. In a SQL server database I have the 'ID' from the XML-file matched with a userID which is where I interconnect the XML-document with the user. A user can have an infinite number of XML-document attached (but let's say maximum >10k documents)
All XML-documents have a few common elements, but the structure can vary a little.
Now, each user will need to make a search in the XML-documents belonging to her, and what I've tried so far (looping through each file and read it with a streamreader) is too slow. I don't care, if it reads and matches the whole file with attributes and so on, or just the text in each element. What should be returned in the first place is a list with the ID's from the filenames.
What is the fastest and smartest methods here, if any?

I think LINQ-to-XML is probably the direction you want to go.
Assuming you know the names of the tags that you want, you would be able to do a search for those particular elements and return the values.
var xDoc = XDocument.Load("yourFile.xml");
var result = from dec in xDoc.Descendants()
where dec.Name == "tagName"
select dec.Value;
results would then contain an IEnumerable of the value of any XML tag that has has a name matching "tagName"
The query could also be written like this:
var result = from dec in xDoc.Decendants("tagName")
select dec.Value;
or this:
var result = xDoc.Descendants("tagName").Select(tag => tag.Value);
The output would be the same, it is just a different way to filter based on the element name.

You'll have to open each file that contains relevant data, and if you don't know which files contain it, you'll have to open all that may match. So the only performance gain would be in the parsing routine.
When parsing Xml, if speed is the requirement, you could use the XmlReader as it performs way better than the other parsers (most read the entire Xml file before you can query them). The fact that it is forward-only should not be a limitation for this case.
If parsing takes about as long as the disk I/O, you could try parsing files in parallel, so one thread could wait for a file to be read while the other parses the loaded data. I don't think you can make that big a win there, though.
Also what is "too slow" and what is acceptable? Would this solution of many files become slower over time?

Use LINQ to XML.
Check out this article. over at msdn.
XDocument doc = XDocument.Load("C:\file.xml");
And don't forget that reading so many files will always be slow, you may try writing a multi-threaded program...

If I understood correctly you don't want to open each xml file for particular user because it's too slow whether you are using linq to xml or some other method.
Have you considered saving some values both in xml file and relational database (tags) (together with xml ID).
In that case you could search for some values in DB first and select only xml files that contain searched values ?
for example:
ID, tagName1, tagName2
xmlDocID, value1, value2
my other question is, why have you chosen to store xml documents in file system. If you are using SQL Server 2005/2008, it has very good support for storing, searching through xml columns (even indexing some values in xml)

Are you just looking for files that have a specific string in the content somewhere?
WARNING - Not a pure .NET solution. If this scares you, then stick with the other answers. :)
If that's what you're doing, another alternative is to get something like grep to do the heavy lifting for you. Shell out to that with the "-l" argument to specify that you are only interested in filenames and you are on to a winner. (for more usage examples, see this link)

L.B Have already made a valid point.
This is a case, where Lucene.Net(or any indexer) would be a must. It would give you a steady (very fast) performance in all searches. And it is one of the primary benefits of indexers, to handle a very large amount of arbitrary data.
Or is there any reason, why you wouldn't use Lucene?

Lucene.NET (and Lucene) support incremental indexing. If you can re-open the index for reading every so often, then you can keep adding documents to the index all day long -- your searches will be up-to-date with the last time you re-opened the index for searching.

Best Way to Load a File, Manipulate the Data, and Write a New File

I have an issue where I need to load a fixed-length file. Process some of the fields, generate a few others, and finally output a new file. The difficult part is that the file is of part numbers and some of the products are superceded by other products (which can also be superceded). What I need to do is follow the superceded trail to get information I need to replace some of the fields in the row I am looking at. So how can I best handle about 200000 lines from a file and the need to move up and down within the given products? I thought about using a collection to hold the data or a dataset, but I just don't think this is the right way. Here is an example of what I am trying to do:
Before
Part Number List Price Description Superceding Part Number
0913982 3852943
3852943 0006710 CARRIER,BEARING
After
Part Number List Price Description Superceding Part Number
0913982 0006710 CARRIER,BEARING 3852943
3852943 0006710 CARRIER,BEARING
As usual any help would be appreciated, thanks.
Wade

Create structure of given fields.
Read file and put structures in collection. You may use part number as key for hashtable to provide fastest searching.
Scan collection and fix the data.
200 000 objects from given lines will fit easily in memory.
For example.
If your structure size is 50 bytes then you will need only 10Mb of memory. It is nothing for modern PC.

Iterate Large XML File and Copy Select Nodes

I need to iterate through a large XML file (~2GB) and selectively copy certain nodes to one or more separate XML files.
My first thought is to use XPath to iterate through matching nodes and for each node test which other file(s) the node should be copied to, like this:
var doc = new XPathDocument(#"C:\Some\Path.xml");
var nav = doc.CreateNavigator();
var nodeIter = nav.Select("//NodesOfInterest");
while (nodeIter.MoveNext())
{
foreach (Thing thing in ThingsThatMightGetNodes)
{
if (thing.AllowedToHaveNode(nodeIter.Current))
{
thing.WorkingXmlDoc.AppendChild(... nodeIter.Current ...);
}
}
}
In this implementation, Thing defines public System.Xml.XmlDocument WorkingXmlDoc to hold nodes that it is AllowedToHave(). I don't understand, though, how to create a new XmlNode that is a copy of nodeIter.Current.
If there's a better approach I would be glad to hear it as well.

Evaluation of an XPath expression requires that the whole XML document (XML Infoset) be in RAM.
For an XML file whose textual representation exceeds 2GB, typically more than 10GB of RAM should be available just to hold the XML document.
Therefore, while not impossible, it may be preferrable (especially on a server that must have resources quickly available to many requests) to use another technique.
The XmlReader (based classes) is an excellent tool for this scenario. It is fast, forward only, and doesn't require to retain the read nodes in memory. Also, your logic will remain almost the same.

You should consider LINQ to XML. Check this blog post for details and examples:
http://james.newtonking.com/archive/2007/12/11/linq-to-xml-over-large-documents.aspx

Try an XQuery processor that implements document projection (an idea first published by Marion and Simeon). It's implemented in a number of processors including Saxon-EE. Basically, if you run a query such as //x, it will filter the input event stream and build a tree that only contains the information needed to handle this query; it will then execute the query in the normal way, but against a much smaller tree. If this is a small part of the total document, you can easily reduce the memory requirement by 95% or so.

Fastest way to delete files that are not in a data table?

I need to write a code in C# that will select a list of file names from a data table and delete every file in a folder that is not in this list.
One possibility would be to have both ordered by name, and then loop through my table results, and for each result, loop through my files and delete them until I find a file that matches the current result or is alphabetically bigger, and then move to the next result without resetting the current file index.
I haven't tried to actually implement this, but seems to me that this would be an O(n) since each list would be looped through just once (ignoring the sorting both lists part). The only thing I'm not sure about is whether I can be 100% sure both the file system and the database engine will sort exactly the same way (will they both consider "_" smaller than "-" and stuff like that). If not, the algorithm above just wouldn't work at all. (By the way this is a Jet Engine database.)
But since this is probably not such an uncommon problem you guys might already know a better solution. I tried search the web but couldn't find anything. Perhaps a more effective solution would be to put each list into a HashSet and find their difference.

Get the folder content into folderFiles (IEnumerable<string>)
Get the file you want to keep in filesToKeep (IEnumerable<string>)
Get a list of "not in list" files.
Delete these files.
Code Sample :
IEnumerable<FileInfo> folderFiles = new List<FileInfo>(); // Fill me.
IEnumerable<string> filesToKeep = new List<string>(); // Fill me.
foreach (string fileToDelete in folderFiles.Select(fi => fi.FullName).Except(filesToKeep))
{
File.Delete(fileToDelete);
}

Here is my suggestion for you. Assuming filesInDatabase contains a list of files which are in the database and pathOfDirectory contains the path of the directory where the files to compare are contained.
foreach (var fileToDelete in Directory.EnumerateFiles(pathOfDirectory).Where(item => !filesInDatabase.Contains(item))
{
File.Delete(fileToDelete);
}
EDIT:
This requires using System.Linq;, because it uses LINQ.

I think hashing is the way to go, but you don't really need two HashSets. Only one HashSet is needed to store the standardized file names from the datatable; the other container can be any collection data type.

First off, .Net allows you to define cultures that can be used in sorting, but I'm not all that familiar with the mechanism, so I'll let Google to give his pointers on the subject.
Second, to avoid all the culture mass, you can use a different algorithm with an idea similar to radix-sort (only without the sort) - time complexity is O(n * length_longest_file_name). File name lengths are limited (as far as I know, almost no file system will allow a file name longer then 256), so I'm assuming that n is dramatically larger then file name lengths, and if n is smaller then the max file name length, just use an O(n^2) method and avoid the work (iterating lists this small is near instant times anyways).
Note: This method does not require sorting.
The idea is to create an array of symbols that can be used as file name chars (about 60-70 chars, if this is a case sensitive search), and another flag array with a flag for each char in the first array.
Now, you create a loop for each char in the file names of the list from the DB (from 1 -> length_longest_file_name).
In each iteration (i) you go over the i-th char of each file name in the DB list. Every char you see, you set it's relevant flag to true.
When all flags are set, you go over the second list and delete every file for which the i-th char of it's name is not flagged.
Implementation might be complex, and the overhead of the two arrays might make it slower when n is small, but you can optimize this to make it better (for instance, no iterating over files that have names shorter then the current i by removing them from both lists).
Hope this helps

I have another idea that might be faster.
var filesToDelete = new List<string>(Directory.GetFiles(directoryPath));
foreach (var databaseFile in databaseFileList)
{
filesToDelete.Remove(databaseFile);
}
foreach (var fileToDelete in filesToDelete)
{
File.Delete(fileToDelete);
}
Explanation: First get all files containing in the directory. Then delete every file from that list, which is in the database. At last delete all remaining files from the list filesToDelete.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.