I'm looking at classes to use to read a large xml file. A fast implementation of the C# XmlReader class, XmlTextReader, provides "forward-only access." What does this mean?
"forward-only" means just that - you can only go forward through data. The main benefits of such approach are no need to store previous information (leading to low memory usage) and ability to read from non-seekable sources like TCP stream (where you can't seek back unlike with file stream that allow random access).
"Forward-only" is very easy to see for table-based structures (like reading from database) - "forward-only" reader will let you only check "current" record or move to the next row. There will be no way to access data from already seen rows via such reader (you have to save data outside of reader to be able to access it).
For XmlReader it is slightly more confusing as it produces tree structure out of stream of text. From stream reading point of view "forward-only" means you will not be able to get any data that reader already looked at (like root node that is basically first line of the file or parent node of current one as it had to be earlier in the file).
But from XML tree generation point of view "forward-only" may be confusing - it produces elements in depth-first order (because that how they are present in the text of the XML) meaning that "next" element is not necessary the one you'd like to see in the tree (especially if you expect breadth-first access like "names of all authors of this book").
Note that XmlReader allows you to access all attributes of current node at any time as it considers them part of the "current element".
Related
The methods in the .NET platform's DirectorySecurity namespace (e.g. GetAccessRules()) are far too slow for my purposes. Instead, I wish to directly query the NTFS $Secure metafile (or, alternatively, the $SDS stream) in order to retrieve a list of local accounts and their associated permissions for each file system object.
My plan is to first read the $MFT metafile (which I've already figured out how to do) - and then, for each entry therein, look up the appropriate security descriptor in the metafile (or stream).
The ideal code block would look something like this:
//I've already successfully written code for MFTReader:
var mftReader = new MFTReader(driveToAnalyze, RetrieveMode.All);
IEnumerable<INode> nodes = mftReader.GetNodes(driveToAnalyze.Name);
foreach (NodeWrapper node in nodes)
{
//Now I wish to return security information for each file system object
//WITHOUT needing to traverse the directory tree.
//This is where I need help:
var securityInfo = GetSecurityInfoFromMetafile(node.FullName, node.SecurityID);
yield return Tuple.Create(node.FullName, securityInfo.PrincipalName, DecodeAccessMask(securityInfo.AccessMask));
}
And I would like my output to look like this:
c:\Folder1\File1.txt jane_smith Read, Write, Execute
c:\Folder1\File1.txt bill_jones Read, Execute
c:\Folder1\File2.txt john_brown Full Control
etc.
I am running .NET version 4.7.1 on the Windows 10.
There's no API to read directly from $Secure, just like there is no API to read directly from $MFT. (There's FSCTL_QUERY_FILE_LAYOUT but that just gives you an abstracted interpretation of the MFT contents.)
Since you said you can read $MFT, it sounds like you must be using a volume handle to read directly from the volume, just like chkdsk and similar tools. That allows you to read whatever you want provided you know how to interpret the on-disk structures. So your question reduces to how to correctly interpret the $Secure file.
I will not give you code snippets or exact data structures, but I will give you some very good hints. There are actually two approaches possible.
The first approach is you could scan forward in $SDS. All of the security descriptors are there, in SecurityId order. You'll find there's at various 16-byte aligned offsets, there will be a 20-byte header that includes the SecurityId among other information, and following that there's the security descriptor in serialized form. The SecurityId values will appear in ascending order in $SDS. Also every alternate 256K region in $SDS is a mirror of the previous 256K region. To cut the work in half only consider the regions 0..256K-1, 512K..768K-1, etc.
The second approach is to make use of the $SII index, also part of the $Secure file. The structure of this is a B-tree very similar to how directories are structured in NTFS. The index entries in $SII have SecurityId as the index for lookups, and also contain the byte offset you can go to in $SDS to find the corresponding header and security descriptor. This approach will be more performant than scanning $SDS, but requires you to know how to interpret a lot more structures.
Craig pretty much covered everything. I would like to clear some of them. Like Craig, no code here.
Navigate to the node number 9 which corresponds to $Secure.
Get all the streams and get all the fragments of the $SDS stream.
Read the content and extract each security descriptor.
Use IsValidSecurityDescriptor to make sure the SD is valid and stop when you reach an invalid SD.
Remember that the $Secure store the security descriptors in self-relative format.
Are you using FSCTL_QUERY_FILE_LAYOUT? The only real source of how to use this function I have found is here:
https://wimlib.net/git/?p=wimlib;a=blob;f=src/win32_capture.c;h=d62f7d07ef20c08c9bec93f261131033e39b159b;hb=HEAD
It looks like he solves the problem with security descriptors like this:
He gets basically all information about files from the MFT, but not security descriptors. For those he gets the field SecurityId from the MFT and looks in a hash table whether he already has a mapping from this ID to the ACL. If he has, he just returns it, otherwise he uses NtQuerySecurityObject and caches it in the hash table. This should drastically reduce the amount of calls. It assumes that there are few security descriptors and that the SecurityID field correctly represents the single instancing of the descriptors
I have a folder with 400k+ XML-documents and many more to come, each file is named with 'ID'.xml, and each belongs to a specific user. In a SQL server database I have the 'ID' from the XML-file matched with a userID which is where I interconnect the XML-document with the user. A user can have an infinite number of XML-document attached (but let's say maximum >10k documents)
All XML-documents have a few common elements, but the structure can vary a little.
Now, each user will need to make a search in the XML-documents belonging to her, and what I've tried so far (looping through each file and read it with a streamreader) is too slow. I don't care, if it reads and matches the whole file with attributes and so on, or just the text in each element. What should be returned in the first place is a list with the ID's from the filenames.
What is the fastest and smartest methods here, if any?
I think LINQ-to-XML is probably the direction you want to go.
Assuming you know the names of the tags that you want, you would be able to do a search for those particular elements and return the values.
var xDoc = XDocument.Load("yourFile.xml");
var result = from dec in xDoc.Descendants()
where dec.Name == "tagName"
select dec.Value;
results would then contain an IEnumerable of the value of any XML tag that has has a name matching "tagName"
The query could also be written like this:
var result = from dec in xDoc.Decendants("tagName")
select dec.Value;
or this:
var result = xDoc.Descendants("tagName").Select(tag => tag.Value);
The output would be the same, it is just a different way to filter based on the element name.
You'll have to open each file that contains relevant data, and if you don't know which files contain it, you'll have to open all that may match. So the only performance gain would be in the parsing routine.
When parsing Xml, if speed is the requirement, you could use the XmlReader as it performs way better than the other parsers (most read the entire Xml file before you can query them). The fact that it is forward-only should not be a limitation for this case.
If parsing takes about as long as the disk I/O, you could try parsing files in parallel, so one thread could wait for a file to be read while the other parses the loaded data. I don't think you can make that big a win there, though.
Also what is "too slow" and what is acceptable? Would this solution of many files become slower over time?
Use LINQ to XML.
Check out this article. over at msdn.
XDocument doc = XDocument.Load("C:\file.xml");
And don't forget that reading so many files will always be slow, you may try writing a multi-threaded program...
If I understood correctly you don't want to open each xml file for particular user because it's too slow whether you are using linq to xml or some other method.
Have you considered saving some values both in xml file and relational database (tags) (together with xml ID).
In that case you could search for some values in DB first and select only xml files that contain searched values ?
for example:
ID, tagName1, tagName2
xmlDocID, value1, value2
my other question is, why have you chosen to store xml documents in file system. If you are using SQL Server 2005/2008, it has very good support for storing, searching through xml columns (even indexing some values in xml)
Are you just looking for files that have a specific string in the content somewhere?
WARNING - Not a pure .NET solution. If this scares you, then stick with the other answers. :)
If that's what you're doing, another alternative is to get something like grep to do the heavy lifting for you. Shell out to that with the "-l" argument to specify that you are only interested in filenames and you are on to a winner. (for more usage examples, see this link)
L.B Have already made a valid point.
This is a case, where Lucene.Net(or any indexer) would be a must. It would give you a steady (very fast) performance in all searches. And it is one of the primary benefits of indexers, to handle a very large amount of arbitrary data.
Or is there any reason, why you wouldn't use Lucene?
Lucene.NET (and Lucene) support incremental indexing. If you can re-open the index for reading every so often, then you can keep adding documents to the index all day long -- your searches will be up-to-date with the last time you re-opened the index for searching.
I need to iterate through a large XML file (~2GB) and selectively copy certain nodes to one or more separate XML files.
My first thought is to use XPath to iterate through matching nodes and for each node test which other file(s) the node should be copied to, like this:
var doc = new XPathDocument(#"C:\Some\Path.xml");
var nav = doc.CreateNavigator();
var nodeIter = nav.Select("//NodesOfInterest");
while (nodeIter.MoveNext())
{
foreach (Thing thing in ThingsThatMightGetNodes)
{
if (thing.AllowedToHaveNode(nodeIter.Current))
{
thing.WorkingXmlDoc.AppendChild(... nodeIter.Current ...);
}
}
}
In this implementation, Thing defines public System.Xml.XmlDocument WorkingXmlDoc to hold nodes that it is AllowedToHave(). I don't understand, though, how to create a new XmlNode that is a copy of nodeIter.Current.
If there's a better approach I would be glad to hear it as well.
Evaluation of an XPath expression requires that the whole XML document (XML Infoset) be in RAM.
For an XML file whose textual representation exceeds 2GB, typically more than 10GB of RAM should be available just to hold the XML document.
Therefore, while not impossible, it may be preferrable (especially on a server that must have resources quickly available to many requests) to use another technique.
The XmlReader (based classes) is an excellent tool for this scenario. It is fast, forward only, and doesn't require to retain the read nodes in memory. Also, your logic will remain almost the same.
You should consider LINQ to XML. Check this blog post for details and examples:
http://james.newtonking.com/archive/2007/12/11/linq-to-xml-over-large-documents.aspx
Try an XQuery processor that implements document projection (an idea first published by Marion and Simeon). It's implemented in a number of processors including Saxon-EE. Basically, if you run a query such as //x, it will filter the input event stream and build a tree that only contains the information needed to handle this query; it will then execute the query in the normal way, but against a much smaller tree. If this is a small part of the total document, you can easily reduce the memory requirement by 95% or so.
I am developing a program to log data from a incoming serial communication. I have to invoke the serial box by sending a command, to recieve something. All this works fine, but i have a problem.
The program have to be run from a netbook ( approx: 1,5 gHZ, 2 gig ram ), and it can't keep up when i ask it to save these information to a XML file.
I am only getting communication every 5 second, i am not reading the file anywhere.
I use xml.save(string filename) to save the file.
Is there another, better way, to save the information to my XML, or should i use an alternative?
If i should use an alternative, which should it be?
Edit:
Added some code:
XmlDocument xml = new XmlDocument();
xml.Load(logFile);
XmlNode p = xml.GetElementsByTagName("records")[0];
for (int i = 0; i < newDat.Length; i++)
{
XmlNode q = xml.CreateElement("record");
XmlNode a = xml.CreateElement("time");
XmlNode b = xml.CreateElement("temp");
XmlNode c = xml.CreateElement("addr");
a.AppendChild(xml.CreateTextNode(outDat[i, 0]));
b.AppendChild(xml.CreateTextNode(outDat[i, 1]));
c.AppendChild(xml.CreateTextNode(outDat[i, 2]));
sendTime = outDat[i, 0];
points.Add(outDat[i, 2], outDat[i, 1]);
q.AppendChild(a);
q.AppendChild(b);
q.AppendChild(c);
p.AppendChild(q);
}
xml.AppendChild(p);
xml.Save(this.logFile);
This is the XML related code, running once every 5 seconds. I am reading (I get no error), adding some childs, and then saving it again. It is when I save that I get the error.
You may want to look at using an XMLWriter and building the XML file by hand. That would allow you to open a file and keep it open for the duration of the logging, appending one XML fragment at a time, as you read in data. The XMLReader class is optimized for forward-only writing to an XMLStream.
The above approach should be much faster when compared to using the Save method to serialize (save) a full XML document each time you read data and when you really only want to append a new fragment at the end.
EDIT
Based on the code sample you posted, it's the Load and Save that's causing the unnecessary performance bottleneck. Every time you're adding a log entry you're essentially loading the full XML document and behind the scenes parsing it into a full-blown XML tree. Then you modify the tree (by adding nodes) and then serialize it all to disk again. This is very very counter productive.
My proposed solution is really the way to go: create and open the log file only once; then use an XMLWriter to write out the XML elements one by one, each time you read new data; this way you're not holding the full contents of the XML log in memory and you're only appending small chunks of data at the end of a file - which should be unnoticeable in terms of overhead; at the end, simply close the root XML tag, close the XMLWriter and close the file. That's it! This is guaranteed to not slow down your UI even if you implement it synchronously, on the UI thread.
While not a direct answer to your question, it sounds like you're doing everything in a very linear way:
Receive command
Modify in memory XML
Save in memory XML to disk
GoTo 1
I would suggest you look into using some threading, or possibly Task's to make this more asynchronous. This would certainly be more difficult, and you would have to wrestle with the task synchronization, but in the long run it's going to perform a lot better.
I would look at having a thread (possibly the main thread, not sure if you're using WinForms, a console app or what) that receives the command, and posts the "changes" to a holding class. Then have a second thread, which periodically polls this holding class and checks it for a "Dirty" state. When it detects this state, it grabs a copy of the XML and saves it to disk.
This allows your serial communication to continue uninterrupted, regardless of how poorly the hardware you're running on performs.
Normally for log files one picks append-friendly format, otherwise you have to re-parse whole file every time you need to append new record and save the result. Plain text CSV is likely the simplest option.
One other option if you need to have XML-like file is to store list of XML fragments instead of full XML. This way you still can use XML API (XmlReader can read fragments when specifying ConformanceLevel.Frament in XmlReaderSettings of XmlReader.Create call), but you don't need to re-read whole document to append new entry - simple file-level append is enough. I.e. WCF logs are written this way.
The answer from #Miky Dinescu is one technique for doing this if your output must be an XML formatted file. The reason why is that you are asking it to completed load and reparse the entire XML file every single time you add another entry. Loading and parsing the XML file becomes more and more IO, memory, and CPU intensive the bigger the file gets. So it doesn't take long before the amount of overhead that has will overwhelm any hardware when it must run within a very limited time frame. Otherwise you need to re-think your whole process and could simply buffer all the data into an in memory buffer which you could write out (flush) at a much more leisurely pace.
I made this work, however I do not believe that it is the "best practice" method.
I have another class, where I have my XmlDocument running at all times, and then trying to save every time data is added. If it fails, it simply waits to save the next time.
I will suggest others to look at Miky Dinescu's suggestion. I just felt that I was in to deep to change how to save data.
I'm in the very early stages of working on a tag editor for mp4 files and more specifically iTunes AAC ones. After doing some snooping around it seems that the file's structure is not as complicated as I first thought and is built in a sort of tree like the following
4 Bytes [Atom Length] 4 Bytes [Atom Name] X Bytes [Atom Data]
An atom's data is as large as the length and can contain either Data(information) or another atom. What I am trying to work out is how one determines if the data is information or an actual atom. Any insight would be much appreciated.
After a lot of snooping around it seems the only way to determine if a node leads to data or another node is by knowing the data structure. As I am only interested in the tags contained the structure is pretty easy to figure out. All the tags are contained in the following hierarchy:
moov.udta.meta.ilst
When delving into the ilst node each tag is represented as a child atom who's name determines what data it contains. As for the actual data, each child atom carries a child of its own which contains the actual information and a flag as to what sort of information it is e.g text or numbers, so all in all it looks something like this:
moov.udta.meta.ilst.[atom size][atom name].[data]
Of course this still leaves the issue with self made tags stored in the uuid atom node which companies like Sony use to add more information to the file. I would imagine that each child in the uuid stores its children in the same way ilst does but I can't be sure.