I would like to write a function to compare two XML files. Based on the following cases, the diff could be due to
the order of the nodes and
some nodes or attributes could have been added or removed
I have found a solution to traverse the XML by using a recursive function or LINQ to XML so basically I can get all nodes and attributes. I have read about the XML Diff and Patch Tool but I'm trying to avoid dependencies on my project. An added complexity to this is to determine which line the diff occurred but this is optional for now.
I'm currently thinking of storing the nodes and attributes of the two XML files to a data structure (e.g dictionary) and compare the dictionaries later but I'm not pretty sure how to do this one. Can you share some ideas?
There is https://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.deepequals%28v=vs.110%29.aspx to do XNode.DeepEquals(XDocument.Load("file1.xml"), XDocument.Load("file2.xml")) but that will only give you a boolean result, not an indication of where the difference is.
Related
I have an xml document which contains tags with numerical attribute. They are sorted by this attribute. is it worth to finding this tags through binary search? Or access to the nodes are not in constant time and it is better to get the data to array? And is it implemented in xml libraries?
It depends how are you planing to read the document. There are two ways to handle XML documents:
First one, implemented by XmlReader, is stream-oriented and processes one element at a time, and nothing more is stored in program memory.
Second one follows Document Object Model interface: it loads the entire document into memory and allows you to query it without looking back to the file. The best method to use that in .NET is LINQ to XML.
Depending of the size of your document, it may be better to choose one or another, but you have to be aware that making any other then linear search with Stream-oriented API is not possible.
And as far I know there you can't get binary search with LINQ to XML out of the box, because it uses IEnumerable. You'd have to get an array of your elements and then implement the binary search on an array. Definitely not difficult task to complete anyway.
In my project I am using an XML-file for datastorage. I am accessing that file with linq-to-xml queries. Actually I have created that XML-file from my SQL-server database but as that tables in SQL contained more that 50 columns, the resulting XML-file is also having more than 50 elements...
Now while applying queries I initially load that XML-file in XDocument object and after that applying queries on that.
My main problem is that as it contain more than 50 element it is very difficult to write queries without intellisence support. Why it is not supporting intellisence? What have I done wrong? What can I do to get intellisence support?
LINQ to XML is based on strings and it isn't confined to documents that follow some schema. That's the reason you don't get IntelliSense, VS has no information about the schema.
If this is really important for you, maybe using something like xsd.exe to generate classes that represent the schema would be better for you.
It's not possible to get intellisense for Linq to Xml.
This is because you load a file at runtime and you expect it to have compile time intellisense. What if you would load a different file at runtime, would you then get a compile time error?
What you could do is generate classes from your Xml file and then deserialize your XML file into these classes. The you can use Linq To Objects to access the data.
Here is some documentation for creating your classes.
I have a little utility that runs through looking for certain things in XML files using LINQ. It processes a MASSIVE collection of them rather quickly and nicely. However, about 20% of a certain batch of files fail to be read and are skipped, failing because of the degree symbol's presence as ° in the files. This is the "Reference to undeclared entity 'deg'." a previous question was about.
The solutions offered in the previous question cannot be directly applied here. I am not at liberty to go around modifying the files, and making copies of them and replacing instances or inserting tags in the copies seems inefficient. What would be the best way to go about getting LINQ to ignore the undeclared entities, which have absolutely no bearing on what my program does anyway? Or is there perhaps a good way of getting an XDocument.Load to be fed some entity declarations beforehand?
Unfortunately entities form part of the well-formedness rules for XML (2.1 Well-Formed XML Documents). It seems like you're saying you want the XDocument.Load to load what is notionally an XML file, but does not in fact conform to the rules, which it won't do, quite reasonably.
If your users are passing you what are supposed to be XML files, but that have undefined entities, then either you have to get them to provide the files in a valid format, or manage the incorrectness youself at load-time, in the ways that have been suggested.
It seems to me, from your restrictions, that the neatest approach would be to follow the example linked-to and create some settings to pass into the XMLReader along the lines of (Validating an XML Document in the DOM).
If there are entities which aren't defined and aren't listed in public schemas, you'll need to create your own schema which defines all the entities you need. So, create a generic settings for the XMLReader which references your own, custom schema. Add the necessary entities to this schema as certain files fail to load and then you'll build up a list of all the entites that you need to define in order that the XML files are valid.
Then, for each document you try to load, create an XMLReader for the file using the settings above and call the XDocument(XMLReader) overload.
Im looking for some advice on how I should go about a solution. I have an import to write using c#. The data comes from an xml file containing ~30000 records each with ~10 nodes for differnet data. My initial thought would be to create a node list of records ids(one of the nodes is a unique id). Then loop through the node list and use xpath to get the rest of the data for the record. My other thought was to convert the xml file into .cvs format and read it that way. Before i dive head first into one or the other any advice, pros/cons or suggestions? Thanks in advance
Go with whichever you feel more comfortable with.
Personally, I would use XDocument and LINQ to XML to query the XML directly.
Transforming to CSV has its own pitfalls, if you don't adhere to the rules (quoting fields, line breaks within fields etc...).
I agree with the above poster that you want to use LINQ to XML if possible, however if you are on an older version of the framework you could use an XMLDocument and the SelectNodes/SelectSingleNode methods. If you do that however make sure you use a NamespaceManager or you won't return anything from your methods unless your XML has no namespaces etc.
That got me a bunch of times.
I have an xml file with about 500 mb and i'm using LINQ with c# to query that file, but it's very slow, because it loads everything into memory. Is there anyway that i can query that file without loading all into memory?
Thanks
This article should get you up and running. Take a look at the SimpleStreamAxis method, which is very handy for finding nodes in large XML files. I've successfully used a variant of this method on 5GB XML files without loading the file into memory.
You can use the technique described on MSDN's page about XNode.ReadFrom to generate an IEnumerable of XNodes (in the example they provide, XElements) from an XmlReader.
Note that when you read an XElement from a Stream or XmlReader, the entire contents of that element must be read too - so you'll still need a little bit of custom logic in the IEnumerator logic to ensure that the right XElements get returned - for instance, if you return the root element, you might as well just parse the entire document right away since the root element contains almost everthing anyhow. The XNode.ReadFrom example contains such logic too.
No, its not possible when using Linq. Linq loads a model of the full xml into memory so you can have access using the tree structure.
If you want fast access without loading the file into memory you could use XmlReader class.
This class gives you a fast forward-only xml parser that has only the current node in memory.
Here is some help on that: http://support.microsoft.com/kb/307548
Edit: Sorry, didn't know that its possible to combine xmlreader with linq.