I have different XML files that I will need to read. I'm wondering if I should deserialize the files into custom objects or just read the data using XDocument objects and Linq-to-XML.
The files range in size from 1-2kb to 3mb+, and the different objects also range in complexity (some have attributes, some have children, some both, some none).
I figure it would be easier to work with the objects as opposed to Linq-to-XML, but creating those objects would require some time up front. Are there any rules of thumb or suggestions about when to deserialize as opposed to Linq?
Thanks for any help!
It really depends on what you are doing with the data. If you are not using all of the information that is provided by the XML document, then a LINQ based approach is probably easiest. Think of taking an RSS feed, and only keeping track of the article dates, and nothing else. In this case using a deserialization technique doesn't really do anything for you.
If you are using just about every last bit of data in the XML document, and its structure reflects that of your object model, then certainly deserialize it. This is something that I do all of the time for things like settings files, and even simple file formats.
In your case it sounds like it already exists, and was created by some external source, and you don't have an object representation of the data in your code already, so I would suggest using a LINQ based approach. Additionally, you mention a lot of variation in the files so the flexibility of LINQ would again come in handy. That is a wild guess based on your description though.
You could use the xsd.exe tool which could generate those classes from you given an XML file:
C:\work>xsd test.xml
C:\work>xsd /classes test.xsd
There is no really a rule of thumb. Personally I prefer working with strongly typed objects unless the file sizes become large in which case I switch to XmlReader.
Related
I am using XSD.exe to convert a pretty complex XML-Schema (XSD-file) to C# Classes. I am then using XmlSerializer to read XML into memory and work with the data.
In the future, the XSD will change. So there will be a new version. I will have to create a new cs file with XSD.exe. But I still want to support the old versions of XML files as well.
What is the best way to go about this and support both the old and new versions of XML files? Obviously, the classes XSD.exe creates will have the same names. So I can't really just generate another cs file in parallel with XSD.exe.
Any ideas are welcome. Thanks in advance!
XML Data Binding has the advantage of enabling you to code against strongly typed classes rather than untyped nodes, but this can make versioning tricky.
Information about this can be found in the 'Schema Versioning' section of Liquid XML Data Binder 2021 - Getting Started documentation.
Data binding technologies (that convert XSD definitions into types in a strongly-typed programming language) are an absolute pain when the schema is large, complex, or changing. My strong advice would be, find a different approach. I've earned a lot of consulting money helping people dig themselves out of this hole.
Use technologies that are better at coping with change and variety. XSLT, XQuery, LINQ, or even DOM if you must. XSLT and XQuery come with schema-awareness as an option so you can get some of the benefits (having your program code checked against the schema) without the heavy price of rebuilding and retesting your application every time there's a change.
Thank you for your answers.
For now, I placed the Code generated by XSD.exe in separate Namespaces and have them derive from a base class.
Like this, I can use either one or the other Class for generating/reading the XML. It appears to be working for me right now, as the Schema will not change without a new Version. Any changes made will be put into a new Version.
I'm saving a huge list of objects to a file, and later deserializing them. The resulting xml file can be about 3 gigs in size.
I want the Deserialization to be super fast, so i tried all three approaches (xml,binary,compressed)
Obviously, deserialization a compressed file takes much longer than an XML one. But i saw binary deserialization also taking a lot more time vs the xml deserialisation. Is that normal? Shoudn't both xml and binary take pretty much the same time to deserialize the object?
Also what do you think will be the best option to use in terms of a good balance between file size and deserialization speeds?
In this performance comparison between all sorts of serialization methods that come with .NET (BinaryFormatter, XmlSerializer, DataContractSerializer, etc.) and protobuf, the protobuf serializer seems to way ahead of the serializers that come with .NET. The resulting size appaears to be smaller as well. If the protobuf format is an option for you, I strongly recommend you have a look at it. :-)
Another option: if deserializing is slow, only deserialize the parts you really need. Create an index file that tells you the offsets of the objects you write to the data file, so you can quickly deserialize the objects you need in a random-access fashion.
Customise your serialisation, either fully or by implementing ISerializable and then using binary (though custom XML may also be worth experimenting with). Don't serialise memoised fields, but just the key field their value is based upon. Look for other areas where you can reduce size by serialising enough information to build part of the graph, rather than the full representation of the graph.
Then use deflate with that.
In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.
You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.
I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.
Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.
I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.
I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar
If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.
You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...
Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.
As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.
The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.
I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.
Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.
Currently doing XML serialization however, it is very slow. Looking for a way to save/load information from file very quickly not really interested in how it looks on disc (if anything I want it to be obscured as I don't want manual editing).
Thinking of binary format however I am not sure if it would be able to serialize properties which may be of a custom type etc.
Any idea's?
You can try using Sqlite. It is very fast, and will give you complete database implementation with SQL queries on a file.
If you are thinking of trying binary formats, I suggest you try this first.
And can be used with ORM, and can be compressed and encrypted.
What exactly is the data?
With xml, the obvious answer would be to use smoething like GZipStream to compress it - making it smaller and obscure. You could use BinaryFormatter but it is brittle and IMO unsuitable for long-term storage. I would say "protocol buffers", (maybe protobuf-net), but it depends what the "custom data" is. But if you are using XmlSerializer at the moment protobuf-net may work virtually without changes (maybe add a few attributes) - and it is (in every case I've seen to date) both smaller and faster than BinaryFormatter.
Here's the steep learning curve (see also: Getting Started):
[ProtoContract]
public class Person {
[ProtoMember(1)]
public int Id {get;set;}
[ProtoMember(2)]
public string Name {get;set;}
//...
}
To be fair, it can get a little trickier if you are using inheritance - not much though. In many cases you can actually use your existing attributes - it'll work with xml / wcf attributes if an explicit element order is included.
Binary serialization certainly works with properties of Custom Types and typically produces smaller files than XML serialization. It's certainly an approach you should consider if file size is an important factor for your situation.
I agree with Am about using an embedded database like SQLite. It comes with significant benefits. The ability to layer an ORM on top of it is probably the most significant.
XML Serialization is handy, particularly when you need to be able to edit the XML by hand or process it with other XML tools like XSLT etc, but it also has some unavoidable performance problems. One important technique when using XML Serialization in .Net is to cache the XML Serializers. Or to have them created by sgen on build.
The reason to cache the XML Serializer is related to the fact that the .Net runtime will automatically generate, compile and load an assembly containing a serializer if it can't find one in an already loaded assembly. This process can be really slow. Also constructing a new XMLSerializer instance can be quite slow. Hence why you should cache it. Be careful when caching the serializer though as different XMLSerializer constructors can produce different serializer implementations which behave differently. Particular with respect to namespaces, etc.
Then of course there is the usual performance implications of parsing a lot of text. Unfortunately that isn't easy to avoid with XML.
One of the reasons SQLite is a better choice than XML is related to the fact that it is, at its core, a fixed length record storage system. Any binary file with fixed length records is going to be fast to read, index and scan. Fixed block size file formats are almost always screamingly fast to read and write. I would recommend implementing one at some point for your own education.
If you still want a text based format (for ease of interoperability) and don't need the benefits of an ORM then consider using the FileHelpers library.
I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size.
I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block.
One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted.
Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc.
I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...
XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article.
If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.
I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.
Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".
Download from Microsoft
Gigabyte XML files! I don't envy you this task.
Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.
I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?
One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.
I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?
This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html
http://msdn.microsoft.com/en-us/library/bb387013.aspx has a relevant example leveraging XStreamingElement.
In order to perform XPath queries with the standard .NET classes the whole document tree needs to be loaded in memory which might not be a good idea if it can take up to a gigabyte. IMHO the XmlReader is a nice class for handling such tasks.
It seems that you already tried using XPathDocument and could not accomodate the parsed xml document in memory.
If this is the case, before starting to split the file (which is ultimately the right decision!) you may try using the Saxon XSLT/XQuery processor. It has a very efficient in-memory representation of a loaded XML document (the "tinytree" model). In addition Saxon SA (the shema-aware version, which isn't free) has some streaming extensions. Read more about this here.
How about just reading the whole thing into a database and then work with the temp database? That might be better because then your queries can be done more efficiently using TSQL.
I think the best solution is to make your own xml parser that can read small chunks not the whole file, or you can split the large file into small files and use dotnet classes with these files.
The problem is you can not parse some of data till the whole data is available so I recommend to use your own parser not dotnet classes
Have you been trying XPathDocument?
This class is optimized for handling XPath queries efficiently.
If you cannot handle your input documents efficiently using XPathDocument you might consider preprocessing and/or splitting up your input documents using an XmlReader.
You've outlined your choices already.
Either you need to abandon the XPath and use XmlTextReader or you need to break the document up into managable chunks on which you can use XPath.
If you choose the latter use XPathDocument its readonly restriction allows better used of memory.
Since in your case the data size can run in Gbs have you considered using ADO.NET with XML as a database. In addition to that the memory footprint would not be huge.
Another approach would be using Linq to XML with using elements like XElementStream. Hope this helps.