Microsoft warns against using BinaryFormatter (they write that there is no way to make the de-serialization safe).
Applications should stop using BinaryFormatter as soon as possible,
even if they believe the data they're processing to be trustworthy.
I don't want to use XML or Json-based solutions (which are what they refer to). I am concerned about file size and preserving the object graph.
If I were to write my own methods to traverse through my object graph and convert the objects to binary could that be made safely or is it something specifically with converting from binary that makes it inherently more dangerous that text?
Are there binary (non-XML and non-JSON) alternatives to BinaryFormatter?
This question feels like it leads to answers that will be more opinion-based.
I'm sure there are a lot of libraries out there, but perhaps the best known alternative is Protocol Buffers (protobuf). It's a Google library, so it gets plenty of development and attention. However, not everyone agrees that using protobuf for generic binary serialization is the best thing to do.
Follow this discussion about BinaryFormatter on the github for dotnet if you want more info; it discusses the general problem with BinaryFormatter, as well as using protobuf as an alternative.
Can I create my own secure binary serialization system?
Yes. That said, the real question should be: 'is it worth my time to do so?'
See this link for the wind-down plan for BinaryFormatter:
https://github.com/dotnet/designs/pull/141/commits/bd0a0661f9d248ed31a354d27ad026efd6719690
At the very bottom you will find:
Why not make BinaryFormatter safe for untrusted payloads?
The BinaryFormatter protocol works by specifying the values of an
object's raw instance fields. In other words, the entire point of
BinaryFormatter is to bypass an object's typical constructor and to
use private reflection to set the instance fields to the contents that
came in over the wire. Bypassing the constructor in this fashion means
that the object cannot perform any validation or otherwise guarantee
that its internal invariants are satisfied. One consequence of this is
that BinaryFormatter is unsafe even for seemingly innocuous types
such as Exception or List<T> or Dictionary<TKey, TValue>,
regardless of the actual types of T, TKey, or TValue.
Restricting deserialization to a list of allowed types will not
resolve this issue.
The security issue isn't with binary serialization as a concept; the issue is with how BinaryFormatter was implemented.
You could design a secure binary deserialization system, if you wanted. If you have very few messages being sent, and you can tightly control which types are deserialized, perhaps it's not too much effort to make a secure system.
However, for a system flexible enough to handle many different use cases (e.g. many different types that can be deserialized), you may find that it takes a lot of effort to build in enough safety checks.
FWIW, you likely will never reach the performance levels of BinaryFormatter with a secure system that offers the same widespread utility (use cases), since BinaryFormatter's speed comes (in part) from having very few safety features. You might approach such performance levels with a targeted, small system with a narrow set of use cases.
Related
I am involved in an effort to replace System.Web.Caching.Cache with Redis. The issue I am running into is that while System.Web.Caching.Cache seems to be able to cache just about whatever object I pass to it, Redis is string-based. This means that I have to worry about serialization myself.
The two approaches I've tried are 1) to use JSON.NET to serialize my objects to a string, and 2) to use BinaryFormatter. The JSON.NET approach can probably be made to work, but of course there are many configuration points (around what to serialize/ignore, how to handle reference loops, etc.) and making this approach work has turned into a fair amount of work.
The BinaryFormatter approach, I had suspected, was probably closer to what System.Web.Caching.Cache was doing internally. Having gone down that path a bit, though, I am not so sure of that anymore. Many types I'm trying to cache are not marked [Serializable], which seems to rule out BinaryFormatter.
I am wondering if anyone else has faced similar issues, or knows what System.Web.Caching.Cache is doing internally so that I can emulate it. Thanks.
System.Web.Caching.Cache seems to be able to cache just about whatever object I pass to it. This means that I have to worry about serialization myself.
That is exactly because System.Web.Caching.Cache stores object references when application is running.
So that I can emulate it
Unfortunately, you cannot. Redis is a remote service. When you take advantage of Redis (i.e., distributed caching, reducing memory footprint in your local machine), you have to pay the price - handle serialization yourself.
The good news is that, there are a couple of serialization libraries available.
Newtonsoft.Json
protobuf-net
MessagePack-CSharp
Currently doing XML serialization however, it is very slow. Looking for a way to save/load information from file very quickly not really interested in how it looks on disc (if anything I want it to be obscured as I don't want manual editing).
Thinking of binary format however I am not sure if it would be able to serialize properties which may be of a custom type etc.
Any idea's?
You can try using Sqlite. It is very fast, and will give you complete database implementation with SQL queries on a file.
If you are thinking of trying binary formats, I suggest you try this first.
And can be used with ORM, and can be compressed and encrypted.
What exactly is the data?
With xml, the obvious answer would be to use smoething like GZipStream to compress it - making it smaller and obscure. You could use BinaryFormatter but it is brittle and IMO unsuitable for long-term storage. I would say "protocol buffers", (maybe protobuf-net), but it depends what the "custom data" is. But if you are using XmlSerializer at the moment protobuf-net may work virtually without changes (maybe add a few attributes) - and it is (in every case I've seen to date) both smaller and faster than BinaryFormatter.
Here's the steep learning curve (see also: Getting Started):
[ProtoContract]
public class Person {
[ProtoMember(1)]
public int Id {get;set;}
[ProtoMember(2)]
public string Name {get;set;}
//...
}
To be fair, it can get a little trickier if you are using inheritance - not much though. In many cases you can actually use your existing attributes - it'll work with xml / wcf attributes if an explicit element order is included.
Binary serialization certainly works with properties of Custom Types and typically produces smaller files than XML serialization. It's certainly an approach you should consider if file size is an important factor for your situation.
I agree with Am about using an embedded database like SQLite. It comes with significant benefits. The ability to layer an ORM on top of it is probably the most significant.
XML Serialization is handy, particularly when you need to be able to edit the XML by hand or process it with other XML tools like XSLT etc, but it also has some unavoidable performance problems. One important technique when using XML Serialization in .Net is to cache the XML Serializers. Or to have them created by sgen on build.
The reason to cache the XML Serializer is related to the fact that the .Net runtime will automatically generate, compile and load an assembly containing a serializer if it can't find one in an already loaded assembly. This process can be really slow. Also constructing a new XMLSerializer instance can be quite slow. Hence why you should cache it. Be careful when caching the serializer though as different XMLSerializer constructors can produce different serializer implementations which behave differently. Particular with respect to namespaces, etc.
Then of course there is the usual performance implications of parsing a lot of text. Unfortunately that isn't easy to avoid with XML.
One of the reasons SQLite is a better choice than XML is related to the fact that it is, at its core, a fixed length record storage system. Any binary file with fixed length records is going to be fast to read, index and scan. Fixed block size file formats are almost always screamingly fast to read and write. I would recommend implementing one at some point for your own education.
If you still want a text based format (for ease of interoperability) and don't need the benefits of an ORM then consider using the FileHelpers library.
I have a library (written in C#) for which I need to read/write representations of my objects to disk (or to any Stream) in a particular binary format (to ensure compatibility with C/Java library implementations). The format requires a fair amount of bit-packing and some DEFLATE'd bytestreams. I would like my library, however, to be as idiomatic .NET as possible, however, and so would like to provide an API as close as possible to the normal binary serialization process. I'm aware of the ability to implement the IFormatter interface, but being that I really am unable to reuse any part of the built-in serialization stack, is it worth doing this, or will it just bring unnecessary overhead. In other words:
Implement IFormatter and co.
OR
Just provide "Serialize"/"Deserialize" methods that act on a Stream?
A good point brought up below about needing the serialization semantics for any case involving Remoting. In a case where using MarshalByRef objects is feasible, I'm pretty sure that this won't be an issue, so leaving that aside are there any benefits or drawbacks to using the ISerializable/IFormatter versus a custom stack (or, is my understanding remoting incorrectly)?
I have always gone with the latter. There isn't much use in reusing the serialization framework if all you're doing is writing a file to a specific framework. The only place I've run into any issues with using a custom serialization framework is when remoting, you have to make your objects serializable.
This may not help you since you have to write to a specific format, but protobuf and sqlite are good tools for doing custom serialization.
I'd do the former. There's not much to the interface, and so if you're mimicking the structure anyway adding an ": IFormatter" and the other code necessary to get full compatibility won't take much.
Writing your own serialization code is error prone and time consuming.
As a thought - have you considered existing open-source portable formats, for example "protocol buffers"? This is a high density binary serialization format that underpins much of Google's data transfer etc. Versions are available in a wide range of languages - including Java/C++ etc (in the core Google distribution), and a vast range of others.
In particular, for .NET-idiomatic usage, protobuf-net looks a lot like XmlSerializer or DataContractSerializer (indeed, it can even work purely with xml/wcf attributes if it includes an order on each element) - or can use the specific protobuf-net attributes:
[ProtoContract]
class Person {
[ProtoMember(1)]
public string Name {get;set;}
}
If you want to guarantee portability to other implementations, the recommendation is to start "contract first", with a ".proto" file - in this case, something like:
message person {
required string name = 1;
}
This .proto file can then be used to generate any language-specific variant; so with protobuf-net you'd run it through "protogen" (included in protobuf-net; and a VS2008 add-on is in progress); or for Java/C++ etc you'd run it through "protoc" (included in Google's protobuf). "protogen" in protobuf-net can currently emit C# and VB, but it is pretty easy to add another language if you want to use F# etc - it just involves writing (or migrating) an xslt.
There is also another .NET version that is a more direct port of the Java version; as such it is less .NET idiomatic. This is dotnet-protobufs.
Which serialization should I use?
I need to store a large Dictionary with 100000+ elements, and I just need to save and load this data directly without caring whether it's binary or whether it's formatted or not.
Right now I am using the BinarySerializer but not sure if it's the most effective?
Please suggest better alternatives in the .NET standard libraries or an external library, preferably free.
EDIT: This is to serialize to disk and from it. The app is single threaded too.
Well, it will depend on what's in the dictionary - but if Protocol Buffers is flexible enough for you (you have to define your own types to serialize - it doesn't do all .NET types or anything like that), it's pretty darned fast.
For example, in protocol buffers I'd represent the dictionary as a message with a repeated key/value pair field. For ultimate speed you could use the CodedOutputStream and CodedInputStream to serialize/deserialize the dictionary directly rather than reading it all into memory separately first. Again, it'll depend on what the key/value types are though.
This is entirely a guess since I haven't profiled this (ie. which is what you should do to truly get your answer).
But my guess is that the binary serializer would give you the best performance. Both in size and speed.
This is a bit of an open-ended question. Are you storing this in memory or writing it to disk? Does this execute in a multi-threaded (and perhaps multi-concurrent-access) environment? Context is important.
BinarySerializer is generally going to be pretty fast, and there are external libs that provide better compression such as ProtoBuffers. I've personally had good success with DataContractSerializer.
The great thing about all these options is that you can try all of them (relatively pain free) to learn for yourself what works in your environment and operation.
Is Protocol Buffer for .NET gonna be lightweight/faster than Remoting(the SerializationFormat.Binary)? Will there be a first class support for it in language/framework terms? i.e. is it handled transparently like with Remoting/WebServices?
I very much doubt that it will ever have direct language support or even framework support - it's the kind of thing which is handled perfectly well with 3rd party libraries.
My own port of the Java code is explicit - you have to call methods to serialize/deserialize. (There are RPC stubs which will automatically serialize/deserialize, but no RPC implementation yet.)
Marc Gravell's project fits in very nicely with WCF though - as far as I'm aware, you just need to tell it (once) to use protocol buffers for serialization, and the rest is transparent.
In terms of speed, you should look at Marc Gravell's benchmark page. My code tends to be slightly faster than his, but both are much, much faster than the other serialization/deserialization options in the framework. It should be pointed out that protocol buffers are much more limited as well - they don't try to serialize arbitrary types, only the supported ones. We're going to try to support more of the common data types (decimal, DateTime etc) in a portable way (as their own protocol buffer messages) in future.
Some performance and size metrics are on this page. I haven't got Jon's stats on there at the moment, just because the page is a little old (Jon: we must fix that!).
Re being transparent; protobuf-net can hook into WCF via the contract; note that it plays nicely with MTOM over basic-http too. This doesn't work with Silverlight, though, since Silverlight lacks the injection point. If you use svcutil, you also need to add an attribute to class (via a partial class).
Re BinaryFormatter (remoting); yes, this has full supprt; you can do this simply by a trivial ISerializable implementation (i.e. just call the Serializer method with the same args). If you use protogen to create your classes, then it can do it for you: you can enable this at the command line via arguments (it isn't enabled by default as BinaryFormatter doesn't work on all frameworks [CF, etc]).
Note that for very small objects (single instances, etc) on local remoting (IPC), the raw BinaryFormatter performance is actually better - but for non-trivial graphs or remote links (network remoting) protobuf-net can out-perform it pretty well.
I should also note that the protocol buffers wire format doesn't directly support inheritance; protobuf-net can spoof this (while retaining wire-compatibility), but like with XmlSerializer, you need to declare the sub-classes up-front.
Why are there two versions?
The joys of open source, I guess ;-p Jon and I have worked on joint projects before, and have discussed merging these two, but the fact is that they target two different scenarios:
dotnet-protobufs (Jon's) is a port of the existing java version. This means it has a very familiar API for anybody already using the java version, and it is built on typical java constructs (builder classes, immutable data classes, etc) - with a few C# twists.
protobuf-net (Marc's) is a ground-up re-implementation following the same binary format (indeed, a critical requirement is that you can interchange data between different formats), but using typical .NET idioms:
mutable data classes (no builders)
the serialization member specifics are expressed in attributes (comparable to XmlSerializer, DataContractSerializer, etc)
If you are working on java and .NET clients, Jon's is probably a good choice for the familiar API on both sides. If you are pure .NET, protobuf-net has advantages - the familiar .NET style API, but also:
you aren't forced to be contract-first (although you can, and a code-generator is supplied)
you can re-use your existing objects (in fact, [DataContract] and [XmlType] classes can often be used without any changes at all)
it has full support for inheritance (which it achieves on the wire by spoofing encapsulation) (possibly unique for a protocol buffers implementation? note that sub-classes have to be declared in advance)
it goes out of its way to plug into and exploit core .NET tools (BinaryFormatter, XmlSerializer, WCF, DataContractSerializer) - allowing it to work directly as a remoting engine. This would presumably be quite a big split from the main java trunk for Jon's port.
Re merging them; I think we'd both be open to it, but it seems unlikely you'd want both feature sets, since they target such different requirements.