protobuf-net and de/serializing objects

protobuf-net and de/serializing objects - c#

I am checking if protobuf-net can be an in place replacement for DataContracts. Besides the excellent performance it is really a neat library. The only issue I have is that the .NET serializers do not make any assumptions what they are currently de/serializing. Especially objects which do contain reference to the typed object are a problem.
[DataMember(Order = 3)]
public object Tag1 // The DataContract did contain a object which becomes now a SimulatedObject
{
get;
set;
}
I tried to mimic object with protocol buffers with a little generic helper which does store for each possible type in a different strongly typed field.
Is this an recommended approach to deal with fields which de/serialize into a number of different not related types?
Below is the sample code for a SimulatedObject which can hold up to 10 different types.
using System;
using System.Collections.Generic;
using System.IO;
using System.Runtime.Serialization;
using ProtoBuf;
using System.Diagnostics;
[DataContract]
public class SimulatedObject<T1,T2,T3,T4,T5,T6,T7,T8,T9,T10>
{
[DataMember(Order = 20)]
byte FieldHasValue; // the number indicates which field actually has a value
[DataMember(Order = 1)]
T1 I1;
[DataMember(Order = 2)]
T2 I2;
[DataMember(Order = 3)]
T3 I3;
[DataMember(Order = 4)]
T4 I4;
[DataMember(Order = 5)]
T5 I5;
[DataMember(Order = 6)]
T6 I6;
[DataMember(Order = 7)]
T7 I7;
[DataMember(Order = 8)]
T8 I8;
[DataMember(Order = 9)]
T9 I9;
[DataMember(Order = 10)]
T10 I10;
public object Data
{
get
{
switch(FieldHasValue)
{
case 0: return null;
case 1: return I1;
case 2: return I2;
case 3: return I3;
case 4: return I4;
case 5: return I5;
case 6: return I6;
case 7: return I7;
case 8: return I8;
case 9: return I9;
case 10: return I10;
default:
throw new NotSupportedException(String.Format("The FieldHasValue field has an invlaid value {0}. This indicates corrupt data or incompatible data layout chagnes", FieldHasValue));
}
}
set
{
I1 = default(T1);
I2 = default(T2);
I3 = default(T3);
I4 = default(T4);
I5 = default(T5);
I6 = default(T6);
I7 = default(T7);
I8 = default(T8);
I9 = default(T9);
I10 = default(T10);
if (value != null)
{
Type t = value.GetType();
if (t == typeof(T1))
{
FieldHasValue = 1;
I1 = (T1) value;
}
else if (t == typeof(T2))
{
FieldHasValue = 2;
I2 = (T2) value;
}
else if (t == typeof(T3))
{
FieldHasValue = 3;
I3 = (T3) value;
}
else if (t == typeof(T4))
{
FieldHasValue = 4;
I4 = (T4) value;
}
else if (t == typeof(T5))
{
FieldHasValue = 5;
I5 = (T5) value;
}
else if (t == typeof(T6))
{
FieldHasValue = 6;
I6 = (T6) value;
}
else if (t == typeof(T7))
{
FieldHasValue = 7;
I7 = (T7) value;
}
else if (t == typeof(T8))
{
FieldHasValue = 8;
I8 = (T8) value;
}
else if (t == typeof(T9))
{
FieldHasValue = 9;
I9 = (T9) value;
}
else if (t == typeof(T10))
{
FieldHasValue = 10;
I10 = (T10) value;
}
else
{
throw new NotSupportedException(String.Format("The type {0} is not supported for serialization. Please add the type to the SimulatedObject generic argument list.", t.FullName));
}
}
}
}
}
[DataContract]
class Customer
{
/*
[DataMember(Order = 3)]
public object Tag1 // The DataContract did contain a object which becomes now a SimulatedObject
{
get;
set;
}
*/
[DataMember(Order = 3)]
public SimulatedObject<bool, Other, Other, Other, Other, Other, Other, Other, Other, SomethingDifferent> Tag1 // Can contain up to 10 different types
{
get;
set;
}
[DataMember(Order = 4)]
public List<string> Strings
{
get;
set;
}
}
[DataContract]
public class Other
{
[DataMember(Order = 1)]
public string OtherData
{
get;
set;
}
}
[DataContract]
public class SomethingDifferent
{
[DataMember(Order = 1)]
public string OtherData
{
get;
set;
}
}
class Program
{
static void Main(string[] args)
{
Customer c = new Customer
{
Strings = new List<string> { "First", "Second", "Third" },
Tag1 = new SimulatedObject<bool, Other, Other, Other, Other, Other, Other, Other, Other, SomethingDifferent>
{
Data = new Other { OtherData = "String value "}
}
};
const int Runs = 1000 * 1000;
var stream = new MemoryStream();
var sw = Stopwatch.StartNew();
Serializer.Serialize<Customer>(stream, c);
sw = Stopwatch.StartNew();
for (int i = 0; i < Runs; i++)
{
stream.Position = 0;
stream.SetLength(0);
Serializer.Serialize<Customer>(stream, c);
}
sw.Stop();
Console.WriteLine("Data Size with Protocol buffer Serializer: {0}, {1} objects did take {2}s", stream.ToArray().Length, Runs, sw.Elapsed.TotalSeconds);
stream.Position = 0;
var newCustw = Serializer.Deserialize<Customer>(stream);
sw = Stopwatch.StartNew();
for (int i = 0; i < Runs; i++)
{
stream.Position = 0;
var newCust = Serializer.Deserialize<Customer>(stream);
}
sw.Stop();
Console.WriteLine("Read object with Protocol buffer deserializer: {0} objects did take {1}s", Runs, sw.Elapsed.TotalSeconds);
}
}

No, this solution is hard to maintain in a long term.
I recommend that you prepend the full name of the serialized type to the serialized data in the serialization process and read the type name in the beginning of the deserialization process (no need to change the protobuf source-code)
As a side note, you should try to avoid mixing object types in the deserialization process. I'm assuming you are updating an existing .net application and can't re-design it.
Update: Sample code
public byte[] Serialize(object myObject)
{
using (var ms = new MemoryStream())
{
Type type = myObject.GetType();
var id = System.Text.ASCIIEncoding.ASCII.GetBytes(type.FullName + '|');
ms.Write(id, 0, id.Length);
Serializer.Serialize(ms, myObject);
var bytes = ms.ToArray();
return bytes;
}
}
public object Deserialize(byte[] serializedData)
{
StringBuilder sb = new StringBuilder();
using (var ms = new MemoryStream(serializedData))
{
while (true)
{
var currentChar = (char)ms.ReadByte();
if (currentChar == '|')
{
break;
}
sb.Append(currentChar);
}
string typeName = sb.ToString();
// assuming that the calling assembly contains the desired type.
// You can include aditional assembly information if necessary
Type deserializationType = Assembly.GetCallingAssembly().GetType(typeName);
MethodInfo mi = typeof(Serializer).GetMethod("Deserialize");
MethodInfo genericMethod = mi.MakeGenericMethod(new[] { deserializationType });
return genericMethod.Invoke(null, new[] { ms });
}
}

I'm working on something similar now and I provided first version of the lib already:
http://bitcare.codeplex.com/
The current version doesn't support generics yet, but I plan to add it in the nearest time.
I uploaded source code only there-when I'm ready with generics I prepare bin version also...
It assumes both sides (client and server) know what they serialize/deserialize so there is no any reason to embed there full metadata info. Because of this serialization results are very small and generated serializers work very fast. It has data dictionaries, uses smart data storage (stores only important bits in short) and does final compression when necessary. If you need it just try if it solves your problem.
The license is GPL, but I will change it soon to less restrictive one(free for commercial usage also, but on your own risk like in GPL)

The version I uploaded to codeplex is working with some of my product. It's tested with different set of unit tests of course. They are not uploaded there, because I ported it to vs2012 and .net 4.5 and decided to create new sets of test cases for the incoming release.
I don't deal with abstract (so called opened) generics. I process parametrized generics. From data contract point of view parametrized generics are just specialized classes so I can process them as usual (as other classes) - the difference is in objects construction only and storage optimizations.
When I store information about null value on Nullable<> it takes one bit in storage stream only, if it's not null value I do serialization according to type provided as generics parameter (so I do serialization of DateTime for instance that can take from one bit for so called default value to a few bytes). The goal was to generate serialization code according to the current knowledge about data contracts on classes instead of doing it on the fly and wasting memory and processing power. When I see the property in some class based on some generic during code generation I know all the properties of that generic and I know the type of every property :) From this point of view it's concrete class.
I will change the license soon. I have to figure out first how to do it :) , because I see it's possible to choose from list of the provided license types but I can't provide my own license text. I see the license of Newtonsoft.Json is what I would have also, but I don't know yet how to change it...
The documentation has not been provided there yet, but in short it's easy to prepare your own serialization tests. You have to compile assembly with your types you want to store/serialize effective way, then create *.tt files in your serialization library (like for person class-it checks dependencies and generates code for other dependent classes also) and save the files (when you save them then it generates all the code for cooperation with serialization library). You can also create the task in your build config to regenerate source code from tt files every time you build the solution(probably Entity Framework generates the code similar way during the build).
You can compile your serialization library now and measure the performance and size of the results.
I need this serialization library for my framework for effective usage of entities with Azure Table and Blob storage so I plan to finish initial release soon...

Related

ProtoBuf Net Dictionary Duplicate Keys Handling (Map)

I have some data which I serialized with protobuf.net. The data is a map, and contains some duplicates (which happened as my key didn't implement IEquatable)
I want to deserialize the data into a dictionary and ignore duplicates.
There seems to be an attribute for that, i.e. [ProtoMap(DisableMap=false)], which the documentation says:
Disable "map" handling; dictionaries will use .Add(key, value) instead
of [key] = value. ...
Basically I want to behavior to be [key] = value, but apparently, the attribute is ignored.
Am I doing anything wrong? Is there any way to achieve the desired (and documented) behavior of ignoring duplicates?
Example code:
1. Produce data with duplicates:
// ------------- ------------- ------------- ------------- ------------- ------------- -------------
// The following part generated the bytes, which requires the key NOT to implement IEquatable
// ------------- ------------- ------------- ------------- ------------- ------------- -------------
var cache = new MyTestClass() { Dictionary = new Dictionary<MyTestKey, string>() };
cache.Dictionary[new MyTestKey { Value = "X" }] = "A";
cache.Dictionary[new MyTestKey { Value = "X" }] = "B";
var bytes = cache.Serialize();
var bytesStr = string.Join(",", bytes); // "10,8,10,3,10,1,88,18,1,65,10,8,10,3,10,1,88,18,1,66";
//..
[DataContract]
public class MyTestKey
{
[DataMember(Order = 1)]
public string Value { get; set; }
}
[DataContract]
public class MyTestClass
{
[DataMember(Order = 1)]
[ProtoMap(DisableMap = false)]
public Dictionary<MyTestKey, string> Dictionary { get; set; }
}
´´´
2. Try deserialize the data, with property IEquatable, which fails..:
[DataContract]
public class MyTestKey : IEquatable<MyTestKey>
{
[DataMember(Order = 1)]
public string Value { get; set; }
public bool Equals(MyTestKey other)
{
if (ReferenceEquals(null, other)) return false;
if (ReferenceEquals(this, other)) return true;
return Value == other.Value;
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
if (ReferenceEquals(this, obj)) return true;
if (obj.GetType() != this.GetType()) return false;
return Equals((MyTestKey) obj);
}
public override int GetHashCode()
{
return (Value != null ? Value.GetHashCode() : 0);
}
}
//..
var bytesStr2 = "10,8,10,3,10,1,88,18,1,65,10,8,10,3,10,1,88,18,1,66";
var bytes2 = bytesStr2.Split(',').Select(byte.Parse).ToArray();
var cache = bytes2.DeserializeTo<MyTestClass>();
´´´
Exception An item with the same key has already been added.
public static class SerializationExtensions
{
public static T DeserializeTo<T>(this byte[] bytes)
{
if (bytes == null)
return default(T);
using (var ms = new MemoryStream(bytes))
{
return Serializer.Deserialize<T>(ms);
}
}
public static byte[] Serialize<T>(this T setup)
{
using (var ms = new MemoryStream())
{
Serializer.Serialize(ms, setup);
return ms.ToArray();
}
}

There's a few different things going on here; "map" mode is actually the one you want here - so it isn't that you're trying to disable map, but actually you're trying to force it on (it is now on by default in most common dictionary scenarios).
There are some complications:
the library only processes [ProtoMap(...)] when processing a [ProtoMember(...)] for a [ProtoContract(...)]
even then, it only processes [ProtoMap(...)] for key-types that are valid as "map" keys in the proto specification
you can turn it on manually (not via the attributes), but in v2.* it enforces the same check as #2 at runtime, which means it will fail
The manual enable from #3 works in v3.* (currently in alpha):
RuntimeTypeModel.Default[typeof(MyTestClass)][1].IsMap = true;
however, this is obviously inelegant, and today requires using an alpha build (we've been using it in production here at Stack Overflow for an extended period; I just need to get a release together - docs, etc).
Given that it works, I'm tempted to propose that we should soften #2 in v3.*, such that while the default behavior remains the same, it would still check for [ProtoMap(...)] for custom types, and enable that mode. I'm on the fence about whether to soften #1.
I'd be interested in your thoughts on these things!
But to confirm: the following works fine in v3.* and outputs "B" (minor explanation of the code: in protobuf, append===merge for root objects, so serializing two payloads one after the other has the same effect as serializing a dictionary with the combined content, so the two Serialize calls spoofs a payload with two identical keys):
static class P
{
static void Main()
{
using var ms = new MemoryStream();
var key = new MyTestKey { Value = "X" };
RuntimeTypeModel.Default[typeof(MyTestClass)][1].IsMap = true;
Serializer.Serialize(ms, new MyTestClass() { Dictionary =
new Dictionary<MyTestKey, string> { { key, "A" } } });
Serializer.Serialize(ms, new MyTestClass() { Dictionary =
new Dictionary<MyTestKey, string> { { key, "B" } } });
ms.Position = 0;
var val = Serializer.Deserialize<MyTestClass>(ms).Dictionary[key];
Console.WriteLine(val); // B
}
}
I think what I'd like is if, in v3.*, it worked without the IsMap = true line, with:
[ProtoContract]
public class MyTestClass
{
[ProtoMember(1)]
[ProtoMap] // explicit enable here, because not a normal map type
public Dictionary<MyTestKey, string> Dictionary { get; set; }
}

UseImplicitZeroDefaults for generated protobuf classes

The default values for classes generated with protogen don't seem to be serialized when UseImplicitZeroDefaults = false.
I have a small .proto file:
package protobuf;
option java_package = "com.company.protobuf";
option java_outer_classname = "Test";
message TestMessage{
optional string Message = 1;
optional bool ABool = 2;
optional int32 AnInt = 3;
}
Using protogen.exe, I've generated a TestMessage class that I'm trying to send back and forth across the wire to a Java app. I can't seem to get protobuf-net to serialize a value of zero for AnInt or false for ABool, including setting UseImplicitZeroDefaults=false. However, using annotated classes for serialization with that setting works. Here's an equivalent class to the one I generated:
[ProtoContract]
class Test2
{
[ProtoMember(1)]
public string Message { get; set; }
[ProtoMember(2)]
public bool ABool { get; set; }
[ProtoMember(3)]
public int AnInt { get; set; }
}
Initializing the two classes with the same data and serializing to byte[] shows that four extra bytes are coming from the annotated class.
...
private static readonly RuntimeTypeModel serializer;
static Program()
{
serializer = TypeModel.Create();
serializer.UseImplicitZeroDefaults = false;
Console.WriteLine(serializer.UseImplicitZeroDefaults); //prints false
}
static void SendMessages(ITopic topic, ISession session)
{
Console.WriteLine(serializer.UseImplicitZeroDefaults);
TestMessage t = new TestMessage();
t.ABool = false;
t.AnInt = 0;
t.Message = "Test Message";
using (var o = new MemoryStream())
{
serializer.Serialize(o, t);
Console.WriteLine(string.Format("Tx: Message={0} ABool={1} AnInt={2}", t.Message, t.ABool, t.AnInt));
Console.WriteLine(o.ToArray().Length);
}
Test2 t2 = new Test2();
t2.ABool = false;
t2.AnInt = 0;
t2.Message = "Test Message";
using (var o = new MemoryStream())
{
serializer.Serialize(o, t2);
Console.WriteLine(string.Format("Tx: Message={0} ABool={1} AnInt={2}", t.Message, t.ABool, t.AnInt));
Console.WriteLine(o.ToArray().Length);
}
}
Output:
False
Tx: Message=Test Message ABool=False AnInt=0
14
Tx: Message=Test Message ABool=False AnInt=0
18
Is there a setting I'm missing? Or do classes generated from .proto files use a different mechanism for serialization? In an ideal world, I would expect the UseImplicitZeroDefaults setting to get picked up by both the annotated and generated classes on their way through the serializer.

If you add -p:detectMissing to your call to protogen, it should emit code following a different pattern that allows for better tracking of these. Basically, it should do what you want then.

Deep copy of List<T>

I'm trying to make a deep copy of a generic list, and am wondering if there is any other way then creating the copying method and actually copying over each member one at a time. I have a class that looks somewhat like this:
public class Data
{
private string comment;
public string Comment
{
get { return comment; }
set { comment = value; }
}
private List<double> traceData;
public List<double> TraceData
{
get { return traceData; }
set { traceData = value; }
}
}
And I have a list of the above data, i.e List<Data>. What I'm trying to do is plot a trace data of the subset of List onto a graph, possibly with some scaling or sweeping on the data. I obviously don't need to plot everything in the list because they don't fit into the screen.
I initially tried getting the subset of the list using the List.GetRange() method, but it seems that the underneath List<double> is being shallow copied instead of deep copied. When I get the subset again using List.GetRange(), I get previously modified data, not the raw data retrieved elsewhere.
Can anyone give me a direction on how to approach this? Thanks a lot.

The idiomatic way to approach this in C# is to implement ICloneable on your Data, and write a Clone method that does the deep copy (and then presumably a Enumerable.CloneRange method that can clone part of your list at once.) There isn't any built-in trick or framework method to make it easier than that.
Unless memory and performance are a real concern, I suggest that you try hard to redesign it to operate on immutable Data objects, though, instead. It'll wind up much simpler.

You can try this
public static object DeepCopy(object obj)
{
if (obj == null)
return null;
Type type = obj.GetType();
if (type.IsValueType || type == typeof(string))
{
return obj;
}
else if (type.IsArray)
{
Type elementType = Type.GetType(
type.FullName.Replace("[]", string.Empty));
var array = obj as Array;
Array copied = Array.CreateInstance(elementType, array.Length);
for (int i = 0; i < array.Length; i++)
{
copied.SetValue(DeepCopy(array.GetValue(i)), i);
}
return Convert.ChangeType(copied, obj.GetType());
}
else if (type.IsClass)
{
object toret = Activator.CreateInstance(obj.GetType());
FieldInfo[] fields = type.GetFields(BindingFlags.Public |
BindingFlags.NonPublic | BindingFlags.Instance);
foreach (FieldInfo field in fields)
{
object fieldValue = field.GetValue(obj);
if (fieldValue == null)
continue;
field.SetValue(toret, DeepCopy(fieldValue));
}
return toret;
}
else
throw new ArgumentException("Unknown type");
}
Thanks to DetoX83 article on code project.

If IClonable way is too tricky for you. I suggest converting to something and back. It can be done with BinaryFormatter or a Json Converter like Servicestack.Text since it is the fastest one in .Net.
Code should be something like this:
MyClass mc = new MyClass();
string json = mc.ToJson();
MyClass mcCloned = json.FromJson<MyClass>();
mcCloned will not reference mc.

The most easiest (but dirty) way is to implement ICloneable by your class and use next extension method:
public static IEnumerable<T> Clone<T>(this IEnumerable<T> collection) where T : ICloneable
{
return collection.Select(item => (T)item.Clone());
}
Usage:
var list = new List<Data> { new Data { Comment = "comment", TraceData = new List { 1, 2, 3 } };
var newList = list.Clone();

another thing you can do is mark your class as serializable and use binary serialization.
Here is a working example
public class Program
{
[Serializable]
public class Test
{
public int Id { get; set; }
public Test()
{
}
}
public static void Main()
{
//create a list of 10 Test objects with Id's 0-10
List<Test> firstList = Enumerable.Range(0,10).Select( x => new Test { Id = x } ).ToList();
using (var stream = new System.IO.MemoryStream())
{
var binaryFormatter = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
binaryFormatter.Serialize(stream, firstList); //serialize to stream
stream.Position = 0;
//deserialize from stream.
List<Test> secondList = binaryFormatter.Deserialize(stream) as List<Test>;
}
Console.ReadKey();
}
}

If you make your objects immutable you don't need to worry about passing around copies of them, then you could do something like:
var toPlot = list.Where(d => d.ShouldBePlotted());

Since your collection is mutable, you need to implement the deep copy programmatically:
public class Data
{
public string Comment { get; set; }
public List<double> TraceData { get; set; }
public Data DeepCopy()
{
return new Data
{
Comment = this.Comment,
TraceData = this.TraceData != null
? new List<double>(this.TraceData)
: null;
}
}
}
The Comment field can be shallow copied because its already an immutable class. You need to create a new list for TraceData, but the elements themselves are immutable and require no special handling to copy them.
When I get the subset again using
List.GetRange(), I get previously
modified data, not the raw data
retrieved elsewhere.
Use your new DeepCopy method as such:
var pointsInRange = dataPoints
.Select(x => x.DeepCopy())
.GetRange(start, length);

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace DeepListCopy_testingSome
{
class Program
{
static void Main(string[] args)
{
List<int> list1 = new List<int>();
List<int> list2 = new List<int>();
//populate list1
for (int i = 0; i < 20; i++)
{
list1.Add(1);
}
///////
Console.WriteLine("\n int in each list1 element is:\n");
///////
foreach (int i in list1)
{
Console.WriteLine(" list1 elements: {0}", i);
list2.Add(1);
}
///////
Console.WriteLine("\n int in each list2 element is:\n");
///////
foreach (int i in list2)
{
Console.WriteLine(" list2 elements: {0}", i);
}
///////enter code here
for (int i = 0; i < list2.Count; i++)
{
list2[i] = 2;
}
///////
Console.WriteLine("\n Printing list1 and list2 respectively to show\n"
+ " there is two independent lists,i e, two differens"
+ "\n memory locations after modifying list2\n\n");
foreach (int i in list1)
{
Console.WriteLine(" Printing list1 elements: {0}", i);
}
///////
Console.WriteLine("\n\n");
///////
foreach (int i in list2)
{
Console.WriteLine(" Printing list2 elements: {0}", i);
}
Console.ReadKey();
}//end of Static void Main
}//end of class
}

One quick and generic way to deeply serialize an object is to use JSON.net. The following extension method allows serializing of a list of any arbitrary objects, but is able to skip Entity Framework navigation properties, since these may lead to circular dependencies and unwanted data fetches.
Method
public static List<T> DeepClone<T>(this IList<T> list, bool ignoreVirtualProps = false)
{
JsonSerializerSettings settings = new JsonSerializerSettings();
if (ignoreVirtualProps)
{
settings.ContractResolver = new IgnoreNavigationPropsResolver();
settings.PreserveReferencesHandling = PreserveReferencesHandling.None;
settings.ReferenceLoopHandling = ReferenceLoopHandling.Ignore;
settings.Formatting = Formatting.Indented;
}
var serialized = JsonConvert.SerializeObject(list, settings);
return JsonConvert.DeserializeObject<List<T>>(serialized);
}
Usage
var clonedList = list.DeepClone();
By default, JSON.NET serializes only public properties. If private properties must be also cloned, this solution can be used.
This method allows for quick (de)serialization of complex hierarchies of objects.

Serializing to XML via DataContract: custom output?

I have a custom Fraction class, which I'm using throughout my whole project. It's simple, it consists of a single constructor, accepts two ints and stores them. I'd like to use the DataContractSerializer to serialize my objects used in my project, some of which include Fractions as fields. Ideally, I'd like to be able to serialize such objects like this:
<Object>
...
<Frac>1/2</Frac> // "1/2" would get converted back into a Fraction on deserialization.
...
</Object>
As opposed to this:
<Object>
...
<Frac>
<Numerator>1</Numerator>
<Denominator>2</Denominator>
</Frac>
...
</Object>
Is there any way to do this using DataContracts?
I'd like to do this because I plan on making the XML files user-editable (I'm using them as input for a music game, and they act as notecharts, essentially), and want to keep the notation as terse as possible for the end user, so they won't need to deal with as many walls of text.
EDIT: I should also note that I currently have my Fraction class as immutable (all fields are readonly), so being able to change the state of an existing Fraction wouldn't be possible. Returning a new Fraction object would be OK, though.

If you add a property that represents the Frac element and apply the DataMember attribute to it rather than the other properties you will get what you want I believe:
[DataContract]
public class MyObject {
Int32 _Numerator;
Int32 _Denominator;
public MyObject(Int32 numerator, Int32 denominator) {
_Numerator = numerator;
_Denominator = denominator;
}
public Int32 Numerator {
get { return _Numerator; }
set { _Numerator = value; }
}
public Int32 Denominator {
get { return _Denominator; }
set { _Denominator = value; }
}
[DataMember(Name="Frac")]
public String Fraction {
get { return _Numerator + "/" + _Denominator; }
set {
String[] parts = value.Split(new char[] { '/' });
_Numerator = Int32.Parse(parts[0]);
_Denominator = Int32.Parse(parts[1]);
}
}
}

DataContractSerializer will use a custom IXmlSerializable if it is provided in place of a DataContractAttribute. This will allow you to customize the XML formatting in anyway you need... but you will have to hand code the serialization and deserialization process for your class.
public class Fraction: IXmlSerializable
{
private Fraction()
{
}
public Fraction(int numerator, int denominator)
{
this.Numerator = numerator;
this.Denominator = denominator;
}
public int Numerator { get; private set; }
public int Denominator { get; private set; }
public XmlSchema GetSchema()
{
throw new NotImplementedException();
}
public void ReadXml(XmlReader reader)
{
var content = reader.ReadInnerXml();
var parts = content.Split('/');
Numerator = int.Parse(parts[0]);
Denominator = int.Parse(parts[1]);
}
public void WriteXml(XmlWriter writer)
{
writer.WriteRaw(this.ToString());
}
public override string ToString()
{
return string.Format("{0}/{1}", Numerator, Denominator);
}
}
[DataContract(Name = "Object", Namespace="")]
public class MyObject
{
[DataMember]
public Fraction Frac { get; set; }
}
class Program
{
static void Main(string[] args)
{
var myobject = new MyObject
{
Frac = new Fraction(1, 2)
};
var dcs = new DataContractSerializer(typeof(MyObject));
string xml = null;
using (var ms = new MemoryStream())
{
dcs.WriteObject(ms, myobject);
xml = Encoding.UTF8.GetString(ms.ToArray());
Console.WriteLine(xml);
// <Object><Frac>1/2</Frac></Object>
}
using (var ms = new MemoryStream(Encoding.UTF8.GetBytes(xml)))
{
ms.Position = 0;
var obj = dcs.ReadObject(ms) as MyObject;
Console.WriteLine(obj.Frac);
// 1/2
}
}
}

This MSDN article describes IDataContractSurrogate Interface which:
Provides the methods needed to substitute one type for another by the
DataContractSerializer during serialization, deserialization, and
export and import of XML schema documents.
Although way too late, still may help someone. Actually, allows to change XML for ANY class.

You can do this with the DataContractSerializer, albeit in a way that feels hacky to me. You can take advantage of the fact that data members can be private variables, and use a private string as your serialized member. The data contract serializer will also execute methods at certain points in the process that are marked with [On(De)Serializ(ed|ing)] attributes - inside of those, you can control how the int fields are mapped to the string, and vice-versa. The downside is that you lose the automatic serialization magic of the DataContractSerializer on your class, and now have more logic to maintain.
Anyways, here's what I would do:
[DataContract]
public class Fraction
{
[DataMember(Name = "Frac")]
private string serialized;
public int Numerator { get; private set; }
public int Denominator { get; private set; }
[OnSerializing]
public void OnSerializing(StreamingContext context)
{
// This gets called just before the DataContractSerializer begins.
serialized = Numerator.ToString() + "/" + Denominator.ToString();
}
[OnDeserialized]
public void OnDeserialized(StreamingContext context)
{
// This gets called after the DataContractSerializer finishes its work
var nums = serialized.Split("/");
Numerator = int.Parse(nums[0]);
Denominator = int.Parse(nums[1]);
}
}

You'll have to switch back to the XMLSerializer to do that. The DataContractSerializer is a bit more restrictive in terms of being able to customise the output.

c# serialized data

I have been using BinaryFormatter to serialise data to disk but it doesn't seem very scalable. I've created a 200Mb data file but am unable to read it back in (End of Stream encountered before parsing was completed). It tries for about 30 minutes to deserialise and then gives up. This is on a fairly decent quad-cpu box with 8Gb RAM.
I'm serialising a fairly large complicated structure.
htCacheItems is a Hashtable of CacheItems. Each CacheItem has several simple members (strings + ints etc) and also contains a Hashtable and a custom implementation of a linked list. The sub-hashtable points to CacheItemValue structures which is currently a simple DTO which contains a key and a value. The linked list items are also equally simple.
The data file that fails contains about 400,000 CacheItemValues.
Smaller datasets work well (though takes longer than i'd expect to deserialize and use a hell of a lot of memory).
public virtual bool Save(String sBinaryFile)
{
bool bSuccess = false;
FileStream fs = new FileStream(sBinaryFile, FileMode.Create);
try
{
BinaryFormatter formatter = new BinaryFormatter();
formatter.Serialize(fs, htCacheItems);
bSuccess = true;
}
catch (Exception e)
{
bSuccess = false;
}
finally
{
fs.Close();
}
return bSuccess;
}
public virtual bool Load(String sBinaryFile)
{
bool bSuccess = false;
FileStream fs = null;
GZipStream gzfs = null;
try
{
fs = new FileStream(sBinaryFile, FileMode.OpenOrCreate);
if (sBinaryFile.EndsWith("gz"))
{
gzfs = new GZipStream(fs, CompressionMode.Decompress);
}
//add the event handler
ResolveEventHandler resolveEventHandler = new ResolveEventHandler(AssemblyResolveEventHandler);
AppDomain.CurrentDomain.AssemblyResolve += resolveEventHandler;
BinaryFormatter formatter = new BinaryFormatter();
htCacheItems = (Hashtable)formatter.Deserialize(gzfs != null ? (Stream)gzfs : (Stream)fs);
//remove the event handler
AppDomain.CurrentDomain.AssemblyResolve -= resolveEventHandler;
bSuccess = true;
}
catch (Exception e)
{
Logger.Write(new ExceptionLogEntry("Failed to populate cache from file " + sBinaryFile + ". Message is " + e.Message));
bSuccess = false;
}
finally
{
if (fs != null)
{
fs.Close();
}
if (gzfs != null)
{
gzfs.Close();
}
}
return bSuccess;
}
The resolveEventHandler is just a work around because i'm serialising the data in one application and loading it in another (http://social.msdn.microsoft.com/Forums/en-US/netfxbcl/thread/e5f0c371-b900-41d8-9a5b-1052739f2521)
The question is, how can I improve this? Is data serialisation always going to be inefficient, am i better off writing my own routines?

I would personally try to avoid the need for the assembly-resolve; that has a certain smell about it. If you must use BinaryFormatter, then I'd simply put the DTOs into a separate library (dll) that can be used in both applications.
If you don't want to share the dll, then IMO you shouldn't be using BinaryFormatter - you should be using a contract-based serializer, such as XmlSerializer or DataContractSerializer, or one of the "protocol buffers" implementations (and to repeat Jon's disclaimer: I wrote one of the others).
200MB does seem pretty big, but I wouldn't have expected it to fail. One possible cause here is the object tracking it does for the references; but even then, this surprises me.
I'd love to see a simplified object model to see if it is a "fit" for any of the above.
Here's an example that attempts to mirror your setup from the description using protobuf-net. Oddly enough there seems to be a glitch working with the linked-list, which I'll investigate; but the rest seems to work:
using System;
using System.Collections.Generic;
using System.IO;
using ProtoBuf;
[ProtoContract]
class CacheItem
{
[ProtoMember(1)]
public int Id { get; set; }
[ProtoMember(2)]
public int AnotherNumber { get; set; }
private readonly Dictionary<string, CacheItemValue> data
= new Dictionary<string,CacheItemValue>();
[ProtoMember(3)]
public Dictionary<string, CacheItemValue> Data { get { return data; } }
//[ProtoMember(4)] // commented out while I investigate...
public ListNode Nodes { get; set; }
}
[ProtoContract]
class ListNode // I'd probably expose this as a simple list, though
{
[ProtoMember(1)]
public double Head { get; set; }
[ProtoMember(2)]
public ListNode Tail { get; set; }
}
[ProtoContract]
class CacheItemValue
{
[ProtoMember(1)]
public string Key { get; set; }
[ProtoMember(2)]
public float Value { get; set; }
}
static class Program
{
static void Main()
{
// invent 400k CacheItemValue records
Dictionary<string, CacheItem> htCacheItems = new Dictionary<string, CacheItem>();
Random rand = new Random(123456);
for (int i = 0; i < 400; i++)
{
string key;
CacheItem ci = new CacheItem {
Id = rand.Next(10000),
AnotherNumber = rand.Next(10000)
};
while (htCacheItems.ContainsKey(key = rand.NextString())) {}
htCacheItems.Add(key, ci);
for (int j = 0; j < 1000; j++)
{
while (ci.Data.ContainsKey(key = rand.NextString())) { }
ci.Data.Add(key,
new CacheItemValue {
Key = key,
Value = (float)rand.NextDouble()
});
int tail = rand.Next(1, 50);
ListNode node = null;
while (tail-- > 0)
{
node = new ListNode
{
Tail = node,
Head = rand.NextDouble()
};
}
ci.Nodes = node;
}
}
Console.WriteLine(GetChecksum(htCacheItems));
using (Stream outfile = File.Create("raw.bin"))
{
Serializer.Serialize(outfile, htCacheItems);
}
htCacheItems = null;
using (Stream inFile = File.OpenRead("raw.bin"))
{
htCacheItems = Serializer.Deserialize<Dictionary<string, CacheItem>>(inFile);
}
Console.WriteLine(GetChecksum(htCacheItems));
}
static int GetChecksum(Dictionary<string, CacheItem> data)
{
int chk = data.Count;
foreach (var item in data)
{
chk += item.Key.GetHashCode()
+ item.Value.AnotherNumber + item.Value.Id;
foreach (var subItem in item.Value.Data.Values)
{
chk += subItem.Key.GetHashCode()
+ subItem.Value.GetHashCode();
}
}
return chk;
}
static string NextString(this Random random)
{
const string alphabet = "abcdefghijklmnopqrstuvwxyz0123456789 ";
int len = random.Next(4, 10);
char[] buffer = new char[len];
for (int i = 0; i < len; i++)
{
buffer[i] = alphabet[random.Next(0, alphabet.Length)];
}
return new string(buffer);
}
}

Serialization is tricky, particularly when you want to have some degree of flexibility when it comes to versioning.
Usually there's a trade-off between portability and flexibility of what you can serialize. For example, you might want to use Protocol Buffers (disclaimer: I wrote one of the C# ports) as a pretty efficient solution with good portability and versioning - but then you'll need to translate whatever your natural data structure is into something supported by Protocol Buffers.
Having said that, I'm surprised that binary serialization is failing here - at least in that particular way. Can you get it to fail with a large file with a very, very simple piece of serialization code? (No resolution handlers, no compression etc.)

Something that could help is cascade serializing.
You call mainHashtable.serialize(), which return a XML string for example. This method call everyItemInYourHashtable.serialize(), and so on.
You do the same with a static method in every class, called 'unserialize(String xml)', which unserialize your objetcs and return an object, or a list of objects.
You get the point ?
Of course, you need to implement this method in every of your class you want to be serializable.
Take a look at ISerializable interface, which represent exaclty what I'm describing. IMO, this interface looks too "Microsoft" (no use of DOM, etc), so i created mine, but principle is the same : cascade.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.