c# serialized data - c#

I have been using BinaryFormatter to serialise data to disk but it doesn't seem very scalable. I've created a 200Mb data file but am unable to read it back in (End of Stream encountered before parsing was completed). It tries for about 30 minutes to deserialise and then gives up. This is on a fairly decent quad-cpu box with 8Gb RAM.
I'm serialising a fairly large complicated structure.
htCacheItems is a Hashtable of CacheItems. Each CacheItem has several simple members (strings + ints etc) and also contains a Hashtable and a custom implementation of a linked list. The sub-hashtable points to CacheItemValue structures which is currently a simple DTO which contains a key and a value. The linked list items are also equally simple.
The data file that fails contains about 400,000 CacheItemValues.
Smaller datasets work well (though takes longer than i'd expect to deserialize and use a hell of a lot of memory).
public virtual bool Save(String sBinaryFile)
{
bool bSuccess = false;
FileStream fs = new FileStream(sBinaryFile, FileMode.Create);
try
{
BinaryFormatter formatter = new BinaryFormatter();
formatter.Serialize(fs, htCacheItems);
bSuccess = true;
}
catch (Exception e)
{
bSuccess = false;
}
finally
{
fs.Close();
}
return bSuccess;
}
public virtual bool Load(String sBinaryFile)
{
bool bSuccess = false;
FileStream fs = null;
GZipStream gzfs = null;
try
{
fs = new FileStream(sBinaryFile, FileMode.OpenOrCreate);
if (sBinaryFile.EndsWith("gz"))
{
gzfs = new GZipStream(fs, CompressionMode.Decompress);
}
//add the event handler
ResolveEventHandler resolveEventHandler = new ResolveEventHandler(AssemblyResolveEventHandler);
AppDomain.CurrentDomain.AssemblyResolve += resolveEventHandler;
BinaryFormatter formatter = new BinaryFormatter();
htCacheItems = (Hashtable)formatter.Deserialize(gzfs != null ? (Stream)gzfs : (Stream)fs);
//remove the event handler
AppDomain.CurrentDomain.AssemblyResolve -= resolveEventHandler;
bSuccess = true;
}
catch (Exception e)
{
Logger.Write(new ExceptionLogEntry("Failed to populate cache from file " + sBinaryFile + ". Message is " + e.Message));
bSuccess = false;
}
finally
{
if (fs != null)
{
fs.Close();
}
if (gzfs != null)
{
gzfs.Close();
}
}
return bSuccess;
}
The resolveEventHandler is just a work around because i'm serialising the data in one application and loading it in another (http://social.msdn.microsoft.com/Forums/en-US/netfxbcl/thread/e5f0c371-b900-41d8-9a5b-1052739f2521)
The question is, how can I improve this? Is data serialisation always going to be inefficient, am i better off writing my own routines?

I would personally try to avoid the need for the assembly-resolve; that has a certain smell about it. If you must use BinaryFormatter, then I'd simply put the DTOs into a separate library (dll) that can be used in both applications.
If you don't want to share the dll, then IMO you shouldn't be using BinaryFormatter - you should be using a contract-based serializer, such as XmlSerializer or DataContractSerializer, or one of the "protocol buffers" implementations (and to repeat Jon's disclaimer: I wrote one of the others).
200MB does seem pretty big, but I wouldn't have expected it to fail. One possible cause here is the object tracking it does for the references; but even then, this surprises me.
I'd love to see a simplified object model to see if it is a "fit" for any of the above.
Here's an example that attempts to mirror your setup from the description using protobuf-net. Oddly enough there seems to be a glitch working with the linked-list, which I'll investigate; but the rest seems to work:
using System;
using System.Collections.Generic;
using System.IO;
using ProtoBuf;
[ProtoContract]
class CacheItem
{
[ProtoMember(1)]
public int Id { get; set; }
[ProtoMember(2)]
public int AnotherNumber { get; set; }
private readonly Dictionary<string, CacheItemValue> data
= new Dictionary<string,CacheItemValue>();
[ProtoMember(3)]
public Dictionary<string, CacheItemValue> Data { get { return data; } }
//[ProtoMember(4)] // commented out while I investigate...
public ListNode Nodes { get; set; }
}
[ProtoContract]
class ListNode // I'd probably expose this as a simple list, though
{
[ProtoMember(1)]
public double Head { get; set; }
[ProtoMember(2)]
public ListNode Tail { get; set; }
}
[ProtoContract]
class CacheItemValue
{
[ProtoMember(1)]
public string Key { get; set; }
[ProtoMember(2)]
public float Value { get; set; }
}
static class Program
{
static void Main()
{
// invent 400k CacheItemValue records
Dictionary<string, CacheItem> htCacheItems = new Dictionary<string, CacheItem>();
Random rand = new Random(123456);
for (int i = 0; i < 400; i++)
{
string key;
CacheItem ci = new CacheItem {
Id = rand.Next(10000),
AnotherNumber = rand.Next(10000)
};
while (htCacheItems.ContainsKey(key = rand.NextString())) {}
htCacheItems.Add(key, ci);
for (int j = 0; j < 1000; j++)
{
while (ci.Data.ContainsKey(key = rand.NextString())) { }
ci.Data.Add(key,
new CacheItemValue {
Key = key,
Value = (float)rand.NextDouble()
});
int tail = rand.Next(1, 50);
ListNode node = null;
while (tail-- > 0)
{
node = new ListNode
{
Tail = node,
Head = rand.NextDouble()
};
}
ci.Nodes = node;
}
}
Console.WriteLine(GetChecksum(htCacheItems));
using (Stream outfile = File.Create("raw.bin"))
{
Serializer.Serialize(outfile, htCacheItems);
}
htCacheItems = null;
using (Stream inFile = File.OpenRead("raw.bin"))
{
htCacheItems = Serializer.Deserialize<Dictionary<string, CacheItem>>(inFile);
}
Console.WriteLine(GetChecksum(htCacheItems));
}
static int GetChecksum(Dictionary<string, CacheItem> data)
{
int chk = data.Count;
foreach (var item in data)
{
chk += item.Key.GetHashCode()
+ item.Value.AnotherNumber + item.Value.Id;
foreach (var subItem in item.Value.Data.Values)
{
chk += subItem.Key.GetHashCode()
+ subItem.Value.GetHashCode();
}
}
return chk;
}
static string NextString(this Random random)
{
const string alphabet = "abcdefghijklmnopqrstuvwxyz0123456789 ";
int len = random.Next(4, 10);
char[] buffer = new char[len];
for (int i = 0; i < len; i++)
{
buffer[i] = alphabet[random.Next(0, alphabet.Length)];
}
return new string(buffer);
}
}

Serialization is tricky, particularly when you want to have some degree of flexibility when it comes to versioning.
Usually there's a trade-off between portability and flexibility of what you can serialize. For example, you might want to use Protocol Buffers (disclaimer: I wrote one of the C# ports) as a pretty efficient solution with good portability and versioning - but then you'll need to translate whatever your natural data structure is into something supported by Protocol Buffers.
Having said that, I'm surprised that binary serialization is failing here - at least in that particular way. Can you get it to fail with a large file with a very, very simple piece of serialization code? (No resolution handlers, no compression etc.)

Something that could help is cascade serializing.
You call mainHashtable.serialize(), which return a XML string for example. This method call everyItemInYourHashtable.serialize(), and so on.
You do the same with a static method in every class, called 'unserialize(String xml)', which unserialize your objetcs and return an object, or a list of objects.
You get the point ?
Of course, you need to implement this method in every of your class you want to be serializable.
Take a look at ISerializable interface, which represent exaclty what I'm describing. IMO, this interface looks too "Microsoft" (no use of DOM, etc), so i created mine, but principle is the same : cascade.

Related

C# comfortable way to construct byte array from different objects

I am looking for a way to fast and simple implementation of this paradigm:
MyByteArray mb = new MyByteArray();
mb.Add<byte>(bytevalue);
mb.Add<float>(floatvalue);
mb.Add<string>(str);
mb.Add<MyClass>(object);
And then get byte[] from mb to send it as a byte packet via RPC call (to be decoded on the other side using the same technique).
I've found MemoryStream, but it looks like too overheaded for this simple operation.
Can you help me? Thank you.
What are you looking for is BinaryWritter. But it still needs a Stream to write on for pure logic reason. And the only Stream that fits in your need is MemoryStream.
Are you afraid of performance overhead ? You can create your MemoryStream from an existing byte array ;
byte [] buffer = new byte[1024];
using (var memoryStream = new MemoryStream(buffer))
{
using (var binaryWriter = new BinaryWriter(memoryStream))
{
binaryWriter.Write(1.2F); // float
binaryWriter.Write(1.9D); // double
binaryWriter.Write(1); // integer
binaryWriter.Write("str"); // string
}
}
// buffer is filled with your data now.
A tricky way to achieve this is to use a combination of builtin class in .net
class Program
{
static void Main(string[] args)
{
Program program = new Program();
var listBytes = new List<byte>();
listBytes.Add( program.CastToBytes("test"));
listBytes.Add(program.CastToBytes(5));
}
Note
for a custom object you have to define an implicit operator on how the properties or all the object should be converted
public byte[] CastToBytes<T>(T value)
{
//this will cover most of primitive types
if (typeof(T).IsValueType)
{
return BitConverter.GetBytes((dynamic)value);
}
if (typeof(T) == typeof(string))
{
return Encoding.UTF8.GetBytes((dynamic) value);
}
//for a custom object you have to define the rules
else
{
var formatter = new BinaryFormatter();
var memoryStream = new MemoryStream();
formatter.Serialize(memoryStream, value);
return memoryStream.GetBuffer();
}
}
}
This looks like the case for Protocol Buffers, you could look at at protobuf-net.
First, let's decorate the classes.
[ProtoContract]
class User
{
[ProtoMember(1)]
public int Id { get; set; }
[ProtoMember(2)]
public string Name { get; set; }
}
[ProtoContract]
class Message
{
[ProtoMember(1)]
public byte Type { get; set; }
[ProtoMember(2)]
public float Value { get; set; }
[ProtoMember(3)]
public User Sender { get; set; }
}
Then we create our message.
var msg = new Message
{
Type = 1,
Value = 1.1f,
Sender = new User
{
Id = 8,
Name = "user"
}
};
And now, we can use ProtoBuf's serializer to do all our work.
// memory
using (var mem = new MemoryStream())
{
Serializer.Serialize<Message>(mem, msg);
var bytes = mem.GetBuffer();
}
// file
using (var file = File.Create("message.bin")) Serializer.Serialize<Message>(file, msg);

Is there size limit for a property to be serialized?

I'm working against an interface that requires an XML document. So far I've been able to serialize most of the objects using XmlSerializer. However, there is one property that is proving problematic. It is supposed to be a collection of objects that wrap a document. The document itself is encoded as a base64 string.
The basic structure is like this:
//snipped out of a parent object
public List<Document> DocumentCollection { get; set; }
//end snip
public class Document
{
public string DocumentTitle { get; set; }
public Code DocumentCategory { get; set; }
/// <summary>
/// Base64 encoded file
/// </summary>
public string BinaryDocument { get; set; }
public string DocumentTypeText { get; set; }
}
The problem is that smaller values work fine, but if the document is too big the serializer just skips over that document item in the collection.
Is there some limitation that I'm bumping up against?
Update: I changed
public string BinaryDocument { get; set; }
to
public byte[] BinaryDocument { get; set; }
and I'm still getting the same result. The smaller document (~150kb) is serializing just fine, but the rest aren't. To be clear, it's not just the value of the property, it's the entire containing Document object that gets dropped.
UPDATE 2:
Here's the serialization code with a simple repro. It's out of a console project I put together. The problem is that this code works fine in the test project. I'm having difficulty getting the full object structure packed in here because it's near impossible to use the actual objects in a test case because of the complexity of filling the fields, so I tried to cut down the code in the main application. The populated object goes into the serialization code with the DocumentCollection filled with four Documents and comes out with one Document.
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Xml;
using System.Xml.Serialization;
namespace ConsoleApplication2
{
class Program
{
static void Main(string[] args)
{
var container = new DocumentContainer();
var docs = new List<Document>();
foreach (var f in Directory.GetFiles(#"E:\Software Projects\DA\Test Documents"))
{
var fileStream = new MemoryStream(File.ReadAllBytes(f));
var doc = new Document
{
BinaryDocument = fileStream.ToArray(),
DocumentTitle = Path.GetFileName(f)
};
docs.Add(doc);
}
container.DocumentCollection = docs;
var serializer = new XmlSerializer(typeof(DocumentContainer));
var ms = new MemoryStream();
var writer = XmlWriter.Create(ms);
serializer.Serialize(writer, container);
writer.Flush();
ms.Seek(0, SeekOrigin.Begin);
var reader = new StreamReader(ms, Encoding.UTF8);
File.WriteAllText(#"C:\temp\testexport.xml", reader.ReadToEnd());
}
}
public class Document
{
public string DocumentTitle { get; set; }
public byte[] BinaryDocument { get; set; }
}
// test class
public class DocumentContainer
{
public List<Document> DocumentCollection { get; set; }
}
}
XmlSerializer has no limit on the length of a string it can serialize.
.Net, however, has a maximum string length of int.MaxValue. Furthermore, since internally a string is implemented as a contiguous memory buffer, on a 32 bit process you're likely to be unable to allocate a string anywhere near that large due to process space fragmentation. And since a c# base64 string requires roughly 2.67 times the memory of the byte [] array from which it was created (1.33 for the encoding times 2 since the .Net char type is actually two bytes) you might be getting an OutOfMemoryException encoding a large binary document as a complete base64 string, then swallowing and ignoring it, leaving the BinaryDocument property null.
That being said, there is no reason for you to manually encode your binary documents into base64, because XmlSerializer does this for you automatically. I.e. if I serialize the following class:
public class Document
{
public string DocumentTitle { get; set; }
public Code DocumentCategory { get; set; }
public byte [] BinaryDocument { get; set; }
public string DocumentTypeText { get; set; }
}
I get the following XML:
<Document>
<DocumentTitle>my title</DocumentTitle>
<DocumentCategory>Default</DocumentCategory>
<BinaryDocument>AAECAwQFBgcICQoLDA0ODxAREhM=</BinaryDocument>
<DocumentTypeText>document text type</DocumentTypeText>
</Document>
As you can see, BinaryDocument is base64 encoded. Thus you should be able to keep your binary documents in a more compact byte [] representation and still get the XML output you want.
Even better, under the covers, XmlWriter uses System.Xml.Base64Encoder to do this. This class encodes its inputs in chunks, thereby avoiding the excessive memory use and potential out-of-memory exceptions described above.
I can't reproduce the problem you are having. Even with individual files as large as 267 MB to 1.92 GB, I'm not seeing any elements being skipped. The only problem I am seeing is that the temporary var ms = new MemoryStream(); exceeds its 2 GB buffer limit eventually, whereupon an exception gets thrown. I replaced this with a direct stream, and that problem went away:
using (var stream = File.Open(outputPath, FileMode.Create, FileAccess.ReadWrite))
That being said, your design will eventually run up against memory limits for a sufficiently large number of sufficiently large files, since you load all of them into memory before serializing. If this is happening, somewhere in your production code you may be catching and swallowing the OutOfMemoryException without realizing it, leading to the problem you are seeing.
As an alternative, I would suggest a streaming solution where you incrementally copy each file's contents to the XML output from within XmlSerializer by making your Document class implement IXmlSerializable:
public class Document : IXmlSerializable
{
public string DocumentPath { get; set; }
public string DocumentTitle
{
get
{
if (DocumentPath == null)
return null;
return Path.GetFileName(DocumentPath);
}
}
const string DocumentTitleName = "DocumentTitle";
const string BinaryDocumentName = "BinaryDocument";
#region IXmlSerializable Members
System.Xml.Schema.XmlSchema IXmlSerializable.GetSchema()
{
return null;
}
void ReadXmlElement(XmlReader reader)
{
if (reader.Name == DocumentTitleName)
DocumentPath = reader.ReadElementContentAsString();
}
void IXmlSerializable.ReadXml(XmlReader reader)
{
reader.ReadXml(null, ReadXmlElement);
}
void IXmlSerializable.WriteXml(XmlWriter writer)
{
writer.WriteElementString(DocumentTitleName, DocumentTitle ?? "");
if (DocumentPath != null)
{
try
{
using (var stream = File.OpenRead(DocumentPath))
{
// Write the start element if the file was successfully opened
writer.WriteStartElement(BinaryDocumentName);
try
{
var buffer = new byte[6 * 1024];
int read;
while ((read = stream.Read(buffer, 0, buffer.Length)) > 0)
writer.WriteBase64(buffer, 0, read);
}
finally
{
// Write the end element even if an error occurred while streaming the file.
writer.WriteEndElement();
}
}
}
catch (Exception ex)
{
// You could log the exception as an element or as a comment, as you prefer.
// Log as a comment
writer.WriteComment("Caught exception with message: " + ex.Message);
writer.WriteComment("Exception details:");
writer.WriteComment(ex.ToString());
// Log as an element.
writer.WriteElementString("ExceptionMessage", ex.Message);
writer.WriteElementString("ExceptionDetails", ex.ToString());
}
}
}
#endregion
}
// test class
public class DocumentContainer
{
public List<Document> DocumentCollection { get; set; }
}
public static class XmlSerializationExtensions
{
public static void ReadXml(this XmlReader reader, Action<IList<XAttribute>> readXmlAttributes, Action<XmlReader> readXmlElement)
{
if (reader.NodeType != XmlNodeType.Element)
throw new InvalidOperationException("reader.NodeType != XmlNodeType.Element");
if (readXmlAttributes != null)
{
var attributes = new List<XAttribute>(reader.AttributeCount);
while (reader.MoveToNextAttribute())
{
attributes.Add(new XAttribute(XName.Get(reader.Name, reader.NamespaceURI), reader.Value));
}
// Move the reader back to the element node.
reader.MoveToElement();
readXmlAttributes(attributes);
}
if (reader.IsEmptyElement)
{
reader.Read();
return;
}
reader.ReadStartElement(); // Advance to the first sub element of the wrapper element.
while (reader.NodeType != XmlNodeType.EndElement)
{
if (reader.NodeType != XmlNodeType.Element)
// Comment, whitespace
reader.Read();
else
{
using (var subReader = reader.ReadSubtree())
{
while (subReader.NodeType != XmlNodeType.Element) // Read past XmlNodeType.None
if (!subReader.Read())
break;
if (readXmlElement != null)
readXmlElement(subReader);
}
reader.Read();
}
}
// Move past the end of the wrapper element
reader.ReadEndElement();
}
}
Then use it as follows:
public static void SerializeFilesToXml(string directoryPath, string xmlPath)
{
var docs = from file in Directory.GetFiles(directoryPath)
select new Document { DocumentPath = file };
var container = new DocumentContainer { DocumentCollection = docs.ToList() };
using (var stream = File.Open(xmlPath, FileMode.Create, FileAccess.ReadWrite))
using (var writer = XmlWriter.Create(stream, new XmlWriterSettings { Indent = true, IndentChars = " " }))
{
new XmlSerializer(container.GetType()).Serialize(writer, container);
}
Debug.WriteLine("Wrote " + xmlPath);
}
Using the streaming solution, when serializing 4 files of around 250 MB each, my memory use went up by 0.8 MB. Using the original classes, my memory went up by 1022 MB.
Update
If you need to write your XML to a memory stream, be aware that the c# MemoryStream has a hard maximum stream length of int.MaxValue (i.e. 2 GB) because the underlying memory is simply a byte array. On a 32-bit process the effective max length will be much smaller, see OutOfMemoryException while populating MemoryStream: 256MB allocation on 16GB system.
To programmatically check to see if your process is actually 32 bit, see How to determine programmatically whether a particular process is 32-bit or 64-bit. To change to 64 bit, see What is the purpose of the “Prefer 32-bit” setting in Visual Studio 2012 and how does it actually work?.
If you are sure you are running in 64 bit mode and are still exceeding the hard size limits of a MemoryStream, perhaps see alternative to MemoryStream for large data volumes or MemoryStream replacement?.

protobuf-net and de/serializing objects

I am checking if protobuf-net can be an in place replacement for DataContracts. Besides the excellent performance it is really a neat library. The only issue I have is that the .NET serializers do not make any assumptions what they are currently de/serializing. Especially objects which do contain reference to the typed object are a problem.
[DataMember(Order = 3)]
public object Tag1 // The DataContract did contain a object which becomes now a SimulatedObject
{
get;
set;
}
I tried to mimic object with protocol buffers with a little generic helper which does store for each possible type in a different strongly typed field.
Is this an recommended approach to deal with fields which de/serialize into a number of different not related types?
Below is the sample code for a SimulatedObject which can hold up to 10 different types.
using System;
using System.Collections.Generic;
using System.IO;
using System.Runtime.Serialization;
using ProtoBuf;
using System.Diagnostics;
[DataContract]
public class SimulatedObject<T1,T2,T3,T4,T5,T6,T7,T8,T9,T10>
{
[DataMember(Order = 20)]
byte FieldHasValue; // the number indicates which field actually has a value
[DataMember(Order = 1)]
T1 I1;
[DataMember(Order = 2)]
T2 I2;
[DataMember(Order = 3)]
T3 I3;
[DataMember(Order = 4)]
T4 I4;
[DataMember(Order = 5)]
T5 I5;
[DataMember(Order = 6)]
T6 I6;
[DataMember(Order = 7)]
T7 I7;
[DataMember(Order = 8)]
T8 I8;
[DataMember(Order = 9)]
T9 I9;
[DataMember(Order = 10)]
T10 I10;
public object Data
{
get
{
switch(FieldHasValue)
{
case 0: return null;
case 1: return I1;
case 2: return I2;
case 3: return I3;
case 4: return I4;
case 5: return I5;
case 6: return I6;
case 7: return I7;
case 8: return I8;
case 9: return I9;
case 10: return I10;
default:
throw new NotSupportedException(String.Format("The FieldHasValue field has an invlaid value {0}. This indicates corrupt data or incompatible data layout chagnes", FieldHasValue));
}
}
set
{
I1 = default(T1);
I2 = default(T2);
I3 = default(T3);
I4 = default(T4);
I5 = default(T5);
I6 = default(T6);
I7 = default(T7);
I8 = default(T8);
I9 = default(T9);
I10 = default(T10);
if (value != null)
{
Type t = value.GetType();
if (t == typeof(T1))
{
FieldHasValue = 1;
I1 = (T1) value;
}
else if (t == typeof(T2))
{
FieldHasValue = 2;
I2 = (T2) value;
}
else if (t == typeof(T3))
{
FieldHasValue = 3;
I3 = (T3) value;
}
else if (t == typeof(T4))
{
FieldHasValue = 4;
I4 = (T4) value;
}
else if (t == typeof(T5))
{
FieldHasValue = 5;
I5 = (T5) value;
}
else if (t == typeof(T6))
{
FieldHasValue = 6;
I6 = (T6) value;
}
else if (t == typeof(T7))
{
FieldHasValue = 7;
I7 = (T7) value;
}
else if (t == typeof(T8))
{
FieldHasValue = 8;
I8 = (T8) value;
}
else if (t == typeof(T9))
{
FieldHasValue = 9;
I9 = (T9) value;
}
else if (t == typeof(T10))
{
FieldHasValue = 10;
I10 = (T10) value;
}
else
{
throw new NotSupportedException(String.Format("The type {0} is not supported for serialization. Please add the type to the SimulatedObject generic argument list.", t.FullName));
}
}
}
}
}
[DataContract]
class Customer
{
/*
[DataMember(Order = 3)]
public object Tag1 // The DataContract did contain a object which becomes now a SimulatedObject
{
get;
set;
}
*/
[DataMember(Order = 3)]
public SimulatedObject<bool, Other, Other, Other, Other, Other, Other, Other, Other, SomethingDifferent> Tag1 // Can contain up to 10 different types
{
get;
set;
}
[DataMember(Order = 4)]
public List<string> Strings
{
get;
set;
}
}
[DataContract]
public class Other
{
[DataMember(Order = 1)]
public string OtherData
{
get;
set;
}
}
[DataContract]
public class SomethingDifferent
{
[DataMember(Order = 1)]
public string OtherData
{
get;
set;
}
}
class Program
{
static void Main(string[] args)
{
Customer c = new Customer
{
Strings = new List<string> { "First", "Second", "Third" },
Tag1 = new SimulatedObject<bool, Other, Other, Other, Other, Other, Other, Other, Other, SomethingDifferent>
{
Data = new Other { OtherData = "String value "}
}
};
const int Runs = 1000 * 1000;
var stream = new MemoryStream();
var sw = Stopwatch.StartNew();
Serializer.Serialize<Customer>(stream, c);
sw = Stopwatch.StartNew();
for (int i = 0; i < Runs; i++)
{
stream.Position = 0;
stream.SetLength(0);
Serializer.Serialize<Customer>(stream, c);
}
sw.Stop();
Console.WriteLine("Data Size with Protocol buffer Serializer: {0}, {1} objects did take {2}s", stream.ToArray().Length, Runs, sw.Elapsed.TotalSeconds);
stream.Position = 0;
var newCustw = Serializer.Deserialize<Customer>(stream);
sw = Stopwatch.StartNew();
for (int i = 0; i < Runs; i++)
{
stream.Position = 0;
var newCust = Serializer.Deserialize<Customer>(stream);
}
sw.Stop();
Console.WriteLine("Read object with Protocol buffer deserializer: {0} objects did take {1}s", Runs, sw.Elapsed.TotalSeconds);
}
}
No, this solution is hard to maintain in a long term.
I recommend that you prepend the full name of the serialized type to the serialized data in the serialization process and read the type name in the beginning of the deserialization process (no need to change the protobuf source-code)
As a side note, you should try to avoid mixing object types in the deserialization process. I'm assuming you are updating an existing .net application and can't re-design it.
Update: Sample code
public byte[] Serialize(object myObject)
{
using (var ms = new MemoryStream())
{
Type type = myObject.GetType();
var id = System.Text.ASCIIEncoding.ASCII.GetBytes(type.FullName + '|');
ms.Write(id, 0, id.Length);
Serializer.Serialize(ms, myObject);
var bytes = ms.ToArray();
return bytes;
}
}
public object Deserialize(byte[] serializedData)
{
StringBuilder sb = new StringBuilder();
using (var ms = new MemoryStream(serializedData))
{
while (true)
{
var currentChar = (char)ms.ReadByte();
if (currentChar == '|')
{
break;
}
sb.Append(currentChar);
}
string typeName = sb.ToString();
// assuming that the calling assembly contains the desired type.
// You can include aditional assembly information if necessary
Type deserializationType = Assembly.GetCallingAssembly().GetType(typeName);
MethodInfo mi = typeof(Serializer).GetMethod("Deserialize");
MethodInfo genericMethod = mi.MakeGenericMethod(new[] { deserializationType });
return genericMethod.Invoke(null, new[] { ms });
}
}
I'm working on something similar now and I provided first version of the lib already:
http://bitcare.codeplex.com/
The current version doesn't support generics yet, but I plan to add it in the nearest time.
I uploaded source code only there-when I'm ready with generics I prepare bin version also...
It assumes both sides (client and server) know what they serialize/deserialize so there is no any reason to embed there full metadata info. Because of this serialization results are very small and generated serializers work very fast. It has data dictionaries, uses smart data storage (stores only important bits in short) and does final compression when necessary. If you need it just try if it solves your problem.
The license is GPL, but I will change it soon to less restrictive one(free for commercial usage also, but on your own risk like in GPL)
The version I uploaded to codeplex is working with some of my product. It's tested with different set of unit tests of course. They are not uploaded there, because I ported it to vs2012 and .net 4.5 and decided to create new sets of test cases for the incoming release.
I don't deal with abstract (so called opened) generics. I process parametrized generics. From data contract point of view parametrized generics are just specialized classes so I can process them as usual (as other classes) - the difference is in objects construction only and storage optimizations.
When I store information about null value on Nullable<> it takes one bit in storage stream only, if it's not null value I do serialization according to type provided as generics parameter (so I do serialization of DateTime for instance that can take from one bit for so called default value to a few bytes). The goal was to generate serialization code according to the current knowledge about data contracts on classes instead of doing it on the fly and wasting memory and processing power. When I see the property in some class based on some generic during code generation I know all the properties of that generic and I know the type of every property :) From this point of view it's concrete class.
I will change the license soon. I have to figure out first how to do it :) , because I see it's possible to choose from list of the provided license types but I can't provide my own license text. I see the license of Newtonsoft.Json is what I would have also, but I don't know yet how to change it...
The documentation has not been provided there yet, but in short it's easy to prepare your own serialization tests. You have to compile assembly with your types you want to store/serialize effective way, then create *.tt files in your serialization library (like for person class-it checks dependencies and generates code for other dependent classes also) and save the files (when you save them then it generates all the code for cooperation with serialization library). You can also create the task in your build config to regenerate source code from tt files every time you build the solution(probably Entity Framework generates the code similar way during the build).
You can compile your serialization library now and measure the performance and size of the results.
I need this serialization library for my framework for effective usage of entities with Azure Table and Blob storage so I plan to finish initial release soon...

Use Protobuf-net to stream large data files as IEnumerable

I'm trying to use Protobuf-net to save and load data to disk but got stuck.
I have a portfolio of assets that I need to process, and I want to be able to do that as fast as possible. I can already read from a CSV but it would be faster to use a binary file, so I'm looking into Protobuf-Net.
I can't fit all assets into memory so I want to stream them, not load them all into memory.
So what I need to do is expose a large set of records as an IEnumerable. Is this possible with Protobuf-Net? I've tried a couple of things but haven't been able to get it running.
Serializing seems to work, but I haven't been able to read them back in again, I get 0 assets back. Could someone point me in the right direction please? Looked at the methods in the Serializer class but can't find any that covers this case. I this use-case supported by Protobuf-net? I'm using V2 by the way.
Thanks in advance,
Gert-Jan
Here's some sample code I tried:
public partial class MainWindow : Window {
// Generate x Assets
IEnumerable<Asset> GenerateAssets(int Count) {
var rnd = new Random();
for (int i = 1; i < Count; i++) {
yield return new Asset {
ID = i,
EAD = i * 12345,
LGD = (float)rnd.NextDouble(),
PD = (float)rnd.NextDouble()
};
}
}
// write assets to file
private void Write(string path, IEnumerable<Asset> assets){
using (var file = File.Create(path)) {
Serializer.Serialize<IEnumerable<Asset>>(file, assets);
}
}
// read assets from file
IEnumerable<Asset> Read(string path) {
using (var file = File.OpenRead(path)) {
return Serializer.DeserializeItems<Asset>(file, PrefixStyle.None, -1);
}
}
// try it
private void Test() {
Write("Data.bin", GenerateAssets(100)); // this creates a file with binary gibberish that I assume are the assets
var x = Read("Data.bin");
MessageBox.Show(x.Count().ToString()); // returns 0 instead of 100
}
public MainWindow() {
InitializeComponent();
}
private void button2_Click(object sender, RoutedEventArgs e) {
Test();
}
}
[ProtoContract]
class Asset {
[ProtoMember(1)]
public int ID { get; set; }
[ProtoMember(2)]
public double EAD { get; set; }
[ProtoMember(3)]
public float LGD { get; set; }
[ProtoMember(4)]
public float PD { get; set; }
}
figured it out. To deserialize use PrefixBase.Base128 wich apparently is the default.
Now it works like a charm!
GJ
using (var file = File.Create("Data.bin")) {
Serializer.Serialize<IEnumerable<Asset>>(file, Generate(10));
}
using (var file = File.OpenRead("Data.bin")) {
var ps = Serializer.DeserializeItems<Asset>(file, PrefixStyle.Base128, 1);
int i = ps.Count(); // got them all back :-)
}

Serializing to XML via DataContract: custom output?

I have a custom Fraction class, which I'm using throughout my whole project. It's simple, it consists of a single constructor, accepts two ints and stores them. I'd like to use the DataContractSerializer to serialize my objects used in my project, some of which include Fractions as fields. Ideally, I'd like to be able to serialize such objects like this:
<Object>
...
<Frac>1/2</Frac> // "1/2" would get converted back into a Fraction on deserialization.
...
</Object>
As opposed to this:
<Object>
...
<Frac>
<Numerator>1</Numerator>
<Denominator>2</Denominator>
</Frac>
...
</Object>
Is there any way to do this using DataContracts?
I'd like to do this because I plan on making the XML files user-editable (I'm using them as input for a music game, and they act as notecharts, essentially), and want to keep the notation as terse as possible for the end user, so they won't need to deal with as many walls of text.
EDIT: I should also note that I currently have my Fraction class as immutable (all fields are readonly), so being able to change the state of an existing Fraction wouldn't be possible. Returning a new Fraction object would be OK, though.
If you add a property that represents the Frac element and apply the DataMember attribute to it rather than the other properties you will get what you want I believe:
[DataContract]
public class MyObject {
Int32 _Numerator;
Int32 _Denominator;
public MyObject(Int32 numerator, Int32 denominator) {
_Numerator = numerator;
_Denominator = denominator;
}
public Int32 Numerator {
get { return _Numerator; }
set { _Numerator = value; }
}
public Int32 Denominator {
get { return _Denominator; }
set { _Denominator = value; }
}
[DataMember(Name="Frac")]
public String Fraction {
get { return _Numerator + "/" + _Denominator; }
set {
String[] parts = value.Split(new char[] { '/' });
_Numerator = Int32.Parse(parts[0]);
_Denominator = Int32.Parse(parts[1]);
}
}
}
DataContractSerializer will use a custom IXmlSerializable if it is provided in place of a DataContractAttribute. This will allow you to customize the XML formatting in anyway you need... but you will have to hand code the serialization and deserialization process for your class.
public class Fraction: IXmlSerializable
{
private Fraction()
{
}
public Fraction(int numerator, int denominator)
{
this.Numerator = numerator;
this.Denominator = denominator;
}
public int Numerator { get; private set; }
public int Denominator { get; private set; }
public XmlSchema GetSchema()
{
throw new NotImplementedException();
}
public void ReadXml(XmlReader reader)
{
var content = reader.ReadInnerXml();
var parts = content.Split('/');
Numerator = int.Parse(parts[0]);
Denominator = int.Parse(parts[1]);
}
public void WriteXml(XmlWriter writer)
{
writer.WriteRaw(this.ToString());
}
public override string ToString()
{
return string.Format("{0}/{1}", Numerator, Denominator);
}
}
[DataContract(Name = "Object", Namespace="")]
public class MyObject
{
[DataMember]
public Fraction Frac { get; set; }
}
class Program
{
static void Main(string[] args)
{
var myobject = new MyObject
{
Frac = new Fraction(1, 2)
};
var dcs = new DataContractSerializer(typeof(MyObject));
string xml = null;
using (var ms = new MemoryStream())
{
dcs.WriteObject(ms, myobject);
xml = Encoding.UTF8.GetString(ms.ToArray());
Console.WriteLine(xml);
// <Object><Frac>1/2</Frac></Object>
}
using (var ms = new MemoryStream(Encoding.UTF8.GetBytes(xml)))
{
ms.Position = 0;
var obj = dcs.ReadObject(ms) as MyObject;
Console.WriteLine(obj.Frac);
// 1/2
}
}
}
This MSDN article describes IDataContractSurrogate Interface which:
Provides the methods needed to substitute one type for another by the
DataContractSerializer during serialization, deserialization, and
export and import of XML schema documents.
Although way too late, still may help someone. Actually, allows to change XML for ANY class.
You can do this with the DataContractSerializer, albeit in a way that feels hacky to me. You can take advantage of the fact that data members can be private variables, and use a private string as your serialized member. The data contract serializer will also execute methods at certain points in the process that are marked with [On(De)Serializ(ed|ing)] attributes - inside of those, you can control how the int fields are mapped to the string, and vice-versa. The downside is that you lose the automatic serialization magic of the DataContractSerializer on your class, and now have more logic to maintain.
Anyways, here's what I would do:
[DataContract]
public class Fraction
{
[DataMember(Name = "Frac")]
private string serialized;
public int Numerator { get; private set; }
public int Denominator { get; private set; }
[OnSerializing]
public void OnSerializing(StreamingContext context)
{
// This gets called just before the DataContractSerializer begins.
serialized = Numerator.ToString() + "/" + Denominator.ToString();
}
[OnDeserialized]
public void OnDeserialized(StreamingContext context)
{
// This gets called after the DataContractSerializer finishes its work
var nums = serialized.Split("/");
Numerator = int.Parse(nums[0]);
Denominator = int.Parse(nums[1]);
}
}
You'll have to switch back to the XMLSerializer to do that. The DataContractSerializer is a bit more restrictive in terms of being able to customise the output.

Categories

Resources