protobuf-net stream objects from disk - c#

Consider I have a very large collection (millions) of objects serialized according to the proto wire format. Is it possible to stream these items from the file? I tried serializing the objects as a List<T> and then deserializing a single T item, but it ended up only reading the very last item from the stream. I also tried serializing each instance individually to the stream with the same effect of upon deserialization, it only reading the last item.
I suspect the solution requires my to know the size of each serialized item and then reading that size from the stream and passing that span of bytes to the protobuf serializer for deserialization. I wanted to make sure there wasn't an easier mechanism, one that doesn't require knowledge of the length of each individual item which may be different for each instance of the object, to accomplish this task.
Another thought I had was including the size of each upcoming object as it's own object in the stream, for example:
0: meta-information for the first object, including type/length in bytes
1: object defined in 0
2: meta-information for the second object, including type/length in bytes
3: object defined in 2
4: ...etc
Version information:
I'm currently using dotnet core 3.1 and protobuf-net version 2.4.4

In protobuf the root object is not terminated by default, with the intent being to allow "merge" === "append". This conflicts with he very common scenario you are describing. Fortunately, many libraries provide a mechanism to encode the length before the object for this reason. What you are looking for is the SerializeWithLengthPrefix and DeserializeWithLengthPrefix methods.
If the data already exists as flat appends, and cannot be rewritten: there are still ways to recover it, by using the reader API. A bit more complex, but I've recovered such data in the past for people when needed.

Related

Using protobuf-net to write fixed size objects in parts and read them one-by-one

Use Case Description
I receive the collections in chunks from a server and I want them to write to file in a way so I can read them back one-by-one later. My objects are fixed size meaning the class only contains objects of types double, long and DateTime.
I already serialize and deserialize objects using below methods at different places in my project:
public static T Deserialize<T>(byte[] buffer)
{
using (MemoryStream stream = new MemoryStream(buffer))
{
return Serializer.Deserialize<T>(stream);
}
}
public static byte[] Serialize<T>(T message)
{
using (MemoryStream stream = new MemoryStream())
{
Serializer.Serialize(stream, message);
return stream.ToArray();
}
}
But, even if this could work, I still think it will produce a larger output file because I believe protobuf stores some information about field names (in its own way). But I could create the byte[] using BinaryWriter without having any info of field names. I know I need to make sure that I read them back in the right order but this could still make some meaningful impact on the output size file I think especially when the number of objects in the collection is really huge.
Do you think is there a way to efficiently write collections in parts and be able to read them one-by-one and also having minimum output files and memory footprint while reading as my collections are really large containing years of market data that I need to read and process. I need to just read the object once, process it, and forget about it. I do not have any need to keep objects in memory.
Protobuf doesn't store field names, but it does use a field prefix that is an encoded integer. For storing multiple objects, you would typically use the *WithLengthPrefix overloads; in particular, DateTime has no reliable fixed length encoding.
However! In your case, perhaps a serializer isn't the right tool. I would consider:
creating a readonly struct composed of a double and two long (or three long if you need high precision epoch time)
using a memory mapped file to access the file system directly
create a Span<byte> over the memory mapped file (or a section thereof)
coerce the Span<byte> to a Span<YourStruct> using MemoryMarshal.Cast
et voila, direct access to your values all the way to the file system.

Generate truncated JSON using JSON.NET

Given a .NET object, I would like to serialize it to a JSON string, but truncated to a specific length (e.g. 100 characters).
Is there an efficient way of doing that which does not involve serializing the entire object (which might be huge)?
Edited to make things clearer:
The result need not be a valid JSON string. It should be equivalent to:
JsonConvert.SerializeObject(obj).Substring(0, 100);
... but without traversing the entire object graph.
No serializer is going to expect this scenario, because usually their job is to make valid data that can be reliably parsed. However, many serializers have options to take a TextWriter (or if not that, then: a Stream) as an output target. You could write a custom subclass of those which either silently discards data after the chosen amount (although the serializer will still walk the entire object graph), or deliberately throws an exception once the desired amount has been reached (this exception would interrupt the serializer, allowing you to avoid most of the unnecessary work).

How to read back appended objects using protobuf-net?

I'm appending real-time events to a file stream using protobuf-net serialization. How can I stream all saved objects back for analysis? I don't want to use an in-memory collection (because it would be huge).
private IEnumerable<Activity> Read() {
using (var iso = new IsolatedStorageFileStream(storageFilename, FileMode.OpenOrCreate, FileAccess.Read, this.storage))
using (var sr = new StreamReader(iso)) {
while (!sr.EndOfStream) {
yield return Serializer.Deserialize<Activity>(iso); // doesn't work
}
}
}
public void Append(Activity activity) {
using (var iso = new IsolatedStorageFileStream(storageFilename, FileMode.Append, FileAccess.Write, this.storage)) {
Serializer.Serialize(iso, activity);
}
}
First, I need to discuss the protobuf format (via Google, not specific to protobuf-net). By design, it is appendable but with append===merge. For lists this means "append as new items", but for single objects this means "combine the members". Secondly, as a consequence of the above, the root object in protobuf is never terminated - the "end" is simply: when you run out of incoming data. Thirdly, and again as a direct consequence - fields are not required to be in any specific order, and generally will overwrite. So: if you just use Serialize lots of times, and then read the data back: you will have exactly one object, which will have basically the values from the last object on the stream.
What you want to do, though, is a very common scenario. So protobuf-net helps you out by including the SerializeWithLengthPrefix and DeserializeWithLengthPrefix methods. If you use these instead of Serialize / Deserialize, then it is possible to correctly parse individual objects. Basically, the length-prefix restricts the data so that only the exact amount per-object is read (rather than reading to the end of the file).
I strongly suggest (as parameters) using tag===field-number===1, and the base-128 prefix-style (an enum). As well as making the data fully protobuf compliant throughout (including the prefix data), this will make it easy to use an extra helper method: DeserializeItems. This exposes each consecutive object via an iterator-block, making it efficient to read huge files without needing everything in memory at once. It even works with LINQ.
There is also a way to use the API to selectively parse/skip different objects in the file - for example, to skip the first 532 records without processing the data. Let me know if you need an example of that.
If you already have lots of data that was already stored with Serialize rather than SerializeWithLengthPrefix - then it is probably still possible to decipher the data, by using ProtoReader to detect when the field-numbers loop back around : meaning, given fields "1, 2, 4, 5, 1, 3, 2, 5" - we can probably conclude there are 3 objects there and decipher accordingly. Again, let me know if you need a specific example.

Yet Another Take on Measuring Object Size

Searching Google and StackOverFlow comes up with a lot of references to this question. Including for example:
Ways to determine size of complex object in .NET?
How to get object size in memory?
So let me say at the start that I understand that it is not generally possible to get an accurate measurement. However I am not that concerned about that - I am looking for something that give me relative values rather than absolute. So if they are off a bit one way or the other it does not matter.
I have a complex object graph. It is made up of a single parent (T) with children that may have children and so on. All the objects in the graph are from the same base class. The childrean are in the form of List T.
I have tried both the serializing method and the unsafe method to calculate size. They give different answers but the 'relative' problem is the same in both cases.
I made an assumption that the size of a parent object would be larger than the sum of the sizes of the children. This has turned out not to be true. I calculated the size of the parent. Then summed the size of the children. In some cases this appeared to make sense but in others the sum of the children far exceeded the size determined for the parent.
So my question is: Why is my simple assumption that serializing an object can result in a size that is less that the sum of the children. The only answer I have come up with is that each serialized object has a fixed overhead (which I guess is self evident) and the sum of these can exceed the 'own size' of the parent. If that is so is there any way to determine what that overhead might be so that I can take account of it?
Many thanks in advance for any suggestions.
EDIT
Sorry I forgot to say that all objects are marked serializable the serialization method is:
var bf = new BinaryFormatter();
var ms = new MemoryStream();
bf.Serialize(ms, testObject);
byte[] array = ms.ToArray();
return array.Length;
It will really depend on which serialization mechanism you use for serializing the objects. It's possible that it's not serializing the children elements, which is one reason why you'd see the parent size smaller than the sum of the children (possibly even smaller than each individual child).
If you want to know the relative size of an object, make sure that you're serializing all the fields of all objects in your graph.
Edit: so, if you're using the binary formatter, then you must look at the specification for the format used by that serializer to understand the overhead. The format specification is public, and can be found at http://msdn.microsoft.com/en-us/library/cc236844(prot.20).aspx. It's not very easy to digest, but if you're willing to put the time to understand it, you'll find exactly how much overhead each object will have in its serialized form.

Transfer objects on per field basis over network

I need to transfer .NET objects (with hierarchy) over network (multiplayer game). To save bandwidth, I'd like to transfer only fields (and/or properties) that changes, so fields that won't change won't transfer.
I also need some mechanism to match proper objects on the other client side (global object identifier...something like object ID?)
I need some suggestions how to do it.
Would you use reflection? (performance is critical)
I also need mechanism to transfer IList deltas (added objects, removed objects).
How is MMO networking done, do they transfer whole objects?
(maybe my idea of per field transfer is stupid)
EDIT:
To make it clear: I've already got mechanism to track changes (lets say every field has property, setter adds field to some sort of list or dictionary, which contains changes - structure is not final now).
I don't know how to serialize this list and then deserialize it on other client. And mostly how to do it effectively and how to update proper objects.
There's about one hundred of objects, so I'm trying avoid situation when I would write special function for each object. Decorating fields or properties with attributes would be ok (for example to specify serializer, field id or something similar).
More about objects: Each object has 5 fields in average. Some object are inherited from other.
Thank you for all answeres.
Another approach; don't try to serialize complex data changes: instead, send just the actual commands to apply (in a terse form), for example:
move 12432 134, 146
remove 25727
(which would move 1 object and remove another).
You would then apply the commands at the receiver, allowing for a full resync if they get out of sync.
I don't propose you would actually use text for this - that is just to make the example clearer.
One nice thing about this: it also provides "replay" functionality for free.
The cheapest way to track dirty fields is to have it as a key feature of your object model, I.e. with a "fooDirty" field for every data field "foo", that you set to true in the "set" (if the value differs). This could also be twinned with conditional serialization, perhaps the "ShouldSerializeFoo()" pattern observed by a few serializers. I'm not aware of any libraries that match exactly what you describe (unless we include DataTable, but ... think of the kittens!)
Perhaps another issue is the need to track all the objects for merge during deserialization; that by itself doesn't come for free.
All things considered, though, I think you could do something alon the above lines (fooDirty/ShouldSerializeFoo) and use protobuf-net as the serializer, because (importantly) that supports both conditional serialization and merge. I would also suggest an interface like:
ISomeName {
int Key {get;}
bool IsDirty {get;}
}
The IsDrty would allow you to quickly check all your objects for those with changes, then add the key to a stream, then the (conditional) serialization. The caller would read the key, obtain the object needed (or allocate a new one with that key), and then use the merge-enabled deserialize (passing in the existing/new object).
Not a full walk-through, but if it was me, that is the approach I would be looking at. Note: the addition/removal/ordering of objects in child-collections is a tricky area, that might need thought.
I'll just say up front that Marc Gravell's suggestion is really the correct approach. He glosses over some minor details, like conflict resolution (you might want to read up on Leslie Lamport's work. He's basically spent his whole career describing different approaches to dealing with conflict resolution in distributed systems), but the idea is sound.
If you do want to transmit state snapshots, instead of procedural descriptions of state changes, then I suggest you look into building snapshot diffs as prefix trees. The basic idea is that you construct a hierarchy of objects and fields. When you change a group of fields, any common prefix they have is only included once. This might look like:
world -> player 1 -> lives: 1
... -> points: 1337
... -> location -> X: 100
... -> Y: 32
... -> player 2 -> lives: 3
(everything in a "..." is only transmitted once).
It is not logical to transfer only changed fields because you would be wasting your time on detecting which fields changed and which didn't and how to reconstruct on the receiver's side which will add a lot of latency to your game and make it unplayable online.
My proposed solution is for you to decompose your objects to the minimum and sending these small objects which is fast. Also, you can use compression to reduce bandwidth usage.
For the Object ID, you can use a static ID which increases when you construct a new Object.
Hope this answer helps.
You will need to do this by hand. Automatically keeping track of property and instance changes in a hierarchy of objects is going to be very slow compared to anything crafted by hand.
If you decide to try it out anyway, I would try to map your objects to a DataSet and use its built in modification tracking mechanisms.
I still think you should do this by hand, though.

Categories

Resources