I'm appending real-time events to a file stream using protobuf-net serialization. How can I stream all saved objects back for analysis? I don't want to use an in-memory collection (because it would be huge).
private IEnumerable<Activity> Read() {
using (var iso = new IsolatedStorageFileStream(storageFilename, FileMode.OpenOrCreate, FileAccess.Read, this.storage))
using (var sr = new StreamReader(iso)) {
while (!sr.EndOfStream) {
yield return Serializer.Deserialize<Activity>(iso); // doesn't work
}
}
}
public void Append(Activity activity) {
using (var iso = new IsolatedStorageFileStream(storageFilename, FileMode.Append, FileAccess.Write, this.storage)) {
Serializer.Serialize(iso, activity);
}
}
First, I need to discuss the protobuf format (via Google, not specific to protobuf-net). By design, it is appendable but with append===merge. For lists this means "append as new items", but for single objects this means "combine the members". Secondly, as a consequence of the above, the root object in protobuf is never terminated - the "end" is simply: when you run out of incoming data. Thirdly, and again as a direct consequence - fields are not required to be in any specific order, and generally will overwrite. So: if you just use Serialize lots of times, and then read the data back: you will have exactly one object, which will have basically the values from the last object on the stream.
What you want to do, though, is a very common scenario. So protobuf-net helps you out by including the SerializeWithLengthPrefix and DeserializeWithLengthPrefix methods. If you use these instead of Serialize / Deserialize, then it is possible to correctly parse individual objects. Basically, the length-prefix restricts the data so that only the exact amount per-object is read (rather than reading to the end of the file).
I strongly suggest (as parameters) using tag===field-number===1, and the base-128 prefix-style (an enum). As well as making the data fully protobuf compliant throughout (including the prefix data), this will make it easy to use an extra helper method: DeserializeItems. This exposes each consecutive object via an iterator-block, making it efficient to read huge files without needing everything in memory at once. It even works with LINQ.
There is also a way to use the API to selectively parse/skip different objects in the file - for example, to skip the first 532 records without processing the data. Let me know if you need an example of that.
If you already have lots of data that was already stored with Serialize rather than SerializeWithLengthPrefix - then it is probably still possible to decipher the data, by using ProtoReader to detect when the field-numbers loop back around : meaning, given fields "1, 2, 4, 5, 1, 3, 2, 5" - we can probably conclude there are 3 objects there and decipher accordingly. Again, let me know if you need a specific example.
Related
Use Case Description
I receive the collections in chunks from a server and I want them to write to file in a way so I can read them back one-by-one later. My objects are fixed size meaning the class only contains objects of types double, long and DateTime.
I already serialize and deserialize objects using below methods at different places in my project:
public static T Deserialize<T>(byte[] buffer)
{
using (MemoryStream stream = new MemoryStream(buffer))
{
return Serializer.Deserialize<T>(stream);
}
}
public static byte[] Serialize<T>(T message)
{
using (MemoryStream stream = new MemoryStream())
{
Serializer.Serialize(stream, message);
return stream.ToArray();
}
}
But, even if this could work, I still think it will produce a larger output file because I believe protobuf stores some information about field names (in its own way). But I could create the byte[] using BinaryWriter without having any info of field names. I know I need to make sure that I read them back in the right order but this could still make some meaningful impact on the output size file I think especially when the number of objects in the collection is really huge.
Do you think is there a way to efficiently write collections in parts and be able to read them one-by-one and also having minimum output files and memory footprint while reading as my collections are really large containing years of market data that I need to read and process. I need to just read the object once, process it, and forget about it. I do not have any need to keep objects in memory.
Protobuf doesn't store field names, but it does use a field prefix that is an encoded integer. For storing multiple objects, you would typically use the *WithLengthPrefix overloads; in particular, DateTime has no reliable fixed length encoding.
However! In your case, perhaps a serializer isn't the right tool. I would consider:
creating a readonly struct composed of a double and two long (or three long if you need high precision epoch time)
using a memory mapped file to access the file system directly
create a Span<byte> over the memory mapped file (or a section thereof)
coerce the Span<byte> to a Span<YourStruct> using MemoryMarshal.Cast
et voila, direct access to your values all the way to the file system.
Consider I have a very large collection (millions) of objects serialized according to the proto wire format. Is it possible to stream these items from the file? I tried serializing the objects as a List<T> and then deserializing a single T item, but it ended up only reading the very last item from the stream. I also tried serializing each instance individually to the stream with the same effect of upon deserialization, it only reading the last item.
I suspect the solution requires my to know the size of each serialized item and then reading that size from the stream and passing that span of bytes to the protobuf serializer for deserialization. I wanted to make sure there wasn't an easier mechanism, one that doesn't require knowledge of the length of each individual item which may be different for each instance of the object, to accomplish this task.
Another thought I had was including the size of each upcoming object as it's own object in the stream, for example:
0: meta-information for the first object, including type/length in bytes
1: object defined in 0
2: meta-information for the second object, including type/length in bytes
3: object defined in 2
4: ...etc
Version information:
I'm currently using dotnet core 3.1 and protobuf-net version 2.4.4
In protobuf the root object is not terminated by default, with the intent being to allow "merge" === "append". This conflicts with he very common scenario you are describing. Fortunately, many libraries provide a mechanism to encode the length before the object for this reason. What you are looking for is the SerializeWithLengthPrefix and DeserializeWithLengthPrefix methods.
If the data already exists as flat appends, and cannot be rewritten: there are still ways to recover it, by using the reader API. A bit more complex, but I've recovered such data in the past for people when needed.
I am using StackExchange Redis DB to insert a dictionary of Key value pairs using Batch as below:
private static StackExchange.Redis.IDatabase _database;
public void SetAll<T>(Dictionary<string, T> data, int cacheTime)
{
lock (_database)
{
TimeSpan expiration = new TimeSpan(0, cacheTime, 0);
var list = new List<Task<bool>>();
var batch = _database.CreateBatch();
foreach (var item in data)
{
string serializedObject = JsonConvert.SerializeObject(item.Value, Formatting.Indented,
new JsonSerializerSettings { ContractResolver = new SerializeAllContractResolver(), ReferenceLoopHandling = ReferenceLoopHandling.Ignore });
var task = batch.StringSetAsync(item.Key, serializedObject, expiration);
list.Add(task);
serializedObject = null;
}
batch.Execute();
Task.WhenAll(list.ToArray());
}
}
My problem: It takes around 7 seconds to set just 350 items of dictionary.
My question: Is this the right way to set bulk items into Redis or is there a quicker way to do this?
Any help is appreciated. Thanks.
"just" is a very relative term, and doesn't really make sense without more context, in particular: how big are these payloads?
however, to clarify a few points to help you investigate:
there is no need to lock an IDatabase unless that is purely for your own purposes; SE.Redis deals with thread safety internally and is intended to be used by competing threads
at the moment, your timing of this will include all the serialization code (JsonConvert.SerializeObject); this will add up, especially if your objects are big; to get a decent measure, I strongly suggest you time the serialization and redis times separately
the batch.Execute() method uses a pipeline API and does not wait for responses between calls, so: the time you're seeing is not the cumulative effect of latency; that leaves just local CPU (for serialization), network bandwidth, and server CPU; the client library tools can't impact any of those things
there is a StringSet overload that accepts a KeyValuePair<RedisKey, RedisValue>[]; you could choose to use this instead of a batch, but the only difference here is that it is the varadic MSET rather than muliple SET; either way, you'll be blocking the connection for other callers for the duration (since the purpose of batch is to make the commands contiguous)
you don't actually need to use CreateBatch here, especially since you're locking the database (but I still suggest you don't need to do this); the purpose of CreateBatch is to make a sequence of commands sequential, but I don't see that you need this here; you could just use _database.StringSetAsync for each command in turn, which would also have the advantage that you'd be running serialization in parallel to the previous command being sent - it would allow you to overlap serialization (CPU bound) and redis ops (IO bound) without any work except to delete the CreateBatch call; this will also mean that you don't monopolize the connection from other callers
So; the first thing I would do would be to remove some code:
private static StackExchange.Redis.IDatabase _database;
static JsonSerializerSettings _redisJsonSettings = new JsonSerializerSettings {
ContractResolver = new SerializeAllContractResolver(),
ReferenceLoopHandling = ReferenceLoopHandling.Ignore };
public void SetAll<T>(Dictionary<string, T> data, int cacheTime)
{
TimeSpan expiration = new TimeSpan(0, cacheTime, 0);
var list = new List<Task<bool>>();
foreach (var item in data)
{
string serializedObject = JsonConvert.SerializeObject(
item.Value, Formatting.Indented, _redisJsonSettings);
list.Add(_database.StringSetAsync(item.Key, serializedObject, expiration));
}
Task.WhenAll(list.ToArray());
}
The second thing I would do would be to time the serialization separately to the redis work.
The thrid thing I would do would be to see if I can serialize to a MemoryStream instead, ideally one that I can re-use - to avoid the string alocation and UTF-8 encode:
using(var ms = new MemoryStream())
{
foreach (var item in data)
{
ms.Position = 0;
ms.SetLength(0); // erase existing data
JsonConvert.SerializeObject(ms,
item.Value, Formatting.Indented, _redisJsonSettings);
list.Add(_database.StringSetAsync(item.Key, ms.ToArray(), expiration));
}
}
This second answer is kinda tangential, but based on the discussion it sounds as though the main cost is serialization:
The object in this context is big with huge infos in string props and many nested classes.
One thing you could do here is not store JSON. JSON is relatively large, and being text-based is relatively expensive to process both for serialization and deserialization. Unless you're using rejson, redis just treats your data as an opaque blob, so it doesn't care what the actual value is. As such, you can use more efficient formats.
I'm hugely biased, but we make use of protobuf-net in our redis storage. protobuf-net is optimized for:
small output (dense binary without redundant information)
fast binary processing (absurdly optimized with contextual IL emit, etc)
good cross-platform support (it implements Google's "protobuf" wire format, which is available on just about every platform available)
designed to work well with existing C# code, not just brand new types generated from a .proto schema
I suggest protobuf-net rather than Google's own C# protobuf library because of the last bullet point, meaning: you can use it with the data you already have.
To illustrate why, I'll use this image from https://aloiskraus.wordpress.com/2017/04/23/the-definitive-serialization-performance-guide/:
Notice in particular that the output size of protobuf-net is half that of Json.NET (reducing the bandwidth cost), and the serialization time is less than one fifth (reducing local CPU cost).
You would need to add some attributes to your model to help protobuf-net out (as per How to convert existing POCO classes in C# to google Protobuf standard POCO), but then this would be just:
using(var ms = new MemoryStream())
{
foreach (var item in data)
{
ms.Position = 0;
ms.SetLength(0); // erase existing data
ProtoBuf.Serializer.Serialize(ms, item.Value);
list.Add(_database.StringSetAsync(item.Key, ms.ToArray(), expiration));
}
}
As you can see, the code change to your redis code is minimal. Obviously you would need to use Deserialize<T> when reading the data back.
If your data is text based, you might also consider running the serialization through GZipStream or DeflateStream; if your data is dominated by text, it will compress very well.
I'm fairly new to c# and am getting into XNA a bit.
So far all is fairly simple and I can find info on it, but one thing that I've been struggling with is finding good tips/tutorials on how to create game save functionality.
I don't really want to use XML for saving neither the configuratio, nor the game since it just makes the value changing too easy. So, I decided to go for binary files, since it adds a layer of complexity.
Sadly I wasnt able to find much information on how to do that.
I saw some posts suggesting users to create a structure, then saving it as a binary file.
This seems fairly simple (I can see that being done with the controls, for example, since there aren't that many variables), but I can't seem to find info on how to convert the actual
public struct controls{
public int forward;
public int back;
}
structure ... well, to a binary file really.
Another question is saving game data.
Should I go for the same approach and create a structure that will hold variables like player health, position etc. and just load it up when I want to load the game?
I guess what I want to ask is - is it possible to "freeze" the game state (amount of enemies around, items dropped etc.) and load it up later?
Any tips, pushes and nods towards the right direction will be much appreciated.
Thank you very much!
Well, simple answer is yes you can store game state. But this is mainly depends on the actual game implementation. You have to implement one/several data classes which will store the data vital for game state recreation. I think you can't just easily dump your game memory to restore the state. You have to recreate the game scene using the values you saved earlier.
So you can use these simple methods to convert virtually any class marked by Serializable attribute to byte array:
public static byte[] ToBytes(object data)
{
using (var ms = new MemoryStream())
{
// create a binary formatter:
var bnfmt = new BinaryFormatter();
// serialize the data to the memory-steam;
bnfmt.Serialize(ms, data);
// return a byte[] representation of the binary data:
return ms.GetBuffer();
}
}
public static T FromBytes<T>(byte[] input)
{
using (var ms = new MemoryStream(input))
{
var bnfmt = new BinaryFormatter();
// serialize the data to the memory-steam;
var value = bnfmt.Deserialize(ms);
return (T)value;
}
}
Also you must know the rules of binary serialization. Which types can be serialized out-of-the-box and which needs some workaround for serialization.
Then you can optionaly apply an encryption/decryption to that byte sequence and save/load it using System.IO.File.
//read
var data = File.ReadAllBytes("test.dat");
//write
using (var file = File.OpenWrite("test.dat"))
{
file.Write(data, 0, data.Length);
}
In this situation, there's no a real "correct" answer. If you just want to "encrypt" data, why just don't create an xml in memory, and then apply you preferred criptographic function to protect it before saving?
Surely, this is not a catch-all rule: saving game data in binary format result in less space occupied on disk, and maybe faster load tines: a very long number, such as 123456789, can be stored using only 4 bytes. If you want to save it in xml, there's so much overhead due to xml tags, and conversion from string to int.
A good approach for your project is to create an helper library with serializers/deserializers. Every struct will have its own, and when called on a specific structure the function will convert structure fields into their binary representation, concatenate them as strings and erite them to file. This explains why every structure needs its own deserializer: it's up to you to chose the order of fields, binary encoding, etc
Finally, the above problem can be solved in a more elegant way using an OOP approach, maybe with every "storable" class implementing a serializable interface, and implementing ad hoc serializazions methods.
I am trying to use the EndianBinaryReader and EndianBinaryWriter that Jon Skeet wrote as part of his misc utils lib. It works great for the two uses I have made of it.
The first reading from a Network Stream (TCPClient) where I sit in a loop reading the data as it comes in. I can create a single EndianBinaryReader and then just dispose of it on the shut down of the application. I construct the EndianBinaryReader by passing the TCPClient.GetStream in.
I am now trying to do the same thing when reading from a UdpClient but this does not have a stream as it is connection less. so I get the data like so
byte[] data = udpClientSnapShot.Receive(ref endpoint);
I could put this data into a memory stream
var memoryStream = new MemoryStream(data);
and then create the EndianBinaryReader
var endianbinaryReader = new EndianBinaryReader(
new BigEndianBitConverter(), memoryStream,Encoding.ASCII);
but this means I have to create a new endian reader every time I do a read. Id there a way where I can just create a single stream that I can just keep updateing the inputstream with the data from the udp client?
I can't remember whether EndianBinaryReader buffers - you could overwrite a single MemoryStream? But to be honest there is very little overhead from an extra object here. How big are the packets? (putting it into a MemoryStream will clone the byte[]).
I'd be tempted to use the simplest thing that works and see if there is a real problem. Probably the one change I would make is to introduce using (since they are IDisposable):
using(var memoryStream = new MemoryStream(data))
using(var endianbinaryReader = ..blah..) {
// use it
}
Your best option is probably an override of the .NET Stream class to provide your custom functionality. The class is designed to be overridable with custom behavior.
It may look daunting because of the number of members, but it is easier than it looks. There are a number of boolean properties like "CanWrite", etc. Override them and have them all return "false" except for the functionality that your reader needs (probably CanRead is the only one you need to be true.)
Then, just override all of the methods that start with the phrase "When overridden in a derived class" in the help for Stream and have the unsupported methods return an "UnsupportedException" (instead of the default "NotImplementedException".
Implement the Read method to return data from your buffered UDP packets using perhaps a linked list of buffers, setting used buffers to "null" as you read past them so that the memory footprint doesn't grow unbounded.