How can I speed up MongoDB deserialization for C# - c#

When returning many results from a query, the code takes a really long time to convert the data into .net objects. These are basic objects, with a few strings as fields. I'm not sure but I think it's using reflection to create the instances which is slow. Is there way to speed this up?

The 10gen driver doesn't use reflection on a per object basis. It uses reflection once per type to generate a serializer using Reflection.Emit, so serialization or deserialization of the first object might be slow, but any objects afterward are fast (relatively).
Your question - is there any way to speed this up?
If your objects are simple (not nested documents, a few public fields, etc.), there probably isn't much you can do. You could implement a custom serializer for the class to eke out a little performance, but I doubt it would be more than a few percent.
I haven't looked into it, and Robert Stam (who answered this question as well) would be the authority on it, but there may be some performance to be gained on multicore or multiprocessor systems by parallelizing deserialization in the driver. I haven't looked at the driver code from that perspective yet, so it may be something Robert has already pursued.
On a general note, I think 30,000 objects in 10 seconds is pretty standard for just about any platform - SQL, Mongo, XML, etc that isn't storing objects as memory blobs directly (like you could using a language like C++).
EDIT:
It looks like the 10gen driver performs deserialization before it returns a cursor for you to enumerate. So if your query returns 30,000 results, all 30,000 objects have to be deserialized before the driver makes a cursor available for enumeration. I haven't looked at the jmongo driver, but I expect that it does the opposite, and defers deserialization until after an object is enumerated in the cursor.
The net result is that while both probably take the same amount of total time to enumerate and deserialize 30,000 objects, deserialization in the jmongo driver is spread across the entire enumeration, where in the c# driver it is frontloaded.
The difference is subtle, but likely to explain what you are seeing.
The bad news is the "fix" is a driver change. One thing you could do is break your query up in chunks, querying for 10 or 100 objects at a time.

Not sure how you are measuring. When the C# driver gets back a batch of documents from the server it deserializes them all at once, so there might be a lag on the first document but then the rest of the documents are really fast. What really matters is the total throughput in terms of documents per second and whether it is fast enough to saturate the network link, which it should be.
While there are hardcoded serializers for many of the standard .NET classes, serialization of POCOs is typically handled through class maps. Reflection is used to build the class maps, but reflection is no longer needed while doing the serialization/deserialization.
You could speed up serialization/deserialization a little bit by writing your own handcoded serializers for your classes (or by making your classes implement IBsonSerializable), but since the bottleneck is probably the network anyway it probably isn't worth it.

Here is what I am using:
Read only needed fields
Cache objects which is often needed but rarely changed in memory
When I need to read many objects by rule (e.g. products by filter criteria) I store all products in single filter object and read all of them at once. Drawback is recalculate this cache when something changed.

Related

C# : Serialize objects to XML without reflection

In an application, we can save the current state of the application and it's configuration(which can be huge). We are using the XmlSerializer.
We now have only what we need in the XML(all XmlIgnore are in place), and it's VERY slow to store the whole configuration(file of ~50-100MB).
We NEED to keep storing this configuration as XML, but we would like to avoid :
The reflection, which is to slow
To implement the IXmlSerializable interface
The idea was to have a method to implement in each object, in which we can register which fields/property we want to serialize, then having a SerializationManager which is able to read what we want to serialize, and then write them.
Like this, objects doesn't know the language (XML) in which they will be rendered, and if one day we want a binary serialization(or if we want to have the possibility to serialize in different format), we can.
But we don't want to reinvent the wheel, and I don't know if some library exists or if something like Linq to xml can help, or if this is natively possible, ...
So how do you think I can achieve this?
"The reflection, which is to slow"
Except, it doesn't use reflection at runtime. It performs metaprogramming on the first run (assuming you are using new XmlSerializer(type)) to inspect the type and generate static code that will work on the given type. Therefore, any volume-related performance issue is not related to reflection. There is a chance that the metaprogramming itself can take a measurable time, but a: this is unlikely unless your model is really complicated, and b: it can be avoided by using the sgen.exe tool to pre-generate the serialization assembly.
Any performance issue, therefore, is most likely due to the size of the model and the overhead of xml.
If you want to try a different serializer, consider something like protobuf-net. You won't be able to read the data (it will not be xml), but the output will be much smaller and faster.
As you mentioned
In an application, we can save the current state of the application and it's configuration
State, especially when it is big (100Mb is ... huge!), required its own way to serialize data. Many of us knows and hates that slow saving/loading game saves from past. Even now, game developers distinguish quicksave from ordinal saves. It is optimized to occurs faster (to example, by caching part of recently performed quicksave) than ordinal save.
First question is why XML? BinarySerializer is faster, but for sizes like this you better use manual serialization (as Marc Gravell suggested, use protobuf, it's ultimate superior to anything).
Second question is, do you really need serialize data (change their format)? The fastest way of saving state is to dump memory. Imagine you have all your data saved in one block of memory, then dumping this block into a file is a very quick save. You may (I am not sure, but it should be doable) construct your data in a way, what overriding this memory will be kind of loading game. This much faster of any conversion.
If you go with dumping, then consider to pack it (into zip). Packing and saving 10 mb should be faster than saving unpacked 100 mb (assuming, you are not using too slow or too good packing algorithm), memory operations and cpu are much faster than SSD.
To save configuration, you can still serialize it as usual. If you want it to be a single file, then define own format of this file, to example:
config_stream, separator["<<<>>>>"], memory block [100 Mb]
Serialize with XmlSerializer into memory, create file, save config, separator, dump.

Memory-mapped file IList implementation, for storing large datasets "in memory"?

I need to perform operations chronologically on huge time series implemented as IList. The data is ultimately stored into a database, but it would not make sense to submit tens of millions of queries to the database.
Currently the in-memory IList triggers an OutOfMemory exception when trying to store more than 8 million (small) objects, though I would need to deal with tens of millions.
After some research, it looks like the best way to do it would be to store data on disk and access it through an IList wrapper.
Memory-mapped files (introduced in .NET 4.0) seem the right interface to use, but I wonder what is the best way to write a class that should implement IList (for easy access) and internally deal with a memory-mapped file.
I am also curious to hear if you know about other ways ! I thought for example of an IList wrapper using data from db4o (someone mentionned here using a memory-mapped file as the IoAdapterFile, though using db4o probably adds a performance cost vs. dealing directly with the memory-mapped file).
I have come across this question asked in 2009, but it did not yield useful answers or serious ideas.
I found this PersistentDictionary<>, but it only works with strings, and by reading the source code I am not sure it was designed for very large datasets.
More scalable (up to 16 TB), the ESENT PersistentDictionary<>, uses the ESENT database engine present in Windows (XP+) and can store all serializable objects containing simple types.
Disk Based Data Structures, including Dictionary, List and Array with an "intelligent" serializer looked exactly like what I was looking for, but it did not run smoothly with extremely large datasets, especially as it does not make use of the "native" .NET MemoryMappedFiles yet, and support for 32 bits systems is experimental.
Update 1: I ended up implementing my own version that makes extensive use of .NET MemoryMappedFiles; it is very fast and I will probably release it on Codeplex once I have made it better for more general purpose usages.
Update 2: TeaFiles.Net also worked great for my purpose. Highly recommended (and free).
I see several options:
"in-memory-DB"
for example SQLite can be used this way - no need for any setup etc. just deploying the DLL (1 or 2) together with the app and the rest can be done programmatically
Load all data into temporary table(s) into the DB, with unknown (but big) amounts of data I found that this pays off really fast (and processing can usually be done inside the DB whcih is even better!)
use a MemoryMappedFile and a fixed structure size (array-like access via offset) but beware that physical memory is the limit except you use some sort of "sliding window" to map only parts into memory
The memory mapped files is a nice way to do it. But it going to be very slow if you need to access things randomly.
Your best bet is probably to come up with a fixed structure size when saved in memory (if you can) then you use the offset as the list item id. However deletes / sorting is always a problem.

JSON as an Encoding Protocol in a Large Distributed Application

I'm on a project that processes and reports on large sets of aggregatable row based data. There is a primary aggregation service and then many clients who can subscribe to different views of the data from that server. The objects are passed back and forth between the Java server and the C# clients encoded in JSON. We're noticing that the parsing of the objects is taking a lot of time and somewhat memory intensive. Have others used JSON for this purpose or seen similar behavior?
We used to use straight XML across the wire and had to use custom serialization (ie. manual) for alot of the objects. While not JSON we did have performance hits due to this constraint. Once we migrated all our tech to a similar architecture we were able to switch to binary serialization which worked much better.
However on the objects where we had issues with performance due to size we made some modifications. Since we had access to the code on both ends (and both were c#) we were able to binary serialize the payload and then base64 encode it since it had to be text across the wire. It did help a good bit in terms of object size and the serialization ran a bit faster.
Since you are going from Java to C# you won't really have that luxury. So the only thing I can think of in your case would be to try and optimize your parsing of the JSON response. You may be able to use some code profiling tools to help you identify portions that are causing you performance issues and then try to optimize those. Also, on the deserialize to JSON make sure you use a string builder to build your final string. If you are doing standard concat operations it will kill performance as well.
Also, you might want to check around I have seen on the web several JSON serializers written for c# some may be faster than what you are doing, who knows.
Not sure if that helps you all that much but there is some info from things we have seen with string based message passing.
UPDATE: Just saw this on dotnetkicks: JSON.Net it's an update from james for the json.net serializers. May help out.
I know for java there are any number of opensource JSON serializers and deserializers. We use FlexJSON.
JSON can be expensive to decode. If performance is an issue try using something like Hessian.

Storing large amounts of data in files. What is the most performant option?

Currently doing XML serialization however, it is very slow. Looking for a way to save/load information from file very quickly not really interested in how it looks on disc (if anything I want it to be obscured as I don't want manual editing).
Thinking of binary format however I am not sure if it would be able to serialize properties which may be of a custom type etc.
Any idea's?
You can try using Sqlite. It is very fast, and will give you complete database implementation with SQL queries on a file.
If you are thinking of trying binary formats, I suggest you try this first.
And can be used with ORM, and can be compressed and encrypted.
What exactly is the data?
With xml, the obvious answer would be to use smoething like GZipStream to compress it - making it smaller and obscure. You could use BinaryFormatter but it is brittle and IMO unsuitable for long-term storage. I would say "protocol buffers", (maybe protobuf-net), but it depends what the "custom data" is. But if you are using XmlSerializer at the moment protobuf-net may work virtually without changes (maybe add a few attributes) - and it is (in every case I've seen to date) both smaller and faster than BinaryFormatter.
Here's the steep learning curve (see also: Getting Started):
[ProtoContract]
public class Person {
[ProtoMember(1)]
public int Id {get;set;}
[ProtoMember(2)]
public string Name {get;set;}
//...
}
To be fair, it can get a little trickier if you are using inheritance - not much though. In many cases you can actually use your existing attributes - it'll work with xml / wcf attributes if an explicit element order is included.
Binary serialization certainly works with properties of Custom Types and typically produces smaller files than XML serialization. It's certainly an approach you should consider if file size is an important factor for your situation.
I agree with Am about using an embedded database like SQLite. It comes with significant benefits. The ability to layer an ORM on top of it is probably the most significant.
XML Serialization is handy, particularly when you need to be able to edit the XML by hand or process it with other XML tools like XSLT etc, but it also has some unavoidable performance problems. One important technique when using XML Serialization in .Net is to cache the XML Serializers. Or to have them created by sgen on build.
The reason to cache the XML Serializer is related to the fact that the .Net runtime will automatically generate, compile and load an assembly containing a serializer if it can't find one in an already loaded assembly. This process can be really slow. Also constructing a new XMLSerializer instance can be quite slow. Hence why you should cache it. Be careful when caching the serializer though as different XMLSerializer constructors can produce different serializer implementations which behave differently. Particular with respect to namespaces, etc.
Then of course there is the usual performance implications of parsing a lot of text. Unfortunately that isn't easy to avoid with XML.
One of the reasons SQLite is a better choice than XML is related to the fact that it is, at its core, a fixed length record storage system. Any binary file with fixed length records is going to be fast to read, index and scan. Fixed block size file formats are almost always screamingly fast to read and write. I would recommend implementing one at some point for your own education.
If you still want a text based format (for ease of interoperability) and don't need the benefits of an ORM then consider using the FileHelpers library.

Best Serialization for a scenario where performance is paramount and data form is unimportant in .NET?

Which serialization should I use?
I need to store a large Dictionary with 100000+ elements, and I just need to save and load this data directly without caring whether it's binary or whether it's formatted or not.
Right now I am using the BinarySerializer but not sure if it's the most effective?
Please suggest better alternatives in the .NET standard libraries or an external library, preferably free.
EDIT: This is to serialize to disk and from it. The app is single threaded too.
Well, it will depend on what's in the dictionary - but if Protocol Buffers is flexible enough for you (you have to define your own types to serialize - it doesn't do all .NET types or anything like that), it's pretty darned fast.
For example, in protocol buffers I'd represent the dictionary as a message with a repeated key/value pair field. For ultimate speed you could use the CodedOutputStream and CodedInputStream to serialize/deserialize the dictionary directly rather than reading it all into memory separately first. Again, it'll depend on what the key/value types are though.
This is entirely a guess since I haven't profiled this (ie. which is what you should do to truly get your answer).
But my guess is that the binary serializer would give you the best performance. Both in size and speed.
This is a bit of an open-ended question. Are you storing this in memory or writing it to disk? Does this execute in a multi-threaded (and perhaps multi-concurrent-access) environment? Context is important.
BinarySerializer is generally going to be pretty fast, and there are external libs that provide better compression such as ProtoBuffers. I've personally had good success with DataContractSerializer.
The great thing about all these options is that you can try all of them (relatively pain free) to learn for yourself what works in your environment and operation.

Categories

Resources