Problem:
I have large number of big messages to serialize and send over network.
I would like to maximize performance, so I'm thinking about creating multiple threads for message serialization and one thread for sending the data. Idea is to dynamically determine number of threads for serialization based on network performance. If data is sent fast and serialization is bottleneck, add more threads to boost serialization. If network is slow, use less threads and stop completely if send buffer is full.
Any ideas how to do that?
What would be the best algorithm to decide if more or less threads are needed?
How to correctly concatenate serialization results from multiple threads?
Please answer to any of this questions? Thanks
It can be treated as a Producer/Consumer problem, In Fx4 you can use a BlockingCollection.
But frankly I would expect the (network) I/O to be the bottleneck, not the serialization. You will have to measure.
You can chunk the data into packets and place the packets into a queue. The process that looks at network performance would look at the queue and decide how many packets to send. The downside to this implementation is that you will need to assemble the packets on the receiving end where they may not be received in the proper order.
Related
I am researching a development of real time messaging with Signal R with web socket as transport.
My application will generate multiple messages at high rate and one question I came across is whether it would be a good idea to consider batching multiple messages before sending them out to the clients.
I have looked at the streaming functionality Signal R offers but I don't think it fits well in this case.
The messages will have variable sizes from just few bytes and up to kilobytes.
As I understand if messages are batched then there will be less time spent for serialization?
Of course this will depend on the serializer being used and may vary depending on message size.
Also there will be less round trips between client and server?
So the question is whether there would be performance gains by batching multiple messages before sending them out to clients.
I understand that it would be hard to give an conclusive answer but still would want to hear some ideas on the topic.
I have an application that needs to transfere many small (1-3Byte) messages via a TCP connection (WLAN).
I know that sending 100kB at once is much faster than sending them in very small pieces (nearly bytewise).
Nevertheless the information I need to transfere is only 1-3bytes in size. Collecting data before sending would increase throughput, but it is important that the small pieces of data are transfered as early/fast as possible. So gathering data before sending is the problem.
Now I ask, what would be the best way, not to send every message individually on the one hand and on the other hand not to delay their transmission longer than necessary.
Now I'm thinking about creating a little buffer. When the first message is add, I start a timer with 1ms timeout. After that millisecond, data will be transfered. Independend of how many bytes are in the queue. But I don't know if this is a good solution.
Isn't there a way that the TCPClient/Server classes of .NET themselves have a solution for such a problem. I mean, they should know, when the current transmission is finished. In the meantime they could accumulate all send requests and send them out as in one transaction.
The TCP stack by default will already buffer the data internally in order to reduce overhead when sending. See Nagle Algorithm for details. Just make sure that you have this algorithm enabled, i.e. set Nodelay to false.
Let's say I have a static list List<string> dataQueue, where data keeps getting added at random intervals and also at a varying rate (1-1000 entries/second).
My main objective is to send the data from the list to the server, I'm using a TcpClient class.
What I've done so far is, I'm sending the data synchronously to the client in a Single thread
byte[] bytes = Encoding.ASCII.GetBytes(message);
tcpClient.GetStream().Write(bytes, 0, bytes.Length);
//The client is already connected at the start
And I remove the entry from the list, once the data is sent.
This works fine, but the speed of data being sent is not fast enough, the list gets populated and consumes more memory, as the list gets iterated and sent one by one.
My question is can I use the same tcpClient object to write concurrently from another thread or can I use another tcpClient object with a new connection to the same server in another thread? What is the most efficient(quickest) way to send this data to the server?
PS: I don't want to use UDP
Right; this is a fun topic which I think I can opine about. It sounds like you are sharing a single socket between multiple threads - perfectly valid as long as you do it very carefully. A TCP socket is a logical stream of bytes, so you can't use it concurrently as such, but if your code is fast enough, you can share the socket very effectively, with each message being consecutive.
Probably the very first thing to look at is: how are you actually writing the data to the socket? what is your framing/encoding code like? If this code is simply bad/inefficient: it can probably be improved. For example, is it indirectly creating a new byte[] per string via a naive Encode call? Are there multiple buffers involved? Is it calling Send multiple times while framing? How is it approaching the issue of packet fragmentation? etc
As a very first thing to try - you could avoid some buffer allocations:
var enc = Encoding.ASCII;
byte[] bytes = ArrayPool<byte>.Shared.Rent(enc.GetMaxByteCount(message.Length));
// note: leased buffers can be oversized; and in general, GetMaxByteCount will
// also be oversized; so it is *very* important to track how many bytes you've used
int byteCount = enc.GetBytes(message, 0, message.Length, bytes, 0);
tcpClient.GetStream().Write(bytes, 0, byteCount);
ArrayPool<byte>.Shared.Return(bytes);
This uses a leased buffer to avoid creating a byte[] each time - which can massively improve GC impact. If it was me, I'd also probably be using a raw Socket rather than the TcpClient and Stream abstractions, which frankly don't gain you a lot. Note: if you have other framing to do: include that in the size of the buffer you rent, use appropriate offsets when writing each piece, and only write once - i.e. prepare the entire buffer once - avoid multiple calls to Send.
Right now, it sounds like you have a queue and dedicated writer; i.e. your app code appends to the queue, and your writer code dequeues things and writes them to the socket. This is a reasonably way to implement things, although I'd add some notes:
List<T> is a terrible way to implement a queue - removing things from the start requires a reshuffle of everything else (which is expensive); if possible, prefer Queue<T>, which is implemented perfectly for your scenario
it will require synchronization, meaning you need to ensure that only one thread alters the queue at a time - this is typically done via a simple lock, i.e. lock(queue) {queue.Enqueue(newItem);} and SomeItem next; lock(queue) { next = queue.Count == 0 ? null : queue.Dequeue(); } if (next != null) {...write it...}.
This approach is simple, and has some advantages in terms of avoiding packet fragmentation - the writer can use a staging buffer, and only actually write to the socket when a certain threshold is buffered, or when the queue is empty, for example - but it has the possibility of creating a huge backlog when stalls occur.
However! The fact that a backlog has occurred indicates that something isn't keeping up; this could be the network (bandwidth), the remote server (CPU) - or perhaps the local outbound network hardware. If this is only happening in small blips that then resolve themselves - fine (especially if it happens when some of the outbound messages are huge), but: one to watch.
If this kind of backlog is recurring, then frankly you need to consider that you're simply saturated for the current design, so you need to unblock one of the pinch points:
making sure your encoding code is efficient is step zero
you could move the encode step into the app-code, i.e. prepare a frame before taking the lock, encode the message, and only enqueue an entirely prepared frame; this means that the writer thread doesn't have to do anything except dequeue, write, recycle - but it makes buffer management more complex (obviously you can't recycle buffers until they've been completely processed)
reducing packet fragmentation may help significantly, if you're not already taking steps to achieve that
otherwise, you might need (after investigating the blockage):
better local network hardware (NIC) or physical machine hardware (CPU etc)
multiple sockets (and queues/workers) to round-robin between, distributing load
perhaps multiple server processes, with a port per server, so your multiple sockets are talking to different processes
a better server
multiple servers
Note: in any scenario that involves multiple sockets, you want to be careful not to go mad and have too many dedicated worker threads; if that number goes above, say, 10 threads, you probably want to consider other options - perhaps involving async IO and/or pipelines (below).
For completeness, another basic approach is to write from the app-code; this approach is even simpler, and avoids the backlog of unsent work, but: it means that now your app-code threads themselves will back up under load. If your app-code threads are actually worker threads, and they're blocked on a sync/lock, then this can be really bad; you do not want to saturate the thread-pool, as you can end up in the scenario where no thread-pool threads are available to satisfy the IO work required to unblock whichever writer is active, which can land you in real problems. This is not usually a scheme that you want to use for high load/volume, as it gets problematic very quickly - and it is very hard to avoid packet fragmentation since each individual message has no way of knowing whether more messages are about to come in.
Another option to consider, recently, is "pipelines"; this is a new IO framework in .NET that is designed for high volume networking, giving particular attention to things like async IO, buffer re-use, and a well-implemented buffer/back-log mechanism that makes it possible to use the simple writer approach (syncronize while writing) and have that not translate into direct sends - it manifests as an async writer with access to a backlog, which makes packet fragmentation avoidance simple and efficient. This is quite an advanced area, but it can be very effective. The problematic part for you will be: it is designed for async usage throughout, even for writes - so if your app-code is currently synchronous, this could be a pain to implement. But: it is an area to consider. I have a number of blog posts talking about this topic, and a range of OSS examples and real-life libraries that make use of pipelines that I can point you at, but: this isn't a "quick fix" - it is a radical overhaul of your entire IO layer. It also isn't a magic bullet - it can only remove overhead due to local IO processing costs.
I am looking for advice how to best architecture a buffer structure that can handle a massive amount of incoming data that are processed at a slower speed than the incoming data.
I programmed a customized binary reader that can stream up to 12 million byte arrays per second on a single thread and look to process the byte array stream in a separate structure on the same machine and different thread. The problem is that the consuming structure cannot keep up with the amount of incoming data of the producer and thus I believe I need some sort of buffer to handle this properly. I am most interested in advice regarding the overall architecture rather than code examples. I target .Net 4.0. Here is more information of my current setup and requirements.
Producer: Runs on a dedicated thread and reads byte arrays from files on physical storage medium (SSD, OCZ Vertex 3 Max IOPS). Approximate throughput is 12 million byte arrays per second. Each array is only of 16 byte size. Fully implemented
Consumer: Supposed to run on a separate thread than the producer.Consumes byte arrays but must parse to several primitive data types before processing the data, thus the processing speed is significantly slower than the producer publishing speed. Consumer structure is fully implemented.
In between: Looking to set up a buffered structure that provides the producer to publish to and the consumer to, well, consume from. Not implemented.
I would be happy if some of you could comment from your own experience or expertise what best to consider in order to handle such structure. Should the buffer implement a throttling algorithm that only requests new data from the producer when the buffer/queue is half empty or so? How is locking and blocking handled? I am sorry I have very limited experience in this space and have so far handled it through the implementation of a messaging bus but any messaging bus technology I looked at is definitely unable to handle the throughput I am looking for. Any comments very welcome!!!
Edit: Forgot to mention, the data is only consumed by one single consumer. Also the order in which the arrays are published does matter; the order needs to be preserved such that the consumer consumes in the same order.
16 bytes, (call it 16B), is too small for efficient inter-thread comms. Queueing up such small buffers will result in more CPU spent on inter-thread comms than on actual useful processing of the data.
So, chunk them up.
Declare some buffer class, (C16B, say), that contains a nice, big array of these 16B's - at least 4K's worth, and a 'count' int to show how many are loaded, (the last buffer loaed from a file will probably not be full). It will help if you place a cache-line-sized empty byte array just in front of this 16B array - helps to avoid false-sharing, You can maybe put the code that processes the 16B's in as a method, 'Process16B', sya, and perhaps the code that loads the array too - takes a file descriptor as a parameter. This class can now be efficiently loaded up an queued to other threads.
You need a producer-consumer queue class - C# already has one in the BlockingCollection classes.
You need flow-control in this app. I would do it by creating a pool of C16B's - create a blocking queue and create/add a big pile of C16B's in a loop. 1024 is a nice, round number. Now you have a 'pool queue' that provides flow-control, avoids the need to new() any C16B's and you don't need them to be continually garbage-collected.
Once you have this, the rest is easy. In your loader thread, continually dequeue C16B's from the pool queue, load them up with data from the files and add() them off to the processing threads/s on a '16Bprocess' blocking queue. In the processing threads, take() from the 16Bprocess queue and process each C16B instance by calling its Process16B method. When the 16B's are processed, add() the C16B back to the pool queue for re-use.
The recycling of the C16B's via the pool queue provides end-to-end flow-control. If the producer is the fastest link, the pool will eventually empty and the producer will block there until the consumer/s returns some C16B's.
If the processing takes so much time, you could always add another processing thread if you have spare cores available. The snag is with such schemes is that the data will get processed out-of-order. This may, or may not, matter. If it does, the data flow might need 'straightening out' later, eg. using sequence numbers and a buffer-list.
I would advise dumping the pool queue count, (and maybe the 16Bprocess queue count as well), to a status component or command-line with a timer. This provides a useful snapshot of where the C16B instances are and you can see the bottlenecks and any C16B leaks without 3rd-party tools, (the ones that that slow the whole app down to a crawl and issue spurious leak reports on shutdown).
You can use a BlockingCollection it will block the producer from adding items to the collection as long as the consumer hasn't consumed enough items.
There are other concurrent collection classes as well, eg. ConcurrentQueue
IMO a blocking Queue of some kind may solve your problem. Essentially the Producer thread will block if the queue has no more capacity. Look at this Creating a blocking Queue<T> in .NET?
Why bother with a buffer at all? Use the disk files as a buffer. When the consumer starts processing a byte array, have the reader read the next one and that's it.
EDIT: After asking for decoupling of the consumer and producer.
You can have a coordinator that tells the producer to produce X byte arrays, and supplies X byte arrays to the consumer. The three parts can act like this:
Coordinator tells producer to produce X byte arrays.
Producer produces X byte arrays
And now do this in a loop:
Coordinator tells consumer to consumer X byte arrays
Coordinator tells producer to produce X byte arrays
Consumer tells coordinator it's done consuming
Loop until there are no more byte arrays
The producer and coordinator can run in the same thread. The consumer should have its own thread.
You will have almost no locking (I think you can do this with no locking at all, just a single wait handle the consumer uses to notify the coordinator it's done), and your coordinator is extremely simple.
REEDIT: Another really decoupled option
Use ZeroMQ for handling the communications. The producer reads byte arrays and posts each array to a ZeroMQ socket. The consumer reads arrays from a ZeroMQ socket.
ZeroMQ is very efficient and fast, and handles all the technicalities (thread synchronization, buffering, etc...) internally. When used on the same computer, you won't suffer any data loss, too (which might happen when using UDP on two different machines).
We need to develop some kind of buffer management for an application we are developing using C#.
Essentially, the application receives messages from devices as and when they come in (there could be many in a short space of time). We need to queue them up in some kind of buffer pool so that we can process them in a managed fashion.
We were thinking of allocating a block of memory in 256 byte chunks (all messages are less than that) and then using buffer pool management to have a pool of available buffers that can be used for incoming messages and a pool of buffers ready to be processed.
So the flow would be "Get a buffer" (process it) "Release buffer" or "Leave it in the pool". We would also need to know when the buffer was filling up.
Potentially, we would also need a way to "peek" into the buffers to see what the highest priority buffer in the pool is rather than always getting the next buffer.
Is there already support for this in .NET or is there some open source code that we could use?
C# sharps memory management is actually quite good, so instead of having a pool of buffers, you could just allocate exactly what you need and stick it into a queue. Once you are done with buffer just let the garbage collector handle it.
One other option (knowing only very little about your application), is to process the messages minimally as you get them, and turn them into full fledged objects (with priorities and all), then your queue could prioritize them just by investigating the correct set of attributes or methods.
If your messages come in too fast even for minimal processing you could have a two queue system. One is just a queue of unprocessed buffers, and the next queue is the queue of message objects built from the buffers.
I hope this helps.
#grieve: Networking is native, meaning that when buffers are used the receive/send data on the network, they are pinned in memory. see my comments below for elaboration.
Why wouldn't you just receive the messages, create a DeviceMessage (for lack of a better name) object, and put that object into a Queue ? If the prioritization is important, implement a PriorityQueue class that handles that automatically (by placing the DeviceMessage objects in priority order as they're inserted into the queue). Seems like a more OO approach, and would simplify maintenance over time with regards to the prioritization.
I know this is an old post, but I think you should take a look at the memory pool implemented in the ILNumerics project. I think they did exactly what you need and it is a very nice piece of code.
Download the code at http://ilnumerics.net/ and take a look at the file ILMemoryPool.cs
I'm doing something similar. I have messages coming in on MTA threads that need to be serviced on STA threads.
I used a BlockingCollection (part of the parallel fx extensions) that is monitored by several STA threads (configurable, but defaults to xr * the number of cores). Each thread tries to pop a message off the queue. They either time out and try again or successfully pop a message off and service it.
I've got it wired with perfmon counters to keep track of idle time, job lengths, incoming messages, etc, which can be used to tweak the queue's settings.
You'd have to implement a custom collection, or perhaps extend BC, to implement queue item priorities.
One of the reasons why I implemented it this way is that, as I understand it, queueing theory generally favors single-line, multiple-servers (why do I feel like I'm going to catch crap about that?).