I'm developing a gaming TCP server and I'm planning that it should withstand up to about 500 connections. Data transfer occurs constantly (movement of the character, attack, etc.).
I started develop with the TCPListener, but I got a performance problem.
I used await XXXSync methods to accept clients, receive data, and sending messages.
I created a program that simulates the actions of players. It allowed to create only 30-40 connections and this caused 100% of the CPU. To test my server, I used two machines: one was a simulator of players, on the other - a server.
These 30-40 clients load the CPU on the server by 10-20% and very much (at 100%) while simultaneously disconnecting all TCP clients.
I was looking for information about optimizing the TCPListener, but nothing found. I've seen topics where people have written that they own a server with 1,000+ connections and this works well.
I thought that the reason could be in the logic of my server, but the profiling tools showed that the main part of the CPU load is getting and sending data (await client.ReadAsync / client.WriteAsync).
Picture 1 | Picture 2
I started looking for possible alternatives and found SocketAsyncEventArgs. At the forum, people wrote that this method can increase productivity by 2 times. I found the source codes of some servers written in C # and took from there the implementation of SocketAsyncEventArgs.
This really greatly increased the productivity by 10-15 times. For 30 simulations, this causes now only 0.5-2%.
I rewrote the emulator of the players at SocketAsyncEventArgs and this allowed me to create 100 simulations with 20% CPU.
After 70 players the load on the CPU on the server grew and is 7-13%.
Profiling tools show that the main use of the CPU is the sending of messages.
For some reason, the most productive option for me was BeginWrite. Write was not suitable because it blocks the stream. WriteAsync was slower, or I just misused it. Picture 3 | Picture 4
I can not find information about optimizing Sockets on c #. Everyone everywhere writes that they are all well even on the TCPListener, which for me is simply unbearable.
Is there any other way to improve performance?
P.S. I tested the server on a computer with i7 seventh generation.
Related
I know this topic is already asked sometimes, and I have read almost all threads and comments, but I'm still not finding the answer to my problem.
I'm working on a high-performance network library that must have TCP server and client, has to be able to accept even 30000+ connections, and the throughput has to be as high as possible.
I know very well I have to use async methods, and I have already implemented all kinds of solutions that I have found and tested them.
In my benchmarking, only the minimal code was used to avoid any overhead in the scope, I have used profiling to minimize the CPU load, there is no more room for simple optimization, on the receiving socket the buffer data was always read, counted and discarded to avoid socket buffer fill completely.
The case is very simple, one TCP Socket listens on localhost, another TCP Socket connects to the listening socket (from the same program, on the same machine oc.), then one infinite loop starts to send 256kB sized packets with the client socket to the server socket.
A timer with 1000ms interval prints a byte counter from both sockets to the console to make the bandwidth visible then resets them for the next measurement.
I've realized the sweet-spot for packet size is 256kB and the socket's buffer size is 64kB to have the maximum throughput.
With the async/await type methods I could reach
~370MB/s (~3.2gbps) on Windows, ~680MB/s (~5.8gbps) on Linux with mono
With the BeginReceive/EndReceive/BeginSend/EndSend type methods I could reach
~580MB/s (~5.0gbps) on Windows, ~9GB/s (~77.3gbps) on Linux with mono
With the SocketAsyncEventArgs/ReceiveAsync/SendAsync type methods I could reach
~1.4GB/s (~12gbps) on Windows, ~1.1GB/s (~9.4gbps) on Linux with mono
Problems are the following:
async/await methods were the slowest, so I will not work with them
BeginReceive/EndReceive methods started new async thread together with the BeginAccept/EndAccept methods, under Linux/mono every new instance of the socket was extremely slow (when there was no more thread in the ThreadPool mono started up new threads, but to create 25 instance of connections did take about 5 mins, creating 50 connections was impossible (program just stopped doing anything after ~30 connections).
Changing the ThreadPool size did not help at all, and I would not change it (it was just a debug move)
The best solution so far is SocketAsyncEventArgs, and that makes the highest throughput on Windows, but in Linux/mono it is slower than the Windows, and it was the opposite before.
I've benchmarked both my Windows and Linux machine with iperf,
Windows machine produced ~1GB/s (~8.58gbps), Linux machine produced ~8.5GB/s (~73.0gbps)
The weird thing is iperf could make a weaker result than my application, but on Linux, it is much higher.
First of all, I would like to know if the results are normal, or can I get better results with a different solution?
If I decide to use the BeginReceive/EndReceive methods (they produced relatively the highest result on Linux/mono) then how can I fix the threading problem, to make the connection instance creating fast, and eliminate the stalled state after creating multiple instances?
I continue making further benchmarks and will share the results if there is any new.
================================= UPDATE ==================================
I promised code snippets, but after many hours of experimenting the overall code is kind of a mess, so I would just share my experience in case it can help someone.
I had to realize under Window 7 the loopback device is slow, could not get higher result than 1GB/s with iperf or NTttcp, only Windows 8 and newer versions have fast loopback, so I don't care anymore about Windows results until I can test on newer version. SIO_LOOPBACK_FAST_PATH should be enabled via Socket.IOControl, but it throws exception on Windows 7.
It turned out the most powerful solution is the Completion event based SocketAsyncEventArgs implementation both on Windows and Linux/Mono. Creating a few thousand instances of the clients never messed up the ThreadPool, the program did not stop suddenly as I mentioned above. This implementation is very nice to the threading.
Creating 10 connections to the listening socket and feeding data from 10 separate thread from the ThreadPool with the clients together could produce ~2GB/s data traffic on Windows, and ~6GB/s on Linux/Mono.
Increasing the client connection count did not improve the overall throughput, but the total traffic became distributed among the connections, this might be because the CPU load was 100% on all cores/threads even with 5, 10 or 200 clients.
I think overall performance is not bad, 100 clients could produce around ~500mbit/s traffic each. (Of course this is measured in local connections, real life scenario on network would be different.)
The only observation I would share: experimenting with both the Socket in/out buffer sizes and with the program read/write buffer sizes/loop cycles highly affected the performance and very differently on Windows and on Linux/Mono.
On Windows the best performance has been reached with 128kB socket-receive, 32kB socket-send, 16kB program-read and 64kB program-write buffers.
On Linux the previous settings produced very weak performance, but 512kB socket-receive and -send both, 256kB program-read and 128kB program-write buffer sizes worked the best.
Now my only problem is if I try create 10000 connecting sockets, after around 7005 it just stops creating the instances, does not throw any exceptions, and the program is running as there was no any problem, but I don't know how can it quit from a specific for loop without break, but it does.
Any help would be appreciated regarding anything I was talking about!
Because this question gets a lot of views I decided to post an "answer", but technically this isn't an answer, but my final conclusion for now, so I will mark it as answer.
About the approaches:
The async/await functions tend to produce awaitable async Tasks assigned to the TaskScheduler of the dotnet runtime, so having thousands of simultaneous connections, therefore thousands or reading/writing operations will start up thousands of Tasks. As far as I know this creates thousands of StateMachines stored in ram and countless context switchings in the threads they are assigned to, resulting in very high CPU overhead. With a few connections/async calls it is better balanced, but as the awaitable Task count grows it gets slow exponentially.
The BeginReceive/EndReceive/BeginSend/EndSend socket methods are technically async methods with no awaitable Tasks, but with callbacks on the end of the call, which actually optimizes more the multithreading, but still the limitation of the dotnet design of these socket methods are poor in my opinion, but for simple solutions (or limited count of connections) it is the way to go.
The SocketAsyncEventArgs/ReceiveAsync/SendAsync type of socket implementation is the best on Windows for a reason. It utilizes the Windows IOCP in the background to achieve the fastest async socket calls and use the Overlapped I/O and a special socket mode. This solution is the "simplest" and fastest under Windows. But under mono/linux, it never will be that fast, because mono emulates the Windows IOCP by using linux epoll, which actually is much faster than IOCP, but it has to emulate the IOCP to achieve dotnet compatibility, this causes some overhead.
About buffer sizes:
There are countless ways to handle data on sockets. Reading is straightforward, data arrives, You know the length of it, You just copy bytes from the socket buffer to Your application and process it.
Sending data is a bit different.
You can pass Your complete data to the socket and it will cut it to chunks, copy the chucks to the socket buffer until there is no more to send and the sending method of the socket will return when all data is sent (or when error happens).
You can take Your data, cut it to chunks and call the socket send method with a chunk, and when it returns then send the next chunk until there is no more.
In any cases You should consider what socket buffer size You should choose. If You are sending large amount of data, then the bigger the buffer is, the less chunks has to be sent, therefore less calls in Your (or in the socket's internal) loop has to be called, less memory copy, less overhead.
But allocating large socket buffers and program data buffers will result in large memory usage, especially if You are having thousands of connections, and allocating (and freeing up) large memory multiple times is always expensive.
On sending side 1-2-4-8kB socket buffer size is ideal for most cases, but if You are preparing to send large files (over few MB) regularly then 16-32-64kB buffer size is the way to go. Over 64kB there is usually no point to go.
But this has only advantage if the receiver side has relatively large receiving buffers too.
Usually over the internet connections (not local network) no point to get over 32kB, even 16kB is ideal.
Going under 4-8kB can result in exponentially incremented call count in the reading/writing loop, causing large CPU load and slow data processing in the application.
Go under 4kB only if You know Your messages will usually be smaller than 4kB, or just very rarely over 4KB.
My conclusion:
Regarding my experiments built-in socket class/methods/solutions in dotnet are OK, but not efficient at all. My simple linux C test programs using non-blocking sockets could overperform the fastest and "high-performance" solution of dotnet sockets (SocketAsyncEventArgs).
This does not mean it is impossible to have fast socket programming in dotnet, but under Windows I had to make my own implementation of Windows IOCP by directly communicating with the Windows Kernel via InteropServices/Marshaling, directly calling Winsock2 methods, using a lot of unsafe codes to pass the context structs of my connections as pointers between my classes/calls, creating my own ThreadPool, creating IO event handler threads, creating my own TaskScheduler to limit the count of simultaneous async calls to avoid pointlessly much context switches.
This was a lot of job with a lot of research, experiment, and testing. If You want to do it on Your own, do it only if You really think it worth it. Mixing unsafe/unmanage code with managed code is a pain in the ass, but the end it worth it, because with this solution I could reach with my own http server about 36000 http request/sec on a 1gbit lan, on Windows 7, with an i7 4790.
This is such a high performance that I never could reach with dotnet built-in sockets.
When running my dotnet server on an i9 7900X on Windows 10, connected to a 4c/8t Intel Atom NAS on Linux, via 10gbit lan, I can use the complete bandwidth (therefore copying data with 1GB/s) no matter if I have only 1 or 10000 simultaneous connections.
My socket library also detects if the code is running on linux, and then instead of Windows IOCP (obviously) it is using linux kernel calls via InteropServices/Marshalling to create, use sockets, and handle the socket events directly with linux epoll, managed to max out the performance of the test machines.
Design tip:
As it turned out it is difficult to design a networking library from scatch, especially one, that is likely very universal for all purposes. You have to design it to have many settings, or especially to the task You need.
This means finding the proper socket buffer sizes, the I/O processing thread count, the Worker thread count, the allowed async task count, these all has to be tuned to the machine the application running on and to the connection count, and data type You want to transfer through the network. This is why the built-in sockets are not performing that good, because they must be universal, and they do not let You set these parameters.
In my case assingning more than 2 dedicated threads to I/O event processing actually makes the overall performance worse, because using only 2 RSS Queues, and causing more context switching than what is ideal.
Choosing wrong buffer sizes will result in performance loss.
Always benchmark different implementations for the simulated task You need to find out which solution or setting is the best.
Different settings may produce different performance results on different machines and/or operating systems!
Mono vs Dotnet Core:
Since I've programmed my socket library in a FW/Core compatible way I could test them under linux with mono, and with core native compilation. Most interestingly I could not observe any remarkable performance differences, both were fast, but of course leaving mono and compiling in core should be the way to go.
Bonus performance tip:
If Your network card is capable of RSS (Receive Side Scaling) then enable it in Windows in the network device settings in the advanced properties, and set the RSS Queue from 1 to as high you can/as high is the best for your performance.
If it is supported by Your network card then it is usually set to 1, this assigns the network event to process only by one CPU core by the kernel. If You can increment this queue count to higher numbers then it will distribute the network events between more CPU cores, and will result in much better performance.
In linux it is also possible to set this up, but in different ways, better to search for Your linux distro/lan driver information.
I hope my experience will help some of You!
I had the same problem. You should take a look into:
NetCoreServer
Every thread in the .NET clr threadpool can handle one task at one time. So to handle more async connects/reads etc., you have to change the threadpool size by using:
ThreadPool.SetMinThreads(Int32, Int32)
Using EAP (event based asynchronous pattern) is the way to go on Windows. I would use it on Linux too because of the problems you mentioned and take the performance plunge.
The best would be io completion ports on Windows, but they are not portable.
PS: when it comes to serialize objects, you are highly encouraged to use protobuf-net. It binary serializes objects up to 10x times faster than the .NET binary serializer and saves a little space too!
I am working in a full production environment that has a range of PLCS around our production mill, each of these PLC's talk back through a 'DataHighway +' network back to a special PC on our LAN Network called the MicroLinks PC. This has the ROCKWELL OPC RSLinx Classic server software on it.
So, recently I have put together a piece of .NET software in c# using the OPC .NET API to read to ROCKWELL OPC server on the Microlinks PC and sync data back into our MYSQL database that is sat on our WINDOWS R2 server PC
Ever since turning on the .net software, the engineers on site have experienced a massive slow down in developing new PLC scripts and fault finding.
Some of the reports are even as bad as 10 second lags.
Consequently, we have had to turn of the .NET software to sync the data to allow the Engineers to do their work swiftly without issues.
So i am looking for some advice on where or what i should look for, any resources to read for this type of problem etc. As PLC and networks are way out of my depth, I am just the .NET programmer.
Here is the structure of our network:
I'm not sure which type of rockwell PLCs you are using. I'm most familiar with the ControlLogix platform so I'll talk about that.
The ethernet card in a controllogix PLC connects at 100Mb/s but the card can't actually handle 100Mb/s continuously. A 1756-ENBt card can handle about 5000 packets per second, the EN2T roughly double that. There are formulas in the rockwell docs explain how to calculate packets per second but another option when you have a running system is to connect the 'Logix5000 task monitor' that comes with RS Logix and verify the CPU usage of the Ethernet card I think Rockwell recommends you keep it under 60%. If you are requesting too many packets then this CPU won't keep up
The PLC itself can starve communications. Controllogix has a "overhead time slice" setting which is the percent of time the PLC spends servicing communication tasks as opposed to running its own logic. Increasing this percentage can improve comms a bit.
It sounds like your program is putting a large burden on the PLC. Does it get better if you slow down your app so that it is not pulling as much data as fast?
One easy way to reduce the number of packets required to retrieve a block of data without slowing down the update rate is to put it all in one array. RSLinx will then be able to optimize the request instead of pulling individual tags
I have had plenty of troubles using Rockwell RSLinx on my local PC trying to find the IP address of a PLC plugged directly into my ethernet port. Using the "Autobrowse" option, it completely locks up my PC trying to scan the ports and IP addresses for targets.
It might just be poorly optimized Rockwell software causing issues. You also may be exchanging a whole lot of data and your server PC is struggling to keep up.
I would contact Rockwell/Allen Bradley support for help with this. They will probably want some cash to help you.
You're almost deinifitely over-polling the PLC. Try polling less and less frequently until find a value that doesn't slow down the network For example, if you're requesting data every 100ms now, change that once per second. Then once per minute. Then once every 15 minutes. At each step check the comms speed for the programming terminals.
We are trying to build an OpenRtb bidder on Azure env. We use redis instances that deployed on Linux VM to store real time data (tracking keys, requests/bidwins counts, etc). We use WebApi based website (standard plan/large instances/scaling by performance) as the end-point. In webapi bidder controller we use async methods, all requests to DBs and redis also async. Json.net to ser/der json req/resp.
Currently we have issues with latency. We should have possibility to receive more then 10000 req per sec and latency should be <100ms.
Could some one to share experience with me? Is this tech stack good for building apps like rtb bidders. Currently I'm trying to find the best strategy to store request context (query, request body, headers, etc) for each request. So, I need the way to insert a lot of (>10000) big messages very fast. I'm thinking about:
storing to log files and copying them to HDFS and parsing by Hadoop MapReduce tasks ( HDInsight)
using some queue such as AzureQueue or ServiceBus or maybe RabbitMQ and send req messages to queue, and some services (self-made or such as LogStash) will receive them and store to some storage as well.
Maybe some one could show me directions how to optimize latency and performance, because currently we have issues with this. Maybe some basic pitfalls?
Interesting.
I have little experience with Azure, but building rtb bidders with high throughput and low latency can be tricky if you involve things like queues or any kind of IO. Use those only if necessary, and preferably out of the bid path. Cache everything or use a high speed in memory store (Aerospike, Couchbase, etc .. even redis if you're smart about it).
Async is great but careful with having too many threads to schedule. that can induce high latency, if you can use epoll, non blocking IO.
I've build a bidder running on aws, with netty (async non blocking io on the jvm) wrapped on Twitter's Finagle (twitter RPC system), chronicle map as model caching and aerospike for data serving. We could handle 20k req/sec on one box of m4.2xlarge (8 cores/ 32 Go) with a 99 percentile tail latency of around 45 ms and 16 ms on average.
You tend to read that you should onlyuse bare metal for bidders, but it's more about architecture that virtualisation or not.
We had the same problem with our implementation and actually we decided to check possible alternatives. We benchmarked different langs and libs.
source Yandex.Tank response per second(ubuntu vm 8 cores)
target ubuntu vm 8 cores
golang fast http 30k+
nginx 20k
golang http 20k-
haskell wai warp 15k+
clojure http-kit 15k-
node.js 7k
rust hyper 10k+
rust iron 10k-
fsharp suave.io 4k+ (best result ever for .net web servers)
asp.net 5 kestrel coreclr/mono ??? 400-
So after that we started to use golang + fasthttp + redis. Actually it was enought to handle 40k rps on a single instance with a 99 percentile less than 11 ms. Actually currently asp.net 5 shows good results in speed you could check here
We are currently investigating the most efficient way of communicating between 120-140 embedded hardware devices running on the .NET Micro framework and a server.
Each embedded device needs to send to, and request information from the server on a fairly regular basis all in real time through TCP.
My question is this: Would it be better to initialise 140 TCP connections to the server, and then hang on to these connections, or initialise a new connection for each requests to and from the devices? Would holding on to and managing 140 TCP connections put a lot of strain on the server?
When the server detects new data in the database it needs to send this new info to 1..* devices (information is targeted to specific devices), if I held on to the 140 connections I would need to do a lookup for the correct connection each time I needed to send information instead of just sending to an IP:PORT associated with the new data.
I guess another possibly stupid question would be is it actually possibly to hang on to 140 TCP connections on a single port?
Any suggestions/comments are appreciated!
In general you are better maintaining the connections for as long as possible. If you have each device opening a connection each time it sends a message you can end up effectively DoS'ing the server as it ends up with lots of sockets in the TIME_WAIT state taking up space in it's tables.
I worked on a system where there were a bunch of clients talking to a server and while they could be turned on and off regularly, it was still better to maintain the connection (and re-establish it when it had dropped and a new message needed to be sent). You may end up needing to write slightly more complex code, but I've found it to be well worth the effort for the reduced load on the server.
Modern operating systems may have bigger buffers than the ones I actually encountered the DoS effect on, but it's fundamentally not the best idea to be using lots of connections like that.
Things can get relatively complicated on the client side, especially when the device tends to go to sleep transparently to the application because that means connections will time out while the app thinks they are still open. When we did this we ended up with relatively complex network code because we needed to deal with the fact that the sockets could (and would) fail as a matter of course and we simply needed to setup a new connection and re-attempt sending the message. You just tuck this code away into your libraries and forget about it once it's done though.
In actual fact in practice our initial application had even more complex code because it was dealing with a network library that was semi-aware of the stop start nature of the devices and tried to resend failed messages, sometimes meaning that the same message got sent twice. We ended up doing an extra layer of communication on top in order to ensure duplicates got rejected. If you're using C# or regular BSD style sockets you shouldn't have that problem though I'm guessing. This was a proprietary library that managed the reconnects but caused headaches with the resends and it's inappropriate default time-outs.
You usually can connect much more than 140 "clients" to a server (that is with decent network / HW / RAM)...
I recommend always to test this sort of thing with real scenarios (load etc.) to decide since there are aspects like network (performance, stability...), HW (server RAM etc.) and SW (what does the server exactly do?) that can only be checked by you.
Depending on the protocol you could/should even put some timeout/reconnect mechanism in there.
The lookup you mean would be really fast - just use ConcurrentDictionary to hold the needed information with IP:PORT as the key (assuming the server runs on a full .NET 4).
For some references see:
http://msdn.microsoft.com/en-us/library/dd287191.aspx
http://geekswithblogs.net/BlackRabbitCoder/archive/2011/02/17/c.net-little-wonders-the-concurrentdictionary.aspx
EDIT - as per comments:
Holding on to a TCP/IP connection doesn't take much processing client-side... it costs a bit of memory. I would recommend to do a small test (1-2 clients) to check this assumption for your specific case.
If you are talking about a system with hardware devices then I suggest to go with closing the connection every time the client finishes sending data.
To make sure the client gets some update from the server, the client can wait for a 5 second period for any data to arrive from the server. If the data is received within/before this timeframe, then close the connection and process the data. If not, close the connection and wait after sending next set of data.
This way scaling becomes much easier. Keeping the connections open always leads to strain on the resources and in my opinion is not necessary unless it is some life-saving device like heart rate monitor, oxygen supply monitor etc.,
I have a series of systems on a LAN running a synchronized display routine. For example, think of a chorus line. The program they ran is fixed. I have each "client" download the entire routine, and then contact the central "server" at fixed points in the routine for synchronization. The routine itself is mundane with, perhaps, 20 possible instructions.
Each client runs the same routine, but they can be doing completely different things at any one time. One part of the chorus line can be kicking left, another part kicking right, but all in time with each other. Clients can join and drop out at any time, but they're all assigned a part. If no-one is there to run the part, it just doesn't get run.
This is all coded in C# .Net.
The client display is a Windows Forms application. The server accepts TCP connections, and then services them round-robin fashion, keeping a master clock of what's going on. The clients send a signal that says "I've reached sync-point 32" (or 19, or 5, or whatever) and waits for the server to acknowledge and then moves on. Or the server can say "No, you need to start at sync-point 15".
This all works great. There is a minor bit of delay between the first and last clients to hit a sync-point, but it's hardly noticeable. Ran for months.
Then the Specification changed.
Now the clients need to respond to near real-time instructions from the server -- it's no longer a pre-set dance program. The server is going to be sending instructions out and the dance program is made up on the fly. I get the fun job of re-designing the protocol, the servicing loops, and the programming instructions.
My toolkit includes anything in a standard .Net 3.5 toolbox. Installing new software is a pain in the arse, since so many systems (clients) can be involved.
I'm looking for suggestions on keeping the clients synced (some sort of latching system? UDP? Broadcast?), distribution of the "dance program", anything that might make this easier than a traditional Client/Server TCP arrangement.
Keep in mind that there are time/speed limitations going on as well. I could put the dance program in a network database, but I'd have to shove instructions in fairly quickly and there'd be a lot of readers using a rather thick protocol (DBI, SqlClient, etc..) to get a small bit of text. That seems overly complex. And I still need something to keep them all displaying in sync.
Suggestions? Opinions? Wild-ass speculation? Code examples?
PS: Answers may not get marked as "correct" (since this isn't a "correct" answer), but +1 votes for good suggestions for sure.
I did something similar (quite a while back) with synchronizing a bank of 4 displays, each run by a single system, receiving messages from a central server.
The architecture we finally settled on after a fair amount of testing involved having one "master" machine. In your case, this would be having one of your 20 clients that acts as the master, and have it connect to the server via TCP.
The server then would send the entire series of commands for the series through to that one machine.
That machine then used UDP to broadcast real-time instructions to each of the other machines (the 19 other clients on its LAN) to keep their displays up to date. We used UDP for a couple reasons here - there was lower overhead involved, which helped keep the total resource usage down. Also, since you're updating in real-time, if one or two "frames" was out of sync, it was never noticable, at least not noticeable enough for our purposes (having a human sitting and interacting with the system).
The key point to this working smoothly, though, is having an intelligent communication means between the main server and the "master" machine - you want to keep the bandwidth as low as possible. In a case like yours, I'd probably come up with a single binary blob that had the current instruction set for the 20 machines, in its smallest form. (Maybe something like 20 bytes, or 40 bytes if you need it, etc). The "master" machine would then worry about translating this out to the other 19 machines and itself.
There are some nice things about this - the server has a much easier time transmitting to one machine in the cluster instead of every machine in the cluster. This let us, for example, have one single, centralized server "drive" multiple clusters efficiently, without having ridiculous hardware requirements anywhere. It also keeps the client code very, very simple. It just has to listen for a UDP datagram and do whatever it says - in your case, it sounds like it would have one of 20 commands, so the client becomes very, very simple.
The "master" server is the trickiest. In our implementation, we actually had the same client code on it as the other 19 (as a separate proces) and one "translation" process that took the blob, broke it into 20 pieces, and transmitted them. It was fairly simple to write, and worked very well.