Efficient & Scalable connected TCP Windows Service using C# .Net 4.5

Efficient & Scalable connected TCP Windows Service using C# .Net 4.5 - c#

Requirements:
The need is for a windows service based C# .NET 4.5 always (at least long) connected TCP Server architecture with vertical and horizontal scaling and each server may handle max possible connections. Clients can be any IoT (internet of things).
I am aware of the limitations on ports but still wonder why these limitations in this era of tech (we always have limits but why still the old ones?!). Also temporary tcp/http connections will scale fine but not a requirement here.
Design:
Single thread per server for async-accept new connections (lifetime of server).
code: rawTcpClient = await tcpListener.AcceptTcpClientAsync();
One thread per client connection (loop) to hold client connection ?
(see my Q below)
a Task for performing client operation (short term, intermittent
operations)
my Question on optimization (if Possible?):
How can I optimize/manage to hold all the client connections in a set of threads/threadpool instead of one thread per connection since this is client-lifetime which may be for a long duration?
Ex: per server, only 50 threads based tasks allocated to hold connected clients so that they don't get disconnected, while waiting for client data?

Efficient & Scalable
The very first thing you need to decide is how efficient you want to be. The socket APIs can get extremely complex if efficiency is your top priority. However, efficiency is almost never the top priority, even though a lot of people think it is. The problem is that complexity can increase exponentially with efficiency/scalability, and if you simply maximize efficiency/scalability, you'll end up with an almost unmaintainable system. So you'll need to decide where to draw the line on that scale.
Particularly if you have horizontal scaling, you probably don't need to use the extreme-efficiency socket APIs.
I am aware of the limitations on ports but still wonder why these limitations in this era of tech (we always have limits but why still the old ones?!).
Compatibility. Ports in particular are represented by a 16-bit value. The only way this would change is if a new standard came out, and everything upgraded. NICs, gateways, ISPs, and IoT devices. That's a high order, and will probably never happen.
Single thread per server for async-accept new connections (lifetime of server).
That's fine. If you have a large amount of connection turnover, you can have multiple accept threads, too. Just keep your backlog high (it should be high by default on Windows Server OSes).
One thread per client connection (loop) to hold client connection?
Er, no.
You'll want to use asynchronous I/O, for sure.
You should have continuous (asynchronous) reads going on all connected clients, and then do (asynchronous) writes as necessary. Also, if the protocol permits it, you should periodically write heartbeat messages to each connected client; otherwise, you'll need a timer for each client to drop the connection. Depending on the nature of your writes, you may need to have a queue of pending writes per client.
a Task for performing client operation (short term, intermittent
operations)
If you use asynchronous tasks, then all your actual code will just run on whatever threadpool thread is available. No need for dedicated tasks at all.
You may find my TCP/IP .NET Sockets FAQ helpful.

Related

High-performance TCP Socket programming in .NET C#

I know this topic is already asked sometimes, and I have read almost all threads and comments, but I'm still not finding the answer to my problem.
I'm working on a high-performance network library that must have TCP server and client, has to be able to accept even 30000+ connections, and the throughput has to be as high as possible.
I know very well I have to use async methods, and I have already implemented all kinds of solutions that I have found and tested them.
In my benchmarking, only the minimal code was used to avoid any overhead in the scope, I have used profiling to minimize the CPU load, there is no more room for simple optimization, on the receiving socket the buffer data was always read, counted and discarded to avoid socket buffer fill completely.
The case is very simple, one TCP Socket listens on localhost, another TCP Socket connects to the listening socket (from the same program, on the same machine oc.), then one infinite loop starts to send 256kB sized packets with the client socket to the server socket.
A timer with 1000ms interval prints a byte counter from both sockets to the console to make the bandwidth visible then resets them for the next measurement.
I've realized the sweet-spot for packet size is 256kB and the socket's buffer size is 64kB to have the maximum throughput.
With the async/await type methods I could reach
~370MB/s (~3.2gbps) on Windows, ~680MB/s (~5.8gbps) on Linux with mono
With the BeginReceive/EndReceive/BeginSend/EndSend type methods I could reach
~580MB/s (~5.0gbps) on Windows, ~9GB/s (~77.3gbps) on Linux with mono
With the SocketAsyncEventArgs/ReceiveAsync/SendAsync type methods I could reach
~1.4GB/s (~12gbps) on Windows, ~1.1GB/s (~9.4gbps) on Linux with mono
Problems are the following:
async/await methods were the slowest, so I will not work with them
BeginReceive/EndReceive methods started new async thread together with the BeginAccept/EndAccept methods, under Linux/mono every new instance of the socket was extremely slow (when there was no more thread in the ThreadPool mono started up new threads, but to create 25 instance of connections did take about 5 mins, creating 50 connections was impossible (program just stopped doing anything after ~30 connections).
Changing the ThreadPool size did not help at all, and I would not change it (it was just a debug move)
The best solution so far is SocketAsyncEventArgs, and that makes the highest throughput on Windows, but in Linux/mono it is slower than the Windows, and it was the opposite before.
I've benchmarked both my Windows and Linux machine with iperf,
Windows machine produced ~1GB/s (~8.58gbps), Linux machine produced ~8.5GB/s (~73.0gbps)
The weird thing is iperf could make a weaker result than my application, but on Linux, it is much higher.
First of all, I would like to know if the results are normal, or can I get better results with a different solution?
If I decide to use the BeginReceive/EndReceive methods (they produced relatively the highest result on Linux/mono) then how can I fix the threading problem, to make the connection instance creating fast, and eliminate the stalled state after creating multiple instances?
I continue making further benchmarks and will share the results if there is any new.
================================= UPDATE ==================================
I promised code snippets, but after many hours of experimenting the overall code is kind of a mess, so I would just share my experience in case it can help someone.
I had to realize under Window 7 the loopback device is slow, could not get higher result than 1GB/s with iperf or NTttcp, only Windows 8 and newer versions have fast loopback, so I don't care anymore about Windows results until I can test on newer version. SIO_LOOPBACK_FAST_PATH should be enabled via Socket.IOControl, but it throws exception on Windows 7.
It turned out the most powerful solution is the Completion event based SocketAsyncEventArgs implementation both on Windows and Linux/Mono. Creating a few thousand instances of the clients never messed up the ThreadPool, the program did not stop suddenly as I mentioned above. This implementation is very nice to the threading.
Creating 10 connections to the listening socket and feeding data from 10 separate thread from the ThreadPool with the clients together could produce ~2GB/s data traffic on Windows, and ~6GB/s on Linux/Mono.
Increasing the client connection count did not improve the overall throughput, but the total traffic became distributed among the connections, this might be because the CPU load was 100% on all cores/threads even with 5, 10 or 200 clients.
I think overall performance is not bad, 100 clients could produce around ~500mbit/s traffic each. (Of course this is measured in local connections, real life scenario on network would be different.)
The only observation I would share: experimenting with both the Socket in/out buffer sizes and with the program read/write buffer sizes/loop cycles highly affected the performance and very differently on Windows and on Linux/Mono.
On Windows the best performance has been reached with 128kB socket-receive, 32kB socket-send, 16kB program-read and 64kB program-write buffers.
On Linux the previous settings produced very weak performance, but 512kB socket-receive and -send both, 256kB program-read and 128kB program-write buffer sizes worked the best.
Now my only problem is if I try create 10000 connecting sockets, after around 7005 it just stops creating the instances, does not throw any exceptions, and the program is running as there was no any problem, but I don't know how can it quit from a specific for loop without break, but it does.
Any help would be appreciated regarding anything I was talking about!

Because this question gets a lot of views I decided to post an "answer", but technically this isn't an answer, but my final conclusion for now, so I will mark it as answer.
About the approaches:
The async/await functions tend to produce awaitable async Tasks assigned to the TaskScheduler of the dotnet runtime, so having thousands of simultaneous connections, therefore thousands or reading/writing operations will start up thousands of Tasks. As far as I know this creates thousands of StateMachines stored in ram and countless context switchings in the threads they are assigned to, resulting in very high CPU overhead. With a few connections/async calls it is better balanced, but as the awaitable Task count grows it gets slow exponentially.
The BeginReceive/EndReceive/BeginSend/EndSend socket methods are technically async methods with no awaitable Tasks, but with callbacks on the end of the call, which actually optimizes more the multithreading, but still the limitation of the dotnet design of these socket methods are poor in my opinion, but for simple solutions (or limited count of connections) it is the way to go.
The SocketAsyncEventArgs/ReceiveAsync/SendAsync type of socket implementation is the best on Windows for a reason. It utilizes the Windows IOCP in the background to achieve the fastest async socket calls and use the Overlapped I/O and a special socket mode. This solution is the "simplest" and fastest under Windows. But under mono/linux, it never will be that fast, because mono emulates the Windows IOCP by using linux epoll, which actually is much faster than IOCP, but it has to emulate the IOCP to achieve dotnet compatibility, this causes some overhead.
About buffer sizes:
There are countless ways to handle data on sockets. Reading is straightforward, data arrives, You know the length of it, You just copy bytes from the socket buffer to Your application and process it.
Sending data is a bit different.
You can pass Your complete data to the socket and it will cut it to chunks, copy the chucks to the socket buffer until there is no more to send and the sending method of the socket will return when all data is sent (or when error happens).
You can take Your data, cut it to chunks and call the socket send method with a chunk, and when it returns then send the next chunk until there is no more.
In any cases You should consider what socket buffer size You should choose. If You are sending large amount of data, then the bigger the buffer is, the less chunks has to be sent, therefore less calls in Your (or in the socket's internal) loop has to be called, less memory copy, less overhead.
But allocating large socket buffers and program data buffers will result in large memory usage, especially if You are having thousands of connections, and allocating (and freeing up) large memory multiple times is always expensive.
On sending side 1-2-4-8kB socket buffer size is ideal for most cases, but if You are preparing to send large files (over few MB) regularly then 16-32-64kB buffer size is the way to go. Over 64kB there is usually no point to go.
But this has only advantage if the receiver side has relatively large receiving buffers too.
Usually over the internet connections (not local network) no point to get over 32kB, even 16kB is ideal.
Going under 4-8kB can result in exponentially incremented call count in the reading/writing loop, causing large CPU load and slow data processing in the application.
Go under 4kB only if You know Your messages will usually be smaller than 4kB, or just very rarely over 4KB.
My conclusion:
Regarding my experiments built-in socket class/methods/solutions in dotnet are OK, but not efficient at all. My simple linux C test programs using non-blocking sockets could overperform the fastest and "high-performance" solution of dotnet sockets (SocketAsyncEventArgs).
This does not mean it is impossible to have fast socket programming in dotnet, but under Windows I had to make my own implementation of Windows IOCP by directly communicating with the Windows Kernel via InteropServices/Marshaling, directly calling Winsock2 methods, using a lot of unsafe codes to pass the context structs of my connections as pointers between my classes/calls, creating my own ThreadPool, creating IO event handler threads, creating my own TaskScheduler to limit the count of simultaneous async calls to avoid pointlessly much context switches.
This was a lot of job with a lot of research, experiment, and testing. If You want to do it on Your own, do it only if You really think it worth it. Mixing unsafe/unmanage code with managed code is a pain in the ass, but the end it worth it, because with this solution I could reach with my own http server about 36000 http request/sec on a 1gbit lan, on Windows 7, with an i7 4790.
This is such a high performance that I never could reach with dotnet built-in sockets.
When running my dotnet server on an i9 7900X on Windows 10, connected to a 4c/8t Intel Atom NAS on Linux, via 10gbit lan, I can use the complete bandwidth (therefore copying data with 1GB/s) no matter if I have only 1 or 10000 simultaneous connections.
My socket library also detects if the code is running on linux, and then instead of Windows IOCP (obviously) it is using linux kernel calls via InteropServices/Marshalling to create, use sockets, and handle the socket events directly with linux epoll, managed to max out the performance of the test machines.
Design tip:
As it turned out it is difficult to design a networking library from scatch, especially one, that is likely very universal for all purposes. You have to design it to have many settings, or especially to the task You need.
This means finding the proper socket buffer sizes, the I/O processing thread count, the Worker thread count, the allowed async task count, these all has to be tuned to the machine the application running on and to the connection count, and data type You want to transfer through the network. This is why the built-in sockets are not performing that good, because they must be universal, and they do not let You set these parameters.
In my case assingning more than 2 dedicated threads to I/O event processing actually makes the overall performance worse, because using only 2 RSS Queues, and causing more context switching than what is ideal.
Choosing wrong buffer sizes will result in performance loss.
Always benchmark different implementations for the simulated task You need to find out which solution or setting is the best.
Different settings may produce different performance results on different machines and/or operating systems!
Mono vs Dotnet Core:
Since I've programmed my socket library in a FW/Core compatible way I could test them under linux with mono, and with core native compilation. Most interestingly I could not observe any remarkable performance differences, both were fast, but of course leaving mono and compiling in core should be the way to go.
Bonus performance tip:
If Your network card is capable of RSS (Receive Side Scaling) then enable it in Windows in the network device settings in the advanced properties, and set the RSS Queue from 1 to as high you can/as high is the best for your performance.
If it is supported by Your network card then it is usually set to 1, this assigns the network event to process only by one CPU core by the kernel. If You can increment this queue count to higher numbers then it will distribute the network events between more CPU cores, and will result in much better performance.
In linux it is also possible to set this up, but in different ways, better to search for Your linux distro/lan driver information.
I hope my experience will help some of You!

I had the same problem. You should take a look into:
NetCoreServer
Every thread in the .NET clr threadpool can handle one task at one time. So to handle more async connects/reads etc., you have to change the threadpool size by using:
ThreadPool.SetMinThreads(Int32, Int32)
Using EAP (event based asynchronous pattern) is the way to go on Windows. I would use it on Linux too because of the problems you mentioned and take the performance plunge.
The best would be io completion ports on Windows, but they are not portable.
PS: when it comes to serialize objects, you are highly encouraged to use protobuf-net. It binary serializes objects up to 10x times faster than the .NET binary serializer and saves a little space too!

How to preserve the message sent order at TCP server side with multiple clients

I have two PCs connected by direct Ethernet cable over 1Gbps link. One of them act as TCP Server and other act as TCP Client/s. Now I would like to achieve maximum possible network throughput between these two.
Options I tried:
Creating multiple clients on PC-1 with different port numbers, connecting to the TCP Server. The reason for creating multiple clients is to increase the network throughput but here I have an issue.
I have a buffer Queue of Events to be sent to Server. There will be multiple messages with same Event Number. The server has to acquire all the messages then sort the messages based on the Event number. Each client now dequeues the message from Concurrent Queue and sends to the server. After sent, again the client repeats the same. I have put constraint on the client side that Event-2 will not be sent until all messaged labelled with Event-1 is sent. Hence, I see the sent Event order correct. And the TCP server continuously receives from all the clients.
Now lets come to the problem:
The server is receiving the data in little random manner, like I have shown in the image. The randomness between two successive events is getting worse after some time of acquisition. I can think of this random behaviour is due to parallel worker threads being executed for IO Completion call backs.
technology used: F# Socket Async with SocketEventArgs
Solution I tried: Instead of allowing receive from all the clients at server side, I tried to poll for the next available client with pending data then it ensured the correct order but its performance is not at all comparable to the earlier approach.
I want to receive in the same order/ nearly same order (but not non-deterministic randomness) as being sent from the clients. Is there any way I can preserve the order and also maintain the better throughput? What are the best ways to achieve nearly 100% network throughput over two PCs?

As others have pointed out in the comments, a single TCP connection is likely to give you the highest throughput, if it's TCP you want to use.
You can possibly achieve slightly (really marginally) higher throughput with UDP, but then you have the hassle of recreating all the goodies TCP gives you for free.
If you want bidirectional high volume high speed throughput (as opposed to high volume just one way at a time), then it's possible one connection for each direction is easier to cope with, but I don't have that much experience with it.
Design tips
You should keep the connection open. The client will need to ask "are you still there?" at regular intervals if no other communication goes on. (On second thought, I realize that the only purpose of this is to allow quick reponse and the possiblity for the server to initiate a message transaction. So I revise it to: keep the connection open for a full transaction at least.)
Also, you should split up large messages - messages over a certain size. Keep the number of bytes you send in each chunk to a maximum round hex number, typically 8K, 16K, 32K or 64K on a local network. Experiment with sizes. The suggested max sizes has been optimal since Windows 3 at least. You need some sort of protocol with a chunck consisting of a fixed header (typically a magic number for check and resynch, a chunk number also for check and for analysis, and a total packet length) followed by the data.
You can possibly further improve throughput with compression (usually low quick compression) - it depends very much on the data, and whether you're on a fast or slow network.
Then there's this hassle that one typically runs into - problems with the Nagle algorith - and I no longer remember enough of the details there. I believe I used to overcome that by sending an acknowledgement in return for each chunk sent, and I suspect by doing that you satisfy the design requirements, and so avoid waiting for the last bytes to come in. But do google this.

Measure RoundTrip TCP latency without changes to application protocol

Is there any way (preferably in C#) how to regularly measure connection layer latency (roundtrip) without changing the application protocol and without creating separate dedicated connection - e.g. using some similar SYN-ACK trick like tcping do but without closing/opening connection?
I'm connecting to the servers via given ASCII based protocol (and always using TCP_NODELAY). Servers send me large amount of discrete messages and I'm regularly sending 'heartbeat' payload (but there is no response payload to the heartbeat).
I cannot change the protocol and in many cases I also cannot create more than one physical connection to the server.

Keep in mind that TCP does windowing, so this could cause issues when trying to implement an elegant SEQ/ACK solution. (you would want sequence, not synchronize)
[EDIT: Snipped a very overcomplicated and confusing explaination.]
I'd have to say the best way is to use a simple stopwatch method of starting a timer, making a very thin request or poll, and measure the time back from it. If that query really is the lightest you can make it, then that should give you the minimum amount of time you can reasonably expect to wait, which sometimes more valuable than the ping (which can be misleading).
If you really absolutely need just the network time to machine and back, just use an ICMP ping.

High performance socket server (like MMO)

I'm working on a server project (and I want to make a MMO Server with this). I have created something but IDK is it a good system. Namely is there a MMO server tutorial/book for creating (help!) a high performance socket server (MMO)? I'll send/receive 5kb data to every connected clients (because this is a MMO systems) and server must handle ~2000clients/s. Can anybody show me a good start point?

There: C# SocketAsyncEventArgs High Performance Socket Code; based on things I have learnt from this (and some other resources) I have written a high performance TCP server which is handling more than 7000 clients.
Edit: Other good .NET code bases I studied to some extend are fracture (F#), SocketAwaitable and SuperSocket. I especially like fracture because of it's simple (not naive) and smart buffer pool handling but (as the version I've worked with) it does not provide a separate pool for acceptors; which I've done myself easily based on the already provided pool.

Persisting 140 TCP connections?

We are currently investigating the most efficient way of communicating between 120-140 embedded hardware devices running on the .NET Micro framework and a server.
Each embedded device needs to send to, and request information from the server on a fairly regular basis all in real time through TCP.
My question is this: Would it be better to initialise 140 TCP connections to the server, and then hang on to these connections, or initialise a new connection for each requests to and from the devices? Would holding on to and managing 140 TCP connections put a lot of strain on the server?
When the server detects new data in the database it needs to send this new info to 1..* devices (information is targeted to specific devices), if I held on to the 140 connections I would need to do a lookup for the correct connection each time I needed to send information instead of just sending to an IP:PORT associated with the new data.
I guess another possibly stupid question would be is it actually possibly to hang on to 140 TCP connections on a single port?
Any suggestions/comments are appreciated!

In general you are better maintaining the connections for as long as possible. If you have each device opening a connection each time it sends a message you can end up effectively DoS'ing the server as it ends up with lots of sockets in the TIME_WAIT state taking up space in it's tables.
I worked on a system where there were a bunch of clients talking to a server and while they could be turned on and off regularly, it was still better to maintain the connection (and re-establish it when it had dropped and a new message needed to be sent). You may end up needing to write slightly more complex code, but I've found it to be well worth the effort for the reduced load on the server.
Modern operating systems may have bigger buffers than the ones I actually encountered the DoS effect on, but it's fundamentally not the best idea to be using lots of connections like that.
Things can get relatively complicated on the client side, especially when the device tends to go to sleep transparently to the application because that means connections will time out while the app thinks they are still open. When we did this we ended up with relatively complex network code because we needed to deal with the fact that the sockets could (and would) fail as a matter of course and we simply needed to setup a new connection and re-attempt sending the message. You just tuck this code away into your libraries and forget about it once it's done though.
In actual fact in practice our initial application had even more complex code because it was dealing with a network library that was semi-aware of the stop start nature of the devices and tried to resend failed messages, sometimes meaning that the same message got sent twice. We ended up doing an extra layer of communication on top in order to ensure duplicates got rejected. If you're using C# or regular BSD style sockets you shouldn't have that problem though I'm guessing. This was a proprietary library that managed the reconnects but caused headaches with the resends and it's inappropriate default time-outs.

You usually can connect much more than 140 "clients" to a server (that is with decent network / HW / RAM)...
I recommend always to test this sort of thing with real scenarios (load etc.) to decide since there are aspects like network (performance, stability...), HW (server RAM etc.) and SW (what does the server exactly do?) that can only be checked by you.
Depending on the protocol you could/should even put some timeout/reconnect mechanism in there.
The lookup you mean would be really fast - just use ConcurrentDictionary to hold the needed information with IP:PORT as the key (assuming the server runs on a full .NET 4).
For some references see:
http://msdn.microsoft.com/en-us/library/dd287191.aspx
http://geekswithblogs.net/BlackRabbitCoder/archive/2011/02/17/c.net-little-wonders-the-concurrentdictionary.aspx
EDIT - as per comments:
Holding on to a TCP/IP connection doesn't take much processing client-side... it costs a bit of memory. I would recommend to do a small test (1-2 clients) to check this assumption for your specific case.

If you are talking about a system with hardware devices then I suggest to go with closing the connection every time the client finishes sending data.
To make sure the client gets some update from the server, the client can wait for a 5 second period for any data to arrive from the server. If the data is received within/before this timeframe, then close the connection and process the data. If not, close the connection and wait after sending next set of data.
This way scaling becomes much easier. Keeping the connections open always leads to strain on the resources and in my opinion is not necessary unless it is some life-saving device like heart rate monitor, oxygen supply monitor etc.,

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.