Winsock IOCP Weird Behaviour On Disconnect Flood

Winsock IOCP Weird Behaviour On Disconnect Flood - c#

I'm programming a Socket/Client/Server library for C#, since I do a lot of cross-platform programming, and I didn't find mono/dotnet/dotnet core enough efficient in high-performance socket handling.
Sice linux epoll unarguably won the performance and usability "fight", I decided to use an epoll-like interface as common API, so I'm trying to emulate it on Windows environment (windows socket performance is not that important for me than linux, but the interface is). To achieve this I use the Winsock2 and Kernel32 API's directly with Marshaling and I use IOCP.
Almost everything works fine, except one: When I create a TCP server with winsock, and connect to it (from local or from a remote machine via LAN, does not matter) with more than 10000 connections, all connections are accepted, no problem at all, when all connections send data (flood) from the client side to the server, no problem at all, server receives all packets, but when I disconnect all clients at the same time, the server does not recognize all disconnection events (i.e. 0 byte read/available), usually 1500-8000 clients stuck. The completion event does not get triggered, therefore I can not detect the connection loss.
The server does not crash, it continues accept new connections, and everything works as expected, only the lost connections do not get recognized.
I've read that - because using overlapped IO needs pre-allocated read buffer - IOCP on reading locks these buffers and releases the locks on completion, and if too many events happen in the same time it can not lock all affected buffers because of an OS limit, and this causes IOCP hang for indefinite time.
I've read that the solution to this buffer-lock problem is I should use a zero-sized buffer with null-pointer to the buffer itself, so the read event will not lock it, and I should use real buffer only when I read real data.
I've implemented the above workaround and it works, except the original problem, after disconnecting many-thousands of clients in the same time, a few-thousand stuck.
Of course I keep up the possibility my code is wrong, so I made a basic server with dotnet's built in SocketAsyncEventArgs class (as the official example describes), that basically does the same using IOCP, and the results are the same.
Everything works fine, except the thousands of client disconnecting in the same time, a few-thousand of disconnection (read on disconnect) events does not get recognized.
I know I should do IO operation and check the return value if the socket is still can perform the IO, and if not, then disconnect it. The problem is in some cases I have nothing to tell the socket, I just receive data, or if I do it periodically this would be almost the same as polling, and would cause high load with thousands of connections, wasted CPU work.
(I use closing the clients numerous closing methods, from gaceful disconnection to proper TCP Socket closing, both on windows and linux clients, results are always the same)
My questions:
Is there any known solution to this problem?
Is there any efficient way to recognize TCP (graceful) connection closing by remote?
Can I somehow set a read-timeout to overlapped socket read?
Any help would be appreciated, thank You!

Related

Deadlocked when both endpoints do Socket.SendAsync

So I have written this client/server socket application that uses the SocketAsyncEventArgs "method" for doing async sockets.
Using the same library I have used for many other applications, I now for the first time experience a situation that I never anticipated.
Our new client/server application when started, starts to send lot's of data in both directions.
When done in unit-tests using mock-objects (without delays) to mimic normal socket operations, it all works well.
But in real situations using real sockets, we get a sort of deadlock where both endpoints are stuck in a Socket.SendAsync() operation (yes it returned true, was not synchronously handled)
My idea is that the receive buffer of both parties are full, and the tcp stack is not acknowleding any frames anymore. (connected to 127.0.0.1)
So I made the receivebuffer twice as large as the sendbuffer, but unfortunately it is not that simple due to the nature of our "protocol", and how we determine to send or receive.
I now have to re-think the method that determines when to start sending and when to start receiving.
A complicating factor is, that the purpose of this connection is to mutliplex multiple bi-directional general purpose communication channels over this socket connection. That means that there is no pre-determined sequence of communication, all channels may have their own protocols.
Of course, there is a tls initiation, handshake and authentication, which all work well, but when the connection becomes operational, and the channels start their own communications, the only sure thing is that received data has a size and channelnumber as a header.
After each operation, I check to see if there is any waiting data in the receivebuffer, or by checking Socket.Available.
This combined with measuring how much data was received since last sent operation, and how full the transmitbuffer is getting, I decide to receive more or start sending, or do nothing, and poll again in xx ms.
I now realize that this is wronge.
Am I trying to accomplish something that is simply not possible using only one socket connection?
Anyone every tried to accomplish something simular, or know a good way of accomplish a safe way that does not introduce these odd lock-ups.
Thanks,
Theo.

Scalability test - Connections dropped after many connections are established

I am programming with sockets (TcpListener and TcpClient actually) in C#. I wrote a server that accepts client connections and streams data to them.
In order to test scalability, I wrote a test harness that creates a certain number of connections (say 1000) in a loop, connects to the server, and writes whatever data is received to the console.
After the server receives about 1300 connections, the clients' connection attempts start failing with a regular "No connection could be made because the target machine actively refused it" exception. If the clients keep trying, some connections get through, but there are still many of them that don't. I even tried putting in delays, e.g. three simultaneous clients each opening one connection per second to the server, but the problem remains.
My guess was that the listen backlog was becoming full, but given the delays I introduced, I now doubt it. How can this behaviour be explained and solved?
Edit: before anyone else jumps on this question and marks it as duplicate without having read it...
I am using asynchronous sockets using the Asynchronous Programming Model. That's the old BeginXXX/EndXXX pattern, not the new async/await pattern. The APM uses the Thread Pool underneath, so this is not a naive one-thread-per-connection model. The connections are dormant most of the time unless I/O occurs. In that case, the .NET Framework automatically allocates threads to handle this.
Edit 2: The gist of this question, for those who thought it was too [insert silly adjective here], is: why does a server drop connections when under a heavy load? The error message I quoted usually occurs when a connection cannot be established (i.e. when you got the ip/port wrong), but this clearly isn't the case.

What happens to a socket on suspend/resume in windows

I have a c# .net4 application that listens on a socket using BeginReceiveFrom and EndRecieveFrom. All works as expected until I put the machine to sleep and then resume.
At that point EndReceieveFrom executes and throws an exception (Cannot access a disposed object). It appears that the socket is disposed when the machine is suspended but I'm not sure how to handle this.
Do I presume that all sockets have been disposed and recreate them all from scratch? I'm having problems tracking down the exact issue as remote debugging also breaks on suspend/resume.

What happens during suspend/resume very much depends on your hardware and networking setup. If your network card is not disabled during suspend, and the suspend is brief, open connections will survive suspend/resume without any problem (open TCP connections can time out on the other end of course).
However, if your network adapter is disabled during the sleep, or it is a USB adapter that gets disabled because it is connected to a disabled hub, or your computer gets a new IP address from DHCP, or your wireless adapter gets reconnected to a different access point, etc., then all current connections are going to be dropped, listening sockets wil no longer be valid, etc.
This is not specific to sleep/resume. Network interfaces can come up and go down at any time, and your code must handle it. You can easily simulate this with a USB network adapter, e.g. yank it out of your computer and your code must handle it.

I've had similar issues with suspend/resume and sockets (under .NET 4 and Windows 8, but I suspect not limited to these).
Specifically, I had a client socket application which only received data. Reading was done via BeginReceive with a call-back. Code in the call-back handled typical failure cases (e.g. remote server closes connection either gracefully or not).
When the client machine went to sleep (and this probably applies to the newer Windows 8 Fast Start mode too which is really just a kind of sleep/hibernate) the server would close the connection after a few seconds. When the client woke up however the async read call-back was not getting called (which I would expect to occur as it should get called when the socket has an error condition/is closed in addition to when there is data). I explicitly added code on a timer to the client to periodically check for this condition and recover, however even here (and using a combination of Poll, Available and Connected to check if the connection was up) the socket on the client side STILL appeared to be connected, so the recovery code never ran. I think if I had tried sending data then I would have received an error, but as I said this was strictly one-way.
The solution I ended up using was to detect the resume from sleep condition and close and re-establish my socket connections when this occurred. There are quite a few ways of detecting resume; in my case I was writing a Windows Service, so I could simply override the ServiceBase.OnPowerEvent method.

.net Tcp Server receives bytes in large clumps every few minutes

Background:
C# .net synchronous Tcp server
a TcpClient object is assigned by blocking on a TcpListener with the AcceptTcpClient method
once there's a TcpClient object, I pass it to a thread that invokes the client's GetStream method to create a NetworkStream
this NetworkStream is looped over, in each iteration doing a networkStream.Read(someBuffer, 0, 4096)
right now client and server are located on the same network, with no congestion to speak of
my server has plenty of memory to spare
if I load my server software onto another machine, the problem goes away
the kicker: traffic from a network Linux box gets through fine and on time
My server has been functioning just fine for several months. However, over the past weekend instead of receiving small groups of bytes in quick succession, the place where the process begins ( tcpListener.AcceptTcpClient() ) only occurs every couple of minutes. So my server sits idle, then gets 30-50 client requests all bundled into one huge block of bytes. Needless to say this causes a huge delay and put strain on my server. If the clump of client requests is big enough, it can take my server 30 minutes to catch up.
In logging built into my clients, I can see them do network writes, and flush between each one. So the clients are functioning correctly.
This reaks of some kind of system intervention. Is my Tcp server (as describe above) bad, or is something in Windows interfering with my traffic, and how can I tell?
Thanks guys.

You might want to install some packet capture software at each Tcp endpoint. You'd be surprised. I'm suffering with a similar problem now, almost completely identical actually.
When I put capture software in place I noted that traffic between endpoints was fast and on time as expected.

Are TCP Connections resource intensive?

I have a TCP server that gets data from one (and only one) client. When this client sends the data, it makes a connection to my server, sends one (logical) message and then does not send any more on that connection.
It will then make another connection to send the next message.
I have a co-worker who says that this is very bad from a resources point of view. He says that making a connection is resource intensive and takes a while. He says that I need to get this client to make a connection and then just keep using it for as long as we need to communicate (or until there is an error).
One benefit of using separate connections is that I can probably multi-thread them and get more throughput on the line. I mentioned this to my co-worker and he told me that having lots of sockets open will kill the server.
Is this true? Or can I just allow it to make a separate connection for each logical message that needs to be sent. (Note that by logical message I mean an xml file that is of variable length.)

It depends entirely on the number of connections that you are intending to open and close and the rate at which you intend to open them.
Unless you go out of your way to avoid the TIME_WAIT state by aborting the connections rather than closing them gracefully you will accumulate sockets in TIME_WAIT state on either the client or the server. With a single client it doesn't actually matter where these accumulate as the issue will be the same. If the rate at which you use your connections is faster than the rate at which your TIME_WAIT connections close then you will eventually get to a point where you cannot open any new connections because you have no ephemeral ports left as all of them are in use with sockets that are in TIME_WAIT.
I write about this in much more detail here: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html
In general I would suggest that you keep a single connection and simply reopen it if it gets reset. The logic may appear to be a little more complex but the system will scale far better; you may only have one client now and the rate of connections may be such that you do not expect to suffer from TIME_WAIT issues but these facts may not stay the same for the life of your system...

The initiation sequence of a TCP connection is a very simple 3 way handshake which has very low overhead. No need to maintain a constant connection.
Also having many TCP connections won't kill your server so fast. modern hardware and operating systems can handle hundreds of concurrect TCP connections, unless you are afraid of Denial of service attacks which are out of the scope of this question obviously.

If your server has only a single client, I can't imagine in practice there'd be any issues with opening a new TCP socket per message. Sounds like your co-worker likes to prematurely optimize.
However, if you're flooding the server with messages, it may become an issue. But still, with a single client, I wouldn't worry about it.
Just make sure you close the socket when you're done with it. No need to be rude to the server :)

In addition to what everyone said, consider UDP. It's perfect for small messages where no response is expected, and on a local network (as opposed to Internet) it's practically reliable.

From the servers perspective, it not a problem to have a very large number of connections open.
How many socket connections can a web server handle?
From the clients perspective, if measuring shows you need to avoid the time initiate connections and you want parallelism, you could create a connection pool. Multiple threads can re-use each of the connections and release them back into the pool when they're done. That does raise the complexity level so once again, make sure you need it. You could also have logic to shrink and grow the pool based on activity - it would be ashame to hold connections open to the server over night while the app is just sitting their idle.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.