So I've allready written a tcp-server with SocketAsyncEventArgs and socket.***Async methods. But I do love the await/async way of writing code using the Stream.ReadAsync and Stream.WriteAsync methods.
Is there any difference performance/memory wise, or do I simply make a difference in syntax?
The advantages of using the socketasynceventargs is to reduce strain on the garbage collector and to keep the buffers of the sockets grouped to prevent out of memory exception.
By using a pool of SocketAsyncEventArgs, you can eliminate the need to create and collect the IAsyncResult for every call to read and write.
Secondly, each SocketAsyncEventArgs can be assigned a portion of a large byte[] block. This prevents memory fragmentation which helps reduce the memory footprint. Although this can technically also be done with the Begin/End calls, it is easier to permenantly assign each SocketAsyncEventArgs with its own buffer instead of assigning a block every time that a read or write is called with the Begin/End methods.
One last difference that can be important is that according to the MSDN Library, the Begin/End Send method may block while the socket buffer is full. The SocketAsyncEventArgs will not block, instead, only a certain amount of bytes will be written depending on the amount of space in the sockets buffer, the callback will specify the amount of bytes written.
The advantages are really not noticeable unless it is an edge case. Unless you run into either of the problems specified, I suggest that you choose what is most comfortable and maintainable for you.
Related
I hope the question title does not become imprecise, but it may happen that a direct replacement isn't available and a code restructuring becomes inevitable.
My task is to stream audio frames from HTTP, pipe them through ffmpeg and then shove them into some audio buffer.
Now a classical approach would probably involve multiple threads and lots of garbage collection I want to avoid.
My modern attempt was using async IAsyncEnumerable<Memory<byte>>, where I'd basically read content from HTTP into a fixed-size byte array, that is allocated once.
Then I'd yield return the Memory struct ("pointer"), which would cause the caller to immediately consume and transform that content via ffmpeg.
With that, I'd stay in lock-step defined by the chunk reads from the HttpClient. That way, the whole processing would happen on one thread and I would only have one big byte array as data store with a long lifetime.
The problem with this is, that Unity's Mono Version doesn't support the C# 8.0 feature of async streams (i.e. awaiting the async enumerable). So I need to come up with a replacement.
I thought about using System.Threading.Channels, but those already have a few caveats in the way the control flow is handled: With Channels, I cannot guarantee that the written Memory<T> is immediately read. As such it can happen that http is overwriting the backing buffer before the other end has read content. This would mean copying lots of data, causing garbage.
An oldschool alternative would be to maintain some kind of ring buffer, where I have a write and a read pointer that is moved whenever each end reads/writes. hand-rolling that felt dumb though, maybe there is a roughly equivalent and elegant API?
Also would you rather just have two Threads and have them busy wait? Or can I maybe just accept the garbage collector pressure and use some regular queue structure, that potentially even uses notify/wait to wake up waiting threads, so they don't have to SpinWait/busy wait?
Let's say I have a static list List<string> dataQueue, where data keeps getting added at random intervals and also at a varying rate (1-1000 entries/second).
My main objective is to send the data from the list to the server, I'm using a TcpClient class.
What I've done so far is, I'm sending the data synchronously to the client in a Single thread
byte[] bytes = Encoding.ASCII.GetBytes(message);
tcpClient.GetStream().Write(bytes, 0, bytes.Length);
//The client is already connected at the start
And I remove the entry from the list, once the data is sent.
This works fine, but the speed of data being sent is not fast enough, the list gets populated and consumes more memory, as the list gets iterated and sent one by one.
My question is can I use the same tcpClient object to write concurrently from another thread or can I use another tcpClient object with a new connection to the same server in another thread? What is the most efficient(quickest) way to send this data to the server?
PS: I don't want to use UDP
Right; this is a fun topic which I think I can opine about. It sounds like you are sharing a single socket between multiple threads - perfectly valid as long as you do it very carefully. A TCP socket is a logical stream of bytes, so you can't use it concurrently as such, but if your code is fast enough, you can share the socket very effectively, with each message being consecutive.
Probably the very first thing to look at is: how are you actually writing the data to the socket? what is your framing/encoding code like? If this code is simply bad/inefficient: it can probably be improved. For example, is it indirectly creating a new byte[] per string via a naive Encode call? Are there multiple buffers involved? Is it calling Send multiple times while framing? How is it approaching the issue of packet fragmentation? etc
As a very first thing to try - you could avoid some buffer allocations:
var enc = Encoding.ASCII;
byte[] bytes = ArrayPool<byte>.Shared.Rent(enc.GetMaxByteCount(message.Length));
// note: leased buffers can be oversized; and in general, GetMaxByteCount will
// also be oversized; so it is *very* important to track how many bytes you've used
int byteCount = enc.GetBytes(message, 0, message.Length, bytes, 0);
tcpClient.GetStream().Write(bytes, 0, byteCount);
ArrayPool<byte>.Shared.Return(bytes);
This uses a leased buffer to avoid creating a byte[] each time - which can massively improve GC impact. If it was me, I'd also probably be using a raw Socket rather than the TcpClient and Stream abstractions, which frankly don't gain you a lot. Note: if you have other framing to do: include that in the size of the buffer you rent, use appropriate offsets when writing each piece, and only write once - i.e. prepare the entire buffer once - avoid multiple calls to Send.
Right now, it sounds like you have a queue and dedicated writer; i.e. your app code appends to the queue, and your writer code dequeues things and writes them to the socket. This is a reasonably way to implement things, although I'd add some notes:
List<T> is a terrible way to implement a queue - removing things from the start requires a reshuffle of everything else (which is expensive); if possible, prefer Queue<T>, which is implemented perfectly for your scenario
it will require synchronization, meaning you need to ensure that only one thread alters the queue at a time - this is typically done via a simple lock, i.e. lock(queue) {queue.Enqueue(newItem);} and SomeItem next; lock(queue) { next = queue.Count == 0 ? null : queue.Dequeue(); } if (next != null) {...write it...}.
This approach is simple, and has some advantages in terms of avoiding packet fragmentation - the writer can use a staging buffer, and only actually write to the socket when a certain threshold is buffered, or when the queue is empty, for example - but it has the possibility of creating a huge backlog when stalls occur.
However! The fact that a backlog has occurred indicates that something isn't keeping up; this could be the network (bandwidth), the remote server (CPU) - or perhaps the local outbound network hardware. If this is only happening in small blips that then resolve themselves - fine (especially if it happens when some of the outbound messages are huge), but: one to watch.
If this kind of backlog is recurring, then frankly you need to consider that you're simply saturated for the current design, so you need to unblock one of the pinch points:
making sure your encoding code is efficient is step zero
you could move the encode step into the app-code, i.e. prepare a frame before taking the lock, encode the message, and only enqueue an entirely prepared frame; this means that the writer thread doesn't have to do anything except dequeue, write, recycle - but it makes buffer management more complex (obviously you can't recycle buffers until they've been completely processed)
reducing packet fragmentation may help significantly, if you're not already taking steps to achieve that
otherwise, you might need (after investigating the blockage):
better local network hardware (NIC) or physical machine hardware (CPU etc)
multiple sockets (and queues/workers) to round-robin between, distributing load
perhaps multiple server processes, with a port per server, so your multiple sockets are talking to different processes
a better server
multiple servers
Note: in any scenario that involves multiple sockets, you want to be careful not to go mad and have too many dedicated worker threads; if that number goes above, say, 10 threads, you probably want to consider other options - perhaps involving async IO and/or pipelines (below).
For completeness, another basic approach is to write from the app-code; this approach is even simpler, and avoids the backlog of unsent work, but: it means that now your app-code threads themselves will back up under load. If your app-code threads are actually worker threads, and they're blocked on a sync/lock, then this can be really bad; you do not want to saturate the thread-pool, as you can end up in the scenario where no thread-pool threads are available to satisfy the IO work required to unblock whichever writer is active, which can land you in real problems. This is not usually a scheme that you want to use for high load/volume, as it gets problematic very quickly - and it is very hard to avoid packet fragmentation since each individual message has no way of knowing whether more messages are about to come in.
Another option to consider, recently, is "pipelines"; this is a new IO framework in .NET that is designed for high volume networking, giving particular attention to things like async IO, buffer re-use, and a well-implemented buffer/back-log mechanism that makes it possible to use the simple writer approach (syncronize while writing) and have that not translate into direct sends - it manifests as an async writer with access to a backlog, which makes packet fragmentation avoidance simple and efficient. This is quite an advanced area, but it can be very effective. The problematic part for you will be: it is designed for async usage throughout, even for writes - so if your app-code is currently synchronous, this could be a pain to implement. But: it is an area to consider. I have a number of blog posts talking about this topic, and a range of OSS examples and real-life libraries that make use of pipelines that I can point you at, but: this isn't a "quick fix" - it is a radical overhaul of your entire IO layer. It also isn't a magic bullet - it can only remove overhead due to local IO processing costs.
Is there a valid reason to not use TcpListener for implementing a high performance/high throughput TCP server instead of SocketAsyncEventArgs?
I've already implemented this high performance/high throughput TCP server using SocketAsyncEventArgs went through all sort of headaches to handling those pinned buffers using a big pre-allocated byte array and pools of SocketAsyncEventArgs for accepting and receiving, putting together using some low level stuff and shiny smart code with some TPL Data Flow and some Rx and it works perfectly; almost text book in this endeavor - actually I've learnt more than 80% of these stuff from other-one's code.
However there are some problems and concerns:
Complexity: I can not delegate any sort of modifications to this server to another
member of the team. That bounds me to this sort of tasks and I can
not pay enough attention to other parts of other projects.
Memory Usage (pinned byte arrays): Using SocketAsyncEventArgs the pools are needed to
be pre-allocated. So for handling 100000 concurrent connections
(worse condition, even on different ports) a big pile of RAM is uselessly hovers there;
pre-allocated (even if these conditions are met just at some times,
server should be able to handle 1 or 2 such peaks everyday).
TcpListener actually works good: I actually had put TcpListener into test (with some tricks like
using AcceptTcpClient on a dedicated thread, and not the async
version and then sending the accepted connections to a
ConcurrentQueue and not creating Tasks in-place and the like)
and with latest version of .NET, it worked very well, almost as good
as SocketAsyncEventArgs, no data-loss and a low memory foot-print
which helps with not wasting too much RAM on server and no pre-allocation is needed.
So why I do not see TcpListener being used anywhere and everybody (including myself) is using SocketAsyncEventArgs? Am I missing something?
I see no evidence that this question is about TcpListener at all. It seems you are only concerned with the code that deals with a connection that already has been accepted. Such a connection is independent of the listener.
SocketAsyncEventArgs is a CPU-load optimization. I'm convinced you can achieve a higher rate of operations per second with it. How significant is the difference to normal APM/TAP async IO? Certainly less than an order of magnitude. Probably between 1.2x and 3x. Last time I benchmarked loopback TCP transaction rate I found that the kernel took about half of the CPU usage. That means your app can get at most 2x faster by being infinitely optimized.
Remember that SocketAsyncEventArgs was added to the BCL in the year 2000 or so when CPUs were far less capable.
Use SocketAsyncEventArgs only when you have evidence that you need it. It causes you to be far less productive. More potential for bugs.
Here's the template that your socket processing loop should look like:
while (ConnectionEstablished()) {
var someData = await ReadFromSocketAsync(socket);
await ProcessDataAsync(someData);
}
Very simple code. No callbacks thanks to await.
In case you are concerned about managed heap fragmentation: Allocate a new byte[1024 * 1024] on startup. When you want to read from a socket read a single byte into some free portion of this buffer. When that single-byte read completes you ask how many bytes are actually there (Socket.Available) and synchronously pull the rest. That way you only pin a single rather small buffer and still can use async IO to wait for data to arrive.
This technique does not require polling. Since Socket.Available can only increase without reading from the socket we do not risk to perform a read that is too small accidentally.
Alternatively, you can combat managed heap fragmentation by allocating few very big buffers and handing out chunks.
Or, if you don't find this to be a problem you don't need to do anything.
Most people seem to build a listener socket and will include "events" to be invoked for processing. EG: SocketConnected, DataReceived. The programmer initializes a listener and binds to the "events" methods to receive socket events to build the service.
I feel on a large scale implementation, it would be more efficient to avoid delegates in the listener. And to complete all the processing in the callback methods. Possibly using different call backs for receiving data based on the knowledge of knowing what command is coming next. (This is part of my Message Frame Structure)
I have looked around for highly scalable examples, but I only find the standard MSDN implementations for asynchronous sockets or variations from other programmers that replicate the MSDN example.
Does anyone have any good experience that could point me in the right direction?
Note> The service will hold thousands of clients and in most cases, the clients stayed connected and updates received by the service will be send out to all other connected clients. It is a synchronized P2P type system for an object orientated database.
The difference between an event call and a callback is negligible. A callback is just the invocation of a delegate (or a function pointer). You can't do asynchronous operation without some sort of callback and expect to get results of any kind.
With events, they can be multicast. This means multiple callback calls--so that would be more costly because you calling multiple methods. But, if you're doing that you probably need to do it--the alternative is to have multiple delegates and call them manually. So, there'd be no real benefit. Events can often include sender/eventargs; so, you've got that extra object and the creation of the eventargs instance; but I've never seen a situation where that affected performance.
Personally, I don't use the event-based asynchronous pattern--I've found (prior to .NET 4.5) that the asynchronous programming model to be more ubiquitous. In .NET 4.5 I much prefer the task asynchronous pattern (single methods that end in Async instead of two methods one starting with Begin and one starting with End) because they can be used with async/await and less wordy.
Now, if the question is the difference between new AsyncCallback(Async_Send_Receive.Read_Callback) e.g.:
s.BeginReceive(so.buffer, 0, StateObject.BUFFER_SIZE, 0,
new AsyncCallback(Async_Send_Receive.Read_Callback), so);
and just Async_Send_Receive.Read_Callback e.g.:
s.BeginReceive(so.buffer, 0, StateObject.BUFFER_SIZE, 0,
Async_Send_Receive.Read_Callback, so);
The second is just a short-hand of the first; the AsyncCallback delegate is still created under the covers.
But, as with most things; even if it's generally accepted not to be noticeably different in performance, test and measure. If one way has more benefits (included performance) than another, use that one.
My only advice to you is this: Go with the style that provides the most clarity.
Eliminating an entire language feature because of an unmeasured speed difference would be premature. The cost of method calls/delegate invocations is highly unlikely to be the bottleneck in your code. Sure, you could benchmark the relative cost of one versus another, but if your program is only spending 1% of its setting up method invocations, then even huge differences won't really affect your program.
My best advice to you if you really want to juice your server, just make sure that all your IO happens asynchronously, and never run long-running tasks in the threadpool. .net4.5 async/await really simplifies all of this... consider using it for more maintainable code.
I have worked with live betting systems using sockets and with two way active messaging. It is really easier to work with a framework to handle the socket layer like WCF P2P. It handles all the connection problems for you and you can concentrate on your bussiness logic.
I have an unmanaged method that when executed takes high CPU. Is it safe to say that unmanaged calls naturally take high CPU?
Following is the code:
public void ReadAt(long ulOffset, IntPtr pv, int cb, out UIntPtr pcbRead)
{
Marshal.Copy(buffer, 0, pv, bytesRead);
pcbRead = new UIntPtr((uint)bytesRead);
bytesRead = 0;
if (streamClosed)
buffer = null;
}
No it's not safe to generalize this. Both managed and unmanaged methods take whatever CPU they need to execute their code.
When someone says unmanaged calls may be expensive they usually mean the overhead from switching between managed and unmanaged. This particular cost will only matter if you do unmanaged calls in tight loops like per-pixel processing on a large image.
Some of the overhead of unmanaged calls can be removed by proper attributes, in particular it is possible to move the security checks from per-call to assembly-load-time. This is of course already done for all unmanaged functions in the .NET framework.
The best guess (without more context) about why you are spending so much time in that function is that you are either (a) copying a very large array or (b) you are calling the method very often in a loop.
In the first case the overhead from switching between managed-unmanaged for Marshal.Copy is negligible, copying a large memory block will always saturate the CPU (ie. 100% usage of one core). There is nothing you can do except eliminating the copy operation completely (which may or may not be possible, depending on how you use the buffers).
If you are in the second case and the arrays are very small it may be worth switching to a purely managed loop. But don't do this without measuring, it's easy to guess wrong, and the unmanaged implementation of Marshal.Copy can pull more tricks than the managed JIT to make up for the overhead.
PS: You might want to read this - high CPU usage by itself is no bad thing, the computer is just trying to get things done as fast as possible. Managed or unmanaged does not matter, if your usage is below 100% (per core) it just means your computer got nothing to do.