OpenRtb bidder on Azure, is it real?

OpenRtb bidder on Azure, is it real? - c#

We are trying to build an OpenRtb bidder on Azure env. We use redis instances that deployed on Linux VM to store real time data (tracking keys, requests/bidwins counts, etc). We use WebApi based website (standard plan/large instances/scaling by performance) as the end-point. In webapi bidder controller we use async methods, all requests to DBs and redis also async. Json.net to ser/der json req/resp.
Currently we have issues with latency. We should have possibility to receive more then 10000 req per sec and latency should be <100ms.
Could some one to share experience with me? Is this tech stack good for building apps like rtb bidders. Currently I'm trying to find the best strategy to store request context (query, request body, headers, etc) for each request. So, I need the way to insert a lot of (>10000) big messages very fast. I'm thinking about:
storing to log files and copying them to HDFS and parsing by Hadoop MapReduce tasks ( HDInsight)
using some queue such as AzureQueue or ServiceBus or maybe RabbitMQ and send req messages to queue, and some services (self-made or such as LogStash) will receive them and store to some storage as well.
Maybe some one could show me directions how to optimize latency and performance, because currently we have issues with this. Maybe some basic pitfalls?

Interesting.
I have little experience with Azure, but building rtb bidders with high throughput and low latency can be tricky if you involve things like queues or any kind of IO. Use those only if necessary, and preferably out of the bid path. Cache everything or use a high speed in memory store (Aerospike, Couchbase, etc .. even redis if you're smart about it).
Async is great but careful with having too many threads to schedule. that can induce high latency, if you can use epoll, non blocking IO.
I've build a bidder running on aws, with netty (async non blocking io on the jvm) wrapped on Twitter's Finagle (twitter RPC system), chronicle map as model caching and aerospike for data serving. We could handle 20k req/sec on one box of m4.2xlarge (8 cores/ 32 Go) with a 99 percentile tail latency of around 45 ms and 16 ms on average.
You tend to read that you should onlyuse bare metal for bidders, but it's more about architecture that virtualisation or not.

We had the same problem with our implementation and actually we decided to check possible alternatives. We benchmarked different langs and libs.
source Yandex.Tank response per second(ubuntu vm 8 cores)
target ubuntu vm 8 cores
golang fast http 30k+
nginx 20k
golang http 20k-
haskell wai warp 15k+
clojure http-kit 15k-
node.js 7k
rust hyper 10k+
rust iron 10k-
fsharp suave.io 4k+ (best result ever for .net web servers)
asp.net 5 kestrel coreclr/mono ??? 400-
So after that we started to use golang + fasthttp + redis. Actually it was enought to handle 40k rps on a single instance with a 99 percentile less than 11 ms. Actually currently asp.net 5 shows good results in speed you could check here

Related

High-performance TCP Socket programming in .NET C#

I know this topic is already asked sometimes, and I have read almost all threads and comments, but I'm still not finding the answer to my problem.
I'm working on a high-performance network library that must have TCP server and client, has to be able to accept even 30000+ connections, and the throughput has to be as high as possible.
I know very well I have to use async methods, and I have already implemented all kinds of solutions that I have found and tested them.
In my benchmarking, only the minimal code was used to avoid any overhead in the scope, I have used profiling to minimize the CPU load, there is no more room for simple optimization, on the receiving socket the buffer data was always read, counted and discarded to avoid socket buffer fill completely.
The case is very simple, one TCP Socket listens on localhost, another TCP Socket connects to the listening socket (from the same program, on the same machine oc.), then one infinite loop starts to send 256kB sized packets with the client socket to the server socket.
A timer with 1000ms interval prints a byte counter from both sockets to the console to make the bandwidth visible then resets them for the next measurement.
I've realized the sweet-spot for packet size is 256kB and the socket's buffer size is 64kB to have the maximum throughput.
With the async/await type methods I could reach
~370MB/s (~3.2gbps) on Windows, ~680MB/s (~5.8gbps) on Linux with mono
With the BeginReceive/EndReceive/BeginSend/EndSend type methods I could reach
~580MB/s (~5.0gbps) on Windows, ~9GB/s (~77.3gbps) on Linux with mono
With the SocketAsyncEventArgs/ReceiveAsync/SendAsync type methods I could reach
~1.4GB/s (~12gbps) on Windows, ~1.1GB/s (~9.4gbps) on Linux with mono
Problems are the following:
async/await methods were the slowest, so I will not work with them
BeginReceive/EndReceive methods started new async thread together with the BeginAccept/EndAccept methods, under Linux/mono every new instance of the socket was extremely slow (when there was no more thread in the ThreadPool mono started up new threads, but to create 25 instance of connections did take about 5 mins, creating 50 connections was impossible (program just stopped doing anything after ~30 connections).
Changing the ThreadPool size did not help at all, and I would not change it (it was just a debug move)
The best solution so far is SocketAsyncEventArgs, and that makes the highest throughput on Windows, but in Linux/mono it is slower than the Windows, and it was the opposite before.
I've benchmarked both my Windows and Linux machine with iperf,
Windows machine produced ~1GB/s (~8.58gbps), Linux machine produced ~8.5GB/s (~73.0gbps)
The weird thing is iperf could make a weaker result than my application, but on Linux, it is much higher.
First of all, I would like to know if the results are normal, or can I get better results with a different solution?
If I decide to use the BeginReceive/EndReceive methods (they produced relatively the highest result on Linux/mono) then how can I fix the threading problem, to make the connection instance creating fast, and eliminate the stalled state after creating multiple instances?
I continue making further benchmarks and will share the results if there is any new.
================================= UPDATE ==================================
I promised code snippets, but after many hours of experimenting the overall code is kind of a mess, so I would just share my experience in case it can help someone.
I had to realize under Window 7 the loopback device is slow, could not get higher result than 1GB/s with iperf or NTttcp, only Windows 8 and newer versions have fast loopback, so I don't care anymore about Windows results until I can test on newer version. SIO_LOOPBACK_FAST_PATH should be enabled via Socket.IOControl, but it throws exception on Windows 7.
It turned out the most powerful solution is the Completion event based SocketAsyncEventArgs implementation both on Windows and Linux/Mono. Creating a few thousand instances of the clients never messed up the ThreadPool, the program did not stop suddenly as I mentioned above. This implementation is very nice to the threading.
Creating 10 connections to the listening socket and feeding data from 10 separate thread from the ThreadPool with the clients together could produce ~2GB/s data traffic on Windows, and ~6GB/s on Linux/Mono.
Increasing the client connection count did not improve the overall throughput, but the total traffic became distributed among the connections, this might be because the CPU load was 100% on all cores/threads even with 5, 10 or 200 clients.
I think overall performance is not bad, 100 clients could produce around ~500mbit/s traffic each. (Of course this is measured in local connections, real life scenario on network would be different.)
The only observation I would share: experimenting with both the Socket in/out buffer sizes and with the program read/write buffer sizes/loop cycles highly affected the performance and very differently on Windows and on Linux/Mono.
On Windows the best performance has been reached with 128kB socket-receive, 32kB socket-send, 16kB program-read and 64kB program-write buffers.
On Linux the previous settings produced very weak performance, but 512kB socket-receive and -send both, 256kB program-read and 128kB program-write buffer sizes worked the best.
Now my only problem is if I try create 10000 connecting sockets, after around 7005 it just stops creating the instances, does not throw any exceptions, and the program is running as there was no any problem, but I don't know how can it quit from a specific for loop without break, but it does.
Any help would be appreciated regarding anything I was talking about!

Because this question gets a lot of views I decided to post an "answer", but technically this isn't an answer, but my final conclusion for now, so I will mark it as answer.
About the approaches:
The async/await functions tend to produce awaitable async Tasks assigned to the TaskScheduler of the dotnet runtime, so having thousands of simultaneous connections, therefore thousands or reading/writing operations will start up thousands of Tasks. As far as I know this creates thousands of StateMachines stored in ram and countless context switchings in the threads they are assigned to, resulting in very high CPU overhead. With a few connections/async calls it is better balanced, but as the awaitable Task count grows it gets slow exponentially.
The BeginReceive/EndReceive/BeginSend/EndSend socket methods are technically async methods with no awaitable Tasks, but with callbacks on the end of the call, which actually optimizes more the multithreading, but still the limitation of the dotnet design of these socket methods are poor in my opinion, but for simple solutions (or limited count of connections) it is the way to go.
The SocketAsyncEventArgs/ReceiveAsync/SendAsync type of socket implementation is the best on Windows for a reason. It utilizes the Windows IOCP in the background to achieve the fastest async socket calls and use the Overlapped I/O and a special socket mode. This solution is the "simplest" and fastest under Windows. But under mono/linux, it never will be that fast, because mono emulates the Windows IOCP by using linux epoll, which actually is much faster than IOCP, but it has to emulate the IOCP to achieve dotnet compatibility, this causes some overhead.
About buffer sizes:
There are countless ways to handle data on sockets. Reading is straightforward, data arrives, You know the length of it, You just copy bytes from the socket buffer to Your application and process it.
Sending data is a bit different.
You can pass Your complete data to the socket and it will cut it to chunks, copy the chucks to the socket buffer until there is no more to send and the sending method of the socket will return when all data is sent (or when error happens).
You can take Your data, cut it to chunks and call the socket send method with a chunk, and when it returns then send the next chunk until there is no more.
In any cases You should consider what socket buffer size You should choose. If You are sending large amount of data, then the bigger the buffer is, the less chunks has to be sent, therefore less calls in Your (or in the socket's internal) loop has to be called, less memory copy, less overhead.
But allocating large socket buffers and program data buffers will result in large memory usage, especially if You are having thousands of connections, and allocating (and freeing up) large memory multiple times is always expensive.
On sending side 1-2-4-8kB socket buffer size is ideal for most cases, but if You are preparing to send large files (over few MB) regularly then 16-32-64kB buffer size is the way to go. Over 64kB there is usually no point to go.
But this has only advantage if the receiver side has relatively large receiving buffers too.
Usually over the internet connections (not local network) no point to get over 32kB, even 16kB is ideal.
Going under 4-8kB can result in exponentially incremented call count in the reading/writing loop, causing large CPU load and slow data processing in the application.
Go under 4kB only if You know Your messages will usually be smaller than 4kB, or just very rarely over 4KB.
My conclusion:
Regarding my experiments built-in socket class/methods/solutions in dotnet are OK, but not efficient at all. My simple linux C test programs using non-blocking sockets could overperform the fastest and "high-performance" solution of dotnet sockets (SocketAsyncEventArgs).
This does not mean it is impossible to have fast socket programming in dotnet, but under Windows I had to make my own implementation of Windows IOCP by directly communicating with the Windows Kernel via InteropServices/Marshaling, directly calling Winsock2 methods, using a lot of unsafe codes to pass the context structs of my connections as pointers between my classes/calls, creating my own ThreadPool, creating IO event handler threads, creating my own TaskScheduler to limit the count of simultaneous async calls to avoid pointlessly much context switches.
This was a lot of job with a lot of research, experiment, and testing. If You want to do it on Your own, do it only if You really think it worth it. Mixing unsafe/unmanage code with managed code is a pain in the ass, but the end it worth it, because with this solution I could reach with my own http server about 36000 http request/sec on a 1gbit lan, on Windows 7, with an i7 4790.
This is such a high performance that I never could reach with dotnet built-in sockets.
When running my dotnet server on an i9 7900X on Windows 10, connected to a 4c/8t Intel Atom NAS on Linux, via 10gbit lan, I can use the complete bandwidth (therefore copying data with 1GB/s) no matter if I have only 1 or 10000 simultaneous connections.
My socket library also detects if the code is running on linux, and then instead of Windows IOCP (obviously) it is using linux kernel calls via InteropServices/Marshalling to create, use sockets, and handle the socket events directly with linux epoll, managed to max out the performance of the test machines.
Design tip:
As it turned out it is difficult to design a networking library from scatch, especially one, that is likely very universal for all purposes. You have to design it to have many settings, or especially to the task You need.
This means finding the proper socket buffer sizes, the I/O processing thread count, the Worker thread count, the allowed async task count, these all has to be tuned to the machine the application running on and to the connection count, and data type You want to transfer through the network. This is why the built-in sockets are not performing that good, because they must be universal, and they do not let You set these parameters.
In my case assingning more than 2 dedicated threads to I/O event processing actually makes the overall performance worse, because using only 2 RSS Queues, and causing more context switching than what is ideal.
Choosing wrong buffer sizes will result in performance loss.
Always benchmark different implementations for the simulated task You need to find out which solution or setting is the best.
Different settings may produce different performance results on different machines and/or operating systems!
Mono vs Dotnet Core:
Since I've programmed my socket library in a FW/Core compatible way I could test them under linux with mono, and with core native compilation. Most interestingly I could not observe any remarkable performance differences, both were fast, but of course leaving mono and compiling in core should be the way to go.
Bonus performance tip:
If Your network card is capable of RSS (Receive Side Scaling) then enable it in Windows in the network device settings in the advanced properties, and set the RSS Queue from 1 to as high you can/as high is the best for your performance.
If it is supported by Your network card then it is usually set to 1, this assigns the network event to process only by one CPU core by the kernel. If You can increment this queue count to higher numbers then it will distribute the network events between more CPU cores, and will result in much better performance.
In linux it is also possible to set this up, but in different ways, better to search for Your linux distro/lan driver information.
I hope my experience will help some of You!

I had the same problem. You should take a look into:
NetCoreServer
Every thread in the .NET clr threadpool can handle one task at one time. So to handle more async connects/reads etc., you have to change the threadpool size by using:
ThreadPool.SetMinThreads(Int32, Int32)
Using EAP (event based asynchronous pattern) is the way to go on Windows. I would use it on Linux too because of the problems you mentioned and take the performance plunge.
The best would be io completion ports on Windows, but they are not portable.
PS: when it comes to serialize objects, you are highly encouraged to use protobuf-net. It binary serializes objects up to 10x times faster than the .NET binary serializer and saves a little space too!

C# Socket Read/Write performance

I'm developing a gaming TCP server and I'm planning that it should withstand up to about 500 connections. Data transfer occurs constantly (movement of the character, attack, etc.).
I started develop with the TCPListener, but I got a performance problem.
I used await XXXSync methods to accept clients, receive data, and sending messages.
I created a program that simulates the actions of players. It allowed to create only 30-40 connections and this caused 100% of the CPU. To test my server, I used two machines: one was a simulator of players, on the other - a server.
These 30-40 clients load the CPU on the server by 10-20% and very much (at 100%) while simultaneously disconnecting all TCP clients.
I was looking for information about optimizing the TCPListener, but nothing found. I've seen topics where people have written that they own a server with 1,000+ connections and this works well.
I thought that the reason could be in the logic of my server, but the profiling tools showed that the main part of the CPU load is getting and sending data (await client.ReadAsync / client.WriteAsync).
Picture 1 | Picture 2
I started looking for possible alternatives and found SocketAsyncEventArgs. At the forum, people wrote that this method can increase productivity by 2 times. I found the source codes of some servers written in C # and took from there the implementation of SocketAsyncEventArgs.
This really greatly increased the productivity by 10-15 times. For 30 simulations, this causes now only 0.5-2%.
I rewrote the emulator of the players at SocketAsyncEventArgs and this allowed me to create 100 simulations with 20% CPU.
After 70 players the load on the CPU on the server grew and is 7-13%.
Profiling tools show that the main use of the CPU is the sending of messages.
For some reason, the most productive option for me was BeginWrite. Write was not suitable because it blocks the stream. WriteAsync was slower, or I just misused it. Picture 3 | Picture 4
I can not find information about optimizing Sockets on c #. Everyone everywhere writes that they are all well even on the TCPListener, which for me is simply unbearable.
Is there any other way to improve performance?
P.S. I tested the server on a computer with i7 seventh generation.

Azure - C# Concurrency - Best Practices

We are scraping an Web based API using Microsoft Azure. The issue is that there is SO much data to retrieve (there are combinations/permutations involved).
If we use a standard Web Job approach, we calculated it would take about 200 years to process all the data we want to get - and we would like our data to be refreshed every week.
Each request/response from the API takes about a 0.5-1.0 seconds to process. Request size is on average 20000 bytes and the average response is 35000 bytes. I believe the total number of requests is in the millions.
Another way to think about this question would be: how would you use Azure to Web scrape - and make sure you don't overload (in terms of memory + network) the VM it's running on? (I don't think you need too much CPU processing in this case).
What we have tried so far:
Used Service Bus Queues/Worker Roles scaled to 8 small VMs - but this caused a lot of network errors to occur (there must be some network limit to how much EACH worker role VM can handle).
Used Service Bus Queues/Continuous Web Job scaled to 8 small VMs - but this seems to work slower - and even scaled, doesn't give us too much control on what's happening behind the scenes. (We don't REALLY know how many VMs are up).
It seems that these things are built for CPU calculation - not for Web/API scraping.
Just to clarify: I throw my requests into a queue - which then get picked up by my multiple VMs for processing to get the responses. That's how I was using the queues. Each VM was using the ServiceBusTrigger class as prescribed by microsoft.
Is it better to have a lot small VMs or few massive VMs?
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?

Actually a web scraper is something that I have up and running, in Azure, for quite some time now :-)
AFAIK there is no 'magic bullet'. Scraping a lot of sources with deadlines is quite hard.
How it works (the most important things):
I use worker roles and C# code for the code itself.
For scheduling, I use the queue storage. I put crawling tasks on the queue with a timeout (e.g. 'when to crawl then') and have the scraper pull them off. You can put triggers on the queue size to ensure you meet deadlines in terms of speed -- personally I don't need them.
SQL Azure is slow, so I don't use that. Instead, I only use table storage for storing the scraped items. Note that updating data might be quite complex.
Don't use too much threading; instead, use async IO for all network traffic.
Also you might have to consider that extra threads require extra memory (parse trees can become quite big) - so there's a trade-off there... I do recall using some threads, but it's really just a few.
Note that probably this does require you to re-design and re-implement your complete web scraper if you're now using a threaded approach.. then again, there are some benefits:
Table storage and queue storage are cheap.
I currently use a single Extra Small VM to scrape well over a thousand web sources.
Inbound network traffic is for free.
As such, the result is quite cheap as well; I'm sure it's much less than the alternatives.
As for classes that I use... well, that's a bit of a long list. I'm using HttpWebRequest for the async HTTP requests and the Azure SDK -- but all the rest is hand crafted (and not open source).
P.S.: This doesn't just hold for Azure; most of this also holds for on-premise scrapers.

I have some experience with scraping so I will share my thoughts.
It seems that these things are built for CPU calculation - not for Web/API scraping.
They are built for dynamic scaling which given your task is not something you really need.
How to make sure you don't overload the VM?
Measure the response times and error rates and tune you code to lower them.
I don't think you need too much CPU processing in this case.
Depends on how much data is coming in each second and what you are doing with it. More complex parsing on quickly incoming data (if you decide to do it on the same machine) will eat up CPU pretty quickly.
8 small VMs caused a lot of network errors to occur (there must be some network limit)
The smaller the VMs the less shared resources they get. There are throughput limits and then there is an issue with your neighbors sharing the actual hardware with you. Often, the smaller your instance size the more trouble you run into.
Is it better to have a lot small VMs or few massive VMs?
In my experience, smaller VMs are too crippled. However, your mileage may vary and it all depends on the particular task and its solution implementation. Really, you have to measure yourself in your environment.
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?
With high throughput scraping you should be looking at infrastructure. You will have different latency in different Azure datacenters, and different experience with network latency/sustained throughput at different VM sizes, and depending on who in particular is sharing the hardware with you. The best practice is to try and find what works best for you - change datacenters, VM sizes and otherwise experiment.
Azure may not be the best solution to this problem (unless you are on a spending spree). 8 small VMs is $450 a month. It is enough to pay for an unmanaged dedicated server with 256Gb of RAM, 40 hardware threads and 500Mbps - 1Gbps (or even up to several Gbps bursts) of quality network bandwidth without latency issues.
For you budget, you will have a dedicated server that you cannot overload. You will have more than enough RAM to deal with async pinning (if you decide to go async), or enough hardware threads for multi-threaded synchronous IO which gives the best throughput (if you choose to go synchronous with a fixed-size threadpool).
On a sidenote, depending on the API specifics, it might turn out that your main issue will be the API owner simply throttling you down to a crawl when you start to put too much pressure on the API endpoints.

Best way to throttle an external application's CPU Usage

Ok - here is the scenario:
I host a server application on Amazon AWS hosted windows instances. (I do not have access to the source code - so I cannot resolve the issues from within the applications source code)
These specific instances are able to build up CPU credits during times of idle cpu (less than 10-20% usage) and then spend those CPU credits during times of increased compute requirement.
My server application however, typically runs at around 15-20% cpu usage when no clients are connected- this is time when I would rather lower the cpu usage to around 5% through throttling of the cpu - maintaining enough cpu throughput to accept a TCP Socket from incoming clients.
When a connected client is detected, I would like to remove the throttle and allow full access to the reserve of AWS CPU Credits.
I have got code in place that can Suspend and Resume processes via C# using Windows API calls.
I am however a bit fuzzy on how to accurately attain a target cpu usage for that process.
What I am doing so far, which is having moderate success:
Looping inside another application
check the cpu usage of the server application - using performance counters (dont like these - they require a 100-1000 ms wait in order to return a % value)
I determine if the current value is above or below the target value - if above, I increase an int value called 'sleep' by 10ms
If below - 'sleep' is decreased by 10ms.
Then the application will call
Process.Suspend();
Threads.sleep(sleep);
Process.Resume();
Like I said - this is having moderate success.
But there are several reasons I don't like it:
1. It requires a semi-rapid loop in an external application: This might end up just shifting cpu usage to that application.
2. Im sure there are better mathematical solutions to work out the ideal sleep time.
I came across this application : http://mion.faireal.net/BES/
It seems to do everything I want, except I need to be able to control it, and I am not a c++ developer.
It also seems to be able to achieve accurate cpu throttling without consuming large cpu utself.
Can someone suggest CPU throttle techniques.
Remember - I cannot modify the source code of the application being throttled - at most, I could inject code into it: but it occurs to me that if I inject suspend code into it, then the resume code could not fire etc.
An external agent program might be the best way to go.

What is the best way scale out work to multiple machines?

We're developing a .NET app that must make up to tens of thousands of small webservice calls to a 3rd party webservice. We would prefer a more 'chunky' call, but the 3rd party does not support it. We've designed the client to use a configurable number of worker threads, and through testing have code that is fairly well optimized for one multicore machine. However, we still want to improve the speed, and are looking at spreading the work accross multiple machines. We're well versed in typical client/server/database apps, but new to designing for multiple machines. So, a few questions related to that:
Is there any other client-side optimization, besides multithreading, that we should look at that could improve speed of a http request/response? (I should note this is a non-standard webservice, so is implemented using WebClient, not a WCF or SOAP client)
Our current thinking is to use WCF to publish chunks of work to MSMQ, and run clients on one or more machines to pull work off of the queue. We have experience with WCF + MSMQ, but want to be sure we're not missing better options. Are there other, better ways to do this today?
I've seen some 3rd party tools like DigiPede and Microsoft's HPC offerings, but these seem like overkill. Any experience with those products or reasons we should consider them over roll-our-own?

Sounds like your goal is to execute all these web service calls as quickly as you can, and get the results tabulated. Given that, your greatest efficiency control is going to be through scaling the number of concurrent requests you can make.
Be sure to look at your client-side connection limits. By default, I think the system default is 2 connections. I haven't tried this myself, but by upping the number of connections with this property, you should theoretically see a multiplier effect in terms of generating more requests by generating more connections from a single machine. There's more info on MS forums.
The MSMQ option works well. I'm running that configuration myself. ActiveMQ is also a fine solution, but MSMQ is already on the server.
You have a good starting point. Get that in operation, then move on to performance and throughput.

At CodeMash this year, Wesley Faler did an interesting presentation on this sort of problem. His solution was to store "jobs" in a DB, then use clients to pull down work and mark status when complete.
He then pushed the whole infrastructure up to Amazon's EC2.
Here's his slides from the presentation - they should give you the basic idea:
I've done something similar w/ multiple PC's locally - the basics of managing the workload were similar to Faler's approach.

If you have optimized the code, you could look into optimizing the network side to minimize the number of packets sent:
reuse HTTP sessions (i.e.: multiple transactions into one session by keeping the connection open, reduces TCP overhead)
reduce the number of HTTP headers to the minimum in the request to save bandwidth
if supported by server, use gzip to compress the body of the request (need to balance CPU usage to do the compression, and the bandwidth you save)

You might want to consider Rhino Service Bus instead of MSMQ. The source is available here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.