Scaling up Multiple HttpWebRequests?

Scaling up Multiple HttpWebRequests? - c#

I'm building a server application that needs to perform a lot of http requests to a couple other servers on an ongoing basis. Currently, I'm basically setting up about 30 threads and continuously running HttpWebRequests synchronously on each thread, achieving a throughput of about 30 requests per second.
I am indeed setting the ServicePoint ConnectionLimit in the app.config so that's not the limiting factor.
I need to scale this up drastically. At the very least I'll need some more CPU horse power, but I'm wondering if I would gain any advantages by using the async methods of the HttpWebRequest object (eg: .BeginGetResponse() ) as opposed to creating threads myself and using the synchronous methods (eg: .GetResponse() ) on these threads.
If I go with the async methods, I obviously have to significantly redesign my app, so I'm wondering if anyone might have some insight before I go and recode everything, in case I'm out to lunch.
Thanks!

If you are on Windows NT, then System.Net.Sockets.Socket class always uses IO Completion ports for async operations. And HTTPWebRequest in async mode uses async sockets, and hence will be using IOCP.
Without doing detailed benchmarking, it is difficult to say if our bottleneck is inside HttpWebRequest, or up the stack, in your application, or on the remote side, in the server. But offhand, for sure, asyncc will give you better performance, because it will end up using IOCP under the covers. And reimplementing the app for async is not that difficult.
So, I would suggest that you first change your app architecture to async. Then see how much max throughput you are getting. Then you can start benchmarking and finding out where the bottleneck is, and removing that.

Fastest result so far for me is using 75 threads running sync httpwebrequest.
About 140 requests per second on a windows 2003 server, 4core 3ghz, 100mb connection.
Async Httprequest / winsock got stuck at about 30-50 req/sec. Did not test sync winsock but I guess it would give you about the same result as httpwebrequest.
Tests was against 1 200 000 blog feeds.
Been struggling with this the last month so it would be interesting to know if someone managed to squeeze more out of .net?
EDIT
New test: Got 350req/sec with the xfserver iocp component. Used a bunch of threads with one instance each before any greater result. The "client part" of the lib had a couple of really annoying bugs that made implementation harder then the "server part". Not what you're asking for and not recommended but some kind of step.
Next: Former winsock test did not use the 3.5 SocketAsyncEventArgs, that will be next.
ANSWER
The answer to your question, no it will not be worth the effort.
The async HttpWebRequest methods offloads main thread while keeping download in background, it does not improve the number/scalability of requests. (at least not in 3.5, might be different in 4.0?)
However, what might be worth looking at is building your own wrapper around async sockets/SocketAsyncEventArgs where iocp works and perhaps implement a begin/end pattern similar to HttpWebRequest (for easiest possible implementation in current code). The improvement is really enormous.

Related

High-performance TCP Socket programming in .NET C#

I know this topic is already asked sometimes, and I have read almost all threads and comments, but I'm still not finding the answer to my problem.
I'm working on a high-performance network library that must have TCP server and client, has to be able to accept even 30000+ connections, and the throughput has to be as high as possible.
I know very well I have to use async methods, and I have already implemented all kinds of solutions that I have found and tested them.
In my benchmarking, only the minimal code was used to avoid any overhead in the scope, I have used profiling to minimize the CPU load, there is no more room for simple optimization, on the receiving socket the buffer data was always read, counted and discarded to avoid socket buffer fill completely.
The case is very simple, one TCP Socket listens on localhost, another TCP Socket connects to the listening socket (from the same program, on the same machine oc.), then one infinite loop starts to send 256kB sized packets with the client socket to the server socket.
A timer with 1000ms interval prints a byte counter from both sockets to the console to make the bandwidth visible then resets them for the next measurement.
I've realized the sweet-spot for packet size is 256kB and the socket's buffer size is 64kB to have the maximum throughput.
With the async/await type methods I could reach
~370MB/s (~3.2gbps) on Windows, ~680MB/s (~5.8gbps) on Linux with mono
With the BeginReceive/EndReceive/BeginSend/EndSend type methods I could reach
~580MB/s (~5.0gbps) on Windows, ~9GB/s (~77.3gbps) on Linux with mono
With the SocketAsyncEventArgs/ReceiveAsync/SendAsync type methods I could reach
~1.4GB/s (~12gbps) on Windows, ~1.1GB/s (~9.4gbps) on Linux with mono
Problems are the following:
async/await methods were the slowest, so I will not work with them
BeginReceive/EndReceive methods started new async thread together with the BeginAccept/EndAccept methods, under Linux/mono every new instance of the socket was extremely slow (when there was no more thread in the ThreadPool mono started up new threads, but to create 25 instance of connections did take about 5 mins, creating 50 connections was impossible (program just stopped doing anything after ~30 connections).
Changing the ThreadPool size did not help at all, and I would not change it (it was just a debug move)
The best solution so far is SocketAsyncEventArgs, and that makes the highest throughput on Windows, but in Linux/mono it is slower than the Windows, and it was the opposite before.
I've benchmarked both my Windows and Linux machine with iperf,
Windows machine produced ~1GB/s (~8.58gbps), Linux machine produced ~8.5GB/s (~73.0gbps)
The weird thing is iperf could make a weaker result than my application, but on Linux, it is much higher.
First of all, I would like to know if the results are normal, or can I get better results with a different solution?
If I decide to use the BeginReceive/EndReceive methods (they produced relatively the highest result on Linux/mono) then how can I fix the threading problem, to make the connection instance creating fast, and eliminate the stalled state after creating multiple instances?
I continue making further benchmarks and will share the results if there is any new.
================================= UPDATE ==================================
I promised code snippets, but after many hours of experimenting the overall code is kind of a mess, so I would just share my experience in case it can help someone.
I had to realize under Window 7 the loopback device is slow, could not get higher result than 1GB/s with iperf or NTttcp, only Windows 8 and newer versions have fast loopback, so I don't care anymore about Windows results until I can test on newer version. SIO_LOOPBACK_FAST_PATH should be enabled via Socket.IOControl, but it throws exception on Windows 7.
It turned out the most powerful solution is the Completion event based SocketAsyncEventArgs implementation both on Windows and Linux/Mono. Creating a few thousand instances of the clients never messed up the ThreadPool, the program did not stop suddenly as I mentioned above. This implementation is very nice to the threading.
Creating 10 connections to the listening socket and feeding data from 10 separate thread from the ThreadPool with the clients together could produce ~2GB/s data traffic on Windows, and ~6GB/s on Linux/Mono.
Increasing the client connection count did not improve the overall throughput, but the total traffic became distributed among the connections, this might be because the CPU load was 100% on all cores/threads even with 5, 10 or 200 clients.
I think overall performance is not bad, 100 clients could produce around ~500mbit/s traffic each. (Of course this is measured in local connections, real life scenario on network would be different.)
The only observation I would share: experimenting with both the Socket in/out buffer sizes and with the program read/write buffer sizes/loop cycles highly affected the performance and very differently on Windows and on Linux/Mono.
On Windows the best performance has been reached with 128kB socket-receive, 32kB socket-send, 16kB program-read and 64kB program-write buffers.
On Linux the previous settings produced very weak performance, but 512kB socket-receive and -send both, 256kB program-read and 128kB program-write buffer sizes worked the best.
Now my only problem is if I try create 10000 connecting sockets, after around 7005 it just stops creating the instances, does not throw any exceptions, and the program is running as there was no any problem, but I don't know how can it quit from a specific for loop without break, but it does.
Any help would be appreciated regarding anything I was talking about!

Because this question gets a lot of views I decided to post an "answer", but technically this isn't an answer, but my final conclusion for now, so I will mark it as answer.
About the approaches:
The async/await functions tend to produce awaitable async Tasks assigned to the TaskScheduler of the dotnet runtime, so having thousands of simultaneous connections, therefore thousands or reading/writing operations will start up thousands of Tasks. As far as I know this creates thousands of StateMachines stored in ram and countless context switchings in the threads they are assigned to, resulting in very high CPU overhead. With a few connections/async calls it is better balanced, but as the awaitable Task count grows it gets slow exponentially.
The BeginReceive/EndReceive/BeginSend/EndSend socket methods are technically async methods with no awaitable Tasks, but with callbacks on the end of the call, which actually optimizes more the multithreading, but still the limitation of the dotnet design of these socket methods are poor in my opinion, but for simple solutions (or limited count of connections) it is the way to go.
The SocketAsyncEventArgs/ReceiveAsync/SendAsync type of socket implementation is the best on Windows for a reason. It utilizes the Windows IOCP in the background to achieve the fastest async socket calls and use the Overlapped I/O and a special socket mode. This solution is the "simplest" and fastest under Windows. But under mono/linux, it never will be that fast, because mono emulates the Windows IOCP by using linux epoll, which actually is much faster than IOCP, but it has to emulate the IOCP to achieve dotnet compatibility, this causes some overhead.
About buffer sizes:
There are countless ways to handle data on sockets. Reading is straightforward, data arrives, You know the length of it, You just copy bytes from the socket buffer to Your application and process it.
Sending data is a bit different.
You can pass Your complete data to the socket and it will cut it to chunks, copy the chucks to the socket buffer until there is no more to send and the sending method of the socket will return when all data is sent (or when error happens).
You can take Your data, cut it to chunks and call the socket send method with a chunk, and when it returns then send the next chunk until there is no more.
In any cases You should consider what socket buffer size You should choose. If You are sending large amount of data, then the bigger the buffer is, the less chunks has to be sent, therefore less calls in Your (or in the socket's internal) loop has to be called, less memory copy, less overhead.
But allocating large socket buffers and program data buffers will result in large memory usage, especially if You are having thousands of connections, and allocating (and freeing up) large memory multiple times is always expensive.
On sending side 1-2-4-8kB socket buffer size is ideal for most cases, but if You are preparing to send large files (over few MB) regularly then 16-32-64kB buffer size is the way to go. Over 64kB there is usually no point to go.
But this has only advantage if the receiver side has relatively large receiving buffers too.
Usually over the internet connections (not local network) no point to get over 32kB, even 16kB is ideal.
Going under 4-8kB can result in exponentially incremented call count in the reading/writing loop, causing large CPU load and slow data processing in the application.
Go under 4kB only if You know Your messages will usually be smaller than 4kB, or just very rarely over 4KB.
My conclusion:
Regarding my experiments built-in socket class/methods/solutions in dotnet are OK, but not efficient at all. My simple linux C test programs using non-blocking sockets could overperform the fastest and "high-performance" solution of dotnet sockets (SocketAsyncEventArgs).
This does not mean it is impossible to have fast socket programming in dotnet, but under Windows I had to make my own implementation of Windows IOCP by directly communicating with the Windows Kernel via InteropServices/Marshaling, directly calling Winsock2 methods, using a lot of unsafe codes to pass the context structs of my connections as pointers between my classes/calls, creating my own ThreadPool, creating IO event handler threads, creating my own TaskScheduler to limit the count of simultaneous async calls to avoid pointlessly much context switches.
This was a lot of job with a lot of research, experiment, and testing. If You want to do it on Your own, do it only if You really think it worth it. Mixing unsafe/unmanage code with managed code is a pain in the ass, but the end it worth it, because with this solution I could reach with my own http server about 36000 http request/sec on a 1gbit lan, on Windows 7, with an i7 4790.
This is such a high performance that I never could reach with dotnet built-in sockets.
When running my dotnet server on an i9 7900X on Windows 10, connected to a 4c/8t Intel Atom NAS on Linux, via 10gbit lan, I can use the complete bandwidth (therefore copying data with 1GB/s) no matter if I have only 1 or 10000 simultaneous connections.
My socket library also detects if the code is running on linux, and then instead of Windows IOCP (obviously) it is using linux kernel calls via InteropServices/Marshalling to create, use sockets, and handle the socket events directly with linux epoll, managed to max out the performance of the test machines.
Design tip:
As it turned out it is difficult to design a networking library from scatch, especially one, that is likely very universal for all purposes. You have to design it to have many settings, or especially to the task You need.
This means finding the proper socket buffer sizes, the I/O processing thread count, the Worker thread count, the allowed async task count, these all has to be tuned to the machine the application running on and to the connection count, and data type You want to transfer through the network. This is why the built-in sockets are not performing that good, because they must be universal, and they do not let You set these parameters.
In my case assingning more than 2 dedicated threads to I/O event processing actually makes the overall performance worse, because using only 2 RSS Queues, and causing more context switching than what is ideal.
Choosing wrong buffer sizes will result in performance loss.
Always benchmark different implementations for the simulated task You need to find out which solution or setting is the best.
Different settings may produce different performance results on different machines and/or operating systems!
Mono vs Dotnet Core:
Since I've programmed my socket library in a FW/Core compatible way I could test them under linux with mono, and with core native compilation. Most interestingly I could not observe any remarkable performance differences, both were fast, but of course leaving mono and compiling in core should be the way to go.
Bonus performance tip:
If Your network card is capable of RSS (Receive Side Scaling) then enable it in Windows in the network device settings in the advanced properties, and set the RSS Queue from 1 to as high you can/as high is the best for your performance.
If it is supported by Your network card then it is usually set to 1, this assigns the network event to process only by one CPU core by the kernel. If You can increment this queue count to higher numbers then it will distribute the network events between more CPU cores, and will result in much better performance.
In linux it is also possible to set this up, but in different ways, better to search for Your linux distro/lan driver information.
I hope my experience will help some of You!

I had the same problem. You should take a look into:
NetCoreServer
Every thread in the .NET clr threadpool can handle one task at one time. So to handle more async connects/reads etc., you have to change the threadpool size by using:
ThreadPool.SetMinThreads(Int32, Int32)
Using EAP (event based asynchronous pattern) is the way to go on Windows. I would use it on Linux too because of the problems you mentioned and take the performance plunge.
The best would be io completion ports on Windows, but they are not portable.
PS: when it comes to serialize objects, you are highly encouraged to use protobuf-net. It binary serializes objects up to 10x times faster than the .NET binary serializer and saves a little space too!

WCF, long running server operations and a WinRT client

Here's a problem I'm currently facing:
A WCF service exposes a large number of methods, some of which can take a longer amount of time.
The client is a WinRT (Metro-style) application (so some .NET classes are unavailable).
The timeout on the client has already been increased to 1.5 minutes.
Despite the increased timeout, some operations can take longer still (but not always).
If a timeout happens, the service continues on it's merry way. The result of the requested operation is lost. Even worse, if the operation is a success, then the client won't get the data required, and the server won't "rollback".
All operations are already implemented using the async pattern on the client. I could use an event-based implementation but, as far as I'm aware, the timeouts will still occur then.
Increasing the timeout value is definitely an option, but it feels like a very dirty solution - it feels like pushing the problem away rather than solving it.
Implementing a WS transaction flow on the server seems impossible - I don't have access to TransactionScope class when designing WinRT apps.
WS Atomic seems like overkill as well (it also requires a lot more set up, and I'm willing to bet the limited capabilities of WinRT applications will prove a big hassle to overcome).
So far my only idea (albeit one with a lot more moving parts, which sort of feels like reinventing the wheel) is to create two service methods - one which begins some long-running operation and returns some kind of "task ID", then runs the operation in the background, and saves the result of the operation (be it error or success) into a DB / storage with that task ID. The client can then poll for the operations result using that task ID via the second service method every once in a while until such a result is available (be it a success or an error).
This approach also has it's drawbacks:
long operations become even longer, as the client needs to poll for the results
lots of new moving parts, potentially making the whole thing less stable
What else could I possibly try to solve this issue?
PS. The actual service side is also not without limitations - it's an MS DAX service, which likely comes with it's own set of potential pitfalls and traps.
EDIT:
It appears my question has some similarity to this SO question... however, given the WinRT nature of the client and the MS DAX nature of the service I'm not sure anything in the answer is really useful to me.

Azure - C# Concurrency - Best Practices

We are scraping an Web based API using Microsoft Azure. The issue is that there is SO much data to retrieve (there are combinations/permutations involved).
If we use a standard Web Job approach, we calculated it would take about 200 years to process all the data we want to get - and we would like our data to be refreshed every week.
Each request/response from the API takes about a 0.5-1.0 seconds to process. Request size is on average 20000 bytes and the average response is 35000 bytes. I believe the total number of requests is in the millions.
Another way to think about this question would be: how would you use Azure to Web scrape - and make sure you don't overload (in terms of memory + network) the VM it's running on? (I don't think you need too much CPU processing in this case).
What we have tried so far:
Used Service Bus Queues/Worker Roles scaled to 8 small VMs - but this caused a lot of network errors to occur (there must be some network limit to how much EACH worker role VM can handle).
Used Service Bus Queues/Continuous Web Job scaled to 8 small VMs - but this seems to work slower - and even scaled, doesn't give us too much control on what's happening behind the scenes. (We don't REALLY know how many VMs are up).
It seems that these things are built for CPU calculation - not for Web/API scraping.
Just to clarify: I throw my requests into a queue - which then get picked up by my multiple VMs for processing to get the responses. That's how I was using the queues. Each VM was using the ServiceBusTrigger class as prescribed by microsoft.
Is it better to have a lot small VMs or few massive VMs?
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?

Actually a web scraper is something that I have up and running, in Azure, for quite some time now :-)
AFAIK there is no 'magic bullet'. Scraping a lot of sources with deadlines is quite hard.
How it works (the most important things):
I use worker roles and C# code for the code itself.
For scheduling, I use the queue storage. I put crawling tasks on the queue with a timeout (e.g. 'when to crawl then') and have the scraper pull them off. You can put triggers on the queue size to ensure you meet deadlines in terms of speed -- personally I don't need them.
SQL Azure is slow, so I don't use that. Instead, I only use table storage for storing the scraped items. Note that updating data might be quite complex.
Don't use too much threading; instead, use async IO for all network traffic.
Also you might have to consider that extra threads require extra memory (parse trees can become quite big) - so there's a trade-off there... I do recall using some threads, but it's really just a few.
Note that probably this does require you to re-design and re-implement your complete web scraper if you're now using a threaded approach.. then again, there are some benefits:
Table storage and queue storage are cheap.
I currently use a single Extra Small VM to scrape well over a thousand web sources.
Inbound network traffic is for free.
As such, the result is quite cheap as well; I'm sure it's much less than the alternatives.
As for classes that I use... well, that's a bit of a long list. I'm using HttpWebRequest for the async HTTP requests and the Azure SDK -- but all the rest is hand crafted (and not open source).
P.S.: This doesn't just hold for Azure; most of this also holds for on-premise scrapers.

I have some experience with scraping so I will share my thoughts.
It seems that these things are built for CPU calculation - not for Web/API scraping.
They are built for dynamic scaling which given your task is not something you really need.
How to make sure you don't overload the VM?
Measure the response times and error rates and tune you code to lower them.
I don't think you need too much CPU processing in this case.
Depends on how much data is coming in each second and what you are doing with it. More complex parsing on quickly incoming data (if you decide to do it on the same machine) will eat up CPU pretty quickly.
8 small VMs caused a lot of network errors to occur (there must be some network limit)
The smaller the VMs the less shared resources they get. There are throughput limits and then there is an issue with your neighbors sharing the actual hardware with you. Often, the smaller your instance size the more trouble you run into.
Is it better to have a lot small VMs or few massive VMs?
In my experience, smaller VMs are too crippled. However, your mileage may vary and it all depends on the particular task and its solution implementation. Really, you have to measure yourself in your environment.
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?
With high throughput scraping you should be looking at infrastructure. You will have different latency in different Azure datacenters, and different experience with network latency/sustained throughput at different VM sizes, and depending on who in particular is sharing the hardware with you. The best practice is to try and find what works best for you - change datacenters, VM sizes and otherwise experiment.
Azure may not be the best solution to this problem (unless you are on a spending spree). 8 small VMs is $450 a month. It is enough to pay for an unmanaged dedicated server with 256Gb of RAM, 40 hardware threads and 500Mbps - 1Gbps (or even up to several Gbps bursts) of quality network bandwidth without latency issues.
For you budget, you will have a dedicated server that you cannot overload. You will have more than enough RAM to deal with async pinning (if you decide to go async), or enough hardware threads for multi-threaded synchronous IO which gives the best throughput (if you choose to go synchronous with a fixed-size threadpool).
On a sidenote, depending on the API specifics, it might turn out that your main issue will be the API owner simply throttling you down to a crawl when you start to put too much pressure on the API endpoints.

Should all my actions using IO be async?

As I read the MSDN article Using Asynchronous Methods in ASP.NET MVC 4, I draw the conclusion that I should always use async await for I/O-bound operations.
Consider the following code, where movieManager exposes the async methods of an ORM like Entity Framework.
public class MovieController : Controller
{
// fields and constructors
public async Task<ActionResult> Index()
{
var movies = await movieManager.listAsync();
return View(movies);
}
public async Task<ActionResult> Details(int id)
{
var movie = await movieManager.FindAsync(id);
return View(movie);
}
}
Will this always give me better scalability and/or performance?
How can I measure this?
Why isn't this used in the "real world"?
How about context synchronization?
Is it that bad, that I shouldn't use async I/O in ASP.NET MVC?
I know these are a lot of questions, but literature on this topic has conflicting conclusions. Some say you should always use async for I/O dependent Tasks, others say you shouldn't use async in ASP.NET applications at all.

Will this always give me better scalability and/or performance?
It may. If you only have a single database server as your backend, then your database could be your scalability bottleneck, and in that case scaling your web server won't have any effect in the wider scope of your service as a whole.
How can I measure this?
With load testing. If you want a simple proof-of-concept, you can check out this gist of mine.
Why isn't this used in the "real world" a lot?
It is. Asynchronous request handlers before .NET 4.5 were quite painful to write, and a lot of companies just threw more hardware at the problem instead. Now that .NET 4.5 and async/await are gaining a lot of momentum, asynchronous request handling will continue to be much more common.
How about context synchronization?
It's handled for you by ASP.NET. I have an async intro on my blog that explains how await will capture the current SynchronizationContext when you await a task. In this case it's an AspNetSynchronizationContext that represents the request, so things like HttpContext.Current, culture, etc. all get preserved across await points automatically.
Is it that bad, that I shouldn't use async I/O in ASP.NET MVC?
As a general rule, if you're on .NET 4.5, you should use async to handle any request that requires I/O. If the request is simple (i.e., does not hit a database or call another service), then just keep it synchronous.

Will this always give me better scalability and/or performance?
You answered it yourself, you need to measure and find out. Typically async is something to add later on due to adding complexity, which is the #1 concern in your code base until you have a problem that is specific.
How can I measure this?
Build it both ways, see which is faster (preferably for a large number of operations)
Why isn't this used in the "real world" a lot?
Because complexity is the biggest problem in software development. If code is complex it is more error prone and harder to debug. More, harder to fix bugs is not a good trade off for potential performance advantages.
How about context synchronization?
I am assuming you mean ASP.NET context, if so you should not have any synchronization, make sure only one thread is hitting your context and communicate through it.
Is it that bad, that I shouldn't use async I/O in ASP.NET MVC?
Introducing async just to then have to deal with synchronization is a loss unless you really need the performance.

Putting asynchronous code in a website has a lot of negative sides :
You'll get into trouble when there are dependencies between the pieces of data, as you cannot make that asynchronous.
Asynchronous work is often done for things like API requests. Have you considered that you shouldn't be doing these in a webpage? If the external service goes down, so goes your site. That doesn't scale.
Doing things asynchronously may speed up your site in some cases but you're basically introducing trouble. You always end up waiting for the slowest one, and since sometimes resources just slow down for whatever reason this means that the risk of something slowing down your site increases by a factor equal to the number of asynchronous jobs you're using. You'll have to introduce timeouts to deal with these, then error handling code, etc.
When scaling to multiple webservers because the CPU load is getting too heavy, the asynchronous work will hurt you. Everything you used to put in asynchronous code now fires simultaneously the moment the user clicks a link, and then eases down. This doesn't only apply to CPU load, but also database load and even API requests. You will see a very awful utilization pattern across all system resources: spikes of heavy usage, and then it goes down again. That doesn't scale well. Synchronous code doesn't have this problem: jobs only start after another one is done.
Asynchronous work for websites is a trap: don't go there!
Put your heavy code in a worker (or cron job) that does these things before the user asks for them. You'll have them in a database and you can keep adding features to your site without having to worry about firing too many asynchronous jobs and what not.
Performance for websites is seriously overrated. Sure, it's nice if your page renders in 50ms, but if it takes 250ms people really won't notice (to test this: put a Sleep(200) in your code).
Your code becomes a lot more scalable if you just offload the work to another process and make the website an interface to only your database. Don't make your webserver do heavy work that it shouldn't do, it doesn't scale. You can have a hundred machines spending a total of 1 CPU hour per webpage - but at least it scales in a way where the page still loads in 200ms. Good luck achieving that with asynchronous code.
I would like to add a side-note here. While my opinion on asynchronous code might seem strong, it's mostly an opinion about programmers. Asynchronous code is awesome and can make a performance difference that proves all of the points I outlined wrong. However, it needs a lot of finetuning in your code to avoid the points I mention in this post, and most programmers just can't handle that.

Is this a good time to use multithreading in ASP.NET MVC and how is it implemented?

I want a certain action request to trigger a set of e-mail notifications. The user does something, and it sends the emails. However I do not want the user to wait for page response until the system generates and sends the e-mails. Should I use multithreading for this? Will this even work in ASP.NET MVC? I want the user to get a page response back and the system just finish sending the e-mails at it's own pace. Not even sure if this is possible or what the code would look like. (PS: Please don't offer me an alternative solution for sending e-mails, don't have time for that kind of reconfiguration.)

SmtpClient.SendAsync is probably a better bet than manual threading, though multi-threading will work fine with the usual caveats.
http://msdn.microsoft.com/en-us/library/x5x13z6h.aspx
As other people have pointed out, success/failure cannot be indicated deterministically when the page returns before the send is actually complete.
A couple of observations when using asynchronous operations:
1) They will come back to bite you in some way or another. It's a risk versus benefit discussion. I like the SendAsync() method I proposed because it means forms can return instantly even if the email server takes a few seconds to respond. However, because it doesn't throw an exception, you can have a broken form and not even know it.
Of course unit testing should address this initially, but what if the production configuration file gets changed to point to a broken mail server? You won't know it, you won't see it in your logs, you only discover it when someone asks you why you never responded to the form they filled out. I speak from experience on this one. There are ways around this, but in practicality, async is always more work to test, debug, and maintain.
2) Threading in ASP.Net works in some situations if you understand the ThreadPool, app domain refreshes, locking, etc. I find that it is most useful for executing several operations at once to increase performance where the end result is deterministic, i.e. the application waits for all threads to complete. This way, you gain the performance benefits while still having a clear indication of results.
3) Threading/Async operations do not increase performance, only perceived performance. There may be some edge cases where that is not true (such as processor optimizations), but it's a good rule of thumb. Improperly used, threading can hurt performance or introduce instability.
The better scenario is out of process execution. For enterprise applications, I often move things out of the ASP.Net thread pool and into an execution service.
See this SO thread: Designing an asynchronous task library for ASP.NET

I know you are not looking for alternatives, but using a MessageQueue (such as MSMQ) could be a good solution for this problem in the future. Using multithreading in asp.net is normally discouraged, but in your current situation I don't see why you shouldn't. It is definitely possible, but beware of the pitfalls related to multithreading (stolen here):
•There is a runtime overhead
associated with creating and
destroying threads. When your
application creates and destroys
threads frequently, this overhead
affects the overall application
performance. •Having too many threads
running at the same time decreases the
performance of your entire system.
This is because your system is
attempting to give each thread a time
slot to operate inside. •You should
design your application well when you
are going to use multithreading, or
otherwise your application will be
difficult to maintain and extend. •You
should be careful when you implement a
multithreading application, because
threading bugs are difficult to debug
and resolve.

At the risk of violating your no-alternative-solution prime directive, I suggest that you write the email requests to a SQL Server table and use SQL Server's Database Mail feature. You could also write a Windows service that monitors the table and sends emails, logging successes and failures in another table that you view through a separate ASP.Net page.

You probably can use ThreadPool.QueueUserWorkItem

Yes this is an appropriate time to use multi-threading.
One thing to look out for though is how will you express to the user when the email sending ultamitely fails? Not blocking the user is a good step to improving your UI. But it still needs to not provide a false sense of success when ultamitely it failed at a later time.

Don't know if any of the above links mentioned it, but don't forget to keep an eye on request timeout values, the queued items will still need to complete within that time period.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.