I've a C# server developed on both Visual Studio 2010 and Mono Develop 2.8. NET Framework 4.0
It looks like this server behaves much better (in terms of scalability) on Windows than on Linux.
I tested the server scalability on native Windows(12 physical cores), and 8 and 12 cores Windows and Ubuntu Virtual Machines using Apache's ab tool.
The windows response time is pretty much flat. It starts picking up when the concurrency level approaches/overcomes the number of cores.
For some reason the linux response times are much worse. They grow pretty much linearly starting from level 5 of concurrency. Also 8 and 12 cores Linux VM behave similarly.
So my question is: why does it perform worse on linux? (and How can I fix that?).
Please take a look at the graph attached, it shows the averaged time to fulfill 75% of the requests as a function of the requests concurrency(the range bar are set at 50% and 100%).
I have a feeling that this might be due to mono's Garbage Collector. I tried playing around with the GC settings but I had no success. Any suggestion?
Some additional background information: the server is based on an HTTP listener that quickly parses the requests and queues them on a thread pool. The thread pool takes care of replying to those requests with some intensive math (computing an answer in ~10secs).
You need to isolate where the problem is first. Start by monitoring your memory usage with HeapShot. If it's not memory, then profile your code to pinpoint the time consuming methods.
This page, Performance Tips: Writing better performing .NET and Mono applications, contains some useful information including using the mono profiler.
Excessive String manipulation and Boxing are often 'hidden' culprits of code that doesn't scale well.
Try the sgen garbage collector (and for that, Mono 2.11.x is recommended). Look at the mono man page for more details.
I don't believe it's because of the GC. AFAIK the GC side effects should be more or less evenly distributed across the threads.
My blind guess is: you can fix it by playing with ThreadPool.SetMinThreads/SetMaxThreads API.
I would strongly recommend you do some profiles on the code with regards to how long individual methods are running. It's quite likely that you're seeing some locking or similar multi-threading difficulties that are not being handled perfectly by mono. CPU and RAM usage would also help.
I believe this may be the same problem that we tracked down involving the thread pool and the starting behavior for new threads, as well as a bug in the mono implementation of setMinThreads. Please see my answer on that thread for more information: https://stackoverflow.com/a/12371795/1663096
If you code throws a lot of exceptions, then mono is 10x faster than .NET
Related
I have a C# MVC application with a WCF service running on Azure. First of it was of course hosted on the free version, but as I had that one running smoothly I wanted to try and see how it ran on either Basic or Standard, which as far as I know should be dedicated servers.
To my surprise the code ran significantly slower once it was changed from Free to either Standard or Basic. I chose the smallest instance, but still expected them to perform better than the Free option?
From my performance logging I can see that the code that runs especially slow is something that is started as async from Task.Run. Initially it was old school Thread.Start() but considered whether this might spawn it in some lower priority thread and therefore changed it to Task.Run - without this changing anything - so perhaps it has nothing to do with it - but it might, so now you know.
The code that runs really slow basically works on some XML document, through XDocument, XElement etc. It loops through, has some LINQ etc. but nothing too fancy. But still it is 5-10 times slower on Basic and Standard as on the Free version? For the exact same request the Free version uses around 1000ms where as Basic and Standard uses 8000-10000ms?
In each test I have tried 5-10 times but without any decrease in response-times. I thought about whether I need to wait some hours before the Basic/Standard is fully functional or something like that, but each time I switch back, the Free version just outperforms it from the get-go.
Any suggestions? Is the Free version for some strange reason more powerful than Basic or Standard or do I need to configure something differently once I get up and running on Basic or Standard?
The notable difference between the Free and Basic/Standard tiers is that Free uses an undisclosed number of shared cores, whereas Basic/Standard has a defined number of CPU cores (1-4 based on how much you pay). Related to this is the fact that Free is a shared instance while Basic/Standard is a private instance.
My best guess based on this that since the Free servers you would be on house multiple different users and applications, they probably have pretty beef specs. Their CPUs are probably 8-core Xeons and there might even be multiple CPUs. Most likely, Azure isn't enforcing any caps but rather relying on quotas (60 CPU minutes / day for the Free tier) and overall demand on the server to restrict CPU use. In other words, if your site is the only one that happens to be doing anything at the moment (unlikely of course, but for the sake of example), you could be potentially utilizing all 8+ cores on the box, whereas when you move over to Basic/Standard you are hard-limited to 1-4. Processing XML is actually very CPU heavy, so this seems to line up with my assumptions.
More than likely, this is a fluke. Perhaps your residency is currently on a relatively newly provisioned server that hasn't been fill up with tenants yet. Maybe you just happen to be sharing with tenants that aren't doing much. Who knows? But, if the server is ever actually under real load, I'd imagine you'd see a much worse response time on the Free tier than even Basic/Standard.
i'm using springframework .net 1.2 and spark view engine for my web application running on .net 3.5 runtime. recently, i have been investigating the performance of my application running under load on multicore processor. i notice when under load a aop proxied method takes much much longer to complete with high context switching but low cpu utilization. i have profile my application using vs2010 resource contention profiler and it show that lock contention happened on every part of the application. i was wondering where could be wrong, is it because of the springframework we used?
We have identify the root of the problem. Our application uses a slot type thread local storage which based on our proof of concept testing, it performed badly under concurrent load. A good reference found from spring .net http://piers7.blogspot.com/2005/11/threadstatic-callcontext-and_02.html. The VS2010 resource contention profiling help us to identify the problem. Coming from java background i don't believe the problem could be the thread local storage until we did a POC.
Are there any tips, tricks and techniques to prevent or minimize slowdowns or temporary freeze of an app because of the .NET GC?
Maybe something along the lines of:
Try to use structs if you can, unless the data is too large or will be mostly used inside other classes, etc.
The description of your App does not fit the usual meaning of "realtime". Realtime is commonly used for software that has a max latency in milliseconds or less.
You have a requirement of responsiveness to the user, meaning you could probably tolerate an incidental delay of 500 ms or more. 100 ms won't be noticed.
Luckily for you, the GC won't cause delays that long. And if it did you could use the Server (background) version of the GC, but I know little about the details.
But if your "user experience" does suffer, it probably won't be the GC.
IMHO, if the performance of your application is being affected noticeably by the GC, something is wrong. The GC is designed to work without intervention and without significantly affecting your application. In other words, you shouldn't have to code with the details of the GC in mind.
I would examine the structure of your application and see where the bottlenecks are, maybe using a profiler. Maybe there are places where you could reduce the number of objects that are being created and destroyed.
If parts of your application really need to be real-time, perhaps they should be written in another language that is designed for that sort of thing.
Another trick is to use GC.RegisterForFullNotifications on back-end.
Let say, that you have load balancing server and N app. servers. When load balancer recieves information about possible full GC on one of the servers it will forward requests to other servers for some time therefore SLA will not be affected by GC (which is especially usefull for x64 boxes where more than 4GB can be addressed).
Updated
No, unfortunately I don't have a code but there is a very simple example at MSDN.com with dummy methods like RedirectRequests and AcceptRequests which can be found here: Garbage Collection Notifications
I have a large multi-threaded C# application running on a multi-core 4-way server. Currently we're using "server mode" garbage collection. However testing has shown that workstation mode GC is quicker.
MSDN says:
Managed code applications that use the server API receive significant benefits from using the server-optimized garbage collector (GC) instead of the default workstation GC.
Workstation is the default GC mode and the only one available on single-processor computers. Workstation GC is hosted in console and Windows Forms applications. It performs full (generation 2) collections concurrently with the running program, thereby minimizing latency. This mode is useful for client applications, where perceived performance is usually more important than raw throughput.
The server GC is available only on multiprocessor computers. It creates a separate managed heap and thread for each processor and performs collections in parallel. During collection, all managed threads are paused (threads running native code are paused only when the native call returns). In this way, the server GC mode maximizes throughput (the number of requests per second) and improves performance as the number of processors increases. Performance especially shines on computers with four or more processors.
But we're not seeing performance shine!!!! Has anyone got any advice?
It's not explained very well, but as far as I can tell, the server mode is synchronous per core, while the workstation mode is asynchronous.
In other words, the workstation mode is intended for a small number of long running applications that need consistent performance. The garbage collection tries to "stay out of the way" but, as a result, is less efficient on average.
The server mode is intended for applications where each "job" is relatively short lived and handled by a single core (edit: think multi threaded web server). The idea is that each "job" gets all the cpu power, and gets done quickly, but that occasionally the core stops handling requests and cleans up memory. So in this case the hope is that GC is more efficient on average, but the core is unavailable while its running, so the application needs to be able to adapt to that.
In your case it sounds like, because you have a single application whose threads are relatively coupled, you're fitting better into the model expected by the first mode rather than the second.
But that's all just after-the-fact justification. Measure your system's performance (as ammoQ said, not your GC performance, but how well you application behaves) and use what you measure to be best.
.NET 4.5 introduces concurrent server garbage collection.
http://msdn.microsoft.com/en-us/library/ee787088.aspx
specify <gcServer enabled="true"/>
specify <gcConcurrent enabled="true"/> (this is the default so can be omitted)
And there is the new SustainedLowLatencyMode;
In the .NET Framework 4.5, SustainedLowLatency mode is available for both workstation and server GC. To turn it on, set the GCSettings.LatencyMode property to GCLatencyMode.SustainedLowLatency.
Server: Your program is the only significant application on the machine and needs the lowest possible latency for GCs.
Workstation: You have a UI or share the machine with other important process
I have a test of my DB engine on .NET 6 where I compare performance on different GC settings. In general, more than 300% improvement did not happen, with 6 heaps.
https://vimeo.com/711964445
runtimeconfig.template.json
{
"configProperties": {
"System.GC.HeapHardLimit": 8000000000,
"System.GC.Server": true,
"System.GC.HeapCount": 6
}
}
I've been researching a bit about .NET Performance counters, but couldn't find anything detailed (unlike the rest of the framework).
My question is: Which are the most relevant performance counters to get the maximum computing power of the machine? Well, the .NET virtual machine, that is..
Thank you,
Chuck Mysak
You haven't described what you mean by "computing power". Here are some of the things you can get through performance counters that might be relevant:
Number of SQL queries executed.
Number of IIS requests completed.
Number of distributed transactions committed.
Bytes of I/O performed (disk, network, etc.).
There are also relative numbers, such as percentages of processor and memory in use which can give an indication of how much of the "power" of your system is actually being used.
However, you will not find anything that correlates cleanly with raw computing "power". In fact, who is to say that the machine's full "power" is being taken advantage of at the time you look at the counters? It sounds like what you really want to do is run a benchmark, which includes the exact work to be performed and the collection of measurements to be taken. You can search the Internet for various benchmark applications. Some of these run directly in .NET while the most popular are native applications which you could shell out to execute.