I'm currently getting into PerfView for performance analysis for my (C#) apps.
But typically those apps use a lot of database calls.
So I asked myself questions like:
- How much time is spent in Repositories?
- (How much time is spent waiting for SQL Queries to return?) -> I don't know if this is even possible to discover with PerfView
But from my traces I get barely any useful results. In the "Any Stacks" View it tells me (when I use grouping on my Repository) that 1,5 seconds are spent in my Repsoitory (the whole call is about 45 seconds). And i know this is not really true, because the repositories calls the database A LOT.
Is it just that CPU metric is not captured when waiting for SQL Queries to complete because CPU has nothing to do in this period of time and therefore my times are just including data transformation times etc in the repository?
Thanks for any help!
EDIT:
What i missed is turning on thread times option to get times of blocked code (which is what's happening during database calls i suppose). I got all the stacks now, just have filter out the uninteresting things. But i don't seem to get anywhere.
What's especially interesting for me when using "Thread Time" is the BLOCKED_TIME. But with it the times are off i think. When you look at the screenshot, it tells me that CPU_TIME is 28,384. Which is milliseconds (afaik), but BLOCKED_TIME is 2,314,732, which can't be milliseconds. So percentage for CPU_TIME is very low with 1.2% but 28 out of 70 seconds are still a lot. So the Inclusive Percentage time is comparing apples and oranges here. Can somebody explain?
So, i succeeded.
What I missed (and Vance Morrison was explaining it in his video tutorial actually) is: When doing a wall clock time analysis with perfview, you get accumulated time from all the threads that have been "waiting around" in what is called "BLOCKED_TIME". Which means for a 70 seconds time, alone the finalizer thread adds 70 seconds to this "BLOCKED_TIME" because it was sitting there not doing anything (at least almost anything in my case).
So when doing wall clock time analysis it is important to filter out what you're interested in. For example search for the thread that was taking the most CPU-time and just include this one in your analysis and go further down the stack to find pieces of your code that are expensive (and also might lead to DB or Service Calls). As soon as you a analysis from the point of view of a method you are really getting the times that were spent in this method and the accumulated "BLOCK_TIME" is out of the picture.
What I found most useful is searching for methods in my own code that "seemed time consuming", i switched to the callers view for this method. Which shed some light from where it's called and in the callees view what is responsible for the consuming time further down the stack (a DB call in a repository or service calls for getting some data).
Little bit hard to explain, but as soon as i had understand really the basics of wall clock time analysis, it all made sense at some point.
I recommend this video tutorial: http://channel9.msdn.com/Series/PerfView-Tutorial
Again, great and very powerful tool!
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm trying not to panic here, so please, bear with me! :S
I've spent a considerable amount of time rewriting a big chunk of code from sync (i.e. thread-blocking) to async (i.e. C# 6+). The code in question runs inside an ASP.NET application and spans everything from low-level ADO.NET DB access to higher-level unit-of-work pattern, and finally a custom async HTTP handler for public API access - the full server-side stack, so to speak. The primary purpose of the rewrite wasn't optimization, but untangling, general clean-up, and bringing the code up to something that resembles a modern and deliberate design. Naturally, optimization gain was implicitly assumed.
Everything in general is great and I'm very satisfied with the overall quality of the new code, as well as the improved scalability it's shown so far in the last couple of weeks of real-world tests. The CPU and memory loads on the server have fallen drastically!
So what's the problem, you might ask?
Well, I've recently been tasked with optimizing a simple data import that is still utilizing the old sync code. Naturally, it didn't take me long before I tried changing it to the new async code-base to see what would happen.
Given everything, the import code is quite simple. It's basically a loop that reads items from a list that's been previously read into memory, adds each of them individually to a unit of work, and saves it to a SQL database by means of an INSERT statement. After the loop is done, the unit of work is committed, which makes the changes permanent (i.e. the DB transaction is committed as well).
The problem is that the new code takes about 20 times as long as the old one, when the expectation was quite the opposite! I've checked and double-checked and there is no obvious overhead in the new code that would warrant such sluggishness.
To be specific: the old code is able to import 1100 items/sec steadily, while the new one manages 40 items/sec AT BEST (on average, it's even less, because the rate is falling slightly over time)! If I run the same test over a VPN, so that the network cost outweighs everything else, the throughputs is somewhere along 25 items/sec for sync and 20 for async.
I've read about multiple cases here on SO which report a 10-20% slowdown when switching from sync to async in similar situations and I was prepared to deal with that for tight loops such as mine. But a 20-fold penalty in a non-networked scenario?! That's completely unacceptable!
What is my best course of action here? How do I tackle this unexpected problem?
UPDATE
I've run the import under a profiler, as suggested.
I'm not sure what to make of the results, though. It would seem that the process spends more than 80% of its time just... waiting. See for yourselves:
The 14% spent inside the custom HTTP handler corresponds to the IDataReader.Read which is a consequence of a tiny remainder of the old sync API. This is still subject to optimization and is likely to be reduced in the near future. Regardless, it's dwarfed by the WaitAny cost, which definitely isn't there in the all-sync version!
What's curious is that the report isn't showing any direct calls from my code to WaitAny, which makes me think this is probably part of the async/await infrastructure. Am I wrong in this conclusion? I kind of hope I am!
What worries me is that I might be reading this all wrong. I know that async costs are much harder to reason about than single-threaded costs. In the end, the WaitAny might be nothing more than the equivalent of the "Sytem Idle Process" on Windows - an artificial representation of the CPU infrastructure that reflects a free percentage of the CPU resource.
Can anyone shed some light here for me, please?
I'm new to profiling. I'm trying to profile a C# application which connects to an SQLite database and retrieve data. The database contains 146856400 rows and the select query retrieves 428800 rows after execution.
On the first execution the main thread takes 246686 ms
On second execution of the same code the main thread takes only 4296 ms
After restarting the system
On the first execution the main thread takes 244533 ms
On the second execution of the same code the main thread takes only 4053 ms
Questions:
1) Why is there a big difference between the first execution timing and the second execution timing
2) After restarting the system why I'm not getting the same results.
Pls help
You experience the difference between cold and warm execution of your query. Cold means the first time and warm all subsequent invocations of your db query.
The first time everything is "cold"
OS file system cache is empty.
SQLLite cache is empty.
ORM dynamic query compilation is not done and cached yet.
ORM Mapper cache is empty.
Garbage Collector needs to tune your working set
....
When you execute your query a second time all these first time initializations (caching) are done and you are measuring the effects of different cache levels as long as there is enough memory available to cache a substantial amount of your requested data.
A performance difference between 4 minutes and 4s is impressive. Both numbers are valid. Measuring something is easy. Telling someone else what exactly you have measured and how the performance can be improved by changing this or that is much harder.
The performance game goes often like this:
Customer: It is slow
Dev: I cannot repro your issue.
Customer: Here is my scenario ....
Dev: I still cannot repro it. Can you give me data set you use and the exact steps you did perform?
Customer: Sure. Here is the data and the test steps.
Dev: Ahh I see. I can make it 10 times faster.
Customer: That is great. Can I have the fix?
Dev: Sure here it is.
Customer: **Very Angry** It has become faster yes. But I cannot read my old data!
Dev: Ups. We need to migrate all your old data to the new much more efficient format.
We need to develop a a conversion tool which will take 3 weeks and your site will
have 3 days downtime while the conversion tool is running.
Or
We keep the old inefficient data format. But then we can make it only 9 times faster.
Customer: I want to access my data faster without data conversion!
Dev: Here is the fix which is 10% slower with no schema changes.
Customer: Finally. The fix does not break anything but it has not become faster?
Dev: I have measured your use case. It is only slow for the first time.
All later data retrievals are 9 times faster than before.
Customer: Did I mention that in my use case I read always different data?
Dev: No you did not.
Customer: Fix it!
Dev: That is not really possible without a major rewrite of large portions of our software.
Customer: The data I want to access is stored in a list. I want to process it sequentially.
Dev: In that case we can preload the data in the background while you are working the current data set. You will only experience a delay for the first data set on each working day.
Customer: Can I have the fix?
Dev: Sure here it is.
Customer: Perfect. It works!
Performance is hard to grasp since most of the time you deal with perceived performance which is subjective. Bringing it down to quantitative measurements is a good start but you need to tune your metrics to reflect actual customer use cases or you will likely optimize at the wrong places like above. A complete understanding of customer requirements and use cases is a must. On the other hand you need to understand your complete system (profile it as hell) to be able to tell the difference between cold and warm query execution and where you can tune the whole thing. These caches become useless if you query for different data all of the time (not likely). Perhaps you need a different index to speed up queries or you buy a SSD or you keep all of the data in memory and do all subsequent queries in memory....
I am trying to test some web services that I have exposed. I have created some web performance tests that emulate user activity. I have put these into a load test, with a step load pattern, with the intent of loading on users to discover the number x of concurrent users that would make the response time go above 10 seconds.
I have tried doing this, but the results...are unexpected. Over 1,000 500-InternalServerErrors. Weirdly, however, my avg response time stays largely the same - at an extremely low value (the blue line in this graph), while the number of users increases to a maximum of 200, despite what the graph indicates (red line) and the number of requests/sec also increases (green). Surely this is incorrect, as page response time should go up with these.
Can anyone offer any insight as to what might be going on here, and how I might fix it ?
The data set I am working from is a test data set that is tiny, so my only theory is that perhaps all requests are being cached, explaining the snappy response time, but the server is still being inundated, hence the errors.
Apologies for the lack of details - I am new to performance testing. Any questions will be answered straight away. Many thanks :)
I have a method in c# code behind..which needs to be executed 10000+ lines from Assemblies as well as in Child group Methods. My Question is How to optimize it? It is taking more than 40 Seconds Load to 500 Rows in my page my own gridview which is designed by myself.
Profile your code. That will help you identify where its slow. From reading your post , optimizing might take you a long time since you have alot of code and data.
Virtualize as much as you can. Instead of loading 500 rows, can you try loading 50 rows first, show your UI then load the remainder 450 rows asynchronously ? This doesnt speed up your application, but at least it seems working much quicker than waiting 40 seconds.
This method is very simple, but it can pinpoint the activities that would profit the most by optimizing.
If there is a way to speed up your program, it is taking some fraction of time, like 60%.
If you randomly interrupt it, under a debugger, you have a 60% chance of catching it in the act.
If you examine the stack, and maybe some of the state variables, it will tell you with great precision just what the problem is.
If you do it 10 times, you will see the problem on about 6 samples.
I've not be coding long so I'm not familiar with which technique is quickest so I was wondering if there was a way to do this in VS or with a 3rd party tool?
Thanks
Profilers are great for measuring.
But your question was "How can I determine where the slow parts of my code are?".
That is a different problem. It is diagnosis, not measurement.
I know this is not a popular view, but it's true.
It is like a business that is trying to cut costs.
One approach (top down) is to measure the overall finances, then break it down by categories and departments, and try to guess what could be eliminated. That is measurement.
Another approach (bottom up) is to walk in at random into an office, pick someone at random, and ask them what they are doing at that moment and (importantly) why, in detail.
Do this more than once.
That is what Harry Truman did at the outbreak of WW2, in the US defense industry, and immediately uncovered massive fraud and waste, by visiting several sites. That is diagnosis.
In code you can do this in a very simple way: "Pause" it and ask it why it is spending that particular cycle. Usually the call stack tells you why, in detail.
Do this more than once.
This is sampling. Some profilers sample the call stack. But then for some reason they insist on summarizing time spent in each function, inclusive and exclusive. That is like summarizing by department in business, inclusive and exclusive.
It loses the information you need, which is the fine-grain detail that tells if the cycles are necessary.
To answer your question:
Just pause your program several times, and capture the call stack each time. If your code is very slow, the wasteful function calls will be on nearly every stack. They will point with precision to the "slow parts of your code".
ADDED: RedGate ANTS is getting there. It can give you cost-by-line, and it is quite spiffy. So if you're in .NET, and can spare 3 figures, and don't mind waiting around to install & learn it, it can tell you much of what your Pause key can tell you, and be much more pretty about it.
Profiling.
RedGate has a product.
JetBrains has a product.
I've used ANTS Profiler and I can join the others with recommendation.
The price is NEGLIGIBLE when you compare it with the amount of dev hours it will save you.
I you're developer for a living, and your company won't buy it for you, either change the company or buy it for yourself.
For profiling large complex UI applications then you often need a set of tools and approaches. I'll outline the approach and tools I used recently on a project to improve the performance of a .Net 2.0 UI application.
First of all I interviewed users and worked through the use cases myself to come up with a list of target use cases that highlighted the systems worse performing areas. I.e. I didn't want to spend n man days optimising a feature that was hardly ever used but very slow. I would want to spend time, however, optimising a feature that was a little bit sluggish but invoked a 1000 times a day, etc.
Once the candidate use cases were identified I instrumented my code with my own light weight logging class (I used some high performance timers and a custom logging solution because a needed sub-millisecond accuracy). You might, however, be able to get away with log4net and time stamps. The reason I instrumented code is that it is sometimes easier to read your own logs rather than the profiler's output. I needed both for a variety of reasons (e.g. measuring .Net user control layouts is not always straightforward using the profiler).
I then ran my instrumented code with the ANTS profiler and profiled the use case. By combining the ANTS profile and my own log files I was very quickly able to discover problems with our application.
We also profiled the server as well as the UI and were able to work out breakdowns for time spent in the UI, time spent on the wire, time spent on the server etc.
Also worth noting is that 1 run isn't enough, and the 1st run is usually worth throwing away. Let me explain: PC load, network traffic, JIT compilation status etc can all affect the time a particular operation will take. A simple strategy is to measure an operation n times (say 5), throw away the slowest and fastest run, the analyse the remianing profiles.
Eqatec profiler is a cute small profiler that is free and easy to use. It probably won't come anywhere near the "wow" factor of Ants profiler in terms of features but it still is very cool IMO and worth a look.
Use a profiler. ANTS costs money but is very nice.
i just set breakpoints, visual will tell you how many ms between breakpoint has passed. so you can find it manually.
ANTS Profiler is very good.
If you don't want to pay, the newer VS verions come with a profiler, but to be honest it doesn't seem very good. ATI/AMD make a free profiler... but its not very user friendly (to me, I couldn't get any useful info out of it).
The advice I would give is to time function calls yourself with code. If they are fast and you do not have a high-precision timer or the calls vary in slowness for a number of reasons (e.g. every x calls building some kind of cache), try running each one x10000 times or something, then dividing the result accordingly. This may not be perfect for some sections of code, but if you are unable to find a good, free, 3rd party solution, its pretty much what's left unless you want to pay.
Yet another option is Intel's VTune.