I am working on a Visio Addin in VS2010 Professional and am looking for hot spots (specifically around a COM object) while debugging the application. I have found a number of profilers that can profile existing .NET applications, but none of which (that I have seen) support debugging. Further more, because this is a .NET add-in rather than a full standalone executable I'm not sure how they'd fair.
Profilers I've looked into:
EQATEC
Slimtune
CLR
nprof
VS2010 Performance Profiler -- Note that this one requires Ultimate or Premium while I am using Professional.
Has anyone found a profiler that can be used during a VS2010 debug session?
I've made this point before on SO, and so have others.
If your object is to improve performance, as measured by wall-clock time, by far the best tool is just the debugger itself, and its "Pause" button.
Let me show you why.
First, let's look at a good profiler
Among profilers, ANTS is probably as good as they come.
When I run it on an app, the top of the screen looks like this:
Notice that you have to choose a time span to look at, and you have to choose if you want to look at CPU time or File I/O time.
Within that time span, you see something like this:
which is trying to show what ANTS thinks is the "hot path", considering only CPU time.
Of course it emphasizes inclusive "Time With Children (%)", and that's good.
In a big code base like this, notice how extremely small the self-time "Time (%)" is?
That's typical, and you can see why.
What this says is that you should certainly ignore functions that have low inclusive percent, because even if you could reduce them to no-ops, your overall time in that interval would go down by no more than their inclusive percent.
So you look at the functions with high inclusive percent, and you try to find something in them to make them take less time, generally by either a) having them make fewer calls to sub-functions, or b) having the function itself be called less.
If you find something and fix it, you get a certain percent speedup. Then you can try it all again.
When you cannot find anything to fix, you declare victory and put away your profiler for another day.
Notice that there might have been additional problems that you could have fixed for more speedup, but if the profiler didn't help you find them, you've assumed they are not there.
These can be really big sleepers.
Now let's take some manual samples
I just randomly paused the app six times during the phase that was bugging me because it was making me wait.
Each time I took a snapshot of the call stack, and I took a good long look at what the program was doing and why it was doing it.
Three of the samples looked like this:
External Code
Core.Types.ResourceString.getStringFromResourceFile Line 506
Core.Types.ResourceString.getText Line 423
Core.Types.ResourceString.ToString Line 299
External Code
Core.Types.ResourceString.getStringFromResourceFile Line 528
Core.Types.ResourceString.getText Line 423
Core.Types.ResourceString.ToString Line 299
Core.Types.ResourceString.implicit operator string Line 404
SplashForm.pluginStarting Line 149
Services.Plugins.PluginService.includePlugin Line 737
Services.Plugins.PluginService.loadPluginList Line 1015
Services.Plugins.PluginService.loadPluginManifests Line 1074
Services.Plugins.PluginService.DoStart Line 95
Core.Services.ServiceBase.Start Line 36
Core.Services.ServiceManager.startService Line 1452
Core.Services.ServiceManager.startService Line 1438
Core.Services.ServiceManager.loadServices Line 1328
Core.Services.ServiceManager.Initialize Line 346
Core.Services.ServiceManager.Start Line 298
AppStart.Start Line 95
AppStart.Main Line 42
Here is what it is doing.
It is reading a resource file (that's I/O, so looking at CPU time would not see it).
The reason it is reading it is to get the name of a plugin.
The reason the name of the plugin is in a resource file is that there might be a future requirement to internationalize that string.
Anyway, the reason it is being fetched is so the name can be displayed on a splash screen during the loading of the plugin.
Presumably the reason for this is, if the user is wondering what is taking so long, the splash screen will show them what's happening.
Those six samples proved that if the name was not displayed, or if it was displayed but was gotten in some more efficient way, then startup speed of the app would approximately double.
I hope you can see that no profiler that works by showing measurements could have yielded this insight this quickly.
Even if the profiler showed inclusive percent by wall-clock time, not CPU, it still would have left the user trying to puzzle out just what was going on, because in summarizing the times on the routines, it loses almost all explanatory context that tells if what it is doing is necessary.
The human tendency when looking only at summary statistics, and looking at the code, is to say "I can see what it's doing, but I don't see any way to improve it."
So what about "statistical significance"?
I hear this all the time, and it comes from naivete' about statistics.
If three out of six samples show a problem, that means the most likely actual percent used by the problem is 3/6=50%.
It also means if you did this many times, on average the cost would be (3+1)/(6+2) which is also 50%.
If you save 50% of time, that gives a 2x speedup.
There is a probability that the cost could be as small as 20%, in which case the speedup would be only 1.25x.
There is an equal probability that the cost could be as large as 80%, in which case the speedup would be 5x (!).
So yes, it is a gamble.
The speedup could be less than estimated, but it will not be zero, and it is equally likely to be dramatically large.
If more precision is required, more samples can be taken, but if one sacrifices the insight that comes from examining samples to get statistical precision, the speedups may well not be found.
P.S. This link shows the key importance of finding all the problems - not missing any.
After digging in the Extension Manager in VS2010, I found dotTrace. This gives the ability to attach to a running process (Visio in my case) during debugging.
This tool's 10 day trial helped out, but at $400 it still feels a little steep. I still hope to find a cheaper way to accomplish the same task.
Related
Assume I have a CPU running at a constant rate, pulling an equal amount of energy per instruction. I also have two functionally identical programs, which result in the same output, except one has been optimized to execute only 100 instructions, while the other program executes 200 instructions. Is the 100 instruction program necessarily faster than the 200 instruction program? Does a program with fewer instructions draw less power than a program with more instructions?
Things are much more complex than this.
For example execution speed is in many cases dominated by memory. As a practical example some code could process the pixels of an image first in rows and then in columns... a different code instead could be more complex but processing rows and columns at the same time.
The second version could execute more instructions because of more complex housekeeping of the data but I wouldn't be surprised if it was faster because of how memory is organized: reading an image one column at a time is going to "trash the cache" and it's very possible that despite being simple the code working that way could be a LOT slower than the more complex one doing the processing in a memory-friendly way. The simpler code may end up being "stalled" a lot waiting for the cache lines to be filled or flushed to the external memory.
This is just an example, but in reality what happens inside a CPU when code is executed is for many powerful processors today a very very complex process: instructions are exploded in micro-instructions, registers are renamed, there is speculative execution of parts of code depending on what branch predictors guess even before the program counter really reaches a certain instruction and so on. Today the only way to know for sure if something is faster or slower is in many cases just trying with real data and measure.
Is the 100 instruction program necessarily faster than the 200 instruction program?
No. Firstly, on some architectures (such as x86) different instructions can take a different number of cycles. Secondly, there are effects — such cache misses, page faults and branch mispreditictions — that complicate the picture further.
From this it follows that the answer to your headline question is "not necessarily".
Further reading.
I found a paper from 2017 comparing the energy usage, speed, and memory consumption of various programming languages. There is an obvious positive correlation between faster languages also using less energy.
I've been using oxyplot for a month now and I'm pretty happy with what it delivers. I'm getting data from an oscilloscope and, after a fast processing, I'm plotting it in real time to a graph.
However, if I compare my application CPU usage to the one provided by the oscilloscope manufacturer, I'm loading a lot more the CPU. Maybe they're using some gpu-based plotter, but I think I can reduce my CPU usage with some modifications.
I'm capturing 10.000 samples per second and adding it to a LineSeries. I'm not plotting all that data, I'm decimating it to a constant number of points, let's say 80 points for a 20 secs measure, so I have 4 points/sec while totally zoomed out and a bit more detail if I zoom in to a specific range.
With the aid of ReSharper, I've noticed that the application is calling a lot of times (I've 6 different plots) the IsValidPoint method (something like 400.000.000 times), which is taking a lot of time.
I think the problem is that, when I add new points to the series, it checks for every point if it is a valid point, instead of the added values only.
Also, it spends a lot of time in the MeasureText/DrawText method.
My question is: is there a way to override those methods and to adapt it to my needs? I'm adding 10.000 new values each second, but the first ones remain the same, so there's no need for re-validate them. Also, the text shown doesn't change.
Thank you in advance for any advice you can give me. Have a good day!
I have instrumented my application with "stop watches". There is usually one such stop watch per (significant) function. These stop watches measure real time, thread time (and process time, but process time seems less useful) and call count. I can obviously sort the individual stop watches using either of the four values as a key. However that is not always useful and requires me to, e.g., disregard top level functions when looking for optimization opportunities, as top level functions/stop watches measure pretty much all of the application's run time.
I wonder if there is any research regarding any kind of score or heuristics that would point out functions/stop watches that are worthy looking at and optimizing?
The goal is to find code worth optimizing, and that's good, but
the question presupposes what many people think, which is that they are looking for "slow methods".
However there are other ways for programs to spend time unnecessarily than by having certain methods that are recognizably in need of optimizing.
What's more, you can't ignore them, because however much time they take will become a larger and larger fraction of the total if you find and fix other problems.
In my experience performance tuning, measuring time can tell if what you fixed helped, but it is not much use for telling you what to fix.
For example, there are many questions on SO from people trying to make sense of profiler output.
The method I rely on is outlined here.
I am working on a point of sale (POS vending machine) project which has many images on the screen where the customer is expected to browse almost all of them. Here are my questions:
Can you please suggest me test cases for testing load time for images?
What is the acceptable load time for these images on screen?
Do we have any standards for testing these kind of acceptable load time?
"What is an acceptable loading time?" is a very broad question, one that has been studied as a research question for human computer interaction issues. In general the answer depends on:
How predictable the loading time is? (does it vary according to time of day, e.g. from 9am to 2am. unpredictable is usually the single most annoying thing about waiting)
How good is the feedback to the user? (does it look like it's broken or have a nice progress bar during the waiting? knowing it's nearly there can help ease the pain, even if the loading times are always consistent)
Who are the users and what other systems have they used previously? If it was all writing in a book before then waiting 2 minutes for images is going to be positively slow. If you're replacing something that took 3 minutes then it's pretty fast.
Ancillary input issues, e.g. does it buffer presses whilst loading and also move items around on the display so people press before it's finished and accidentally press the wrong thing? Does it annoyingly eat input soon after you've started to input it so you have to type/scan it again?
In terms of testing I'm assuming you're not planning on observing users and asking "how hard would you like to hurt this proxy for frustration?" What you can realistically test is how it copes under realistic loads and how accurate the predictions are.
We've got a fairly large application running on VxWorks 5.5.1 that's been developed and modified for around 10 years now. We have some simple home-grown tools to show that we are not using too much memory or too much processor, but we don't have a good feel for how much headroom we actually have. It's starting to make it difficult to do estimates for future enhancements.
Does anybody have any suggestions on how to profile such a system? We've never had much luck getting the Wind River tools to work.
For bonus points: the other complication is that our system has very different behaviors at different times; during start-up it does a lot of stuff, then it sits relatively idle except for brief bursts of activity. If there is a profiler with some programmatic way to have to record state information, I think that'd be very useful too.
FWIW, this is compiled with GCC and written entirely in C.
I've done a lot of performance tuning of various kinds of software, including embedded applications. I won't discuss memory profiling - I think that is a different issue.
I can only guess where the "well-known" idea originated that to find performance problems you need to measure performance of various parts. That is a top-down approach, similar to the way governments try to control budget waste, by subdividing. IMHO, it doesn't work very well.
Measurement is OK for seeing if what you did made a difference, but it is poor at telling you what to fix.
What is good at telling you what to fix is a bottom-up approach, in which you examine a representative sample of microscopic units of what is being spent, and finding out the full explanation of why each one is being spent. This works for a simple statistical reason. If there is a reason why some percent (for example 40%) of samples can be saved, on average 40% of samples will show it, and it doesn't require a huge number of samples. It does require that you examine each sample carefully, and not just sort of aggregate them into bigger bunches.
As a historical example, this is what Harry Truman did at the outbreak of the U.S. involvement in WW II. There was terrific waste in the defense industry. He just got in his car, drove out to the factories, and interviewed the people standing around. Then he went back to the U.S. Senate, explained what the problems were exactly, and got them fixed.
Maybe this is more of an answer than you wanted. Specifically, this is the method I use, and this is a blow-by-blow example of it.
ADDED: I guess the idea of finding-by-measuring is simply natural. Around '82 I was working on an embedded system, and I needed to do some performance tuning. The hardware engineer offered to put a timer on the board that I could read (providing from his plenty). IOW he assumed that finding performance problems required timing. I thanked him and declined, because by that time I knew and trusted the random-halt technique (done with an in-circuit-emulator).
If you have the Auxiliary Clock available, you could use the SPY utility (configurable via the config.h file) which does give you a very rough approximation of which tasks are using the CPU.
The nice thing about it is that it does not require being attached to the Tornado environment and you can use it from the Kernel shell.
Otherwise, btpierre's suggestion of using taskHookAdd has been used successfully in the past.
I've worked on systems that have had luck using locally-built monitoring utilities based on taskSwitchHookAdd and related functions (delete hook, etc).
"Simply" use this to track the number of ticks a given task runs. I realize that this is fairly gross scale information for profiling, but it can be useful depending on your needs.
To see how much cpu% each task is using, calculate the percentage of ticks assigned to each task.
To see how much headroom you have, add a lowest priority "idle" task that just does "while(1){}", and see how much cpu% it is assigned to it. Roughly speaking, that's your headroom.