Profiling / Code optimizing tool [closed] - optimization

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Please let me know which tool (GNU or 3rd party tool) is the best we can make use for profiling and optimizing our code. Is gprof an effective tool? Do we have dtrace tool ported in Linux?

You're not alone in conflating the terms "profiling" and "optimizing", but they're really very different. As different as weighing a book versus reading it.
As far as gprof goes, here are some issues.
Among profilers, the most useful are the ones that
Sample the whole call stack (not just the program counter) or at least as much of it as contains your code.
Samples on wall-clock time (not just CPU time). If you're losing time in I/O or other blocking calls, a CPU-only profiler will simply not see it.
Tells you by line-of-code (not just by function) the percent of stack samples containing that line. That's important because any such line of code that you could eliminate would save you that percent of overall time, and you don't have to eyeball functions to guess where it is.
A good profiler for linux that does this is Zoom. I'm sure there are others.
(Don't get confused about what matters. Efficiency and/or timing accuracy of a profiler is not important for helping you locate speed bugs. What's important is that it draws your attention to the right things.)
Personally, I use the random-pausing method, because I find it's the most effective.
Not only is it simple and requires no investment, but it finds speedup opportunities that are not localized to particular routines or lines of code, as in this example.
This is reflected in the speedup factors that can be achieved.

gprof is better than nothing. But not much. It estimates time spent not only in a function, but also in all of the functions called by the function - but beware it is an estimate, not a direct measurement. It does not make the distinction that some two callers of the same subroutine may have widely differing times spent inside it, per call. To do better than that you need a real call graph profiler, one that looks at several levels of the stack on a timer tick.
dtrace is good for what it does.
If you are doing low level performance optimization on x86, you should consider Intel's Vtune tool. Not only does it provide the best access I am aware of to low level performance measurement hardware on the chip, the so-called EMON (Event Monitoring) system (some of which I designed), but Vtune also has some pretty good high level tools. Such as call graph profiling that, I believe, is better than gprof. On the low level, I like doing things like generating profiles of the leading suspects, like branch mispredictions and cache misses, and looking at the code to see if there is something that can be done. Sometimes simple stuff, such as making an array size 255 rather than 256, helps a lot.
Generic Linux oprofile, http://oprofile.sourceforge.net/about/, is almost as good as Vtune, and better in some ways. And available for x86 and ARM. I haven't used it much, but I particularly like that you can use it in an almost completely non-intrusive manner, with no need to create the special -pg binary that gprof needs.

Their are many tools from where you can optimize your code,
For web application their are different tools to optimize the code i.e jzip compressors
e.g YUI Compressor etc.
For desktop application optimizing compiler is good.

Related

VHDL optimization tips [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am quite new in VHDL, and by using different IP cores (by different providers) can see that sometimes they differ massively as per the space that they occupy or timing constraints.
I was wondering if there are rules of thumb for optimization in VHDL (like there are in C, for example; unroll your loops, etc.).
Is it related to the synthesis tools I am using (like the different compilers are using other methods of optimization in C, so you need to learn to read the feedback asm files they return), or is it dependent on my coding skills?
Is it related to the synthesis tools I am using (like the different compilers are using other methods of optimization in C, so you need to learn to read the feedback asm files they return), or is it dependent on my coding skills?
The answer is "yes." When you are coding in HDL, you are actually describing hardware (go figure). Instead of the code being converted into machine code (like it would with C) it is synthesized to logical functions (AND, NOT, OR, XOR, etc) and memory elements (RAM, ROM, FF...).
VHDL can be used in many different ways. You can use VHDL in a purely structural sense where at the base level you are calling our primitives of the underlying technology that you are targeting. For example, you literally instantiate every AND, OR, NOT, and Flip Flop in your design. While this can give you a lot of control, it is not an efficient use of time in 99% of cases.
You can also implement hardware using behavioral constructs with VHDL. Instead of explicitly calling out each logic element, you describe a function to be implemented. For example, if this, do this, otherwise, do something else. You can describe state machines, math operations, and memories all in a behavioral sense. There are huge advantages to describing hardware in a behavioral sense:
Easier for humans to understand
Easier for humans to maintain
More portable between synthesis tools and target hardware
When using behavioral constructs, knowing your synthesis tool and your target hardware can help in understanding how what you write will actually be implemented. For example, if you describe a memory element with a asynchronous reset the implementation in hardware will be different for architectures with a dedicated asynchronous reset input to the memory element and one without.
Synthesis tools will generally publish in their reference manual or user guide a list of suggested HDL constructs to use in order to obtain some desired implementation result. For basic cases, they will be what you would expect. For more elaborate behavior models (e.g. a dual port RAM) there may be some form that you need to follow for the tool to "recognize" what you are describing.
In summary, for best use of your target device:
Know the device you are targeting. How are the programmable elements laid out? How many inputs and outputs are there from lookup tables? Read the device user manual to find out.
Know your synthesis engine. What types of behavioral constructs will be recognized and how will they be implemented? Read the synthesis tool user guide or reference manual to find out. Additionally, experiment by synthesizing small constructs to see how it gets implemented (via RTL or technology viewer, if available).
Know VHDL. Understand the differences between signals and variables. Be able to recognize statements that will generate many levels of logic in your design.
I was wondering if there are rules of thumb for optimization in VHDL
Now that you know the hardware, synthesis tool, and VHDL... Assuming you want to design for maximum performance, the following concepts should be adhered to:
Pipeline, pipeline, pipeline. The more levels of logic you have between synchronous elements, the more difficulty you are going to have making your timing constraint/goal.
Pipeline some more. Having additional stages of registers can provide additional wiggle-room in the future if you need to add more processing steps to your algorithm without affecting the overall latency/timeline.
Be careful when operating on the boundaries of the normal fabric. For example, if interfacing with an IO pin, dedicated multiplies, or other special hardware, you will take more significant timing hits. Additional memory elements should be placed here to avoid critical paths forming.
Review your synthesis and implementation reports frequently. You can learn a lot from reviewing these frequently. For example, if you add a new feature, and your timing takes a hit, you just introduced a critical path. Why? How can you alleviate this issue?
Take care with your "global" structures -- such as resets. Logic that must be widely distributed in your design deserves special care, since it needs to reach across your whole device. You may need special pipeline stages, or timing constraints on this type of logic. If at all possible, avoid "global" structures, unless truly a requirement.
While synthesis tools have design goals to focus on area, speed or power, the designer's choices and skills is the major contributor for the quality of the output. A designer should have a goal to maximize speed or minimize area and it will greatly influence his choices. A design optimized for speed can be made smaller by asking the tool to reduce the area, but not nearly as much as the same design thought for area in the first place.
However, it is more complicated than that. IP cores often target several FPGA technologies as well as ASIC. This can only be achieved by using general VHDL constructs, (or re-writing the code for each target, which non-critical IP providers don't do). FPGA and ASIC vendor have primitives that will improve speed/area when used, but if you write code to use a primitive for a technology, it doesn't mean that the resulting code will be optimized if you change the technology. Both Xilinx and Altera have DSP blocks to speed multiplication and whatnot, but they don't work exactly the same and writing code that uses the full potential of both is very challenging.
Synthesis tools are notorious for doing exactly what you ask them to, even if a more optimized solution is simple, for example:
a <= (x + y) + z; -- This is a 2 cascaded 2-input adder
b <= x + y + z; -- This is a 3-input adder
Will likely lead a different path from xyz to b/c. In the end, the designer need to know what he wants, and he has to verify that the synthesis tool understands his intent.

Evolution Strategies [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
What is the basic idea behind self adaptive evolution strategies? What are the strategy parameters and how are they manipulated during the run of the algorithm?
There's an excellent article on scholarpedia on the Evolution Strategy. I can also recommend the excellent journal article: Beyer, H.-G. & Schwefel, H.-P. Evolution Strategies - A Comprehensive Introduction. Natural Computing, 2002, 1, 3-52.
In the history of ES there have been several ways of adopting strategy parameters. The target of the adaptation generally is the shape and size of the sampling region around the current solution. The first one was the 1/5th success rule, then came the sigma self-adaptation and finally covariance matrix adaptation (CMA-ES). Why is this important? To put it simple: Adaptation of the mutation strength is necessary to maintain the evolution progress in all stages of the search. The closer you come to the optimum, the less you want to mutate your vector.
The advantage of CMA-ES over sigma self-adaptation is that it also adapts the shape of the region. Sigma self-adaption is restricted to axis-parallel adaptions only.
To get a larger picture, the book Introduction to Evolutionary Computing has a great chapter (#8) on parameter control, which self adaptation is part of.
Here is a quote taken from the introductory section:
Globally, we distinguish two major forms of setting parameter values:
parameter tuning and parameter control. By parameter tuning we mean
the commonly practised approach that amounts to finding good values
for the parameters Wont the run of the algorithm and then running the
algorithm using these values, which remain fixed during the run. Later
on in this section we give arguments that any static set of parameters
having the values fixed during an EA run seems to be inappropriate.
Parameter control forms an alternative. as it amounts to starting a
run with initial parameter values that are changed during the run.
Parameter tuning is a typical approach to algorithm design. Such
tuning is done by experimenting with different values and selecting
the ones that give the best results on the test problems at. hand.
However, the number of possible parameters and their different values
means that this is a very time-consuming activity
[Parameter control] is based on the observation that finding good
parameter values for an evolutionary algorithm is a poorly structured,
ill-defined, complex problem. This is exactly the kind of problem on
which EAs are often considered to perform better than other methods.
It is thus a natural idea to use an EA for tuning an EA to a
particular problem. This could be done using two EAs: one for problem
solving and another one - the so-called meta-EA - to tune the first
one. It could also be done by using only one EA that
tunes itself to a given problem, while solving that problem.
Self-adaptation, as introduced in evolution strategies for varying the
mutation parameters, falls within this category
It is then followed by concrete examples and further details.
well the goal behind self adapting in evolutionary computation in general is that algorithms should be general and require as less problem knowledge in form of input parameters you have to specify as possible.
self adapting makes an algorithm more general without the need of problem knowledge to choose the correct parametrisation.

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
My impression is that it has to do with taking advantage of certain libraries (e.g., BLAS, LAPLACK) where possible, revising algorithms (reducing iteration), profiling, parallelizing, considering memory-hard disk-processor use/access... I am aware of the adage, "want to optimize your code? don't do it", but if one were interested in learning about writing efficient code, what references might be available?
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
Profiling is a must at any level of machinery. In common usage, I've found that scaling to larger and larger grids requires a better understanding of the grid software and the topology of the grid. In that sense, everything you learn about optimizing for one machine is still applicable, but understanding the grid software gets you additional mileage. Hadoop is one of the most popular and widespread grid systems, so learning about the scheduler options, interfaces (APIs and web interfaces), and other aspects of usage will help. Although you may not use Hadoop for a given supercomputer, it is one of the less painful methods for learning about distributed computing. For parallel computing, you may pursue MPI and other systems.
Additionally, learning to parallelize code on a single machine, across multiple cores or processors, is something you can begin learning on a desktop machine.
Recommendations:
Learn to optimize code on a single machine:
Learn profiling
Learn to use optimized libraries (after profiling: so that you see the speedup)
Be sure you know algorithms and data structures very well (*)
Learn to do embarrassingly parallel programming on multiple core machines.
Later: consider multithreaded programming. It's harder and may not pay off for your problem.
Learn about basic grid software for distributed processing
Learn about tools for parallel processing on a grid
Learn to program for alternative hardware, e.g. GPUs, various specialized computing systems.
This is language agnostic. I have had to learn the same sequence in multiple languages and multiple HPC systems. At each step, take a simpler route to learn some of the infrastructure and tools; e.g. learn multicore before multithreaded, distributed before parallel, so that you can see what fits for the hardware and problem, and what doesn't.
Some of the steps may be reordered depending on local computing practices, established codebases, and mentors. If you have a large GPU or MPI library in place, then, by all means, learn that rather than foist Hadoop onto your collaborators.
(*) The reason to know algorithms very well is that as soon as your code is running on a grid, others will see it. When it is hogging up the system, they will want to know what you're doing. If you are running a process that is polynomial and should be constant, you may find yourself mocked. Others with more domain expertise may help you find good approximations for NP-hard problems, but you should know that the concept exists.
Parallelization would be the key.
Since the problems you cited (e.g. CFD, multiphysics, mass transfer) are generally expressed as large-scale linear algebra problems, you need matrix routines that parallelize well. MPI is a standard for those types of problems.
Physics can influence as well. For example, it's possible to solve some elliptical problems efficiently using explicit dynamics and artificial mass and damping matricies.
3D multiphysics means coupled differential equations with varying time scales. You'll want a fine mesh to resolve details in both space and time, so the number of degrees of freedom will rise rapidly; time steps will be governed by the stability requirements of your problem.
If someone ever figures out how to run linear algebra as a map-reduce problem they'll have it knocked.
Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
First, you would need to understand the problem. Not all problems can be solved in parallel (and I'm using the term parallel in as wide meaning as it can get). So, see how the problem is now solved. Can it be solved with some other method quicker. Can it be divided in independent parts ... and so on ...
Fortran is the language specialized for scientific computing, and during the recent years, along with the development of new language features, there has also been some very interesting development in terms of features that are aiming for this "market". The term "co-arrays" could be an interesting read.
But for now, I would suggest reading first into a book like Using OpenMP - OpenMP is a simpler model but the book (fortran examples inside) explains nicely the fundamentals. Message parsing interface (for friends, MPI :) is a larger model, and one of often used. Your next step from OpenMP should probably go in this direction. Books on the MPI programming are not rare.
You mentioned also libraries - yes, some of those you mentioned are widely used. Others are also available. A person who does not know exactly where the problem in performance lies should IMHO never try to undertake the task of trying to rewrite library routines.
Also there are books on parallel algorithms, you might want to check out.
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
In short it comes down to understanding the problem, learning where the problem in performance is, re-solving the whole problem again with a different approach, iterating a few times, and by that time you'll already know what you're doing and where you're stuck.
We're in a position similar to yours.
I'm most in agreement with #Iterator's answer, but I think there's more to say.
First of all, I believe in "profiling" by the random-pausing method, because I'm not really interested in measuring things (It's easy enough to do that) but in pinpointing code that is causing time waste, so I can fix it. It's like the difference between a floodlight and a laser.
For one example, we use LAPACK and BLAS. Now, in taking my stack samples, I saw a lot of the samples were in the routine that compares characters. This was called from a general routine that multiplies and scales matrices, and that was called from our code. The matrix-manipulating routine, in order to be flexible, has character arguments that tell it things like, if a matrix is lower-triangular or whatever. In fact, if the matrices are not very large, the routine can spend more than 50% of its time just classifying the problem. Of course, the next time it is called from the same place, it does the same thing all over again. In a case like that, a special routine should be written. When it is optimized by the compiler, it will be as fast as it reasonably can be, and will save all that classifying time.
For another example, we use a variety of ODE solvers. These are optimized to the nth degree of course. They work by calling user-provided routines to calculate derivatives and possibly a jacobian matrix. If those user-provided routines don't actually do much, samples will indeed show the program counter in the ODE solver itself. However, if the user-provided routines do much more, samples will find the lower end of the stack in those routines mostly, because they take longer, while the ODE code takes roughly the same time. So, optimization should be concentrated in the user-provided routines, not the ODE code.
Once you've done several of the kind of optimization that is pinpointed by stack sampling, which can speed things up by 1-2 orders of magnitude, then by all means exploit parallelism, MPI, etc. if the problem allows it.

Profiling a VxWorks system

We've got a fairly large application running on VxWorks 5.5.1 that's been developed and modified for around 10 years now. We have some simple home-grown tools to show that we are not using too much memory or too much processor, but we don't have a good feel for how much headroom we actually have. It's starting to make it difficult to do estimates for future enhancements.
Does anybody have any suggestions on how to profile such a system? We've never had much luck getting the Wind River tools to work.
For bonus points: the other complication is that our system has very different behaviors at different times; during start-up it does a lot of stuff, then it sits relatively idle except for brief bursts of activity. If there is a profiler with some programmatic way to have to record state information, I think that'd be very useful too.
FWIW, this is compiled with GCC and written entirely in C.
I've done a lot of performance tuning of various kinds of software, including embedded applications. I won't discuss memory profiling - I think that is a different issue.
I can only guess where the "well-known" idea originated that to find performance problems you need to measure performance of various parts. That is a top-down approach, similar to the way governments try to control budget waste, by subdividing. IMHO, it doesn't work very well.
Measurement is OK for seeing if what you did made a difference, but it is poor at telling you what to fix.
What is good at telling you what to fix is a bottom-up approach, in which you examine a representative sample of microscopic units of what is being spent, and finding out the full explanation of why each one is being spent. This works for a simple statistical reason. If there is a reason why some percent (for example 40%) of samples can be saved, on average 40% of samples will show it, and it doesn't require a huge number of samples. It does require that you examine each sample carefully, and not just sort of aggregate them into bigger bunches.
As a historical example, this is what Harry Truman did at the outbreak of the U.S. involvement in WW II. There was terrific waste in the defense industry. He just got in his car, drove out to the factories, and interviewed the people standing around. Then he went back to the U.S. Senate, explained what the problems were exactly, and got them fixed.
Maybe this is more of an answer than you wanted. Specifically, this is the method I use, and this is a blow-by-blow example of it.
ADDED: I guess the idea of finding-by-measuring is simply natural. Around '82 I was working on an embedded system, and I needed to do some performance tuning. The hardware engineer offered to put a timer on the board that I could read (providing from his plenty). IOW he assumed that finding performance problems required timing. I thanked him and declined, because by that time I knew and trusted the random-halt technique (done with an in-circuit-emulator).
If you have the Auxiliary Clock available, you could use the SPY utility (configurable via the config.h file) which does give you a very rough approximation of which tasks are using the CPU.
The nice thing about it is that it does not require being attached to the Tornado environment and you can use it from the Kernel shell.
Otherwise, btpierre's suggestion of using taskHookAdd has been used successfully in the past.
I've worked on systems that have had luck using locally-built monitoring utilities based on taskSwitchHookAdd and related functions (delete hook, etc).
"Simply" use this to track the number of ticks a given task runs. I realize that this is fairly gross scale information for profiling, but it can be useful depending on your needs.
To see how much cpu% each task is using, calculate the percentage of ticks assigned to each task.
To see how much headroom you have, add a lowest priority "idle" task that just does "while(1){}", and see how much cpu% it is assigned to it. Roughly speaking, that's your headroom.

Compiler optimizations: Where/how can I get a feel for what the payoff is for different optimizations?

In my independent study of various compiler books and web sites, I am learning about many different ways that a compiler can optimize the code that is being compiled, but I am having trouble figuring out how much of a benefit each optimization will tend to give.
How do most compiler writers go about deciding which optimizations to implement first? Or which optimizations are worth the effort or not worth the effort? I realize that this will vary between types of code and even individual programs, but I'm hoping that there is enough similarity between most programs to say, for instance, that one given technique will usually give you a better performance gain than another technique.
I found when implementing textbook compiler optimizations that some of them tended to reverse the improvements made by other optimizations. This entailed a lot of work trying to find the right balance between them.
So there really isn't a good answer to your question. Everything is a tradeoff. Many optimizations work well on one type of code, but are pessimizations for other types. It's like designing a house - if you make the kitchen bigger, the pantry gets smaller.
The real work in building an optimizer is trying out the various combinations, benchmarking the results, and, like a master chef, picking the right mix of ingredients.
Tongue in cheek:
Hubris
Benchmarks
Embarrassment
More seriously, it depends on your compiler's architecture and goals. Here's one person's experience...
Go for the "big payoffs":
native code generation
register allocation
instruction scheduling
Go for the remaining "low hanging fruit":
strength reduction
constant propagation
copy propagation
Keep bennchmarking.
Look at the output; fix anything that looks stupid.
It is usually the case that combining optimizations, or even repeating optimization passes, is more effective than you might expect. The benefit is more than the sum of the parts.
You may find that introduction of one optimization may necessitate another. For example, SSA with Briggs-Chaitin register allocation really benefits from copy propagation.
Historically, there are "algorithmical" optimizations from which the code should benefit in most of the cases, like loop unrolling (and compiler writers should implement those "documented" and "tested" optimizations first).
Then there are types of optimizations that could benefit from the type of processor used (like using SIMD instructions on modern CPUs).
See Compiler Optimizations on Wikipedia for a reference.
Finally, various type of optimizations could be tested profiling the code or doing accurate timing of repeated executions.
I'm not a compiler writer, but why not just incrementally optimize portions of your code, profiling all the while?
My optimization scheme usually goes:
1) make sure the program is working
2) find something to optimize
3) optimize it
4) compare the test results with what came out from 1; if they are different, then the optimization is actually a breaking change.
5) compare the timing difference
Incrementally, I'll get it faster.
I choose which portions to focus on by using a profiler. I'm not sure what extra information you'll garner by asking the compiler writers.
This really depends on what you are compiling. There is was a reasonably good discussion about this on the LLVM mailing list recently, it is of course somewhat specific to the optimizers they have available. They use abbreviations for a lot of their optimization passes, if you not familiar with any of acronyms they are tossing around you can look at their passes page for documentation. Ultimately you can spend years reading academic papers on this subject.
This is one of those topics where academic papers (ACM perhaps?) may be one of the better sources of up-to-date information. The best thing to do if you really want to know could be to create some code in unoptimized form and some in the form that the optimization would take (loops unrolled, etc) and actually figure out where the gains are likely to be using a compiler with optimizations turned off.
It is worth noting that in many cases, compiler writers will NOT spend much time, if any, on ensuring that their libraries are optimized. Benchmarks tend to de-emphasize or even ignore library differences, presumably because you can just use different libraries. For example, the permutation algorithms in GCC are asymptotically* less efficient than they could be when trying to permute complex data. This relates to incorrectly making deep copies during calls to swap functions. This will likely be corrected in most compilers with the introduction of rvalue references (part of the C++0x standard). Rewriting the STL to be much faster is surprisingly easy.
*This assumes the size of the class being permuted is variable. E.g. permutting a vector of vectors of ints would slow down if the vectors of ints were larger.
One that can give big speedups but is rarely done is to insert memory prefetch instructions. The trick is to figure out what memory the program will be wanting far enough in advance, never ask for the wrong memory and never overflow the D-cache.