Do lex and yacc provide optimized code? - yacc

Do Lex and Yacc provide optimized code or is it required that we write our own code manually for higher performance?

The code you write has a substantial effect on the speed, especially on the lexer side. Some versions of Flex come with half a dozen (or so) different word counters, most written with Flex, and a few written by hand -- the code gives a pretty good idea of how to optimize scanning speed (and a fairly reasonable comparison of what you can expect in hand-written vs. machine generated lexers).
On the parser side, you're generally a bit more constrained -- you can't make as many changes without affecting the semantics. Here, a great deal depends on the parser generator you use -- for example, some algorithms for less constrained grammars consistently produce relatively slow parsers, while other algorithms only slow down for the less constrained constructs, and run relatively fast as long as the input doesn't use the more complex constructs.

Yacc produces a table-driven parser, which can never be as fast as a well-written hand-coded one. I don't know if the same applies to lex.

Related

GCC optimization flags for matrix/vector operations

I am performing matrix operations using C. I would like to know what are the various compiler optimization flags to improve speed of execution of these matrix operations for double and int64 data - like Multiplication, Inverse, etc. I am not looking for hand optimized code, I just want to make the native code more faster using compiler flags and learn more about these flags.
The flags that I have found so far which improve matrix code.
-O3/O4
-funroll-loops
-ffast-math
First of all, I don't recommend using -ffast-math for the following reasons:
It has been proved that the performance actually degrades when
using this option in most (if not all) cases. So "fast math" is
not actually that fast.
This option breaks strict IEEE compliance on floating-point
operations which ultimately results in accumulation of computational
errors of unpredictable nature.
You may well get different results in different environments and the difference may be
substantial. The term environment (in this case) implies the combination of: hardware,
OS, compiler. Which means that the diversity of situations when you can get unexpected
results has exponential growth.
Another sad consequence is that programs which link against the
library built with this option might
expect correct (IEEE compliant) floating-point math, and this is
where their expectations break, but it will be very tough to figure
out why.
Finally, have a look at this article.
For the same reasons you should avoid -Ofast (as it includes the evil -ffast-math). Extract:
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
There is no such flag as -O4. At least I'm not aware of that one, and there is no trace of it in the official GCC documentation. So the maximum in this regard is -O3 and you should be definitely using it, not only to optimize math, but in release builds in general.
-funroll-loops is a very good choice for math routines, especially involving vector/matrix operations where the size of the loop can be deduced at compile-time (and as a result unrolled by the compiler).
I can recommend 2 more flags: -march=native and -mfpmath=sse. Similarly to -O3, -march=native is good in general for release builds of any software and not only math intensive. -mfpmath=sse enables use of XMM registers in floating point instructions (instead of stack in x87 mode).
Furthermore, I'd like to say that it's a pity that you don't want to modify your code to get better performance as this is the main source of speedup for vector/matrix routines. Thanks to SIMD, SSE Intrinsics, and Vectorization, the heavy-linear-algebra code can be orders of magnitude faster than without them. However, proper application of these techniques requires in-depth knowledge of their internals and quite some time/effort to modify (actually rewrite) the code.
Nevertheless, there is one option that could be suitable in your case. GCC offers auto-vectorization which can be enabled by -ftree-vectorize, but it is unnecessary since you are using -O3 (because it includes -ftree-vectorize already). The point is that you should still help GCC a little bit to understand which code can be auto-vectorized. The modifications are usually minor (if needed at all), but you have to make yourself familiar with them. So see the Vectorizable Loops section in the link above.
Finally, I recommend you to look into Eigen, the C++ template-based library which has highly efficient implementation of most common linear algebra routines. It utilizes all the techniques mentioned here so far in a very clever way. The interface is purely object-oriented, neat, and pleasing to use. The object-oriented approach looks very relevant to linear algebra as it usually manipulates the pure objects such as matrices, vectors, quaternions, rotations, filters, and so on. As a result, when programming with Eigen, you never have to deal with such low level concepts (as SSE, Vectorization, etc.) yourself, but just enjoy solving your specific problem.

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
My impression is that it has to do with taking advantage of certain libraries (e.g., BLAS, LAPLACK) where possible, revising algorithms (reducing iteration), profiling, parallelizing, considering memory-hard disk-processor use/access... I am aware of the adage, "want to optimize your code? don't do it", but if one were interested in learning about writing efficient code, what references might be available?
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
Profiling is a must at any level of machinery. In common usage, I've found that scaling to larger and larger grids requires a better understanding of the grid software and the topology of the grid. In that sense, everything you learn about optimizing for one machine is still applicable, but understanding the grid software gets you additional mileage. Hadoop is one of the most popular and widespread grid systems, so learning about the scheduler options, interfaces (APIs and web interfaces), and other aspects of usage will help. Although you may not use Hadoop for a given supercomputer, it is one of the less painful methods for learning about distributed computing. For parallel computing, you may pursue MPI and other systems.
Additionally, learning to parallelize code on a single machine, across multiple cores or processors, is something you can begin learning on a desktop machine.
Recommendations:
Learn to optimize code on a single machine:
Learn profiling
Learn to use optimized libraries (after profiling: so that you see the speedup)
Be sure you know algorithms and data structures very well (*)
Learn to do embarrassingly parallel programming on multiple core machines.
Later: consider multithreaded programming. It's harder and may not pay off for your problem.
Learn about basic grid software for distributed processing
Learn about tools for parallel processing on a grid
Learn to program for alternative hardware, e.g. GPUs, various specialized computing systems.
This is language agnostic. I have had to learn the same sequence in multiple languages and multiple HPC systems. At each step, take a simpler route to learn some of the infrastructure and tools; e.g. learn multicore before multithreaded, distributed before parallel, so that you can see what fits for the hardware and problem, and what doesn't.
Some of the steps may be reordered depending on local computing practices, established codebases, and mentors. If you have a large GPU or MPI library in place, then, by all means, learn that rather than foist Hadoop onto your collaborators.
(*) The reason to know algorithms very well is that as soon as your code is running on a grid, others will see it. When it is hogging up the system, they will want to know what you're doing. If you are running a process that is polynomial and should be constant, you may find yourself mocked. Others with more domain expertise may help you find good approximations for NP-hard problems, but you should know that the concept exists.
Parallelization would be the key.
Since the problems you cited (e.g. CFD, multiphysics, mass transfer) are generally expressed as large-scale linear algebra problems, you need matrix routines that parallelize well. MPI is a standard for those types of problems.
Physics can influence as well. For example, it's possible to solve some elliptical problems efficiently using explicit dynamics and artificial mass and damping matricies.
3D multiphysics means coupled differential equations with varying time scales. You'll want a fine mesh to resolve details in both space and time, so the number of degrees of freedom will rise rapidly; time steps will be governed by the stability requirements of your problem.
If someone ever figures out how to run linear algebra as a map-reduce problem they'll have it knocked.
Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
First, you would need to understand the problem. Not all problems can be solved in parallel (and I'm using the term parallel in as wide meaning as it can get). So, see how the problem is now solved. Can it be solved with some other method quicker. Can it be divided in independent parts ... and so on ...
Fortran is the language specialized for scientific computing, and during the recent years, along with the development of new language features, there has also been some very interesting development in terms of features that are aiming for this "market". The term "co-arrays" could be an interesting read.
But for now, I would suggest reading first into a book like Using OpenMP - OpenMP is a simpler model but the book (fortran examples inside) explains nicely the fundamentals. Message parsing interface (for friends, MPI :) is a larger model, and one of often used. Your next step from OpenMP should probably go in this direction. Books on the MPI programming are not rare.
You mentioned also libraries - yes, some of those you mentioned are widely used. Others are also available. A person who does not know exactly where the problem in performance lies should IMHO never try to undertake the task of trying to rewrite library routines.
Also there are books on parallel algorithms, you might want to check out.
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
In short it comes down to understanding the problem, learning where the problem in performance is, re-solving the whole problem again with a different approach, iterating a few times, and by that time you'll already know what you're doing and where you're stuck.
We're in a position similar to yours.
I'm most in agreement with #Iterator's answer, but I think there's more to say.
First of all, I believe in "profiling" by the random-pausing method, because I'm not really interested in measuring things (It's easy enough to do that) but in pinpointing code that is causing time waste, so I can fix it. It's like the difference between a floodlight and a laser.
For one example, we use LAPACK and BLAS. Now, in taking my stack samples, I saw a lot of the samples were in the routine that compares characters. This was called from a general routine that multiplies and scales matrices, and that was called from our code. The matrix-manipulating routine, in order to be flexible, has character arguments that tell it things like, if a matrix is lower-triangular or whatever. In fact, if the matrices are not very large, the routine can spend more than 50% of its time just classifying the problem. Of course, the next time it is called from the same place, it does the same thing all over again. In a case like that, a special routine should be written. When it is optimized by the compiler, it will be as fast as it reasonably can be, and will save all that classifying time.
For another example, we use a variety of ODE solvers. These are optimized to the nth degree of course. They work by calling user-provided routines to calculate derivatives and possibly a jacobian matrix. If those user-provided routines don't actually do much, samples will indeed show the program counter in the ODE solver itself. However, if the user-provided routines do much more, samples will find the lower end of the stack in those routines mostly, because they take longer, while the ODE code takes roughly the same time. So, optimization should be concentrated in the user-provided routines, not the ODE code.
Once you've done several of the kind of optimization that is pinpointed by stack sampling, which can speed things up by 1-2 orders of magnitude, then by all means exploit parallelism, MPI, etc. if the problem allows it.

Voicexml how many words in grammar

I want to have a dynamic grammar in my voicexml file (read single products and create the grammar with php)
my question is, if there is any advice or experience how many words should be writte into the source from where I read the products.
I don't know much about the structure or pronunciation of the words, so let's say
a) the words are rather different from each other
b) the words rather have the same structre or pronunciation
c) a mix of a) and b)
thanks in advance
I'm assuming you mean SRGS grammars when you indicate a dynamic grammar for VoiceXML.
Unfortunately, you're going to have to do performance testing under a reasonable load to really know for sure. I've successfully transmitted 1M+ grammars under certain conditions. I've also done 10,000 name lists. I've also come across platforms that can only utilize a few dozen entries.
The speech recognition (ASR) and VoiceXML platform are going to have a significant impact on your results. And, the number of concurrent recognitions with this grammar will also be relevant along with the overall recognition load.
The factors you mention do have an impact on recognition performance and cpu load, but I've typically found size of grammar and length/variability of entries to matter more. For example, yes/no grammars typically have a much higher cpu load then complex menu grammars (short phrases tend to require more passes and leave open a larger number of possibilities when processing). I've seen some horrible numbers from wide ranging digit grammars (9-31 digit grammars). The sounds are short and difficult to disambiguate. The variability in components, again, creates large number of paths that have to be continuously checked for a solution. Most menu or natural speaking phrases have longer words that sound significantly different so that many paths can be quickly excluded.
Some tips:
Most enterprise class ASR systems support a cache. If you can identify grammars with URL parameters and set any HTTP header information the ASR needs (don't assume they follow the standards), you may see a significant performance boost.
Prompts can often hide grammar loading/compiling phases. If you have a relatively long prompt where people will tend to barge in, you'll find that you can hide some fairly large grammar fetches. Again, not all platforms do a good job of processing these tasks in parallel. Note, most ASR engines can collect audio and perform end-pointing, while still fetching and compiling the grammar. This buys you more time, but you'll see the impact in longer latencies.
Most ASR engines provide tools that let you analyze a grammar with sample audio. The tools will usually give you a cpu resource indicators. I've rarely found that you can calculate/predict overall performance due to the complexities around recognition concurrency, but they can give you a comparative impact with other grammars. I have yet to find an engine that makes it easy to track grammar processing times, it can be difficult to even roughly guess concurrency challenges. In most cases, large scale testing has been necessary.
After grammar load/compile times, recognition concurrency is the most significant performance impact. I've seen a few applications that have highly complex grammars near the beginning of the call. There were high levels of recognition concurrency without an opportunity to cache (platform issue at the time), which lead to scaling challenges (intermittent, large latencies in recognition processing).

When is optimisation premature?

As Knuth said,
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
This is something which often comes up in Stack Overflow answers to questions like "which is the most efficient loop mechanism", "SQL optimisation techniques?" (and so on). The standard answer to these optimisation-tips questions is to profile your code and see if it's a problem first, and if it's not, then therefore your new technique is unneeded.
My question is, if a particular technique is different but not particularly obscure or obfuscated, can that really be considered a premature optimisation?
Here's a related article by Randall Hyde called The Fallacy of Premature Optimization.
Don Knuth started the literate programming movement because he believed that the most important function of computer code is to communicate the programmer's intent to a human reader. Any coding practice that makes your code harder to understand in the name of performance is a premature optimization.
Certain idioms that were introduced in the name of optimization have become so popular that everyone understands them and they have become expected, not premature. Examples include
Using pointer arithmetic instead of array notation in C, including the use of such idioms as
for (p = q; p < lim; p++)
Rebinding global variables to local variables in Lua, as in
local table, io, string, math
= table, io, string, math
Beyond such idioms, take shortcuts at your peril.
All optimization is premature unless
A program is too slow (many people forget this part).
You have a measurement (profile or similar) showing that the optimization could improve things.
(It's also permissible to optimize for memory.)
Direct answer to question:
If your "different" technique makes the program harder to understand, then it's a premature optimization.
EDIT: In response to comments, using quicksort instead of a simpler algorithm like insertion sort is another example of an idiom that everyone understands and expects. (Although if you write your own sort routine instead of using the library sort routine, one hopes you have a very good reason.)
IMHO, 90% of your optimization should occur at design stage, based on percieved current, and more importantly, future requirements. If you have to take out a profiler because your application doesn't scale to the required load you have left it too late, and IMO will waste a lot of time and effort while failing to correct the problem.
Typically the only optimizations that are worthwhile are those that gain you an order of magnitude performance improvement in terms of speed, or a multiplier in terms of storage or bandwidth. These types of optimizations typically relate to algorithm selection and storage strategy, and are extremely difficult to reverse into existing code. They may go as deep as influencing the decision on the language in which you implement your system.
So my advice, optimize early, based on your requirements, not your code, and look to the possible extended life of your app.
If you haven't profiled, it's premature.
My question is, if a particular
technique is different but not
particularly obscure or obfuscated,
can that really be considered a
premature optimisation?
Um... So you have two techniques ready at hand, identical in cost (same effort to use, read, modify) and one is more efficient. No, using the more efficient one would not, in that case, be premature.
Interrupting your code-writing to look for alternatives to common programming constructs / library routines on the off-chance that there's a more efficient version hanging around somewhere even though for all you know the relative speed of what you're writing will never actually matter... That's premature.
Here's the problem I see with the whole concept of avoiding premature optimization.
There's a disconnect between saying it and doing it.
I've done lots of performance tuning, squeezing large factors out of otherwise well-designed code, seemingly done without premature optimization.
Here's an example.
In almost every case, the reason for the suboptimal performance is what I call galloping generality, which is the use of abstract multi-layer classes and thorough object-oriented design, where simple concepts would be less elegant but entirely sufficient.
And in the teaching material where these abstract design concepts are taught, such as notification-driven architecture, and information-hiding where simply setting a boolean property of an object can have an unbounded ripple effect of activities, what is the reason given? Efficiency.
So, was that premature optimization or not?
First, get the code working. Second, verify that the code is correct. Third, make it fast.
Any code change that is done before stage #3 is definitely premature. I am not entirely sure how to classify design choices made before that (like using well-suited data structures), but I prefer to veer towards using abstractions taht are easy to program with rather than those who are well-performing, until I am at a stage where I can start using profiling and having a correct (though frequently slow) reference implementation to compare results with.
From a database perspective, not to consider optimal design at the design stage is foolhardy at best. Databases do not refactor easily. Once they are poorly designed (this is what a design that doesn't consider optimization is no matter how you might try to hide behind the nonsense of premature optimization), is almost never able to recover from that becasue the database is too basic to the operation of the whole system. It is far less costly to design correctly considering the optimal code for the situation you expect than to wait until the there are a million users and people are screaming becasue you used cursors throughout the application. Other optimizations such as using sargeable code, selecting what look to be the best possible indexes, etc. only make sense to do at design time. There is a reason why quick and dirty is called that. Because it can't work well ever, so don't use quickness as a substitute for good code. Also frankly when you understand performance tuning in databases, you can write code that is more likely to perform well in the same time or less than it takes to write code which doesn't perform well. Not taking the time to learn what is good performing database design is developer laziness, not best practice.
What you seem to be talking about is optimization like using a hash-based lookup container vs an indexed one like an array when a lot of key lookups will be done. This is not premature optimization, but something you should decide in the design phase.
The kind of optimization the Knuth rule is about is minimizing the length the most common codepaths, optimizing the code that is run most by for example rewriting in assembly or simplifying the code, making it less general. But doing this has no use until you are certain which parts of code need this kind of optimization and optimizing will (could?) make the code harder to understand or maintain, hence "premature optimization is the root of all evil".
Knuth also says it is always better to, instead of optimizing, change the algorithms your program uses, the approach it takes to a problem. For example whereas a little tweaking might give you a 10% increase of speed with optimization, changing fundamentally the way your program works might make it 10x faster.
In reaction to a lot of the other comments posted on this question: algorithm selection != optimization
The point of the maxim is that, typically, optimization is convoluted and complex. And typically, you the architect/designer/programmer/maintainer need clear and concise code in order to understand what is going on.
If a particular optimization is clear and concise, feel free to experiment with it (but do go back and check whether that optimization is effective). The point is to keep the code clear and concise throughout the development process, until the benefits of performance outweigh the induced costs of writing and maintaining the optimizations.
Optimization can happen at different levels of granularity, from very high-level to very low-level:
Start with a good architecture, loose coupling, modularity, etc.
Choose the right data structures and algorithms for the problem.
Optimize for memory, trying to fit more code/data in the cache. The memory subsystem is 10 to 100 times slower than the CPU, and if your data gets paged to disk, it's 1000 to 10,000 times slower. Being cautious about memory consumption is more likely to provide major gains than optimizing individual instructions.
Within each function, make appropriate use of flow-control statements. (Move immutable expressions outside of the loop body. Put the most common value first in a switch/case, etc.)
Within each statement, use the most efficient expressions yielding the correct result. (Multiply vs. shift, etc)
Nit-picking about whether to use a divide expression or a shift expression isn't necessarily premature optimization. It's only premature if you do so without first optimizing the architecture, data structures, algorithms, memory footprint, and flow-control.
And of course, any optimization is premature if you don't define a goal performance threshold.
In most cases, either:
A) You can reach the goal performance threshold by performing high-level optimizations, so it's not necessary to fiddle with the expressions.
or
B) Even after performing all possible optimizations, you won't meet your goal performance threshold, and the low-level optimizations don't make enough difference in performance to justify the loss of readability.
In my experience, most optimization problems can be solved at either the architecture/design or data-structure/algorithm level. Optimizing for memory footprint is often (though not always) called for. But it's rarely necessary to optimize the flow control & expression logic. And in those cases where it actually is necessary, it's rarely sufficient.
I try to only optimise when a performance issue is confirmed.
My definition of premature optimisation is 'effort wasted on code that is not known to be a performance problem.' There is most definitely a time and place for optimisation. However, the trick is to spend the extra cost only where it counts to the performance of the application and where the additional cost outweighs the performance hit.
When writing code (or a DB query) I strive to write 'efficient' code (i.e. code that performs its intended function, quickly and completely with simplest logic reasonable.) Note that 'efficient' code is not necessarily the same as 'optimised' code. Optimisations often introduce additional complexity into code which increases both the development and maintenance cost of that code.
My advice: Try to only pay the cost of optimisation when you can quantify the benefit.
When programming, a number of parameters are vital. Among these are:
Readability
Maintainability
Complexity
Robustness
Correctness
Performance
Development time
Optimisation (going for performance) often comes at the expense of other parameters, and must be balanced against the "loss" in these areas.
When you have the option of choosing well-known algorithms that perform well, the cost of "optimising" up-front is often acceptable.
Norman's answer is excellent. Somehow, you routinely do some "premature optimization" which are, actually, best practices, because doing otherwise is known to be totally inefficient.
For example, to add to Norman's list:
Using StringBuilder concatenation in Java (or C#, etc.) instead of String + String (in a loop);
Avoiding to loop in C like: for (i = 0; i < strlen(str); i++) (because strlen here is a function call walking the string each time, called on each loop);
It seems in most JavaScript implementations, it is faster to do too for (i = 0 l = str.length; i < l; i++) and it is still readable, so OK.
And so on. But such micro-optimizations should never come at the cost of readability of code.
The need to use a profiler should be left for extreme cases. The engineers of the project should be aware of where performance bottlenecks are.
I think "premature optimisation" is incredibly subjective.
If I am writing some code and I know that I should be using a Hashtable then I will do that. I won't implement it in some flawed way and then wait for the bug report to arrive a month or a year later when somebody is having a problem with it.
Redesign is more costly than optimising a design in obvious ways from the start.
Obviously some small things will be missed the first time around but these are rarely key design decisions.
Therefore: NOT optimising a design is IMO a code smell in and of itself.
It's worth noting that Knuth's original quote came from a paper he wrote promoting the use of goto in carefully selected and measured areas as a way to eliminate hotspots. His quote was a caveat he added to justify his rationale for using goto in order to speed up those critical loops.
[...] again, this is a noticeable saving in the overall running speed,
if, say, the average value of n is about 20, and if the search routine
is performed about a million or so times in the program. Such loop
optimizations [using gotos] are not difficult to learn and, as I have
said, they are appropriate in just a small part of a program, yet they
often yield substantial savings. [...]
And continues:
The conventional wisdom shared by many of today's software engineers
calls for ignoring efficiency in the small; but I believe this is
simply an overreaction to the abuses they see being practiced by
pennywise-and-pound-foolish programmers, who can't debug or maintain
their "optimized" programs. In established engineering disciplines a
12% improvement, easily obtained, is never considered marginal; and I
believe the same viewpoint should prevail in software engineering. Of
course I wouldn't bother making such optimizations on a oneshot job,
but when it's a question of preparing quality programs, I don't want
to restrict myself to tools that deny me such efficiencies [i.e., goto
statements in this context].
Keep in mind how he used "optimized" in quotes (the software probably isn't actually efficient). Also note how he isn't just criticizing these "pennywise-and-pound-foolish" programmers, but also the people who react by suggesting you should always ignore small inefficiencies. Finally, to the frequently-quoted part:
There is no doubt that the grail of efficiency leads to abuse.
Programmers waste enormous amounts of time thinking about, or worrying
about, the speed of noncritical parts of their programs, and these
attempts at efficiency actually have a strong negative impact when
debugging and maintenance are considered. We should forgot about small
efficiencies, say 97% of the time; premature optimization is the root
of all evil.
... and then some more about the importance of profiling tools:
It is often a mistake to make a priori judgments about what parts of a
program are really critical, since the universal experience of
programmers who have been using measurement tools has been that their
intuitive guesses fail. After working with such tools for seven years,
I've become convinced that all compilers written from now on should be
designed to provide all programmers with feedback indicating what
parts of their programs are costing the most; indeed, this feedback
should be supplied automatically unless it has been specifically
turned off.
People have misused his quote all over the place, often suggesting that micro-optimizations are premature when his entire paper was advocating micro-optimizations! One of the groups of people he was criticizing who echo this "conventional wisdom" as he put of always ignoring efficiencies in the small are often misusing his quote which was originally directed, in part, against such types who discourage all forms of micro-optimization.
Yet it was a quote in favor of appropriately applied micro-optimizations when used by an experienced hand holding a profiler. Today's analogical equivalent might be like, "People shouldn't be taking blind stabs at optimizing their software, but custom memory allocators can make a huge difference when applied in key areas to improve locality of reference," or, "Handwritten SIMD code using an SoA rep is really hard to maintain and you shouldn't be using it all over the place, but it can consume memory much faster when applied appropriately by an experienced and guided hand."
Any time you're trying to promote carefully-applied micro-optimizations as Knuth promoted above, it's good to throw in a disclaimer to discourage novices from getting too excited and blindly taking stabs at optimization, like rewriting their entire software to use goto. That's in part what he was doing. His quote was effectively a part of a big disclaimer, just like someone doing a motorcycle jump over a flaming fire pit might add a disclaimer that amateurs shouldn't try this at home while simultaneously criticizing those who try without proper knowledge and equipment and get hurt.
What he deemed "premature optimizations" were optimizations applied by people who effectively didn't know what they were doing: didn't know if the optimization was really needed, didn't measure with proper tools, maybe didn't understand the nature of their compiler or computer architecture, and most of all, were "pennywise-and-pound-foolish", meaning they overlooked the big opportunities to optimize (save millions of dollars) by trying to pinch pennies, and all while creating code they can no longer effectively debug and maintain.
If you don't fit in the "pennywise-and-pound-foolish" category, then you aren't prematurely optimizing by Knuth's standards, even if you're using a goto in order to speed up a critical loop (something which is unlikely to help much against today's optimizers, but if it did, and in a genuinely critical area, then you wouldn't be prematurely optimizing). If you're actually applying whatever you're doing in areas that are genuinely needed and they genuinely benefit from it, then you're doing just great in the eyes of Knuth.
Premature optimization to me means trying to improve the efficiency of your code before you have a working system, and before you have actually profiled it and know where the bottleneck is. Even after that, readability and maintainability should come before optimization in many cases.
I don't think that recognized best practices are premature optimizations. It's more about burning time on the what ifs that are potential performance problems depending on the usage scenarios. A good example: If you burn a week trying to optimize reflecting over an object before you have proof that it is a bottleneck you are prematurely optimizing.
Unless you find that you need more performance out of your application, due to either a user or business need, there's little reason to worry about optimizing. Even then, don't do anything until you've profiled your code. Then attack the parts which take the most time.
The way I see it is, if you optimize something without knowing how much performance you can gain in different scenario IS a premature optimization. The goal of code should really making it easiest for human to read.
As I posted on a similar question, the rules of optimisation are:
1) Don't optimise
2) (for experts only) Optimise later
When is optimisation premature? Usually.
The exception is perhaps in your design, or in well encapsulated code that is heavily used. In the past I've worked on some time critical code (an RSA implementation) where looking at the assembler that the compiler produced and removing a single unnecessary instruction in an inner loop gave a 30% speedup. But, the speedup from using more sophisticated algorithms was orders of magnitude more than that.
Another question to ask yourself when optimising is "am I doing the equivalent of optimising for a 300 baud modem here?". In other words, will Moore's law make your optimisation irrelevant before too long. Many problems of scaling can be solved just by throwing more hardware at the problem.
Last but not least it's premature to optimise before the program is going too slowly. If it's web application you're talking about, you can run it under load to see where the bottlenecks are - but the likelihood is that you will have the same scaling problems as most other sites, and the same solutions will apply.
edit: Incidentally, regarding the linked article, I would question many of the assumptions made. Firstly it's not true that Moore's law stopped working in the 90s. Secondly, it's not obvious that user's time is more valuable than programmer's time. Most users are (to say the least) not frantically using every CPU cycle available anyhow, they are probably waiting for the network to do something. Plus there is an opportunity cost when programmer's time is diverted from implementing something else, to shaving a few milliseconds off something that the program does while the user is on the phone. Anything longer than that isn't usually optimisation, it's bug fixing.

Compiler optimizations: Where/how can I get a feel for what the payoff is for different optimizations?

In my independent study of various compiler books and web sites, I am learning about many different ways that a compiler can optimize the code that is being compiled, but I am having trouble figuring out how much of a benefit each optimization will tend to give.
How do most compiler writers go about deciding which optimizations to implement first? Or which optimizations are worth the effort or not worth the effort? I realize that this will vary between types of code and even individual programs, but I'm hoping that there is enough similarity between most programs to say, for instance, that one given technique will usually give you a better performance gain than another technique.
I found when implementing textbook compiler optimizations that some of them tended to reverse the improvements made by other optimizations. This entailed a lot of work trying to find the right balance between them.
So there really isn't a good answer to your question. Everything is a tradeoff. Many optimizations work well on one type of code, but are pessimizations for other types. It's like designing a house - if you make the kitchen bigger, the pantry gets smaller.
The real work in building an optimizer is trying out the various combinations, benchmarking the results, and, like a master chef, picking the right mix of ingredients.
Tongue in cheek:
Hubris
Benchmarks
Embarrassment
More seriously, it depends on your compiler's architecture and goals. Here's one person's experience...
Go for the "big payoffs":
native code generation
register allocation
instruction scheduling
Go for the remaining "low hanging fruit":
strength reduction
constant propagation
copy propagation
Keep bennchmarking.
Look at the output; fix anything that looks stupid.
It is usually the case that combining optimizations, or even repeating optimization passes, is more effective than you might expect. The benefit is more than the sum of the parts.
You may find that introduction of one optimization may necessitate another. For example, SSA with Briggs-Chaitin register allocation really benefits from copy propagation.
Historically, there are "algorithmical" optimizations from which the code should benefit in most of the cases, like loop unrolling (and compiler writers should implement those "documented" and "tested" optimizations first).
Then there are types of optimizations that could benefit from the type of processor used (like using SIMD instructions on modern CPUs).
See Compiler Optimizations on Wikipedia for a reference.
Finally, various type of optimizations could be tested profiling the code or doing accurate timing of repeated executions.
I'm not a compiler writer, but why not just incrementally optimize portions of your code, profiling all the while?
My optimization scheme usually goes:
1) make sure the program is working
2) find something to optimize
3) optimize it
4) compare the test results with what came out from 1; if they are different, then the optimization is actually a breaking change.
5) compare the timing difference
Incrementally, I'll get it faster.
I choose which portions to focus on by using a profiler. I'm not sure what extra information you'll garner by asking the compiler writers.
This really depends on what you are compiling. There is was a reasonably good discussion about this on the LLVM mailing list recently, it is of course somewhat specific to the optimizers they have available. They use abbreviations for a lot of their optimization passes, if you not familiar with any of acronyms they are tossing around you can look at their passes page for documentation. Ultimately you can spend years reading academic papers on this subject.
This is one of those topics where academic papers (ACM perhaps?) may be one of the better sources of up-to-date information. The best thing to do if you really want to know could be to create some code in unoptimized form and some in the form that the optimization would take (loops unrolled, etc) and actually figure out where the gains are likely to be using a compiler with optimizations turned off.
It is worth noting that in many cases, compiler writers will NOT spend much time, if any, on ensuring that their libraries are optimized. Benchmarks tend to de-emphasize or even ignore library differences, presumably because you can just use different libraries. For example, the permutation algorithms in GCC are asymptotically* less efficient than they could be when trying to permute complex data. This relates to incorrectly making deep copies during calls to swap functions. This will likely be corrected in most compilers with the introduction of rvalue references (part of the C++0x standard). Rewriting the STL to be much faster is surprisingly easy.
*This assumes the size of the class being permuted is variable. E.g. permutting a vector of vectors of ints would slow down if the vectors of ints were larger.
One that can give big speedups but is rarely done is to insert memory prefetch instructions. The trick is to figure out what memory the program will be wanting far enough in advance, never ask for the wrong memory and never overflow the D-cache.