Does optimizing code in TI-BASIC actually make a difference?

Does optimizing code in TI-BASIC actually make a difference? - optimization

I know in TI-BASIC, the convention is to optimize obsessively and to save as many bits as possible (which is pretty fun, I admit).
For example,
DelVar Z
Prompt X
If X=0
Then
Disp "X is zero"
End //28 bytes
would be cleaned up as
DelVar ZPrompt X
If not(X
"X is zero //20 bytes
But does optimizing code this way actually make a difference? Does it noticeably run faster or save memory?

Yes. Optimizing your TI-Basic code makes a difference, and that difference is much larger than you would find for most programming languages.
In my opinion, the most important optimization to TI-Basic programs is size (making them as small as possible). This is important to me since I have dozens of programs on my calculator, which only has 24 kB of user-accessible RAM. In this case, it isn't really necessary to spend lots of time trying to save a few bytes of space; instead, I simply advise learning the shortest and most efficient ways to do things, so that when you write programs, they will naturally tend to be small.
Additionally, TI-Basic programs should be optimized for speed. Examples off of the top of my head include the quirk with the unclosed For( loop, calculating a value once instead of calculating it in every iteration of a loop (if possible), and using quickly-accessed variables such as Ans and the finance variables whenever the variable must be accessed a large number of times (e.g. 1000+).
A third possible optimization is for run-time memory usage. Every loop, function call, etc. has an overhead that must be stored in the memory stack in order to return to the original location, calculate values, etc. during the program's execution. It is important to avoid memory leaks (such as breaking out of a loop with Goto).
It is up to you to decide how you balance these optimizations. I prefer to:
First and foremost, guarantee that there are no memory leaks or incorrectly nested loops in my program.
Take advantage of any size optimizations that have little or no impact on the program's speed.
Consider speed optimizations, and decide if the added speed is worth the increase in program size.

TI-BASIC is an interpreted language, which usually means there is a huge overhead on every single operation.
The way an interpreted language works is that instead of actually compiling the program into code that runs on the CPU directly, each operation is a function call to the interpreter that look at what needs to be done and then calls functions to complete those sub tasks. In most cases, the overhead is a factor or two in speed, and often also in stack memory usage. However, the memory for non-stack is usually the same.
In your above example you are doing the exact same number of operations, which should mean that they run exactly as fast. What you should optimize are things like i = i + 1, which is 4 operations into i++ which is 2 operations. (as an example, TI-BASIC doesn't support ++ operator).
This does not mean that all operations take the exact same time, internally a operation may be calling hundreds of other functions or it may be as simple as updating a single variable. The programmers of the interpreter may also have implemented various peephole optimizations that optimizes very specific language constructs, e.g. for(int i = 0; i < count; i++) could both be implemented as a collection of expensive interpreter functions that behave as if i is generic, or it could be optimized to a compiled loop where it just had to update the variable i and reevaluate the count.
Now, not all interpreted languages are doomed to this pale existence. For example, JavaScript used to be one, but these days all major js engines JIT compile the code to run directly on the CPU.
UPDATE: Clarified that not all operations are created equal.

Absolutely, it makes a difference. I wrote a full-scale color RPG for the TI-84+CSE, and let me tell you, without optimizing any of my code, the game would flat out not run. At present, on the CSE, Sorcery of Uvutu can only run if every other program is archived and all other memory is out of RAM. The programs and data storage alone takes up 20k bytes in RAM, or just 1kb under all of available user memory. With all the variables in use, the memory approaches dangerously low points. I had points in my development where due to poor optimizations, I couldn't even start the game without getting a "memory all gone" error. I had plans to implement various extra things, but due to space and speed concerns, it was impossible to do so. That's only the consideration to space.
In the speed department, the game became, and still is, slow in the overworld. Walking around in the overworld is painfully slow compared to other games, and that's because of what I have to do in that code; I have to check for collisions, check if the user is moving to a new map, check if they pressed a key that should illicit a response, check if a battle should go on, and more. I was able to make slight optimizations to the walking speed, but even then, I could blatantly tell I had made improvements. It still was pretty awfully slow (at least compared to every other port I've made), but I made it a little more tolerable.
In summary, through my own experiences crafting a large project, I can say that in TI-Basic, optimizing code does make a difference. Other answers mentioned this, but TI-Basic is an interpreted language. This means the code isn't compiled into faster, lower level code, but the stuff that you put in the program is read straight out as it executes, is interpreted by the interpreter, calls the subroutines and other stuff it needs to to execute the commands, and then returns back to read the next line. As a result of that, and the fact that the TI-84+ series CPU, the Zilog Z80, was designed in 1976, you get a rather slow interpreter, especially for this day and age. As such, the fewer the commands you run, and the more you take advantage of system weirdness such as Ans being the fastest variable that can also hold the most types of data (integers/floats, strings, lists, matrices, etc), the better the performance you're gonna get.
Sources: My own experiences, documented here: https://codewalr.us/index.php?topic=778.msg27190#msg27190
TI-84+CSE RAM numbers came from here: https://education.ti.com/en/products/calculators/graphing-calculators/ti-84-plus-c-se?category=specifications
Information about the Z80 came from here: http://segaretro.org/Zilog_Z80

Depends, if it's just a basic math program then no. For big games then YES. The TI-84 has only 3.5MB of space available and has the combo of an ancient Z80 processor and a whopping 128KB of RAM. TI-BASIC is also quite slow as it's interpreted (look it up for further information) so if you to make fast-running games then YES. Optimization is very important.

Related

Is faster code also more power efficient?

Assume I have a CPU running at a constant rate, pulling an equal amount of energy per instruction. I also have two functionally identical programs, which result in the same output, except one has been optimized to execute only 100 instructions, while the other program executes 200 instructions. Is the 100 instruction program necessarily faster than the 200 instruction program? Does a program with fewer instructions draw less power than a program with more instructions?

Things are much more complex than this.
For example execution speed is in many cases dominated by memory. As a practical example some code could process the pixels of an image first in rows and then in columns... a different code instead could be more complex but processing rows and columns at the same time.
The second version could execute more instructions because of more complex housekeeping of the data but I wouldn't be surprised if it was faster because of how memory is organized: reading an image one column at a time is going to "trash the cache" and it's very possible that despite being simple the code working that way could be a LOT slower than the more complex one doing the processing in a memory-friendly way. The simpler code may end up being "stalled" a lot waiting for the cache lines to be filled or flushed to the external memory.
This is just an example, but in reality what happens inside a CPU when code is executed is for many powerful processors today a very very complex process: instructions are exploded in micro-instructions, registers are renamed, there is speculative execution of parts of code depending on what branch predictors guess even before the program counter really reaches a certain instruction and so on. Today the only way to know for sure if something is faster or slower is in many cases just trying with real data and measure.

Is the 100 instruction program necessarily faster than the 200 instruction program?
No. Firstly, on some architectures (such as x86) different instructions can take a different number of cycles. Secondly, there are effects — such cache misses, page faults and branch mispreditictions — that complicate the picture further.
From this it follows that the answer to your headline question is "not necessarily".
Further reading.

I found a paper from 2017 comparing the energy usage, speed, and memory consumption of various programming languages. There is an obvious positive correlation between faster languages also using less energy.

gfortran change/find out write buffer size

I have this molecular dynamics program that writes atom position and velocities to a file at every n steps of simulation. The actual writing is taking like 90% of the running time! (checked by eiminating the writes) So I desperately need to optimize that.
I see that some fortrans have an extension to change the write buffer size (called i/o block size) and the "number of blocks" at the OPEN statement, but it appears that gfortran doesn't. Also I read somewhere that gfortran uses 8192 bytes write buffer.
I even tried to do an FSTAT (right after opening, is that right?) to see what is the block size and number of blocks it is using but it returns -1 on both. (compiling for windows 64 bit)
Isn't there a way to enlarge the write buffer for a file in gfortran? Will it be diferent compiling for linux than for windows?
I'd really really rather stay in fortran but as a desperate measure isn't there a way to do so by adding some c routine?
thanks!

IanH question is key. Unformatted IO is MUCH faster than formatted. The conversion from base 2 to base 10 is very CPU intensive. If you don't need the values to be human readable, then use unformatted IO. If you want to be able to read the values in another language, then use access='stream'.
Another approach would be to add your own buffering. Replace the write statement with a call to a subroutine. Have that subroutine store values and write only when it has received M values. You'll also have to have a "flush" call to the subroutine to cause it to write the last values, if they are fewer them M.
If gcc C is faster at IO, you could mix Fortran and C with Fortran's ISO_C_Binding: https://stackoverflow.com/questions/tagged/fortran-iso-c-binding. There are examples of the use of the ISO C Binding in the gfortran manual under "Mixed Language Programming".

If you spend 90% of your runtime writing coords/vels every n timesteps, the obvious quick fix would be to instead write data every, say, n/100 timestep. But I'm sure you already thought of that yourself.
But yes, gfortran has a fixed 8k buffer, whose size cannot be changed except by modifying the libgfortran source and rebuilding it. The reason for the buffering is to amortize the syscall overhead; (simplistic) tests on Linux showed that 8k is sufficient and more than that goes far into diminishing returns territory. That being said, if you have some substantiated claims that bigger buffers are useful on some I/O patterns and/or OS, there's no reason why the buffer can't be made larger in a future release.
As for you performance issues, as already mentioned, unformatted is a lot faster than formatted I/O. Additionally, gfortran has rather high per-IO-statement overhead. You can amortize that by writing arrays (or, array sections) rather than individual elements (this matters mostly for unformatted, for formatted IO there is so much to do that this doesn't help that much).

I am thinking that if cost of IO is comparable or even larger than the effort of simulation, then it probably isn't such a good idea to store all these data to disk the first place. It is better to do whatever processing you intend to do directly during the simulation, instead of saving lots of intermediate data them later read them in again to do the processing.
Moreover, MD is an inherently highly parallelizable problem, and with IO you will severely cripple the efficiency of parallelization! I would avoid IO whenever possible.
For individual trajectories, normally you just need to store the initial condition of each trajectory, along with its key statistics, or important snapshots at a small number of time values. When you need one specific trajectory plotted you can regenerate the exact same trajectory or section of trajectory from the initial condition or the closest snapshot, and with similar cost as reading it from the disk.

How fast is VB .net compared to native code for arithmetic?

I need to write software that will do a lot of math. Mostly it will be matrix multiplication with integers to compute DCT. How much faster should I expect the code to run in native c as compared to VB .Net? Factor of 2, factor of 10, factor of 1000...? Has someone tried and collected statistics on this?

.Net code is JIT-compiled to native code before execution, so it should not be slower than native code in general. I'd expect a factor < 10.
Moreover, adaptive optimization techniques profile the code as it runs, gaining more information than a typical static compiler. So, the JIT can make more informed decisions for further optimizations

The .NET code is compiled into native code by the JIT compiler, so you get native code in both cases.
The difference is that the C code has somewhat less overhead around the calculations, so you should perhaps expect a performace difference of factor 2.

VB is 93.7% as fast as C. If you pick the right scenario.
Actually, if your 'native C' includes regular calls to malloc() and free(), any kind of Gargage Collected language like VB.Net is going to literally run circles around it. GC can be 10x faster than mallocs in your inner loops.
If you break down and use C, try to reuse structures that you declared just once instead of making new ones, to avoid this problem. This may be of benefit even in VB if your solution lends itself to it. However it will be harder to program and GC is very fast.
As far as bounds/overflow checks, if speed is important and testing has revealed they don't happen, and you're not risking life or millions from an error or abend, they are a waste of time. But if you can't get rid of them, your time is likely still more valuable in a language with which you can program more quickly.
If you expect serious size and usage, it pays to split the task with a controlling program and store the allocated 'task definitions' into a shared directory with a file per task solver, or a database. Then you can run a solver per processor (2 per HT CPU), or network computers. Be weary of queue structures - it's tough to atomicly 'Mark-Taken-And-Get-Data-If-Not-Taken'. You know how many task solvers you're going to start. I did this with an imaging utility I develop, it was much easier than expected, and it creamed the previous version. Plus if you use multiple processes with a properly dividable problem domain, you avoid the slight-to-significant programming burden of multithreading. Or convincing your coworkers that your culrly braces are in the right place. Peace.

basic operations cpu time cost

I was wondering, how to optimize loops for systems with very limited resources. Let's say, if we have a basic for loop, like ( written in javascript):
for(var i = someArr.length - 1; i > -1; i--)
{
someArr[i]
}
I honestly don't know, isn't != cheaper than > ?
I would be grateful for any resources covering computing cost in context of basic operators, like the aforementioned, >>, ~, !, and so on.

Performance on a modern CPU is far from trivial. Here are a couple of things that complicate it:
Computers are fast. Your CPU can execute upwards of 6 billion instructions per second. So even the slowest instruction can be executed millions of times per second, meaning that it only really matters if you use it very often
Modern CPU's have hundreds of instructions in flight simultaneously. They are pipelined, meaning that while one instruction is being read, another is reading from registers, a third one is executing, and a fourth one is writing back to a register. Modern CPU's have 15-20 of such stages. On top of this, they can execute 3-4 instructions at the same time on each of these stages. And they can reorder these instructions. If the multiplication unit is being used by another instruction, perhaps we can find an addition instruction to execute instead, for example. So even if you have some slow instructions mixed in, their cost can be hidden very well most of the time, by executing other instructions while waiting for the slow one to finish.
Memory is hundreds of times slower than the CPU. The instructions being executed don't really matter if their cost is dwarfed by retrieval of data from memory. And even this isn't reliable, because the CPU has its own onboard caches to attempt to hide this cost.
So the short answer is "don't try to outsmart the compiler". If you are able to choose between two equivalent expressions, the compiler is probably able to do the same, and will pick the most efficient one. The cost of an instruction varies, depending on all the above factors. Which other instructions are executing, what data is in the CPU's cache, which precise CPU model is the code running on, and so on. Code that is super efficient in one case may be very inefficient in other cases. The compiler will try to pick the most generally efficient instructions, and schedule them as well as possible. Unless you know more than the compiler about this, you're unlikely to be able to do a better job of it.
Don't try such microoptimizations unless you really know what you're doing. As the above shows, low-level performance is a ridiculously complex subject, and it's very easy to write "optimizations" that result in far slower code. Or which just sacrifice readability on something that makes no difference at all.
Further, most of your code simply doesn't have a measurable impact on performance.
People generally love quoting (or misquoting) Knuth on this subject:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil
People often interpret this as "don't bother trying to optimize your code". If you actually read the full quote, some much more interesting consequences should become clear:
Most of the time, we should forget about microoptimizations. Most code is executed so rarely that optimizations won't matter. Keeping in mind the number of instructions a CPU can execute per second, it is obvious that a block of code has to be executed very often for optimizations in it to have any effect. So about 97% of the time, your optimizations will be a waste of time. But he also says that sometimes (3% of the time), your optimizations will matter. And obviously, looking for those 3% is a bit like looking for a needle in a haystack. If you just decide to "optimize your code" in general, you're going to waste your time on the first 97%. Instead, you need to first locate the 3% that actually need optimizing. In other words, run your code through a profiler, and let it tell you which code takes up the most CPU time. Then you know where to optimize. And then your optimizations are no longer premature.

It is extraordinarily unlikely that such micro-optimizations will make a noticeable difference to your code in any but the most extreme (real time embedded systems?) circumstances. Your time would probably be better served worrying about making your code readable and maintainable.
When in doubt, always begin by asking Donald Knuth:
http://shreevatsa.wordpress.com/2008/05/16/premature-optimization-is-the-root-of-all-evil/
Or, for a slightly less high-brow take on micro-optimization:
http://www.codinghorror.com/blog/archives/000185.html

Most comparations have same coast, because the processor simply compares it in all aspects, then after that it takes a decision based on flags generated by this previous comparation so the comparation signal doesn't matter at all. But some architectures try to accelerate this process based on the value you are comparing with, like comparations against 0.
As far as I know, bitwise operations are the cheapest operations, slightly faster than addition and subtraction. Multiplication and division operations are a little more expensive, and comparation is the highest coast operation.

That's like asking for a fish, when I would rather teach you to fish.
There are simple ways to see for yourself how long things take. My favorite is to just copy the code 10 times, and then wrap it in a loop of 10^8 times. If I run it and look at my watch, the number of seconds it takes translates to nanoseconds.
Saying don't do premature optimization is a "don't be". If you want a "do be" you could try a proactive performance tuning technique like this.
BTW my favorite way of coding your loop is:
for (i = N; --i >= 0;){...}

Premature Optimization can be dangerous the best approach would be to write your application without worrying about that and then find the slow points and optimize those. If you are really worried about this use a lower level language. An interpreted language like javascript will cost you some processing power when compared to a lower level language like C.

In this particular case, > vs = is probably not a perfomance issue. HOWEVER > is generally safer choice because prevents cases where you modified the code from running off into the weeds and stuck in a infinite loop.

When is optimisation premature?

As Knuth said,
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
This is something which often comes up in Stack Overflow answers to questions like "which is the most efficient loop mechanism", "SQL optimisation techniques?" (and so on). The standard answer to these optimisation-tips questions is to profile your code and see if it's a problem first, and if it's not, then therefore your new technique is unneeded.
My question is, if a particular technique is different but not particularly obscure or obfuscated, can that really be considered a premature optimisation?
Here's a related article by Randall Hyde called The Fallacy of Premature Optimization.

Don Knuth started the literate programming movement because he believed that the most important function of computer code is to communicate the programmer's intent to a human reader. Any coding practice that makes your code harder to understand in the name of performance is a premature optimization.
Certain idioms that were introduced in the name of optimization have become so popular that everyone understands them and they have become expected, not premature. Examples include
Using pointer arithmetic instead of array notation in C, including the use of such idioms as
for (p = q; p < lim; p++)
Rebinding global variables to local variables in Lua, as in
local table, io, string, math
= table, io, string, math
Beyond such idioms, take shortcuts at your peril.
All optimization is premature unless
A program is too slow (many people forget this part).
You have a measurement (profile or similar) showing that the optimization could improve things.
(It's also permissible to optimize for memory.)
Direct answer to question:
If your "different" technique makes the program harder to understand, then it's a premature optimization.
EDIT: In response to comments, using quicksort instead of a simpler algorithm like insertion sort is another example of an idiom that everyone understands and expects. (Although if you write your own sort routine instead of using the library sort routine, one hopes you have a very good reason.)

IMHO, 90% of your optimization should occur at design stage, based on percieved current, and more importantly, future requirements. If you have to take out a profiler because your application doesn't scale to the required load you have left it too late, and IMO will waste a lot of time and effort while failing to correct the problem.
Typically the only optimizations that are worthwhile are those that gain you an order of magnitude performance improvement in terms of speed, or a multiplier in terms of storage or bandwidth. These types of optimizations typically relate to algorithm selection and storage strategy, and are extremely difficult to reverse into existing code. They may go as deep as influencing the decision on the language in which you implement your system.
So my advice, optimize early, based on your requirements, not your code, and look to the possible extended life of your app.

If you haven't profiled, it's premature.

My question is, if a particular
technique is different but not
particularly obscure or obfuscated,
can that really be considered a
premature optimisation?
Um... So you have two techniques ready at hand, identical in cost (same effort to use, read, modify) and one is more efficient. No, using the more efficient one would not, in that case, be premature.
Interrupting your code-writing to look for alternatives to common programming constructs / library routines on the off-chance that there's a more efficient version hanging around somewhere even though for all you know the relative speed of what you're writing will never actually matter... That's premature.

Here's the problem I see with the whole concept of avoiding premature optimization.
There's a disconnect between saying it and doing it.
I've done lots of performance tuning, squeezing large factors out of otherwise well-designed code, seemingly done without premature optimization.
Here's an example.
In almost every case, the reason for the suboptimal performance is what I call galloping generality, which is the use of abstract multi-layer classes and thorough object-oriented design, where simple concepts would be less elegant but entirely sufficient.
And in the teaching material where these abstract design concepts are taught, such as notification-driven architecture, and information-hiding where simply setting a boolean property of an object can have an unbounded ripple effect of activities, what is the reason given? Efficiency.
So, was that premature optimization or not?

First, get the code working. Second, verify that the code is correct. Third, make it fast.
Any code change that is done before stage #3 is definitely premature. I am not entirely sure how to classify design choices made before that (like using well-suited data structures), but I prefer to veer towards using abstractions taht are easy to program with rather than those who are well-performing, until I am at a stage where I can start using profiling and having a correct (though frequently slow) reference implementation to compare results with.

From a database perspective, not to consider optimal design at the design stage is foolhardy at best. Databases do not refactor easily. Once they are poorly designed (this is what a design that doesn't consider optimization is no matter how you might try to hide behind the nonsense of premature optimization), is almost never able to recover from that becasue the database is too basic to the operation of the whole system. It is far less costly to design correctly considering the optimal code for the situation you expect than to wait until the there are a million users and people are screaming becasue you used cursors throughout the application. Other optimizations such as using sargeable code, selecting what look to be the best possible indexes, etc. only make sense to do at design time. There is a reason why quick and dirty is called that. Because it can't work well ever, so don't use quickness as a substitute for good code. Also frankly when you understand performance tuning in databases, you can write code that is more likely to perform well in the same time or less than it takes to write code which doesn't perform well. Not taking the time to learn what is good performing database design is developer laziness, not best practice.

What you seem to be talking about is optimization like using a hash-based lookup container vs an indexed one like an array when a lot of key lookups will be done. This is not premature optimization, but something you should decide in the design phase.
The kind of optimization the Knuth rule is about is minimizing the length the most common codepaths, optimizing the code that is run most by for example rewriting in assembly or simplifying the code, making it less general. But doing this has no use until you are certain which parts of code need this kind of optimization and optimizing will (could?) make the code harder to understand or maintain, hence "premature optimization is the root of all evil".
Knuth also says it is always better to, instead of optimizing, change the algorithms your program uses, the approach it takes to a problem. For example whereas a little tweaking might give you a 10% increase of speed with optimization, changing fundamentally the way your program works might make it 10x faster.
In reaction to a lot of the other comments posted on this question: algorithm selection != optimization

The point of the maxim is that, typically, optimization is convoluted and complex. And typically, you the architect/designer/programmer/maintainer need clear and concise code in order to understand what is going on.
If a particular optimization is clear and concise, feel free to experiment with it (but do go back and check whether that optimization is effective). The point is to keep the code clear and concise throughout the development process, until the benefits of performance outweigh the induced costs of writing and maintaining the optimizations.

Optimization can happen at different levels of granularity, from very high-level to very low-level:
Start with a good architecture, loose coupling, modularity, etc.
Choose the right data structures and algorithms for the problem.
Optimize for memory, trying to fit more code/data in the cache. The memory subsystem is 10 to 100 times slower than the CPU, and if your data gets paged to disk, it's 1000 to 10,000 times slower. Being cautious about memory consumption is more likely to provide major gains than optimizing individual instructions.
Within each function, make appropriate use of flow-control statements. (Move immutable expressions outside of the loop body. Put the most common value first in a switch/case, etc.)
Within each statement, use the most efficient expressions yielding the correct result. (Multiply vs. shift, etc)
Nit-picking about whether to use a divide expression or a shift expression isn't necessarily premature optimization. It's only premature if you do so without first optimizing the architecture, data structures, algorithms, memory footprint, and flow-control.
And of course, any optimization is premature if you don't define a goal performance threshold.
In most cases, either:
A) You can reach the goal performance threshold by performing high-level optimizations, so it's not necessary to fiddle with the expressions.
or
B) Even after performing all possible optimizations, you won't meet your goal performance threshold, and the low-level optimizations don't make enough difference in performance to justify the loss of readability.
In my experience, most optimization problems can be solved at either the architecture/design or data-structure/algorithm level. Optimizing for memory footprint is often (though not always) called for. But it's rarely necessary to optimize the flow control & expression logic. And in those cases where it actually is necessary, it's rarely sufficient.

I try to only optimise when a performance issue is confirmed.
My definition of premature optimisation is 'effort wasted on code that is not known to be a performance problem.' There is most definitely a time and place for optimisation. However, the trick is to spend the extra cost only where it counts to the performance of the application and where the additional cost outweighs the performance hit.
When writing code (or a DB query) I strive to write 'efficient' code (i.e. code that performs its intended function, quickly and completely with simplest logic reasonable.) Note that 'efficient' code is not necessarily the same as 'optimised' code. Optimisations often introduce additional complexity into code which increases both the development and maintenance cost of that code.
My advice: Try to only pay the cost of optimisation when you can quantify the benefit.

When programming, a number of parameters are vital. Among these are:
Readability
Maintainability
Complexity
Robustness
Correctness
Performance
Development time
Optimisation (going for performance) often comes at the expense of other parameters, and must be balanced against the "loss" in these areas.
When you have the option of choosing well-known algorithms that perform well, the cost of "optimising" up-front is often acceptable.

Norman's answer is excellent. Somehow, you routinely do some "premature optimization" which are, actually, best practices, because doing otherwise is known to be totally inefficient.
For example, to add to Norman's list:
Using StringBuilder concatenation in Java (or C#, etc.) instead of String + String (in a loop);
Avoiding to loop in C like: for (i = 0; i < strlen(str); i++) (because strlen here is a function call walking the string each time, called on each loop);
It seems in most JavaScript implementations, it is faster to do too for (i = 0 l = str.length; i < l; i++) and it is still readable, so OK.
And so on. But such micro-optimizations should never come at the cost of readability of code.

The need to use a profiler should be left for extreme cases. The engineers of the project should be aware of where performance bottlenecks are.
I think "premature optimisation" is incredibly subjective.
If I am writing some code and I know that I should be using a Hashtable then I will do that. I won't implement it in some flawed way and then wait for the bug report to arrive a month or a year later when somebody is having a problem with it.
Redesign is more costly than optimising a design in obvious ways from the start.
Obviously some small things will be missed the first time around but these are rarely key design decisions.
Therefore: NOT optimising a design is IMO a code smell in and of itself.

It's worth noting that Knuth's original quote came from a paper he wrote promoting the use of goto in carefully selected and measured areas as a way to eliminate hotspots. His quote was a caveat he added to justify his rationale for using goto in order to speed up those critical loops.
[...] again, this is a noticeable saving in the overall running speed,
if, say, the average value of n is about 20, and if the search routine
is performed about a million or so times in the program. Such loop
optimizations [using gotos] are not difficult to learn and, as I have
said, they are appropriate in just a small part of a program, yet they
often yield substantial savings. [...]
And continues:
The conventional wisdom shared by many of today's software engineers
calls for ignoring efficiency in the small; but I believe this is
simply an overreaction to the abuses they see being practiced by
pennywise-and-pound-foolish programmers, who can't debug or maintain
their "optimized" programs. In established engineering disciplines a
12% improvement, easily obtained, is never considered marginal; and I
believe the same viewpoint should prevail in software engineering. Of
course I wouldn't bother making such optimizations on a oneshot job,
but when it's a question of preparing quality programs, I don't want
to restrict myself to tools that deny me such efficiencies [i.e., goto
statements in this context].
Keep in mind how he used "optimized" in quotes (the software probably isn't actually efficient). Also note how he isn't just criticizing these "pennywise-and-pound-foolish" programmers, but also the people who react by suggesting you should always ignore small inefficiencies. Finally, to the frequently-quoted part:
There is no doubt that the grail of efficiency leads to abuse.
Programmers waste enormous amounts of time thinking about, or worrying
about, the speed of noncritical parts of their programs, and these
attempts at efficiency actually have a strong negative impact when
debugging and maintenance are considered. We should forgot about small
efficiencies, say 97% of the time; premature optimization is the root
of all evil.
... and then some more about the importance of profiling tools:
It is often a mistake to make a priori judgments about what parts of a
program are really critical, since the universal experience of
programmers who have been using measurement tools has been that their
intuitive guesses fail. After working with such tools for seven years,
I've become convinced that all compilers written from now on should be
designed to provide all programmers with feedback indicating what
parts of their programs are costing the most; indeed, this feedback
should be supplied automatically unless it has been specifically
turned off.
People have misused his quote all over the place, often suggesting that micro-optimizations are premature when his entire paper was advocating micro-optimizations! One of the groups of people he was criticizing who echo this "conventional wisdom" as he put of always ignoring efficiencies in the small are often misusing his quote which was originally directed, in part, against such types who discourage all forms of micro-optimization.
Yet it was a quote in favor of appropriately applied micro-optimizations when used by an experienced hand holding a profiler. Today's analogical equivalent might be like, "People shouldn't be taking blind stabs at optimizing their software, but custom memory allocators can make a huge difference when applied in key areas to improve locality of reference," or, "Handwritten SIMD code using an SoA rep is really hard to maintain and you shouldn't be using it all over the place, but it can consume memory much faster when applied appropriately by an experienced and guided hand."
Any time you're trying to promote carefully-applied micro-optimizations as Knuth promoted above, it's good to throw in a disclaimer to discourage novices from getting too excited and blindly taking stabs at optimization, like rewriting their entire software to use goto. That's in part what he was doing. His quote was effectively a part of a big disclaimer, just like someone doing a motorcycle jump over a flaming fire pit might add a disclaimer that amateurs shouldn't try this at home while simultaneously criticizing those who try without proper knowledge and equipment and get hurt.
What he deemed "premature optimizations" were optimizations applied by people who effectively didn't know what they were doing: didn't know if the optimization was really needed, didn't measure with proper tools, maybe didn't understand the nature of their compiler or computer architecture, and most of all, were "pennywise-and-pound-foolish", meaning they overlooked the big opportunities to optimize (save millions of dollars) by trying to pinch pennies, and all while creating code they can no longer effectively debug and maintain.
If you don't fit in the "pennywise-and-pound-foolish" category, then you aren't prematurely optimizing by Knuth's standards, even if you're using a goto in order to speed up a critical loop (something which is unlikely to help much against today's optimizers, but if it did, and in a genuinely critical area, then you wouldn't be prematurely optimizing). If you're actually applying whatever you're doing in areas that are genuinely needed and they genuinely benefit from it, then you're doing just great in the eyes of Knuth.

Premature optimization to me means trying to improve the efficiency of your code before you have a working system, and before you have actually profiled it and know where the bottleneck is. Even after that, readability and maintainability should come before optimization in many cases.

I don't think that recognized best practices are premature optimizations. It's more about burning time on the what ifs that are potential performance problems depending on the usage scenarios. A good example: If you burn a week trying to optimize reflecting over an object before you have proof that it is a bottleneck you are prematurely optimizing.

Unless you find that you need more performance out of your application, due to either a user or business need, there's little reason to worry about optimizing. Even then, don't do anything until you've profiled your code. Then attack the parts which take the most time.

The way I see it is, if you optimize something without knowing how much performance you can gain in different scenario IS a premature optimization. The goal of code should really making it easiest for human to read.

As I posted on a similar question, the rules of optimisation are:
1) Don't optimise
2) (for experts only) Optimise later
When is optimisation premature? Usually.
The exception is perhaps in your design, or in well encapsulated code that is heavily used. In the past I've worked on some time critical code (an RSA implementation) where looking at the assembler that the compiler produced and removing a single unnecessary instruction in an inner loop gave a 30% speedup. But, the speedup from using more sophisticated algorithms was orders of magnitude more than that.
Another question to ask yourself when optimising is "am I doing the equivalent of optimising for a 300 baud modem here?". In other words, will Moore's law make your optimisation irrelevant before too long. Many problems of scaling can be solved just by throwing more hardware at the problem.
Last but not least it's premature to optimise before the program is going too slowly. If it's web application you're talking about, you can run it under load to see where the bottlenecks are - but the likelihood is that you will have the same scaling problems as most other sites, and the same solutions will apply.
edit: Incidentally, regarding the linked article, I would question many of the assumptions made. Firstly it's not true that Moore's law stopped working in the 90s. Secondly, it's not obvious that user's time is more valuable than programmer's time. Most users are (to say the least) not frantically using every CPU cycle available anyhow, they are probably waiting for the network to do something. Plus there is an opportunity cost when programmer's time is diverted from implementing something else, to shaving a few milliseconds off something that the program does while the user is on the phone. Anything longer than that isn't usually optimisation, it's bug fixing.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas