Locality and code optimization

Locality and code optimization - optimization

Can optimizers get rid of bad uses of spatial locality? I'm maintaining some code written by somebody else, and many of their arrays are declared in haphazard orders, and iterated differently every time they are called.
Because of the complexity of the code it would be quite the block of time to try and remanage every time the arrays were cycled. I'm not skilled enough at reading assembly language to be able to tell exactly whats different with varying levels of optimization, but my question is,
Is locality important when writing programs, or does that get optimized away so I can not worry about it?

Getting locality right is important, because it can make a difference of two orders of magnitude (5-6 orders of magnitude if you have page faults) of difference in runtime.
Apart from the fact that real compilers usually don't handle this automatically (as Joel Falcou said), even a hypothetical compiler would have a very hard time doing such a thing. In many cases, it may not even be valid for the compiler to do such a thing, and it is very hard to predict when it is or when it is not.
Say, for example, you have vertex data that you calculate on the CPU, and which you upload to a graphics API such as OpenGL or DirectX. You've agreed with that API a certain vertex data layout. Now the compiler figures that it is more efficient to rearrange the layout in some way. Bang, you're dead.
How was the compiler supposed to know?
Say you have a few arrays and a few pointers, and some pointers alias others, or some point into the middle of an array for some reason, others point at the beginning. The compiler figures that it's more efficient to do certain operations in a different order, overwriting one result with another.
The data corruption issue left aside, let's say those arrays are "somewhat big", so they're most certainly going to be dynamically allocated rather than being on the stack. Which means their start addresses are "non-deterministic" or even "random" from the compiler's point of view. How is the compiler going to make decisions -- at compile time -- not knowing half of the details?

Few to none compiler handle data layout for locality. It's still an active research domain.

Related

Does optimizing code in TI-BASIC actually make a difference?

I know in TI-BASIC, the convention is to optimize obsessively and to save as many bits as possible (which is pretty fun, I admit).
For example,
DelVar Z
Prompt X
If X=0
Then
Disp "X is zero"
End //28 bytes
would be cleaned up as
DelVar ZPrompt X
If not(X
"X is zero //20 bytes
But does optimizing code this way actually make a difference? Does it noticeably run faster or save memory?

Yes. Optimizing your TI-Basic code makes a difference, and that difference is much larger than you would find for most programming languages.
In my opinion, the most important optimization to TI-Basic programs is size (making them as small as possible). This is important to me since I have dozens of programs on my calculator, which only has 24 kB of user-accessible RAM. In this case, it isn't really necessary to spend lots of time trying to save a few bytes of space; instead, I simply advise learning the shortest and most efficient ways to do things, so that when you write programs, they will naturally tend to be small.
Additionally, TI-Basic programs should be optimized for speed. Examples off of the top of my head include the quirk with the unclosed For( loop, calculating a value once instead of calculating it in every iteration of a loop (if possible), and using quickly-accessed variables such as Ans and the finance variables whenever the variable must be accessed a large number of times (e.g. 1000+).
A third possible optimization is for run-time memory usage. Every loop, function call, etc. has an overhead that must be stored in the memory stack in order to return to the original location, calculate values, etc. during the program's execution. It is important to avoid memory leaks (such as breaking out of a loop with Goto).
It is up to you to decide how you balance these optimizations. I prefer to:
First and foremost, guarantee that there are no memory leaks or incorrectly nested loops in my program.
Take advantage of any size optimizations that have little or no impact on the program's speed.
Consider speed optimizations, and decide if the added speed is worth the increase in program size.

TI-BASIC is an interpreted language, which usually means there is a huge overhead on every single operation.
The way an interpreted language works is that instead of actually compiling the program into code that runs on the CPU directly, each operation is a function call to the interpreter that look at what needs to be done and then calls functions to complete those sub tasks. In most cases, the overhead is a factor or two in speed, and often also in stack memory usage. However, the memory for non-stack is usually the same.
In your above example you are doing the exact same number of operations, which should mean that they run exactly as fast. What you should optimize are things like i = i + 1, which is 4 operations into i++ which is 2 operations. (as an example, TI-BASIC doesn't support ++ operator).
This does not mean that all operations take the exact same time, internally a operation may be calling hundreds of other functions or it may be as simple as updating a single variable. The programmers of the interpreter may also have implemented various peephole optimizations that optimizes very specific language constructs, e.g. for(int i = 0; i < count; i++) could both be implemented as a collection of expensive interpreter functions that behave as if i is generic, or it could be optimized to a compiled loop where it just had to update the variable i and reevaluate the count.
Now, not all interpreted languages are doomed to this pale existence. For example, JavaScript used to be one, but these days all major js engines JIT compile the code to run directly on the CPU.
UPDATE: Clarified that not all operations are created equal.

Absolutely, it makes a difference. I wrote a full-scale color RPG for the TI-84+CSE, and let me tell you, without optimizing any of my code, the game would flat out not run. At present, on the CSE, Sorcery of Uvutu can only run if every other program is archived and all other memory is out of RAM. The programs and data storage alone takes up 20k bytes in RAM, or just 1kb under all of available user memory. With all the variables in use, the memory approaches dangerously low points. I had points in my development where due to poor optimizations, I couldn't even start the game without getting a "memory all gone" error. I had plans to implement various extra things, but due to space and speed concerns, it was impossible to do so. That's only the consideration to space.
In the speed department, the game became, and still is, slow in the overworld. Walking around in the overworld is painfully slow compared to other games, and that's because of what I have to do in that code; I have to check for collisions, check if the user is moving to a new map, check if they pressed a key that should illicit a response, check if a battle should go on, and more. I was able to make slight optimizations to the walking speed, but even then, I could blatantly tell I had made improvements. It still was pretty awfully slow (at least compared to every other port I've made), but I made it a little more tolerable.
In summary, through my own experiences crafting a large project, I can say that in TI-Basic, optimizing code does make a difference. Other answers mentioned this, but TI-Basic is an interpreted language. This means the code isn't compiled into faster, lower level code, but the stuff that you put in the program is read straight out as it executes, is interpreted by the interpreter, calls the subroutines and other stuff it needs to to execute the commands, and then returns back to read the next line. As a result of that, and the fact that the TI-84+ series CPU, the Zilog Z80, was designed in 1976, you get a rather slow interpreter, especially for this day and age. As such, the fewer the commands you run, and the more you take advantage of system weirdness such as Ans being the fastest variable that can also hold the most types of data (integers/floats, strings, lists, matrices, etc), the better the performance you're gonna get.
Sources: My own experiences, documented here: https://codewalr.us/index.php?topic=778.msg27190#msg27190
TI-84+CSE RAM numbers came from here: https://education.ti.com/en/products/calculators/graphing-calculators/ti-84-plus-c-se?category=specifications
Information about the Z80 came from here: http://segaretro.org/Zilog_Z80

Depends, if it's just a basic math program then no. For big games then YES. The TI-84 has only 3.5MB of space available and has the combo of an ancient Z80 processor and a whopping 128KB of RAM. TI-BASIC is also quite slow as it's interpreted (look it up for further information) so if you to make fast-running games then YES. Optimization is very important.

Usage of frame pointer optimizations

This is related but NOT the same as frame pointer omitting ? Any risk?
I am trying to follow this old (but still relevan article)
http://blogs.msdn.com/b/larryosterman/archive/2007/03/12/fpo.aspx
Larry (author writes)
machines got sufficiently faster since 1995 that the performance
improvements that were achieved by FPO weren't sufficient to counter
the pain in debugging and analysis that FPO caused
However in the discussion further down the page one user writes
Disabling FPO can have both serious code size and performance impact.
Tail call optimizations have to be disabled when a frame pointer is
present, leading to much greater stack usage in affected paths. Small
functions are also disproportionately affected by prolog/epilog code.
Third, although there are still six registers available with a frame
pointer on X86, only three of them are nonvolatile with respect to
nested calls: EBX, ESI, and EDI. Opening up a fourth register can drop
out a bunch of spill code.
I have a couple of question.
Spill code == Register spillage?
Is the author correct that FPO is generally considered a pain and
the gain doe not out-weigh the benefits.
Is FPO still relevant today in x64 architecture since there are a
LOT more registers o play with.
Do you use FPO? What for (if yes) and does it make a difference to
you?
Finally in this article
http://www.altdevblogaday.com/2012/05/24/x64-abi-intro-to-the-windows-x64-calling-convention/
The author says
[with repect to Windows x64 calling convention].....
All parameters have space reserved on the stack, even the ones passed in registers. In fact, there’s stack space for 4 parameters
even if your function doesn’t have any params. Those parameters are 8
bytes so that’s at least 32 bytes on the stack for every function
(every function actually has at least 48 bytes on the stack…I’ll
explain that another time). This stack area is called the home space.
There are few reasons behind this home space:
If the registers need to be used for something else, the called function can store the data in the home space without moving the stack
pointer.
It keeps the stack structure easy to determine. That’s very handy for debugging, and perhaps necessary for x64′s stack metadata (another
point I’ll come back to another time). ...... The compiler can use it
for whatever it wants, and an optimized build will likely make great
use of it.
Wouldn't an optimized build optimize the excess allocation away?

1.Spill code == Register spillage?
Almost. Stricly speaking, spill code is the code added by the compiler to implement a register spill. The spill itself is the decision to tag a live range as not able to be placed in a register.
2.Is the author correct that FPO is generally considered a pain and the gain doe not out-weigh the benefits.
The author is probably correct that in modern processor architectures, the kinds of functions where FPOs will generate a significant performance gain is a smaller set than in the past. Yet FPO's do make code smaller, reducing cache pressure. They do reduce register pressure. These can be important in some settings. They do speed up prolog and epilog code by a few instructions. The point about debuggers not working well without the FP is noteworthy. It means core dumps are less useful for post mortems on production-optimized code. You'd never want to use FPO during development except for final testing.
3.Is FPO still relevant today in x64 architecture since there are a LOT more registers o play with.
Modern processors are so various and complex that you just about never know what's "relevant" until you try it and measure.
4.Do you use FPO? What for (if yes) and does it make a difference to you?
I have written a medium-size C library (20K SLOC) where it made a small (~5%) difference in run time overall under gcc. This was a native language extension to a scripting language that had to compile under both gcc and Visual C. Using it would have split the build path. I decided 5% was not worth it for the purpose the extension served. But if it had been a dynamic fluid simulation to predict the weather, 5% could have been worth many millions of dollars. The decision would have been different.
5.Wouldn't an optimized build optimize the excess allocation away?
That's entirely up to the compiler and optimizer designer. It looks from the MS documentation here that MS has defined the ABI to require home space for all data even if it's whole lifetime is spent in a register.

1) When you need to use a register and don't have any unused ones, you need to write code to save some register value on the stack and later restore it.
2) FPO was a pain back when unwinding was primarily done by walking the stack. Nowadays standard unwind ABIs exist anyway (e.g. to enable exception handling), so the information already exists, and is organized more efficiently (away from the hot code), so there's no pain. Sure, there would be some pain if you wrote all your machine code by hand, but that's not the typical use case.
3) Typical x86_64 ABIs don't use frame pointers at all (except when absolutely necessary, like for variable-length arrays in C).
4) I'm not a compiler. My compiler doesn't generate frame pointers.
Optimize excess away) Not sure what your question is. The space consumption for the home area isn't a problem. The benefit of not having to adjust any stack pointers is a big advantage, since you need a lot less code. The same goes for the red zone just beyond the stack frame, which allows leaf code to use a lot of memory without ever needing any stack pointer gymnastics.

Generic advice on reducing GC time in GHC

Are there any generic rules to follow in order to discover the cause when a GHC-compiled program spends to much time doing garbage collection? And what would be generally considered too much? For example, in general, is 60% productivity acceptable or is it an indication that something is likely wrong with the code?

Here's a quick and very incomplete list:
Test and benchmark. One of haskell's few weaknesses is the difficulty in predicting time and space costs. If you don't have test data you've got nothing.
Use better algorithms. This sounds too simple, but optimizing inefficient algorithms is like rapping s**t in gold.
Strategically make some data more strict. Test and Benchmark! The goal is to store the physically smaller WHNF value rather then the thunk that produces it, thereby cleaning up more garbage in the most efficient first pass. look for complicated functions that produce simple data.
Strategically make some data less strict. Test and Benchmark! The goal is delay production of a large amount of data until just before it is used and discarded, thereby cleaning up more garbage in the most efficient first pass. Look for simple functions that produce large complex data. See also comonads.
Strategically make use of arrays and unboxed types, in particular see #2. with regard to the ST monad. Test and Benchmark! All of these fit more raw data into smaller more compact memory. There is less garbage to collect.
Fiddle with the RTS settings (ghc specific). Test and Benchmark! The goal is to "impedence match" the GC with the memory needs of your program. I get even more lost here then in 1-5 so ask the experts on this one.
Better garbage collection has a fairly simple premise: Create less garbage, collect it sooner, produce fewer memory allocations/deallocations. Any thing you can do that might result in one of these three effects is worth a shot. Test and Benchmark!

Compiler optimizations: Where/how can I get a feel for what the payoff is for different optimizations?

In my independent study of various compiler books and web sites, I am learning about many different ways that a compiler can optimize the code that is being compiled, but I am having trouble figuring out how much of a benefit each optimization will tend to give.
How do most compiler writers go about deciding which optimizations to implement first? Or which optimizations are worth the effort or not worth the effort? I realize that this will vary between types of code and even individual programs, but I'm hoping that there is enough similarity between most programs to say, for instance, that one given technique will usually give you a better performance gain than another technique.

I found when implementing textbook compiler optimizations that some of them tended to reverse the improvements made by other optimizations. This entailed a lot of work trying to find the right balance between them.
So there really isn't a good answer to your question. Everything is a tradeoff. Many optimizations work well on one type of code, but are pessimizations for other types. It's like designing a house - if you make the kitchen bigger, the pantry gets smaller.
The real work in building an optimizer is trying out the various combinations, benchmarking the results, and, like a master chef, picking the right mix of ingredients.

Tongue in cheek:
Hubris
Benchmarks
Embarrassment
More seriously, it depends on your compiler's architecture and goals. Here's one person's experience...
Go for the "big payoffs":
native code generation
register allocation
instruction scheduling
Go for the remaining "low hanging fruit":
strength reduction
constant propagation
copy propagation
Keep bennchmarking.
Look at the output; fix anything that looks stupid.
It is usually the case that combining optimizations, or even repeating optimization passes, is more effective than you might expect. The benefit is more than the sum of the parts.
You may find that introduction of one optimization may necessitate another. For example, SSA with Briggs-Chaitin register allocation really benefits from copy propagation.

Historically, there are "algorithmical" optimizations from which the code should benefit in most of the cases, like loop unrolling (and compiler writers should implement those "documented" and "tested" optimizations first).
Then there are types of optimizations that could benefit from the type of processor used (like using SIMD instructions on modern CPUs).
See Compiler Optimizations on Wikipedia for a reference.
Finally, various type of optimizations could be tested profiling the code or doing accurate timing of repeated executions.

I'm not a compiler writer, but why not just incrementally optimize portions of your code, profiling all the while?
My optimization scheme usually goes:
1) make sure the program is working
2) find something to optimize
3) optimize it
4) compare the test results with what came out from 1; if they are different, then the optimization is actually a breaking change.
5) compare the timing difference
Incrementally, I'll get it faster.
I choose which portions to focus on by using a profiler. I'm not sure what extra information you'll garner by asking the compiler writers.

This really depends on what you are compiling. There is was a reasonably good discussion about this on the LLVM mailing list recently, it is of course somewhat specific to the optimizers they have available. They use abbreviations for a lot of their optimization passes, if you not familiar with any of acronyms they are tossing around you can look at their passes page for documentation. Ultimately you can spend years reading academic papers on this subject.

This is one of those topics where academic papers (ACM perhaps?) may be one of the better sources of up-to-date information. The best thing to do if you really want to know could be to create some code in unoptimized form and some in the form that the optimization would take (loops unrolled, etc) and actually figure out where the gains are likely to be using a compiler with optimizations turned off.

It is worth noting that in many cases, compiler writers will NOT spend much time, if any, on ensuring that their libraries are optimized. Benchmarks tend to de-emphasize or even ignore library differences, presumably because you can just use different libraries. For example, the permutation algorithms in GCC are asymptotically* less efficient than they could be when trying to permute complex data. This relates to incorrectly making deep copies during calls to swap functions. This will likely be corrected in most compilers with the introduction of rvalue references (part of the C++0x standard). Rewriting the STL to be much faster is surprisingly easy.
*This assumes the size of the class being permuted is variable. E.g. permutting a vector of vectors of ints would slow down if the vectors of ints were larger.

One that can give big speedups but is rarely done is to insert memory prefetch instructions. The trick is to figure out what memory the program will be wanting far enough in advance, never ask for the wrong memory and never overflow the D-cache.

Overhead of using bignums

I have hit upon this problem about whether to use bignums in my language as a default datatype when there's numbers involved. I've evaluated this myself and reduced it to a convenience&comfort vs. performance -question. The answer to that question depends about how large the performance hit is in programs that aren't getting optimized.
How small is the overhead of using bignums in places where a fixnum or integer would had sufficed? How small can it be at best implementations? What kind of implementations reach the smallest overhead and what kind of additional tradeoffs do they result in?
What kind of hit can I expect to the results in the overall language performance if I'll put my language to default on bignums?

You can perhaps look at how Lisp does it. It will almost always do the exactly right thing and implicitly convert the types as it becomes necessary. It has fixnums ("normal" integers), bignums, ratios (reduced proper fractions represented as a set of two integers) and floats (in different sizes). Only floats have a precision error, and they are contagious, i.e. once a calculation involves a float, the result is a float, too. "Practical Common Lisp" has a good description of this behaviour.

To be honest, the best answer is "try it and see".
Clearly bignums can't be as efficient as native types, which typically fit in a single CPU register, but every application is different - if yours doesn't do a whole load of integer arithmetic then the overhead could be negligible.

Come to think of it... I don't think it will have much performance hits at all.
Because bignums by nature, will have a very large base, say a base of 65536 or larger for which is usually a maximum possible value for traditional fixnum and integers.
I don't know how large you would set the bignum's base to be but if you set it sufficiently large enough so that when it is used in place of fixnums and/or integers, it would never exceeds its first bignum-digit thus the operation will be nearly identical to normal fixnums/int.
This opens an opportunity for optimizations where for a bignum that never grows over its first bignum-digit, you could replace them with uber-fast one-bignum-digit operation.
And then switch over to n-digit algorithms when the second bignum-digit is needed.
This could be implemented with a bit flag and a validating operation on all arithmetic operations, roughly thinking, you could use the highest-order bit to signify bignum, if a data block has its highest-order bit set to 0, then process them as if they were normal fixnum/ints but if it is set to 1, then parse the block as a bignum structure and use bignum algorithms from there.
That should avoid performance hits from simple loop iterator variables which I think is the first possible source of performance hits.
It's just my rough thinking though, a suggestion since you should know better than me :-)
p.s. sorry, forgot what the technical terms of bignum-digit and bignum-base were

your reduction is correct, but the choice depends on the performance characteristics of your language, which we cannot possibly know!
once you have your language implemented, you can measure the performance difference, and perhaps offer the programmer a directive to choose the default

You will never know the actual performance hit until you create your own benchmark as the results will vary per language, per language revision and per cpu and. There's no language independent way to measure this except for the obvious fact that a 32bit integer uses twice the memory of a 16bit integer.

How small is the overhead of using bignums in places where a fixnum or integer would had sufficed? Show small can it be at best implementations?
The bad news is that even in the best possible software implementation, BigNum is going to be slower than the builtin arithmetics by orders of magnitude (i.e. everything from factor 10 up to factor 1000).
I don't have exact numbers but I don't think exact numbers will help very much in such a situation: If you need big numbers, use them. If not, don't. If your language uses them by default (which language does? some dynamic languages do …), think whether the disadvantage of switching to another language is compensated for by the gain in performance (which it should rarely be).
(Which could roughly be translated to: there's a huge difference but it shouldn't matter. If (and only if) it matters, use another language because even with the best possible implementation, this language evidently isn't well-suited for the task.)

I totally doubt that it would be worth it, unless it is very domain-specific.
The first thing that comes to mind are all the little for loops throughout programs, are the little iterator variables all gonna be bignums? That's scary!
But if your language is rather functional... then maybe not.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas