Semantics of GCC hot attribute - optimization

Assume I have a compilation unit consisting of three functions, A, B, and C. A is invoked once from a function external to the compilation unit (e.g. it's an entry point or callback); B is invoked many times by A (e.g. it's invoked in a tight loop); C is invoked once by each invocation of B (e.g. it's a library function).
The entire path through A (passing through B and C) is performance-critical, though the performance of A itself is non-critical (as most time is spent in B and C).
What is the minimal set of functions which one should annotate with __attribute__ ((hot)) to effect more aggressive optimization of this path? Assume we cannot use -fprofile-generate.
Equivalently: Does __attribute__ ((hot)) mean "optimize the body of this function", "optimize calls to this function", "optimize all descendant calls this function makes", or some combination thereof?
The GCC info page does not clearly address these questions.

Official documentation:
hot
The hot attribute on a function is used to inform the compiler that the function is a hot spot of the compiled program. The function is optimized more aggressively and on many target it is placed into special subsection of the text section so all hot functions appears close together improving locality.
When profile feedback is available, via -fprofile-use, hot functions are automatically detected and this attribute is ignored.
The hot attribute on functions is not implemented in GCC versions earlier than 4.3.
The hot attribute on a label is used to inform the compiler that path following the label are more likely than paths that are not so annotated. This attribute is used in cases where __builtin_expect cannot be used, for instance with computed goto or asm goto.
The hot attribute on labels is not implemented in GCC versions earlier than 4.8.
2007:
__attribute__((hot))
Hint that the marked function is "hot" and should be optimized more aggresively and/or placed near other "hot" functions (for cache locality).
Gilad Ben-Yossef:
As their name suggests, these function attributes are used to hint the compiler that the corresponding functions are called often in your code (hot) or seldom called (cold).
The compiler can then order the code in branches, such as if statements, to favour branches that call these hot functions and disfavour functions cold functions, under the assumption that it is more likely that that the branch that will be taken will call a hot function and less likely to call a cold one.
In addition, the compiler can choose to group together functions marked as hot in a special section in the generated binary, on the premise that since data and instruction caches work based on locality, or the relative distance of related code and data, putting all the often called function together will result in better caching of their code for the entire application.
Good candidates for the hot attribute are core functions which are called very often in your code base. Good candidates for the cold attribute are internal error handling functions which are called only in case of errors.
So, according to these sources, __attribute__ ((hot)) means:
optimize calls to this function
optimize the body of this function
put body of this function to .hot section (to group all hot code in one location)
After source code analysis we can say that "hot" attribute is checked with (lookup_attribute ("hot", DECL_ATTRIBUTES (current_function_decl)); and when it is true, the functions's node->frequency is set to NODE_FREQUENCY_HOT (predict.c, compute_function_frequency()).
If the function has frequency as NODE_FREQUENCY_HOT,
If there is no profile information and no likely/unlikely on branches, maybe_hot_frequency_p will return true for the function (== "...frequency FREQ is considered to be hot."). This turns value of maybe_hot_bb_p into true for all Basic Blocks (BB) in the function ("BB can be CPU intensive and should be optimized for maximal performance.") and maybe_hot_edge_p true for all edges in function. In turn in non -Os-modes these BB and edges and also loops will be optimized for speed, not for size.
For all outbound call edges from this function, cgraph_maybe_hot_edge_p will return true ("Return true if the call can be hot."). This flag is used in IPA (ipa-inline.c, ipa-cp.c, ipa-inline-analysis.c) and influence inline and cloning decisions

Related

Module expansion of goals passed to library meta-predicates

Using SWI-Prolog (Multi-threaded, 64 bits, Version 7.3.5),
we proceed step by step:
Define dcg nonterminal a//1 in module dcgAux (pronounced: "di-SEE-goh"):
:- module(dcgAux,[a//1]).
a(0) --> [].
a(s(N)) --> [a], a(N).
Run the following queries—using phrase/2 and apply:foldl/4:
?- use_module([library(apply),dcgAux]).
true.
?- phrase( foldl( a,[s(0),s(s(0))]),[a,a,a]).
true.
?- phrase( foldl(dcgAux:a,[s(0),s(s(0))]),[a,a,a]).
true.
?- phrase(apply:foldl(dcgAux:a,[s(0),s(s(0))]),[a,a,a]).
true.
?- phrase(apply:foldl( a,[s(0),s(s(0))]),[a,a,a]).
ERROR: apply:foldl_/4: Undefined procedure: apply:a/3
нет! Quite a surprise—and not a good one. Have we been missing some unknown unknowns?
To get rid of above irritating behavior, we must first find out the reason(s) causing it:
?- import_module(apply,M), M=user.
false.
?- phrase(apply:foldl(a,[s(0),s(s(0))]),[a,a,a]).
ERROR: apply:foldl_/4: Undefined procedure: apply:a/3
?- add_import_module(apply,user,end).
true.
?- import_module(apply,M), M=user. % sic!
M = user. % `?- import_module(apply,user).` fails!
?- phrase(apply:foldl(a,[s(0),s(s(0))]),[a,a,a]).
true.
What's going on? The way I see it is this:
Module expansion of the goal passed to foldl/4 is limited.
Quoting from the SWI-Prolog manual page on import_module/2:
All normal modules only import from user, which imports from system.
SWI's library(apply) only "inherits" from system, but not user.
If we clone module apply to applY (and propagate the new module name), we observe:
?- use_module(applY).
true.
?- phrase(applY:foldl(a,[s(0),s(s(0))]),[a,a,a]). % was: ERROR
true. % now: OK!
Please share your ideas on how I could/should proceed!
(I have not yet run a similar experiment with other Prolog processors.)
This is an inherent feature/bug of predicate based module systems in the Quintus tradition. That is, this module system was first developed for Quintus Prolog. It was adopted subsequently by SICStus (after 0.71), then (more or less) by 13211-2, then by YAP, and (with some modifications) by SWI.
The problem here is what exactly an explicit qualification means. As long as the goal is no meta-predicate, things are trivially resolvable: Take the module of the innermost qualification. However, once you have meta-predicates, the meta-arguments need to be informed of that module ; or not. If the meta-arguments are informed, we say that the colon sets the calling context, if not, then some other means is needed for that purpose.
In Quintus tradition, the meta-arguments are taken into account. With the result you see. As a consequence you cannot compare two implementations of the same meta-predicate in the same module directly. There are other approaches most notably IF and ECLiPSe that do not change the calling context via the colon. This has advantages and disadvantages. The best is to compare them case by case.
Here is a recent case. Take lambdas and how they are put into a module
in SICStus, in SWI, and in ECLiPSe.
As for the Quintus/SICStus/YAP/SWI module system, I'd rather use it in the most conservative manner possible. That is:
no explicit qualification, consider the infix : as something internal
clean, checkable meta-declarations - insert on purpose an undefined predicate just to see whether or not cross-referencing is able to detect the problem (in SWI that's check or make).
use the common subset, avoid the many bells and whistles - there are many well meant extensions...
do more versatile things the pedestrian way: reexport by adding an appropriate module and add a dummy definition. Similarly, instead of renaming, import things from an interface module.
be always aware that module systems have inherently some limits. No matter how you twist or turn it. There is no completely seamless module system, for the very purpose of modules is to separate code and concerns.
1: To be precise, SICStus' adaptation of Quintus modules did only include : for module sensitive arguments in meta_predicate declarations. The integers 0..9, which are so important for higher-order programming based on call/N were only introduced about 20 years later in
4.2.0 released 2011-03-08.

How does gcc push local variables on to the stack?

void
f
()
{
int a[1];
int b;
int c;
int d[1];
}
I have found that these local variables, for this example, are not pushed on to the stack in order. b and c are pushed in the order of their declaration, but, a and d are grouped together. So the compiler is allocating arrays differently from any other built in type or object.
Is this a C/C++ requirement or gcc implementation detail?
The C standard says nothing about the order in which local variables are allocated. It doesn't even use the word "stack". It only requires that local variables have a lifetime that begins on entry to the nearest enclosing block (basically when execution reaches the {) and ends on exit from that block (reaching the }), and that each object has a unique address. It does acknowledge that two unrelated variables might happen to be adjacent in memory (for obscure technical reasons involving pointer arithmetic), but doesn't say when this might happen.
The order in which variables are allocated is entirely up to the whim of the compiler, and you should not write code that depends on any particular ordering. A compiler might lay out local variables in the order in which they're declared, or alphabetically by name, or it might group some variables together if that happens to result in faster code.
If you need to variables to be allocated in a particular order, you can wrap them in an array or a structure.
(If you were to look at the generated machine code, you'd most likely find that the variables are not "pushed onto the stack" one by one. Instead, the compiler will probably generate a single instruction to adjust the stack pointer by a certain number of bytes, effectively allocating a single chunk of memory to hold all the local variables for the function or block. Code that accesses a given variable will then use its offset within the stack frame.)
And since your function doesn't do anything with its local variables, the compiler might just not bother allocating space for them at all, particularly if you request optimization with -O3 or something similar.
The compiler can order the local variables however it wants. It may even choose to either not allocate them at all (for example, if they're not used, or are optimized away through propagation/ciscizing/keeping in register/etc) or allocate the same stack location for multiple locals that have disjoint live ranges.
There is no common implementation detail to outline how a particular compiler does it, as it may change at any time.
Typically, compilers will try to group similar sized variables (and/or alignments) together to minimize wasted space through "gaps", but there are so many other factors involved.
structs and arrays have slightly different requirements, but that's beyond the scope of this question I believe.

Variable Encapsulation in Case Statement

While modifying an existing program's CASE statement, I had to add a second block where some logic is repeated to set NetWeaver portal settings. This is done by setting values in a local variable, then assigning that variable to a Changing parameter. I copied over the code and did a Pretty Print, expecting to compiler to complain about the unknown variable. To my surprise however, this code actually compiles just fine:
CASE i_actionid.
WHEN 'DOMIGO'.
DATA: ls_portal_actions TYPE powl_follow_up_sty.
CLEAR ls_portal_actions.
ls_portal_actions-bo_system = 'SAP_ECC_Common'.
" [...]
c_portal_actions = ls_portal_actions.
WHEN 'EBELN'.
ls_portal_actions-bo_system = 'SAP_ECC_Common'.
" [...]
C_PORTAL_ACTIONS = ls_portal_actions.
ENDCASE.
As I have seen in every other programming language, the DATA: declaration in the first WHEN statement should be encapsulated and available only inside that switch block. Does SAP ignore this encapsulation to make that value available in the entire CASE statement? Is this documented anywhere?
Note that this code compiles just fine and double-clicking the local variable in the second switch takes me to the data declaration in the first. I have however not been able to test that this code executes properly as our testing environment is down.
In short you cannot do this. You will have the following scopes in an abap program within which to declare variables (from local to global):
Form routine: all variables between FORM and ENDFORM
Method: all variables between METHOD and ENDMETHOD
Class - all variables between CLASS and ENDCLASS but only in the CLASS DEFINITION section
Function module: all variables between FUNCTION and ENDFUNCTION
Program/global - anything not in one of the above is global in the current program including variables in PBO and PAI modules
Having the ability to define variables locally in a for loop or if is really useful but unfortunately not possible in ABAP. The closest you will come to publicly available documentation on this is on help.sap.com: Local Data in the Subroutine
As for the compile process do not assume that ABAP will optimize out any variables you do not use it won't, use the code inspector to find and remove them yourself. Since ABAP works the way it does I personally define all my variables at the start of a modularization unit and not inline with other code and have gone so far as to modify the pretty printer to move any inline definitions to the top of the current scope.
Your assumption that a CASE statement defines its own scope of variables in ABAP is simply wrong (and would be wrong for a number of other programming languages as well). It's a bad idea to litter your code with variable declarations because that makes it awfully hard to read and to maintain, but it is possible. The DATA statements - as well as many other declarative statements - are only evaluated at compile time and are completely ignored at runtime. You can find more information about the scopes in the online documentation.
The inline variable declarations are now possible with the newest version of SAP Netweaver. Here is the link to the documentation DATA - inline declaration. Here are also some guidelines of a good and bad usage of this new feature
Here is a quote from this site:
A declaration expression with the declaration operator DATA declares a variable var used as an operand in the current writer position. The declared variable is visible statically in the program from DATA(var) and is valid in the current context. The declaration is made when the program is compiled, regardless of whether the statement is actually executed.
Personally have not had time to check it out yet, because of lack of access to such system.

Use of recursion in Scala when run in the JVM

From searching elsewhere on this site and the web, tail call optimization is not supported by the JVM. Does that therefore mean that tail recursive Scala code such as the following, which may run on very large input lists, should not be written if it is to run on the JVM?
// Get the nth element in a list
def nth[T](n : Int, list : List[T]) : T = list match {
case Nil => throw new IllegalArgumentException
case _ if n == 0 => throw new IllegalArgumentException
case _ :: tail if n == 1 => list.head
case _ :: tail => nth(n - 1, tail)
}
Martin Odersky's Scala by Example contains the following paragragh which seems to suggests that there are circumstances or other environments where recursion is appropriate:
In principle, tail calls can always re-use the stack frame of the calling
function. However, some run-time environments (such as the Java VM) lack the
primitives to make stack frame re-use for tail calls efficient. A production quality
Scala implementation is therefore only required to re-use the stack frame of a di-
rectly tail-recursive function whose last action is a call to itself. Other tail calls might
be optimized also, but one should not rely on this across implementations.
Can anyone explain what this middle two sentences of this paragraph mean?
Thank you!
Since direct tail recursion is equivalent to a while loop, your example will run efficiently on the JVM because the Scala compiler can compile this to a loop under the hood, simply using a jump. General TCO however is not supported on the JVM, although there is available the tailcall() method which supports tail calls using compiler-generated trampolines.
To ensure that the compiler can correctly optimize a tail-recursive function to a loop, you can use the scala.annotation.tailrec annotation, which will cause a compiler error if the compiler cannot make the desired optimization:
import scala.annotation.tailrec
#tailrec def nth[T](n : Int, list : List[T]) : Option[T] = list match {
case Nil => None
case _ if n == 0 => None
case _ :: tail if n == 1 => list.headOption
case _ :: tail => nth(n - 1, tail)
}
(screw IllegalArgmentException!)
In principle, tail calls can always re-use the stack frame of the calling
function. However, some runtime environments (such as the Java VM) lack the
primitives to make stack frame re-use for tail calls efficient. A production quality
Scala implementation is therefore only required to re-use the stack frame of a di
rectly tail-recursive function whose last action is a call to itself. Other tail calls might
be optimized also, but one should not rely on this across implementations.
Can anyone explain what this middle two sentences of this paragraph mean?
Tail recursion is a special case of a tail call. Direct tail recursion is a special case of tail recursion. Only direct tail recursion is guaranteed to be optimized. Others may be optimized, too, but that's basically just a compiler optimization. As a language feature, Scala only guarantees direct tail recursion elimination.
So, what's the difference?
Well, a tail call is simply the last call in a subroutine:
def a = {
b
c
}
In this case, the call to c is a tail call, the call to b is not.
Tail recursion is when a tail call calls a subroutine that was already called before:
def a = {
b
}
def b = {
a
}
This is tail recursion: a calls b (a tail call), which in turn calls a again. (In contrast to the direct tail recursion described below, this is sometimes called indirect tail recursion.)
However, none of the two examples will get optimized by Scala. Or, more precisely: a Scala implementation is allowed to optimize them, but it is not required to do so. This is in contrast to, e.g. Scheme, where the language specification guarantees that all of the above cases will take O(1) stack space.
The Scala Language Specification only guarantees that direct tail recursion is optimized, i.e. when a subroutine directly calls itself with no other calls in between:
def a = {
b
a
}
In this case, the call to a is a tail call (because it is the last call in the subroutine), it is tail recursion (because it calls itself again) and most importantly it is direct tail recursion, because a directly calls itself without going through another call first.
Note that there are many subtle things that may lead to a method not being directly tail-recursive. For example, if a is overloaded, then the recursion may actually go through different overloads, and thus would no longer be direct.
In practice, this means two things:
you cannot perform an Extract Method Refactoring on a tail-recursive method, at least not including the tail call, because this would turn a directly tail-recursive method (which will get optimized) into an indirectly tail-recursive method (which will not get optimized).
You can only use direct tail recursion. A tail-recursive descent parser, or a state machine, which can be very elegantly expressed using indirect tail recursion, are out.
The main reason for this is that when your underlying execution engine lacks powerful control flow manipulation features such as GOTO, continuations, first-class mutable stack or proper tail calls, then you need to either implement your own stack on top of it, use trampolines, make a global CPS transform or something similarly nasty, in order to provide generalized proper tail calls. All of these have either severe impact on performance or interoperability with other code on the same platform.
Or, as Rich Hickey, the creator of Clojure, said when he was facing the same problem: "Performance, Java interop, tail calls. Pick two." Both Clojure and Scala chose to compromise on tail calls and provide only tail recursion and not full tail calls.
To cut a long story short: yes, the specific example you posted will be optimized, since it is direct tail recursion. You can test this by putting an #tailrec annotation on the method. The annotation does not change whether or not the method gets optimized, it does however guarantee that you will get a compile error if the method can not be optimized.
Due to the above-mentioned subtleties, it is generally a good idea to put an #tailrec annotation on methods that you need to be optimized, both in order to get a compile error, but also as a hint to other developers maintaining your code.
The Scala compiler will attempt to optimize tail calls by "flattening" them into a loop that won't cause a continually expanding stack.
Of course, your code has to be optimizable for it to do so. If you use the annotation #tailrec before your method however (scala.annotation.tailrec) the compiler will REQUIRE the method be optimizable or fail to compile.
Martin's remark is saying that only directly self-recursive calls are candidates (other criteria being met) for the TCO optimization. Indirect, mutually recursive method pairs (or larger sets of recursive methods) cannot be so optimized.
Note that there are JVMs that support tail call optimization (IIRC, IBM's J9 does), it's just not a requirement in the JLS, and Oracle's implementation doesn't do it.

Difference between State, ST, IORef, and MVar

I am working through Write Yourself a Scheme in 48 Hours (I'm up to about 85hrs) and I've gotten to the part about Adding Variables and Assignments. There is a big conceptual jump in this chapter, and I wish it had been done in two steps with a good refactoring in between rather then jumping at straight to the final solution. Anyway…
I've gotten lost with a number of different classes that seem to serve the same purpose: State, ST, IORef, and MVar. The first three are mentioned in the text, while the last seems to be the favored answer to a lot of StackOverflow questions about the first three. They all seem to carry a state between consecutive invocations.
What are each of these and how do they differ from one another?
In particular these sentences don't make sense:
Instead, we use a feature called state threads, letting Haskell manage the aggregate state for us. This lets us treat mutable variables as we would in any other programming language, using functions to get or set variables.
and
The IORef module lets you use stateful variables within the IO monad.
All this makes the line type ENV = IORef [(String, IORef LispVal)] confusing - why the second IORef? What will break if I'll write type ENV = State [(String, LispVal)] instead?
The State Monad : a model of mutable state
The State monad is a purely functional environment for programs with state, with a simple API:
get
put
Documentation in the mtl package.
The State monad is commonly used when needing state in a single thread of control. It doesn't actually use mutable state in its implementation. Instead, the program is parameterized by the state value (i.e. the state is an additional parameter to all computations). The state only appears to be mutated in a single thread (and cannot be shared between threads).
The ST monad and STRefs
The ST monad is the restricted cousin of the IO monad.
It allows arbitrary mutable state, implemented as actual mutable memory on the machine. The API is made safe in side-effect-free programs, as the rank-2 type parameter prevents values that depend on mutable state from escaping local scope.
It thus allows for controlled mutability in otherwise pure programs.
Commonly used for mutable arrays and other data structures that are mutated, then frozen. It is also very efficient, since the mutable state is "hardware accelerated".
Primary API:
Control.Monad.ST
runST -- start a new memory-effect computation.
And STRefs: pointers to (local) mutable cells.
ST-based arrays (such as vector) are also common.
Think of it as the less dangerous sibling of the IO monad. Or IO, where you can only read and write to memory.
IORef : STRefs in IO
These are STRefs (see above) in the IO monad. They don't have the same safety guarantees as STRefs about locality.
MVars : IORefs with locks
Like STRefs or IORefs, but with a lock attached, for safe concurrent access from multiple threads. IORefs and STRefs are only safe in a multi-threaded setting when using atomicModifyIORef (a compare-and-swap atomic operation). MVars are a more general mechanism for safely sharing mutable state.
Generally, in Haskell, use MVars or TVars (STM-based mutable cells), over STRef or IORef.
Ok, I'll start with IORef. IORef provides a value which is mutable in the IO monad. It's just a reference to some data, and like any reference, there are functions which allow you to change the data it refers to. In Haskell, all of those functions operate in IO. You can think of it like a database, file, or other external data store - you can get and set the data in it, but doing so requires going through IO. The reason IO is necessary at all is because Haskell is pure; the compiler needs a way to know which data the reference points to at any given time (read sigfpe's "You could have invented monads" blogpost).
MVars are basically the same thing as an IORef, except for two very important differences. MVar is a concurrency primitive, so it's designed for access from multiple threads. The second difference is that an MVar is a box which can be full or empty. So where an IORef Int always has an Int (or is bottom), an MVar Int may have an Int or it may be empty. If a thread tries to read a value from an empty MVar, it will block until the MVar gets filled (by another thread). Basically an MVar a is equivalent to an IORef (Maybe a) with extra semantics that are useful for concurrency.
State is a monad which provides mutable state, not necessarily with IO. In fact, it's particularly useful for pure computations. If you have an algorithm that uses state but not IO, a State monad is often an elegant solution.
There is also a monad transformer version of State, StateT. This frequently gets used to hold program configuration data, or "game-world-state" types of state in applications.
ST is something slightly different. The main data structure in ST is the STRef, which is like an IORef but with a different monad. The ST monad uses type system trickery (the "state threads" the docs mention) to ensure that mutable data can't escape the monad; that is, when you run an ST computation you get a pure result. The reason ST is interesting is that it's a primitive monad like IO, allowing computations to perform low-level manipulations on bytearrays and pointers. This means that ST can provide a pure interface while using low-level operations on mutable data, meaning it's very fast. From the perspective of the program, it's as if the ST computation runs in a separate thread with thread-local storage.
Others have done the core things, but to answer the direct question:
All this makes the line type ENV =
IORef [(String, IORef LispVal)]
confusing. Why the second IORef? What
will break if I do type ENV = State
[(String, LispVal)] instead?
Lisp is a functional language with mutable state and lexical scope. Imagine you've closed over a mutable variable. Now you've got a reference to this variable hanging around inside some other function -- say (in haskell-style pseudocode) (printIt, setIt) = let x = 5 in (\ () -> print x, \y -> set x y). You now have two functions -- one prints x, and one sets its value. When you evaluate printIt, you want to lookup the name of x in the initial environment in which printIt was defined, but you want to lookup the value that name is bound to in the environment in which printIt is called (after setIt may have been called any number of times).
There are ways besids the two IORefs to do this, but you certainly need more than the latter type you've proposed, which doesn't allow you to alter the values that names are bound to in a lexically-scoped fashion. Google the "funargs problem" for a whole lot of interesting prehistory.