OCaml modules and performance - module

Some functions are really easy to implement in OCaml (for example, map from a list) but you may use the map of the OCaml library: List.map
However, we can wonder which code will be more efficient. Calling a module of a separate compilation unit (a library) may void some possible optimizations. I read in the news group fa.caml that when calling functions from libraries, closures are used.
I have OCaml code in production that use Modules and Functors for doing generic programming. For historical reason my code is monolitic: all in one file. Now I have more time, I'm willing to separate the code into files for such modules. However, I'm afraid I can lost performance, as it took me a while to get it right. For example, I have modules for wrapping complex objects with numbers, so I enforce unique representation and fast comparison. I use those wrapped objects with generic Maps, Sets, and build caches upon them.
The questions are:
Am I going to loose performance if I move to separate files?
Is OCaml doing many optimizations on my code full of modules, functors, etc?
In C++, if you define class method in a .h, the compiler may end up inlining short methods, etc. Is it possible to achieve that in OCaml using separated files?

You may lose some performance. However, there are two mitigating factors:
The OCaml native code compiler can do cross-module inlining, so it is possible for code to be inlined even across the separate compilation units (with a couple caveats - recursive functions and function arguments are not inlined across modules[1]).
The code will still quite possibly be fast enough, and the gains in readability and maintainability will quite possibly outweigh any (marginal) performance cost.
I do not know if OCaml defunctorizes code where the functors are defined in the same source file. If it does not, then modules shouldn't add any performance hit above that already incurred by the functors.
In general, it is my opinion that it is best to write straightforward, readable, maintainable code and not worry too much about microscopic performance characteristics like this unless the code proves to be too slow in practice.


OOP Performance after compilation

I got into a conversation with someone about OOP, who said that OOP costs to much performance. Now I know that in some cases it might, but as I see it, it would depend on different things.
Language execution. In languages using an interpretor, I can see that it could be a possibility. But what about compiled language like C++ or half compiled like Java? In any case it would just slow down the compilation vs. C, but as native or byte code I would think that the compilers would have optimized it to a point where this is not a problem.
Language structure. If we take PHP as an example, it is quite a flexible language with little rules. Java on the other hand uses strict naming schemes, strict file structure rules and is strict about data types. This speeds up lookup quite a bit. What if we used the same rules in PHP? Made it 100% OOP and adapted the same rules as Java has, would this not speed up PHP?
I found a really great OOP example, but this example does not prove the upside of OOP, but rather the upside of overview and structure. It's no problem using PP to do the same, at least not in PHP.
OOP is a very moot term and that's why your question is moot as well.
On the most generic level, OOP is about objects (let's not dive into what they are) encapsulating some state and passing each other messages to enquire or change that state. As you can see, these objects might be processes running on separate network-connected machines and message passing might be done quite literally—by passing messages of some application-level protocol over that network; this is the one extreme. The opposite edge of this spectrum is, say, C or C++ or Object Pascal etc which are compiled down to machine instructions and in which objects are just memory regions. I reckon the only "interesting" topic is a language on this side of the OOP spectrum, right?
In this, down-to-machine, level, the only relevant slow down I perceive is dynamic dispatch which is what is typically used to implement implementation inheritance (class Bar extends class Foo as in PHP) which allows you to pass objects of derived classes to the code expecting objects of their base class. This is typically requires a lookup through the table of methods at runtime to select a relevant method.
Note that this is not somehow inherent to the concept of OOP. For instance, dynamic lookup like this has been routinely used in plain C code even before C++ came to existence, and C is not an OOP language.
What I'm leading you to, is that some ways to access data and code cost more than others in terms of performance but provide powerful programming tools. Picking such an algorythm while considering a resulting tradeoff is not at all peculiar to implementing OOP concepts and happens in any computing field and any computing paradigm or a combinations of them.
In the end, I would say that the most visible slow-downs will come not from the code running on a CPU but rather from the runtime system. For instance, PHP is known for its ability to dynamically load code at runtime. Does this count as a feature of it being an OOP-enabled language? On the one hand, in these days of heavy frameworks, when PHP loads something it usually loads definitions of classes. On the other hand, if these frameworks were, say, purely procedural the same performance cost would be incurred (as the most loading time is spent waiting on I/O). Interpreted or JIT-compiled languages have to interpret or compile the code they execute and this incurs pefrormance hits. Does this depend on some of these languages implementing OOP concepts? Unlikely, IMO.

Compiler code optimization: AST vs. IR

, where I define IR as a 3-address code type representation (I realize that one can mean by it an AST representation as well).
It is my understanding that, when writing a best-practice compiler for an imperative language, code optimization happens both on the AST (probably best using a Visitor Pattern), and on the IR produced from the AST. 
(a) Is that correct?
(b) Which type of optimization steps are best handled on the AST before even producing an IR? (reference to an article/a list online welcome too as long as it deals with an imperative language) 
The compiler I'm working on is for Decaf (which some might know) which has a fairly deep CFG up to (single) class inheritance; I'll add features not part of it such as type coercion. It will be completely hand-coded (using no tools whatsoever). This is not homework; writing it for fun. 
(a) Yes.
(b) Constant folding is one example; CSE is another; in fact almost anything to do with expression evaluation. IR-phase optimizations are more about what results from flow analysis.
IR is a form of an AST (often it is "flattened", but there are deep tree IRs as well), it may not be easy to distinguish one from another, especially if compiler is implemented as a sequence of very small rewrites from an original AST all the way down to a final IR suitable for instruction selection.
Optimisations may happen anywhere on this chain, but some representations are more suitable for a wide range of optimisations, most notably, an SSA form, used by most of the modern compilers to do nearly all the optimisations.
It's never too early to optimise (to coin a phrase). So there are optimisations performed before and during AST creation, on the AST itself, on the IR (if you have one) and on the code as it is generated. In C-like languages and those that compile to machine code, the effort goes into the later stages. In compilers targeting a VM I think there is less room for improvements at that stage.
Some early optimisations obviously work better than others. I don't know much about Decaf, but there are the obvious things like constant folding and constant expression evaluation. If you get the whole program in tree form before you have to generate any code you can find common subexpressions, do code migration, eliminate dead code/dead stores, hoist invariants, eliminate tail recursion and some kinds of strength reduction.
A lot of it depends on how hard you want to work and what your target is. You didn't say much about that.

Why do tools like yacc and ANTLR generate source code?

These tools basically input a grammar and output code which processes a series of tokens into something more useful, like a syntax tree. But could these tools be written in the form of a library instead? What is the reason for generating source code as output? Is there a performance gain? Is it more flexible for the end user? Easier to implement for the authors of yacc and ANTLR?
Sorry if the question is too vague, I'm just curious about the historical reasons behind the decisions the authors made, and what purpose auto-generated code has in today's environment.
There's a big performance advantage achieved by the parser generator working out the interactions of the grammar rules with respect to one another, and compiling the result to code.
One could build interpreters that simply accepted grammars and did the parsing; there are parser types (Earley) that would actually be relatively good at that, and one could compute the grammar interactions at runtime (Earley parsers kind of do this anyway) rather than offline and then execute the parsing algorithm.
But you would pay a parsing performance penalty of 10 to 100x slowdown, and probably a big storage demand.
If you are parsing using only very small grammars, or you are parsing only very small documents, this might not matter. But the grammars that many parser generators get applied too end up being fairly big (people keep wanting to add things to what you can say in a language), and they often end up processing pretty big documents. So performance now matters, and viola, people build code-generating parser generators.
Once you have a tool, it is often easier to use even in simple cases. So now that you have parser generators, you can even apply them to little grammars or to parsing little documents.
EDIT: Addendum. The historical reason is probably driven by space and time demands. Earlier systems had not a lot of room (32Kb in 1975), didn't run very fast (1 MIPS same time frame), and people had big source files already. Parser generators tended to help with this set of problems; interpreted grammars would have had intolerably bad performance.
Ira Baxter gave you one set of reasons for not handling the grammar parsing at runtime.
There is another reason too. Associated with each rule in the grammar is the appropriate action. The action is normally a fragment of a separate language (for example, C or C++). All actions in a grammar interpreted at runtime would have to be mappable to something appropriate in the program. In general, that's a losing proposition. The fragments can do all sorts of things, referencing parts of the stack ($$, $1, etc) and invoking actions (YYACCEPT, etc). Designing the runtime system so that it could be reliably used with such fragments would be tough. You'd like be into creating source code and compiling that into a DSO (dynamic shared object) or DLL (dynamic link library) and loading it. That requires a compiler on the customer's machine, where the customer may have deliberately designed their production system to be compiler-free.

Why is Clojure dynamically typed?

One thing I like very much is reading about different programming languages. Currently, I'm learning Scala but that doesn't mean I'm not interested in Groovy, Clojure, Python, and many others. All these languages have a unique look and feel and some characteristic features. In the case of Clojure I don't understand one of these design decisions. As far as I know, Clojure puts great emphasis on its functional paradigm and pretty much forces you to use immutable "variables" wherever possible. So if half of your values are immutable, why is the language dynamically typed?
The Clojure website says:
First and foremost, Clojure is dynamic. That means that a Clojure program is not just something you compile and run, but something with which you can interact.
Well, that sounds completely strange. If a program is compiled you can't change it anymore. Sure you can "interact" with it, that's what UIs are used for but the website certainly doesn't mean a neat "dynamic" GUI.
How does Clojure benefit from dynamical typing
I mean the special case of Clojure and not general advantages of dynamic typing.
How does the dynamic type system help improve functional programming
Again, I know the pleasure of not spilling "int a;" all over the source code but type inference can ease a lot of the pain. Therefore I would just like to know how dynamic typing supports the concepts of a functional language.
If a program is compiled you can't change it anymore.
This is wrong. In image-based systems, like Lisp (Clojure can be seen as a Lisp dialect) and Smalltalk, you can change the compiled environment. Development in such a language typically means working on a running system, adding and changing function definitions, macro definitions, parameters etc. (adding means compiling and loading into the image).
This has a lot of benefits. For one, all the tools can interact directly with the program and do not need to guess at the system's behaviour. You also do not have any long compilation pauses, because each compiled unit is very small (it is very rare to recompile everything). The NASA JPL once corrected a running Lisp system on a probe hundreds of thousands of kilometres away in space.
For such a system, it is very natural to have type information available at runtime (that is what dynamic typing means). Of course, nothing hinders you from also doing type inference and type checks at compilation time. These concepts are orthogonal. Modern Lisp implementations typically can do both.
Well first of all Clojure is a Lisp and Lisps traditionally have always been dynamically typed.
Second as the excerpt you quoted said Clojure is a dynamic language. This means, among other things, that you can define new functions at runtime, evaluate arbitrary code at runtime and so on. All of these things are hard or impossible to do in statically typed languages (without plastering casts all over the place).
Another reason is that macros might complicate debugging type errors immensely. I imagine that generating meaningful error messages for type errors produced by macro-generated code would be quite a task for the compiler.
I agree, a purely functional language can still have an interactive read-eval-print-loop, and would have an easier time with type inference. I assume Clojure wanted to attract lisp programmers by being "lisp for the jvm", and chose to be dynamic like other lisps. Another factor is that type systems need to be designed as the very first step of the language, and it's faster for language implementors to just skip that step.
(I'm rephrasing the original answer since it generated too much misunderstanding)
One of the reasons to keep Clojure (and any Lisp) dynamically typed is to simplify creation of macros. In short, macros deal with abstract syntax trees (ASTs) which can contain nodes of many, many different types (usually, any objects at all). In theory, it's possible to make full statically typed macro system, but in practice such systems are usually limited and sparsely spread. Please, see examples below and extended discussion in the thread.
EDIT 2020: Wow, 9 years passed from the time I posted this answer, and people still add comments. What a legacy we all have left!
Some people noted in comments that having a statically typed language doesn't prevent you from expressing code as data structure. And, strictly speaking, it's true - union types allow to express data structures of any complexity, including syntax of a language. However I claim that to express the syntax, you must either reduce expressiveness, or use such wide unions that you lose all advantages of static typing. To prove this claim I will use another language - Julia.
Julia is optionally typed - you can constrain any function or struct field to have a particular type, and Julia will check it. The language supports AST as a first class citizen using Expr and Symbol types. Expression definition looks something like this:
struct Expr
Expression consists of a head which is always a symbol and list of arguments which may have any types. Julia also supports special Union which can constrain argument to specific types, e.g. Symbols and other Exprs:
struct Expr
args::Vector{Union{Symbol, Expr}}
Which is sufficient to express e.g. :(x + y):
dump(:(x + y))
head: Symbol call
args: Array{Any}((3,))
1: Symbol +
2: Symbol x
3: Symbol y
But Julia also supports a number of other types in expressions. One obvious and helpful example is literals:
:(x + 1)
Moreover, you can use interpolation or construct expressions manually to put any object to AST:
obj = create_some_object()
ex1 = :(x + $objs)
ex2 = Expr(:+, :x, obj)
These examples are not just a funny experiments, they are actively used in real code, especially in macros. So you cannot constrain expression arguments to a specific union of types - expressions may contain any values.
Of course, when designing a new language you can put any restrictions on it. Perhaps, restricting Expr to contain only Symbol, Expr and some Literals would be useful in some contexts. But it goes against principles of simplicity and flexibility in both - Julia and Clojure, and would significantly reduce usefulness of macros.
Because that's what the world/market needed. No sense in building what's already built.
I hear the JVM already has a statically typed language ;)

Code Optimization with Scala

What structures of Scala can be used more efficiently than in Java, to increase execution speed? I don't know if this is possible, but to clear my doubts :)
The scala #specialized annotation can generate multiple versions of a class, fine-tuned with specific primitive types. You can write all of that out in Java, but you probably don't want to.
To expand on Ross's answer, you can use #specialized to generate specific versions of a collection. For instance, in Java you'd generally use fastutil or Apache Primitives for collections of primitives. Scala's #specialized will generate these variants for you and hide them automatically like so:
class MyLinkedList[#specialized T] (args: T*) {
// whatever it does
Other than that, actors make it easier to write concurrent applications. Coming up in 2.9 are parallel collections, which can apply higher-order functions in parallel across collections, speeding up any place you'd have the Scala equivalent of a Java loop (fold, foreach, etc). See this ScalaDays talk for the nitty-gritty on this.
As of 2.9, the parallel collections library is slated to be part of the standard distribution. This will allow extremely simple distribution of so-called "embarrassingly parallel" problems over multiple cores. Doing so in Java takes considerably more effort.
As a general rule, Scala benchmarks range from moderately slower than Java to slightly faster, depending on the problem and coding techniques.
I'll refrain from speculation on how the resulting performance might differ from an equivalent Java construct, but Scala does closure elimination, which might make a measurable difference, modulo HotSpot tricks.
Also stay tuned for Iulian's thesis which should be out soon and will provide a lot more information on the subject of Scala optimization.