Imperative languages with static, structural typing and global type inference - type-inference

I know of languages like Haskell being statically typed and having type inference. But are there non-functional languages that have global type inference, the equivalent of something like C with type inference and structural typing.

OCaml is the only one I know which can be imperative/object-oriented that is statically typed, garbage collected and supports global type inference and structural typing, though it is essentially a functional language.
Scala isn't a functional language like OCaml but an imperative/object-oriented language that supports structural typing, but does not have the kind of type inference you're looking for. It still supports functional constructs, though.
If by "non-functional" you mean a language that doesn't support functional programming at all, then I don't think there is one.

OCaml isn't the only contender anymore. A number of structurally-typed imperative languages have appeared in recent years:
F# is, like OCaml, a multiparadigm language that supports complex pattern matching and both imperative and functional programming. Being an OCaml derivative, the two languages are in fact so similar that barring minor feature differences they are practically source-compatible. The main [dis?]advantage is that it runs on .NET.
Go is the lovechild of the original Unix / Plan 9 / Inferno team since their induction into Google, based on their decades of work on the compilers for those systems. Go supports structural polymorphism in the sense that object composition is its primary sub-typing mechanic, and that method interfaces do not need to be explicitly implemented.
Haxe is an ActionScript derivative made to compile to an impressive variety of platforms including C++ (!). It supports fully structural types and enums (equvalent to OCaml unions) alongside C#-style object heirarchies, and boasts a sophisticated macro system.

There's also Crystal, but it's in pre-alpha stage:
https://github.com/manastech/crystal

Related

Interpreter semantics: clarifying the steps an interpreter makes

This question is about definitions, semantics.
I understand the general concept of interpretation, translating source to machine code in real-time, or into an intermediate cache which is later "compiled" in real time or just before run time, etc.
Is there a semantic distinction made between the source > byte code translation step, and the byte code > machine code translation step? Do people typically refer to the first part as "interpretation" and the second step as "compilation". Please don't misunderstand, I am not asking for a definition of compilation outside the scope of dynamic languages. That is another topic.
Additionally, is it futile to make a semantic distinction between these two steps, due to the large number of interpreters that implement so many different techniques?
Typically, interpretation means the execution of a program in an arbitrary form (plain sourcecode, abstract syntax tree (AST), bytecode, ...) by an interpreter.
Some virtual machines make heavy use of JITs (just in time compilers) which translate (compile) the intermediate representation of a program to native machine code. This is definitely a form of compilation.
Also, some VMs do several phases of compilation: At first, an AST is compiled to bytecode, which can later on be compiled to machine code.
I would say, compilation means basically a transformation of one intermediate representation to the next representation.
The steps an interpreter makes is usually programmed in a loop similar to:
get next instruction
parse and interpret its components
dispatch its translation
The definitions and semantics of the language are only implemented in an interpreter, but are defined elsewhere.
The answer to your question lies in the formal, operational and axiomatic semantic definitions of the language being either interpreted or compiled. In both cases, the semantics of the formal language definition must be preserved and consistent for any interpretation or compilation regardless of the implementation techniques employed.
Implementations of languages such as interpreters and compilers are tested against test suites which test the implementation of each language construct in the language against its formal semantic definition.
A language designer generates the formal definition of a language in a symbolic form such as denotational semantics. This definition is very abstract from a mathematical point-of-view.
A compiler or interpreter implementer is more interested in the operational semantic definition of the language which is more directly related to building the compiler or interpreter to run on a target machine.
A user of the language is more interested in the axiomatic definition of the language which informs programmers how to use the language's constructs to create programs.

Why is Clojure dynamically typed?

One thing I like very much is reading about different programming languages. Currently, I'm learning Scala but that doesn't mean I'm not interested in Groovy, Clojure, Python, and many others. All these languages have a unique look and feel and some characteristic features. In the case of Clojure I don't understand one of these design decisions. As far as I know, Clojure puts great emphasis on its functional paradigm and pretty much forces you to use immutable "variables" wherever possible. So if half of your values are immutable, why is the language dynamically typed?
The Clojure website says:
First and foremost, Clojure is dynamic. That means that a Clojure program is not just something you compile and run, but something with which you can interact.
Well, that sounds completely strange. If a program is compiled you can't change it anymore. Sure you can "interact" with it, that's what UIs are used for but the website certainly doesn't mean a neat "dynamic" GUI.
How does Clojure benefit from dynamical typing
I mean the special case of Clojure and not general advantages of dynamic typing.
How does the dynamic type system help improve functional programming
Again, I know the pleasure of not spilling "int a;" all over the source code but type inference can ease a lot of the pain. Therefore I would just like to know how dynamic typing supports the concepts of a functional language.
If a program is compiled you can't change it anymore.
This is wrong. In image-based systems, like Lisp (Clojure can be seen as a Lisp dialect) and Smalltalk, you can change the compiled environment. Development in such a language typically means working on a running system, adding and changing function definitions, macro definitions, parameters etc. (adding means compiling and loading into the image).
This has a lot of benefits. For one, all the tools can interact directly with the program and do not need to guess at the system's behaviour. You also do not have any long compilation pauses, because each compiled unit is very small (it is very rare to recompile everything). The NASA JPL once corrected a running Lisp system on a probe hundreds of thousands of kilometres away in space.
For such a system, it is very natural to have type information available at runtime (that is what dynamic typing means). Of course, nothing hinders you from also doing type inference and type checks at compilation time. These concepts are orthogonal. Modern Lisp implementations typically can do both.
Well first of all Clojure is a Lisp and Lisps traditionally have always been dynamically typed.
Second as the excerpt you quoted said Clojure is a dynamic language. This means, among other things, that you can define new functions at runtime, evaluate arbitrary code at runtime and so on. All of these things are hard or impossible to do in statically typed languages (without plastering casts all over the place).
Another reason is that macros might complicate debugging type errors immensely. I imagine that generating meaningful error messages for type errors produced by macro-generated code would be quite a task for the compiler.
I agree, a purely functional language can still have an interactive read-eval-print-loop, and would have an easier time with type inference. I assume Clojure wanted to attract lisp programmers by being "lisp for the jvm", and chose to be dynamic like other lisps. Another factor is that type systems need to be designed as the very first step of the language, and it's faster for language implementors to just skip that step.
(I'm rephrasing the original answer since it generated too much misunderstanding)
One of the reasons to keep Clojure (and any Lisp) dynamically typed is to simplify creation of macros. In short, macros deal with abstract syntax trees (ASTs) which can contain nodes of many, many different types (usually, any objects at all). In theory, it's possible to make full statically typed macro system, but in practice such systems are usually limited and sparsely spread. Please, see examples below and extended discussion in the thread.
EDIT 2020: Wow, 9 years passed from the time I posted this answer, and people still add comments. What a legacy we all have left!
Some people noted in comments that having a statically typed language doesn't prevent you from expressing code as data structure. And, strictly speaking, it's true - union types allow to express data structures of any complexity, including syntax of a language. However I claim that to express the syntax, you must either reduce expressiveness, or use such wide unions that you lose all advantages of static typing. To prove this claim I will use another language - Julia.
Julia is optionally typed - you can constrain any function or struct field to have a particular type, and Julia will check it. The language supports AST as a first class citizen using Expr and Symbol types. Expression definition looks something like this:
struct Expr
head::Symbol
args::Vector{Any}
end
Expression consists of a head which is always a symbol and list of arguments which may have any types. Julia also supports special Union which can constrain argument to specific types, e.g. Symbols and other Exprs:
struct Expr
head::Symbol
args::Vector{Union{Symbol, Expr}}
end
Which is sufficient to express e.g. :(x + y):
dump(:(x + y))
Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol +
2: Symbol x
3: Symbol y
But Julia also supports a number of other types in expressions. One obvious and helpful example is literals:
:(x + 1)
Moreover, you can use interpolation or construct expressions manually to put any object to AST:
obj = create_some_object()
ex1 = :(x + $objs)
ex2 = Expr(:+, :x, obj)
These examples are not just a funny experiments, they are actively used in real code, especially in macros. So you cannot constrain expression arguments to a specific union of types - expressions may contain any values.
Of course, when designing a new language you can put any restrictions on it. Perhaps, restricting Expr to contain only Symbol, Expr and some Literals would be useful in some contexts. But it goes against principles of simplicity and flexibility in both - Julia and Clojure, and would significantly reduce usefulness of macros.
Because that's what the world/market needed. No sense in building what's already built.
I hear the JVM already has a statically typed language ;)

Why do almost all OO languages compile to bytecode?

Of the object-oriented languages I know, pretty much all but C++ and Objective-C compile to bytecode running on some sort of virtual machine. Why have so many different languages settled on compiling to bytecode, as opposed to machine code? Is it possible in princible to have a high-level memory-managed OOP language that compiled to machine code?
Edit: I'm aware that multiplatform support is often advanced as an advantage of this approach. However, it's quite possible to compile natively on multiple platforms, without making a new compiler per platform. One can, per example, emit C code and then compile that with GCC.
There's no reason in fact, this is a kind of coincidence. OOP now is the leading concept in "big" programming, and so virtual machines are.
Also note, that there are 2 distinct parts of traditional virtual machines - garbage collector and bytecode interpreter/JIT-compiler, and these parts can exist separately. For example, Common Lisp implementation called SBCL compiles program to a native code, but at runtime heavily uses garbage collection.
This is done to allow a VM or JIT compiler the chance to compile the code on demand optimally for the architecture on which the code is executed. Also, it allows for cross-platform bytecode to be created once and then executed on multiple hardware architectures. This allows for hardware specific optimizations to be placed into the compiled code.
Since byte code is not limited to a microarchitecture, it can be smaller than machine code. Complex instructions can be represented vs. the much more primitive instructions available in modern day CPUs, since the constraints in the design of CPU instructions are very different from the constraints in designing a bytecode architecture.
Then there's the issue of security. The bytecode can be verified and analyzed prior to execution (i.e., no buffer overflows, variables of a certain type being accessed as something they are not), etc...
Java uses bytecode because two of its initial design goals were portability and compactness. Those both came from the initial vision of a language for embedded devices, where fragments of code could be downloaded on the fly.
Python, Ruby, Smalltalk, javascript, awk and so on use bytecode because writing a native compiler is a lot of work, but a textual interpreter is too slow - bytecode hits a sweet spot of being fairly easy to write, but also satisfactorily quick to run.
I have no idea why the Microsoft languages use bytecode, since for them, neither portability nor compactness is a big deal. A lot of the thinking behind the CLR came out of computer scientists in Cambridge, so i imagine considerations like ease of program analysis and verification were involved.
Note that as well as C++ and Objective C, Eiffel, Ada 9X, Vala and Go are OO languages (of varying vintage) that are compiled straight to native code.
All in all, i'd say that OO and bytecode do not go hand in hand. Rather, we have a coincidental convergence of several streams of development: the traditional bytecoded interpreters of scripting languages like Python and Ruby, the mad Gosling masterplan of Java, and whatever it is Microsoft's motives are.
The biggest reason why most interpreted languages (not specifically OO languages) are compiled to bytecode is for performance. The most expensive part of interpreting code is transforming text source to an intermediate representation. For instance, to perform something like:
foo + bar;
The interpreter would have to scan 10 characters, transform them into 4 tokens, build an AST for the operation, resolve three symbols (+ is a symbol, which depends on the types of foo and bar), all before it can perform any action that actually depends on the run-time state of the program. None of this can change from run to run, and so many languages try to store some form of intermediate representation.
bytecode, rather than storing an AST has a few advantages. For one, bytecodes are easy to serialize, so the IR can be written to disk and reused at the next invocation, further reducing interpretation time. Another reason is that bytecode often takes up less actual ram. significantly bytecode representations are often easy to just in time compile, because they are often structurally similar to typical machine code.
As another data point, the D programming language is GC'ed, OO, and a lot higher level than C++ while still being compiled to native code.
Bytecode is significantly more flexible medium than machine code. First, it provides the basis for platform portability without the need for a compiler or shipping source code. So a developer can distribute a single version of the application without needing to give up the source, require complex developer tools, or anticipate potential target platforms. While the later is not always practical it does happen. Especially with developer libraries say I distribute a library that I've only tested on Windows, but someone else uses it on Linux or Android. It happens quite frequently actually, and most of the time it works as expected.
Byte code is also generally more optimized that an interpreter because it's closer to machine instructions therefore faster to translate to machine instructions. Not all OO languages are compiled. Ruby, Python, and even Javascript are interpreted so they aren't compiled to anything so the ruby interpreter has to take a very flexible language and turn that into instructions, but that flexibility comes at a price paid an runtime: parse text, generate AST, translate AST to machine code, etc. It's also easy to do optimizations like JIT where byte code is translated to machine code directly, and even gives the possibility for creating optimizations for specific hardware.
Finally, just because one language compiles to bytecode doesn't preclude other languages taking advantage of of that byte code. Now any optimization using that byte code can be applied to these other languages that might know how to translate themselves to that byte code. That makes the byte code a very important layer for reusability for other languages.
OO and byte code compilation goes back to the 70s with Smalltalk, and I'm sure someone will say LISP as early as the 50s/60s. But, it really wasn't until the 90s that it started to really be used in production systems on a large scale.
Native compilation sounds like the optimal path, and probably why our industry spent 20 years or more thinking that was THE ANSWER to all our problems, but the last 15 years we've seen byte code compilation take stage and it's been a significant advantage over what we did before. Looking back we realize how much time wasted natively compiling everything mostly by hand.
I agree with Chubbard's answer and I'd add that in OO languages type information can be very important for enabling optimizations by virtual-machines or last-level compilers
It is easier to develop an interpreter than a compiler.
Effort in development of...:
interpreter < bytecode-interpreter < bytecode-jit-compiler < compiler-to-platform-independent-language < compiler-to-multiple-machine-dependent-assembler.
It is a general trend to stop the development at jit-compilers because of platform independence. Only the preferred languages in respect to performance and research in theoretical computer science are and will be developed in ALL possible directions, including new bytecode-interpreter, even while there are good and advanced compilers to platform independent languages and to different machine-dependant assemblers.
The research in OOP languages is pretty ...let's say dull, compared to functional languages, because really new language and compiler technologies are more easily expressed with/in/using mathematical cathegory theory and mathematical descriptions of touring-complete type-systems. In other words: it is nearly functional in itself, while imperative languages are nearly only assembler-frontends with some syntactic sugar. OOP languages tend to be imperative languages, because functional languages have already closures and lambda. There are other ways to implement java-like "interfaces" in functional languages, and there is just no need for additional object oriented features.
In i.e. Haskell, adding the feature of OOP-like programming would probably be more than only a few steps back in technology – there would be no point in using that. (<- that is not only IMHO... you ever heard of GADTs or Multi-parameter-type-classes?) Probably there might be even better ways to dynamically create Objects with Interfaces to communicate with OOP-languges than changing that language itself. But there are other functional languages, too, that explicitely combine functional and OOP aspects. There is just more science with mainly functional languages than non-functional OO-languages.
OO languages can not be easily compiled to other OO languages, iff they are in some way more "advanced". Usually, they have features like stack-protector, advanced debugging abilities, abstract and inspectable multi-threading, dynamic object-loading from files from the internet... Many of these features are not or not-easily realisable with C or C++ as compiler-backend. The functional language LISP (which is 50 years old!) was AFAIK the first with garbage collector. As compiler-backend LISP used a hacked version of the language C, because plain C did not allow some of those things, assembler did allow, i.e. proper-tail-calls or tables-next-to-code. C-- allows that.
An other aspect: Imperative languages are intended to run on a specific architecture, i.e. C and C++ programs run on only those architectures, they are programmed for. Java is more extreme: it runs only on a single architecture, a virtual one, which itself runs on others.
Functional languages are usually by design pretty architecture-independent: LISP was developed to be so immense architecture-unspecific, that it could be compiled to genetic code, in some distant future. Yes, like programs running in living biologic cells.
With the bytecode for the LLVM, functional languages will most-likely be compiled to bytecode in the future, too. Most imperative languages will most likely still have the same inherited problems as they have now from not-abstracting-far-enough. Well, I'm not that sure about clang and D, but those two are not "the most" anyway.

Do high-level programming languages tend to be object-oriented while low-level languages are procedurally oriented?

I'm just getting a bit confused about all the language types out there. What's the difference - if there is one - between the high level / low level languages distinction compared to the object-oriented / procedural distinction? A lot of the analogies seem similar.
The high/low level distinction is more about abstraction than paradigm. Typically, the "lower" you are, the more you have to know about the machine you're running on - its memory, file system, and even processor instruction set.
A high-level language puts a layer of abstraction between you and the machine. It handles the gory details. This is both good and bad. Abstraction takes away some worry but also takes away control.
A high-level language can be procedural, object oriented, functional, etc...
Lower-level languages may not provide concepts like object orientation, because object orientation is an abstraction.
High level/low level refers to the perceived 'closeness' of the language to assembler and machine code (assembler is low-level, C is seen as lower level than C++ or Java, etc).
OO and procedural programming are language facilities provided to support a certain way of designing programs (called programming paradigms). They have nothing to do with if the language is high or low level beyond the fact that an OO language tends to not be low level as assembler doesn't know about objects and classes. There are a lot of other paradigms out there as well, such as functional programming.
not really.
c++ for example is object oriented and it is fairly low level.
"High level" and "low level" are somewhat vague terms that people can disagree about. You can take a look at the amount of abstraction a programming language provides by how much code you have to write to accomplish a particular task, then call the languages which need less code higher level. Of course, then you need a way to measure code size.
There isn't necessarily any causation across these two axes ("paradigm" and "level"), but I think the correlation is that logic and functional languages tend to be highest level, followed shortly by object-oriented languages, with procedural languages typically lower-level.
And not part of the question, but I also think that correlationally, dynamically-typed languages tend to be higher-level than statically-typed ones.
I think it might be an interesting visualization for someone to do a three-dimensional scatter plot of programming languages across the three axes: paradigm (logic/functional/oo/procedural) typing (static/dynamic) and level (see e.g. 'Code Complete' for various metrics on measuring level) .
I think a good analogy to make here is this.
Languages which are object oriented do tend to be higher level than purely functional ones. Look at c++ and c. Yeah c++ is still pretty low level, as mentioned by docesam, but c++ is still higher level than it's purely functional older brother, c.
No, its not quite that simple always, as object orientation isn't the only thing that makes a language high-level, but its definitely an indicator as object orientation means more abstraction over real raw machine instructions.
But object orientation isn't enough to determine which language is the highest level.
I'd look at the following things:
Does the language have static or dynamic typing? (Javascript & Python vs Java and c++)
object oriented or not? (c vs c++)
pure text macros or templates? (c vs c++)
Dynamic Binding vs static binding (again Javascript & Python vs Java & c++)
Does the language support named functions or do you have to use line jumps?
Does the language allow for things like comments?
many more
I like to say that - it all boils down to the machine instruction set. So, regardless of how high-level something is represented, it will still boil down to the machine instructions. So, high-level languages are abstractions of ideas, while low-level ones twiddle closer to the hardware.
The analogies are similar because it all boils down to one thing - machine code!

Inversion of Control in Compilers

Has anyone out there actually used inversion of control containers within compiler implementations yet? I know that by design, compilers need to be very fast, but I've always been curious about how IoC/DI could affect the construction of a programming language--hot-swappable syntaxes, anyone?
Lisp-style languages often do this. Reader macros are pieces of user-written code which extend the reader (and hence, the syntax) of a language. Plain-old macros are pieces of user-written code which also extend the language.
The entire syntaxes aren't hot-swappable, but certain pieces are extendable in various ways.
All this isn't a new idea. Before it was deemed worthy of a three-letter acronym, IoC was known as "late binding", and pretty agreed on as a Good Idea.
LR(k) grammars typically use a generic parser system driven by tables (action-goto/shift-reduce tables), so you use a table generator tool which produces these tables and feed them to the generic parser system which then can parse your input using the tables. In general these parser systems then signal you that a non-terminal has been reduced. See for example the GoldParser system which is free.
I wouldn't really call it inversion of control because it's natural for compilers. They are usually a series of passes that transform code in the input language to code in the output language. You can, of course, swap in a different pass (for example, gcc compiles multiple languages by using a different frontend).