Good introductory text about GHC implementation? - optimization

When programming in Haskell (and especially when solving Project Euler problems, where suboptimal solutions tend to stress the CPU or memory needs) I'm often puzzled why the program behaves the way it is. I look at profiles, try to introduce some strictness, chose another data structure, ... but mostly it's groping in the dark, because I lack a good intuition.
Also, while I know how Lisp, Prolog and imperative languages are typically implemented, I have no idea about implementing a lazy language. I'm a bit curious too.
Hence I would like to know more about the whole chain from program source to execution model.
Things I wonder about:
what typical optimizations are applied?
what is the execution order when there are multiple candidates for evaluation (while I know it's driven from the needed outputs, there may still be big performance differences between first evaluating A and then B, or evaluating B first to detect that you don't need A at all)
how are thunks represented?
how are the stack and the heap used?
what is a CAF? (profiling indicates sometimes that the hotspot is there, but I have no clue)

The majority of the technical information about the architecture and approach of the GHC system is in their wiki. I'll link to the key pieces, and some related papers that people may not know about.
What typical optimizations are applied?
The key paper on this is: A transformation-based optimiser for Haskell,
SL Peyton Jones and A Santos, 1998, which describes the model GHC uses of applying type-preserving transformations (refactorings) of a core Haskell-like language to improve time and memory use. This process is called "simplification".
Typical things that are done in a Haskell compiler include:
Inlining;
Beta reduction;
Dead code elimination;
Transformation of conditions: case-of-case, case elimiation.
Unboxing;
Constructed product return;
Full laziness transformation;
Specialization;
Eta expansion;
Lambda lifting;
Strictness analysis.
And sometimes:
The static argument transformation;
Build/foldr or stream fusion;
Common sub-expression elimination;
Constructor specialization.
The above-mentioned paper is the key place to start to understand most of these optimizations. Some of the simpler ones are given in the earlier book, Implementing Functional Languages, Simon Peyton Jones and David Lester.
What is the execution order when there are multiple candidates for evaluation
Assuming you're on a uni-processor, then the answer is "some order that the compiler picks statically based on heuristics, and the demand pattern of the program". If you're using speculative evaluation via sparks, then "some non-deterministic, out-of-order execution pattern".
In general, to see what the execution order is, look at the core, with, e.g. the ghc-core tool. An introduction to Core is in the RWH chapter on optimizations.
How are thunks represented?
Thunks are represented as heap-allocated data with a code pointer.
See the layout of heap objects.
Specifically, see how thunks are represented.
How are the stack and the heap used?
As determined by the design of the Spineless Tagless G-machine, specifically, with many modifications since that paper was released. Broadly, the execution model:
(boxed) objects are allocated on the global heap;
every thread object has a stack, consisting of frames with the same layout as heap objects;
when you make a function call, you push values onto the stack and jump to the function;
if the code needs to allocate e.g. a constructor, that data is placed on the heap.
To deeply understand the stack use model, see "Push/Enter versus Eval/Apply".
What is a CAF?
A "Constant Applicative Form". E.g. a top level constant in your program allocated for the lifetime of your program's execution. Since they're allocated statically, they have to be treated specially by the garbage collector.
References and further reading:
The GHC Commentary
The Spinless Tagless G-machine
Compilation via Transformation
Push/Enter vs Eval/Apply
Unboxed Values as First-Class Citizens
Secrets of the Inliner
Runtime Support for Multicore Haskell

This is probably not what you had in mind in terms of an introductory text, but Edward Yang has an ongoing series of blog posts discussing the Haskell heap, how thunks are implemented, etc.
It's entertaining, both with the illustrations and also by virtue of explicating things without delving into too much detail for someone new to Haskell. The series covers many of your questions:
The Haskell heap and how thunks are stored - the first post in the series
Bindings and CAFs
How the IO monad gets translated into primitives
On a more technical level, there are a number of papers that cover (in concert with other things), parts of what you're wanting to know.:
A paper by SPJ, Simon Marlow et al on GC in Haskell - I haven't read it, but since GC often represents a good porton of the work Haskell does, it should give insight.
The Haskell 2010 report - I'm sure you'll have heard of this, but it's too good not to link to. Can make for dry reading in places, but one of the best ways to understand what makes Haskell the way it is, at least the portions I've read.
A history of Haskell - is more technical than the name would suggest, and offers some very interesting views into Haskell's design, and the decisions behind the design. You can't help but better understand Haskell's implementation after reading it.

Related

OOP Performance after compilation

I got into a conversation with someone about OOP, who said that OOP costs to much performance. Now I know that in some cases it might, but as I see it, it would depend on different things.
Language execution. In languages using an interpretor, I can see that it could be a possibility. But what about compiled language like C++ or half compiled like Java? In any case it would just slow down the compilation vs. C, but as native or byte code I would think that the compilers would have optimized it to a point where this is not a problem.
Language structure. If we take PHP as an example, it is quite a flexible language with little rules. Java on the other hand uses strict naming schemes, strict file structure rules and is strict about data types. This speeds up lookup quite a bit. What if we used the same rules in PHP? Made it 100% OOP and adapted the same rules as Java has, would this not speed up PHP?
I found a really great OOP example, but this example does not prove the upside of OOP, but rather the upside of overview and structure. It's no problem using PP to do the same, at least not in PHP.
OOP is a very moot term and that's why your question is moot as well.
On the most generic level, OOP is about objects (let's not dive into what they are) encapsulating some state and passing each other messages to enquire or change that state. As you can see, these objects might be processes running on separate network-connected machines and message passing might be done quite literally—by passing messages of some application-level protocol over that network; this is the one extreme. The opposite edge of this spectrum is, say, C or C++ or Object Pascal etc which are compiled down to machine instructions and in which objects are just memory regions. I reckon the only "interesting" topic is a language on this side of the OOP spectrum, right?
In this, down-to-machine, level, the only relevant slow down I perceive is dynamic dispatch which is what is typically used to implement implementation inheritance (class Bar extends class Foo as in PHP) which allows you to pass objects of derived classes to the code expecting objects of their base class. This is typically requires a lookup through the table of methods at runtime to select a relevant method.
Note that this is not somehow inherent to the concept of OOP. For instance, dynamic lookup like this has been routinely used in plain C code even before C++ came to existence, and C is not an OOP language.
What I'm leading you to, is that some ways to access data and code cost more than others in terms of performance but provide powerful programming tools. Picking such an algorythm while considering a resulting tradeoff is not at all peculiar to implementing OOP concepts and happens in any computing field and any computing paradigm or a combinations of them.
In the end, I would say that the most visible slow-downs will come not from the code running on a CPU but rather from the runtime system. For instance, PHP is known for its ability to dynamically load code at runtime. Does this count as a feature of it being an OOP-enabled language? On the one hand, in these days of heavy frameworks, when PHP loads something it usually loads definitions of classes. On the other hand, if these frameworks were, say, purely procedural the same performance cost would be incurred (as the most loading time is spent waiting on I/O). Interpreted or JIT-compiled languages have to interpret or compile the code they execute and this incurs pefrormance hits. Does this depend on some of these languages implementing OOP concepts? Unlikely, IMO.

Build compiler Object Oriented

I'm currently working on a project for my class. I'm building a compiler with Flex (lex) and Bison (YACC) and C. I have done just a little bit of semantic an syntax analysis but i have been thinking how im going to implement the object oriented part. That is, how can I handle classes, overloading, polymorphism and heritage.
I can't seem to find something useful on google and the dragon book its just too low level. I mean too focused on building a compiler from scratch. So I was hoping that someone could point me to a good book, tutorials,example, something that can help me to clear my doubts.
Thanks in advance for the help, and im sorry if someone thinks this is asking to have my homework done.
I agree with the first comment that this question is far too broad to be answered. But I'll try anyway.
There are several aspects to your question:
What are the semantics of the commonly used concepts of object-oriented programming?
How can they be implemented in a compiler?
How are they usually implemented in other compilers?
What are good resources for further studies?
Semantics
Varies widely between languages and there also is quite a bit of confusion/controversity about what OOP actually means (a nice presentation on that topic: http://www.infoq.com/presentations/It-Is-Possible-to-Do-OOP-in-Java which also has some examples of implementing OOP-features). Just pick one model and look up a reference that defines the semantics such as a language specification or a scientific paper on the model.
Javascript probably is the easiest model to implement as it very directly maps to the implementation without much of a necessary surrounding framework in the compiler. A static version of the Java model (compile time class compilation instead of runtime classloading) shouldn't be too hard either. More sophisticated models would be C++ (which allows multiple inheritance) and Smalltalk or Common Lisp/CLOS (with Meta-object-protocols).
Possible Implementations
Again a wide range of choices. As the semantics are fixed and mostly rather straightforward the implementation effort most strongly depends on the performance you want to archive and the existing infrastructure of your compiler. Storing everything in lists and scanning them for the first entry that satisfies the rules probably is the easiest implementation.
Usual Implementation
Most programming languages out of the Java/C#/C++ area do static compile-time name/signature lookups to find the definitions of the things referred to and use a http://en.wikipedia.org/wiki/Virtual_method_table to resolve polymorphic calls. They also use the Vtable pointer for instanceof-checks and for checking down-casts.
Resources
While only 30 pages are directly concerned with objects I still think Lisp in Small Pieces (LiSP) is a great book for learning to work at that level within a compiler. It focusses on implementing language features, trade-offs in implementations and fitting the pieces together. (if (you can get over the the syntax used) (it's great)).

Compiler Optimization of Deterministic Functions

I was reading about Deterministic Execution, which is that for the same input, you have the same output. I was wondering whether any compiler writer has thought about optimizing deterministic functions at runtime.
For example, take the factorial function. If at runtime, it is detected that it is continuously being called with the same input value, the compiler can cache the output value and instead of executing the factorial function, can directly use that output value. Seems like a nice research topic. Are there any papers or work on this topic?
This is usually called memoization, and is a fairly common optimization in functional languages.
It can be done but as far as I know, it's not common for compilers to do it. The trouble is that users can define as many types as they like and equality in any way that they like, and with heap allocation and stuff it's very, very difficult to prove such a thing. Basically, it could be done, but only if your function involves straight numerical computation, which is rare, and thus it's usually not of high value.
You're talking about referential transparency. And it's a big part of functional programming.
http://en.wikipedia.org/wiki/Referential_transparency_(computer_science)
http://blogs.msdn.com/b/vcblog/archive/2008/11/12/pogo.aspx talks about profile guided optimization.
doesnot answer your questions per se but in general talks about using runtime behavior to optimize assembly

Significant Challengers to OOP

From what I understand, OOP is the most commonly used paradigm for large scale projects. I also know that some smaller subsets of big systems use other paradigms (e.g. SQL, which is declarative), and I also realize that at lower levels of computing OOP isn't really feasible. But it seems to me that usually the pieces of higher level solutions are almost always put together in a OOP fashion.
Are there any scenarios where a truly non-OOP paradigm is actually a better choice for a largescale solution? Or is that unheard of these days?
I've wondered this ever since I've started studying CS; it's easy to get the feeling that OOP is some nirvana of programming that will never be surpassed.
In my opinion, the reason OOP is used so widely isn't so much that it's the right tool for the job. I think it's more that a solution can be described to the customer in a way that they understand.
A CAR is a VEHICLE that has an ENGINE. That's programming and real world all in one!
It's hard to comprehend anything that can fit the programming and real world quite so elegantly.
Linux is a large-scale project that's very much not OOP. And it wouldn't have a lot to gain from it either.
I think OOP has a good ring to it, because it has associated itself with good programming practices like encapsulation, data hiding, code reuse, modularity et.c. But these virtues are by no means unique to OOP.
You might have a look at Erlang, written by Joe Armstrong.
Wikipedia:
"Erlang is a general-purpose
concurrent programming language and
runtime system. The sequential subset
of Erlang is a functional language,
with strict evaluation, single
assignment, and dynamic typing."
Joe Armstrong:
“Because the problem with
object-oriented languages is they’ve
got all this implicit environment that
they carry around with them. You
wanted a banana but what you got was a
gorilla holding the banana and the
entire jungle.”
The promise of OOP was code reuse and easier maintenance. I am not sure it delivered. I see things such as dot net as being much the same as the C libraries we used to get fro various vendors. You can call that code reuse if you want. As for maintenance bad code is bad code. OOP did not help.
I'm the biggest fan of OOP, and I practice OOP every day.
It's the most natural way to write code, because it resembles the real life.
Though, I realize that the OOP's virtualization might cause performance issues.
Of course that depends on your design, the language and the platform you chose (systems written in Garbage collection based languages such as Java or C# might perform worse than systems which were written in C++ for example).
I guess in Real-time systems, procedural programming may be more appropriate.
Note that not all projects that claim to be OOP are in fact OOP. Sometimes the majority of the code is procedural, or the data model is anemic, and so on...
Zyx, you wrote, "Most of the systems use relational databases ..."
I'm afraid there's no such thing. The relational model will be 40 years old next year and has still never been implemented. I think you mean, "SQL databases." You should read anything by Fabian Pascal to understand the difference between a relational dbms and an SQL dbms.
" ... the relational model is usually chosen due to its popularity,"
True, it's popular.
" ... availability of tools,"
Alas without the main tool necessary: an implementation of the relational model.
" support,"
Yup, the relational model has fine support, I'm sure, but it's entirely unsupported by a dbms implementation.
" and the fact that the relational model is in fact a mathematical concept,"
Yes, it's a mathematical concept, but, not being implemented, it's largely restricted to the ivory towers. String theory is also a mathematical concept but I wouldn't implement a system with it.
In fact, despite it's being a methematical concept, it is certainly not a science (as in computer science) because it lacks the first requirement of any science: that it is falsifiable: there's no implementation of a relational dbms against which we can check its claims.
It's pure snake oil.
" ... contrary to OOP."
And contrary to OOP, the relational model has never been implemented.
Buy a book on SQL and get productive.
Leave the relational model to unproductive theorists.
See this and this. Apparently you can use C# with five different programming paradigms, C++ with three, etc.
Software construction is not akin to Fundamental Physics. Physics strive to describe reality using paradigms which may be challenged by new experimental data and/or theories. Physics is a science which searches for a "truth", in a way that Software construction doesn't.
Software construction is a business. You need to be productive, i.e. to achieve some goals for which someone will pay money. Paradigms are used because they are useful to produce software effectively. You don't need everyone to agree. If I do OOP and it's working well for me, I don't care if a "new" paradigm would potentially be 20% more useful to me if I had the time and money to learn it and later rethink the whole software structure I'm working in and redesign it from scratch.
Also, you may be using another paradigm and I'll still be happy, in the same way that I can make money running a Japanese food restaurant and you can make money with a Mexican food restaurant next door. I don't need to discuss with you whether Japanese food is better than Mexican food.
I doubt OOP is going away any time soon, it just fits our problems and mental models far too well.
What we're starting to see though is multi-paradigm approaches, with declarative and functional ideas being incorporated into object oriented designs. Most of the newer JVM languages are a good example of this (JavaFX, Scala, Clojure, etc.) as well as LINQ and F# on the .net platform.
It's important to note that I'm not talking about replacing OO here, but about complementing it.
JavaFX has shown that a declarative
solution goes beyond SQL and XSLT,
and can also be used for binding
properties and events between visual
components in a GUI
For fault tolerant and highly
concurrent systems, functional
programming is a very good fit,
as demonstrated by the Ericsson
AXD301 (programmed using Erlang)
So... as concurrency becomes more important and FP becomes more popular, I imagine that languages not supporting this paradigm will suffer. This includes many that are currently popular such as C++, Java and Ruby, though JavaScript should cope very nicely.
Using OOP makes the code easier to manage (as in modify/update/add new features) and understand. This is especially true with bigger projects. Because modules/objects encapsulate their data and operations on that data it is easier to comprehend the functionality and the big picture.
The benefit of OOP is that it is easier to discuss (with other developers/management/customer) a LogManager or OrderManager, each of which encompass specific functionality, then describing 'a group of methods that dump the data in file' and 'the methods that keep track of order details'.
So I guess OOP is helpful especially with big projects but there are always new concepts turning up so keep on lookout for new stuff in the future, evaluate and keep what is useful.
People like to think of various things as "objects" and classify them, so no doubt that OOP is so popular. However, there are some areas where OOP has not gained a bigger popularity. Most of the systems use relational databases rather than objective. Even if the second ones hold some notable records and are better for some types of tasks, the relational model is unsually chosen due to its popularity, availability of tools, support and the fact that the relational model is in fact a mathematical concept, contrary to OOP.
Another area where I have never seen OOP is the software building process. All the configuration and make scripts are procedural, partially because of the lack of the support for OOP in shell languages, partially because OOP is too complex for such tasks.
Slightly controversial opinion from me but I don't find OOP, at least of a kind that is popularly applied now, to be that helpful in producing the largest scale software in my particular domain (VFX, which is somewhat similar in scene organization and application state as games). I find it very useful on a medium to smaller scale. I have to be a bit careful here since I've invited some mobs in the past, but I should qualify that this is in my narrow experience in my particular type of domain.
The difficulty I've often found is that if you have all these small concrete objects encapsulating data, they now want to all talk to each other. The interactions between them can get extremely complex, like so (except much, much more complex in a real application spanning thousands of objects):
And this is not a dependency graph directly related to coupling so much as an "interaction graph". There could be abstractions to decouple these concrete objects from each other. Foo might not talk to Bar directly. It might instead talk to it through IBar or something of this sort. This graph would still connect Foo to Bar since, albeit being decoupled, they still talk to each other.
And all this communication between small and medium-sized objects which make up their own little ecosystem, if applied to the entire scale of a large codebase in my domain, can become extremely difficult to maintain. And it becomes so difficult to maintain because it's hard to reason about what happens with all these interactions between objects with respect to things like side effects.
Instead what I've found useful is to organize the overall codebase into completely independent, hefty subsystems that access a central "database". Each subsystem then inputs and outputs data. Some other subsystems might access the same data, but without any one system directly talking to each other.
... or this:
... and each individual system no longer attempts to encapsulate state. It doesn't try to become its own ecosystem. It instead reads and writes data in the central database.
Of course in the implementation of each subsystem, they might use a number of objects to help implement them. And that's where I find OOP very useful is in the implementation of these subsystems. But each of these subsystems constitutes a relatively medium to small-scale project, not too large, and it's at that medium to smaller scale that I find OOP very useful.
"Assembly-Line Programming" With Minimum Knowledge
This allows each subsystem to just focus on doing its thing with almost no knowledge of what's going on in the outside world. A developer focusing on physics can just sit down with the physics subsystem and know little about how the software works except that there's a central database from which he can retrieve things like motion components (just data) and transform them by applying physics to that data. And that makes his job very simple and makes it so he can do what he does best with the minimum knowledge of how everything else works. Input central data and output central data: that's all each subsystem has to do correctly for everything else to work. It's the closest thing I've found in my field to "assembly line programming" where each developer can do his thing with minimum knowledge about how the overall system works.
Testing is still also quite simple because of the narrow focus of each subsystem. We're no longer mocking concrete objects with dependency injection so much as generating a minimum amount of data relevant to a particular system and testing whether the particular system provides the correct output for a given input. With so few systems to test (just dozens can make up a complex software), it also reduces the number of tests required substantially.
Breaking Encapsulation
The system then turns into a rather flat pipeline transforming central application state through independent subsystems that are practically oblivious to each other's existence. One might sometimes push a central event to the database which another system processes, but that other system is still oblivious about where that event came from. I've found this is the key to tackling complexity at least in my domain, and it is effectively through an entity-component system.
Yet it resembles something closer to procedural or functional programming at the broad scale to decouple all these subsystems and let them work with minimal knowledge of the outside world since we're breaking encapsulation in order to achieve this and avoid requiring the systems to talk to each other. When you zoom in, then you might find your share of objects being used to implement any one of these subsystems, but at the broadest scale, the systems resembles something other than OOP.
Global Data
I have to admit that I was very hesitant about applying ECS at first to an architectural design in my domain since, first, it hadn't been done before to my knowledge in popular commercial competitors (3DS Max, SoftImage, etc), and second, it looks like a whole bunch of globally-accessible data.
I've found, however, that this is not a big problem. We can still very effectively maintain invariants, perhaps even better than before. The reason is due to the way the ECS organizes everything into systems and components. You can rest assured that an audio system won't try to mutate a motion component, e.g., not even under the hackiest of situations. Even with a poorly-coordinated team, it's very improbable that the ECS will degrade into something where you can no longer reason about which systems access which component, since it's rather obvious on paper and there are virtually no reasons whatsoever for a certain system to access an inappropriate component.
To the contrary it often removed many of the former temptations for hacky things with the data wide open since a lot of the hacky things done in our former codebase under loose coordination and crunch time was done in hasty attempts to x-ray abstractions and try to access the internals of the ecosystems of objects. The abstractions started to become leaky as a result of people, in a hurry, trying to just get and do things with the data they wanted to access. They were basically jumping through hoops trying to just access data which lead to interface designs degrading quickly.
There is something vaguely resembling encapsulation still just due to the way the system is organized since there's often only one system modifying a particular type of components (two in some exceptional cases). But they don't own that data, they don't provide functions to retrieve that data. The systems don't talk to each other. They all operate through the central ECS database (which is the only dependency that has to be injected into all these systems).
Flexibility and Extensibility
This is already widely-discussed in external resources about entity-component systems but they are extremely flexible at adapting to radically new design ideas
in hindsight, even concept-breaking ones like a suggestion for a creature which is a mammal, insect, and plant that sprouts leaves under sunlight all at once.
One of the reasons is because there are no central abstractions to break. You introduce some new components if you need more data for this or just create an entity which strings together the components required for a plant, mammal, and insect. The systems designed to process insect, mammal, and plant components then automatically pick it up and you might get the behavior you want without changing anything besides adding a line of code to instantiate an entity with a new combo of components. When you need whole new functionality, you just add a new system or modify an existing one.
What I haven't found discussed so much elsewhere is how much this eases maintenance even in scenarios when there are no concept-breaking design changes that we failed to anticipate. Even ignoring the flexibility of the ECS, it can really simplify things when your codebase reaches a certain scale.
Turning Objects Into Data
In a previous OOP-heavy codebase where I saw the difficulty of maintaining a codebase closer to the first graph above, the amount of code required exploded because the analogical Car in this diagram:
... had to be built as a completely separate subtype (class) implementing multiple interfaces. So we had an explosive number of objects in the system: a separate object for point lights from directional lights, a separate object for a fish eye camera from another, etc. We had thousands of objects implementing a few dozen abstract interfaces in endless combinations.
When I compared it to ECS, that required only hundreds and we were able to do the exact same things before using a small fraction of the code, because that turned the analogical Car entity into something that no longer requires its class. It turns into a simple collection of component data as a generalized instance of just one Entity type.
OOP Alternatives
So there are cases like this where OOP applied in excess at the broadest level of the design can start to really degrade maintainability. At the broadest birds-eye view of your system, it can help to flatten it and not try to model it so "deep" with objects interacting with objects interacting with objects, however abstractly.
Comparing the two systems I worked on in the past and now, the new one has more features but takes hundreds of thousands of LOC. The former required over 20 million LOC. Of course it's not the fairest comparison since the former one had a huge legacy, but if you take a slice of the two systems which are functionally quite equal without the legacy baggage (at least about as close to equal as we might get), the ECS takes a small fraction of the code to do the same thing, and partly because it dramatically reduces the number of classes there are in the system by turning them into collections (entities) of raw data (components) with hefty systems to process them instead of a boatload of small/medium objects.
Are there any scenarios where a truly non-OOP paradigm is actually a
better choice for a largescale solution? Or is that unheard of these
days?
It's far from unheard of. The system I'm describing above, for example, is widely used in games. It's quite rare in my field (most of the architectures in my field are COM-like with pure interfaces, and that's the type of architecture I worked on in the past), but I've found that peering over at what gamers are doing when designing an architecture made a world of difference in being able to create something that still remains very comprehensible at it grows and grows.
That said, some people consider ECS to be a type of object-oriented programming on its own. If so, it doesn't resemble OOP of a kind most of us would think of, since data (components and entities to compose them) and functionality (systems) are separated. It requires abandoning encapsulation at the broad system level which is often considered one of the most fundamental aspects of OOP.
High-Level Coding
But it seems to me that usually the pieces of higher level solutions
are almost always put together in a OOP fashion.
If you can piece together an application with very high-level code, then it tends to be rather small or medium in scale as far as the code your team has to maintain and can probably be assembled very effectively using OOP.
In my field in VFX, we often have to do things that are relatively low-level like raytracing, image processing, mesh processing, fluid dynamics, etc, and can't just piece these together from third party products since we're actually competing more in terms of what we can do at the low-level (users get more excited about cutting-edge, competitive production rendering improvements than, say, a nicer GUI). So there can be lots and lots of code ranging from very low-level shuffling of bits and bytes to very high-level code that scripters write through embedded scripting languages.
Interweb of Communication
But there comes a point with a large enough scale with any type of application, high-level or low-level or a combo, that revolves around a very complex central application state where I've found it no longer useful to try to encapsulate everything into objects. Doing so tends to multiply complexity and the difficulty to reason about what goes on due to the multiplied amount of interaction that goes on between everything. It no longer becomes so easy to reason about thousands of ecosystems talking to each other if there isn't a breaking point at a large enough scale where we stop modeling each thing as encapsulated ecosystems that have to talk to each other. Even if each one is individually simple, everything taken in as a whole can start to more than overwhelm the mind, and we often have to take a whole lot of that in to make changes and add new features and debug things and so forth if you try to revolve the design of an entire large-scale system solely around OOP principles. It can help to break free of encapsulation at some scale for at least some domains.
At that point it's not necessarily so useful anymore to, say, have a physics system encapsulate its own data (otherwise many things could want to talk to it and retrieve that data as well as initialize it with the appropriate input data), and that's where I found this alternative through ECS so helpful, since it turns the analogical physics system, and all such hefty systems, into a "central database transformer" or a "central database reader which outputs something new" which can now be oblivious about each other. Each system then starts to resemble more like a process in a flat pipeline than an object which forms a node in a very complex graph of communication.

Does procedural programming have any advantages over OOP?

[Edit:] Earlier I asked this as a perhaps poorly-framed question about when to use OOP versus when to use procedural programming - some responses implied I was asking for help understanding OOP. On the contrary, I have used OOP a lot but want to know when to use a procedural approach. Judging by the responses, I take it that there is a fairly strong consensus that OOP is usually a better all-round approach but that a procedural language should be used if the OOP architecture will not provide any reuse benefits in the long term.
However my experience as a Java programmer has been otherwise. I saw a massive Java program that I architected rewritten by a Perl guru in 1/10 of the code that I had written and seemingly just as robust as my model of OOP perfection. My architecture saw a significant amount of reuse and yet a more concise procedural approach had produced a superior solution.
So, at the risk of repeating myself, I'm wondering in what situations should I choose a procedural over an object-oriented approach. How would you identify in advance a situation in which an OOP architecture is likely to be overkill and a procedural approach more concise and efficient.
Can anyone suggest examples of what those scenarios would look like?
What is a good way to identify in advance a project that would be better served by a procedural programming approach?
I like Glass' rules of 3 when it comes to Reuse (which seems to be what you're interested in).
1) It is 3 times as difficult to
build reusable components as single
use components 2) A reusable
component should be tried out in three
different applications before it will
be sufficiently general to accept into
a reuse library
From this I think you can extrapolate these corollaries
a) If you don't have the budget
for 3 times the time it would take you
to build a single use component, maybe
you should hold off on reuse. (Assuming Difficulty = Time)
b) If
you don't have 3 places where you'd
use the component you're building,
maybe you should hold off on building
the reusable component.
I still think OOP is useful for building the single use component, because you can always refactor it into something that is really reusable later on. (You can also refactor from PP to OOP but I think OOP comes with enough benefits regarding organization and encapsulation to start there)
Reusability (or lack of it) is not bound to any specific programming paradigm. Use object oriented, procedural, functional or any other programming as needed. Organization and reusability come from what you do, not from the tool.
Those who religiously support OOP don't have any facts to justify their support, as we see here in these comments as well. They are trained (or brain washed) in universities to use and praise OOP and OOP only and that is why they support it so blindly. Have they done any real work in PP at all? Other then protecting code from careless programmers in a team environment, OOP doesn't offer much. Personally working both in PP and OOP for years, I find that PP is simple, straight forward and more efficient, and I agree with the following wise men and women:
(Reference: http://en.wikipedia.org/wiki/Object-oriented_programming):
A number of well-known researchers and programmers have criticized OOP. Here is an incomplete list:
Luca Cardelli wrote a paper titled “Bad Engineering Properties of Object-Oriented Languages”.
Richard Stallman wrote in 1995, “Adding OOP to Emacs is not clearly an improvement; I used OOP when working on the Lisp Machine window systems, and I disagree with the usual view that it is a superior way to program.”
A study by Potok et al. has shown no significant difference in productivity between OOP and procedural approaches.
Christopher J. Date stated that critical comparison of OOP to other technologies, relational in particular, is difficult because of lack of an agreed-upon and rigorous definition of OOP. A theoretical foundation on OOP is proposed which uses OOP as a kind of customizable type system to support RDBMS.
Alexander Stepanov suggested that OOP provides a mathematically-limited viewpoint and called it “almost as much of a hoax as Artificial Intelligence” (possibly referring to the Artificial Intelligence projects and marketing of the 1980s that are sometimes viewed as overzealous in retrospect).
Paul Graham has suggested that the purpose of OOP is to act as a “herding mechanism” which keeps mediocre programmers in mediocre organizations from “doing too much damage”. This is at the expense of slowing down productive programmers who know how to use more powerful and more compact techniques.
Joe Armstrong, the principal inventor of Erlang, is quoted as saying “The problem with object-oriented languages is they’ve got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.”
Richard Mansfield, author and former editor of COMPUTE! magazine, states that “like countless other intellectual fads over the years (“relevance”, communism, “modernism”, and so on—history is littered with them), OOP will be with us until eventually reality asserts itself. But considering how OOP currently pervades both universities and workplaces, OOP may well prove to be a durable delusion. Entire generations of indoctrinated programmers continue to march out of the academy, committed to OOP and nothing but OOP for the rest of their lives.” and also is quoted as saying “OOP is to writing a program, what going through airport security is to flying”.
You gave the answer yourself - big projects simply need OOP to prevent getting too messy.
From my point of view, the biggest advantage of OOP is code organization. This includes the principles of DRY and encapsulation.
I would suggest using the most concise, standards-based approach that you can find for any given problem. Your colleague who used Perl demonstrated that a good developer who knows a particular tool well can achieve great results regardless of the methodology. Rather than compare your Java-versus-Perl projects as a good example of the procedural-versus-OOP debate, I would like to see a face-off between Perl and a similarly concise language such as Ruby, which happens to also have the benefits of object orientation. Now that's something I'd like to see. My guess is Ruby would come out on top but I'm not interested in provoking a language flame-war here - my point is only that you choose the appropriate tool for the job - whatever approach can accomplish the task in the most efficient and robust way possible. Java may be robust because of its object orientation but as you and your colleague and many others who are converting to dynamic languages such as Ruby and Python are finding these days, there are much more efficient solutions out there, whether procedural or OOP.
I think DRY principle (Don't Repeat Yourself) combined with a little Agile is a good approach. Build your program incrementally starting with the simplest thing that works then add features one by one and re-factor your code as necessary as you go along.
If you find yourself writing the same few lines of code again and again - maybe with different data - it's time to think about abstractions that can help separate the stuff that changes from the stuff that stays the same.
Create thorough unit tests for each iteration so that you can re-factor with confidence.
It's a mistake to spend too much time trying to anticipate which parts of your code need to be reusable. It will soon become apparent once the system starts to grow in size.
For larger projects with multiple concurrent development teams you need to have some kind of architectural plan to guide the development, but if you are working on your own or in small cooperative team then the architecture will emerge naturally if you stick to the DRY principle.
Another advantage of this approach is that whatever you do is based on real world experience. My favourite analogy - you have to play with the bricks before you can imagine how the building might be constructed.
I think you should use procedural style when you have a very well specified problem, the specification won't change and you want a very fast running program for it. In this case you may trade the maintainability for performance.
Usually this is the case when you write a game engine or a scientific simulation program. If your program calculate something more than million times per second it should be optimized to the edge.
You can use very efficient algorithms but it won't be fast enough until you optimize the cache usage. It can be a big performance boost your data is cached. This means the CPU don't need fetch bytes from the RAM, it know them. To achieve this you should try to store your data close to each other, your executable and data size should be minimal, and try using as less pointers as you can (use static global fixed sized arrays where you can afford).
If you use pointers you are continuously jumping in the memory and your CPU need to reload the cache every time. OOP code is full of pointers: every object is stored by its memory address. You call new everywhere which spread your objects all over the memory making the cache optimization almost impossible (unless you have an allocator or a garbage collector that keeps things close to each other). You call callbacks and virtual functions. The compiler usually can't inline the virtual functions and a virtual function call is relatively slow (jump to the VMT, get the address of the virtual function, call it [this involves pushing the parameters and local variables on the stack, executing the function then popping everything]). This matters a lot when you have a loop running from 0 to 1000000 25 times in every second. By using procedural style there aren't virtual function and the optimizar can inline everything in those hot loops.
If the project is so small that it would be contained within one class and is not going to be used for very long, I would consider using functions. Alternatively if the language you are using does not support OO (e.g. c).
"The problem with object-oriented languages is they’ve got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.” —Joe Armstrong
Do you want the jungle?
I think the suitability of OOP depends more on the subject area you're working in than the size of the project. There are some subject areas (CAD, simulation modeling, etc.) where OOP maps naturally to the concepts involved. However, there are a lot of other domains where the mapping ends up being clumsy and incongruous. Many people using OOP for everything seem to spend a lot of time trying to pound square pegs into round holes.
OOP has it's place, but so do procedural programming, functional programming, etc. Look at the problem you're trying to solve, then choose a programming paradigm that allows you to write the simplest possible program to solve it.
Procedural programs can be simpler for a certain type of program. Typically, these are the short script-like programs.
Consider this scenario:
Your code is not OO. You have data structures and many functions throughout your progam that operate on the data structures. Each function takes a data structure as a parameter and does different things depending on a "data_type" field in the data structure.
IF all is working and not going to be changed, who cares if it's OO or not? It's working. It's done. If you can get to that point faster writing procedurally, then maybe that's the way to go.
But are you sure it's not going to be changed? Let's say you're likely to add new types of data structures. Each time you add a new data structure type that you want those functions to operate on, you have to make sure you find and modify every one of those functions to add a new "else if" case to check for and add the behavior you want to affect the new type of data structure. The pain of this increases as the program gets larger and more complicated. The more likely this is, the better off you would be going with the OO approach.
And - are you sure that it's working with no bugs? More involved switching logic creates more complexity in testing each unit of code. With polymorphic method calls, the language handles the switching logic for you and each method can be simpler and more straightforward to test.
The two concepts are not mutually exclusive, it is very likely that you will use PP in conjunction with OOP, I can't see how to segregate them.
I believe Grady Booch said once that you really start to benefit a lot from OOP at 10000+ lines of code.
However, I'd always go the OO-way. Even for 200 lines. It's a superior approach in a long term, and the overhead is just an overrated excuse. All the big things start small.
One of the goals of OOP was to make reusability easier however it is not the only purpose. The key to learning to use objects effectively is Design Patterns.
We are all used to the idea of algorithms which tell us how to combine different procedures and data structures to perform common tasks. Conversely look at Design Patterns by the Gang of Four for ideas on how to combine objects to perform common tasks.
Before I learned about Design Patterns I was pretty much in the dark about how to use objects effectively other than as a super type structure.
Remember that implementing Interfaces is just as important if not more important than inheritance. Back in the day C++ was leading example of object oriented programming and using interfaces are obscured compared to inheritance (virtual functions, etc). The C++ Legacy meant a lot more emphasis was placed on reusing behavior in the various tutorials and broad overviews. Since then Java, C#, and other languages have moved interface up to more a focus.
What interfaces are great for is precisely defining how two object interact with each. It is not about reusing behavior. As it turns out much of our software is about how the different parts interact. So using interface gives a lot more productivity gain than trying to make reusable components.
Remember that like many other programming ideas Objects are a tool. You will have to use your best judgment as to how well they work for your project. For my CAD/CAM software for metal cutting machines there are important math functions that are not placed in objects because there is no reason for them be in objects. Instead they are exposed from library and used by the object that need them. Then there is are some math function that were made object oriented as their structure naturally lead to this setup. (Taking a list of points and transforming it in on of several different types of cutting paths). Again use your best judgment.
Part of your answer depends on what language you're using. I know that in Python, it's pretty simple to move procedural code into a class, or a more formal object.
One of my heuristics is a based on how the "state" of the situation is. If the procedure pollutes the namespace, or could possibly affect the global state (in a bad, or unpredictable way), then encapsulating that function in an object or class is probably wise.
My two cents...
Advantages of procedural programming
Simple designing (fast proof of concept, battle with dramatically
dynamic requirements)
Simple inter-project communications
Natural when temporal order matters
Less overhead at runtime
The more Procedural code become good the closer it's to Functional. And advantages of FP are well known.
I always begin designing in a top-down fashion and in the top parts it's much easier to think in OOP terms. But when comes the time to code some little specific parts you are much more productive with just procedure programming.
OOP is cool in designing and in shaping the project, so that the divide-et-impera paradigm can be applied. But you cannot apply it in every aspect of your code, as it were a religion :)
If you "think OO" when you're programming, then I'm not sure it makes sense to ask "when should I revert to procedural programming?" This is equivalent to asking java programmers what they can't do as well because java requires classes. (Ditto .NET languages).
If you have to make an effort to get past thinking procedurally, then I'd advise asking about how you can overcome that (if you care to); otherwise stay with procedural. If it's that much effort to get into OOP-mode, your OOP code probably won't work very well anyway (until you get further along the learning curve.)
IMHO, the long term benefits of OOP outweigh the time saved in the short term.
Like AZ said, using OOP in a procedural fashion (which I do quite a bit), is a good way to go (for smaller projects). The bigger the project, the more OOP you should employ.
You can write bad software in both concepts. Still, complex software are much easier to write, understand and maintain in OO languages than in procedural. I wrote highly complex ERP applications in procedural language (Oracle PL/SQL) and then switched to OOP (C#). It was and still is a breath of fresh air.
To this point, the arguments of using OO for DRY and encapsulation is just adding unnecessary complexity in terms of how implicit it is and just sheer of how many layers that a class can inherit a lot of properties and methods into it.
not to mention that it's really hard to design a good OO cause you'd end up adding unrelated/unnecessary things that are going to be inherited throughout the whole layers of classes that inherits them. which is really bad if one parent class gets messy, the whole codebase is messy. and gets refactored.
also the fact that those inherited properties are not specifically fit into the use case to the class that inherits it which requires to be overridden. and to the ones that don't need them at all just have them for no good reason.
for something that does not need to be shared, sure there's abstract properties. but you'd end up having to implement them in all the instances that tries to inherits them.
this inheritance is just too magicky and gets dangerous.
but I'd give OO credit on how it's good at enforcing of what should be available. but then again it's too much power that is really easy to be wrongly used.
In my opinion, final class should be the default. and you need to deliberately choose if you want to allow it to inheritance.
Most studies have found that OO code is more concise than procedural code. If you look at projects that re-wrote existing C code in C++ (not something I necessarily advise, BTW) , you normally see reductions in code size of between 50 and 75 percent.
So the answer is - always use OO!