Serializing Running Programs in a Functional Interpreter - serialization

I am writing an interpreter implemented functionally using a variations of the Cont Monad. Inspired by Smalltalk's use of images to capture a running program, I am investigating how to serialize the executing hosted program and need help determining how to accomplish this at a high level.
Problem Statement
Using the Cont monad, I can capture the current continuation of a running program in my interpreter. Storing the current continuation allows resuming interpreter execution by calling the continuation. I would like to serialize this continuation so that the state of a running program can be saved to disk or loaded by another interpreter instance. However, my language (I am both targeting and working in Javascript) does not support serializing functions this way.
I would be interested in an approach that can be used to build up the continuation at a given point of execution given some metadata without running the entire program again until it reaches that point. Preferably, minimal changes to the implementation of the interpreter itself would be made.
Considered Approach
One approach that may work is to push all control flow logic into the program state. For example, I currently express a C style for loop using the host language's recursion for the looping behavior:
var forLoop = function(init, test, update, body) {
var iter = function() {
// When test is true, evaluate the loop body and start iter again
// otherwise, evaluate an empty statement and finish
return branch(test,
next(
next(body, update),
iter()),
empty);
};
return next(
init,
iter());
};
This is a nice solution but if I pause the program midway though a for loop, I don't know how I can serialize the continuation that has been built up.
I know I can serialize a transformed program using jumps however, and the for loop can be constructed from jump operations. A first pass of my interpreter would generate blocks of code and save these in the program state. These blocks would capture some action in the hosted language and potentially execute other blocks. The preprocessed program would look something like this:
Label Actions (Block of code, there is no sequencing)
-----------------------------------
start: init, GOTO loop
loop: IF test GOTO loop_body ELSE GOTO end
loop_body: body, GOTO update
update: update, GOTO loop
end: ...
This makes each block of code independent, only relying on values stored in the program state.
To serialize, I would save off the current label name and the state when it was entered. Deserialization would preprocess the input code to build the labels again and then resume at the given label with the given state. But now I have to think in terms of these blocks when implementing my interpreter. Even using composition to hide some of this seems kind of ugly.
Question
Are there any good existing approaches for addressing this problem? Am I thinking about serializing a program the entirely wrong way? Is this even possible for structures like this?

After more research, I have some thoughts on how I would do this. However, I'm not sure that adding serialization is something I want to do at this point as it would effect the rest of the implementation so much.
I'm not satisfied with this approach and would greatly like to hear any alternatives.
Problem
As I noted, transforming the program into a list of statements makes serialization easier. The entire program can be transformed into something like assembly language, but I wanted to avoid this.
Keeping a concept of expressions, what I didn't originally consider is that function calls can occur inside of deeply nested expressions. Take this program for example:
function id(x) { return x; }
10 + id(id(5)) * id(3);
The serializer should be able to serialize the program at any statement, but the statement could potentially be evaluated inside of an expression.
Host Functions In the State
The reason the program state cannot be easily serialized is that it contains host functions for continuations. These continuations must be transformed into data structures that can be serialized and independently reconstructed into the action the original continuation represented. Defunctionalization is most often used to express a higher order language in a first order language, but I believe it would also enable serialization.
Not all continuations can be easily defunctionalized without drastically rewriting the interpreter. As we are only interested in serialization at specific points, serialization at these points requires the entire continuation stack be defunctionalized. So all statements and expressions must be defunctionalized, but internal logic can remain unchanged in most cases because we don't want to allow serialization partway though an internal operation.
However, to my knowledge, defunctionalization does not work the Cont Monad because of bind statements. The lack of a good abstraction makes it difficult to work with.
Thoughts on a Solution
The goal is to create an object made up of only simple data structures that can be used to reconstruct the entire program state.
First, to minimize the amount of work required, I would rewrite the statements level interpreter to use something more like a state machine that can be more easily serialized. Then, I would defunctionalize expressions. Function calls would push the defunctionlized continuation for the remaining expression onto an internal stack for the state machine.
Making Program State a Serializable Object
Looking at how statements work, I'm not convinced that the Cont Monad is the best approach for chaining statements together (I do think it works pretty well at the expression level and for internal operations however). A state machine seems more natural approach, and this would also be easier to serialize.
A machine that steps between statements would be written. All other structures in the state would also be made serializable. Builtin functions would have to use serializable handles to identify them so that no functions are in the state.
Handling Expressions
Expressions would be rewritten to pass defunctionalized continuations instead of host function continuations. When a function call is encountered in an expression, it captures the defunctionalized current continuation and pushes it onto the statement machine's internal stack (This would only happen for hosted functions, not builtin ones), creating a restore point where computation can resume.
When the function returns, the defunctionalized continuation is passed the result.
Concerns
Javascript also allows hosted functions to be evaluated inside almost any expression (getters, setters, type conversion, higher order builtins) and this may complicate things if we allow serialization inside of those functions.
Defunctionalization seems to require working directly with continuations and would make the entire interpreter less flexible.

Related

Should I avoid unwrap in production application?

It's easy to crash at runtime with unwrap:
fn main() {
c().unwrap();
}
fn c() -> Option<i64> {
None
}
Result:
Compiling playground v0.0.1 (file:///playground)
Running `target/debug/playground`
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', ../src/libcore/option.rs:325
note: Run with `RUST_BACKTRACE=1` for a backtrace.
error: Process didn't exit successfully: `target/debug/playground` (exit code: 101)
Is unwrap only designed for quick tests and proofs-of-concept?
I can not affirm "My program will not crash here, so I can use unwrap" if I really want to avoid panic! at runtime, and I think avoiding panic! is what we want in a production application.
In other words, can I say my program is reliable if I use unwrap? Or must I avoid unwrap even if the case seems simple?
I read this answer:
It is best used when you are positively sure that you don't have an error.
But I don't think I can be "positively sure".
I don't think this is an opinion question, but a question about Rust core and programming.
While the whole “error handling”-topic is very complicated and often opinion based, this question can actually be answered here, because Rust has rather narrow philosophy. That is:
panic! for programming errors (“bugs”)
proper error propagation and handling with Result<T, E> and Option<T> for expected and recoverable errors
One can think of unwrap() as converting between those two kinds of errors (it is converting a recoverable error into a panic!()). When you write unwrap() in your program, you are saying:
At this point, a None/Err(_) value is a programming error and the program is unable to recover from it.
For example, say you are working with a HashMap and want to insert a value which you may want to mutate later:
age_map.insert("peter", 21);
// ...
if /* some condition */ {
*age_map.get_mut("peter").unwrap() += 1;
}
Here we use the unwrap(), because we can be sure that the key holds a value. It would be a programming error if it didn't and even more important: it's not really recoverable. What would you do when at that point there is no value with the key "peter"? Try inserting it again ... ?
But as you may know, there is a beautiful entry API for the maps in Rust's standard library. With that API you can avoid all those unwrap()s. And this applies to pretty much all situations: you can very often restructure your code to avoid the unwrap()! Only in a very few situation there is no way around it. But then it's OK to use it, if you want to signal: at this point, it would be a programming bug.
There has been a recent, fairly popular blog post on the topic of “error handling” whose conclusion is similar to Rust's philosophy. It's rather long but worth reading: “The Error Model”. Here is my try on summarizing the article in relation to this question:
deliberately distinguish between programming bugs and recoverable errors
use a “fail fast” approach for programming bugs
In summary: use unwrap() when you are sure that the recoverable error that you get is in fact unrecoverable at that point. Bonus points for explaining “why?” in a comment above the affected line ;-)
In other words, can I say my program is reliable if I use unwrap? Or must I avoid unwrap even if the case seems simple?
I think using unwrap judiciously is something you have to learn to handle, it can't just be avoided.
My rhetorical question barrage would be:
Can I say my program is reliable if I use indexing on vectors, arrays or slices?
Can I say my program is reliable if I use integer division?
Can I say my program is reliable if I add numbers?
(1) is like unwrap, indexing panics if you make a contract violation and try to index out of bounds. This would be a bug in the program, but it doesn't catch as much attention as a call to unwrap.
(2) is like unwrap, integer division panics if the divisor is zero.
(3) is unlike unwrap, addition does not check for overflow in release builds, so it may silently result in wraparound and logical errors.
Of course, there are strategies for handling all of these without leaving panicky cases in the code, but many programs simply use for example bounds checking as it is.
There are two questions folded into one here:
is the use of panic! acceptable in production
is the use of unwrap acceptable in production
panic! is a tool that is used, in Rust, to signal irrecoverable situations/violated assumptions. It can be used to either crash a program that cannot possibly continue in the face of this failure (for example, OOM situation) or to work around the compiler knowing it cannot be executed (at the moment).
unwrap is a convenience, that is best avoided in production. The problem about unwrap is that it does not state which assumption was violated, it is better instead to use expect("") which is functionally equivalent but will also give a clue as to what went wrong (without opening the source code).
unwrap() is not necessarily dangerous. Just like with unreachable!() there are cases where you can be sure some condition will not be triggered.
Functions returning Option or Result are sometimes just suited to a wider range of conditions, but due to how your program is structured those cases might never occur.
For example: when you create an iterator from a Vector you buid yourself, you know its exact length and can be sure how long invoking next() on it returns a Some<T> (and you can safely unwrap() it).
unwrap is great for prototyping, but not safe for production. Once you are done with your initial design you go back and replace unwrap() with Result<Value, ErrorType>.

Erlang serialization

I need to serialize a function in Erlang, send it over to another note, deserialize and execute it there. The problem I am having is with files. If the function reads from a file which is not in the second node I am getting an error. Is there a way how I can differentiate between serializable and not serializable constructs in Erlang? Thus if a function makes use of a file or pid, then it fails to serialize?
Thanks
First of all, if you are sending anonymous functions, be extremely careful with that. Or, rather, just don't do it
There are a couple of cases when this function won't even be executed or will be executed in a completely wrong way.
Every function in Erlang, even an anonymous one, belongs to some module, in the one it's been constructed inside, to be precise. If this function has been built in REPL, it's bound to erl_eval module, which is even more dangerous (I'll explain further, why).
Say, you start two nodes, one of them has a module named 'foo', and one doesn't have such a module loaded (and cannot load it). If you construct a lambda inside the module 'foo', send it to the second node and try to call it, you'll fail with {error, undef}.
There can be another funny problem. Try to make two versions of module 'foo', implement a 'bar' function inside each of them and implement a lambda inside of which (but the lambdas will be different). You'll get yet another error when trying to call a sent lambda.
I think, there could possibly be other tricky parts of sending lambdas to different nodes, but trust me, that's already quite a lot.
Secondly, there are tons of way you can get a process or a port inside a lambda without knowing it in advance
Even though there is a way of catching closured variables from a lambda (if you take a look at a binarized lambda, all the external variables that are being used inside it are listed starting from the 2nd byte), they are not the only source of potential pids or ports.
Consider an easy example: you call a self() function inside your lambda. What will it return? Right, a pid. Okay, we can probably parse the binary and catch this function call, along with a dozen of other built-in functions. But what will you do when you are calling some external function? ets:lookup(sometable, somekey)? some_module:some_function_that_returns_god_knows_what()? You don't know what they are going to return.
Now, to what you can actually do here
When working with files, always send filenames, not descriptors. If you need file's position or something, send it as well. File descriptors shouldn't be known outside the process they've been opened.
As I mentioned, do everything to avoid lambdas to be sent to other nodes. It's hard to tell how to avoid that, since I don't know your exact task. Maybe, you can send a list of functions to execute, like:
[{module1, parse_query},
{module1, dispatch_parsed_query},
{module2, validate_response},
{module2, serialize_query}]
and pass arguments through this functions sequence (be sure all the modules exist everywhere). Maybe, you can actually stick to some module that is going to be frequently changed and deployed over the entire cluster. Maybe, you might want to switch to JS/Lua and use externally started ports (Riak is using spidermonkey to process JS-written lambdas for Map/Reduce requests). Finally, you can actually get module's object code, send it over to another node and load it there. Just keep in mind it's not safe too. You can break some running processes, lose some constructed lambdas, and so on.

Coroutine what is the difference with C++ inline function?

As I understood it, coroutine is a sequence of function calls, with the difference that function calls do not allocate stack each call (they are resumed for the point they were stoped), or we can say that these function are executed in shared stack.
So the main advantage of coroutine is execution speed. Isn't it just inline functions from C++? (when instead of call, function body is inserted during compilation).
A "coroutine", in the commonly used sense, is basically a function that -- once started -- can be envisioned as running alongside the caller. That is, when the coroutine "yields" (a semi-special kind of return), it isn't necessarily done -- and "calling" it again will have the coroutine pick up right where it left off, with all its state intact, rather than starting from the beginning. The calls can thus be seen as kinda passing messages between the two functions.
Few languages fully and natively do this. (Stack-based languages tend to have a hard time with it, absent some functionality like Windows's "fibers".) Ruby does, apparently, and Python uses a limited version of it. I believe they call it a "generator", and it's basically used kinda like an iterable collection (whose iterator generates its next "element" on the fly). C# can semi do that as well (they call it an "iterator"), but the compiler actually turns the function into a class that implements a state machine of sorts.

STM32 programming tips and questions

I could not find any good document on internet about STM32 programming. STM's own documents do not explain anything more than register functions. I will greatly appreciate if anyone can explain my following questions?
I noticed that in all example programs that STM provides, local variables for main() are always defined outside of the main() function (with occasional use of static keyword). Is there any reason for that? Should I follow a similar practice? Should I avoid using local variables inside the main?
I have a gloabal variable which is updated within the clock interrupt handle. I am using the same variable inside another function as a loop condition. Don't I need to access this variable using some form of atomic read operation? How can I know that a clock interrupt does not change its value in the middle of the function execution? Should I need to cancel clock interrupt everytime I need to use this variable inside a function? (However, this seems extremely ineffective to me as I use it as loop condition. I believe there should be better ways of doing it).
Keil automatically inserts a startup code which is written in assembly (i.e. startup_stm32f4xx.s). This startup code has the following import statements:
IMPORT SystemInit
IMPORT __main
.In "C", it makes sense. However, in C++ both main and system_init have different names (e.g. _int_main__void). How can this startup code can still work in C++ even without using "extern "C" " (I tried and it worked). How can the c++ linker (armcc --cpp) can associate these statements with the correct functions?
you can use local or global variables, using local in embedded systems has a risk of your stack colliding with your data. with globals you dont have that problem. but this is true no matter where you are, embedded microcontroller, desktop, etc.
I would make a copy of the global in the foreground task that uses it.
unsigned int myglobal;
void fun ( void )
{
unsigned int myg;
myg=myglobal;
and then only use myg for the rest of the function. Basically you are taking a snapshot and using the snapshot. You would want to do the same thing if you are reading a register, if you want to do multiple things based on a sample of something take one sample of it and make decisions on that one sample, otherwise the item can change between samples. If you are using one global to communicate back and forth to the interrupt handler, well I would use two variables one foreground to interrupt, the other interrupt to foreground. yes, there are times where you need to carefully manage a shared resource like that, normally it has to do with times where you need to do more than one thing, for example if you had several items that all need to change as a group before the handler can see them change then you need to disable the interrupt handler until all the items have changed. here again there is nothing special about embedded microcontrollers this is all basic stuff you would see on a desktop system with a full blown operating system.
Keil knows what they are doing if they support C++ then from a system level they have this worked out. I dont use Keil I use gcc and llvm for microcontrollers like this one.
Edit:
Here is an example of what I am talking about
https://github.com/dwelch67/stm32vld/tree/master/stm32f4d/blinker05
stm32 using timer based interrupts, the interrupt handler modifies a variable shared with the foreground task. The foreground task takes a single snapshot of the shared variable (per loop) and if need be uses the snapshot more than once in the loop rather than the shared variable which can change. This is C not C++ I understand that, and I am using gcc and llvm not Keil. (note llvm has known problems optimizing tight while loops, very old bug, dont know why they have no interest in fixing it, llvm works for this example).
Question 1: Local variables
The sample code provided by ST is not particularly efficient or elegant. It gets the job done, but sometimes there are no good reasons for the things they do.
In general, you use always want your variables to have the smallest scope possible. If you only use a variable in one function, define it inside that function. Add the "static" keyword to local variables if and only if you need them to retain their value after the function is done.
In some embedded environments, like the PIC18 architecture with the C18 compiler, local variables are much more expensive (more program space, slower execution time) than global. On the Cortex M3, that is not true, so you should feel free to use local variables. Check the assembly listing and see for yourself.
Question 2: Sharing variables between interrupts and the main loop
People have written entire chapters explaining the answers to this group of questions. Whenever you share a variable between the main loop and an interrupt, you should definitely use the volatile keywords on it. Variables of 32 or fewer bits can be accessed atomically (unless they are misaligned).
If you need to access a larger variable, or two variables at the same time from the main loop, then you will have to disable the clock interrupt while you are accessing the variables. If your interrupt does not require precise timing, this will not be a problem. When you re-enable the interrupt, it will automatically fire if it needs to.
Question 3: main function in C++
I'm not sure. You can use arm-none-eabi-nm (or whatever nm is called in your toolchain) on your object file to see what symbol name the C++ compiler assigns to main(). I would bet that C++ compilers refrain from mangling the main function for this exact reason, but I'm not sure.
STM's sample code is not an exemplar of good coding practice, it is merely intended to exemplify use of their standard peripheral library (assuming those are the examples you are talking about). In some cases it may be that variables are declared external to main() because they are accessed from an interrupt context (shared memory). There is also perhaps a possibility that it was done that way merely to allow the variables to be watched in the debugger from any context; but that is not a reason to copy the technique. My opinion of STM's example code is that it is generally pretty poor even as example code, let alone from a software engineering point of view.
In this case your clock interrupt variable is atomic so long as it is 32bit or less so long as you are not using read-modify-write semantics with multiple writers. You can safely have one writer, and multiple readers regardless. This is true for this particular platform, but not necessarily universally; the answer may be different for 8 or 16 bit systems, or for multi-core systems for example. The variable should be declared volatile in any case.
I am using C++ on STM32 with Keil, and there is no problem. I am not sure why you think that the C++ entry points are different, they are not here (Keil ARM-MDK v4.22a). The start-up code calls SystemInit() which initialises the PLL and memory timing for example, then calls __main() which performs global static initialisation then calls C++ constructors for global static objects before calling main(). If in doubt, step through the code in the debugger. It is important to note that __main() is not the main() function you write for your application, it is a wrapper with different behaviour for C and C++, but which ultimately calls your main() function.

Can someone explain to me why I would need functional programming instead of OOP? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Functional programming vs Object Oriented programming
Can someone explain to me why I would need functional programming instead of OOP?
E.g. why would I need to use Haskell instead of C++ (or a similar language)?
What are the advantages of functional programming over OOP?
One of the big things I prefer in functional programming is the lack of "spooky action at a distance". What you see is what you get – and no more. This makes code far easier to reason about.
Let's use a simple example. Let's say I come across the code snippet X = 10 in either Java (OOP) or Erlang (functional). In Erlang I can know these things very quickly:
The variable X is in the immediate context I'm in. Period. It's either a parameter passed in to the function I'm reading or it's being assigned the first (and only—c.f. below) time.
The variable X has a value of 10 from this point onward. It will not change again within the block of code I'm reading. It cannot.
In Java it's more complicated:
The variable X might be defined as a parameter.
It might be defined somewhere else in the method.
It might be defined as part of the class the method is in.
Whatever the case is, since I'm not declaring it here, I'm changing its value. This means I don't know what the value of X will be without constantly scanning backward through the code to find the last place it was assigned or modified explicitly or implicitly (like in a for loop).
When I call another method, if X happens to be a class variable it may change out from underneath me with no way for me to know this without inspecting the code of that method.
In the context of a threading program it's even worse. X can be changed by something I can't even see in my immediate environment. Another thread may be calling the method in #5 that modifies X.
And Java is a relatively simple OOP language. The number of ways that X can be screwed around with in C++ is even higher and potentially more obscure.
And the thing is? This is just a simple example of how a common operation can be far more complicated in an OOP (or other imperative) language than in a functional. It also doesn't address the benefits of functional programming that don't involve mutable state, etc. like higher order functions.
There are three things about Haskell that I think are really cool:
1) It's a statically-typed language that is extremely expressive and lets you build highly maintainable and refactorable code quickly. There's been a big debate between statically typed languages like Java and C# and dynamic languages like Python and Ruby. Python and Ruby let you quickly build programs, using only a fraction of the number of lines required in a language like Java or C#. So, if your goal is to get to market quickly, Python and Ruby are good choices. But, because they're dynamic, refactoring and maintaining your code is tough. In Java, if you want to add a parameter to a method, it's easy to use the IDE to find all instances of the method and fix them. And if you miss one, the compiler catches it. With Python and Ruby, refactoring mistakes will only be caught as run-time errors. So, with traditional languages, you get to choose between quick development and lousy maintainability on the one hand and slow development and good maintainability on the other hand. Neither choice is very good.
But with Haskell, you don't have to make this type of choice. Haskell is statically typed, just like Java and C#. So, you get all the refactorability, potential for IDE support, and compile-time checking. But at the same time, types can be inferred by the compiler. So, they don't get in your way like they do with traditional static languages. Plus, the language offers many other features that allow you to accomplish a lot with only a few lines of code. So, you get the speed of development of Python and Ruby along with the safety of static languages.
2) Parallelism. Because functions don't have side effects, it's much easier for the compiler to run things in parallel without much work from you as a developer. Consider the following pseudo-code:
a = f x
b = g y
c = h a b
In a pure functional language, we know that functions f and g have no side effects. So, there's no reason that f has to be run before g. The order could be swapped, or they could be run at the same time. In fact, we really don't have to run f and g at all until their values are needed in function h. This is not true in a traditional language since the calls to f and g could have side effects that could require us to run them in a particular order.
As computers get more and more cores on them, functional programming becomes more important because it allows the programmer to easily take advantage of the available parallelism.
3) The final really cool thing about Haskell is also possibly the most subtle: lazy evaluation. To understand this, consider the problem of writing a program that reads a text file and prints out the number of occurrences of the word "the" on each line of the file. Suppose you're writing in a traditional imperative language.
Attempt 1: You write a function that opens the file and reads it one line at a time. For each line, you calculate the number of "the's", and you print it out. That's great, except your main logic (counting the words) is tightly coupled with your input and output. Suppose you want to use that same logic in some other context? Suppose you want to read text data off a socket and count the words? Or you want to read the text from a UI? You'll have to rewrite your logic all over again!
Worst of all, what if you want to write an automated test for your new code? You'll have to build input files, run your code, capture the output, and then compare the output against your expected results. That's do-able, but it's painful. Generally, when you tightly couple IO with logic, it becomes really difficult to test the logic.
Attempt 2: So, let's decouple IO and logic. First, read the entire file into a big string in memory. Then, pass the string to a function that breaks the string into lines, counts the "the's" on each line, and returns a list of counts. Finally, the program can loop through the counts and output them. It's now easy to test the core logic since it involves no IO. It's now easy to use the core logic with data from a file or from a socket or from a UI. So, this is a great solution, right?
Wrong. What if someone passes in a 100GB file? You'll blow out your memory since the entire file must be loaded into a string.
Attempt 3: Build an abstraction around reading the file and producing results. You can think of these abstractions as two interfaces. The first has methods nextLine() and done(). The second has outputCount(). Your main program implements nextLine() and done() to read from the file, while outputCount() just directly prints out the count. This allows your main program to run in constant memory. Your test program can use an alternate implementation of this abstraction that has nextLine() and done() pull test data from memory, while outputCount() checks the results rather than outputting them.
This third attempt works well at separating the logic and the IO, and it allows your program to run in constant memory. But, it's significantly more complicated than the first two attempts.
In short, traditional imperative languages (whether static or dynamic) frequently leave developers making a choice between
a) Tight coupling of IO and logic (hard to test and reuse)
b) Load everything into memory (not very efficient)
c) Building abstractions (complicated, and it slows down implementation)
These choices come up when reading files, querying databases, reading sockets, etc. More often than not, programmers seem to favor option A, and unit tests suffer as a consequence.
So, how does Haskell help with this? In Haskell, you would solve this problem exactly like in Attempt 2. The main program loads the whole file into a string. Then it calls a function that examines the string and returns a list of counts. Then the main program prints the counts. It's super easy to test and reuse the core logic since it's isolated from the IO.
But what about memory usage? Haskell's lazy evaluation takes care of that for you. So, even though your code looks like it loaded the whole file contents into a string variable, the whole contents really aren't loaded. Instead, the file is only read as the string is consumed. This allows it to be read one buffer at a time, and your program will in fact run in constant memory. That is, you can run this program on a 100GB file, and it will consume very little memory.
Similarly, you can query a database, build a resulting list containing a huge set of rows, and pass it to a function to process. The processing function has no idea that the rows came from a database. So, it's decoupled from its IO. And under-the-covers, the list of rows will be fetched lazily and efficiently. So, even though it looks like it when you look at your code, the full list of rows is never all in memory at the same time.
End result, you can test your function that processes the database rows without even having to connect to a database at all.
Lazy evaluation is really subtle, and it takes a while to get your head around its power. But, it allows you to write nice simple code that is easy to test and reuse.
Here's the final Haskell solution and the Approach 3 Java solution. Both use constant memory and separate IO from processing so that testing and reuse are easy.
Haskell:
module Main
where
import System.Environment (getArgs)
import Data.Char (toLower)
main = do
(fileName : _) <- getArgs
fileContents <- readFile fileName
mapM_ (putStrLn . show) $ getWordCounts fileContents
getWordCounts = (map countThe) . lines . map toLower
where countThe = length . filter (== "the") . words
Java:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.Reader;
class CountWords {
public interface OutputHandler {
void handle(int count) throws Exception;
}
static public void main(String[] args) throws Exception {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(new File(args[0])));
OutputHandler handler = new OutputHandler() {
public void handle(int count) throws Exception {
System.out.println(count);
}
};
countThe(reader, handler);
} finally {
if (reader != null) reader.close();
}
}
static public void countThe(BufferedReader reader, OutputHandler handler) throws Exception {
String line;
while ((line = reader.readLine()) != null) {
int num = 0;
for (String word: line.toLowerCase().split("([.,!?:;'\"-]|\\s)+")) {
if (word.equals("the")) {
num += 1;
}
}
handler.handle(num);
}
}
}
If we compare Haskell and C++, functional programming makes debugging extremely easy, because there's no mutable state and variables like ones found in C, Python etc. which you should always care about, and it's ensured that, given some arguments, a function will always return the same results in spite the number of times you evaluate it.
OOP is orthogonal to any programming paradigm, and there are lanugages which combine FP with OOP, OCaml being the most popular, several Haskell implementations etc.