Erlang serialization - serialization

I need to serialize a function in Erlang, send it over to another note, deserialize and execute it there. The problem I am having is with files. If the function reads from a file which is not in the second node I am getting an error. Is there a way how I can differentiate between serializable and not serializable constructs in Erlang? Thus if a function makes use of a file or pid, then it fails to serialize?
Thanks

First of all, if you are sending anonymous functions, be extremely careful with that. Or, rather, just don't do it
There are a couple of cases when this function won't even be executed or will be executed in a completely wrong way.
Every function in Erlang, even an anonymous one, belongs to some module, in the one it's been constructed inside, to be precise. If this function has been built in REPL, it's bound to erl_eval module, which is even more dangerous (I'll explain further, why).
Say, you start two nodes, one of them has a module named 'foo', and one doesn't have such a module loaded (and cannot load it). If you construct a lambda inside the module 'foo', send it to the second node and try to call it, you'll fail with {error, undef}.
There can be another funny problem. Try to make two versions of module 'foo', implement a 'bar' function inside each of them and implement a lambda inside of which (but the lambdas will be different). You'll get yet another error when trying to call a sent lambda.
I think, there could possibly be other tricky parts of sending lambdas to different nodes, but trust me, that's already quite a lot.
Secondly, there are tons of way you can get a process or a port inside a lambda without knowing it in advance
Even though there is a way of catching closured variables from a lambda (if you take a look at a binarized lambda, all the external variables that are being used inside it are listed starting from the 2nd byte), they are not the only source of potential pids or ports.
Consider an easy example: you call a self() function inside your lambda. What will it return? Right, a pid. Okay, we can probably parse the binary and catch this function call, along with a dozen of other built-in functions. But what will you do when you are calling some external function? ets:lookup(sometable, somekey)? some_module:some_function_that_returns_god_knows_what()? You don't know what they are going to return.
Now, to what you can actually do here
When working with files, always send filenames, not descriptors. If you need file's position or something, send it as well. File descriptors shouldn't be known outside the process they've been opened.
As I mentioned, do everything to avoid lambdas to be sent to other nodes. It's hard to tell how to avoid that, since I don't know your exact task. Maybe, you can send a list of functions to execute, like:
[{module1, parse_query},
{module1, dispatch_parsed_query},
{module2, validate_response},
{module2, serialize_query}]
and pass arguments through this functions sequence (be sure all the modules exist everywhere). Maybe, you can actually stick to some module that is going to be frequently changed and deployed over the entire cluster. Maybe, you might want to switch to JS/Lua and use externally started ports (Riak is using spidermonkey to process JS-written lambdas for Map/Reduce requests). Finally, you can actually get module's object code, send it over to another node and load it there. Just keep in mind it's not safe too. You can break some running processes, lose some constructed lambdas, and so on.

Related

How can I have a "private" Erlang module?

I prefer working with files that are less than 1000 lines long, so am thinking of breaking up some Erlang modules into more bite-sized pieces.
Is there a way of doing this without expanding the public API of my library?
What I mean is, any time there is a module, any user can do module:func_exported_from_the_module. The only way to really have something be private that I know of is to not export it from any module (and even then holes can be poked).
So if there is technically no way to accomplish what I'm looking for, is there a convention?
For example, there are no private methods in Python classes, but the convention is to use a leading _ in _my_private_method to mark it as private.
I accept that the answer may be, "no, you must have 4K LOC files."
The closest thing to a convention is to use edoc tags, like #private and #hidden.
From the docs:
#hidden
Marks the function so that it will not appear in the
documentation (even if "private" documentation is generated). Useful
for debug/test functions, etc. The content can be used as a comment;
it is ignored by EDoc.
#private
Marks the function as private (i.e., not part of the public
interface), so that it will not appear in the normal documentation.
(If "private" documentation is generated, the function will be
included.) Only useful for exported functions, e.g. entry points for
spawn. (Non-exported functions are always "private".) The content can
be used as a comment; it is ignored by EDoc.
Please note that this answer started as a comment to #legoscia's answer
Different visibilities for different methods is not currently supported.
The current convention, if you want to call it that way, is to have one (or several) 'facade' my_lib.erl module(s) that export the public API of your library/application. Calling any internal module of the library is playing with fire and should be avoided (call them at your own risk).
There are some very nice features in the BEAM VM that rely on being able to call exported functions from any module, such as
Callbacks (funs/anonymous funs), MFA, erlang:apply/3: The calling code does not need to know anything about the library, just that it's something that needs to be called
Behaviours such as gen_server need the previous point to work
Hot reloading: You can upgrade the bytecode of any module without stopping the VM. The code server inside the VM maintains at most two versions of the bytecode for any module, redirecting external calls (those with the Module:) to the most recent version and the internal calls to the current version. That's why you may see some ?MODULE: calls in long-running servers, to be able to upgrade the code
You'd be able to argue that these points'd be available with more fine-grained BEAM-oriented visibility levels, true. But I don't think it would solve anything that's not solved with the facade modules, and it'd complicate other parts of the VM/code a great deal.
Bonus
Something similar applies to records and opaque types, records only exist at compile time, and opaque types only at dialyzer time. Nothing stops you from accessing their internals anywhere, but you'll only find problems if you go that way:
You insert a new field in the record, suddenly, all your {record_name,...} = break
You use a library that returns an opaque_adt(), you know that it's a list and use like so. The library is upgraded to include the size of the list, so now opaque_adt() is a tuple() and chaos ensues
Only those functions that are specified in the -export attribute are visible to other modules i.e "public" functions. All other functions are private. If you have specified -compile(export_all) only then all functions in module are visible outside. It is not recommended to use -compile(export_all).
I don't know of any existing convention for Erlang, but why not adopt the Python convention? Let's say that "library-private" functions are prefixed with an underscore. You'll need to quote function names with single quotes for that to work:
-module(bar).
-export(['_my_private_function'/0]).
'_my_private_function'() ->
foo.
Then you can call it as:
> bar:'_my_private_function'().
foo
To me, that communicates clearly that I shouldn't be calling that function unless I know what I'm doing. (and probably not even then)

How to pass a variable from one require() to another?

I have a variable currently set globally, that I am using in two different require() functions. How can I have it passed from first require() to the other?
As far as I know: you don't. The only way to make them read the same data is by putting it somewhere so they both can read it. Because both require() functions are placed inside a normal JavaScript file, the only scope they share is the global scope.
The other possibility is by passing it to the callback function of the require(), but that means you have to write a "module" that returns your data (or use dojo/text plugin to read a file read-only).
It's not recommended to have two require() statements at all. If we're still talking about the same stopwatch (I've read your code a bit), you should put all event handlers in one require() block. Then you can put the data they both need inside that require() block and they will all be able to read it.
Some reasons why having multiple require() blocks is not beneficial:
Harder to read/maintain: If you have multiple require() blocks, it's confusing. Most people don't use it anyways (recommended or not).
More duplicates: If you're using common modules like dojo/query and dojo/dom, you will have to add them to your module list every time. I don't think it will download the file twice, but it's still more to write/maintain.
Scope issues: As you just noticed, you will have scoping issues like this when multiple require() blocks needs to access the same data. This means you have to put them in global scope (which is another bad practice), write a module with just plain data (not really useful) or you have to wrap both of them in a function, which you have to execute manually (confusing)
Seperation concerns: If you have a valid reason to seperate them, you should be seperating them into modules.

Is there a way in elisp to make variable access trigger a function call?

I am programming in elisp, and I would like to associate a symbol with a function such that an attempt to access the variable instead calls the function. In particular, I want to trigger a special error message when lisp code attempts to access a certain variable. Is there a way to do this?
Example: suppose I want the variable current-time to evaluate to whatever (current-time-string) evaluates to at the time the variable is accessed. Is this possible?
Note that I do not control the code that attempts to access the variable, so that code could be compiled, so walking the tree and manually replacing variable accesses with function calls is not really an option.
You are looking for the Common Lisp define-symbol-macro.
Emacs Lisp lacks this feature, you cannot accomplish what you are trying to do.
However, not all is lost if you just want an error on accessing a variable.
Just makunbound it and access will error out (unless, of course, someone else nds it first).
I don't think you can do that.
As Sam says, define-symbol-macro would be the closest thing in Lisp (tho you make it sound like the accesses might be compiled beforehand, in which case even define-symbol-macro would be powerless). The closest thing in Elisp would be cl-symbol-macrolet, but that is even more limiting than define-symbol-macro since it has to be placed lexically around the accesses.

Serializing Running Programs in a Functional Interpreter

I am writing an interpreter implemented functionally using a variations of the Cont Monad. Inspired by Smalltalk's use of images to capture a running program, I am investigating how to serialize the executing hosted program and need help determining how to accomplish this at a high level.
Problem Statement
Using the Cont monad, I can capture the current continuation of a running program in my interpreter. Storing the current continuation allows resuming interpreter execution by calling the continuation. I would like to serialize this continuation so that the state of a running program can be saved to disk or loaded by another interpreter instance. However, my language (I am both targeting and working in Javascript) does not support serializing functions this way.
I would be interested in an approach that can be used to build up the continuation at a given point of execution given some metadata without running the entire program again until it reaches that point. Preferably, minimal changes to the implementation of the interpreter itself would be made.
Considered Approach
One approach that may work is to push all control flow logic into the program state. For example, I currently express a C style for loop using the host language's recursion for the looping behavior:
var forLoop = function(init, test, update, body) {
var iter = function() {
// When test is true, evaluate the loop body and start iter again
// otherwise, evaluate an empty statement and finish
return branch(test,
next(
next(body, update),
iter()),
empty);
};
return next(
init,
iter());
};
This is a nice solution but if I pause the program midway though a for loop, I don't know how I can serialize the continuation that has been built up.
I know I can serialize a transformed program using jumps however, and the for loop can be constructed from jump operations. A first pass of my interpreter would generate blocks of code and save these in the program state. These blocks would capture some action in the hosted language and potentially execute other blocks. The preprocessed program would look something like this:
Label Actions (Block of code, there is no sequencing)
-----------------------------------
start: init, GOTO loop
loop: IF test GOTO loop_body ELSE GOTO end
loop_body: body, GOTO update
update: update, GOTO loop
end: ...
This makes each block of code independent, only relying on values stored in the program state.
To serialize, I would save off the current label name and the state when it was entered. Deserialization would preprocess the input code to build the labels again and then resume at the given label with the given state. But now I have to think in terms of these blocks when implementing my interpreter. Even using composition to hide some of this seems kind of ugly.
Question
Are there any good existing approaches for addressing this problem? Am I thinking about serializing a program the entirely wrong way? Is this even possible for structures like this?
After more research, I have some thoughts on how I would do this. However, I'm not sure that adding serialization is something I want to do at this point as it would effect the rest of the implementation so much.
I'm not satisfied with this approach and would greatly like to hear any alternatives.
Problem
As I noted, transforming the program into a list of statements makes serialization easier. The entire program can be transformed into something like assembly language, but I wanted to avoid this.
Keeping a concept of expressions, what I didn't originally consider is that function calls can occur inside of deeply nested expressions. Take this program for example:
function id(x) { return x; }
10 + id(id(5)) * id(3);
The serializer should be able to serialize the program at any statement, but the statement could potentially be evaluated inside of an expression.
Host Functions In the State
The reason the program state cannot be easily serialized is that it contains host functions for continuations. These continuations must be transformed into data structures that can be serialized and independently reconstructed into the action the original continuation represented. Defunctionalization is most often used to express a higher order language in a first order language, but I believe it would also enable serialization.
Not all continuations can be easily defunctionalized without drastically rewriting the interpreter. As we are only interested in serialization at specific points, serialization at these points requires the entire continuation stack be defunctionalized. So all statements and expressions must be defunctionalized, but internal logic can remain unchanged in most cases because we don't want to allow serialization partway though an internal operation.
However, to my knowledge, defunctionalization does not work the Cont Monad because of bind statements. The lack of a good abstraction makes it difficult to work with.
Thoughts on a Solution
The goal is to create an object made up of only simple data structures that can be used to reconstruct the entire program state.
First, to minimize the amount of work required, I would rewrite the statements level interpreter to use something more like a state machine that can be more easily serialized. Then, I would defunctionalize expressions. Function calls would push the defunctionlized continuation for the remaining expression onto an internal stack for the state machine.
Making Program State a Serializable Object
Looking at how statements work, I'm not convinced that the Cont Monad is the best approach for chaining statements together (I do think it works pretty well at the expression level and for internal operations however). A state machine seems more natural approach, and this would also be easier to serialize.
A machine that steps between statements would be written. All other structures in the state would also be made serializable. Builtin functions would have to use serializable handles to identify them so that no functions are in the state.
Handling Expressions
Expressions would be rewritten to pass defunctionalized continuations instead of host function continuations. When a function call is encountered in an expression, it captures the defunctionalized current continuation and pushes it onto the statement machine's internal stack (This would only happen for hosted functions, not builtin ones), creating a restore point where computation can resume.
When the function returns, the defunctionalized continuation is passed the result.
Concerns
Javascript also allows hosted functions to be evaluated inside almost any expression (getters, setters, type conversion, higher order builtins) and this may complicate things if we allow serialization inside of those functions.
Defunctionalization seems to require working directly with continuations and would make the entire interpreter less flexible.

Determine if a function is async-signal-safe (can be called inside a signal handler)

My questions are:
Is there a way to conclusively determine if a function is async-signal-safe if you don't have access to its implementation?
If not, is there a way to test if function would be async-signal-safe enough to call from a signal handler?
If you reads the man pages of signal() or sigaction(), you get a list of async-signal-safe functions (functions that can be safely called inside a signal handler). However, I believe that this list is not exhaustive. For example, the following page http://linux.die.net/man/7/signal, under the Async-signal-safe functions header, reads:
POSIX.1-2004 (also known as POSIX.1-2001 Technical Corrigendum 2) requires an implementation to guarantee that the following functions can be safely called inside a signal handler:
And then it proceeds to list the normal async-signal-safe functions listed in the man pages above. As I read it, it says "it requires", not "these are the only ones".
For example, this site says that back_trace_symbols_fd() is async-signal safe. That function obtains is data from dladdr() and it doesn't use malloc() like back_trace_symbols(), so it looks like it may be safe. Also, I did some testing, and the output struct of dladdr() contains char* variables, but these are NOT malloc'ed at runtime. The char string they point to exists at run-time even before dladdr() is called.
Any thoughts or ideas that can point me in the right direction are appreciated.
If you don't have access to the function's implementation, you can look at the manual page. If the manual page doesn't say it is async-safe, and the POSIX standard doesn't say it is async-safe, the only safe conclusion is "it is not async-safe" (coupled with "do not use it").
There is no 100% reliable way to test whether a function is async-safe. Remember, testing can only show the presence of bugs, not their absence (Dijkstra). The mere fact that you don't manage to tickle the function into misbehaving under test may simply mean that your testing is not adequate (but rest assured, the important customer who you can't afford to offend will immediately and accidentally devise a devastatingly effective test that demonstrates that the function is not async-safe almost as soon as you release the code with the faulty assumption).
What are you hoping to achieve in the signal handler? You should consider whether it is the right place for it. It is probably best to follow the advice of the man page:
In general, signal handlers should do little more
than set a flag; most other actions are not safe.