What gives Smalltalk the ability to do image persistence, and why can't languages like Ruby/Python serialize themselves? - serialization

In smalltalk, you're able to save the state of the world into an image file. I assume this has to do with Smalltalk's ability to "serialize" itself -- that is, objects can produce their own source code.
1) Is this an accurate understanding?
2) What is the challenge in adding this ability to modern languages (non-lisp, obviously)?
3) Is "serialization" the right word? What's the correct jargon?

It's much simpler than "serializing". A Smalltalk image is simply a snapshot of the object memory. It takes the whole RAM contents (after garbage collecting) and dumps it into a file. On startup, it loads that snapshot from disk into RAM and continues processing where it left off. There are some hooks to perform special actions on snapshot and when resuming, but basically this is how it works.
(added: see Lukas Renggli's comment below for a crucial design choice that makes it so simple compared to other environments)

Extending upon Bert Freudenberg’s excellent answer.
1) Is this (ie object's ability to serialize their own source code) an accurate understanding?
Nope. As Bert pointed out a Smalltalk image is simply a memory snapshot. The single-source-of-truth of both Smalltalk objects and Smalltalk programs is their memory representation. This is a huge difference to other languages, where programs are represented as text files.
2) What is the challenge in adding this ability to modern languages (non-lisp, obviously)?
Technically, bootstrapping an application from a memory snapshot should be possible for most languages. If I am not mistaken there are solutions that use this approach to speedup startup times for Java applications. You'd have to agree on a canonical memory representation though and you'd need to make care to reinitialize native resources upon restarting the program. For example, in Smalltalk, open files and network connecting are reopened. And also, there's a startup hook to fix the endianness of numbers.
3) Is "serialization" the right word? What's the correct jargon?
Hibernation is the term.

Crucial difference is that Smalltalk treats Program just as bunch of objects. And IDE is just a bunch of editors editing those objects, so when you save-load image, all your code is there just as when you left off.
For other languages it could be possible to do so, but I guess there would be more fiddling, depending on how much reflection there is. In most other languages, reflection comes as add-on, or even afterthought, but in Smalltalk it is in the heart of the system.

This happens already in a way when you put your computer to sleep, right? The kernel writes running programs to disk and loads them up again later? Presumably the kernel could move a running program to a new machine over a network, assuming same architecture on the other end? Java can serialized all objects also because of the JVM, right? Maybe the hurdle is just architecture implying varied memory layouts?
Edit: But I guess you're interested in using this functionality from the program itself. So I think it's just a matter of implementing the feature in the Python/Ruby interpreter and stdlib, and having some kind of virtual machine if you want to be able to move to a different hardware architecture.

Related

How does Smalltalk image handle IO?

I just started to learn Smalltalk, went through its syntax, but hasn't done any real coding with it. While reading some introductory articles and some SO questions like:
What gives Smalltalk the ability to do image persistence, and why
can't languages like Ruby/Python serialize
themselves?
What is a Smalltalk “image”?
One question always comes into my mind: How does Smalltalk image handle IO?
A smalltalk program can resume from where it exits, using information stored in the image. Say I have some opened TCP connections(not to mention all sorts of buffer), how do they get recovered? There seems to be no way other than reopening them(confirmed by this answer). And if Smalltalk does reopen those connections, isn't it going against the idea of "resume execution of the program at a later time exactly from where you left off"? Or is there some magic behind it?
I don't mind if the answer is specific to certain dialects, say Pharo.
Also would be interested to know some resources to learn more about this topic.
As you have noted some resources are not part of the memory heap and therefore will not be recovered just by loading the image back in memory. In particular this applies to all kinds of resources managed by the operating system, and cross-platform Smalltalks where you can copy the image from one OS to another and restart the image even have to restore such resources differently than they were before.
The trick in the Smalltalks I have seen is that all classes receive a message immediately after the image resumed. By implementing a method for that message they can restore any transient resources (sockets, connections, foreign handles, ...) that their instances might need. To find all instances some Smalltalks provide messages such as allInstances, or you must maintain a registry of the relevant objects yourself.
And if Smalltalk does reopen those connections, isn't it going against the idea of "resume execution of the program at a later time exactly from where you left off"?
From a user perspective, after that reinitialization and reallocation of resources, everything still looks like "exactly where you left off", even though some technical details have changed under the hood. Of course this won't be the case if it is impossible to restore the resources (no network, for example). Some limits cannot be overcome by Smalltalk magic.
How does the Smalltalk image handle IO?
To make that resumption described above possible, all external resources are wrapped and represented as some kind of Smalltalk object. The wrapper objects will be persisted in the image, although they will be disconnected from the outside world when Smalltalk is shut down. The wrappers can then restore the external resources after the image has been started up again.
It might be useful to add a small history lesson: Smalltalk was created in the 1970s at Xerox's Palo Alto Research Center (PARC). In the same time, at the same place, Personal Computing was invented. Also in the same time at the same place, the Ethernet was invented.
Smalltalk was a single integrated system, it was at the same time the IDE, the GUI, the shell, the kernel, the OS, even the microcode for the CPU was part of the Smalltalk System. Smalltalk didn't have to deal with non-Smalltalk resources from outside the image, because for all intents and purposes, there was no "outside". It was possible to re-create the exact machine state, since there wasn't really any boundary between the Virtual Machine and the machine. (Almost all the system was implemented in Smalltalk. There were only a couple of tiny bits of microcode, assembly, and Mesa. Even what we would consider device drivers nowadays were written in Smalltalk.)
There was no need to persist network connections to other computers, because nobody outside of a few labs had networks. Heck, almost no organization even had more than one computer. There was no need to interact with the host OS because Smalltalk machines didn't have an OS; Smalltalk was the OS. (You probably know the famous quote from Dan Ingalls' Design Principles Behind Smalltalk: "An operating system is a collection of things that don't fit into a language. There shouldn't be one.") Because Smalltalk was the OS, there was no need for a filesystem, all data was simply objects.
Smalltalk cannot control what is outside of Smalltalk. This is a general property that is not unique to Smalltalk. You can break encapsulation in Java by editing the compiled bytecode. You can break type-safety in Haskell by editing the compiled machine code. You can create a memory leak in Rust by editing the compiled machine code.
So, all the guarantees, features, and properties of Smalltalk are only available as long as you don't leave Smalltalk.
Here's an even simpler example that does not involve networking or moving the image to a different machine: open a file in the host filesystem. Suspend the image. Delete the file. Resume the image. There is no possible way to resume the image in the same state.
All Smalltalk can do, is approximate the state of external resources as good as it possibly can. It can attempt to re-open the file. If the file is gone, it can maybe attempt to create one with the same name. It can try to resume a network connection. If that fails, it can try to re-establish the connection, create a new connection to the same address.
But ultimately, everything outside the image is outside of the control of Smalltalk, and there is nothing Smalltalk can do about it.
Note that this impedance mismatch between the inside of the image and the "outside world" is one of the major criticisms that is typically leveled at Smalltalk. And if you look at Smalltalk systems that try to integrate deeply with the outside world, they often have to compromise. E.g. GNU Smalltalk, which is specifically designed to integrate deeply into a Unix system, actually gives up on the image and persistence.
I'll add one more angle to the nice answers of Joerg and JayK.
What is important to understand is the context of the time and age Smalltalk was created. (Joerg already pointed out important aspect of everything being Smalltalk). We are talking about time right after ARPANET.
I think they were not expecting the collaboration and interconnection we have nowadays. The image was meant as a record of a session of a single programmer without any external communication. Times changed and now you naturally ask the IO question. As JayK noted you can re-init the image so you will get image similar to the point you ended your session.
The real issue, the reason I've decided to add my 2c, is the collaboration among multiple developers. This is where the image, the original idea, is, in my opinion, outlived. There is no way to share an image among multiple developers so they could develop at the same time and share the code.
Imagine wouldn't it be great if you could have one central image and the developers would have only diffs and open their environment where they ended with everyone's new code incorporated? Sound familiar? This is kind of VCS we have like mercurial, git etc. without the image, only code. Sadly, to say such image re-construction does not exist.
Smalltalk is trying to catch up with the std. versioning tooling we use nowadays.
(Side note: Smalltalk had their own "versioning" systems but they rather lack in many ways compared to the current VCS. The ones used Monticello (Pharo), ENVY (VA Smalltalk), and Store (VisualWorks).)
Pharo is now trying to catch the train and implement the git functionality via iceberg. The Smalltalk/X-jv branch has integrated decent mercurial support. The Squeak has Squot for git (for now) [thank you #JayK].
Now "only" (that is BIG only) to add support for central/diff images :).
For some practical code dealing with image startUp and shutDown, take a look at the class side of ZnServer. SessionManager is the class providing all the functionality you need to deal with giving up system resources on shutdown and re-aquiring them again on startup.
Need to chime in on the source control discussion a bit.
The solution I have seen with VS(E)/VA is that you work with Envy/PVCS and share this repository with the developers.
Every developer has his/her own image with all the pros and cons of images.
One company I was working for was discussing whether it wouldnt make sense to build up the development image egain from scratch every couple of weeks in order to get rid of everything that might dilute the code quality (open Filehandles, global variables, pool dictionary entries, you name it, you will get it and it will crash your code during run-time).
When it comes to building a "run-time", you take the plain tiny standard image, and all your code comes from bind files and SLLs.

Possible to share information between an add-on to an existing program and a standalone application? [duplicate]

I'm looking at building a Cocoa application on the Mac with a back-end daemon process (really just a mostly-headless Cocoa app, probably), along with 0 or more "client" applications running locally (although if possible I'd like to support remote clients as well; the remote clients would only ever be other Macs or iPhone OS devices).
The data being communicated will be fairly trivial, mostly just text and commands (which I guess can be represented as text anyway), and maybe the occasional small file (an image possibly).
I've looked at a few methods for doing this but I'm not sure which is "best" for the task at hand. Things I've considered:
Reading and writing to a file (…yes), very basic but not very scalable.
Pure sockets (I have no experience with sockets but I seem to think I can use them to send data locally and over a network. Though it seems cumbersome if doing everything in Cocoa
Distributed Objects: seems rather inelegant for a task like this
NSConnection: I can't really figure out what this class even does, but I've read of it in some IPC search results
I'm sure there are things I'm missing, but I was surprised to find a lack of resources on this topic.
I am currently looking into the same questions. For me the possibility of adding Windows clients later makes the situation more complicated; in your case the answer seems to be simpler.
About the options you have considered:
Control files: While it is possible to communicate via control files, you have to keep in mind that the files need to be communicated via a network file system among the machines involved. So the network file system serves as an abstraction of the actual network infrastructure, but does not offer the full power and flexibility the network normally has. Implementation: Practically, you will need to have at least two files for each pair of client/servers: a file the server uses to send a request to the client(s) and a file for the responses. If each process can communicate both ways, you need to duplicate this. Furthermore, both the client(s) and the server(s) work on a "pull" basis, i.e., they need to revisit the control files frequently and see if something new has been delivered.
The advantage of this solution is that it minimizes the need for learning new techniques. The big disadvantage is that it has huge demands on the program logic; a lot of things need to be taken care of by you (Will the files be written in one piece or can it happen that any party picks up inconsistent files? How frequently should checks be implemented? Do I need to worry about the file system, like caching, etc? Can I add encryption later without toying around with things outside of my program code? ...)
If portability was an issue (which, as far as I understood from your question is not the case) then this solution would be easy to port to different systems and even different programming languages. However, I don't know of any network files ystem for iPhone OS, but I am not familiar with this.
Sockets: The programming interface is certainly different; depending on your experience with socket programming it may mean that you have more work learning it first and debugging it later. Implementation: Practically, you will need a similar logic as before, i.e., client(s) and server(s) communicating via the network. A definite plus of this approach is that the processes can work on a "push" basis, i.e., they can listen on a socket until a message arrives which is superior to checking control files regularly. Network corruption and inconsistencies are also not your concern. Furthermore, you (may) have more control over the way the connections are established rather than relying on things outside of your program's control (again, this is important if you decide to add encryption later on).
The advantage is that a lot of things are taken off your shoulders that would bother an implementation in 1. The disadvantage is that you still need to change your program logic substantially in order to make sure that you send and receive the correct information (file types etc.).
In my experience portability (i.e., ease of transitioning to different systems and even programming languages) is very good since anything even remotely compatible to POSIX works.
[EDIT: In particular, as soon as you communicate binary numbers endianess becomes an issue and you have to take care of this problem manually - this is a common (!) special case of the "correct information" issue I mentioned above. It will bite you e.g. when you have a PowerPC talking to an Intel Mac. This special case disappears with the solution 3.+4. together will all of the other "correct information" issues.]
+4. Distributed objects: The NSProxy class cluster is used to implement distributed objects. NSConnection is responsible for setting up remote connections as a prerequisite for sending information around, so once you understand how to use this system, you also understand distributed objects. ;^)
The idea is that your high-level program logic does not need to be changed (i.e., your objects communicate via messages and receive results and the messages together with the return types are identical to what you are used to from your local implementation) without having to bother about the particulars of the network infrastructure. Well, at least in theory. Implementation: I am also working on this right now, so my understanding is still limited. As far as I understand, you do need to setup a certain structure, i.e., you still have to decide which processes (local and/or remote) can receive which messages; this is what NSConnection does. At this point, you implicitly define a client/server architecture, but you do not need to worry about the problems mentioned in 2.
There is an introduction with two explicit examples at the Gnustep project server; it illustrates how the technology works and is a good starting point for experimenting:
http://www.gnustep.org/resources/documentation/Developer/Base/ProgrammingManual/manual_7.html
Unfortunately, the disadvantages are a total loss of compatibility (although you will still do fine with the setup you mentioned of Macs and iPhone/iPad only) with other systems and loss of portability to other languages. Gnustep with Objective-C is at best code-compatible, but there is no way to communicate between Gnustep and Cocoa, see my edit to question number 2 here: CORBA on Mac OS X (Cocoa)
[EDIT: I just came across another piece of information that I was unaware of. While I have checked that NSProxy is available on the iPhone, I did not check whether the other parts of the distributed objects mechanism are. According to this link: http://www.cocoabuilder.com/archive/cocoa/224358-big-picture-relationships-between-nsconnection-nsinputstream-nsoutputstream-etc.html (search the page for the phrase "iPhone OS") they are not. This would exclude this solution if you demand to use iPhone/iPad at this moment.]
So to conclude, there is a trade-off between effort of learning (and implementing and debugging) new technologies on the one hand and hand-coding lower-level communication logic on the other. While the distributed object approach takes most load of your shoulders and incurs the smallest changes in program logic, it is the hardest to learn and also (unfortunately) the least portable.
Disclaimer: Distributed Objects are not available on iPhone.
Why do you find distributed objects inelegant? They sounds like a good match here:
transparent marshalling of fundamental types and Objective-C classes
it doesn't really matter wether clients are local or remote
not much additional work for Cocoa-based applications
The documentation might make it sound like more work then it actually is, but all you basically have to do is to use protocols cleanly and export, or respectively connect to, the servers root object.
The rest should happen automagically behind the scenes for you in the given scenario.
We are using ThoMoNetworking and it works fine and is fast to setup. Basically it allows you to send NSCoding compliant objects in the local network, but of course also works if client and server are on he same machine. As a wrapper around the foundation classes it takes care of pairing, reconnections, etc..

Matching a virtual machine design with its primary programming language

As background for a side project, I've been reading about different virtual machine designs, with the JVM of course getting the most press. I've also looked at BEAM (Erlang), GHC's RTS (kind of but not quite a VM) and some of the JavaScript implementations. Python also has a bytecode interpreter that I know exists, but have not read much about.
What I have not found is a good explanation of why particular virtual machine design choices are made for a particular language. I'm particularly interested in design choices that would fit with concurrent and/or very dynamic (Ruby, JavaScript, Lisp) languages.
Edit: In response to a comment asking for specificity here is an example. The JVM uses a stack machine rather then a register machine, which was very controversial when Java was first introduced. It turned out that the engineers who designed the JVM had done so intending platform portability, and converting a stack machine back into a register machine was easier and more efficient then overcoming an impedance mismatch where there were too many or too few registers virtual.
Here's another example: for Haskell, the paper to look at is Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine. This is very different from any other type of VM I know about. And in point of fact GHC (the premier implementation of Haskell) does not run live, but is used as an intermediate step in compilation. Peyton-Jones lists no less then 8 other virtual machines that didn't work. I would like to understand why some VM's succeed where other fail.
I'll answer your question from a different tack: what is a VM? A VM is just a specification for "interpreter" of a lower level language than the source language. Here I'm using the black box meaning of the word "interpreter". I don't care how a VM gets implemented (as a bytecode intepereter, a JIT compiler, whatever). When phrased that way, from a design point of view the VM isn't the interesting thing it's the low level language.
The ideal VM language will do two things. One, it will make it easy to compile the source language into it. And two it will also make it easy to interpret on the target platform(s) (where again the interpreter could be implemented very naively or could be some really sophisticated JIT like Hotspot or V8).
Obviously there's a tension between those two desirable properties, but they do more or less form two end points on a line through the design space of all possible VMs. (Or, perhaps some more complicated shape than a line because this isn't a flat Euclidean space, but you get the idea). If you build your VM language far outside of that line then it won't be very useful. That's what constrains VM design: putting it somewhere into that ideal line.
That line is also why high level VMs tend to be very language specific while low level VMs are more language agnostic but don't provide many services. A high level VM is by its nature close to the source language which makes it far from other, different source languages. A low level VM is by its nature close to the target platform thus close to the platform end of the ideal lines for many languages but that low level VM will also be pretty far from the "easy to compile to" end of the ideal line of most source languages.
Now, more broadly, conceptually any compiler can be seen as a series of transformations from the source language to intermediate forms that themselves can be seen as languages for VMs. VMs for the intermediate languages may never be built, but they could be. A compiler eventually emits the final form. And that final form will itself be a language for a VM. We might call that VM "JVM", "V8"...or we might call that VM "x86", "ARM", etc.
Hope that helps.
One of the techniques of deriving a VM is to just go down the compilation chain, transforming your source language into more and more low level intermediate languages. Once you spot a low level enough language suitable for a flat representation (i.e., the one which can be serialised into a sequence of "instructions"), this is pretty much your VM. And your VM interpreter or JIT compiler would just continue your transformations chain from the point you selected for a serialisation.
Some serialisation techniques are very common - e.g., using a pseudo-stack representation for expression trees (like in .NET CLR, which is not a "real" stack machine at all). Otherwise you may want to use an SSA-form for serialisation, as in LLVM, or simply a 3-address VM with an infinite number of registers (as in Dalvik). It does not really matter which way you take, since it is only a serialisation and it would be de-serialised later to carry on with your normal way of compilation.
It is a bit different story if you intend to interpret you VM code immediately instead of compiling it. There is no consensus currently in what kind of VMs are better suited for interpretation. Both stack- (or I'd dare to say, Forth-) based VMs and register-based had proven to be efficient.
I found this book to be helpful. It discusses many of the points you are asking about. (note I'm not in any way affiliated with Amazon, nor am I promoting Amazon; just was the easiest place to link from).
http://www.amazon.com/dp/1852339691/

When implementing an interpreter, is it a good or bad to piggyback off the host language's garbage collector?

Let's say you are implementing an interpreter for a GCed language in a language that is GCed.
It seems to me you'd get garbage collection for free as long as you are reasonably careful about your design.
Is this generally how it's done? Are there good reasons not to do this?
Language and runtime are two different things. They are not really related IMHO.
Therefore, if your existing runtime offers a GC already, there must be a good reason to extend the runtime with another GC. In the good old days when memory allocations in the OS were slow and expensive, apps brought their own heap managers which where more efficient in dealing with small chunks of data. That was one readon for adding another memory management to an existing runtime (or OS). But if you're talking Java, .NET or so - those should be good and efficient enough for most tasks at hand.
However, you may want to create a proper interface/API for memory and object management tasks (and others), so that your language ("guest") runtime could be implemented on to of another host runtime later on.
For an interpreter, there should be no problem with using the host GC, IMHO, particularly at first. As always they goal should be to get something working, then make it work right, then make it fast. This is particularly true for Domain Specific Languages (DSL) where the goal is a small language. For these, implementing a full GC would be overkill.

How to save a program's progress, and resume later?

You may know a lot of programs, e.g some password cracking programs, we can stop them while they're running, and when we run the program again (with or without entering a same input), they will be able to continue from where they have left. I wonder what kind of technique those programs are using?
[Edit] I am writing a program mainly based on recursion functions. Within my knowledge, I think it is incredibly difficult to save such states in my program. Is there any technique, somehow, saves the stack contents, function calls, and data involved in my program, and then when it is restarted, it can run as if it hasn't been stopped? This is just some concepts I got in my mind, so please forgive me if it doesn't make sense...
It's going to be different for every program. For something as simple as, say, a brute force password cracker all that would really need to be saved was the last password tried. For other apps you may need to store several data points, but that's really all there is too it: saving and loading the minimum amount of information needed to reconstruct where you were.
Another common technique is to save an image of the entire program state. If you've ever played with a game console emulator with the ability to save state, this is how they do it. A similar technique exists in Python with pickling. If the environment is stable enough (ie: no varying pointers) you simply copy the entire apps memory state into a binary file. When you want to resume, you copy it back into memory and begin running again. This gives you near perfect state recovery, but whether or not it's at all possible is highly environment/language dependent. (For example: most C++ apps couldn't do this without help from the OS or if they were built VERY carefully with this in mind.)
Use Persistence.
Persistence is a mechanism through which the life of an object is beyond programs execution lifetime.
Store the state of the objects involved in the process on the local hard drive using serialization.
Implement Persistent Objects with Java Serialization
To achieve this, you need to continually save state (i.e. where you are in your calculation). This way, if you interrupt the probram, when it restarts, it will know it is in the middle of calculation, and where it was in that calculation.
You also probably want to have your main calculation in a separate thread from your user interface - this way you can respond to "close / interrupt" requests from your user interface and handle them appropriately by stopping / pausing the thread.
For linux, there is a project named CRIU, which supports process-level save and resume. It is quite like hibernation and resuming of the OS, but the granularity is broken down to processes. It also supports container technologies, specifically Docker. Refer to http://criu.org/ for more information.