Difference between State, ST, IORef, and MVar - variables

I am working through Write Yourself a Scheme in 48 Hours (I'm up to about 85hrs) and I've gotten to the part about Adding Variables and Assignments. There is a big conceptual jump in this chapter, and I wish it had been done in two steps with a good refactoring in between rather then jumping at straight to the final solution. Anyway…
I've gotten lost with a number of different classes that seem to serve the same purpose: State, ST, IORef, and MVar. The first three are mentioned in the text, while the last seems to be the favored answer to a lot of StackOverflow questions about the first three. They all seem to carry a state between consecutive invocations.
What are each of these and how do they differ from one another?
In particular these sentences don't make sense:
Instead, we use a feature called state threads, letting Haskell manage the aggregate state for us. This lets us treat mutable variables as we would in any other programming language, using functions to get or set variables.
and
The IORef module lets you use stateful variables within the IO monad.
All this makes the line type ENV = IORef [(String, IORef LispVal)] confusing - why the second IORef? What will break if I'll write type ENV = State [(String, LispVal)] instead?

The State Monad : a model of mutable state
The State monad is a purely functional environment for programs with state, with a simple API:
get
put
Documentation in the mtl package.
The State monad is commonly used when needing state in a single thread of control. It doesn't actually use mutable state in its implementation. Instead, the program is parameterized by the state value (i.e. the state is an additional parameter to all computations). The state only appears to be mutated in a single thread (and cannot be shared between threads).
The ST monad and STRefs
The ST monad is the restricted cousin of the IO monad.
It allows arbitrary mutable state, implemented as actual mutable memory on the machine. The API is made safe in side-effect-free programs, as the rank-2 type parameter prevents values that depend on mutable state from escaping local scope.
It thus allows for controlled mutability in otherwise pure programs.
Commonly used for mutable arrays and other data structures that are mutated, then frozen. It is also very efficient, since the mutable state is "hardware accelerated".
Primary API:
Control.Monad.ST
runST -- start a new memory-effect computation.
And STRefs: pointers to (local) mutable cells.
ST-based arrays (such as vector) are also common.
Think of it as the less dangerous sibling of the IO monad. Or IO, where you can only read and write to memory.
IORef : STRefs in IO
These are STRefs (see above) in the IO monad. They don't have the same safety guarantees as STRefs about locality.
MVars : IORefs with locks
Like STRefs or IORefs, but with a lock attached, for safe concurrent access from multiple threads. IORefs and STRefs are only safe in a multi-threaded setting when using atomicModifyIORef (a compare-and-swap atomic operation). MVars are a more general mechanism for safely sharing mutable state.
Generally, in Haskell, use MVars or TVars (STM-based mutable cells), over STRef or IORef.

Ok, I'll start with IORef. IORef provides a value which is mutable in the IO monad. It's just a reference to some data, and like any reference, there are functions which allow you to change the data it refers to. In Haskell, all of those functions operate in IO. You can think of it like a database, file, or other external data store - you can get and set the data in it, but doing so requires going through IO. The reason IO is necessary at all is because Haskell is pure; the compiler needs a way to know which data the reference points to at any given time (read sigfpe's "You could have invented monads" blogpost).
MVars are basically the same thing as an IORef, except for two very important differences. MVar is a concurrency primitive, so it's designed for access from multiple threads. The second difference is that an MVar is a box which can be full or empty. So where an IORef Int always has an Int (or is bottom), an MVar Int may have an Int or it may be empty. If a thread tries to read a value from an empty MVar, it will block until the MVar gets filled (by another thread). Basically an MVar a is equivalent to an IORef (Maybe a) with extra semantics that are useful for concurrency.
State is a monad which provides mutable state, not necessarily with IO. In fact, it's particularly useful for pure computations. If you have an algorithm that uses state but not IO, a State monad is often an elegant solution.
There is also a monad transformer version of State, StateT. This frequently gets used to hold program configuration data, or "game-world-state" types of state in applications.
ST is something slightly different. The main data structure in ST is the STRef, which is like an IORef but with a different monad. The ST monad uses type system trickery (the "state threads" the docs mention) to ensure that mutable data can't escape the monad; that is, when you run an ST computation you get a pure result. The reason ST is interesting is that it's a primitive monad like IO, allowing computations to perform low-level manipulations on bytearrays and pointers. This means that ST can provide a pure interface while using low-level operations on mutable data, meaning it's very fast. From the perspective of the program, it's as if the ST computation runs in a separate thread with thread-local storage.

Others have done the core things, but to answer the direct question:
All this makes the line type ENV =
IORef [(String, IORef LispVal)]
confusing. Why the second IORef? What
will break if I do type ENV = State
[(String, LispVal)] instead?
Lisp is a functional language with mutable state and lexical scope. Imagine you've closed over a mutable variable. Now you've got a reference to this variable hanging around inside some other function -- say (in haskell-style pseudocode) (printIt, setIt) = let x = 5 in (\ () -> print x, \y -> set x y). You now have two functions -- one prints x, and one sets its value. When you evaluate printIt, you want to lookup the name of x in the initial environment in which printIt was defined, but you want to lookup the value that name is bound to in the environment in which printIt is called (after setIt may have been called any number of times).
There are ways besids the two IORefs to do this, but you certainly need more than the latter type you've proposed, which doesn't allow you to alter the values that names are bound to in a lexically-scoped fashion. Google the "funargs problem" for a whole lot of interesting prehistory.

Related

Is the term immutable variable just a convention?

In Rust variables are immutable by default, i.e., they don't vary but are not constants (as noted here).
Do they retain the name "variable" just by convention, or is there another reason why the term "variable" is maintained?
It should be noted that the term mut in Rust was hotly debated before stabilization with some arguing that it should be called excl or uniq. The matter is that the mut in in let mut x and &mut x are two completely different things.
let mut x declares that x is mutable, in the sense that it can be re-assigned, but also that one can take a &mut reference of it; which is best called an exclusive or unique reference. It is quite possible in Rust in some cases to mutate through a shared reference in the case of std::cell::Cell, for instance, and not all operations that require an exclusive reference involve mutation. An operation that requires an exclusive reference is simply one that would be unsafe with a shared one; Cell is designed in such a way that it is not, by strictly controlling under what conditions mutation can occur.
In theory, the two functions of let mut x could have different keywords, but they are compressed into one for simplicity. Rust could in theory be designed with mut and excl being different keywords, and allowing for let excl x, which would be a variable wherefrom one could take an exclusive reference, but not mutate.
One can also have variables that are not declared with mut, in particular in function calls. In a signature like fn func ( x : u32 ), x is not mutable, but it is variable, because it a different x can be passed every single time.
The let mut x type of "mutable" is purely a lint and, in theory, unnecessary for Rust to work — any currently working Rust program will continue to work if all non-mutable variables be made mutable. It's simply considered bad practice to do so and the compiler will warn the programmer whenever he make a variable mutable that isn't necessary to be mutable; this helps catching unintended bugs. This is absolutely not the case with exclusive and shared references, which are necessary to be distinguished and more than just a lint.
Here "variable" means "factor involved in computation" not "varying". This is from the mathematical principle where expressions like f(x) include x, a variable, as a part of the equation.
In Rust, like with other languages, you'll need variables (e.g. input) that affects how the program runs, otherwise your program would only ever behave in a singular, specific way, producing the same output each time.
You'll need to think of what variables change during processing and which do not. Those that do not need to change do not need to be declared mutable.
Regardless of if or when they change, they're still considered variables.
In C++ you'll have things like const int x which is a constant (read-only) variable, so the term can take on all sorts of specific meanings.
Is the term immutable variable just a convention?
By definition every... definition of a word is a convention, language, meaning of the word, change by time, is unique for every people that live, you can take 100 peoples and end with 100 difference definition of 1 word. That why we often start scientific paper by defining word that could be miss understand in the paper. Trying to clarify as much as possible. Rust does not differs that why we have The Reference
We have a specific section for variable
A variable is a component of a stack frame, either a named function
parameter, an anonymous temporary, or a named local variable.
A local variable (or stack-local allocation) holds a value directly,
allocated within the stack's memory. The value is a part of the stack
frame.
Local variables are immutable unless declared otherwise. For example:
let mut x = ....
Function parameters are immutable unless declared with mut. The mut
keyword applies only to the following parameter. For example: |mut x,
y| and fn f(mut x: Box, y: Box) declare one mutable variable
x and one immutable variable y.
Local variables are not initialized when allocated. Instead, the
entire frame worth of local variables are allocated, on frame-entry,
in an uninitialized state. Subsequent statements within a function may
or may not initialize the local variables. Local variables can be used
only after they have been initialized through all reachable control
flow paths.
So there is not much to add, variable in rust is clearly defined, it doesn't matter if your definition doesn't match or you find a definition of variable that doesn't match Rust one. In the context of Rust, variable is that. If you want to ask about opinion about this choice then it's off topic as opinion oriented. But, wiki definition make Rust definition quite standard both from mathematics view than computer science:
Variable (computer science), a symbolic name associated with a value and whose associated value may be changed
Variable (mathematics), a symbol that represents a quantity in a mathematical expression, as used in many sciences

When does clojure remove a variable?

I was looking at the source for the memoize.
Coming from languages like C++/Python, this part hit me hard:
(let [mem (atom {})] (fn [& args] (if-let [e (find #mem args)] ...
I realize that memoize returns a function, but for storing state, it uses a local "variable" mem. But after memoize returns the function, shouldn't that outer let vanish from scope. How can the function still refer to the mem.
Why doesn't Clojure delete that outer variable, and how does it manage variable names. Like suppose, I make another memoized function, then memoize uses another mem. Doesn't that name clash with the earlier mem?
P.S.: I was thinking that there must be something much be happening in there, that prevents that, so I wrote myself a easier version, that goes like http://ideone.com/VZLsJp , but that still works like the memoize.
Objects are garbage collectable if no thread can access them, as per usual for JVM languages. If a thread has a reference to the function returned by memoize and the function has a reference to the atom in mem then transitively the atom is still accessible.
But after memoize returns the function, shouldn't that outer let vanish from scope. How can the function still refer to the mem.
This is what is called a closure. If a function is defined using a name from its environment, it keeps a reference to that value afterwards - even if the defining environment is gone and the function is the only thing that has access any more.
Like suppose, I make another memoized function, then memoize uses another mem. Doesn't that name clash with the earlier mem?
No, except possibly by confusing programmers. Having multiple scopes each declare their own name mem is very much possible and the usual rules of lexical scoping are used to determine which is meant when mem is read. There are some trickier edge cases such as
(let[foo 2]
(let[foo (fn[] foo)] ;; In the function definition, foo has the value from the outer scope
;; because the second let has not yet bound the name
(foo))) ;; => 2.
but generally the idea is pretty simple - the value of a name is the one given in the definition closest in the program text to the place it is used - either in the local scope or in the closest outer scope.
Different invocations of memoize create different closures so that the name mem refers to different atoms in each returned function.

How does gcc push local variables on to the stack?

void
f
()
{
int a[1];
int b;
int c;
int d[1];
}
I have found that these local variables, for this example, are not pushed on to the stack in order. b and c are pushed in the order of their declaration, but, a and d are grouped together. So the compiler is allocating arrays differently from any other built in type or object.
Is this a C/C++ requirement or gcc implementation detail?
The C standard says nothing about the order in which local variables are allocated. It doesn't even use the word "stack". It only requires that local variables have a lifetime that begins on entry to the nearest enclosing block (basically when execution reaches the {) and ends on exit from that block (reaching the }), and that each object has a unique address. It does acknowledge that two unrelated variables might happen to be adjacent in memory (for obscure technical reasons involving pointer arithmetic), but doesn't say when this might happen.
The order in which variables are allocated is entirely up to the whim of the compiler, and you should not write code that depends on any particular ordering. A compiler might lay out local variables in the order in which they're declared, or alphabetically by name, or it might group some variables together if that happens to result in faster code.
If you need to variables to be allocated in a particular order, you can wrap them in an array or a structure.
(If you were to look at the generated machine code, you'd most likely find that the variables are not "pushed onto the stack" one by one. Instead, the compiler will probably generate a single instruction to adjust the stack pointer by a certain number of bytes, effectively allocating a single chunk of memory to hold all the local variables for the function or block. Code that accesses a given variable will then use its offset within the stack frame.)
And since your function doesn't do anything with its local variables, the compiler might just not bother allocating space for them at all, particularly if you request optimization with -O3 or something similar.
The compiler can order the local variables however it wants. It may even choose to either not allocate them at all (for example, if they're not used, or are optimized away through propagation/ciscizing/keeping in register/etc) or allocate the same stack location for multiple locals that have disjoint live ranges.
There is no common implementation detail to outline how a particular compiler does it, as it may change at any time.
Typically, compilers will try to group similar sized variables (and/or alignments) together to minimize wasted space through "gaps", but there are so many other factors involved.
structs and arrays have slightly different requirements, but that's beyond the scope of this question I believe.

How a programmers solve the dilemma of using old variables instead of new variables?

For example:
... some code
int sizeOfSomeObject = someObject.length();
... some code, sizeOfSomeObject is not need anymore
now I need other int variable for other action(for example, for position in some object), and i have the dilemma: create a new variable or use sizeOfSomeObject for this. In the first case I will keep readability, but lose performance. In the second case - on the contrary. What usually do programmers in this situation?
In the first case I will keep readability, but lose performance. In the second case - on the contrary.
So did you benchmark it? I suspect no, you didn't. Most modern compilers do a lot of agressive analysis during register allocation, so if the optimizer perceives that there's a variable that's not used anymore, but there's a new variable of the same type, it will just merge the two variables to the same memory region or processor register. No need to worry about performance penalties.
And anyway, don't do premature optimization (which this is). In 90% of the cases, readability is more important than "performance".
All in all, go ahead and create a new variable with an appropriate, different, descriptive name. And just for fun, compile this version and the version in which you used the same variable name, and look at the generated assembly (or bytecode, or...) - and find out that they're identical.
I would use different named variables for different things.
In terms of something like this, I don't think just one variable would cause a massive performance hit. In most languages you have the option to clear variables from memory in some way when they are no longer in use, so I would recommend doing that so that the code means something to you or others when read at a later date.
In C++, you can use blocks for objects to be destroyed as soon as they are not needed anymore:
void some_function () {
{
MyClass c;
// ... here we use c ...
}
// now c has been destroyed
{
MyClass d;
// ... here we use d ...
}
// now d has been destroyed
}
In your example (with int variables), there is no reason to worry about performance. The worst thing that could probably happen is memory for two variables being used instead of one, but (i) that's negligible and (ii) int's will probably live in a CPU register, anyway. If you really worry, use the block approach for your int example.
It depends how often such an int would be initialized. If it's not in some hugely nested for loop, most (all) programmers will go for the first. Besides, most modern programming languages have a garbage collector, which cleans up left over objects.
Decent compiler will optimize out your second variable, so that shouldn't be an issue.
That said, there are situations where variable reuse makes sense. E.g., you might have some variable that holds a generic output populated from call to some external API. According to the context and parameters passed to the API you'll process the data differently but it's probably better (more readable etc.) to reuse the same data variable.
For example, something like this:
void* data = getSomeData(params);
//process data
//change params
data = getSomeData(params);
//process data
//change params
data = getSomeData(params);

Most appropriate data structure for dynamic languages field access

I'm implementing a dynamic language that will compile to C#, and it's implementing its own reflection API (.NET's is too slow, and the DLR is limited only to more recent and resourceful implementations).
For this, I've implemented a simple .GetField(string f) and .SetField(string f, object val) interface. Until recently, the implementation just switches over all possible field string values and makes the corresponding action.
Also, this dynamic language has the possibility to define anonymous objects. For those anonymous objects, at first, I had implemented a simple hash algorithm.
By now, I am looking for ways to optimize the dynamic parts of the language, and I have come across the fact that a hash algorithm for anonymous objects would be overkill. This is because the objects are usually small. I'd say the objects contain 2 or 3 fields, normally. Very rarely, they would contain more than 15 fields. It would take more time to actually hash the string and perform the lookup than if I would test for equality between them all. (This is not tested, just theoretical).
The first thing I did was to -- at compile-time -- create a red-black tree for each anonymous object declaration and have it laid onto an array so that the object can look for it in a very optimized way.
I am still divided, though, if that's the best way to do this. I could go for a perfect hashing function. Even more radically, I'm thinking about dropping the need for strings and actually work with a struct of 2 longs.
Those two longs will be encoded to support 10 chars (A-za-z0-9_) each, which is mostly a good prediction of the size of the fields. For fields larger than this, a special function (slower) receiving a string will also be provided.
The result will be that strings will be inlined (not references), and their comparisons will be as cheap as a long comparison.
Anyway, it's a little hard to find good information about this kind of optimization, since this is normally thought on a vm-level, not a static language compilation implementation.
Does anyone have any thoughts or tips about the best data structure to handle dynamic calls?
Edit:
For now, I'm really going with the string as long representation and a linear binary tree lookup.
I don't know if this is helpful, but I'll chuck it out in case;
If this is compiling to C#, do you know the complete list of fields at compile time? So as an idea, if your code reads
// dynamic
myObject.foo = "some value";
myObject.bar = 32;
then during the parse, your symbol table can build an int for each field name;
// parsing code
symbols[0] == "foo"
symbols[1] == "bar"
then generate code using arrays or lists;
// generated c#
runtimeObject[0] = "some value"; // assign myobject.foo
runtimeObject[1] = 32; // assign myobject.bar
and build up reflection as a separate array;
runtimeObject.FieldNames[0] == "foo"; // Dictionary<int, string>
runtimeObject.FieldIds["foo"] === 0; // Dictionary<string, int>
As I say, thrown out in the hope it'll be useful. No idea if it will!
Since you are likely to be using the same field and method names repeatedly, something like string interning would work well to quickly generate keys for your hash tables. It would also make string equality comparisons constant-time.
For such a small data set (expected upper bounds of 15) I think almost any hashing will be more expensive then a tree or even a list lookup, but that is really dependent on your hashing algorithm.
If you want to use a dictionary/hash then you'll need to make sure the objects you use for the key return a hash code quickly (perhaps a single constant hash code that's built once). If you can prevent collisions inside of an object (sounds pretty doable) then you'll gain the speed and scalability (well for any realistic object/class size) of a hash table.
Something that comes to mind is Ruby's symbols and message passing. I believe Ruby's symbols act as a constant to just a memory reference. So comparison is constant, they are very lite, and you can use symbols like variables (I'm a little hazy on this and don't have a Ruby interpreter on this machine). Ruby's method "calling" really turns into message passing. Something like: obj.func(arg) turns into obj.send(:func, arg) (":func" is the symbol). I would imagine that symbol makes looking up the message handler (as I'll call it) inside the object pretty efficient since it's hash code most likely doesn't need to be calculated like most objects.
Perhaps something similar could be done in .NET.