COMP 311 Principles of Programming Languages (Fall 2004)

Section 8: Assignment and Mutability

Assignment and Mutability

Practical programming languages include operations for mutating the values of program variables and data structures. To model this feature in LC, we will add an assignment operation to the language with syntax

(set! x M)

and the abstract representation

(define-structure (setter lhs rhs))

where x is any lambda-bound identifier. set! allows us to model changing events in the real world.

There are two defensible models of how assignment should behave. In languages like Scheme and Java, an assignment changes the binding of an identifier. Conceptually, assignment to an identifier destructively modifies the value field of the binding for the specified identifier. In this model, only the value of an identifer can be passed as an argument to a procedure. Hence, assignment to a formal parameter has no effect on the corresponding actual parameter.

In the other model, only fields of data structures can be modified. "Assignable variables" are simply identifiers bound to modifiable structures called "cells" or refs. These structures have a single updatable field. ML supports this model of assignment.

Typically, a single set! enables us to create cycles and graphs, while multiple set!s model state in the modeled universe.

Consider the following code fragment:

((lambda (x)
 ...
 (set! x 6)
 ...)
 5)

What does the set! do to x? Initially, x is bound to the value 5. But the 5 is a constant, so it can't directly be changed. To understand this issue more clearly, think about the difference between the Java values of type int and type Integer. Since each int value is simply a bit pattern, it is by definition immutable. Every instance of a particular int (say 5) in Java program simply repeats the bit pattern.

In contrast, each Integer value is an object allocated on the heap. An Integer object contains a value field that holds an int specifying the value of the Integer. In principle, the value field in an object Integer(5) on the heap could be modified to hold the bit pattern for int 6 instead of 5. The standard Java library prohibits this modification by declaring the value field private and not providing a ``setter'' method for changing it. Otherwise, an Integer object x could ``mysteriously'' refer to a different integer value during program execution even though x is never explicitly updated. (If another Integer variable y is bound to the same object; any operation that modifies y will also modify x.)

To change the value associated with a variable x, we must bind a different value to the variable x. We can accomplish this by including a clause in MEval case-split of the form:

((setter? M) ``change the environment'')

But how do we do this?

To make variables assignable, we need to change what they stand for. They cannot be directly associated with values; rather, they must be associated with an object which can be modified to hold a differnt value. What kind of object can we use? In Scheme, a particularly apt choice is to use a box to hold the value of each variable. (Another possible choice is to view the environment as a list of binding structs which have two fields: a symbol specifying the variable name and a value. Then we can use mutation on Scheme structs to change the value of the second field. As we will see in a moment, this second choice imposes some serious limitations on subsequent language extensions.)

Moral: Variables must stand for boxes.

Where do we associate boxes with values? Since every use of a procedure is supposed to create a new copy of the body with all occurrences of the parameter replaced by the argument value, it is natural to create a box right after the argument is evaluated and to put the value of the argument into that box. Put technically, we change the interpretation of application as follows:

((app? M)
  (MApply
    (MEval (app-fp M) env)
    (box (MEval (app-ap M) env))))

This in turn induces another change in the interpreter: since every variable is boxed, we now have to unbox it to get its actual value.

((var? M)
  (unbox (lookup (var-name M) env)))

At this point, there is no LC program we can write to distinguish this implementation of LC from the one we had before.

We are now ready to interpret a set! expression:

((setter? M)
  (set-box! (lookup (setter-lhs M) env)
    (MEval (setter-rhs M) env)))

Why is the value returned by MEval not itself boxed? A set! expression only changes the value that a variable is associated with; it does not introduce a new variable.

Call-by-Value and Call-by-Reference

Consider this program, which contains a mutation:

(let ((f (lambda (x) (set! x 5))))
  (let ((y 10))
    (let ((_ (f y)))
      y)))

What result is produced by evalating this program? The value of y, which is 10, is placed in a new box when f is applied; this new box is thrown away after the procedure body (including the set!) has been evaluated, so the eventual value is that of y, which is still 10. This is call-by-value: we passed the value of y, not the capability to change its value. Therefore, in this language, we cannot write a swap procedure.

Assume we want that the following expression

(let ((f (lambda (x y)
	   (let ((t x))
	     (set! x y)
	     (set! y t)))))
  (let ((a 5) (b 6))
    (f a b)
    (cons a b)))

evaluate to (cons 6 5). Then we have to pass references to variables a and b, not their value alone. These references are simply the boxes holding the variables' values. We can support this capability with a small change change to our LC interpreter based on the following observation: when the argument expression in an application is already a variable, it is associated with a box in the environment. Hence, we can pass this box to the procedure and don't need to create a new one:

((app? M)
  (MApply (... fp ...)
    (if (var? (app-rand M))
      (lookup (var-name (app-rand M)) env)
      (...))))

This new mode of parameter-passing is called call-by-reference. Pascal and Fortran use variants of this parameter-passing technique. While passing references enables programmers to write procedures like swap, it also introduces a new phenomenon into the language: variable aliasing. Variable aliasing occurs when two syntactially distinct variables refer to the same mutable location in the environment. In Scheme such a coincidence is impossible; in Pascal it is common.

The binding object approach to supporting assignment does not support call-by-refernce because the value field of a binding object cannot be shared! Each binding object has a distinct value field.

The absence of variable aliasing in Scheme does not mean that Scheme escapes the aliasing problem. Scheme only guarantees that distinct varible names do not refer to the same location (box). Scheme allows data aliasing, where more than selction path refers to the same location. For example, two elements of a vector can be exactly the same box. All interesting programming languages permit data aliasing.

Variations on Call-by-Reference: Call-by-name

The original Algol 60 language supported call-by-value and call-by-name, but not call-by-reference. In imperative languages (languages supporting mutable state), call-by-name has the same semantics as it does in functional languages, assuming that we equate left-hand-evaluation in imperative languages with evaluation in functional languages. As a result, call-by-name is a baroque, expensive alternative to call-by-reference. Each actual parameter passed by reference is a suspension that yields a box (also called a reference or a location) when it is evaluated. In essence, call-by-name repeatedly evaluates the actual parameter to produce a box every time the corresponding formal parameter is referenced. If the suspension produces the same location each time, then call-by-name is equivalent to call-by-reference. But the suspension can contain references to variables that change (from assignment) while the procedure body is executed. In the special case where the actual parameter does not left-hand evaluate to a box (e.g., an actual parameter that is a constant like 10), the calling program generates a dummy box and copies the value into the box. The following Algol-like code (written in C syntax)exploits this property.

procedure Sum(int x, int y, int n) {
  // actual y must be an expression in which actual x occurs free
  int sum = 0;
  for (x = 0; x < n, x++) sum = sum + y;
  return sum;
}
  
int j, sum = 0;

int[10] a;

// initialize a
for (int i = 0; i < 10; i++) a[i] = i;

// computer the sum
sum = Sum(j, a[j], 10));

// print the result
print(j, sum);

The ugly convention of passing j and a[j] by name and using modifications to the formal parameter for j to determine different values for the formal parameter corresponding to a[j] is called Jensen's device. In the imperative world, the call-by-need optimization of call-by-name does not work because reevaluations of the suspension for a call-by-name parameter does not necessarily produce the same result! In the code above, the suspension for x passed to Sum always evaluates to the box containing the global variable j, but the suspension for y contains a free reference to j which is synonymous with x. Hence, the assignments to x in the body of Sum change the contents of the global variable j; the procedure call Sum(j,a[j],10) sums the elements of a. Modern procedural languages have dropped call-by-name in favor of call-by-reference eliminating the obscure interactions between call-by-name evaluation and modifications to variables.

Variations on Call-by-Reference: Call by Value-Result

Another alternative to call-by-reference is called call-by-value-result. When an actual parameter is passed by value-result, the calling procedure left-hand-evaluates the actual parameter exactly as it would for call-by-reference. It passes the address of the box to the called procedure which saves it, creates a new local variable (a box) for the corresponding formal parameter and copies the contents of the passed box into the local box. During the execution of the procedure body, the local copy is used whenever the formal parameter is accessed. On exit from the called procedure, the called procedure copies the contents of the local box into the corresponding actual parameter box. In essence, call-by-value-result creates a temporary copy of the actual parameter box and copies the contents of this copy into the actual parameter box on exit. Languages that support value-result parameter passing often support call-by-result also. Call-by-result differs from call-by-value-result by not initializing the contents of the local box bound to the formal parameter. For the sake of safety, it must confirm that the contents denote a legal data value.

An Example Illustrating Parameter Passing Mechanisms

Consider the Algol-like code fragment involving the procedure Sum above. Let's determine what values will be printed for each of the different parameter passing mechanisms that we have discussed:

call-by-value: in the body of Sum the local variables x,y,n are new boxes with contents 0,0,10, respectively. Hence, the loop repeatedly adds 0 to sum which is initially to 0. Hence, the returned result is 0 and the program prints
```
 
0 0
```
call-by-name: in the body of Sum the local variables x,y,n are bound to suspensions with bodies j, a[j], temp closed over the calling environment where temp is a variable generated by the language processor (compiler or interpreter) holding the value 10. In the body of Sum, the loop repeatedly evaluates the suspensions for x, y. x always evaluates to the box corresponding the variable j, but y evaluates to the box corresponding to array element a[j] which depends on j. Hence, the loop sums the integers between 0 and 9 and increments variable j on each iteration. The program prints
```
 
10 45
```
call-by-reference: in the body of Sum the local variables x,y,n are bound to the boxes corresponding to the variable j, the array element a[0], and a dummy box created to hold the actual parameter 10. (We are assuming that the language boxes constants passed by reference instead of declaring the program syntactically incorrect.) In the body of Sum, the loop repeatedly adds 0 to sum, incrementing j by one on each iteration. Hence, the program prints
```
 
10 0
```
call-by-value-result: in the body of Sum the local variables x,y,n are bound to the boxes that are copies of the boxes corresponding to the variable j, the array element a[0], and a dummy box temp created to hold the actual parameter 10. (We are assuming that the language boxes constants passed by value-result instead of declaring the program syntactically incorrect.) In the body of Sum, the loop repeatedly adds 0 to sum, incrementing the local box for x on each iteration. On exit, the values of x,y,n are copied into the variables (boxes) j,a[0],temp. Hence, the program prints
```
10 0
```
just like call-by-reference. The code in this program does not depend on whether the local variables are copies or originals.
call-by-result: the parameter passing convention does not make sense in this example because it fails to initialize the formal parameters x, y, n. The behavior of the program is unpredictable. It may generate a subscripting error in the loop in Sum because n is not initialized!

Mutation: An Alternate Design

ML supports an appealing variant of pass-by-value with nearly the same expressive capabilities as call-by-reference. ML forces a programmer to associate variables that are used for modeling cycles or state change with reference cells (or boxes). Put differently, it makes references into values and can thus turn pass-by-value into the parameter passing of references exactly when needed.

To model ML we introduce three new classes of expressions:

(ref M), which creates a ref cell, which is a record with one slot that holds the value of M;
(! M), which assumes that M will evaluate to a ref cell; and,
(:= M1 M2), which assumes M1 will evaluate to a ref cell, and replaces the contents of that ref cell with the value M2 reduces to.

In LC-ML, ref cells are a new class of values, distinct from numbers and procedures (closures). Interpreting the new expressions requires the addition of three new lines to the original interpreter of LC:

(define MEval
  (lambda (M env)
    (cond
      ((var? M) (lookup (var-name M) env))
      ((num? M) (num-num M))
      ((add? M) (+ (MEval (add-left M) env)
                  (MEval (add-right M) right)))
      ((proc? M) (make-closure M env))
      ((app? M) (MApply (MEval (app-rator M) env)
                  (MEval (app-rand M) env)))
      ((ref? M)
	(box (MEval (ref-init M) env)))
      ((!? M)
	(unbox (MEval (!-expr M) env)))
      ((:=? M)
	(set-box! (MEval (:=-lhs M) env)
	  (MEval (:=-rhs M) env))))))

(define MApply
  (lambda (val-of-fp val-of-arg-p)
    (val-of-fp val-of-arg-p)))

Exercise: Write the function swap in LC-ML.

Orthogonality

What did we have to change here? We added one line for each new feature, but were able to leave the old code untouched. This is a very elegant design: adding the new set of features does not change the underlying interpreter for the old language. This is called orthogonality.

Exercise: What is the price that programmers pay for the orthogonal design of ML?

Values in Algol-like Languages versus Mostly Funtional Languages

In Scheme, ML, and Java, mutable objects are considered values. Hence, when a vector (array) is passed as a value to a procedure (method), the corresponding formal parameter refers to the passed object. In Algol-like languages (including C), values are immutable! Hence, when any object such as an array or structure is passed by value, a copy is constructed of the bitstring representing the object. In these languages, data sharing is achieved by using explicit pointers and explicitly dereferencing them. Pointers are passed by value, just like integers. If a procedure is passed a pointer to an object, it can explicitly dereference the pointer to access the shared object. Hence, to pass an array object (rather than a copy), the calling code must pass a pointer to the array or alternatively pass the array by reference. (To prevent inefficient array copying, C/C++ treats values of type T[] as values of type *T in most contexts).

Simulating Mutation in Purely Functional Interpreters

In our interpreter, we defined the semantics of assignment and data mutation in LC in terms of data mutation in Scheme. What did this reduction accomplish? We reduced the meaning of all possible assignment and data mutation operations in LC to the meaning of a single Scheme program relying on a few uses of mutation. Can we do better? In particular, can we reduce the mutation to purely functional form (which can be understood in terms of familiar concepts of algebra)? The answer is yes, but the reduction introduces extra complexity into our representation of environments.

To simulate the modification of an environment in a purely functional interpreter, we might try to model assignment by constructing a new environment identical to the current environment except for the specified binding change. But this simple strategy obviously fails because constructing such an environment has NO effect on the environment embedded in interpreter invocations at higher levels in the interpreter calling chain. (MEval repeatedly calls itself passing an environment argument; constructing a new environment does not modify the value of any environments in stack of pending calls on MEval.

We could try to patch this defect by modifying MEval to return a pair consisting of the expression value and the ``updated'' environment. Then the returned environment could be used by higher level invocations of MEval. But even this ``fix'' does not work because the environment lists are shared. A given variable binding can appear in many different environments (every extension of the environment where the variable was introduced), many of which do not use the ``current'' environment (e.g., closures!). Hence, returning a modified environment to the calling invocation of MEval does not communicate the change to every context in the pending evaluation (e.g., a closure referring to the afffected variable) where the variable appears.

The common solution to this problem is to use brute force and pass the parameters required to simulate an idealized shared memory system. From a mathematical perspective, an environment is function mapping symbols to values. We can partition this function into two stages: the mapping of symbols to numeric locations and the mapping of locations to values. The first part is called the ``location environment'' (or environment for short) and the second part is called the ``store''. Note that the second part looks exactly like an idealized computer memory where the contents of memory locations are arbitrary values instead of bitstrings.

By using such a ``two-level'' characterization of the environment, we can explicitly specify which variable bindings are shared (refer to the same box). Two variables share a box iff the location environment maps them to the same numeric location. In such an interpreter, every evaluation of a procedure application must generate unique location names for the new local variable introduced by the procedure. But the generation of these locations requires access either to the entire database (pool) of environments that currently exist in the interpretation process or a counter that is incremented every time a variable is allocated (and used to generate a fresh name). Since the interpreter is purely functional, such a counter much passed as an argument to MEval and any subfunction that can directly or indirectly allocate new variables.

Of course, both pieces of the environment, called the ``location'' environment and the store, must be passed to MEval and every subfunction that refers to the environment.

Exercise: Rewrite the interpreter from Section 3 to use a ``two level environment'' consisting of a ``location environment'' and a store.

Back to course website