Principles of Programming Languages: Outline of Lecture 8

Outline of Lecture 8, Comp 311

Assignment and Mutability

Practical programming languages include operations for mutating the values of program variables and data structures. To model this feature in LC, we will add an assignment operation to the language with concrete syntax

(set! x M)

and the abstract representation (AST)

(define-structure (setter lhs rhs))

where x is be any lambda-bound identifier. set! allows us to model changing events in the real world.

There are two defensible models of how assignment should be defined. In some languages like Scheme and Java, an assignment changes the binding of an identifier. Conceptually, assignment to an identifier destructively modifies the value field of the binding for the specified identifier in the various program environments that include that variable. In this interpretation of assignment, only the value of an identifer can be passed as an argument to a procedure. Hence, assignment to a formal parameter has no effect on the corresponding actual parameter.

In other languages, notably ML, only designated fields of data structures can be modified. "Assignable variables" are simply identifiers bound to modifiable structures called "cells" or refs. These structures have a single updatable field.

Some languages like Pascal and C++ support a hybrid of these two approaches in which variables implicitly reside in cells but the cell stuctures are revealed only when variables are passed "by reference". We will defer discussing this hybrid approach until we talk about the various schemes for passing parameters to procedures in imperative languages.

Typically, a single set! enables us to create cycles and graphs, while multiple set!s model state in the modeled universe.

Consider the following code fragment:

((lambda (x)
 ...
 (set! x 6)
 ...)
 5)

What does the set! do to x? Initially, x is bound to the value 5. But the 5 is a constant, so it can't directly be changed.

To understand this issue more clearly, think about the difference between the Java values of type int and type Integer. Since each int value is simply a bit pattern, it is by definition immutable. In contrast, each Integer value is an object allocated on the heap. An Integer object contains a value field that holds an int. In principle, the value field in an object Integer(5) on the heap could be modified to hold the bit pattern for int 6 instead of 5, but Java wisely prohibits this modification by declaring the value field private final. To change the value associated with a variable x in LC, we must bind a different value to the variable x in the environment. We can accomplish this by including a clause in MEval case-split of the form:

((setter? M) ``change the environment'')

But how do we do this?

To make variables assignable, we need to change what references to the variable mean (as determined by the lookup function). To enable changing the values of variables, the environment must bind variables to mutable (modifiable) objects rather than values. What kind of object can we use? In Scheme, a particularly apt choice is a box that holds the value of the variable.

Moral: Variables must stand for boxes.

Where do we associate boxes with values? Since every invocation of a procedure is supposed to create a new copy of the body with all occurrences of the parameter replaced by the argument value, it is natural to create a box immediately after the argument is evaluated and to put the value of the argument into that box. Hence, the clause for processing app expressions becomes:

((app? M)
  (MApply
    (MEval (app-fp M) env)
    (box (MEval (app-ap M) env))))

This change forces another change in the interpreter: since every variable is boxed, we now have to unbox it to get its actual value.

((var? M)
  (unbox (lookup (var-name M) env)))

At this point, there is no LC program we can write to distinguish this implementation of LC from the one we had before.

We are now ready to interpret a set! expression:

((setter? M)
  (set-box! (lookup (setter-lhs M) env)
    (MEval (setter-rhs M) env)))

Exercise: how could we extend our substitution-based evaluation rules to handle assignment? Hint: recall the rules for evaluations involving assignment from Comp 210.

Call-by-Value and Call-by-Reference

Consider this program, which contains a mutation:

(let ((f (lambda (x) (set! x 5))))
  (let ((y 10))
    (let ((_ (f y)))
      y)))

What result is produced by evalating this program? The value of y, which is 10, is placed in a new box when f is applied; this new box is thrown away after the procedure body (including the set!) has been evaluated, so the value of the expression is the value of y, which never changed. This parameter-passing mechanism is called call-by-value: we passed the value of y, not the capability to change the value in the box for y. Therefore, in this language, we cannot write a swap procedure.

Assume we want that the following expression

(let ((f (lambda (x y)
	   (let ((t x))
	     (set! x y)
	     (set! y t)))))
  (let ((a 5) (b 6))
    (f a b)
    (cons a b)))

evaluate to (cons 6 5). Then we have to pass references to variables a and b, not their value alone. These references are simply the boxes holding the variables' values. We can support this capability with a small (but ill-advised) change to our LC interpreter based on the following observation: when the argument expression in an application is already a variable, it is associated with a box in the environment. Hence, we can pass this box to the procedure and don't need to create a new one:

((app? M)
  (MApply (... fp ...)
    (if (var? (app-rand M))
      (lookup (var-name (app-rand M)) env)
      (...))))

This new mode of parameter-passing is called call-by-reference. Pascal and Fortran use variants of this parameter-passing technique. While passing references enables programmers to write procedures like swap, it also introduces a new phenomenon into the language: variable aliasing. Variable aliasing occurs when two syntactially distinct variables refer to the same mutable location in the environment. In Scheme such a coincidence is impossible; in Pascal it is common.

The absence of variable aliasing in Scheme does not mean that Scheme escapes the aliasing problem. Scheme only guarantees that distinct varible names do not refer to the same location (box). Scheme allows data aliasing, where more than selction path refers to the same location. For example, two elements of a vector can be exactly the same box. All interesting programming languages permit data aliasing.

Exercise: Why is the preceding change to the LC interpreter ill-advised? Hint: consider the semantics of the function calls (f x), (f 1), and (f ((lambda (y) y) x)). Extending LC to cleanly support call-by-reference is a challenging problem.

Variations on Call-by-Reference

Call-by-name

The original Algol 60 language supported call-by-value and call-by-name, but not call-by-reference. In imperative languages (languages supporting mutable state), call-by-name has the same semantics as it does in functional languages, assuming that we equate left-hand-evaluation in imperative languages with evaluation in functional languages. As a result, call-by-name is a baroque, expensive alternative to call-by-reference. Each actual parameter passed by reference is a suspension that yields a box (also called a reference or a location) when it is evaluated. In essence, call-by-name repeatedly evaluates the actual parameter to produce a box every time the corresponding formal parameter is referenced. If the suspension produces the same location each time, then call-by-name is equivalent to call-by-reference. But the suspension can contain references to variables that change (from assignment) while the procedure body is executed. In the special case where the actual parameter does not left-hand evaluate to a box (e.g., an actual parameter that is a constant like 10), the calling program generates a dummy box and copies the value into the box. The following Algol-like code (written in C syntax)exploits this property.

procedure Sum(int x, int y, int n) {
  // actual y must be an expression in which actual x occurs free
  int sum = 0;
  for (x = 0; x < n, x++) sum = sum + y;
  return sum;
}
  
int j = 0;

int[10] a;

// initialize a
for (int i = 0; i < 10; i++) a[i] = i;

print(j,Sum(j,a[j],10));

The ugly convention of passing j and a[j]j to determine different values for the formal parameter corresponding to a[j] is called Jensen's device. In the imperative world, the call-by-need optimization of call-by-name does not work because the reevaluation of a suspension does not necessarily produce the same result! In the code above, the suspension for x passed to Sum always evaluates to the box containing the global variable j, but the suspension for y contains a free reference to j which is synonymous with x. Hence, the assignments to x in the body of Sum change the contents of the global variable j; the procedure call Sum(j,a[j],10) sums the elements of a. Modern languages have dropped call-by-name in favor of call-by-reference eliminating the obscure interactions between call-by-name evaluation and modifications to variables.

Call by Value-Result

Another alternative to call-by-reference is called call-by-value-result. When an actual parameter is passed by value-result, the calling procedure left-hand-evaluates the actual parameter exactly as it would for call-by-reference. It passes the address of the box to the called procedure which saves it, creates a new local variable (a box) for the corresponding formal parameter and copies the contents of the passed box into the local box. During the execution of the procedure body, the local copy is used whenever the formal parameter is accessed. On exit from the called procedure, the called procedure copies the contents of the local box into the corresponding actual parameter box. In essence, call-by-value-result creates a temporary copy of the actual parameter box and copies the contents of this copy into the actual parameter box on exit. Languages that support value-result parameter passing often support call-by-result also. Call-by-result differs from call-by-value-result by not initializing the contents of the local box bound to the formal parameter. For the sake of safety, it must confirm that the contents denote a legal data value.

An Example Illustrating Parameter Passing Mechanisms

Consider the Algol-like code fragment involving the procedure Sum above. Let's determine what values will be printed for each of the different parameter passing mechanisms that we have discussed:

call-by-value: in the body of Sum the local variables x,y,n are new boxes with contents 0,0,10, respectively. Hence, the loop repeatedly adds 0 to sum which is initially to 0. Hence, the returned result is 0 and the program prints
```
 
0 0
```
call-by-name: in the body of Sum the local variables x,y,n are bound to suspensions with bodies j, a[j], temp closed over the calling environment where temp is a variable generated by the language processor (compiler or interpreter) holding the value 10. In the body of Sum, the loop repeatedly evaluates the suspensions for x, y. x always evaluates to the box corresponding the variable j, but y evaluates to the box corresponding to array element a[j] which depends on j. Hence, the loop sums the integers between 0 and 9 and increments variable j on each iteration. The program prints
```
 
10 45
```
call-by-reference: in the body of Sum the local variables x,y,n are bound to the boxes corresponding to the variable j, the array element a[0], and a dummy box created to hold the actual parameter 10. (We are assuming that the language boxes constants passed by reference instead of declaring the program syntactically incorrect.) In the body of Sum, the loop repeatedly adds 0 to sum, incrementing j by one on each iteration. Hence, the program prints
```
 
10 0
```
call-by-value-result: in the body of Sum the local variables x,y,n are bound to the boxes that are copies of the boxes corresponding to the variable j, the array element a[0], and a dummy box temp created to hold the actual parameter 10. (We are assuming that the language boxes constants passed by value-result instead of declaring the program syntactically incorrect.) In the body of Sum, the loop repeatedly adds 0 to sum, incrementing the local box for x on each iteration. On exit, the values of x,y,n are copied into the variables (boxes) j,a[0],temp. Hence, the program prints
```
10 0
```
just like call-by-reference. The code in this program does not depend on whether the local variables are copies or originals.
call-by-result: the parameter passing convention does not make sense in this example because it fails to initialize the formal parameters x, y, n. The behavior of the program is unpredictable. It may generate a subscripting error in the loop in Sum because n is not initialized!

Mutation: An Alternate Design

ML supports an appealing variant of pass-by-value with nearly the same expressive capabilities as call-by-reference. ML forces a programmer to associate variables that are used for modeling cycles or state change with reference cells (or boxes). Put differently, it makes references into values and can thus turn pass-by-value into the parameter passing of references exactly when needed.

To model ML we introduce three new classes of expressions:

(ref M), which creates a ref cell, which is a record with one slot that holds the value of M;
(! M), which assumes that M will evaluate to a ref cell; and,
(:= M1 M2), which assumes M1 will evaluate to a ref cell, and replaces the contents of that ref cell with the value M2 reduces to.

In LC-ML, ref cells are a new class of values, distinct from numbers and procedures (closures). Interpreting the new expressions requires the addition of three new lines to the original interpreter of LC:

(define MEval
  (lambda (M env)
    (cond
      ((var? M) (lookup (var-name M) env))
      ((num? M) (num-num M))
      ((add? M) (+ (MEval (add-left M) env)
                  (MEval (add-right M) right)))
      ((proc? M) (make-closure M env))
      ((app? M) (MApply (MEval (app-rator M) env)
                  (MEval (app-rand M) env)))
      ((ref? M)
	(box (MEval (ref-init M) env)))
      ((!? M)
	(unbox (MEval (!-expr M) env)))
      ((:=? M)
	(set-box! (MEval (:=-lhs M) env)
	  (MEval (:=-rhs M) env))))))

(define MApply
  (lambda (val-of-fp val-of-arg-p)
    (val-of-fp val-of-arg-p)))

Exercise: Write the function swap in LC-ML.

Orthogonality

What did we have to change here? We added one line for each new feature, but were able to leave the old code untouched. This is a very elegant design: adding the new set of features does not change the underlying interpreter for the old language. This is called orthogonality.

Exercise: What is the price that programmers pay for the orthogonal design of ML?

Values in Algol-like Languages versus Mostly Funtional Languages

In Scheme, ML, and Java, mutable objects are considered values. Hence, when a vector (array) is passed as a value to a procedure (method), the corresponding formal parameter refers to the passed object. In Algol-like languages (including C), values are immutable! Hence, when an array or structure is passed by value, a copy is constructed of the bitstring representing the array or structure. In these languages, data sharing is achieved by using explicit pointers and explicitly dereferencing them. Pointers are passed by value, just like integers. If a procedure is passed a pointer to an object, it can explicitly dereference the pointer to access the shared object. Hence, to pass an array object (rather than a copy), the calling code must pass a pointer to the array or alternatively pass the array by reference. (To prevent inefficient array copying, C/C++ treats values of type T[] as values of type *T in most contexts).

Simulating Mutation in Purely Functional Interpreters

In our interpreter, we defined the semantics of assignment and data mutation in LC in terms of data mutation in Scheme. What did this reduction accomplish? We reduced the meaning of all possible assignment and data mutation operations in LC to the meaning of a single Scheme program relying on a few uses of mutation. Can we do better? In particular, can we reduce the mutation to purely functional form (which can be understood in terms of familiar concepts of algebra)? The answer is yes, but the reduction introduces extra complexity into our representation of environments.

To simulate the modification of an environment in a purely functional interpreter, we might try to model assignment by constructing a new environment identical to the current environment except for the specified binding change. But this simple strategy obviously fails because constructing such an environment has NO effect on the environment embedded in interpreter invocations at higher levels in the interpreter calling chain. (MEval repeatedly calls itself passing an environment argument; constructing a new environment does not modify the value of any environments in stack of pending calls on MEval.

We could try to patch this defect by modifying MEval to return a pair consisting of the expression value and the ``updated'' environment. Then the returned environment could be used by higher level invocations of MEval. But even this ``fix'' does not work because the environment lists are shared. A given variable binding can appear in many different environments (every extension of the environment where the variable was introduced), many of which do not use the ``current'' environment (e.g., closures!). Hence, returning a modified environment to the calling invocation of MEval does not communicate the change to every context in the pending evaluation (e.g., a closure referring to the afffected variable) where the variable appears.

The common solution to this problem is to use brute force and pass the parameters required to simulate an idealized shared memory system. From a mathematical perspective, an environment is function mapping symbols to values. We can partition this function into two stages: the mapping of symbols to numeric locations and the mapping of locations to values. The first part is called the ``location environment'' (or environment for short) and the second part is called the ``store''. Note that the second part looks exactly like an idealized computer memory where the contents of memory locations are arbitrary values instead of bitstrings.

By using such a ``two-level'' characterization of the environment, we can explicitly specify which variable bindings are shared (refer to the same box). Two variables share a box iff the location environment maps them to the same numeric location. In such an interpreter, every evaluation of a procedure application must generate unique location names for the new local variable introduced by the procedure. But the generation of these locations requires access either to the entire database (pool) of environments that currently exist in the interpretation process or a counter that is incremented every time a variable is allocated (and used to generate a fresh name). Since the interpreter is purely functional, such a counter much passed as an argument to MEval and any subfunction that can directly or indirectly allocate new variables.

Of course, both pieces of the environment, called the ``location'' environment and the store, must be passed to MEval and every subfunction that refers to the environment.

Exercise

Rewrite the interpreter from Lecture 3 to use a ``two level environment'' consisting of a ``location environment'' and a store.

cork@cs.rice.edu (adapted from shriram@cs.rice.edu)