Section 13: Types and datatype





Types are the thought-police of the programming language world.

In the first part of this course, we discussed how to model the traditional class of language facilities with interpreters. In the second part, we are studying how to understand typed languages. Remember that types impose syntactic restrictions on programmers to prevent simple mistakes. Typically these mistakes re-inforce the abstraction boundaries a programmer thinks about as he programs, e.g., that some variable ranges over integer values and another over integer functions. In practice, we will restrict programmers by making certain program phrases illegal, and not accepting such phrases.

Typed languages require two extensions:

  1. A sub-language for talking about types.
  2. A mechanism for stating what the types of certain primitive phrases are, and rules for inferring the rules for the remaining phrases in a program. In our study we have followed the usual strategy of specifying the types for all binding occurrences of of identifiers and of inferring the types of the other phrases in a program with a type checker.

Next we take a closer look at what kind of restrictions our simple type system imposed on TLC, and how to get this expressive power of programming back.

Types are Restrictive

Consider the Scheme procedure assq, which takes two inputs, x of type sym and list of type (sym value) list. It returns a value (x,v) if x is associated with v in l, or #f otherwise. We cannot write assq in TLC; the essence of the problem can be described in the following expression:

(if P

where P is some expression whose type is bool. This expression cannot be typed in TLC because both branches of a conditional expression in TLC have to have the same type. Likewise, we cannot use lists with assq unless all the elements have the same type. Both restrictions are severe considering the flexibility of Scheme's assq.

We could avoid these problems if we added a new kind of type constructor to our type language:

t :== ... | (t + t)

which is known as a union type. A type (t1 + t2) indicates that the corresponding expressions have either type t1 or type t2.

Now consider the following program, which sums up the leaves of a binary tree:

(define Sigma
  (lambda (tea : tree)
    (if (is-a-leaf? tea)
      (+ (Sigma (left tea)) (Sigma (right tea))))))

Can this program be typed in TLC?

Clearly, tea is of type tree; but we want to view it as a number or as a combination of two trees. In Scheme, we would write a data description that looked like this:

A Binary Tree is either
  a number, or
  a pair of Binary Trees.

But with the current type system, we cannot write down such a description for arbitrarily large binary trees; we can only construct trees of fixed, finite depth.

What we really need is the ability to make a component of a type refer to a part of itself, ie, a recursive type. In C, we would solve this problem by using pointers, but this solution is extremely low-level. Pointers are to data structures what labels and goto's are for program structures: they are all low-level methods that make it easy for the programmer to make mistakes. Indeed, pointers are typically worse than goto's: few languages provide ways to jump to computed labels, but many languages, especially C, provide ways to dereference computed pointers.

A datatype declaration has the following format:

(datatype tid
  ((constr-1 type-11 type-12 ...)
   (constr-n type-n1 type-n2 ...))

This declaration creates a new type whose name is tid. The type can have several forms, called variants, which are described by the entries in the declaration. Each variant has a constructor, whose name is the first field of that variant's entry. Each n-place constructor is followed in the entry by n types, which specify the types of the arguments allowed for the constructor. The datatype declaration provides a means for declaring union types; most crucially, the scope of tid is the entire datatype declaration, so recursive types can also be created in this manner.

For example, consider the declaration of the type of binary trees, of the sort we summed above:

(datatype BT
  ((leaf int)
   (node BT BT))

However, if we try writing code that uses this type declaration, we soon find that we are missing some key items, such as a means for telling the different variants apart (predicates) and means for getting at the individual fields in a component (selectors). Hence, we extend the syntax of a datatype declaration to include these:

(datatype tid
  ((constr-1 pred-1 (type-11 sel-11) (type-12 sel-12) ...)
   (constr-n pred-n (type-n1 sel-n1) (type-n2 sel-n2) ...))

datatype combines union types with recursive types. After all, the latter only make sense with the former, and the former can easily be introduced with datatype. For example, one might write the return type for assq as

(datatype assoc-type
  ((SOME int)

Question: Every Scheme program can be prefixed with a single datatype declaration, and then be explicitly typed in terms of that declaration. What would this datatype look like?

Typechecking datatype

We have thusfar appealed to intuition to understand the typing properties of datatype. This intuitive understanding is not sufficient to answer questions such as, ``How many types are created if we place a datatype statement inside a loop?'' Hence, we shall construct a typing rule for datatype.

Clearly, the type of the (datatype ...) expression should be that of the result expression contained therein. So we can write

     ... tenv' |- exp : t
tenv |- (datatype ... exp) : t

where tenv' is possibly an augmented version of tenv. However, it would be quite useless if the type environments tenv and tenv' were the same, since the declarations would then have had no effect at all. In fact, the declarations augment the type environment in the expected manner. Each of the constructors, selectors and predicates is typed --

constr-1 : type-11 ... type-1m -> tid 
pred-1 : tid -> bool
sel-11 : tid -> type-11

-- and these types are then incorporated into tenv', shadowing type bindings in tenv as appropriate:

tenv + [constr-1 : type-11 ... type-1m -> tid,
        pred-1 : tid -> bool,
        sel-11 : tid -> type-11,
        constr-2 : type-21 ... type-2k -> tid,
        ... ] |- exp : t
        tenv |- (datatype ... exp) : t

This seems sufficient: we augment the type environment with the types of the procedures created by the datatype declaration, and then, using their types, type the contained expression, returning its type as that of the overall expression. What is wrong with this judgement?

One reasonable question is, ``What if the value returned were of type tid?'' It would not be a very meaningful value, since tid's extent is the lexical (datatype ...) body; not only would there be no way of inspecting or using the returned value, it would not even have a meaningful type. Thus, we should prevent against this possibility. (Incidentally, the naïve approach is the one taken by ML.)

We therefore impose the restriction that tid cannot be free in the type of the expression portion of the datatype declaration. However, this is unfortunate, since there are, in fact, situations where it is meaningful to return an object of type tid along with some procedures that operate on such an object. This cannot be done with our typing restrictions. The types in such a program would, per force, have to be anticipated, and the declaration would have to be moved into a sufficiently global scope such that the creation and all uses of objects of that type are encompassed.

We are still not done! (This example should illustrate the extremely subtle problem of designing type systems.) Here is a ``poem'' that illustrates one remaining problem:

(datatype A
  ((cons-1 (int sel-1)))
 (datatype _
   ((cons-2 (bool sel-2)))
  (if (sel-2 (cons-1 5)) ... ...)))

Depending on what _ is, we may or may not notice the problem. We do not if it were, say, B; but we definitely do if it is A.

Question: Explain what the problem is, and show how it can be resolved.

Back to course website