[PLT logo] Lecture 1, Comp 311

The Boring Part of Programming Languages.

Every programming language represents programs as sequence of symbols. To facilitate the construction of programs on a typical computer, these symbols are represented as sequences of characters that can be typed on a keyboard. Since the ASCII character is the standard international character set, all commercially viable languages use ASCII to represent the symbols forming their syntax.

To perform any computational process on the programs written in a language (e.g., execute those programs), we must take ASCII strings representing program text and build higher-level tree representations for the text. This process, called parsing, conveniently breaks into two parts: lexical analysis and context-free parsing (which is often simply called parsing).


Lexical Analysis

Consider this sequence of characters:

m anbi te sd og

It doesn't take long to recognize the above characters as spelling out the sentence, ``man bites dog''. The process of taking an input stream of characters and converting it into a sequence of distinct, recognizable symbols (called tokens or lexemes) is called tokenizing (or lexical analysis). Likewise, in Scheme, we might tokenize

(set! x (+ x 1))

as (, set!, x, (, +, x, 1, ), and ).

The process of tokenizing is fairly straight-forward; assembling a structural description of a program (typically a tree) from a sequence of tokens is harder. Consider the C statement

x = x + 1 ;

We have, in order, an identifier, the assignment symbol, another identifier, a binary operator, a constant and a terminator. We expect the assignment operator to be followed by an expression. An expression, in turn, can be two expressions separated by a binary operator, such as +. Finally, statements end with a terminator. Reasoning thus, we can conclude that the above sequence of characters represents a legal program statement.


Parsing: Rejecting Illegal Programs

In general, legal tokens don't always group into legal expressions, just as in English. The act of determining which sequence of tokens are put together legally, and of classifying these groups, is called parsing.

The potential inputs (PI) are all sequences that can be typed in. Of these, we want to determine the subset of syntactically legal phrases. (However, we aren't considering whether these phrases make sense yet.) We will use the following conventions: Num represents any number and empty represents the empty sequence of tokens. Then PI is defined by the following equation:

PI ::= empty
     | T PI 

where T is defined by:

T ::= Num 
     | + 

The set of legal inputs AE is defined by the equation:

AE ::= Num 
     | Num + AE

For example, the input

1 2

is a PI but not an AE. A parser is responsible for determining this fact. The following program performs this determination assuming pi is a Scheme list containing the sequence of tokens in the given input:

(define parse
  (lambda (pi)
    (cond
      ((null? pi) #f)
      ((number? (car pi)) 
       (or (null? (cdr pi))
           (and (pair? (cdr pi)) (eq? (cadr pi) '+) (parse (cddr pi)))))
      (else #f))) 

Simply recognizing whether a sequence of tokens forms a legal program is not very informative. To understand (e.g., compile or interpret) a program, we must understand its structure. We will use this structure as a basis for defining the meaning of programs.


Parsing: Assembling Sequences of Tokens into Abstract Syntax Trees

Consider the following syntaxes for assignment:

x = x + 1
x := x + 1
(set! x (+ x 1))

Which of these do we prefer?

The answer is that it really doesn't (and shouldn't) matter very much. All three perform the same abstract operation and differ only in surface syntax. Thus, when we study the operations, we would like to do so without paying attention to such details. A minor argument against the first form of syntax (as in C/C++) is that the = symbol is misleading notation for assignment because it denotes the equality predicate in standard mathematical discourse. As a result, programmers will try to write

x = y

to test the equality of x and y.

To eliminate the irrelevant syntactic details, we can create a data representation that formulates program syntax as trees, or equivalently, words in a free term algebra. For instance, the abstract syntax for the assignment code given above could be

(make-assignment <representation of "x"> <representation of "(x + 1)">)
in Scheme notation or, equivalently,
make-assignment(<representation of "x">, <representation of "x + 1">)
in algebraic notation.

The idea is to create abstract syntax that represents the program structurally as a tree instead of a sequence of tokens. In contrast to conventional surface syntax, it typically disambiguates different semantic uses of the same token.

For example, we might want to distinguish between unary and binary operations. Some operators such as - are both, so merely looking at the operator doesn't tell us what kind of operation we have; we would have to see how many operands it has been given. This is both inconvenient and inefficient. Instead, we can immediately distinguish between the uses of - by creating different abstract syntax representations:

-1     is represented as  (apply (make-unary-op '-) 1)
1 - 2  is represented as  (apply (make-binary-op '-) 1 2)

On demand, surface syntax for the program can be generated from the abstract syntax (although its formatting may differ slightly rom the original input).

Let us consider a concrete example of parsing a surface syntax into abstract syntax. Assume the following potential inputs (PI) and legal programs (L):

PI ::= empty
     | T PI
T ::= Num 
    | Sym
    | (
    | )
L ::= Num
    | Var
    | (L L)
    | (lambda Var L)

where Sym is the set of Scheme symbols, Var is the set Sym less the elements 'lambda, left-paren, and right-paren. Names that are specified explicitly (like lambda) are called keywords. They are markers that distinguish a group of tokens.

We will use Scheme's data definition language to design abstract representations for program components. Here is the set of data constructors that we will use:

(define-structure (num n))
(define-structure (var s))
(define-structure (proc param body))
(define-structure (app rator rand))

corresponding to Num, Var, applications (L L), and procedures ('lambda Var L), respectively.

A simple strategy for parsing a sequence of input tokens into abstract syntax is to determine how the string is generated from the inductive definition by matching the clauses of the definition against the input stream proceeding left-to-right and using using K symbol lookahead to determine which choice is made at each point in the generation process. In our sample syntax above, this involves matching the clauses for L against the input stream. This strategy is called "recursive descent" or LL(K) parsing. Unfortunately, this strategy fails miserably if the inductive definition has left recursive clauses such as

L ::= L + Num
because it will infinitely loop. In addition, some inductive definitions of syntax will not accommodate this approach because K symbol lookahead is not sufficient to determine which clause of the definition applies.

The following Scheme function parse uses recursive descent to parse the input stream into a program expression L:

;; structures defining abstract syntax trees
(define-struct proc (var body))
(define-struct var (symbol))
(define-struct app (rator rand))
(define-struct num (number))

(define parse
  (lambda (stream)
    (local 
      ((define token (stream)))
      (if (eof-object? token)     ; empty stream
          (error 'parse)
          (local ((define result (parse-L token stream))
		  (define extra (stream)))
            ; check for excess input
            (if (eof-object? extra)
                result
                (error 'parse "unexpected input: ~a" extra)))))))

(define parse-L ; first token has already been read
  (lambda (token stream)
    (cond 
      ((number? token) (make-num token))
      ((eq? token LEFT-PAREN)
       (local ((define token (stream)))
         (if (eq? token 'lambda) 
             (parse-proc stream)          ; consumes right-paren
             (parse-app token stream))))
      ((symbol? token) (make-var token))  ; consumes right-paren
      (else (error 'parse "urecognised token: ~a~n" token)))))

(define parse-proc ; 'lambda has already been read
                   ; must consume right-paren
  (lambda (stream)  
    (local ((define bound-var (stream)))
      (cond
        ((symbol? bound-var)
         (local ((define body (parse-L (stream) stream))
		 (define delimiter (stream))) ; consume right-paren
	   (if (eq? delimiter RIGHT-PAREN)
	       (make-proc bound-var body)
	       (error 'parse "unexpected input at end of proc: ~a" delimiter))))
        (else (error 'parse "bad variable in lambda: ~a" bound-var))))))

(define parse-app        
  (lambda (rator-first stream) ; first symbol of rator has already been read
    (local ((define rator (parse-L rator-first stream))
	    (define rand-first (stream)) ; first token of rand
	    (define rand (parse-L rand-first stream))
	    (define delimiter (stream)))
      (if (eq? delimiter RIGHT-PAREN)
          (make-app rator rand)
          (error 'parse "application too long: ~a" delimiter)))))

;; funny syntax for the symbol left paren
;; and the symbol right paren
(define LEFT-PAREN '|(|)
(define RIGHT-PAREN '|)|)

;; some test code
;; requires lexer.ss library to run the tests
(define-struct test (input output)) 
(map (lambda (t) (make-test t (parse (make-string-lexer t))))
     (list "f" "1/2" "(f 1)" "(lambda x x)"
	   "((lambda x (x x)) (lambda x (x x)))"))

Be sure to understand why we call error rather than return #f when we encounter illegal syntax.


Exercise: How can we construct a parser that signals more than one error?

We can now summarize our view of syntax:

If you design or build programming languages, you will have to understand much more about constructing and parsing surface syntax, and on the psychological impact of your choices. Our approach has some failings:

These problems are considered in detail in courses such as Comp 412. For our purposes, this treatment is sufficient.

cork@cs.rice.edu/ (adapted from shriram@cs.rice.edu)