COMP 311 Principles of Programming Languages (Fall 2009)

Section 2: Parsing

Syntax: The Boring Part of Programming Languages

Every programming language represents programs as sequence of symbols. To facilitate the construction and manipulation of programs on a typical computer, these symbols are represented as sequences of characters that can be typed on a keyboard. Since the ASCII character is the standard international character set, all commercially viable languages use ASCII to represent the symbols forming their syntax.

More recently, a new standard called Unicode that accommodates the characters sets for languages like Chinese and Arabic as well as Roman languages has been embraced by some computer vendors. A few languages, notably Java, support both Unicode and ASCII.

To perform any computational process on the programs written in a language (e.g., execute those programs), we must take ASCII strings representing program text and build higher-level tree representations for the text. This process, called parsing, conveniently breaks into two parts: lexical analysis and context-free parsing (which is often simply called parsing).

Lexical Analysis

Consider this sequence of characters:

begin middle end

As readers of English, we immediately decompose this sentence into a sequence of three words: "begin", "middle", and "end". These words are the smallest meaningful pieces of syntax in the phrase; shorter sequences of characters like "egi" and "beg" are not meaningful in the context of the given phrase. The process of taking an input stream of characters and converting it into a sequence of distinct, meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that converts a character stream to a token stream is called a tokenizer or a lexer.

We tokenize

(set! x (+ x 1))

into the following sequence of nine tokens (typically represented as Scheme symbols):

( set! x ( + x 1 ) )

Similarly, in Java, we tokenize

System.out.println("Hello World!");

as the sequence of nine tokens (typically represented using some notion of a Token class)

System .  out . println ( "Hello World!" ) ;

The process of tokenizing is straightforward for most languages because it can be performed by a finite automaton (Fortran is a well-known exception!). The rules governing this process are (a very boring) part of the language definition. The task of assembling a structural description of a program (typically a tree) from a sequence of tokens is harder. Consider the Java statement

x = x + 1 ;

where x is an int variable.

We have, in order, an identifier x, the assignment symbol, the identifier x again, the binary operator +, the constant 1 and the terminator symbol ;. The grammatical rules of Java stipulate the assignment operator may be preceded by an identifier and must be followed by an expression. An expression, in turn, may be two expressions separated by a binary operator, such as +. Finally, an assignment expression can function as a statement if it is followed by the terminator symbol. Hence, we can deduce from the grammatical rules of Java that the above sequence of characters is a legal program statement.

Parsing: Rejecting Illegal Programs

In general, sequences of legal tokens don't always form legal program phrases, just as in English. The rules that determine the which sequences of tokens are legal program phrases is called a grammar. The act of applying the rules of a grammar to determine which sequence of tokens are legal, and of classifying these groups, is called parsing.

In practice, parsing does more than simply accept or reject sequences of tokens. It constructs a tree structure representing the structure of the program as analyzed by the rules of the grammar.

Let PI (for Potential Inputs) denote the set of all sequences of tokens that can be produced by the lexer for a trivial programming language which we will specify below. We want to determine the subset of PI consisting of all syntactically legal phrases. (Note: we aren't considering whether these phrases make sense yet; we are only concerned with whether they have proper structure.) We will use the following conventions: Num represents any number and empty represents the empty sequence of tokens. Then PI is defined by the following equation:

PI ::= empty | T PI

where | means "or" and T is defined by:

T ::= Num | +

The set of legal inputs AE is defined by the equation:

AE ::= Num | Num + AE

In OCaml, the abstract syntax for this language could be represented by a type declaration:

type ae = Num of int | Plus of int * ae

We could use a very similar representation in Scheme, but of course, our type "definition" would merely be a comment. We would need to introduce a structure to represent Plus constructions.

For example, the input

1 2

is a PI but not an AE. A parser is responsible for determining this fact.

In Scheme, the following function performs this computation assuming a-pi is bound to a Scheme list containing the sequence of tokens (symbols) in the given input:

(define AE?
  (lambda (a-pi)
    (cond
      ((null? a-pi) false)
      ((number? (first a-pi)) 
       (or (null? (rest a-pi))
           (and (eq? (first (rest a-pi)) '+) (AE? (rest (rest a-pi))))))
      (else false))))

The preceding code does not construct a parse tree for the recognized expression.

In OCaml, we could implement the correponding function ae_Q (assuming a is bound to an OCaml list of tokens) as follows:

let rec ae_Q a =
	match a with
	  [] -> false
	| x::xs ->
	  (if numberQ x then
	    match xs with
	     [] -> true
	    | x::xs -> ((x = "+") & (ae_Q xs))
           else false);;

Simply recognizing whether a sequence of tokens forms a legal program is not very informative. To understand (e.g., compile or interpret) a program, we must understand its structure. We will use this structure as a basis for defining the meaning of programs.

Parsing: Assembling Sequences of Tokens into Abstract Syntax Trees

Consider the following ways to express an assignment operation:


x = x + 1

x := x + 1

(set! x (+ x 1))

Which of these do we prefer?

The answer is that it really doesn't (and shouldn't) matter very much. All three perform the same abstract operation and differ only in surface syntax (also called concrete syntax). Thus, when we study the operations, we would like to do so without paying attention to such details. A minor argument against the first form of syntax (as in C/C++/Java) is that the = symbol is misleading notation for assignment because it denotes the equality predicate in standard mathematical discourse. As a result, programmers may mistakenly write

x = y

to test the equality of x and y.

To eliminate the irrelevant syntactic details, we can create a data representation that formulates program syntax as trees, or equivalently, words in a free term algebra. For instance, the abstract syntax for the assignment code given above could be

(make-assignment <representation of "x"> <representation of "(x + 1)">)

in Scheme notation or, equivalently,

make-assignment(<representation of "x">, <representation of "x + 1">)

in algebraic notation. Typical representation in OCaml would be


Assign (Var "x", ... parse "x + 1"...).

The idea is to create abstract syntax that represents the program structurally as a tree instead of a sequence of tokens. In contrast to conventional surface syntax, it typically disambiguates different semantic uses of the same token.

For example, we might want to distinguish between unary and binary operations. Some operators such as - are both, so merely looking at the operator doesn't tell us what kind of operation we have; we would have to see how many operands it has been given. This is both inconvenient and inefficient. Instead, we can immediately distinguish between the uses of - by creating different abstract syntax representations:


-1     is represented as  (apply (make-unary-op '-) 1)

1 - 2  is represented as  (apply (make-binary-op '-) 1 2)

Note that abstract syntax is a much more powerful program representation that surface syntax. It is trivial to generate the surface syntax for a program represented in abstract syntax by walking the tree representation (although the formatting of the generated surface syntax may differ slightly from the original input).

The converse is much harder; it is parsing! Technically, it is parsing augmented by abstract syntax tree generation, but plain parsing--accepting or rejecting potential inputs--is a pointless exercise in most contexts.

Let us consider a concrete example of parsing a surface syntax into abstract syntax. Assume the following potential inputs (PI) and legal programs (E):

PI ::= empty | T PI

T ::= Num | Sym | ( | )

E ::= Num | Var | (E E) | (lambda Var E)

where Sym is the set of Scheme symbols, Var is the set Sym less the elements lambda, left-paren, and right-paren, and lambda is the symbol lambda (not the name of a syntactic category like E. Names that are specified explicitly (like lambda) in the syntax are called keywords. They are name tokens with a special meaning.

In OCaml, the following datatype can be used to define an abstract syntax for this language:

type exp =
      Num of int
    | Var of string
    | App of exp*exp
    | Proc of string*exp;;

We can use Scheme's data definition language to design a similar abstract representation for program components. Here is the set of data constructors that we will use:

;; exp := (make-num number) + (make-var symbol) + (make-app exp exp) + (make-proc symbol exp)
(define-structure (num n))            
(define-structure (var s))            
(define-structure (app rator rand))   
(define-structure (proc param body))  ;; param is a symbol not a var!

where exp corresponds to E in the concrete syntax, num corresponds to Num, var corresponds to Var, app corresponds to (E E), and proc corresponds to

(lambda Var E)

Since the param field in a proc must be a variable, we represent it by a Scheme symbol rather than a var structure.

A simple strategy for parsing a sequence of input tokens into abstract syntax is to determine how the string is generated from the inductive definition by matching the clauses of the definition against the input stream proceeding left-to-right and using using k symbol lookahead to determine which choice is made at each point in the generation process. In our sample syntax above, this involves matching the clauses for E against the input stream. This parsing strategy is called LL(k) parsing. Unfortunately, LL(k) parsing only works for some unambigous context-free grammars. In particular, it fails miserably if the inductive definition has left recursive clauses such as

E ::= E + Num

because it will infinitely loop.

In addition, some inductive definitions of syntax will not accommodate this approach because k symbol lookahead is not sufficient to determine which clause of the definition applies. In practice it is easy to design readable concrete syntaxes that can be parsed using only one symbol of lookahead LL(1) parsing. Some existing languages, however, are not amenable to LL(k) parsing for any value of k. For such languages it is often possible to tweak either the lexer and the rules of the grammar slightly so that the parser accepts the same language (differently tokenized) or a slightly larger language which can be filtered later in the translation process. LL(k) parsing, it is easy to write such a parser by hand using a programming model called "recursive descent". A "recursive descent" parser includes a a separate procedure to recognize each syntactic category in the grammar. Each such procedure matches the input stream by some combination of (i) directly inspecting the tokens in the stream and (ii) calling the procedures corresponding to the syntactic categories on the right hand side of the rule for the category. Lookahead can be handled in one of two ways. Either the parser can read ahead and keep the tokens that have been read but not processed in a small k-element buffer. This approach is particularly convenient if k = 1 because the buffer reduces to a a scalar variable. Second, the lexer can provide explicit peek operations allowing the parser to look ahead in the input stream without consuming any tokens. (In essence, the lexer has to do some buffering in this approach).

The following Scheme function parse uses recursive descent to parse the input stream into a program expression Exp.

It does not presume that the lexer provides a lookahead facility; it performs its own buffering (a single token since k = 1 for this grammar) instead.

;; Exp := (make-num number) + (make-var symbol) + (make-app Exp Exp) + 
;;        (make-proc symbol Exp)
(define-structure (num n))            
(define-structure (var s))            
(define-structure (app rator rand))   
(define-structure (proc param body))  ;; param is a symbol not a var!

;; Note: make-proc takes a symbol as its first argument rather than a var because
;;       the mk-var wrapper would be redundant; the first argument must be a variable

;; PI := (-> Symbol) with the side effect of advancing the cursor to
;;                   the next symbol

;; PI -> Exp | (error)
;; Purpose: (parse s) returns the tree in Exp corresonding to s
;;               if one exists;
;;           otherwise aborts by invoking (error ...)

(define parse
  (lambda (stream)
    (local 
      ((define token (stream)))
      (if (eof-object? token)     ; empty stream
          (error 'parse)
          (local ((define result (parse-Exp token stream))
		  (define extra (stream)))
            ; check for excess input
            (if (eof-object? extra)
                result
                (error 'parse "unexpected input: ~a" extra)))))))

;; Symbol PI -> Exp | (error)
;; Purpose: (parse-Exp t s) returns the tree in Exp corresponding to the
;;              longest matching prefix of t,s;
;;          aborts by invoking (error ...) in no such prefix exists

(define parse-Exp 
  (lambda (token stream)
    (cond 
      ((number? token) (make-num token))
      ((eq? token LEFT-PAREN)
       (local ((define token (stream)))
         (if (eq? token 'lambda) 
             (parse-proc stream)          ; consumes right-paren
             (parse-app token stream))))
      ((symbol? token) (make-var token))  ; consumes right-paren
      (else (error 'parse "urecognised token: ~a~n" token)))))

;; PI -> proc | (error)
;; Purpose: (parse-proc s) returns the proc tree corresponding to
;;               the longest matching prefix of 'lambda,s
;;          aborts by invoking (error ...) in no such prefix exists

(define parse-proc
  (lambda (stream)  
    (local ((define bound-var (stream)))
      (cond
        ((symbol? bound-var)
         (local ((define body (parse-Exp (stream) stream))
		 (define delimiter (stream))) ; consume right-paren
	   (if (eq? delimiter RIGHT-PAREN)
	       (make-proc bound-var body)
	       (error 'parse "unexpected input at end of proc: ~a" delimiter))))
        (else (error 'parse "bad variable in lambda: ~a" bound-var))))))

;; Symbol PI -> app | (error)
;; Purpose: (parse-app r s) returns the app tree corresponding to
;;               the longest matching prefix of input stream (,r,s
;;          aborts by invoking (error ...) in no such prefix exists

(define parse-app        
  (lambda (rator-first stream) ; first symbol of rator has already been read
    (local ((define rator (parse-Exp rator-first stream))
	    (define rand-first (stream)) ; first token of rand
	    (define rand (parse-Exp rand-first stream))
	    (define delimiter (stream)))
      (if (eq? delimiter RIGHT-PAREN)
          (make-app rator rand)
          (error 'parse "application too long: ~a" delimiter)))))

;; definitions of the symbols for left-paren and right-paren
(define LEFT-PAREN '|(|)
(define RIGHT-PAREN '|)|)

;; some test code
;; requires lexer.ss library to run the tests
(define-struct test (input output)) 
(map (lambda (t) (make-test t (parse (make-string-lexer t))))
     (list "f" "1/2" "(f 1)" "(lambda x x)"
	   "((lambda x (x x)) (lambda x (x x)))"))

Be sure to understand why we call error rather than return false when we encounter illegal syntax.

Parsing in OCaml

The corresponding parser can be implemented in OCaml as follows:

exception ParserError of string;;

let rec parse stream = 
    match stream with 
      [] -> raise (ParserError "Reached EOF unexpectedly")
    | token::stream ->
      let (result,stream) = parse_exp token stream in
      (match stream with 
        [] -> result
      | _ -> raise (ParserError ("unexpected input" ^ token)))

and parse_exp token stream =
  match token with 
    x when  (numberQ x) -> (Num (int_of_string token), stream)
  | "("  -> 
   (match stream with 
    | [] -> raise (ParserError "No closing paren")
    | token::stream ->
	if token = "lambda" then
	  parse_proc stream
	else
	  parse_app token stream)
  | x when (symbolQ x) -> (Var x, stream)
  | _ -> raise (ParserError "Unrgonised token")

and parse_proc stream =
  match stream with
     bound_var::token::stream ->
      if symbolQ bound_var then
	let (body,rest) = parse_exp token stream in
       (match rest with
	   ")"::stream ->
	     (Proc (bound_var, body), stream)
         | _ -> raise (ParserError "Unexpected input at end of proc"))
      else 
	raise (ParserError "Bad variable in lambda")

and parse_app rator_first stream =
  let (rator, rest) = parse_exp rator_first stream in
  match stream with
   rand_first:: stream ->
     let (rand, rest) = parse_exp rand_first stream in
     (match stream with
       ")"::stream -> (App (rator, rand), stream)
     | _ -> raise (ParserError "Application too long"))


(* Example:

# parse ["(";"f";"1";")"];;
- : exp = App (Var "f", Num 1)

*)

Significant Exercise: How can we construct a parser that detects and signals more than one error?

Summary

We can now summarize our view of syntax:

The stream of input characters is tokenized.
The parser converts tokens into abstract representations. In the process, it

checks the legality of the input, and
constructs a structural description on the program.

If the syntax is simple, the parser can simply follow the inductive definition of the syntax. For this approach to work (technically called "recursive descent" or LL(1) parsing), selecting the appropriate clause in the inductive definition should only require one token lookahead and the definition must not contain any left recursive clauses.

If you design or build programming languages, you will have to understand much more about constructing and parsing surface syntax, and on the psychological impact of your choices. Our approach has some failings:

We didn't deal with the process of tokenization.
Most language syntaxes are far more complex than the ones we saw; hence, the process of determining whether a given input is syntactically legal can be much harder.
Most of all, the process of detecting and signalling errors is complex and tedious.

These problems are considered in detail in courses such as COMP 412. For our purposes, this treatment is sufficient.