Programming Assignment 1:
Parsing and Abstract Syntax

 

 

 

Assigned: Wednesday, August 26, 2009
Due: 12:00 noon, Wednesday, September 9, 2009

Files

Supporting Code for

Some Simple Test Inputs

Overview

Preparation   We suggest that you run the Owlnet register command for comp311, which will make the latest versions of drscheme and drjava available as commands in your Owlnet environment.

For more information about DrScheme, read the documentation in www.drscheme.org/tour/. and the DrScheme Help menu. You can download DrScheme Version 208 from drscheme.org. DrScheme supports a variety of different Scheme dialects. For this class, please use the "Pretty Big" dialect of Scheme under Professional Languages. You can start DrScheme on Owlnet by typing drscheme after you have registered.

DrJava is distributed at the web site drjava.org. You can run DrJava on Owlnet by typing drjava.

Instructions for using O'Caml will follow.

Every student should create a comp311 directory in his or her home directory. Each programming team should submit their solution under the userid one team member. We recommend creating a a programs/1 directory within the directory comp311. All of your files for this assignment should be stored in the programs/1 subdirectory.

Teamwork   You are strongly encouraged to do this assignment and all other programming assignments using pair programming. When you submit your assignment, indicate the composition of your team (names, student ids, and email addresses) in your README file.

If you cannot find a partner and would like one, send an email to comp311@rice.edu and we will try to find one for you. Teams of more than two members are not permitted.

Testing   You are expected to write unit tests for all of your non-trivial methods or functions, depending on whether you write your assignments in Java or Scheme. DrJava provides integrated support for JUnit which makes this task easy. DrScheme Version 208 supports a similar unit testing system for Scheme called (surprise!) SchemeUnit. It is available for download from http://sourceforge.net/projects/schematics/ under the name schemeunit in the list of file releases. The downloadable file is a "plt" file that you open using DrScheme. When you open it, DrScheme adds the library to the collection of libraries maintained by DrScheme. Documentation for SchemeUnit is available at http://schematics.sourceforge.net/schemeunit/schemeunit.html. Chapter 2 entitled "Quick Start" (a single page!) provides enough information to write unit tests for this assignment.

Note that unit testing conflicts in some cases with the name hiding recommendations given for writing Scheme code in Comp 210 because lexically nested function definitions cannot be tested. In your Scheme programs all non-trivial functions must be accessible at the top-level.

If you are using O'Caml, you can get the O'Caml unit testing framework oUnit at http://www.xs4all.nl/~mmzeeman/ocaml/. Since oUnit is available as source only, you will need to follow the included README.

Note if you get the error "Reference to undefined global 'Unix'" when you try to use O'Caml, you will need to recompile oUnit after adding the following line at the top of oUnit.ml:

#load "unix.cma"

Task   Your task is to write a parser in Generic Java, Scheme, or O'Caml that recognizes expressions in the illustrative language Jam, which will be defined subsequently. The course staff has provided lexers in both Scheme and Java for this assignment. These lexers return a larger set of symbols than what appears in the Jam definition. These extra symbols anticipate future extensions to Jam. At this point, your parser should simply reject any program containing one of these extra symbols as ill-formed, just like it would for any other syntax error.

Scheme Version   In Scheme, the parser's input language is a stream of tokens implemented as a parameterless "read" procedure p that encodes a token stream. Given the procedure p, the application (p) returns the next token in the encoded stream, which is either a Scheme symbol, a Scheme integer, or a special end-of-file object that is recognized by the predicate eof-object?. The Scheme library file lexer.ss implements a scanner (lexical analyzer) that converts either a file or a string to a corresponding token stream. Given a string s specifying a file name, the application (make-file-lexer s) returns the corresponding token stream.

The parser's output language is Jam abstract syntax, which is defined in the library file abs.ss. If the parser determines that some given input does not represent a legal Jam program, it should print an appropriate error message and terminate the parsing process using the error primitive.

Note that lexer.ss and abs.ss are Teachpack (library) files and should be loaded into DrScheme using the AddTeachpack command on the Language menu. You can load both libraries into DrScheme as a single TeachPack library by placing both library files in your program directory and designating the file lexer.ss as the TeachPack to be loaded. The module lexer.ss requires the module abs.ss, which forces it to be loaded. These library files are available on the web in the directory www.cs.rice.edu/~javaplt/311/Assignments/1/scheme.

Your parser should be expressed as a function parse that takes a string specifying a file name as its only argument and returns an abstract syntax tree for the Jam expression in the specified file. Your Scheme program should be stored in the file parser.ss.

Java Version   In Java, the parser's input token stream is provided by a Lexer object supporting the parameterless method readToken(). The readToken() method produces a stream of Token objects belonging to various subclasses of the Token interface. The Lexer class supports two constructors: a zero-ary constructor that converts standard input to a Lexer and a unary constructor Lexer(String fileName) that converts the specified file to a Lexer. The file lexer.java contains the definition of the Lexer class and the collection of classes representing tokens (the Java interface Token) and abstract syntax trees (the java interface AST).

All classes should be in the default (top-level) package.

Your parser should be expressed as a class Parser with constructors Parser(Reader stream) and Parser(String filename) which create parsers reading the specified stream or file. The Reader form of these constructors is very convenient when you are testing your parser from the DrJava read-eval-print loop or from unit test methods.

The Parser class should contain a public method AST parse() that returns the abstract syntax tree for the Jam expression in the input stream. Our grading program will print the output using the method System.out.println, which invokes the toString method on its argument. All of the requisite abstract syntax classes including toString methods are defined in the file lexer.java. This library files are available on the web in the directory http://www.cs.rice.edu/~javaplt/311/Assignments/1/java.

If your parser encounters an erroneous input, simply print an explanation of the syntax error and abort parsing the file by throwing a ParseException. Recovering from a syntax error to continue parsing is a complex problem with many special cases. It is not a realistic goal for this assignment.

O'Caml Version   The O'Caml version is similiar to the Scheme version. The parser's input language is a stream of tokens implemented by a series of "pop" or "peek" operations. To peek the next token you would need to call lexer("peek"), and to pop the next token lexer "pop". You can check for the end-of-file with the test "if (token == EOF) then ..."

make_file_lexer : string -> string -> token = <fun> make_string_lexer : string -> string -> token = <fun>

In ocaml, you must support the following interfaces:

parse_file : string -> exp = <fun>
parse_string : string -> exp = <fun>

The support code can be found at http://www.cs.rice.edu/~javaplt/311/Assignments/1/ocaml.

Choice of Language   Generic Java, Scheme, and O'Caml are the only languages permitted for this assignment.

Input Language Specification

The following paragraphs define the parser's input language and the subset that your parser is supposed to recognize a legal Jam program.

The Meta Language The parser's input language is described using Backus-Naur Form (BNF). When reading the notation, keep the following in mind:

The Input Language   The parser's input language is Input where:

Input ::= Token*
Token ::= AlphaOther AlphaOtherNumeric* | Delimiter | Operator | Int

The lexer recognizes keywords, delimiters (parentheses, commas, semicolons), primitive operations, identifiers, and numbers.

Adjacent tokens must be separated by whitespace (a non-empty sequence of spaces, tabs, and newlines) unless one of the tokens is a delimiter or operator. In identifying operators, the lexer chooses the longest possible match. Hence, <= is interpreted as a single token rather than < followed by =.

Lower             ::= a | b | c | d | ... | z
Upper             ::= "A" | "B" | "C" | "D" | ... | "Z"
Other             ::= ? | _
Digit             ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Alpha             ::= Upper | Lower
AlphaOther        ::= Alpha | Other
AlphaOtherNumeric ::= AlphaOther | Digit
Delimiter         ::= ( | ) | [ | ] | , | ;
Operator          ::= "+" | - | ~ | "*" | / | = | != | < | > | <= | >= | & | "|" | :=

Jam   The set of legitimate Jam phrases is the subset of the potential inputs consisting the expressions Exp defined by the following grammar:

Expressions

Exp         ::= Term { Binop Exp }
              | if Exp then Exp else Exp
              | let Def+ in Exp
              | map IdList to Exp
Term        ::= Unop Term
              | Factor { ( ExpList ) }
              | Null
              | Int
              | Bool
Factor      ::= ( Exp ) | Prim | Id
ExpList     ::= { PropExpList }
PropExpList ::= Exp | Exp , PropExpList
IdList      ::= { PropIdList }
PropIdList  ::= Id | Id , PropIdList

Definitions

Def ::= Id := Exp ;

Primitive Constants, Operators, and Operations

Null  ::= null
Bool  ::= true | false
Unop  ::= Sign | ~
Sign  ::= "+" | -
Binop ::= Sign | "*" | / | = | != | < | > | <= | >= | & | "|"
Prim  ::= number? | function? | list? | null? | cons? | cons | first | rest | arity

Identifiers

Id ::= AlphaOther {AlphaOther | Digit}*

Please note that Id does not contain the anything that matches Prim or the keywords if, then, else, map, to, let, in, null, true, and false.

Numbers

Int ::= Digit+

The preceding grammar requires one symbol lookahead in a few situations. The Scheme lexer in the file lexer.ss supports a peek operation precisely for this purpose. Given a token stream p, an application of the form (p x) where x is any argument (e.g., the symbol 'peek) returns a copy of the next token in p without removing it from (or otherwise changing) p. In the Java lexer, the method Token peek() behaves exactly like the method Token readToken() except for the fact that it leaves the the scanned token at the front of the input stream (in contrast readToken() removes the scanned token from the input stream). The O'Caml code will provide a similar facility.

Abstract Syntax Trees   As mentioned above, the Scheme library lexer.ss defines the set of constructors used to build abstract syntax trees in Scheme. There is one constructor for each different form of expression in the definition of Jam syntax given above. In addition, there is one abstract syntax form for Jam definitions, which can easily express both concrete forms given above. For example, suppose the Jam program under consideration is

f(x) + (x * 12))

The abstract syntax representation for this program phrase is:

(make-biop-exp '+
  (make-app-exp (make-id-exp 'f) (list (make-id-exp 'x)))
  (make-biop-exp '* (make-id-exp 'x) (make-num-exp 12)))
  

in the context of the following data descriptor definitions:

(define-struct biop-exp (rator rand1 rand2))
(define-struct app-exp (rator l-of-rands))
(define-struct id-exp (arg))
(define-struct num-exp (arg))
  

The Java file lexer.java defines all of the abstract syntax classes required to represent Jam programs. There is one constructor for each different form of expression in the definition of Jam syntax given above. Some primitive (non-recursive) abstract syntax classes are the same the corresponding token classes. There is one abstract syntax form for Jam definitions, which can easily express both concrete forms given above.

Testing and Submitting Your Program

The directory www.cs.rice.edu/~javaplt/311/Assignments/1/tests contains a sample input programs. Create a README file in the your directory program/1 that

Make sure that your README file is structured correctly:

  1. It has to be named README (all upper case, no .txt or anything)
  2. It has to be in the assignment's "root directory", e.g. ~/comp311/programs/1
  3. The first two lines should only contain your usernames and nothing else. The third line should be blank (if you're working alone, leave lines 2 and 3 blank), e.g.:
    mgricken
    dmp

    This is where the readme text starts.

Your test data files should be stored in the programs/1 directory.

Each procedure or method in your program should include a short comment stating precisely what it does. For routines that parse a particular form of program phrase, give grammar rules describing that form.

To submit your program, make sure that everything that you want to submit is located in the directory programs/1 and type the command

turnin311 -t[scheme|java|ocaml] 1

from within the comp311/programs directory. For example, if you have an O'Caml version you would execute

turnin311 -tocaml 1

The command will inform you whether or not your submission succeeded. Only submit one copy of your program per team. If you need to resubmit an improved version your program, submit it from the same account as the original so that the old version is replaced.

Note if turnin311 is not on your path, you will need to add /home/comp311/bin to your path:

export PATH=$PATH:/home/comp311/bin

Implementation Hints

Use an "unparser" to print a concrete representation for an abstract syntax tree. Then you can directly compare test input strings and output strings (up to differences in whitespace and parentheses using for grouping).

Back to course website