Programming Assignment 1:
Parsing and Abstract Syntax

 

 

 

Assigned: Wednesday, August 25, 2010
Due: 12:00 noon, Wednesday, September 8, 2010
Points: 100

Files

Overview

Preparation   We suggest that you immediately download and install the latest version of the Java 6 JDK and the latest version of DrJava on your preferred computing platform. On Windows and Linux, the best Java 6 JDK is provided by Sun Microsystems. On Macs, the Java JDK is part of the supported system. (If you have an older Mac [pre Mac OS X 10.5], you can use the Java 5 SDK since Java 6 is only supported on machines with 64-bit Intel processors running Mac OS X 10.5. You may occasionally trip over the fact that some new Java 6 library class or method is not available in Java 5, but we will figure out a workaround if that situation arises.)

The latest version of the Sun Java 6 SE JDK systems (for platforms other than Macs) can be obtained at http://java.sun.com/javase/downloads/index.jsp. We recommend downloading and installing only the JDK, not the JDK with NetBeans (which greatly increases the size of the download). The latest version of DrJava can be obtained from http://drjava.org.

We will try to maintain the latest versions of the Java 6 JDK and DrJava on CLEAR, but CLEAR is under the control of Rice IT, which limits our ability to configure what software is available. We recommend using CLEAR only to submit your programs. On CLEAR, we are providing a turnin311 script, which submits your program to our repository. Unfortunately, the IT submission facilities will not permit us to run a script testing that submitted programs pass the trivial unit tests in Assign1Test.java (including in our supporting library files.

On CLEAR, you may need to use the path ~comp311/bin/turnin311 to access the turnin command.

Please create a comp311 directory in your home directory on both your development machine and on CLEAR. Each programming team should submit their solution under the userid one team member. For this assignment, please create a programs/1 directory within the directory comp311. All of your files for this assignment should be stored in the programs/1 subdirectory.

DrJava is distributed as a Java .jar file at the web site drjava.org. You can run DrJava on a Unix system (with Java JDK 5.0 or 6.0 installed) by typing

java -jar drjava.jar

assuming that drjava.jar is located in the current directory. On Windows machines (with a properly Java JDK 5.0/6.0 installed), you can run the drjava.jar file simply by double-clicking on the icon. If the Registry has the wrong file type associated the .jar files, then double-clicking .jar icons will not work. In this case, you can download an executable drjava.exe file for Windows from the DrJava web site. This file wraps the drjava.jar in a script that runs the .jar file.

Documentation for DrJava is available at drjava.org. If you have further questions about DrJava, send email to javaplt@rice.edu

All of your files for this assignment should be stored in the programs/1 subdirectory.

Teamwork   You are strongly encouraged to do this assignment and all other programming assignments using pair programming. When you submit your assignment, indicate the composition of your team (names, student ids, and email addresses) in your README file.

If you cannot find a partner and would like one, send an email to comp311@rice.edu and we will try to find one for you. Teams of more than two members are not permitted.

Testing   You are expected to write unit tests for all of your non-trivial methods or functions. DrJava provides integrated support for JUnit which makes this task easy.

Task   Your task is to write a parser in Generic Java (Java 6) that recognizes expressions in the pedagogic functionallanguage Jam, which will be defined subsequently. The course staff has provided a lexer in Java for this assignment. This lexer returns a larger set of symbols than what appears in the Jam definition. These extra symbols anticipate future extensions to Jam. At this point, your parser should simply reject any program containing one of these extra symbols as ill-formed, just like it would for any other syntax error.

Java Support Library   In our Java library, the parser's input token stream is provided by a Lexer object supporting the parameterless method readToken(). The readToken() method produces a stream of Token objects belonging to various subclasses of the Token interface. The Lexer class supports two constructors: a zero-ary constructor that converts standard input to a Lexer and a unary constructor Lexer(String fileName) that converts the specified file to a Lexer. The file lexer.java contains the definition of the Lexer class and the collection of classes representing tokens (the Java interface Token) and abstract syntax trees (the java interface AST).

All classes should be in the default (top-level) package.

Your parser should be expressed as a class Parser with constructors Parser(Reader stream) and Parser(String filename) which create parsers reading the specified stream or file. The Reader form of these constructors is very convenient when you are testing your parser from the DrJava read-eval-print loop or from unit test methods.

The Parser class should contain a public method AST parse() that returns the abstract syntax tree for the Jam expression in the input stream. Our grading program will print the output using the method System.out.println, which invokes the toString method on its argument. All of the requisite abstract syntax classes including toString methods are defined in the file lexer.java. This library files are available on the web in the directory http://www.cs.rice.edu/~javaplt/311/Assignments/1/java.

If your parser encounters an erroneous input, simply print an explanation of the syntax error and abort parsing the file by throwing a ParseException. Recovering from a syntax error to continue parsing is a complex problem with many special cases. It is not a realistic goal for this assignment.

Input Language Specification

The following paragraphs define the parser's input language and the subset that your parser should recognize as legal Jam programs.

The Meta Language The parser's input language is described using Backus-Naur Form (BNF). When reading the notation, keep the following in mind:

The Input Language   The parser's input language is Input where:

Input ::= Token*
Token ::= AlphaOther AlphaOtherNumeric* | Delimiter | Operator | Int

The lexer recognizes keywords, delimiters (parentheses, commas, semicolons), primitive operations, identifiers, and numbers.

Adjacent tokens must be separated by whitespace (a non-empty sequence of spaces, tabs, and newlines) unless one of the tokens is a delimiter or operator. In identifying operators, the lexer chooses the longest possible match. Hence, <= is interpreted as a single token rather than < followed by =.

Lower             ::= a | b | c | d | ... | z
Upper             ::= "A" | "B" | "C" | "D" | ... | "Z"
Other             ::= ? | _
Digit             ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Alpha             ::= Upper | Lower
AlphaOther        ::= Alpha | Other
AlphaOtherNumeric ::= AlphaOther | Digit
Delimiter         ::= ( | ) | [ | ] | , | ;
Operator          ::= "+" | - | ~ | "*" | / | = | != | < | > | <= | >= | & | "|" | :=

Jam   The set of legitimate Jam phrases is the subset of the potential inputs consisting the expressions Exp defined by the following grammar:

Expressions

Exp         ::= Term { Binop Exp }
              | if Exp then Exp else Exp
              | let Def+ in Exp
              | map IdList to Exp
Term        ::= Unop Term
              | Factor { ( ExpList ) }
              | Null
              | Int
              | Bool
Factor      ::= ( Exp ) | Prim | Id
ExpList     ::= { PropExpList }
PropExpList ::= Exp | Exp , PropExpList
IdList      ::= { PropIdList }
PropIdList  ::= Id | Id , PropIdList

Definitions

Def ::= Id := Exp ;

Primitive Constants, Operators, and Operations

Null  ::= null
Bool  ::= true | false
Unop  ::= Sign | ~
Sign  ::= "+" | -
Binop ::= Sign | "*" | / | = | != | < | > | <= | >= | & | "|"
Prim  ::= number? | function? | list? | null? | cons? | cons | first | rest | arity

Identifiers

Id ::= AlphaOther {AlphaOther | Digit}*

Please note that Id does not contain the anything that matches Prim or the keywords if, then, else, map, to, let, in, null, true, and false.

Numbers

Int ::= Digit+

The preceding grammar requires one symbol lookahead in a few situations. The Java lexer in our support code includes a method Token peek() that behaves exactly like the method Token readToken() except for the fact that it leaves the the scanned token at the front of the input stream (in contrast readToken() removes the scanned token from the input stream).

The Java file lexer.java defines all of the abstract syntax classes required to represent Jam programs using the composite design pattern. There is one constructor for each fundamentally different form of expression in the definition of Jam syntax given above. Some primitive (non-recursive) abstract syntax classes are the same the corresponding token classes.

For example, suppose the Jam program under consideration is

f(x) + (x * 12))

The abstract syntax representation for this program phrase is:

new BinOpApp(
       new Op("+"),
       new App(new Variable("f"), new AST[] { new Variable("x") }),
       new BinOpApp(new Op("*"), new Variable("x"), new IntConstant(12))
     )

This constrution does not tell the whole story. The lexer is guaranteed to always return the same object for a given operator. Similarly, there is only copy of each variable encountered by the lexer. Hence, there is only one copy of the Operator "+", one copy of the Operator "*", one copy of the Variable "x", etc. Note that the Operators "+" and "-" can be used either as unary or binary operators.

Testing and Submitting Your Program

The directory www.cs.rice.edu/~javaplt/311/Assignments/1/tests contains a sample input programs.

Create a README file in the your directory program/1 that

Make sure that your README file is structured correctly:

  1. It has to be named README (all upper case, no .txt or anything)
  2. It has to be in the assignment's "root directory", e.g. ~/comp311/programs/1
  3. The first two lines should only contain your usernames and nothing else. The third line should be blank (if you're working alone, leave lines 2 and 3 blank), e.g.:
    mgricken
    dmp

    This is where the readme text starts.

Your test data files should be stored in the programs/1 directory.

Each procedure or method in your program should include a short comment stating precisely what it does. For routines that parse a particular form of program phrase, give grammar rules describing that form.

To submit your program, make sure that everything that you want to submit is located in the directory programs/1 and type the command

turnin311 1

from within the comp311/programs directory.

The command will inform you whether or not your submission succeeded. Only submit one copy of your program per team. If you need to resubmit an improved version your program, submit it from the same account as the original so that the old version is replaced.

Note if turnin311 is not on your path, you will need to add /home/comp311/bin to your path:

export PATH=$PATH:/home/comp311/bin

Implementation Hints

Use an "unparser" to print a concrete representation for an abstract syntax tree. Then you can directly compare test input strings and output strings (up to differences in whitespace and parentheses using for grouping).

Extra Credit (10 points)

The preceding grammar for Jam has an annoying flaw. To make the language easy to parse using recursive descent (top-down) parsing, we used right recursion in the definition of arithmetic expressions. (Left recursion forces infinite looping.) But right recursion yields right-associative ASTs (where computation must be performed right to left), while the usual mathematical convention for arithmetic expression is for binary operators to be left-associative (where computation must be performed left to right). There is a widely used trick in writing grammars for top-down parsing that avoids this problem. The trick is replace right recursion by iteration. In our Jam grammar, the right recursive rule (production):

Exp         ::= Term { Binop Exp }
can be written using iteration instead of recursion as follows:
Exp         ::= Term { Binop Term }*
Then the code for recogizing program text conforming to this rule can use a help function with an accumulator or an explicit loop to build a left associative AST for such an Exp.

For extra credit, revise your parser to produce left-associative ASTs for arithmetic expressions.

Back to course website