Programming Assignment 1:
Parsing and Abstract Syntax

 

 

 

Points: 100

Files

Overview

Preparation   If you are using a Windows, Mac OS X, or Linux machine, we suggest that you download and install the latest version of the Amazon Corretto 8 distribution of the Java 8 OpenJDK and the latest version of DrJava in ".jar" form from http://drjava.org. If you already have a Java 8 OpenJDK installed in Linux or Mac OS X, you can use that disrtribution of the OpenJDK as well. DrJava is not as comprehensive as professional IDEs so Alternatively you can use a professional IDE like IntelliJ or Eclipse if you prefer. But beware of the fact that the assignments you submit must run from the command line. DrJava assumes the same program file organization as java command line tools (like java and javac); professional IDEs typically do not. We do not recommend using later versions (9-13) of Java which completely re-organize the JDK; we will grade all assignments using Java 8. The DrJava IDE does not run on Java 9+. Moreover, Java 8 is the version of Java installed on CLEAR. Nothing in this course depends on Java 8. A Java 7 JDK will suffice, but the performance of Java 7 is not quite as good as Java 8. The DrJava interactions pane does not yet support the language extensions added in Java 8. But you can still use these extensions in any source files that you compile because the Java 8 compiler translates all Java 8 source code to standard Java byte code.

We will try to maintain the latest versions of the Java 8 OpenJDK and DrJava on CLEAR, but CLEAR is under the control of Rice IT, which limits our ability to configure what software is available. We recommend using CLEAR only to submit your programs. On CLEAR, we are providing a turnin411 script, which submits your program to our repository. Unfortunately, the IT submission facilities will not permit us to run a script testing that submitted programs pass the trivial unit tests in Assign1Test.java. As a result, we rely on our own "turnin411" script.

On CLEAR, you may need to use the path ~comp411/bin/turnin411 to access the turnin411 command.

Please create a comp411 directory in your home directory on both your development machine and on CLEAR. Each programming team should submit their solution under the userid of one team member. For this assignment, please create a programs/1 directory within the directory comp411. All of your files for this assignment should be stored in the programs/1 subdirectory.

DrJava is distributed as a Java .jar file at the web site http://drjava.org. You can run DrJava on a Linux or Mac system (with Java JDK 7/8 installed) by typing

java -jar drjava.jar

assuming that drjava.jar is located in the current directory. On Windows machines (with a properly installed Java JDK 7/8), you can run the drjava.jar file simply by double-clicking on the icon. If the Registry has the wrong file type associated the .jar files, then double-clicking .jar icons will not work. In this case, you can download an executable drjava.exe file for Windows from the DrJava web site. This file wraps the drjava.jar in a script that runs the .jar file.

Documentation for DrJava is available at drjava.org. If you have further questions about DrJava, ask a question on Piazza or send email to javaplt@rice.edu

All of your files for this assignment should be stored in the programs/1 subdirectory.

Teamwork   You are strongly encouraged to do this assignment and all other programming assignments using pair programming. When you submit your assignment, indicate the composition of your team (names, student ids, and email addresses) in your README file.

If you cannot find a partner and would like one, post a message to our Piazza page and we will try to find one for you. Teams of more than two members are not permitted.

Testing   You are expected to write unit tests for all of your non-trivial methods or functions and submit your test code as part of your program. DrJava provides integrated support for JUnit which makes this task easy.

Task   Your task is to write a parser in Generic Java (Java 7/8) that recognizes expressions in the pedagogic functionallanguage Jam, which will be defined subsequently. The course staff has provided a lexer in Java for this assignment. This lexer returns a larger set of symbols than what appears in the Jam definition. These extra symbols anticipate future extensions to Jam. At this point, your parser should simply reject any program containing one of these extra symbols as ill-formed, just like it would for any other syntax error.

Java Support Library   In our Java library, the parser's input token stream is provided by a Lexer object supporting the parameterless method readToken(). The readToken() method produces a stream of Token objects belonging to various subclasses of the Token interface. The Lexer class supports two constructors: a zero-ary constructor that converts standard input to a Lexer and a unary constructor Lexer(String fileName) that converts the specified file to a Lexer. The file lexer.java contains the definition of the Lexer class and the collection of classes representing tokens (the Java interface Token) and abstract syntax trees (the java interface AST).

The API That Your Program Must Support

All classes should be in the default (top-level) package.

Your parser should be expressed as a class Parser with constructors Parser(Reader stream) and Parser(String filename) which create parsers reading the specified stream or file. The Reader form of these constructors is very convenient when you are testing your parser from the DrJava read-eval-print loop or from unit test methods.

The Parser class should contain a public method AST parse() that returns the abstract syntax tree for the Jam expression in the input stream. Our grading program will print the output using the method System.out.println, which invokes the toString method on its argument. All of the requisite abstract syntax classes including toString methods are defined in the file lexer.java. This library files are available on the web in the directory ../../Assignments/1/java.

If your parser encounters an erroneous input (including illegal tokens), simply print an explanation of the syntax error and abort parsing the file by throwing a ParseException. Recovering from a syntax error to continue parsing is a complex problem with many special cases. It is not a realistic goal for this assignment.

Input Language Specification

The following paragraphs define the parser's input language and the subset that your parser should recognize as legal Jam programs.

The Meta Language The parser's input language is described using Extended Backus-Naur Form (EBNF). When reading the notation, keep the following in mind:

The Input Language   The parser's input language is Input where:

Input ::= Token*
Token ::= AlphaOther AlphaOtherNumeric* | Delimiter | Operator | Int

The lexer recognizes keywords, delimiters (parentheses, commas, semicolons), primitive operations, identifiers, and numbers.

Adjacent tokens must be separated by whitespace (a non-empty sequence of spaces, tabs, and newlines) unless one of the tokens is a delimiter or operator. In identifying operators, the lexer chooses the longest possible match. Hence, <= is interpreted as a single token rather than < followed by =.

Lower             ::= a | b | c | d | ... | z
Upper             ::= 'A' | 'B' | 'C' | 'D' | ... | 'Z'      
Other             ::= ? | _
Digit             ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Alpha             ::= Upper | Lower
AlphaOther        ::= Alpha | Other
AlphaOtherNumeric ::= AlphaOther | Digit
Delimiter         ::= ( | ) | [ | ] | , | ;
Operator          ::= '+' | - | ~ | '*' | / | = | != | < | > | <= | >= | & | '|' | :=

Note: The terminal symbols 'A', 'B', ... are enclosed in single quotation marks because A, B, ..., would automatically be the names of non-terminal symbols in our notation system. For essentially the same reason, we enclose the symbols '+', '*', and '|' in single quotation marks; these symbols without quotation marks are metasymbols in the notation scheme described above.

Jam   The set of legitimate Jam phrases is the subset of the potential inputs consisting the expressions Exp defined by the following grammar:

Expressions

Exp         ::= Term { Binop Exp }
              | if Exp then Exp else Exp
              | let Def+ in Exp
              | map IdList to Exp
Term        ::= Unop Term
              | Factor { ( ExpList ) }
              | Null
              | Int
              | Bool
Factor      ::= ( Exp ) | Prim | Id
ExpList     ::= { PropExpList }
PropExpList ::= Exp { , Exp }*
IdList      ::= { PropIdList }
PropIdList  ::= Id { , Id}*

Definitions

Def ::= Id := Exp ;

Primitive Constants, Operators, and Operations

Null  ::= null
Bool  ::= true | false
Unop  ::= Sign | ~
Sign  ::= '+' | -
Binop ::= Sign | '*' | / | = | != | < | > | <= | >= | & | '|'
Prim  ::= number? | function? | list? | null? | cons? | cons | first | rest | arity

Identifiers

Id ::= AlphaOther {AlphaOther | Digit}*

Please note that Id does not contain the anything that matches Prim or the keywords if, then, else, map, to, let, in, null, true, and false.

Numbers

Int ::= Digit+

The preceding grammar requires one symbol lookahead in a few situations. The Java lexer in our support code includes a method Token peek() that behaves exactly like the method Token readToken() except for the fact that it leaves the the scanned token at the front of the input stream (in contrast readToken() removes the scanned token from the input stream).

The Java file lexer.java defines all of the abstract syntax classes required to represent Jam programs using the composite design pattern. There is one constructor for each fundamentally different form of expression in the definition of Jam syntax given above. Some primitive (non-recursive) abstract syntax classes are the same the corresponding token classes.

For example, suppose the Jam program under consideration is

f(x) + (x * 12))

The abstract syntax representation for this program phrase is:

new BinOpApp(
       new Op("+"),
       new App(new Variable("f"), new AST[] { new Variable("x") }),
       new BinOpApp(new Op("*"), new Variable("x"), new IntConstant(12))
     )

This construction does not tell the whole story. The lexer is guaranteed to always return the same object for a given operator. Similarly, there is only copy of each variable encountered by the lexer. Hence, there is only one copy of the Operator "+", one copy of the Operator "*", one copy of the Variable "x", etc. As a result, the == operator in Java can be used to compare operators and variables. Note that the Operators "+" and "-" can be used either as unary or binary operators.

The file OriginalSyntaxDiagrams contains a collection of syntax diagrams corresponding to the grammar given above where simple right recursive productions (PropExpList and PropIdList) are converted to iterative form, but the complex right recursive productions for Exp are left in right-recursive form. In addition, some productions of the grammar have been "inlined" in the diagrams to make them simpler (e.g., PropExpList and PropIdList).

Left-associative Binary Operators

The preceding grammar for Jam has an annoying flaw. To make the language easy to parse using recursive descent (top-down) parsing, we used right recursion in the definition of arithmetic expressions (Exp). (Left recursion forces infinite looping.) But right recursion yields right-associative ASTs (where computation must be performed right to left), while the usual mathematical convention in arithmetic expressions is that binary operators are left-associative (where computation must be performed left to right). There is a widely-used trick in writing grammars for top-down parsing using syntax diagrams that avoids this problem. The trick is replace right recursion in a simple production (only one alternative on the right hand side) by iteration: an edge looping back to the input edge of the diagram. We used this trick in generating the syntax diagrams for ExpList and IdList for our original grammar. Given our original Jam grammar, we can convert the production

Exp         ::= Term { Binop Exp } | ...
to
Exp         ::= BinaryExp | ...
BinaryExp   ::= Term { Binop BinaryExp }
at the cost of excluding chains of binary operations terminating in an expression other than a term. Then this revised grammar can be represented using syntax diagrams that define BinaryExp iteratively (just like PropExpList in the original grammar).

The preceding definition of BinaryExp generates the same set of strings as the extended rule:

BinaryExp ::= Term { BinOp Term }*
which is ambiguous about left or right associativity. When iteration is used to recognize a BinaryExp, the parsing process will construct a left-associative tree since the tree is constructed left-to-right (each new Binop adds a new root to the AST).

The file RevisedSyntaxDiagrams contains iterative syntax diagrams corresponding to the left associative version of the language, which precisely describes the syntax of Jam and how we want the parser to behave. Of course, this language eliminates a few quirky inputs from the language based on our original grammmar. Your parser should treat these quirky inputs as erroneous. The web page RevisedSyntaxDiagrams gives an equivalent set of syntax diagrams with more non-terminals (syntactic categories) which you may find more convenient for structuring your parser. Programs written in a functional style tend to be easier to read and to understand when each function definition is short.

It is possible to write syntax diagrams that treat binary expressions iteratively and still allow arbitrary expressions at the end of the binary expression. But the corresponding grammar and syntax diagrams are much more complex. So we have not followed this design choice in creating the revised syntax for Jam. In fact, we view our Jam grammar as a case study showing that context-free grammars are not quite the right formalism for expressing easily parsed program syntax. Syntax diagrams where iteration is always left-associative (equivalent to extended context-free-grammars where iteration always expands into left-associative parse trees) are much better. I am not aware of a rigorous exposition of such extended BNF grammars in the literature; I suspect that most theoretical computer scientists would not find such an exposition very interesting even though the standard notion of parsing language specified by a context-free grammar is an imperfect formalization of practical parsing.

Write your parser to produce left-associative ASTs for arithmetic expressions. If your parser implements either of the "revised" syntax diagram definitions given above (which are equivalent) using while loops for iteration and constructs the corresponding abstract syntax trees in a functional style (no mutation of trees in theconstruction process), your parser will naturally build syntax trees that are left-associative.

Testing and Submitting Your Program

The directory tests/ contains a sample input programs.

Create a README file in the your directory program/1 that

We have provided a sample README for your reference. Make sure that your README file is structured correctly:

  1. It has to be named README (all upper case, no .txt or anything)
  2. It has to be in the assignment's "root directory", e.g. ~/comp411/programs/1
  3. The first two lines should only contain your usernames and nothing else. The third line should be blank (if you're working alone, leave lines 2 and 3 blank), e.g.:
    mgricken
    dmp

    This is where the readme text starts.

Your test data files should be stored in the programs/1 directory. All JUnit test classes submitted for grading should be compatible with a generic solution that conforms to the public interface defined in the assignment. In other words, we should be able to compile and run your unit tests against a reference solution used for grading, which has the same public interface. Your test classes should be defined in files matching the pattern "*Assign#Test.java", where "#" is the the current assignment number. For example, "MyAssign2Test.java" would be a valid name for Assignment 2 unit tests. If you want to include other tests that you do not want graded (e.g., to test your private interfaces), then simply use a different suffix for the class names (e.g., "Assign1PrivateTest.java"). If you have additional utility classes that are required dependences of your unit test classes, please name those classes using the pattern "*Assign#TestSupport.java", which signals that the class is required to support your unit tests, but is not a unit test class that should be independently launched. You may also include test support files with names matching the pattern "*AssignAllTestSupport.java", which can contain common test support code that can be reused across multiple assignment subissions without needing to rename the file.

Each procedure or method in your program should include a short comment stating precisely what it does. For routines that parse a particular form of program phrase, give grammar rules describing that form.

To submit your program, make sure that everything that you want to submit is located in the directory programs/1 and type the command

~comp411/bin/turnin411 1

from within the comp411/programs directory.

The command will inform you whether or not your submission succeeded. Only submit one copy of your program per team. If you need to resubmit an improved version your program, submit it from the same account as the original so that the old version is replaced.

Implementation Hints

The toString() methods for each AST class effectively provide an unparser for ASts. you can directly compare test input strings and the output AST by unparsing the AST (using toString()). The two strings should match up to differences in whitespace, new lines, and parentheses using for grouping.

Back to course website