# COMP 202: Principles of Object-Oriented Programming II

Design Patterns for Parsing (Part 2)

# Extended Visitors and Recursive Descent Parsing

You may download the supporting code for today here. It also includes a `ToModelAlgo` visitor that transforms the parse tree into an AST with addition, multiplication, integer and variable leaves.

Here is a simpler example using the extended visitor framework: ASTs with addition, multiplication, integer and variable leaves.

## Simple Infix Arithmetic Expressions

Consider a program to read in from a text file a simple arithmetic expression in in-fix format, such as "abc + 45 + x + y", and build the corresponding parse tree.  We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables) and addition.  The syntax of such arithmetic expression can be expressed as a grammar with the following 4 rules.

The format below is called "Baukus-Naur Form" or "BNF" for short.  It is an example of a "context-free grammar" that is "read Left-to-right with a Left-most derivation on a 1 token look-ahead" or, more simply, "LL(1)".  That means that you a) read the line from left to right, b) process the information from left to right and c) need only look at one token at any given point to know where you are in the grammar.

E     ::=  F E1
E1    ::=  E1a | Empty
E1a   ::=  + E
F     ::=  num | id

The above grammar can be separated into two distinct sets of symbols:

• Terminal symbols:  num, id, empty, +
• Non-terminal symbols: E, F, E1, E1a

Terminal symbols are atomic entities that are never defined in terms of other symbols, i.e. they only appear on the right hand side of the rules.   Non-terminal symbols, however are defined in terms of other symbols and hence always appear once on the left-hand side of the rules.  E is a special non-terminal symbol called the "Start Symbol" and represents the form of the grammar when one first encounters something in that grammar.  Likewise, Empty is a special symbol representing the lack of symbols at the end of a sequence of symbols.

It is important to note that there are really only two types of rules above:

• Sequence: A non-terminal symbol defined as one or more symbols in an ordered set.  For instance, E is a sequence that describes a sequence of two symbols, F followed by E1.   Likewise, E1a is a sequence of + followed by E.   Note the net recursive/circular definition of E here.
• Combination:  A non-terminal symbol defined as one of two or more possible symbols.  A vertical line (the symbol of the logical OR operation) is used to separate the possibilities.  E1 and F are both combinations above.

Here is how the above grammar generates the string "abc + 45".

E -> F E1              // E is the start symbol and has only one rule: E ::= F E1

-> id E1             // Apply the rule F ::= id because "abc" is an id.

-> id E1a            // Apply the rule E1 ::= E1a because there are more tokens

-> id + E            // Apply the rule E1a ::= + E

-> id + F E1         // Apply the rule E ::= F E1

-> id + num E1        // Apply the rule F ::= num because "45" is a number

-> id + num Empty    // Apply the rule E1 ::= Empty because there are no more tokens

-> abc + 45          // the lexeme of id is "abc" and the lexeme of num is "45"

The above derivation of the input string "abc + 45" from the grammar rules can be represented by a tree structure called the parse tree as follows.

E
|
------------------------------------
|                                    |
F                                   E1
|                                    |
id                                  E1a
"abc"                                 |
---------------
|              |
E
"+"             |
----------
|         |
F        E1
|         |
num      Empty
"45"

First, let us consider the modeling of the tokens.   We want "intelligent" tokens that understand what they are, so they can be used in algorithms that depend on what type of tokens are being used.  In particular, when we are parsing a stream of incoming tokens, we simply want to run some algorithm that determines whether or not the token was the appropriate one for that moment as well as then proceed to process the subsequent tokens correctly, based on that one token (remember that we are only considering LL(1) grammars, so one token is all it takes).

Tokens are thus hosts for extended visitors, calling the case on the visitor corresponding to the name of the token.  A factory is used to hide the details of constructing tokens from the rest of the system.

Now consider the grammar symbols.   There are really only three distinct grammar symbols: empty, terminal and sequence.  A symbol for the combination rule is unnecessary because once a token stream has been parsed in terms of a grammar, the choices of a combination no longer exist and that point in the grammar can be represented by a single symbol, not a combination.

Once again, these grammar symbols are hosts for extended visitors so that algorithms can be run on them.

Parsing thus involves running an algorithm on the stream of tokens from the tokenizer (actually, only the first one because the algorithm itself will pull the rest of the tokens from the tokenizer).  This algorithm will produce a grammar symbol (the "start symbol") which is at the root of a tree of grammar symbols (the parse tree).

In recursive descent parsing (RDP), we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the parse tree  from the top down, based on the value of the tokens.  Building a parse tree from the top down means building it by first looking at the grammar rule(s) for the start symbol E and try to create the object  that represents E.  The task of building E is called "parsing E" and is to be carried out by an appropriate token visitor.

We can see from the above diagram that in order to parse E, we need to parse the sequence of parse F and E1.   But that means that we need to know how to parse F and E1:

• To parse F, we need to know how to parse num and id.  To create the visitor to parse F, the visitors to parse num and id need to be combined together so that the current token can pick out the correct visitor when parsing F:  when parsing F, the id token will execute the visitor that parses the id symbol, while the num token will execute the visitor that parses the num symbol.
• To parse E1, we need to know how to parse E1a and Empty.   Just like for the F symbol, to create the visitor to parse E1, the visitors to parse E1a and Empty need to be combined together so that the current token can pick out the correct visitor when parsing E1.
• To parse E1a, we need to to parse the sequence of the "+" symbol then E, which completes the recursive call.
• To parse Empty, we put back the token just consumed.

Writing the concrete visitors to parse each of the non-terminal symbols will require, in general, performing a global analysis looking deeper than necessary  into the grammar structure.  One must analyze the details of the grammar to the level of knowing all possible tokens at that stage.  This analysis may require one to look beyond the immediate relationships a class may have with any other classes.  For instance, to find the possible tokens that could be the start of an E term, one must look down to the F term.  However, the code to process an E term should only be concerned with the fact that an E term is composed of a F term and an E1 term, and should not be coupled to the internal nature of the F, E1 or any other terms.  Likewise, when parsing a branch, such as F, one must create the union of all the visitors that parse the branch’s subclasses.

To remedy this problem, one must re-think the instantiation process of the visitors.  In particular, in order to decouple the construction of the visitor for one term from the details of other terms, one must abstract and encapsulate the construction process.  This is done by the abstract factory design pattern.  Using factories to instantiate the parsing visitors

·          enables each term to be decoupled from any other term by hiding the instantiation details.

·          enables the construction of the union of visitors by chaining, which is used to implement branches.

·          enables the lazy construction of already installed visitors which is needed to create circular relationships.

Each non-terminal symbol (and its corresponding class) is associated with a factory that constructs its parsing visitor.

All factories adhere to a basic factory interface which provides the methods to instantiate the parsing visitors.  For convenience, all the factories are derived from an abstract factory, ATokVisitorFact, which provides access to the tokenizer.

The major components of the parsing system are:

• ITokVisFact :  The top-level interface that defines the factory that creates parsing visitors, which are visitors on Tokens.  More specifically, the visitors are ATokisitorVisitor<IGrammarSymbol, Object>'s, which use a lambda-based Extended Visitor implementation.
• ATokVisFact : An abstract mid-level implementation of ITokVisitorFact that contains the tokenizer and provides services to manage it.
• TerminalSymbolFact : A concrete implementation of ATokVisitotFact whose instances parse specific Tokens, as specified by a given an String, the name of the corresponding Token..  There is a one-to-one correspondence between the instances of this class and the distinct types of tokens generated by the tokenizer.
• CombinationFact :  A concrete implementation of ATokVisFact that produces ATokVisitors whose cases can process the symbols created by either of two given ATokVisitors.  The cases of the visitors produced by this factory are thus the union of the cases of the given visitors.  The visitors created by this factory utilize the ITokVisitorCmds copied from the given visitors.   The default case is that specified by the latter of  the two given factories' visitors.  The restriction to LL(1) grammars means that no two cases from the contributing visitors will ever be for the same Token.
• SequenceFact : A concrete implementation of ATokVisitorFact that produces ATokVisitors whose cases can process a binary sequence of the symbols created by two given ATokVisitors. Thus, m>every case in the first of the given visitors is augmented (decorated) to first execute normally and then to execute the second of the given visitors on the next token.  The default case is also so decorated.
• MTSymbolFact : A concrete implementation of ATokVisitorFact that produces ATokVisitors whose cases parse the empty token/symbol defined at the end of a recursive structure of symbols in the grammar.   Thus, only the default case is defined.   All the visitors created by this factory do is to push the token back into the tokenizer.  Note that due to the way in which the CombinationFact works, the factory must always be the last when combining two or more visitors.
• ProxyFact : This concrete implementation of ITokVisitorFact is simply a proxy to a given ITokVisitorFact.   This enables a specification of a given factory to be used but without specifying the usage of a particular instance of the factory.    In particular, this factory is very useful when a recursive loop is encountered.

In practice, the following convenience classes are often useful:

• MultiSequenceFact :  This concrete implementation of ITokVisitorFact simply utilizes multiple SequenceFact's to create a resultant factory whose visitors can parse an arbitrarily long sequence of symbols.
• MultiCombinationFact :  This concrete implementation of ITokVisitorFact simply utilizes multiple CombinationFact's to create a resultant factory whose visitors can parse a token chosen from an arbitrarily large set of symbols.

The code to create a visitor to parse the above grammar is thus:

Note how the process of  creating a visitor for a particular grammar is purely a run-time operation and does not require any custom classes at all.

## Parse Tree-to-Model Translation

The parse tree created by the parser can be a bit unwieldy to work with. It is reasonable to transform it to a simpler representation. The file source code contains a `ToModelAlgo class` that performs this transformation. It is a visitor on grammar symbols and returns an `AAST`.

Consider the different cases for a parse tree:

1. If you have just a number, it will be the parse tree
```|_ parser.SequenceSymbol
|_ parser.TerminalSymbol: 7.0 (Num)
|_ parser.MTSymbol```
2. Similarly, if you just have a variable name, it will be the parse tree
```|_ parser.SequenceSymbol
|_ parser.TerminalSymbol: a (Id)
|_ parser.MTSymbol```
3. If you have something more complex, the parse tree will look like
```|_ parser.SequenceSymbol
|_ parser.TerminalSymbol: 3.0
|_ parser.SequenceSymbol
|_ parser.TerminalSymbol: +
|_ parser.SequenceSymbol
|_ parser.TerminalSymbol: 7.0
|_ parser.MTSymbol```

So in all cases will it start with a `SequenceSymbol`. Therefore, the `ToModelAlgo` visitor rejects anything that is not a `SequenceSymbol`. If it encounters a Sequence, then the first symbol must be a terminal symbol, namely a Num or an Id (check all three cases above). The `ToModelAlgo` visitor therefore executes the `NumOrIdHelper` on the first symbol and passes its return value (an `AAST`) to the visitor that deals with the second symbol, `SequenceHelper`.

`NumOrIdHelper` is simple. It rejects anything that is not a Num or an Id. If it encounters a Num or an Id, it returns the `AAST` for it.

`SequenceHelper` takes the result from `NumOrIdHelper` as inps[0]. Consider the legal cases for `SequenceHelper`: In cases 1) and 2) above, `SequenceHelper` may encounter the empty symbol. In that case, the `AAST` is just whatever was passed from `NumOrIdHelper` in inps[0]. Otherwise, `SequenceHelper` must encounter a sequence ("+ ...").

If `SequenceHelper` encounters a sequence, it should have a "+" or "*" in the first symbol and an arithmetic expression in the second symbol. `SequenceSymbol` therefore uses the main `ToModelAlgo` visitor on the second symbol, and then passes that return value to `OperatorHelper` as inps[1], along with whatever it got from `NumOrIdHelper` in `inps[0]`.

`OperatorHelper` is run on the first symbol of `SequenceHelper`'s host. `OperatorHelper` takes the `AAST` from the left side of the operator as `inps[0]` and the `AAST` from the right side as `inps[1]`. There are two legal cases for `OperatorHelper`: "+" and "*". It creates the correct `AAST`s for these, e.g. `new Add(inps[0], inps[1]);` and rejects anything else.

This code transforms the parse tree into a simple `AAST` that contains only the information that is pertinent to the model.

Design Patterns for Parsing (Part 2)

URL: http://www.cs.rice.edu/~mgricken/teaching/202/08-fall/lectures/parsing2/index.shtml