Lab 1: Lexer and Parser


Overview

In this course, you will design and implement a compiler Tiger, for a non-trivial subset of the Java programming language MiniJava, as described in the appendix of the Tiger book. You will use Java as the implementation language (so you are following the chicken and egg tradition in compiler design community). By the end of this course, you will not only learn the theory and practice of compilers, but also gain a deeper understanding about Java.

There are 7 labs and a final project planned:

Finally, there is a final project, in which you are required to propose your own ideas and to do some nontrivial projects.

This is the lab 1. In this lab, you will implement a frond-end, i.e., a lexer and a parser, for Tiger.

Software Setup

In this lab, you will use Java as your implementation language, please first check the reference page to install necessary software, if you have not installed the Java platform. Make sure to install the latest Java compiler and runtime (version 22), other versions of Java do NOT work as the Tiger compiler makes use of some latest Java feature in version 22 (e.g., pattern matching or string templates). You can use any Java IDEs (e.g., IntelliJ, or VS code).

Version control and Git

In each lab, we will supply skeleton code for you to start with and modify. These files are controlled and distributed using the Git version control system. So, to access these files, you will first need a Git client installed on your machine. On Linux or Mac, it is easy to install a Git client:

On Windows, there are many Git clients available, among which msysgit is a very popular one, you can install and try it if you are using Windows.

For those who have no prior experience of using Git, do not forget to refer to the official manual when in doubt.

Getting Started

This year, we have put the skeleton code repository on GitHub (hence this courseware is open-source). To check out the code repository for lab 1 to your machine, you should run the following commands:

    $ git clone https://github.com/bjhua/tiger.git

which will create a new folder tiger in which reside all the source files just checked out for the lab 1 branch (the default branch).

Git is a distributed version control system, allowing one to work with the local version of repository on your machine. For instance, if you have finished the exercise 5 in lab 1, you can commit your changes by:

    $ git commit -am 'my solution to lab 1, exercise 5'

which will commit the changes to your local repository.

Now import the project into your favorite Java IDE and browse the source code:

    src/Tiger.java:   the "Main" class serving as compiler entry
    src/lexer/*:      the lexer
    src/parser/*:     the parser
    src/control/*:    to control the behavior of the compiler
    src/util/*:       utility classes
    src/slp/*:        the SLP interpreter and compiler
    test/*:           MiniJava programs as testcases (so do not compile them as compiler sources)

Now, build this project on your machine. When the building process finished, a new bin folder would be created. Note that the name of the output folder depends on your specific IDE configurations. For example, the folder may be named out or something else. But we will always name it bin in the following and subsequent labs.

If there are no building errors during the building process, you can run the Tiger compiler:

    $ java --enable-preview -cp bin Tiger -help

which will output something like:

The Tiger compiler. Copyright (C) 2013-, SSE of USTC.
Usage: java Tiger [options] <filename>

Available options:
    -dump {token}   dump tokens from lexical analysis
    -help           show this help information

Lab Requirement

There are two kinds of exercises: normal exercises and challenge ones. Normal exercises are all required, whereas all challenge exercises are optional. Challenge exercises may not be that hard, but may involve substantial design efforts or code hacking.

Hand-in Procedure

When you finished your lab, zip you solutions and submit to the USTC online teaching system.


Part A: SLP and Its Interpreter and Compiler

In this part of the lab, you will do some programming exercises to warm you up. To be specific, you will write an interpreter and a compiler for a small programming language SLP, which stands for Straight-Line Programming language. In other words, the SLP language has no control-flow constructs. The syntax for SLP is given in chapter 1 of the Tiger book. You should refer to the Tiger book for the syntax details.

This part of the lab serves as an introduction to environments (symbol tables mapping variable names to information about the variables); to abstract syntax trees (data structures representing the syntactic structure of programs); to recursion over tree data structures which is useful in many parts of a compiler; and to a functional programming style without assignment statements.

SLP Abstract Syntax Trees

SLP contains just two syntactic forms: statements and expressions.

Exercise 1. Read the SLP syntax in the chapter 1 of the Tiger book, make sure you understand how an SLP program is formed. As a specific example, explain to yourself why the following program is legal according to SLP syntax:
        a := 5+3;
        b := (print(a, a-1), 10*a);
        print(b)
What is this program's output?

The key idea of a compiler is to encode the language syntax with tree-like data structures. We will delve into this topic in lab 2. For the purpose of this lab, we have offered you some Java classes encoding the SLP syntax. To be specific, Java (since version 14) has introduced a nice feature called record to encode an aggregated data structure, which is amenable for compiler implementation. Technically speaking, a Java record is similar to algebraic datatypes from functional programming.

To put the above discussion into perspective, let us consider the following sample grammar:

    T ::= B1 | B2 | ... | Bn

to encode this grammar using Java record, we can declare a sealed interface for the left non-terminal T, and n records B1 to Bn for the n right-hands, respectively, all implementing the interface T:


    interface sealed T permits B1, B2, ..., Bn {}
    record B1(...) implements T{...}
    record B2(...) implements T{...}
    ...
    record Bn(...) implements T{...}

Such data structures are normally called abstract syntax trees, which we will discuss in depth in lab 2.

Exercise 2. Read the class definitions in src/slp/Slp.java, make sure you understand how these classes define the SLP syntax. As a concrete example, make sure you understand how the object sample1 in the source file src/slp/SamplesPrograms.java encodes the SLP program in exercise 1. You do not need to write any code for this exercise.
Exercise 3. For this SLP program:
        a := 1;
        b := 0;
        print(a/b)
Use the data structures for SLP, to define a new program sample2 in the source file src/slp/SamplesPrograms.java.

Pretty Printer

In compiler design, we often need to dump the internal data structures to inspect the status of these structures. Such inspections can not only aid debugging the compiler, but also guarantee the implementation correctness. A well-established technique finish this task is called pretty printing, which is also a classical algorithmic implementation topic.

The implementation of pretty printing, as well as implementation of similar operations on compiler's data structures (e.g., abstract syntax trees), is highly dependent on the implementation languages. To be specific, for the Java language as we used in this course, the standard way to accomplish this task is to use a special design pattern called visitor pattern. You should refer to the chapter 4 of the Tiger book for an excellent introduction to the visitor pattern and especially it usages in compilers. And you may also refer to books on design pattern to understand how visitor pattern works in general. Furthermore, it should also be noted that design patterns are important for other object-oriented language (e.g., C++, or Python), besides Java.

While design patterns are well-understood and established techniques to implement operations on data structures in an object-oriented style, we will use pattern matching, another novel feature from Java (since version 17), to implement these operations in a functional programming style, which is much easier to write and understand than visitor patterns.

Consider the aforementioned sample grammar, we can define an operation op on the type T as the following node demonstrated:


        void op(T x){
            switch(x){
                case B1(...) -> {...}
                case B2(...) -> {...}
                ...
                case Bn(...) -> {...}
            }
        }

this code is essentially a case analysis, which inspects all possible shapes of the type T and perform actions based on the shape, respectively.

Exercise 4. Read the code in src/slp/PrettyPrint.java, make sure you understand how this code implement the pretty printing through pattern matching. Then run the code src/slp/Main.java:
        $ java --enable-preview -cp bin slp.Main
to pretty print the sample programs. The output looks like:
        a = 5+3;
        b = (print(a, a-1, ), 10*a);
        print(b, )
As you may notice, there is a "bug" in the implementation of printing the print statement: it will always print a redundant , at the end of the argument list. For example, it will print print(a, b, ) for a print statement with two arguments, instead of print(a, b). Fix this bug.

Maximum Argument Numbers

A program may contain zero, or more print statements, each of which taking one or more arguments (that is, the print function is overloaded). For instance, the above SLP program in exercise 1 has two print statements: the former one has 2 arguments whereas the latter one has just 1. We define the maximum argument numbers of program as the maximum number of arguments among all print statements. For example, the maximum argument number for the above program is 2.

Exercise 5. Finish the code in src/slp/MaxArgument.java, to calculate the maximum number of arguments with a given program. Do not forget to test your code when finished, and fix any bugs before continuing.

An Interpreter

In this part, you will write an interpreter for SLP. An interpreter runs a given program online, that is, it analyzes the program and mimic the execution behaviour of the program. As SLP has an assignment statement x:=e, so the interpreter must have an abstract memory, which keeps track of the current value of a variable x.

Exercise 6. Finish the Java code insrc/slp/Interpreter.java that "interprets" a program in SLP. You may need to modify the types of the corresponding methods. Do not forget to test your code when finished, and fix any bugs before continuing. For example, if we run the above sample SLP program in exercise 1,
        $ java --enable-preview -cp bin slp.Main
which should output:
        8 7
        80
note that the number 7 is followed by a newline character.

A Compiler from SLP to x64

In this part, you will be building a compiler for SLP targeting x64. Do not worry, if you are not familiar with x64, as we will not make use too much complex features of it. And also we will discuss x64 in details in lab 5. You may want to refer to some instruction set manual to familiarize yourself with x64. You may also want to take a look at the x64 calling conventions, to understand how a function call happens.

Exercise 7. First read the code in src/slp/Compiler.java, to understand how an SLP program is compiled to x64 assembly. Your job is to extend the compiler to support the subtraction - and the division /. Do not forget to test your compiler using the test cases in src/slp/SampleProgram.java, as well as your own test cases.

Part B: The Lexer

A lexer reads as input the program source files and outputs a series of tokens recognized. In this part of the lab, you will design and implement a lexer for Tiger.

The first step to design and implement the lexer is to design good data structures to represent the input and output, respectively, which is, nevertheless to say, implementation language dependent. In Java, we can represent the input by some kind of (buffered) input stream established from the program source text file, and can represent recognized token by specific data structure. The data structure defining token can be found in the class src/lexer/Token.java. This data structure consists of the following key fields:

Exercise 8. You should read the MiniJava specification carefully, and determine what other tokens should be added by extending the enum Kind type.

To implement the lexer algorithm, we have two options: the handwritten approach and the automatic lexer generator approach. Both are very popular in compiler implementations. In this lab, we will use a handwritten approach to implement the lexer for Tiger, which is also used by production compilers such as GCC or LLVM.

Exercise 9. The method nextToken0() in the file lexer/Lexer.java recognize tokens. Read the MiniJava specification and supply the missing code for the method.

To this point, you have finished the lexer, and your Tiger compiler should compile all the test programs in the /test directory. Before you continue, you should run your Tiger compiler on these test cases and make sure your lexer works correctly. For instance, to test your compiler on the input file /test/Factorial.java):

    $ java --enable-preview -cp bin Tiger ./test/Factorial.java -dump token

which will not only lex your test case but also will dump the tokens recognized. If your compiler fails on any test cases, check your code to fix bugs and re-test your compiler. And you may want to write more test cases to test your Tiger compiler besides the test cases we offered.

Exercise 10. Your lexer must be fast enough to lex (reasonably) large Java programs. To test how fast your lexer is, you should use some relatively large Java test cases. Download this monster generator and compile it:
  $ javac MonsterGen.java
then run it to produce a Monster.java code:
  $ java MonsterGen 100000>Monster.java
And run your lexer on Monster.java to determin how long it will take to compile this program? How long does Oracle's javac compiler take compile it?
Challenge! We have used a row number rowNum and a column number colNum to represent the position of a token. Though simple and easy to implement, such position information is coarse-grained in that they only record where a token starts, but not where it ends. Propose a better data structure to represent a position in a fine-grained manner and implement your idea into the compiler.
Challenge! Modify your lexer to lex full Java, instead of MiniJava. After you finish this, you can use your lexer to lex some nontrivial Java projects, such as the Apache's source code.
Challenge! Use an automatic lexer generator to build your Tiger lexer. For example, you may try the JFlex, a very popular lexical analyzer generator. Is this approach better than the handwritten approach we have used in this lab?

Part C: The Parser

You will implement a parser in this part of the lab. A parser parses the input program, according to the given language syntax, and determines whether the target program is legal. The parser generates data structures to be processed by later phases, if the target program is legal.

Just as for the lexer, there are generally two approaches to implement a parser: handwritten or using an automatic parser generator. We will implement a handwritten recursive descendent parser in this part, that is, all production rules become a bunch of recursive functions with (approximately) one function for one non-terminal.

Exercise 11. Finish the recursive decedent parser in the file parser/Parser.java. We have offered most code for you, you should supply the missing code for those incomplete methods.

To this point, your parser should be able to parse all MiniJava programs. Remember to test your parser thoroughly using the test cases in directory /test as well as your own test cases. Fix any bugs before continuing.

Exercise 12. The current implementation of the error handling and error recovery method error() only reports an error message before exiting. Modify this method to report more accurate error messages, such as the file name, the line number, the column number, some diagnostic message, and the related source code. You may take a look at the diagnostic messages generated by the Clang compiler to get some idea. Finally, you may try to implement error recover, that is, to recover from errors so that all potential syntactic errors can be detected.
Challenge! Implement your parser with automatic parser generators. For instance, you may use CUP, or AntLR, or SableCC, etc. Again, this exercise may not be that difficult, but may require considerable modifications to current code base and programming efforts.

Hand-in

This completes the lab. Remember to submit your solution to the online teaching system.