In this course, you will design and implement a compiler Tiger, for MiniJava, a small but non-trivial subset of the Java programming language, as described in the appendix of the Tiger book. You will use Java as the implementation language (so you are following the chicken and egg tradition in the compiler design community). By the end of this course, you will not only learn the theory and practice of compilers, but also gain a deeper understanding about Java.
There are 7 labs and a final project planned:
Finally, there is a final project, in which you are required to propose your own ideas and to do some nontrivial projects.
This is the lab 1, which comprises three parts. In part A, you will learn how to create abstract syntax and process it using Java, by implementing a small but complete compiler (interpreter) for a small language SLP. In part B and C, you will implement a frond-end, i.e., a lexer and a parser, for Tiger, respectively.
In this and the following labs, you will use Java as your implementation language,
please first check
the reference page
to install necessary software, if you have not installed the
Java platform.
Make sure to install the latest
Java compiler and runtime (version 26), other versions
of Java do NOT work as the Tiger compiler makes use of some
features that are introduced in latest Java (e.g., implicit permits).
You can use any Java IDEs (e.g., IntelliJ, or VS code).
In each lab, we will supply skeleton code for you to start with and modify. These files are controlled and distributed using the Git version control system. So, to access these files, you will first need a Git client installed on your machine. On Linux or Mac, it is easy to install a Git client:
$ sudo apt install git
$ brew install git
On Windows, there are many Git clients available, among which msysgit is a very popular one, you can install and try it if you are using Windows.
For those who have no prior experience of using Git, do not forget to refer to the official manual when in doubt.
We will use gradle to build the project. So, you should install gradle on your machine, if you have not. On Linux or Mac, it is easy to install gradle:
$ sudo apt install gradle
$ brew install gradle
On Windows, you should download the gradle binary release first, and install it manually.
This year, we have put the skeleton code repository on GitHub (hence this courseware is open-source). To check out the code repository for lab 1 to your machine, you should run the following commands:
$ git clone https://github.com/bjhua/tiger.git
which will create a new folder tiger in
which reside all the source files just checked out for the
lab 1 branch (the default branch).
Git is a distributed version control system, allowing one to work with the local version of repository on your machine. For instance, if you have finished the exercise 5 in lab 1, you can commit your changes by:
$ git commit -am 'my solution to lab 1, exercise 5'
which will commit the changes to your local repository.
Now import the project into your favorite Java IDE and browse the source code:
src/Tiger.java: the "Main" class serving as compiler entry
src/lexer/*: the lexer
src/parser/*: the parser
src/control/*: to control the behavior of the compiler
src/util/*: utility classes
src/slp/*: the SLP interpreter and compiler
test/*: unit tests (requires JUnit)
benchmark/*: MiniJava programs as testcases (so do not compile them as compiler sources)
Now, build this project on your machine. When the
building process finished, a new build folder
would be created comprising the generated binaries.
Note that the name of the output folder
depends on your specific IDE configurations.
For example, the folder may be named out or bin or
something else. But we will always name it build
in the following and subsequent labs.
If there are no building errors during the building process,
you can run the Tiger compiler:
$ java -cp build/libs/Tiger-1.0.jar Tiger -help
which will output something like:
The Tiger compiler. Copyright (C) 2013-, SSE of USTC.
Usage: java Tiger [options] <filename>
Available options:
-dump {token} dump tokens from lexical analysis
-help show this help information
There are two kinds of exercises: normal exercises and challenge ones. Normal exercises are all required, whereas all challenge exercises are optional. Challenge exercises may not be that hard, but may involve substantial design efforts or code hacking.
When you finished your lab, zip you solutions and submit to the USTC online teaching system.
In this part of the lab, you will do some programming exercises
to warm you up. Specifically, you will write an interpreter
and a compiler for a small programming language SLP, which stands
for
Straight-Line Programming language.
In other words, the SLP language has no control-flow constructs like
if or while.
The syntax
for SLP is given in chapter 1 of the Tiger book. You should
refer to the Tiger book for the syntax details.
This part of the lab serves as an introduction to environments (symbol tables mapping variable names to information about the variables); to abstract syntax trees (data structures representing the syntactic structure of programs); to recursion over tree data structures which is useful in many parts of a compiler; and to a functional programming style without side effects like modifying the value of a variable.
SLP contains just two syntactic forms: statements and expressions.
a := 5+3;
b := (print(a, a-1), 10*a);
print(b)
What is this program's output?
The key idea of a compiler is to encode the language syntax with tree-like data structures. We will delve into this topic in lab 2. For the purpose of this lab, we have offered you some Java classes encoding the SLP syntax. To be specific, Java (since version 14) has introduced a nice feature called record to encode an aggregated data structure, which is amenable for compiler implementation. Technically speaking, a Java record is similar to algebraic datatypes from functional programming.
To put the above discussion into perspective, let us consider the following sample grammar:
T ::= B1 | B2 | ... | Bn
where the left side hand T is a nonterminal
with n productions
B1 to Bn as right side hands.
To encode this grammar using Java record, we can declare a sealed
interface for
the left non-terminal T, and n records
B1 to
Bn for the n right-hands,
respectively, all implementing the interface T:
public sealed interface T {
record B1(...) implements T{...}
record B2(...) implements T{...}
...
record Bn(...) implements T{...}
}
Note that earlier Java version (prior to version 22) requires
that the interfaces T should have
explicit permits clauses to specify
all amenable records.
The latest Java does not require that anymore, if the
interface and all the implementing records are in the same file.
Such data structures are normally called abstract syntax trees, which we will discuss in depth in lab 2.
src/slp/Slp.java, make
sure you understand how these classes define the
SLP syntax. As a
concrete example, make sure you understand how
the object sample1
in the source file src/slp/SamplesPrograms.java encodes the
SLP program in exercise 1.
a := 1;
b := 0;
print(a/b)
Use the data structures for SLP, to define a new program
sample2
in the source file src/slp/SamplesPrograms.java.
Recursion is crucial to visit the abstract syntax trees. Specifically, one can visit the trees in pre-, middle-, or post-orders. To make this discussion into perspective, we will discuss a special but important type of recursion: the pretty printer.
In compiler design, we often need to dump the internal data structures to inspect the status of these structures. Such inspections can not only help debug the compiler, but also guarantee the implementation correctness. A well-established technique to finish this task is called pretty printing, which is also a classical algorithmic implementation topic.
The implementation of pretty printing, as well as implementation of similar operations on compiler's data structures (e.g., abstract syntax trees), is highly dependent on the implementation languages. To be specific, for the Java language as we used in this course, the standard way to accomplish this task is to use a special design pattern called visitor pattern. You should refer to the chapter 4 of the Tiger book for an excellent introduction to the visitor pattern and especially it usages in compilers. And you may also refer to books on design pattern to understand how visitor pattern works in general. Furthermore, it should also be noted that design patterns are important for other object-oriented language (e.g., C++, or Python), besides Java.
While design patterns are well-understood and established techniques to implement operations on data structures in an object-oriented style, we will use pattern matching, another novel feature from Java (since version 17), to implement these operations in a functional programming style. Generally speaking, the pattern matching is more accessible and much easier to write and understand than visitor patterns.
Consider the aforementioned sample grammar, we can define
an operation op on the type T
as the following node demonstrated:
void op(T x){
switch(x){
case T.B1(...) -> {...}
case T.B2(...) -> {...}
...
case T.Bn(...) -> {...}
}
}
this code is essentially a type case analysis, which
inspects all possible shapes of the type T
and perform actions based on the corresponding shape.
src/slp/PrettyPrint.java, make
sure you understand how this code implement the pretty
printing by leveraging pattern matching.
Then run the code src/slp/PrettyPrint.java
to pretty print the sample programs.
The output should look like:
a = 5+3;
b = (print(a, a-1, ), 10*a);
print(b, )
As you may notice, there is an intentional "bug" in the implementation of
the print statement: it will always
print a redundant , at the end of the argument list.
For example, it will print
print(a, b, ) for a print statement with two
arguments a and b, instead of the desired output print(a, b).
Fix this bug.
A program may contain zero, or more print
statements, each of which taking one or more arguments (that is,
the print function is overloaded).
For instance, the
above SLP program in exercise 1 has two print
statements: among which the
former one has 2 arguments whereas the latter one has just 1 argument.
We define the maximum argument numbers of program as
the maximum number of arguments among all print statements.
For example, the maximum argument number for the above program
is 2.
src/slp/MaxArgument.java,
to calculate the maximum number of arguments with a
given program.
Do not forget to test your code when finished, and
fix any bugs before continuing.
In this part, you will write an interpreter for SLP.
An interpreter runs a given program
online, that is, it analyzes the program and mimic the
execution behavior of the program.
As SLP
has an assignment statement x:=e, so
the interpreter may design an abstract memory to keep
track of the current value v of a variable x.
src/slp/Interpreter.java
that "interprets" a program in SLP.
You may need to modify the types of the corresponding methods.
Do not forget to test your code when finished, and
fix any bugs before continuing.
For example, if we run the above sample SLP program in exercise 1,
it should output:
8 7
80
note that the number 7 is followed by an invisible
newline character \n.
In this part, you will be building a compiler for SLP targeting x64. Do not worry, if you are not familiar with x64, as we will not make use too much complex features of it. And also we will discuss x64 in details in lab 5. You may want to refer to some instruction set manual to familiarize yourself with x64. You may also want to take a look at the x64 calling conventions, to understand how a function call happens.
src/slp/Compiler.java,
to understand how an SLP
program is compiled to x64 assembly.
Your job is to extend the compiler to support the subtraction
- and the division /.
Do not forget to test your compiler using the test cases in
src/slp/SampleProgram.java, as well as your own
test cases.
A lexer reads as input the program source files and outputs a series of tokens recognized. In this part of the lab, you will design and implement a lexer for Tiger.
The first step to design and
implement the lexer is to design good data structures
to represent the input and output, respectively.
This step is, nevertheless to say,
implementation language dependent.
In Java, we can represent the
input by some kind of (buffered) input
stream established from the program source text file, and can
represent recognized
token by specific data structure.
The data structure defining token can be found in
the class src/lexer/Token.java. This
data structure consists of the following key fields:
kind: the kind of the token, which is an
enumerable type
as defined by enum Kind;
lexeme: extra information that the
token may carry, if any; and
coordinate: the coordinate of the token in the
source program.
For simplicity, we
use the file name, the row, and the column numbers to represent current
token's coordinate.
This
information is crucial in later phases of the compiler, such as
error diagnosing and profiling.
enum Kind type correspondingly.
To implement the lexer algorithm, we have two options: the handwritten approach and the automatic lexer generator approach. Both are very popular in compiler implementations. In this lab, we will use the handwritten approach to implement the lexer for Tiger, which is also used by production compilers such as GCC or LLVM.
nextToken0() in the file
lexer/Lexer.java recognize tokens.
Read the MiniJava specification and supply
the missing code for the method.
To this point, you have finished the lexer, and your Tiger
compiler should compile all the test programs in
the /benchmark directory. Before you continue,
you should run your Tiger compiler on
these test cases and make sure your lexer works correctly.
For instance, to test your compiler on the input file
/benchmark/Factorial.java):
$ java -cp build/libs/tiger-1.0.jar Tiger ./benchmark/Factorial.java -dump token
which will not only lex your test case but also will dump the tokens recognized. If your compiler fails on any test cases, check your code to fix bugs and re-test your compiler. And you may want to write more test cases to test your Tiger compiler besides the test cases we offered.
Monster.java file:
$ java MonsterGen.java 100000>Monster10w.javaAnd run your lexer on the generated
Monster10w.java file
to investigate how long it takes to compile this program.
Does the official javac compiler
compile that file? Why?
You will implement a parser in this part of the lab. A parser parses the input program, according to the given language syntax, and determines whether the target program is legal. The parser generates data structures to be processed by later phases, if the target program is legal.
Just as for the lexer, there are generally two approaches to implement a parser: handwritten or using an automatic parser generator. We will implement a handwritten recursive descendent parser in this part, that is, all production rules become a bunch of recursive functions with (approximately) one function for one non-terminal.
parser/Parser.java. We have offered most
code for you, you should supply the missing code for
those incomplete methods.
To this point, your parser should be able to parse all
MiniJava programs. Remember
to test your parser thoroughly using
the test cases in directory /benchmark as well as your
own test cases. Fix any bugs before continuing.
error()
reports just one error before exiting.
Modify this method to report all potential syntactic errors.
That is, your parser should recover from one error to detect
more errors.
This completes the lab. Remember to submit your solution to the online teaching system.