In this course, you will design and implement a compiler Tiger, for a non-trivial subset of the Java programming language MiniJava, as described in the appendix of the Tiger book. You will use Java as the implementation language (so you are following the chicken and egg tradition in compiler design community). By the end of this course, you will not only learn the theory and practice of compilers, but also gain a deeper understanding about Java.
There are 7 labs and a final project planned:
Finally, there is a final project, in which you are required to propose your own ideas and to do some nontrivial projects.
This is the lab 1. In this lab, you will implement a frond-end, i.e., a lexer and a parser, for Tiger.
In this lab, you will use Java as your implementation language, please first check the reference page to install necessary software, if you have not installed the Java platform. Make sure to install the latest Java compiler and runtime (version 22), other versions of Java do NOT work as the Tiger compiler makes use of some latest Java feature in version 22 (e.g., pattern matching or string templates). You can use any Java IDEs (e.g., IntelliJ, or VS code).
In each lab, we will supply skeleton code for you to start with and modify. These files are controlled and distributed using the Git version control system. So, to access these files, you will first need a Git client installed on your machine. On Linux or Mac, it is easy to install a Git client:
$ sudo apt install git
$ brew install git
On Windows, there are many Git clients available, among which msysgit is a very popular one, you can install and try it if you are using Windows.
For those who have no prior experience of using Git, do not forget to refer to the official manual when in doubt.
This year, we have put the skeleton code repository on GitHub (hence this courseware is open-source). To check out the code repository for lab 1 to your machine, you should run the following commands:
$ git clone https://github.com/bjhua/tiger.git
which will create a new folder tiger
in
which reside all the source files just checked out for the
lab 1 branch (the default branch).
Git is a distributed version control system, allowing one to work with the local version of repository on your machine. For instance, if you have finished the exercise 5 in lab 1, you can commit your changes by:
$ git commit -am 'my solution to lab 1, exercise 5'
which will commit the changes to your local repository.
Now import the project into your favorite Java IDE and browse the source code:
src/Tiger.java: the "Main" class serving as compiler entry src/lexer/*: the lexer src/parser/*: the parser src/control/*: to control the behavior of the compiler src/util/*: utility classes src/slp/*: the SLP interpreter and compiler test/*: MiniJava programs as testcases (so do not compile them as compiler sources)
Now, build this project on your machine. When the
building process finished, a new bin
folder
would be created. Note that the name of the output folder
depends on your specific IDE configurations.
For example, the folder may be named out
or
something else. But we will always name it bin
in the following and subsequent labs.
If there are no building errors during the building process,
you can run the Tiger
compiler:
$ java --enable-preview -cp bin Tiger -help
which will output something like:
The Tiger compiler. Copyright (C) 2013-, SSE of USTC. Usage: java Tiger [options] <filename> Available options: -dump {token} dump tokens from lexical analysis -help show this help information
There are two kinds of exercises: normal exercises and challenge ones. Normal exercises are all required, whereas all challenge exercises are optional. Challenge exercises may not be that hard, but may involve substantial design efforts or code hacking.
When you finished your lab, zip you solutions and submit to the USTC online teaching system.
In this part of the lab, you will do some programming exercises to warm you up. To be specific, you will write an interpreter and a compiler for a small programming language SLP, which stands for Straight-Line Programming language. In other words, the SLP language has no control-flow constructs. The syntax for SLP is given in chapter 1 of the Tiger book. You should refer to the Tiger book for the syntax details.
This part of the lab serves as an introduction to environments (symbol tables mapping variable names to information about the variables); to abstract syntax trees (data structures representing the syntactic structure of programs); to recursion over tree data structures which is useful in many parts of a compiler; and to a functional programming style without assignment statements.
SLP contains just two syntactic forms: statements and expressions.
a := 5+3; b := (print(a, a-1), 10*a); print(b)What is this program's output?
The key idea of a compiler is to encode the language syntax with tree-like data structures. We will delve into this topic in lab 2. For the purpose of this lab, we have offered you some Java classes encoding the SLP syntax. To be specific, Java (since version 14) has introduced a nice feature called record to encode an aggregated data structure, which is amenable for compiler implementation. Technically speaking, a Java record is similar to algebraic datatypes from functional programming.
To put the above discussion into perspective, let us consider the following sample grammar:
T ::= B1 | B2 | ... | Bn
to encode this grammar using Java record, we can declare a sealed
interface for
the left non-terminal T
, and n records
B1
to
Bn
for the n right-hands,
respectively, all implementing the interface T
:
interface sealed T permits B1, B2, ..., Bn {}
record B1(...) implements T{...}
record B2(...) implements T{...}
...
record Bn(...) implements T{...}
Such data structures are normally called abstract syntax trees, which we will discuss in depth in lab 2.
src/slp/Slp.java
, make
sure you understand how these classes define the
SLP syntax. As a
concrete example, make sure you understand how
the object sample1
in the source file src/slp/SamplesPrograms.java
encodes the
SLP program in exercise 1.
You do not need to write any code for this exercise.
a := 1; b := 0; print(a/b)Use the data structures for SLP, to define a new program
sample2
in the source file src/slp/SamplesPrograms.java
.
In compiler design, we often need to dump the internal data structures to inspect the status of these structures. Such inspections can not only aid debugging the compiler, but also guarantee the implementation correctness. A well-established technique finish this task is called pretty printing, which is also a classical algorithmic implementation topic.
The implementation of pretty printing, as well as implementation of similar operations on compiler's data structures (e.g., abstract syntax trees), is highly dependent on the implementation languages. To be specific, for the Java language as we used in this course, the standard way to accomplish this task is to use a special design pattern called visitor pattern. You should refer to the chapter 4 of the Tiger book for an excellent introduction to the visitor pattern and especially it usages in compilers. And you may also refer to books on design pattern to understand how visitor pattern works in general. Furthermore, it should also be noted that design patterns are important for other object-oriented language (e.g., C++, or Python), besides Java.
While design patterns are well-understood and established techniques to implement operations on data structures in an object-oriented style, we will use pattern matching, another novel feature from Java (since version 17), to implement these operations in a functional programming style, which is much easier to write and understand than visitor patterns.
Consider the aforementioned sample grammar, we can define
an operation op
on the type T
as the following node demonstrated:
void op(T x){
switch(x){
case B1(...) -> {...}
case B2(...) -> {...}
...
case Bn(...) -> {...}
}
}
this code is essentially a case analysis, which
inspects all possible shapes of the type T
and perform actions based on the shape, respectively.
src/slp/PrettyPrint.java
, make
sure you understand how this code implement the pretty
printing through
pattern matching.
Then run the code src/slp/Main.java
:
$ java --enable-preview -cp bin slp.Mainto pretty print the sample programs. The output looks like:
a = 5+3; b = (print(a, a-1, ), 10*a); print(b, )As you may notice, there is a "bug" in the implementation of printing the
print
statement: it will always
print a redundant ,
at the end of the argument list.
For example, it will print
print(a, b, )
for a print statement with two
arguments, instead of print(a, b)
.
Fix this bug.
A program may contain zero, or more print
statements, each of which taking one or more arguments (that is,
the print
function is overloaded).
For instance, the
above SLP program in exercise 1 has two print
statements: the
former one has 2 arguments whereas the latter one has just 1.
We define the maximum argument numbers of program as
the maximum number of arguments among all print
statements.
For example, the maximum argument number for the above program
is 2.
src/slp/MaxArgument.java
,
to calculate the maximum
number of arguments with a
given program.
Do not forget to test your code when finished, and
fix any bugs before continuing.
In this part, you will write an interpreter for SLP.
An interpreter runs a given program
online, that is, it analyzes the program and mimic the
execution behaviour of the program.
As SLP
has an assignment statement x:=e
, so
the interpreter must have an abstract memory, which keeps
track of the current value of a variable x
.
src/slp/Interpreter.java
that "interprets" a program in SLP.
You may need to modify the types of the corresponding methods.
Do not forget to test your code when finished, and
fix any bugs before continuing.
For example, if we run the above sample SLP program in exercise 1,
$ java --enable-preview -cp bin slp.Mainwhich should output:
8 7 80note that the number
7
is followed by a
newline character.
In this part, you will be building a compiler for SLP targeting x64. Do not worry, if you are not familiar with x64, as we will not make use too much complex features of it. And also we will discuss x64 in details in lab 5. You may want to refer to some instruction set manual to familiarize yourself with x64. You may also want to take a look at the x64 calling conventions, to understand how a function call happens.
src/slp/Compiler.java
,
to understand how an SLP
program is compiled to x64 assembly.
Your job is to extend the compiler to support the subtraction
-
and the division /
.
Do not forget to test your compiler using the test cases in
src/slp/SampleProgram.java
, as well as your own
test cases.
A lexer reads as input the program source files and outputs a series of tokens recognized. In this part of the lab, you will design and implement a lexer for Tiger.
The first step to design and
implement the lexer is to design good data structures
to represent the input and output, respectively,
which is, nevertheless to say,
implementation language dependent.
In Java, we can represent the
input by some kind of (buffered) input
stream established from the program source text file, and can
represent recognized
token by specific data structure.
The data structure defining token can be found in
the class src/lexer/Token.java
. This
data structure consists of the following key fields:
kind
: the kind of the token, which is an
enumerable type
as defined by enum Kind
;
lexeme
: extra information that the
token may carry, if any; and
rowNum, colNum
: the positions of the token in the
source program (this
information is crucial in later phase of the compiler, such as
error diagnosing and profiling).
For simplicity, we just
use row and column numbers to represent current
token's position.
enum Kind
type.
To implement the lexer algorithm, we have two options: the handwritten approach and the automatic lexer generator approach. Both are very popular in compiler implementations. In this lab, we will use a handwritten approach to implement the lexer for Tiger, which is also used by production compilers such as GCC or LLVM.
nextToken0()
in the file
lexer/Lexer.java
recognize tokens.
Read the MiniJava specification and supply
the missing code for the method.
To this point, you have finished the lexer, and your Tiger
compiler should compile all the test programs in
the /test
directory. Before you continue,
you should run your Tiger compiler on
these test cases and make sure your lexer works correctly.
For instance, to test your compiler on the input file
/test/Factorial.java
):
$ java --enable-preview -cp bin Tiger ./test/Factorial.java -dump token
which will not only lex your test case but also will dump the tokens recognized. If your compiler fails on any test cases, check your code to fix bugs and re-test your compiler. And you may want to write more test cases to test your Tiger compiler besides the test cases we offered.
$ javac MonsterGen.javathen run it to produce a
Monster.java
code:
$ java MonsterGen 100000>Monster.javaAnd run your lexer on
Monster.java
to determin how
long it will take to compile this program?
How long does Oracle's javac
compiler take
compile it?
rowNum
and a column
number colNum
to represent the position of a token.
Though simple and easy to implement, such position information is
coarse-grained in that they only record where a
token starts, but not where it ends.
Propose a better data
structure to represent a position in a fine-grained manner
and implement your idea into the compiler.
You will implement a parser in this part of the lab. A parser parses the input program, according to the given language syntax, and determines whether the target program is legal. The parser generates data structures to be processed by later phases, if the target program is legal.
Just as for the lexer, there are generally two approaches to implement a parser: handwritten or using an automatic parser generator. We will implement a handwritten recursive descendent parser in this part, that is, all production rules become a bunch of recursive functions with (approximately) one function for one non-terminal.
parser/Parser.java
. We have offered most
code for you, you should supply the missing code for
those incomplete methods.
To this point, your parser should be able to parse all
MiniJava programs. Remember
to test your parser thoroughly using
the test cases in directory /test
as well as your
own test cases. Fix any bugs before continuing.
error()
only reports an error message before exiting.
Modify this method to report more accurate error messages, such
as the file name, the line number, the column
number, some diagnostic
message, and the related source code. You may take a look
at the diagnostic messages generated by the
Clang
compiler to get some idea.
Finally, you may
try to implement error recover, that
is, to recover from errors so that all potential syntactic errors
can be detected.
This completes the lab. Remember to submit your solution to the online teaching system.