In the class, we discussed a specific style for defining program semantics: the operational semantics, in which a program is executed on some virtual machines. We also discussed the symbolic execution, which is similar to the operational semantics, except for the that the program is executed on a symbolic virtual machine and all program paths are explored in a systematic way. Finally, we discussed the concolic execution, in which the program is executed both concretely and symbolically.
In this assignment, we'll build several executors: the concrete executor, the symbolic executor, and the concolic executor. To be specific, this assignment is divided into four parts, each of which contains both some tutorials and problems. The first part is about concrete execution, in which you'll implement an executor based on the big-step operational semantics as we discussed in the class; the second part covers symbolic execution;The third part covers concolic execution, which combines symbolic execution and concrete execution; finally, the last part, we'll be getting some experience of symbolic execution application, by applying this technique to a well-understood security problem: SQL injection. Some problems are tagged with Exercise, which you should solve. And several problems are tagged with Challenge, which are optional. Download this code templates to start with.
Before getting started with this assignment, make sure you've finished Software Setup in the lab 1, and have Z3 and Python (3.12+ recommended) properly installed on your computer. For any problems, please feel free to contact us for help.
When you finished the lab, zip you code files with file name <ID>-lab-7.zip (e.g., SA19225789-lab-7.zip), and submit it to USTC Online Teaching Platform. Any late submission will NOT be accepted.
The idea of concrete execution is straightforward: we build a virtual machine to execute the program. And the execution rules for the VM follow the operational semantics rules (big or small step, depending on the style you choose). In many circumstances, such execution engine is often called an interpreter, as we interpret the code directly, instead of compiling them to native code.
In this part, we'll build a concrete executor (interpreter) for the MiniPy language.
The abstract syntax for MiniPy:
B ::= + | - | * | / | == | != | > | < | >= | <= E ::= n | x | E B E S ::= pass | x = E | seq(S, S) | f(E1, ..., En) | if(E, S, S) | while(E, S) F ::= f((x1, ..., xn), S, E)
where the non-terminal F stands for a function, the (x1, ..., xn) means the function's arguments, the S means the statements in the function, and the E means the return expression of the function; The non-terminal S stands for a statement; The non-terminal E stands for an expression; The non-terminal B stands for a binary operator.
The concrete executor (interpreter) you'll build will directly execute the source code of MiniPy. For this purpose, you'll first build data structures to represent the syntax for MiniPy, such data structures are called the abstract syntax trees (AST), as we have seen several times, for various languages, in the previous assignments.
In order to build the concrete executor, we first need to define the memory model, for this, we'll be using Python's dataclass model. We define, in the concrete.py file, a Python dataclass Memory which contains a definition of the concrete memory. The memory operations are defined as:
𝜌:𝑥→𝑛 concrete_memory: Dict[str, int] memory model data structure 𝜌[𝑥⊢𝑛] concrete_memory["x"] = n update the memory, mapping x to n 𝜌(𝑥): concrete_memory["x"] read the binding of variable x
To implement the concrete executor, we'll be using the big-step operational semantics rules. We list the rules for statements below:
ρ⊢E⟹n -------------------------- (E-Assign) ρ⊢x=E⟹ρ[x⊢n] ρ⊢E⟹1 ρ⊢S1⟹ρ1 ρ⊢E⟹0 ρ⊢S2⟹ρ2 -------------------------- (E-If1) ------------------------- (E-If2) ρ⊢if(E;S1;S2)⟹ρ1 ρ⊢if(E;S1;S2)⟹ρ2 ρ⊢E⟹0 ρ⊢E⟹1 ρ⊢S⟹ρ′ ρ′⊢while(E;S) -------------------------- (E-While1) ------------------------------------ (E-While2) ρ⊢while(E;S)⟹ρ ρ⊢while(E;S)⟹ρ′′
In order to generate test cases automatically for any given program, Lori A. Clarke and James C. King proposed symbolic execution in 1976. Compared to concrete execution, symbolic execution executes the program with the symbolic input and maintains the symbolic program state.
Symbolic execution can cover all possible paths of a program. For each path, it needs to maintain program states (symbolic memory, accumulated path conditions). For branching, the program states are forked. The following is an example if-statement:
def foo(): parent process if e1: / \ if e11: fork() / \ fork() ... / \ else: e11(e1) e12(~e1) ... / \ / \ else: fork() / \ fork() fork() / \ fork() if e12: / \ / \ ... (e1/\e11) (e1/\~e11) (~e1/\e12) (~e1/\~e12) else: ...
We'll be using Python's multiprocessing package to do process forking. You can refer to the online document for details. Here we give some basic usage, which should be enough for this lab's purpose:
# create a sub-process "p" which executes `func` function with two parameters 12 and 20: p = Process(target=func, args=(12, 20)) # start the sub-process p p.start() # wait sub-process p to finish p.join()
During the symbolic execution, we need to maintain symbolic memory for all execution paths, unlike memory in concrete execution, the symbolic execution will store AST nodes instead of actual values in the symbolic memory. Each execution branch maintains an independent storage area, a Python dataclass Memory which contains symbolic memory and path conditions.
The whole process for symbolic execution is very similar to concrete execution, except for that the symbolic memory is split for branching. Here is an example to demonstrate how symbolic memory splits for if statement:
def foo(x, y): x = x, y = y, z = x + y z = x+y / \ if x == 0: x == 0 / \ x != 0 z = z + 1 / \ return z x = x, y = y, z = z + 1 x = x, y = y, z = x + y
When symbolic execution finished, we should get path conditions and symbolic memory for each path, convert these path conditions to Z3 forms and send them to Z3 solver. The Z3 solver will calculate a model for these path conditions (if such a model does exist), and this model essentially stands for the concrete input that leads the program to this path. The whole process is:
programs ---> symbolic ---> path conditions ---> solver ---> model ---> concrete executor /obligation execution
Concolic execution is the combination of concrete execution and symbolic execution. The main idea is to maintain both symbolic memory and store path conditions during the concrete execution. By negating the path condition, the executor can generate an input which leads the program to a different path. Compared to symbolic execution, it has at least the following advantages:
Concolic execution maintains both a symbolic memory and a concrete memory. A Python dataclass Memory which contains symbolic memory and concrete memory is already defined in the concolic.py file.
One benefit of using concolic execution is that it avoids the non-termination loop problem, here is the typical process:
while e: <------- concrete execution, append to path condition list statements <------- concolic execution
Another benefit of using concolic execution is that it will be more flexible than symbolic execution when dealing with complex constraints. Here is an example:
def foo(x, y): m = x*x*x + x*x + x + 200 if m == y: s = 0 else: s = 1 return s
When we process the if-statement, we need to add the constraint x*x*x + y*y*(x-10) + 200 == y to the path conditions, as this constraint is nonlinear and may beyond the capability of current state-of-the-art solvers, say Z3. In concolic execution, we can weaken this constraint by replacing the symbolic value of m with its concrete value. Suppose the values of x and y are 10 and 1, the constraint can be simplified to 1300 == y.
All problems in this part are optional.
In this part of this assignment, we'll be getting some experience of symbolic execution application, by applying this technique to a well-understood security problem: SQL injection.
Most modern software systems, especially the online transaction processing systems (OLTP) make extensive use of databases. SQL injection is a code injection technique, used to attack such data-driven systems, in which malicious SQL statements are inserted for execution.
We have supplied you a very small database application, using the Python's sqlalchemy toolkit. Read the code in the sql-injection-sym.py file, run this code, you will see some output like this:
Please input a user name:
Basically, this file will create a local database victim.db (with SQLite3), with a predefine table users. And there have been some existing data in this table. (What are these data?)
Your searched the user: Bob Found a potential SQL injection vulnerability, you may trigger it with: [Bob = "hacker' OR '1=1"] Found the user: Bob [('Bob', 30, 'M')]as you can see, besides the normal response: [('Bob', 30, 'M')], the symbolic executor also tells you that it detects a SQL injection vulnerability. In order to trigger the vulnerability, you should input this (crazy username!):
hacker' OR '1=1what output do you get? You don't need to write any code in this exercise, but make sure you understand what SQL injection is and how it happens.
With the basic knowledge of SQL injection, the key problem we'd like to ask here is how we can find out such vulnerabilities in practical DB systems, both precisely and effectively. One promising way to do this is to use symbolic execution, the basic idea here is to run the target DB application symbolically (here in this problem: concolically), and before the DB command get executed, we check the symbolic values of the DB queries against some well-known payload patterns, to see whether the user-controlled input can match the payload pattern. As we'll discuss a little later, this syntactic pattern matching approach may be too constrained, but it may be the simplest one to implement.
To run the DB application symbolically, we use a new approach: instead of building an interpreter for the DB application from scratch, we intercept the Python's string APIs, so that these APIs will calculate symbolic values, besides actual values.
Nevertheless, by using more payload pattern, you can detect more SQL injections. For example, you can drop the table users with some injected SQL command.
This completes this lab. Remember to zip you homework with file name <ID>-lab-7.zip (e.g., SA19225789-lab-7.zip), and submit it to USTC Online Teaching Platform.
Happy hacking!