Synopsis
Lab 3: Datalog
Fall Semester 2020
Due: 26 October, 8:00 a.m. Eastern Time Corresponding Lecture: Lesson 7 (Constraint Based Analysis)
Writing a constraint-based static analysis for C programs with LLVM and Z3.
Objective
In this lab, you will implement a constraint-based analysis to determine reaching definitions in simple C programs. You will discover relevant facts about LLVM IR Instructions, feed those facts into the Z3 constraint solver, and implement the rules of reaching definitions in Z3’c C++ API.
Resources
● Z3 tutorial and C++ API
– https://rise4fun.com/z3/tutorial/guide
– https://z3prover.github.io/api/html/group__cppapi.html
– https://github.com/Z3Prover/z3/blob/master/examples/c%2B%2B/example.cpp
● Important classes
– https://z3prover.github.io/api/html/classz3_1_1fixedpoint.html – https://z3prover.github.io/api/html/classz3_1_1expr.html
Setup
Download LLVMDatalog.zip from Canvas and unzip in the home directory on the VM (note: there is an extraneous “datalog” directory on the VM already – you can delete or ignore the contents). This will create an LLVMDatalog directory.
The following commands setup the lab:
The above command will install the Z3 constraint solver, generate a Makefile using CMake, and finally generate an executable file constraint that calculates reaching definitions:
$ cd ~/LLVMDatalog
$ source oneTimeSetup.bash $ mkdir build && cd build $ cmake ..
$ make
$ cd ~/datalog/test
$ clang -emit-llvm -S -fno-discard-value-names -c simple0.c $ ../build/constraint simple0.ll
If you’ve done everything correctly up to this point you should see:
You can expect to re-run the command often as you work through the lab. You should not
need to re-run cmake or Lab Instructions
In this lab, you will design a reaching definition analysis using Z3. The main tasks are to design the analysis in the form of Datalog rules through the Z3 C++ API, and implement a function that extracts logical constraints in the form of Datalog facts for each LLVM instruction.
We will then feed these constraints, along with your datalog rules, into the Z3 solver. The main function of src/Constraint.cpp ties this logic together, and provides comments to explain how the main components work together.
In short, the lab consists of the following tasks:
1. Write Datalog rules in the initialize function in Extractor.cpp to define the reaching
definition analysis and taint analysis.
2. Write the function in Extractor.cpp that extracts Datalog facts
from LLVM IR .
Relations for Datalog Analysis. The skeleton code provides the definitions of necessary Datalog relations over LLVM IR in Extractor.h. In the following subsection, we will show how to represent LLVM IR program using these relations.
The relations for def and use of variables are as follows:
● Def(X,Y) and Use(X,Y) : Variable X is defined (rsp., used) at instruction Y.
The relations for the reaching definition analysis are as follows:
=== Reaching Definition (Out) ===
=== Reaching Definition (In) ===
make
oneTimeSetup.bash.
● ● ●
and Gen(X,Y) : Definition X is killed (rsp., generated) by instruction Y.
: Instruction Y is an immediate successor of instruction X.
and Out(X,Y) : Definition Y may reach the program point just before (rsp., after)
instruction X.
Kill(X,Y)
extractContraints
Instruction
Next(X,Y)
In(X,Y)
Note that the Kill relation can be derived by using the Def relation.
You will use these relations to build rules for the reaching definition analysis in Extractor.cpp.
Encoding LLVM Instruction in Datalog. Recall that, in LLVM IR, values and instructions are interchangable. Therefore, all variables X, Y, and Z are an instance of LLVM’s Value class.
Assume that input C programs do not have pointer variables. Therefore, we abuse pointer variables in LLVM IR as their dereference expressions. Consider the following simplified LLVM program from a simple C program int x = 0; int y = x;:
We apply these rules to create Def and Use facts:
● alloca instructions will represent variables
● Each instruction “S” defines its second argument “A”, creating
● Each instruction “L” uses its single argument “A”, creating
Our and facts for the above program are:
You will find more LLVM instruction types in the test programs, most importantly call and sdiv. You will need to reason about the ways to create the appropriate Def and facts for these instruction types.
Defining Datalog Rules from C++ API. You will write your Datalog rules in the function initialize using the above relations. Consider an example Datalog rule:
A(X, Y) :- B(X, Z), C(Z, Y).
This rule corresponds to the following formula: ∀X, Y , Z. B(X, Z) ⋀ C(Z, Y ) ⇒ A(X, Y ).
x = alloca i32 ;I0 y = alloca i32 ;I1 store i32 0, i32* x ;I2 a=loadi32,i32*x ;I3 store i32 a, i32* y ;I4
store
Def(A,S)
load
Use(A,L)
Def
Use
Def(I0, I2)
Use(I0,I3)
Def(I1, I4)
Use(I1,I4)
Use
In Z3, you can specify the formula in the following sequence of APIs in initialize. Assume ctx and Solver are configured as shown in Extractor.cpp. They represent instances of a Z3Context and a fixedpoint constraint solver, respectively.
/* Declare functions */ Z3::func_decl A = ctx.function(“A”, Z3::func_decl B = ctx.function(“B”, Z3::func_decl C = ctx.function(“C”, /* Declare quantified variables */ z3::expr X = ctx.bv_const(“X”, 32); z3::expr Y = ctx.bv_const(“Y”, 32); z3::expr Z = ctx.bv_const(“Z”, 32); /* Define and register rules */
ctx.bv_sort(32)); ctx.bv_sort(32)); ctx.bv_sort(32));
// encode X as a 32-bit bitvector (bv)
z3::expr R0 = z3::forall(X, Y, Z, z3::implies(B(Y,Z) && C(Z, Y), A(X,Y))); Solver->add_rule(R0, ctx.str_symbol(“R0”));
Study the above pattern with forall and implies closely, as you can adapt it to express all the Datalog relations required to complete this lab.
Extracting Datalog Facts. You will need to implement the function extractConstraints in Extractor.cpp to extract Datalog facts for each LLVM instruction. The skeleton code provides a couple of auxiliary functions in src/Extract.cpp and src/Utils.cpp help you with this task:
– void addX(const InstMapTy &InstMap, …)
– X denotes the name of a relation. These functions adds a fact of X to the solver. It takes InstMap that encodes each LLVM instruction as an integer. This map is initialized in the main function.
– vector
– Returns a set of predecessors of a given LLVM instruction I.
Miscellaneous. For convenience for easy debugging, you can use the print function in Extract.cpp. If the -d option is passed through the command line constraint simple0.ll -d), it will print out tuples of several relations. You can extend the function for your purpose.
Format of Input Programs
Input programs in this lab are assumed to use a strict subset of the C language as follows:
– All variables used or defined are signed integers
– Function calls have either zero or one arguments
– Control flow is limited to if, else, and while
Example Input and Output
Your analyzer should run on LLVM IR. For example:
Your analyzer should print out the corresponding LLVM instructions for Out and In relations. The output below is only a sample, many more facts will be printed from complete solutions.
Note that we will be grading your output, so please do not change the format of the output or produce any additional output in your submission.
$ cd ~/LLVMDatalog/test
$ clang -emit-llvm -S -fno-discard-value-names -c simple0.c $ ../build/constraint simple0.ll
=== Reaching Definitions (Out) ===
Out(“ret i32 0”, “store i32 0, i32* %retval, align 4”) Out(“ret i32 0”, “%call = call i32 (…) @tainted_input()”) Out(“ret i32 0”, “store i32 %call, i32* %x, align 4”)
=== Reaching Definitions (In) ===
In(“ret i32 0”, “store i32 0, i32* %retval, align 4”) In(“ret i32 0”, “%call = call i32 (…) @tainted_input()”) In(“ret i32 0”, “store i32 %call, i32* %x, align 4”)
Grading
Generally, credit for this lab is awarded as follows, very similar to Lab 2:
● The In and Out relations of each test program will be compared against the correct
values, with each equally weighted
● For each test program, your score will be [# expected tuples]−[# missing tuples]−[# extra tuples] % of
[# expected tuples]
the total possible points for that set. For example, an output with 10 expected tuples
where you found all 10 expected tuples but also two additional tuples would score
10−0−2 % = 80% of the possible credit 10
● We may use additional test programs for grading not supplied in the lab zip. If used, these additional test programs would be of similar size and complexity to the ones you have seen.
Items to Submit
Submit your single file Extractor.cpp via Canvas. Do not submit a zipfile, or any extraneous files.