Static Program Analysis
Part 1 – the TIP language
http://cs.au.dk/~amoeller/spa/
øller & . Schwartzbach Computer Science, Aarhus University
Copyright By PowCoder代写 加微信 powcoder
Questions about programs
• Does the program terminate on all inputs?
• How large can the heap become during execution?
• Can sensitive information leak to non-trusted users?
• Can non-trusted users affect sensitive information?
• Are buffer-overruns possible?
• Data races?
• SQL injections?
• XSS? •…
foo(p,x) {
if (*p==0) { f=1; }
q = alloc 10; *q = (*p)-1; f=(*p)*(x(q,x));
return f; }
Invariants:
any point in the program = any value of the PC
Program points
A property holds at a program point if it holds in any such state for any execution with any input
Questions about program points
• Will the value of x be read in the future?
• Can the pointer p be null?
• Which variables can p point to?
• Is the variable x initialized before it is read?
• What is a lower and upper bound on the value of the integer variable x?
• At which program points could x be assigned its current value?
• Do p and q point to disjoint structures in the heap?
• Can this assert statement fail?
Why are the answers interesting?
• Increaseefficiency
– resource usage
– compiler optimizations
• Ensurecorrectness – verify behavior
– catch bugs early
• Support program understanding
• Enablerefactorings
“Program testing can be used to show the presence of bugs, but never to show their absence.”
[Dijkstra, 1972]
Nevertheless, testing often takes 50% of the development cost
Programs that reason about programs
a program analyzer A
P always works correctly
a program P
P fails for some inputs
Requirements to the perfect program analyzer
SOUNDNESS (don’t miss any errors) COMPLETENESS (don’t raise false alarms)
TERMINATION (always give an answer)
Rice’s theorem, 1953
Rice’s theorem
Any non-trivial property of the behavior of programs in a Turing-complete language is undecidable!
Reduction to the halting problem
• Can we decide if a variable has a constant value? x = 17; if (TM( j)) x = 18;
• Here, x is constant if and only if the j’th Turing machine does not halt on empty input
Undecidability of program correctness
Is the FAIL state
unreachable in
the given program
(for any input)?
) from e(T)
(ignoring input
Build e(S T
Simulate T on input e(T)
Does M accept input e(M)?
If simulation reaches ACCEPT, then (without reaching FAIL)
goto Otherwise, just terminate or loop forever
Undecidability of program correctness
Is the FAIL state
unreachable in
the given program
(for any input)?
Does M accept input e(M)?
) from e(T)
Build e(S T
Simulate T on input e(T
If simulation reaches ACCEPT, then Otherwise, just terminate
(without reaching FAIL)
(Note: this proof works even if we only consider programs that always terminate!)
Approximation
• Approximate answers may be decidable!
• The approximation must be conservative:
– i.e. only err on “the safe side”
– which direction depends on the client application
• We’ll focus on decision problems
• More subtle approximations if not only “yes”/“no” – e.g. memory usage, pointer targets
False positives and false negatives
Example approximations
• Decide if a given function is ever called at runtime: – if “no”, remove the function from the code
– if “yes”, don’t do anything
– the “no” answer must always be correct if given
• Decide if a cast (A)x will always succeed: – if “yes”, don’t generate a runtime check
– if “no”, generate code for the cast
– the “yes” answer must always be correct if given
Beyond “yes”/“no” problems
• How much memory / time may be used in any execution?
• Which variables may be the targets of a pointer variable p?
The engineering challenge
• A correct but trivial approximation algorithm may just give the useless answer every time
• The engineering challenge is to give the useful answer often enough to fuel the client application
• … and to do so within reasonable time and space
• This is the hard (and fun) part of static analysis!
Bug finding
int main() {
char *p,*q;
printf(“%s”,p);
q = (char *)malloc(100);
p = (char *)malloc(100); p = (char *)malloc(100); q = p;
strcat(p,q);
gcc –Wall foo.c
lint foo.c
No errors!
https://en.wikipedia.org/wiki/Lint_(software)
Does anyone use static program analysis?
For optimization:
• every optimizing compiler and modern JIT
For verification or error detection:
A constraint-based approach
Conceptually separates the analysis specification from algorithmic aspects and implementation details
constraint solver
program to analyze
mathematical constraints
⟦p⟧ = &int ⟦q⟧ = &int ⟦alloc⟧ = &int ⟦x⟧ =
⟦&n⟧ = &int ⟦main⟧ = ()->int
Challenging features in modern programming language
• Higher-orderfunctions
• Mutable records or objects, arrays
• Integerorfloating-pointcomputations
• Dynamicdispatching
• Inheritance
• Exceptions
• Reflection •…
The TIP language
• TinyImperativeProgramminglanguage
• Example language used in this course:
– minimal C-style syntax
– cut down as much as possible
– enough features to make static analysis challenging and fun
• Scalaimplementationavailable
Expressions
Exp Int | Id
| Exp+Exp | Exp–Exp | Exp*Exp | Exp/Exp | Exp>Exp | Exp==Exp
| (Exp) | input
• I ∈ Int represents an integer literal
• X ∈ Id represents an identifier (x, y, z,…)
• input expression reads an integer from the input stream
• comparison operators yield 0 (false) or 1 (true)
Statements
Stm Id=Exp;
| outputExp;
| if (Exp) {Stm} [else {Stm}]?
| while (Exp) { Stm }
• In conditions, 0 is false, all other values are true
• The output statement writes an integer value to the output stream
• Functions take any number of arguments and return a single value:
• The optional var block declares a collection of uninitialized variables
• Function calls are an extra kind of expressions:
Fun Id(Id,…,Id){ [var Id, …, Id;]?
return Exp; }
Exp … | Id(Exp,…,Exp)
| alloc Exp
| * Exp | null
Stm … | *Exp = Exp;
(No pointer arithmetic)
Records are passed by value (like structs in C)
For simplicity, values of record fields cannot themselves be records
| { Id:Exp, …, Id:Exp }
| Id.Id=Exp;
| (*Exp).Id = Exp;
Functions as values
• Functions are first-class values
• The name of a function is like a variable that refers to
that function
• Generalizedfunctioncalls:
• Functionvaluessufficetoillustratethemainchallenges with methods (in object-oriented languages)
and higher-order functions (in functional languages)
Exp … | Exp( Exp, …, Exp )
• A program is a collection of functions
• Thefunctionnamedmaininitiatesexecution – its arguments are taken from the input stream
– its result is placed on the output stream
• Fun … Fun
An iterative factorial function
while (n>0) {
n = n-1; }
return f; }
A recursive factorial function
if (n==0) {
f=n*rec(n-1);
return f; }
An unnecessarily complicated function
foo(p,x) {
if (*p==0) {
q = alloc 0;
*q = (*p)-1;
f=(*p)*(x(q,x));
return f; }
n = input;
return foo(&n,foo);
Beyond TIP
Other common language features in mainstream languages:
• globalvariables • objects
• nested functions •…
Control flow graphs
ite(n) { var f; f = 1;
while (n>0) {
return f; }
Control flow graphs
• A control flow graph (CFG) is a directed graph:
– nodes correspond to program points
(either immediately before or after statements)
– edges represent possible flow of control
• A CFG always has
– a single point of entry
– a single point of exit
(think of them as no-op statements)
• Let v be a node in a CFG
– pred(v) is the set of predecessor nodes – succ(v) is the set of successor nodes
CFG construction (1/3)
• For the simple while fragment of TIP, CFGs are constructed inductively
• CFGs for simple statements etc.:
CFG construction (2/3)
For a statement sequence S1 S2:
– eliminate the exit node of S1 and the entry node of S2 – glue the statements together
CFG construction (3/3)
Similarly for the other control structures:
true false
false true
Normalization
Sometimes convenient to ensure that
each CFG node performs only one operation
Normalization: flatten nested expressions, using fresh variables
x = f(y+3)*5;
t2 = f(t1);
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com