Dependence and Data Flow Models
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 1
Why Data Flow Models?
• Models from Chapter 5 emphasized control • Control flow graph, call graph, finite state machines
• We also need to reason about dependence • Where does this value of x come from?
• What would be affected by changing this? • Where is this value used?
• Many program analyses and test design techniques use data flow information
– Often in combination with control flow
• Example: Taint analysis to prevent SQL injection attacks • Example: Dataflow test criteria (Ch.13)
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 2
SQL Injection
• statement = “SELECT * FROM users WHERE name ='” + userName + “‘;”
– to attack, set username = ‘ OR ‘1’=’1
– result: SELECT * FROM users WHERE name = ” OR ‘1’=’1′; – gives the list of all users
• adding commands:
– username = 1;DROP TABLE users – will delete all users
• one of the top attacks on sensitive databases
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 3
Learning objectives
• Understand basics of data-flow models and the related concepts (def-use pairs, dominators…)
• Understand some analyses that can be performed with the data-flow model of a program
– The data flow analyses to build models – Analyses that use the data flow models
• Understand basic trade-offs in modeling data flow
– variations and limitations of data-flow models and analyses, differing in precision and cost
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 4
Def-Use Pairs (1)
• A def-use (du) pair associates a point in a program where a value is produced with a point where it is used
• Definition: where a variable gets a value
– Variable declaration (often the special value “uninitialized”)
– Variable initialization
– Assignment
– Values received by a parameter
• Use: extraction of a value from a variable
– Expressions
– Conditional statements
– Parameter passing
– Returns
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 5
Def-Use Pairs (2)
…
if (…) {
x = … ;
y = … + x + … ;
Def-Use path
…
if (…) {
x = … …
y = … + x + … …
Definition: x gets a value
Use: the value of x is extracted
… }
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 6
Def-Use Pairs (3)
/** Euclid’s algorithm */ public class GCD
{
public int gcd(int x, int y) {
int tmp; // A: def x, y while(y!=0){ //B:usey
tmp=x%y; //C:deftmp;usex,y
x = y;
y = tmp;
return x; }
// D: def x; use y // E: def y; use tmp
// F: use x
}
def = {x,y} use = {}
Figure 6.2, page 79
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 7
Def-Use Pairs (4)
• A definition-clear path is a path along the CFG from a definition to a use of the same variable without another definition of the same variable between
– If, instead, another definition is present on the path, then the latter definition kills the former
• A def-use pair is formed if and only if there is a definition-clear path between the definition and the use
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 8
Definition-Clear or Killing
…
x=… //A:defx
q = …
x=y; // B:killx,defx z = …
y=f(x); //C:usex
A
…
Definition: x gets a value
Definition: x gets a new value, old value is killed
Use: the value ofxis extracted
Path A..C is
not definition-clear
Path B..C is definition-clear
B
x = …
x = y …
C y=f(x)
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 9
(Direct) Data Dependence Graph
• A direct data dependence graph is:
– –
Nodes: as in the control flow graph (CFG)
Edges: def-use (du) pairs, labelled with the variable name
(Figure 6.3, page 80)
x y
Dependence edges show this
x value could be the unchanged parameter or could be set at line D
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 10
Control dependence (1)
• Data dependence: Where did these values come from?
• Control dependence: Which statement controls whether this statement executes?
– Nodes: as in the CFG
– Edges: unlabelled, from entry/branching points to controlled blocks
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 11
Dominators
• Pre-dominators in a rooted, directed graph can be used to make this intuitive notion of controlling decision precise.
• Node M dominates node N if every path from the root to N passes through M.
– A node will typically have many dominators, but except for the root, there is a unique immediate dominator of node N which is closest to N on any path from the root, and which is in turn dominated by all the other dominators of N.
– Because each node (except the root) has a unique immediate dominator, the immediate dominator relation forms a tree.
• Post-dominators: Calculated in the reverse of the control flow graph, using a special exit node as the root.
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 12
Dominators (example)
A B
CE DF
G
• A pre-dominates all nodes; G post-dominates all nodes
• F and G post-dominate E
• G is the immediate post- dominator of B
– C does not post-dominate B
• B is the immediate pre- dominator of G
– F does not pre-dominate G
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 13
Control dependence (2)
• We can use post-dominators to give a more precise definition of control dependence:
– Consider again a node N that is reached on some but not all execution paths.
– There must be some node C with the following property:
• C has at least two successors in the control flow graph (i.e., it represents a control flow decision);
• C is not post-dominated by N
• there is a successor of C in the control flow graph that is post- dominated by N.
– When these conditions are true, we say node N is control- dependent on node C.
• Intuitively: C was the last decision that controlled whether N executed
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 14
Control Dependence
A B
CE DF
G
Execution of F is not inevitable at B
Execution of F is inevitable at E
F is control-dependent on B, the last point at which its execution was not inevitable
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 15
These concepts are used in the root cause analysis
Data Flow Analysis
Computing data flow information
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 16
Calculating def-use pairs
• Definition-use pairs can be defined in terms of paths in the program control flow graph:
– There is an association (d,u) between a definition of variable v at d and a use of variable v at u iff
• there is at least one control flow path from d to u
• with no intervening definition of v.
– vd reaches u (vd is a reaching definition at u).
– If a control flow path passes through another definition e of the same variable v, ve kills vd at that point.
• Even if we consider only loop-free paths, the number of paths in a graph can be exponentially larger than the number of nodes and edges.
• Practical algorithms therefore do not search every individual path. Instead, they summarize the reaching definitions at a node over all the paths reaching that node.
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 17
Exponential paths (even without loops)
ABCDEFGV
2 paths from A to B 4 from A to C
8 from A to D
16 from A to E
…
128 paths from A to V
Tracing each path is not efficient, and we can do much better.
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 18
DF Algorithm
• An efficient algorithm for computing reaching definitions (and several other properties) is based on the way reaching definitions at one node are related to the reaching definitions at an adjacent node.
• Suppose we are calculating the reaching definitions of node n, and there is an edge (p,n) from an immediate predecessor node p.
– If the predecessor node p can assign a value to variable v, then the definition vp reaches n. We say the definition vp is generated at p.
– If a definition vp of variable v reaches a predecessor node p, and if v is not redefined at that node (in which case we say the vp is killed at that point), then the definition is propagated on from p to n.
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 19
Equations of node E (y = tmp)
Calculate reaching definitions at E in terms of its immediate predecessor D
public class GCD {
public int gcd(int x, int y) {
int tmp; // A: def x, y while(y!=0){ //B:usey
tmp=x%y; //C:deftmp;usex,y
x = y;
y = tmp;
return x; }
// D: def x; use y
// E: def y; use tmp
// F: use x
}
Reach(E) = ReachOut(D)
ReachOut(E) = (Reach(E) \ {yA}) È {yE}
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 20
Equations of node B (while (y != 0))
This line has two predecessors: Before the loop, end of the loop
public class GCD {
public int gcd(int x, int y) {
int tmp; // A: def x, y, tmp while(y!=0){ //B:usey
tmp=x%y; //C:deftmp;usex,y
x = y;
y = tmp;
return x; }
// D: def x; use y // E: def y; use tmp
// F: use x
}
• Reach(B) = ReachOut(A) È ReachOut(E) • ReachOut(A) = gen(A) = {xA, yA, tmpA}
• ReachOut(E) = (Reach(E) \ {yA}) È {yE}
(c) 2007 Mauro Pezzè & Michal Young
Ch 6, slide 21
Summary
• Data flow models detect patterns on CFGs:
– Nodes initiating the pattern
– Nodes terminating it
– Nodes that may interrupt it
• Often, but not always, about flow of information (dependence)
• Pros:
– Can be implemented by efficient iterative algorithms
– Widely applicable (not just for classic “data flow” properties)
• Limitations:
– Unable to distinguish feasible from infeasible paths
– Analyses spanning whole programs (e.g., alias analysis) must trade off precision against computational cost
(c) 2007 Mauro Pezzè & Michal Young Ch 6, slide 22
Home reading
• Chapter 6, Sections 6.1 and 6.2 of the book Software Testing and Analysis, by Mauro Pezze and Michal Young
– Data Flow
(c) 2007 Mauro Pezzè & Michal Young Ch 1, slide 23