CS代写 {HEADSHOT}

{HEADSHOT}
The field of so3ware analysis is highly diverse: there are many different approaches each with their own strengths and limitaBons in aspects such as soundness, completeness, scalability, and applicability.
We will learn about a dominant approach to so3ware analysis called constraint-based analysis.
Constraint-based analysis follows a declaraBve paradigm: it is concerned with expressing “what” the analysis computes rather than “how” the analysis computes it.

Copyright By PowCoder代写 加微信 powcoder

In other words, constraint-based analysis is concerned with the specificaBon of the analysis, rather than the implementaBon of the analysis.
The analysis specificaBon takes the form of constraints over program facts, while the analysis implementaBon involves solving these constraints using an off-the-shelf constraint solver.
This separaBon of concerns has many benefits: it simplifies the design and understanding of the analysis, it allows to rapidly prototype analyses, and it enables to leverage conBnual performance improvements in constraint solvers.
We will illustrate these benefits on classical dataflow analysis problems using Datalog, a constraint programming language.

Designing an efficient program analysis is a challenging task. It involves dealing with both the specificaBon of the analysis — that is, what informaBon the analysis must compute — and the implementaBon of the analysis — that is, the details of how the analysis should compute that informaBon efficiently.
For example, in a null-pointer dereference checking analysis, the specificaBon might be, “No null pointer is dereferenced along any path in the program.” As for the implementaBon, there are several design choices that affect the efficiency of the analysis, such as whether to use a forward vs. a backwards traversal of the program, whether to use symbolic vs. explicit representaBons of the program’s state, and many others.

Even the first choice, whether to traverse the program forward or backward, is a nontrivial decision.
Consider for instance a null-pointer dereference checking analysis.
A forward traversal involves starBng at locaBons in the program where pointers are set to null and checking if they can flow to locaBons in the program where pointers are dereferenced. A backward traversal involves doing the opposite, that is, starBng at locaBons in the program where pointers are dereferenced, and checking if locaBons where pointers are set to null can reach them.
It is easy to see that, if a program does not set any pointers to null, then forward traversal is more efficient. On the other hand, if the program does not dereference any pointers, then backward traversal is more efficient. In pracBce, programs contain a mix of both null pointer assignments and pointer dereferences, making it challenging to determine the most efficient traversal strategy.

In constraint-based analysis, the analysis designer defines the specificaBon of the program analysis using what is called a constraint language, and a constraint solver automates the implementaBon of the analysis.

This approach to program analysis has several benefits.
Because the analysis specificaBon is separated from the implementaBon, analysis designers can focus their efforts on specifying what informaBon the analysis must compute, rather than implemenBng how the analysis should compute that informaBon efficiently.
Another benefit of constraint-based analysis is that it yields natural program specificaBons: just like types in a type system, constraints are usually defined locally, and solving their conjuncBon captures global properBes about the program.
Finally, the modularizaBon of the program analysis task into a specificaBon and an implementaBon sub-problem allows the specificaBon to be agnosBc of the implementaBon. In other words, we can “plug-and-play” powerful, off-the-shelf constraint solvers, giving us flexibility that would otherwise not be available.

{QUIZ SLIDE}
To illustrate the difference between the specificaBon and the implementaBon of a program analysis, let’s look at the following quiz. Consider a dataflow analysis such as live variables analysis. If this analysis is expressed as a constraint-based analysis, which of the following must the analysis designer sBll decide upon?
– The order in which statements should be processed
– What the gen and kill sets for each kind of statement are
– In what language to implement the chaoBc iteraBon algorithm
– Whether to take intersecBons or unions at merge points
Check all that apply.

{SOLUTION SLIDE}
Recall that, when using a constraint-based analysis, the user only needs to decide aspects of the specificaBon, not the implementaBon. Therefore, the answers to this quiz are those that are part of the specificaBon of live variables analysis instead of its implementaBon. Let’s consider each statement in turn.
The order in which statements should be processed: this is an implementaBon aspect, as changing the order in which statements are processed would not change the outcome of the analysis. Therefore this is an aspect the constraint solver would determine, so the analysis designer does not need to decide this.
What the gen and kill sets for each kind of statement are: this is a specificaBon aspect. Choosing different gen and kill sets would affect the outcome of the analysis. Therefore the analysis designer needs to decide on this aspect.
In what language to implement the chaoBc iteraBon algorithm: this choice won’t affect the final outcome of the analysis, so it’s another implementaBon aspect that the analysis designer is not responsible for.
Whether to take the intersecBon or union at merge points: switching between intersecBon and union changes the type of analysis that is being done, so this is a specificaBon detail that the analysis designer needs to decide.

Here are the topics we will consider in the remainder of this lesson.
Next, you will learn a language called Datalog that can be used to specify constraint-based analyses.
Once you have learned the basics of Datalog, you will see how to use it to specify two kinds of staBc analyses:
First, you will see how to specify an intra-procedural analysis in Datalog, that is, an analysis that is restricted to a single procedure. In parBcular, you will see how to specify compuBng reaching definiBons.
Then, you will see how to define an inter-procedural analysis in Datalog, that is, an analysis of a program involving mulBple procedures. In parBcular, you will see how to specify compuBng points-to informaBon. You will also see the extra complexiBes inherent in defining these types of analysis.

Datalog is a declaraBve logic programming language.
It is not a Turing-complete language: it can be viewed as a subset of Prolog, or as SQL with recursion. Efficient algorithms exist to evaluate programs in these languages, so there exist efficient algorithms to evaluate Datalog programs.
Datalog originated as a query language for deducBve databases. It was later applied in many other domains, including so3ware analysis, data mining, networking, security, knowledge representaBon, and cloud compuBng among others.
There are many implementaBons of Datalog. Some of the implementaBons available include Logicblox, bddbddb, IRIS, and Paddle.
You can learn more about Datalog using the resources linked in the Instructor Notes.
[hkp://www.utdallas.edu/~gupta/courses/acl/papers/datalog-paper.pdf and online book at webdam.inria.fr/Alice/]

We will now present the syntax of Datalog by means of an example program that computes reachability in a directed graph.
The problem of graph reachability is to determine all pairs of nodes in a graph that are connected by a path.
To express this problem as a program in Datalog, we need to define three things:
– the form of the input to the Datalog program,
– the form of the output of the Datalog program, and
– the rules of inference comprising the Datalog program that compute the output from the input.

A Datalog program’s inputs and outputs are defined in terms of relaBons, which are declaraBve statements that some number of objects are related in some way.
A relaBon is similar to a table in a relaBonal database, and a tuple in a relaBon is similar to a row in the table: it asserts that the relaBon holds among some number of objects.

For the graph-reachability problem, the input is a single binary relaBon called edge(n:N, m:N), where n and m are variables of type Node, denoted by N, the set of all nodes. This relaBon encodes the edges in the input graph. For example, for the graph shown here, the edge relaBon contains tuples (0,1) and (2,3), but not tuples (3,4), (0,3), or (2,0).
The four tuples (0,1), (0,2), (2,3), and (2,4) are sufficient to establish the enBre structure of the graph.
The output of this Datalog program is a single binary relaBon called path(n:N, m:N), which is true iff there is a directed path in the graph from n to m. So, for the graph shown, the path relaBon should contain tuples (0,4) and (0,3), but not tuples (3,0) or (1,4).

In order for the Datalog program to compute the output relaBons from the input relaBons, we must provide rules of inference. These are deducBve rules that hold universally. They specify logical “if- then” statements.

The rules of inference that we will define for this problem are (in English):
First: There is always a path from each node x to itself, which in Datalog syntax takes the form
path(x, x).
Second: If there is a path from node x to node z and an edge from node z to node y, then there is a path from node x to node y. In Datalog syntax, this rule takes the form
path(x, z) :- path(x, y), edge(y, z).
The rules of inference are wriken in the opposite order that they are typically wriken in: the hypothesis of an implicaBon is wriken on the right-hand side, and the conclusion is wriken on the le3- hand side. RelaBons separated by a comma are ANDed together. The first inference rule, because it has no hypotheses, acts as an axiomaBc statement. Finally, a period is used to end each inference rule.

Now that you’re familiar with the syntax of Datalog programs, I will illustrate the semanBcs of Datalog programs, using the graph-reachability example. Conceptually, we start out with the empty path relaBon, and apply each of these two rules, growing the path relaBon with each applicaBon. We stop when the path relaBon stops growing.
A slight variant of this algorithm is depicted here [point to box]. It starts out by applying the first rule, which involves adding to the path relaBon each tuple (x, x) for each node x in the graph, capturing the intent of this rule that there exists a path from each node to itself. It then repeatedly applies the second rule, which involves adding to the path relaBon each tuple (x, z) whenever there exists a node y such that tuple (x, y) exists in the current path relaBon and tuple (y, z) exists in the input edge relaBon. This captures the intent of the second rule, that there exists a path from node x to node z if there exists a path from node x to some node y, and there exists an edge from that node y to node z.
This naive algorithm is essenBally the chaoBc iteraBon algorithm used for dataflow analyses and pointer analysis. In pracBce, Datalog solvers have much more efficient algorithms for compuBng the output relaBons from the input relaBons and inference rules. The key is that if there are mulBple rules, the order in which the rules are applied does not maker.
AddiBonally, the result of the algorithm, like that of chaoBc iteraBon, is the least soluBon: the smallest path relaBon that saBsfies all the rules. The least soluBon typically corresponds to what the user wants to compute in many problems. An example of a non-least soluBon to this problem would be that path(x,y) holds for all nodes x and y. While this relaBon doesn’t violate any rules, it contains many nonsensical paths that would not be desired by a user.

Let’s look at a run of this Datalog program on an example input.
Suppose the input is the following directed graph, encoded by the following edge relaBon, which contains four tuples: (0,1), (0,2), (2,3) and (2,4).
The output of this Datalog program on this input is as follows:

Applying the first rule, path(x,x), produces all paths of length 0, represented by the following tuples in the path relaBon:(0,0), (1,1), (2,2), (3,3), and (4,4).

Applying the second rule at this Bme yields all paths of length 1, represented by the following tuples in the path relaBon: (0,1), (0,2), (2,3), and (2,4).

Applying the second rule again yields all paths of length two: (0,3) and (0,4).
Because the path relaBon doesn’t change a3er applying either of these rules again, the algorithm terminates, yielding the least soluBon seen here.

{QUIZ SLIDE}
Let’s work on expressing another computaBon in Datalog in the form of a quiz. Suppose we want to compute the relaBon scc (standing for strongly connected component) on a directed graph from the input relaBons edge and path (as we defined them earlier), and suppose we want our Datalog program to output scc(n1, n2) if and only if n2 is reachable from n1 and n1 is reachable from n2.
Select each of the inference rules below that will compute the correct output:
scc(n1, n2) :- edge(n1, n2), edge(n2, n1).
scc(n1, n2) :- path(n1, n2), path(n2, n1).
scc(n1, n2) :- path(n1, n3), path(n3, n2), path(n2, n4), path(n4, n1). scc(n1, n2) :- path(n1, n3), path(n2, n3).

{SOLUTION SLIDE}
Two of these inference rules—the second and third—correctly compute the relaBon scc. The second rule is minimal in its expression of the scc relaBon, but the third rule sBll computes the same relaBon. To see this, recall that path(x,x) holds for all nodes x; therefore, by taking n3 equal to n1 and n4 equal to n2, the hypothesis is true if and only if path(n1,n2) and path(n2,n1) are true.
While the first rule will not produce any incorrect tuples in relaBon scc, it will fail to produce scc(n1,n2) for any two nodes that are reachable from each other but which are not adjacent to each other.
Finally, the last rule could potenBally produce incorrect tuples: the fact that there exists some node n3 such that there is a path from n1 to n3 and a path from n2 to n3 is neither a necessary nor a sufficient condiBon for nodes n1 and n2 to belong to a strongly connected component.

Now that we have seen the syntax and semanBcs of Datalog programs, we will consider how to use Datalog to specify an intra-procedural dataflow analysis; specifically, reaching definiBons analysis.

The specificaBon of reaching definiBons analysis is as follows: OUT[n] = (IN[n] – KILL[n]) ∪ GEN[n]
IN[n] = U n’ ∈ predecessors[n] OUT[n’]
where KILL[n] is the set of definiBons killed at program point n, GEN[n] is the set of definiBons generated at program point n, and predecessors(n) is the set of program points that immediately precede program point n in the input procedure’s control-flow graph.

Let us describe the form of the input and output relaBons as well as the inference rules that would be used to specify reaching definiBons analysis in Datalog.

The input relaBons for the analysis should capture all the informaBon from the input procedure’s control-flow graph that is relevant to compuBng the IN and OUT sets for each program point. While we haven’t yet formally defined the inference rules for the analysis, by looking at the specificaBon we see that in order to compute OUT[n], we need to know KILL[n] and GEN[n], and in order to compute IN[n], we need to know predecessors(n). Therefore, the input relaBons should give Datalog’s constraint solver knowledge of the contents of the KILL, GEN, and predecessors sets. Moreover, all three of these relaBons can be computed from the control-flow graph of the procedure to be analyzed.
Let us define the relaBon kill(n:N, d:D) to mean that the definiBon d is in the KILL set of program point n …

the relaBon gen(n:N, d:D) to mean that the definiBon d is in the GEN set of program point n …

and the relaBon next(n:N, m:N) to mean that program point m is an immediate successor of program point n, or equivalently, that program point n is an immediate predecessor of program point m.
(In these relaBons, N denotes the set of all program points and D denotes the set of all definiBons in the given control-flow graph.)

In reaching definiBons analysis, we want to compute the IN and OUT sets for each program point. So let us say that in(n:P, d:D) is the relaBon that asserts that the definiBon d is a member of the IN set of program point n—that is, definiBon d may reach the program point just before n …

… and let us define out(n:P, d:D) to mean that definiBon d is a member of the OUT set of program point n—that is, definiBon d may reach the program point just a3er n.

Lastly, we specify three rules of inference to compute the IN and OUT sets. These will be based on the formulas for OUT[n] and IN[n] shown here
out(n, d) :- gen(n, d).
out(n, d) :- in(n, d), !kill(n, d). in(m, d) :- out(n, d), next(n, m).
The first two rules map to the first rule in the specificaBon, which is OUT[n] = (IN[n] – KILL[n]) ∪ GEN[n]. We use two separate rules to reflect the union of (IN[n] – KILL[n]) with GEN[n], and we use the ‘!’ character to mean the relaBon kill(n,d) does not hold. This represents the fact that IN[n] – KILL[n] is the intersecBon of IN[n] with the complement of KILL[n].

Finally, the third rule maps to the second rule in the specificaBon, which is IN[n] = U n’ ∈ predecessors[n] OUT[n’]. Because this expression is a union, we only need a single inference rule to ensure all definiBons are correctly added to the appropriate IN set. For each predecessor n of program point m, each definiBon d in the OUT set of that predecessor n will saBsfy the hypothesis of this third inference rule.

Let’s look at an example run of our reaching definiBons analysis specified in Datalog. Consider this control-flow graph:
Program point 1 is the entry point to the procedure. It has a single transiBon to program point 2, which contains the statement x = 8. Program point 2 has a single transiBon to program point 3, which is a test of the boolean expression (x != 1). If this expression is true, control flows to program point 4, which contains the definiBon x = x – 1 and then transiBons to program point 3 again. If the boolean expression at program point 3 is false, control flows to program point 5, which is the exit point of the procedure.

Recall that the inputs to reaching definiBons analysis are the relaBons kill(n, d), gen(n, d), and next(n, m). Because each definiBon is associated with a program point, we will label the definiBons by the number of the program point they appear at (so the definiBon x = 8 will be denoted by lezng d equal 2 in the relaBons, and the definiBon x = x – 1 will be denoted by lezng d equal 4).
There are only two tuples in the gen relaBon of this control-flow graph: (2, 2) and (4, 4), as no other program points establish any variable definiBons.
The next relaBon can be computed from the directed edges in the control-flow graph. Each edge corresponds to a tuple in this relaBon. This relaBon thus contains the following tuples:(1, 2), (2, 3), (3, 4), (3, 5), and (4, 3).
The kill relaBon is a bit more difficult to compute. A definiBon d is in the KILL set of a program point n if there is a definiBon other than d generated at n and there is a directed path from the point where d is generated to n. Using the graph-reachability analysis we described earlier (for example), we can compute that the only such tuple that is applicable for this procedure is (4, 2), capturing the fact that the definiBon associated with program point 2 is killed at program point 4.

Given these input relaBons and the inference rules described previously, the output relaBons produced are as follo

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com