Statistical Debugging CS 6340
{HEADSHOT}
Despite our best efforts, even after extensive testing, we as software developers rarely put out bug-free code. In this lesson, we will look at a type of debugging technique used throughout the software industry: statistical debugging. Statistical debugging harnesses the power of the user base to catch the bugs that slip through the in-house testing process.
Based on dynamic analysis, this debugging technique collects data about the program’s behavior in user runs, and transmits this data back to a centralized server, where the developers of the program can analyze the collected data to deduce what program behaviors are predictors of program crashes, and focus bug triaging and bug fixing efforts accordingly.
Copyright By PowCoder代写 加微信 powcoder
In this lesson, we will look at the process that enables to leverage this source of real-world testing data.
Motivation
• Bugs will escape in-house testing and analysis tools – Dynamic analysis (i.e. testing) is unsound
– Static analysis is incomplete
– Limited resources (time, money, people)
• Software ships with unknown (and even known) bugs
We live in an imperfect world with imperfect software. Despite extensive resources devoted to testing and debugging software, you can safely bet that bugs will escape in-house testing and quality assurance.
There are theoretical and practical reasons for this situation. Dynamic analysis (that is, testing) of computer software is unsound in that it may miss real bugs. On the other hand, static analysis of source code is incomplete in that it may report false bugs. And software development is bounded by constraints on resources: time, money, and people.
Sometimes that means we don’t catch all the bugs in our software, and users end up finding bugs. And sometimes it means we have to ship software with known bugs because we just can’t address them all.
An Idea: Statistical Debugging Monitor Deployed Code
– Online: Collect information from user runs
– Offline: Analyze information to find bugs
• Effectively a “black box” for software
Statistical debugging is an idea that is becoming increasingly common in practice to address this problem.
When you’ve installed software such as Chrome or even an operating system such as OS X or Windows, you’ve likely seen messages like these — messages asking for your permission to send usage and crash reports to the company developing the software. Reports like these are an essential element of statistical debugging.
The idea is to use data from actual users’ runs of deployed code. This process has two stages: one online and the other offline. In the online stage, information is collected from the user’s runs: information about the machine’s state during the execution of the software as well as whether the run ends in success or failure. This information is then transmitted back to the development company. In the offline stage, the company analyzes the data transmitted by many different users to find bugs in the program.
Effectively, these runs are to software what “black boxes” or flight recorders are to airplanes: they record the data prior to a crash which can be analyzed later to determine the cause of the crash.
Benefits of Statistical Debugging Actual runs are a vast resource!
• Crowdsource-based Testing
– Number of real runs >> number of testing runs
• Reality-directed Debugging
– Real-world runs are the ones that matter most
The potential advantages of this method of “post-deployment” bug hunting are enticing.
Actual runs by users are a vast resource for two key reasons.
First, they allow to effectively crowdsource testing to the user population, which not only saves the resources for testing software in-house, but also the real runs can greatly outnumber the test runs that can be performed in-house.
For reference, over 60 million licenses for Office XP were sold in its first year after release, which amounted to nearly 2 licenses per second. And the music-sharing software Kazaa was downloaded over 1.9 million times in a single week, amounting to 3 downloads per second. Imagine all of these users sending regular reports on usage statistics and crashes.
Secondly, actual runs allow debugging to be reality-directed. When allocating resources for testing and debugging before release, one is left to guess which features are most critical to test and which bugs are most critical to fix.
Post-deployment statistics provide data on which parts of the code and which bugs are most important to the users: the needs of the userbase can be used to define the allocation of resources.
Two Key Questions • How do we get the data?
• What do we do with it?
There are two key questions that must be answered to effectively implement the idea of statistical debugging:
How can we get the data from the users in an efficient way?
Once we have the data, how can we analyze it in a way that helps us with our goal of finding and debugging the bugs that most affect users?
Practical Challenges
Internet data
remote 1. Complex systems client
– Millions of lines of code
central server
– Mix of controlled and uncontrolled code 2. Remote monitoring constraints
– Limited disk space, network bandwidth, power, etc. 3. Incomplete information
– Limit performance overhead – Privacy and security
Answering these two questions requires addressing considerable practical challenges. Let’s look at each of these challenges one at a time.
First, the systems we are debugging are complex. They are typically millions of lines of code. They comprise many components, some of which we can monitor, but others that we cannot. Additionally, the code may be executed across several threads that interact in complex and unexpected ways. Despite these challenges, we must be able to collect data that helps us to efficiently determine which portions of the code and what elements of the program state are relevant to debugging.
Second, remote monitoring is constrained by a limited amount of disk space to store user data, for example in the case where we are monitoring an app on the users’ smartphone, as well as a limit on the bandwidth of the network through which users will need send their data, the power that is consumed while storing and sending the data, etc. There might also be a cost to users for sending us lots of data over the network. So we must be smart in deciding what data to store and transmit.
Finally, we must live with the fact that we will have incomplete information to analyze. While it is possible in principle to store and send complete information about a program run, in practice this imposes severe performance overhead that makes the program unusable. Another reason why we are limited to incomplete information is that we must safeguard users’ privacy and security, which may be compromised if we transmit information over the network that includes sensitive data, such as passwords or credit card information.
The Approach
• Guess behaviors that are “potentially interesting”
– Compile-time instrumentation of program
• Collect sparse, fair subset of these behaviors
– Generic sampling framework
– Feedback profile + outcome label (success vs. failure)
for each run
• Analyze behavioral changes in successful vs. failing runs to find bugs
– Statistical debugging
The strategy we will use to overcome these challenges consists of three steps.
In the first step, we will guess behaviors that are potentially interesting. A behavior is interesting for our purpose if it points us to a bug in the program being deployed. If we knew exactly which behaviors are interesting, there would be nothing left to do. The point is that we do not know exactly which behaviors are interesting. Therefore, we are going to start by guessing a large number of potentially interesting behaviors. We will achieve this by compile-time instrumentation of the program before deploying it. Instrumentation is the process of modifying the program’s source code or binary code with probes to monitor the potentially interesting behaviors.
In the second step, we will collect a sparse and fair subset of observations of these behaviors. We will design a generic framework for sampling whether or not to observe some aspect of the program’s state so that we do not overwhelm the users’ devices by storing and sending large amounts of data. The user’s report will then consist of a vector of these observations compressed to a very simple representation, called a feedback profile, plus a bit indicating whether the user’s run ended in success or failure, called the outcome label. This will be the entirety of the data we receive in the user report.
In the third and final step, we will analyze the collected data to find differences between behaviors of successful runs versus failing runs that help localize the causes of the failing runs. This is the step that gives the approach the name “statistical debugging”.
Overall Architecture
Program Source
Top bugs with likely causes
Statistical Debugging
Profile + 😊 / 😡
Deployed Application
Here is the overall architecture of the approach we just outlined.
The source code of the program, after being written by the developer, is instrumented with statements to monitor behaviors that we have guessed are potentially interesting.
For reasons we just outlined, such as performance overhead and network limits, we will not collect these behaviors in every run. To allow skipping behaviors in a fair and efficient manner, the sampler injects code that at run time will probabilistically decide which behaviors to collect.
The instrumented code is then compiled and deployed to the user base.
During user runs of the program, behavior feedback profiles are created along with bits indicating the success or failure outcome of the run, and these profiles are transmitted back to the developer’s database.
The developer can then perform statistical analysis on this data to determine which bugs are the most important and what their likely causes are.
Notice that, unlike the other program analysis techniques in this course, this approach does not require a formal specification in order to find bugs: users merely need to indicate whether their run succeeded or failed. Such an indication can even come automatically instead of from the user. For instance, if the run violates an assertion or suffers a segmentation fault, the run can automatically be labeled as a failing run. As another example, if a run produces an output that the user deems incorrect, the user can simply label the run as failing.
A Model of Behavior
• Assume any interesting behavior is expressible as a predicate P on a program state at a particular program point.
Observation of behavior = observing P
• Instrument the program to observe each predicate • Which predicates should we observe?
Let’s start moving into the first step of the process, deciding which program behaviors we will observe and collect from users.
Our model for an interesting behavior will be a boolean predicate that is either true or false based on the program’s state at a given point in the program.
An observation of such a behavior thus consists of an observation of whether the corresponding predicate is true or false. We will instrument the program to observe each such predicate and record their truth values for reporting later. The question remains, however: which predicates should we choose to observe?
Branches Are Interesting
++branch_17[p!=0];
if (p) … else …
Track predicates:
branch_17:
p == 0 p != 0
The first interesting program behavior we’ll keep track of is branch conditions. This will allow us to see if particular branches in the code are correlated with higher probabilities of program failure.
In particular, we will observe the truth value of the boolean expression in each branch condition in the program.
Suppose that this branch condition p appears at program point numbered 17. The instrumentation process will add a line of code that increments the count in the appropriate cell of a two-cell array called branch_17.
The 0th cell of this array will count the number of times the expression p was observed to be false (in this case 63), and the next cell will count the number of times the expression p was observed to be true (in this case 0).
We will maintain a unique such array for each branch condition in the program.
Note that branch conditions occur not only in if statements, such as in this shown example, but also in while statements, switch statements, and so on.
Return Values Are Interesting
n = fopen(…);
++call_41[(n==0)+(n>=0)];
Track predicates:
n <0 n >0 n == 0
Another program behavior we’ll keep track of is the return value from function calls that return an integer data type.
In particular, we will keep track of the return value’s relationship to 0. This information is potentially useful because such return values often convey information about the success or failure of an operation that the called function performed. In particular, a return value of 0 may be used to indicate a successful operation, and a non-zero return value may be used to indicate a failure of the operation. Programmers often presume that the operation will succeed, and neglect to check for the case in which the operation fails. Tracking the relationship of the return value to 0 will help us to pinpoint the cause of runs that fail due to such programmer errors.
Let’s look at an example. Suppose this call to the fopen function appears at program point 41. The instrumentation process will add a line of code that increments the count in the appropriate cell of a three-cell array called call_41.
The 0th cell of this array will count the number of times n was observed to be less than 0 (in this case 23), the next cell will count the number of times n was observed to be greater than 0 (in this case 0), and the last cell will count the number of times n was observed to be equal to 0 (in this case 90).
We will maintain a unique such array for each integer-returning function call in the program.
What Other Behaviors Are Interesting? • Depends on the problem you wish to solve!
• Examples:
– Number of times each loop runs
– Scalar relationships between variables (e.g. i < j, i > 42)
– Pointer relationships (e.g. p == q, p != null)
What other behaviors might we want to keep track of?
As you might have guessed, the answer to this question depends on the context of the problem you’re solving.
Other examples of behaviors we might wish to keep track of are:
– The number of times each loop runs. This can help us with debugging performance issues, as programs spend most of their time in loops.
– Scalar relationships between variables, such as whether one variable is greater than another (this might help us detect array index out-of-bounds errors).
– And pointer relationships, such as whether two pointers are equal or whether a pointer is not null.
QUIZ: Identify the Predicates
List all predicates tracked for this program, assuming only branches are potentially interesting:
void main() {
for (int i = 0; i < 3; i++) {
char c = getc();
if (c == ‘a’)
z = 0; else
assert(z == 1);
{QUIZ SLIDE}
Let’s look at the following program. To check your understanding, please type in these boxes which predicates we will keep track of, assuming that we are presently only interested in branch conditions. (So, for example, you can ignore the predicate z == 1, since it does not indicate a branch condition.)
QUIZ: Identify the Predicates
List all predicates tracked for this program, assuming only branches are potentially interesting:
void main() {
for (int i = 0; i < 3; i++) {
char c = getc();
if (c == ‘a’)
z = 0; else
assert(z == 1);
{SOLUTION SLIDE}
Here are the four branch condition predicates we can extract from the program.
One branch condition is at the if-statement, which gives us two predicates: c == ‘a’ and its negation c != ‘a’.
Another branch condition that might have been less obvious is the one that occurs in the for-loop.
At the end of each iteration of the for-loop, we check whether i is less than 3 to determine whether to iterate through the loop again. So this gives us the predicate i < 3 and its negation i >= 3.
Summarization and Reporting
branch_17:
call_41: n <0
n >0 n == 0
p == 0 1 p != 0 0 n <0
When a run finishes, we combine the arrays of predicate counts that we were tracking at different instrumentation sites in the program into a single array, the feedback profile, in preparation for reporting it. Notice that this feedback profile consists of data from two instrumentation sites, branch_17 and call_41, but five predicates.
Summarization and Reporting
• Feedback report per run is: – Vector of predicate states:
‒ , 0, 1 , *
– Success/failure outcome
• No time dimension, for good or ill
p == 0 1 p != 0 0 n <0
We report the feedback profile along with the outcome label indicating whether the run succeeded or failed. It is up to you how you wish to define the state of each predicate in the feedback profile. One possibility is to report the counts itself. To keep things simple, however, we will abstract these counts to one of only four states: -, 0, 1, and *:
• The - state means that the predicate was never observed (neither as true nor as false) during the run.
• The state 0 means that the predicate was observed at least once to be false, and never observed to be true.
• The state 1 means that the predicate was observed at least once to be true, and never observed to be false.
• Finally, the * state means that the predicate was observed at least once to be false and at least once to be true.
Let’s apply this state abstraction to these predicate counts. Since the predicates at each site are related to each other, remember that we must look at the counts of all the predicates at each site before deciding the state of each predicate at that site.
Let’s look at the first site consisting of these two predicates (gesture at p==0 and p!=0). We infer that the state of predicate p == 0 is 1 because p==0 was observed at least once to be true (in fact 63 times), and never observed to be false. On the other hand, the state of predicate p != 0 is 0 because it was observed at least once to be false (in fact 63 times), and never observed to be true.
Using the same procedure, we will fill in the states of these three predicates in a quiz shortly. (gesture at n < 0, n > 0, and n == 0).
Notice that since there are only four possible abstract states, only two bits suffice for reporting the state of each predicate in the feedback profile, compared to say 32 or 64 bits for a count.
Also, notice that we are abstracting away the time dimension in our report. Like all abstraction schemes, this has its benefits (such as reducing complexity) and drawbacks (such as losing potentially useful debugging information).
QUIZ: Abstracting Predicate Counts
p == 0 1 p != 0 0 n <0
‒ Never observed
0 False at least once, never true
1 True at least once, never false
* Both true and false at least once
{QUIZ SLIDE}
Now go ahead and complete this table to abstract the predicate counts for n < 0, n > 0, n == 0.
You can use the box on the left if you need to remind yourself what each of the four abstract states means.
QUIZ: Abstracting Predicate Counts
p == 0 1 p != 0 0 n <0 *
‒ Never observed
0 False at least once, never true
1 True at least once, never false
* Both true and false at least once
n >0 n == 0
{SOLUTION SLIDE}
We infer that the state o
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com