CS代写 APRIL 2018 | VOL. 61 | NO. 4

contributed articl
DOI:10.1145/3188720
For a static analysis project to succeed, developers must feel they benefit from and enjoy using it.
BY CAITLIN SADOWSKI, EDWARD AFTANDILIAN, ALEX EAGLE, LIAM MILLER-CUSHON, AND CIERA JASPAN

Copyright By PowCoder代写 加微信 powcoder

from Building Static Analysis Tools at Google
SOFTWARE BUGS COST developers and software companies a great deal of time and money. For example, in 2014, a bug in a widely used SSL implementation (“goto fail”) caused it to accept invalid SSL certificates,36 and a bug related to date formatting caused a large-scale Twitter outage.23 Such bugs are often statically detectable and are, in fact, obvious upon reading the code or documentation yet still make it into production software.
Previous work has reported on experience applying bug-detection tools to production software.6,3,7,29 Although there are many such success stories for developers using static analysis tools, there are also reasons engineers do not always use static analysis tools or ignore their warnings,6,7,26,30 including:
Not integrated. The tool is not inte- grated into the developer’s workflow or takes too long to run;
Not actionable. The warnings are not actionable;
Not trustworthy. Users do not trust the results due to, say, false positives;
Not manifest in practice. The report- ed bug is theoretically possible, but the problem does not actually manifest in practice;
key insights
˽ Static analysis authors should focus on the developer and listen to their feedback.
˽ Careful developer workflow integration is key for static analysis tool adoption.
˽ Static analysis tools can scale by crowdsourcing analysis development.
58 COMMUNICATIONS OF THE ACM | APRIL 2018 | VOL. 61 | NO. 4

Too expensive to fix. Fixing the de- tected bug is too expensive or risky; and Warnings not understood. Users do
not understand the warnings.
Here, we describe how we have ap-
plied the lessons from Google’s pre- vious experience with FindBugs Java analysis, as well as from the academic literature, to build a successful static analysis infrastructure used daily by most software engineers at Google. Google’s tooling detects thousands of problems per day that are fixed by en- gineers, by their own choice, before the problematic code is checked into Google’s companywide codebase.
Scope. We focus on static analysis tools that have become part of the core devel- oper workflow at Google and used by a large fraction of Google’s develop-
ers. Many of the static analysis tools deployed at the scale of Google’s two- billion-line codebase32 are relatively simple; running more sophisticated analyses at scale is not yet considered a priority.
Note that developers outside of Google working in specialized fields (such as aerospace13 and medical devices21) may use additional static analysis tools and workflows. Like- wise, developers working on specific types of projects (such as kernel code and device drivers4) may run ad hoc analyses. There has been lots of great work on static analysis, and we do not claim the lessons we report here are unique, but we do believe that collating and sharing what has worked to improve code quality and the devel-
oper experience at Google is valuable. Terminology. We use the following terms: analysis tools run one or more “checks” over source code and identify “is- sues” that may or may not represent actual software faults. We consider an issue to be an “effective false positive” if developers did not take positive action after seeing the issue.35 If an analysis incorrectly re- ports an issue, but developers make the fix anyway to improve code readability or maintainability, that is not an effec- tive false positive. If an analysis reports an actual fault, but the developer did not understand the fault and therefore took no action, that is an effective false posi- tive. We make this distinction to empha- size the importance of developer percep- tion. Developers, not tool authors, will determine and act on a tool’s perceived
APRIL 2018 | VOL. 61 | NO. 4 | COMMUNICATIONS OF THE ACM 59
IMAGE BY IGOR KISSELEV

contributed articles
false-positive rate.
How Google builds software. Here,
we outline key aspects of Google’s software-development process. At Google, nearly all developer tools (with the exception of the develop- ment environment) are centralized and standardized. Many parts of the infrastructure are built from scratch and owned by internal teams, giving the flexibility to experiment.
Source control and code ownership.
Google has developed and uses a sin- gle-source control system and a single monolithic source code repository that holds (nearly) all Google proprietary source code.a Developers use “trunk- based” development, with limited use of branches, typically for releases, not for features. Any engineer can change any piece of code, subject to approval by the code’s owners. Code ownership is path-based; an owner of a directory im- plicitly owns all subdirectories as well.
Build system. All code in Google’s re- pository builds with a customized ver- sion of the Bazel build system,5 requiring that builds be hermetic; that is, all inputs must be explicitly declared and stored in source control so the builds are easily distributed and parallelized. In Google’s build system, Java rules depend on the Java Development Kit and Java compiler that are checked into source control, and such binaries can be updated for all us- ers simply by checking-in new versions. Builds are generally from source (at head), with few binary artifacts checked into the repository. Since all develop- ers use the same build system, it is the source of truth for whether any given piece of code compiles without errors.
Analysis tools. The static analysis tools Google uses are typically not com- plex. Google does not have infrastruc- ture support to run interprocedural or whole-program analysis at Google scale, nor does it use advanced static analysis techniques (such as separa- tion logic7) at scale. Even simple checks have required analysis infrastructure supporting workflow integration to make them successful. The types of analyses deployed as part of the gen- eral developer workflow include:
Style checkers (such as Checkstyle,10
a Google’s large open source projects (such as Android and Chrome) use separate infrastruc- ture and their own workflows.
Developers, not
tool authors, will determine and act on a tool’s perceived false-positive rate.
Pylint,34 and Golint18);
Bug-finding tools that may extend the compiler (such as Error Prone,15 ClangTidy,12 Clang Thread Safety Analysis,11 Govet,17 and the Checker
Framework ), including, but not lim-
ited to, abstract-syntax-tree pattern- match tools, type-based checks, and unused variable analysis;
Analyzers that make calls to produc- tion services (such as to check whether an employee mentioned in a code com- ment is still employed at Google); and
Analyzers that examine properties of build outputs (such as the size of binaries).
The “goto fail” bug36 would have been caught by Google’s C++ linter that checks whether if statements are followed by braces. The code that caused the Twitter outage23 would not compile at Google because of an Error Prone compiler error, a pattern-based check that identifies date-formatting misuses. Google de- velopers also use dynamic analysis tools (such as AddressSanitizer) to find buffer overruns and ThreadSanitizer to find data races.14 These tools are run during testing and sometimes also with production traffic.
Integrated Development Environ- ments (IDEs). An obvious workflow in- tegration point to show static analy- sis issues early in the development process is within an IDE. However, Google developers use a wide variety of editors, making it difficult to con- sistently detect bugs by all developers prior to invoking the build tool. Al- though Google does use analyses in- tegrated with popular internal IDEs, requiring a particular IDE with analy- ses enabled is a non-starter.
Testing. Nearly all Google code in- cludes corresponding tests, ranging from unit tests all the way to large-scale integration tests. Tests are integrated as a first-class concept in the build sys- tem and hermetic and distributed, just like builds. For most projects, devel- opers write and maintain the tests for their code; projects typically have no separate testing or quality-assurance group. Google’s continuous build-and- test system runs tests on every commit and notifies a developer if the develop- er’s change broke the build or caused a test to fail. It also supports testing a change before committing to avoid
60 COMMUNICATIONS OF THE ACM | APRIL 2018 | VOL. 61 | NO. 4

breaking downstream projects.
Code review. Every commit to Google’s codebase goes through code reviewfirst.Althoughanydevelopercan proposeachangetoanypartofGoogle’s code, an owner of the code must review and approve the change before submis- sion. In addition, even owners must have their code reviewed before com- mitting a change. Code review happens through a centralized, web-based tool that is tightly integrated with other de- velopment infrastructure. Static analy-
sis results are surfaced in code review. Releasing code. Google teams release frequently, with much of the release validation and deployment process automated through a “push on green” methodology,27 meaning an arduous, manual-release-validation process is not possible. If Google engineers find a bug in a production service, a new release can be cut and deployed to pro- duction servers at relatively low cost compared with applications that must
be shipped to users.
What We Learned from FindBugs
Earlier research, from 2008 to 2010, on static analysis at Google focused on Java analysis with FindBugs2,3: a stand- alone tool created by of the University of Maryland and of York College of Pennsyl- vania that analyzes compiled Java class files and identifies patterns of code that lead to bugs. As of January 2018, FindBugs was available at Google only as a command-line tool used by few engineers. A small Google team, called “BugBot,” worked with Pugh on three failed attempts to integrate FindBugs into the Google developer workflow.
We have thus learned several lessons:
Attempt 1. Bug dashboard. Initially, in 2006, FindBugs was integrated as a centralized tool that ran nightly over the entire Google codebase, produc- ing a database of findings engineers could examine through a dashboard. Although FindBugs found hundreds of bugs in Google’s Java codebase, the dashboard saw little use because a bug dashboard was outside the developers’ usual workflow, and distinguishing be- tween new and existing static-analysis issues was distracting.
Attempt 2. Filing bugs. The BugBot team then began to manually triage new issues found by each nightly Find-
Bugs run, filing bug reports for the most important ones. In May 2009, hundreds of Google engineers par- ticipated in a companywide “Fixit” week, focusing on addressing Find- Bugs warnings.3 They reviewed a total of 3,954 such warnings (42% of 9,473 total), but only 16% (640) were actu- ally fixed, despite the fact that 44% of reviewed issues (1,746) resulted in a bug report being filed. Although the Fixit validated that many issues found by FindBugs were actual bugs, a sig- nificant fraction were not important enough to fix in practice. Manually tri- aging issues and filing bug reports is not sustainable at a large scale.
Attempt 3. Code review integration.
The BugBot team then implemented asysteminwhichFindBugsautomati- cally ran when a proposed change was sent for review, posting results as comments on the code-review thread, something the code-review team was already doing for style/formatting is- sues. Google developers could sup- press false positives and apply Find- Bugs’ confidence in the result to filter comments. The tooling further at- tempted to show only new FindBugs warnings but sometimes miscatego- rized issues as new. Such integration was discontinued when the code-re- view tool was replaced in 2011 for two main reasons: the presence of effec- tive false positives caused developers to lose confidence in the tool, and de- veloper customization resulted in an inconsistent view of analysis results.
Make It a Compiler Workflow
Concurrent with FindBugs experimen- tation, the C++ workflow at Google was improving with the addition of new checks to the Clang compiler. The Clang team implemented new com- piler checks, along with suggested fixes, then used ClangMR38 to run the updated compiler in a distributed way over the entire Google codebase, re- fine checks, and programmatically fix all existing instances of a problem in the codebase. Once the codebase was cleansed of an issue, the Clang team enabled the new diagnostic as a com- piler error (not a warning, which the Clang team found Google developers ignored) to break the build, a report difficult to disregard. The Clang team was very successful improving the co-
debase through this strategy.
We followed this design and built a sim-
ple pattern-based static analysis for Java calledErrorProne15ontopofthejavacJava compiler.1Thefirstcheckrolledout,called PreconditionsCheckNotNull,b detects cases in which a runtime precondition check trivially succeeds because the arguments in the method call are transposed, as when, say, checkNotNull(“uid was null”, uid) instead of checkNotNull(uid, “uid was null”).
In order to launch checks like PreconditionsCheckNotNull with- out breaking any continuous builds, the Error Prone team runs such checks over the whole codebase using a javac- based MapReduce program, analo- goustoClangMR,calledJavacFlume built using FlumeJava.8 JavacFlume emits a collection of suggested fixes, represented as diffs, that are then ap- plied to produce a whole-codebase change. The Error Prone team uses an internal tool, Rosie,32 to split the large- scale change into small changes that each affect a single project, test those changes, and send them for code re- view to the appropriate team. The team reviews only those fixes that apply to its code, and, when they approve them, Rosie commits the change. All changes are eventually approved, the existing issues are fixed, and the team enables the compiler error.
When we have surveyed develop- ers who received these patches, 57% of them who received a proposed fix to checked-in code were happy to have received it, and 41% were neutral. Only 2% responded negatively, saying, “It just created busywork for me.”
Value of compiler checks. Compiler errors are displayed early in the devel- opment process and integrated into the developer workflow. We have found expanding the set of compiler checks to be effective for improving code qual- ity at Google. Because checks in Error Prone are self-contained and written against the javac abstract syntax tree, rather than bytecode (unlike Find- Bugs), it is relatively easy for developers outside the team to contribute checks. Leveraging these contributions is vital in increasing Error Prone’s overall im-
b http://errorprone.info/bugpattern/ PreconditionsCheckNotNull
contributed articles
APRIL 2018 | VOL. 61 | NO. 4 | COMMUNICATIONS OF THE ACM 61

contributed articles
pact. As of January 2018, 733 checks had been contributed by 162 authors.
Reporting issues sooner is better.
Google’s centralized build system logs all builds and build results, so we iden- tified all users who had seen one of the error messages in a given time window. We sent a survey to developers who recently encountered a compiler er- ror and developers who had received a patch with a fix for the same problem. Google developers perceive that issues flagged at compile time (as opposed to patches for checked-in code) catch more important bugs; for example, survey participants deemed 74% of the issues flagged at compile time as “real problems,” compared to 21% of those found in checked-in code. In addition, survey participants deemed 6% of the issues found at compiletime (vs. 0% in checked-in code) “critical.” This result is explained by the “survi- vor effect”;3 that is, by the time code is submitted, the errors are likely to have been caught by more expensive means (such as testing and code review). Moving as many checks into the com- piler as possible is one proven way to avoid those costs.
Criteria for compiler checks. To scale- up our work, we have defined criteria for enabling checks in the compiler, setting the bar high, since breaking the com- pile would be a significant disruption. A compiler check at Google should be eas- ily understood; actionable and easy to fix (whenever possible, the error should in- clude a suggested fix that can be applied mechanically); produce no effective false positives (the analysis should never stop the build for correct code); and report issues affecting only correctness rather than style or best practices.
The primary goal of an analyzer satisfying these criteria is not simply to detect faults but to automatically fix all instances of a prospective com- piler error throughout the codebase. However, such criteria limit the scope of the checks the Error Prone team enables when compiling code; many issues that cannot always be detected correctly or mechanically fixed are still serious problems.
Code Review
Once the Error Prone team had built the infrastructure needed to detect is- sues at compile time, and had proved
the approach works, we wanted to show more high-impact bugs that do not meet the criteria we outlined ear- lier for compiler errors and provide re- sults for languages other than Java and C++. The second integration point for static analysis results is Google’s code review tool, Critique; static analysis results are exposed in Critique using Tricorder,35 Google’s program-analysis platform. As of January 2018, there was a compiler warnings-free default for C++ and Java builds at Google, with all analysis results either shown as com- piler errors or in code review.
Criteria for code-review checks.
Unlike compile-time checks, analysis results shown during code review are allowed to include up to 10% effective false positives. There is an expectation during code review that feedback is not always perfect and that authors evalu- ate proposed changes before applying them. A code review check at Google should fulfill several criteria:
Be understandable. Be easy for any engineer to understand;
Be actionable and easy to fix. The fix may require more time, thought, or ef- fort than a compiler check, and the re- sult should include guidance as to how the issue might indeed be fixed;
Produce less than 10% effective false positives. Developers should feel the check is pointing out an actual issue at least 90% of the time;c and
Have the potential for significant im- pact on code quality. The issues may not affect correctness, but developers should take them seriously and delib- erately choose to fix them.
Some issues are severe enough to be flagged in the compiler, but pro- ducing them or developing an auto- mated fix is not feasible. For example, fixing an issue may require significant restructuring of the code. Enabling these checks as compiler errors would require manual cleanup of existing in- stances that is infeasible on the scale of Google’s vast codebase. Analysis tools show these checks in code review pre- vent new occurrences of the issue, al- lowing the developer to decide how to
c Although this number was initially chosen by the first author somewhat arbitrarily, it seems to be a sweet spot for developer satisfaction and matches the cutoff for similar systems in other companies.
make an appropriate fix. Code review is also a good context for reporting rela- tively less-important issues like stylis- tic problems or opportunities to sim- plify code. In our experience, reporting them at compile-time is frustrating for developers and makes it more difficult to iterate and debug quickly; for ex- ample, an unreachable co

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com