Reducing Code Review Time at Google
Google's tool for helping developers address code review comments more efficiently.
This is the latest issue of my newsletter. Each week I share research and perspectives on developer productivity.
Upcoming virtual events:
GenAI: I’m hosting a panel with leaders from Airbnb, GitHub, and Jumio on how they’re leveraging GenAI to boost developer productivity. Sign up to join the conversation here. May 30th.
Metrics: Gradle is hosting a ‘DPE Showdown’ on developer productivity metrics. I’ll be speaking alongside leaders from Uber, Spotify and Microsoft. Sign up to join the event here. June 13th.
This week I read Resolving Code Review Comments with Machine Learning from Google. Code reviews are a critical part of the software development process at Google, but they take a significant amount of time. Researchers looked for a way to speed up code reviews while maintaining quality. This paper documents their solution and results.
My summary of the paper
Developers at Google spend a lot of time "shepherding" code changes through reviews, both as authors and reviewers. Even when only a single review iteration is needed there’s a cost involved—it takes time to understand the reviewer’s recommendation, look up relevant information, and type out the edit. Moreover, the required active work time that the code author must devote to address reviewer comments grows almost linearly with the number of comments.
For these reasons, Google created a code review comment-resolution assistant. Their goal: to reduce the time spent resolving code review comments by making it easier for reviewers to provide actionable suggestions and authors to efficiently address those suggestions. Their assistant achieves this by proposing code changes based on a comment’s text.
Google's code review comment-resolution assistant uses machine learning to help developers address review comments more efficiently. Here's a simplified explanation of how it was created, how it works, and its impact:
Modeling and training
The comment-resolution assistant uses a text-to-text machine learning model. It processes code with inline reviewer comments and predicts the necessary code edits to address these comments. The model was trained on a vast dataset, including over 3 billion examples, to handle various software-engineering tasks like code edits and error fixing. During training, it was fine-tuned to prioritize high-precision predictions, ensuring that only the most reliable suggestions are presented to users.
Prototyping an assistant based on the model
The team wanted the assistant to be easy and efficient for developers to use, so they tested different designs through user studies and an internal beta test. They ultimately developed an assistant that works as follows (also illustrated in the image below):
Incoming comments: It listens for new code-review comments from reviewers.
Eligible for ML fixing: It ignores irrelevant comments, such as those from automated tools, non-specific comments, comments on unsupported file types, resolved comments, and comments with manual suggestions.
Generated ML predictions: It queries the model to generate a suggested code edit.
If the model is confident in the prediction (above 70% precision), it posts the suggestion to downstream systems (the code-review frontend and the integrated development environment).
Discovered and Applied: There, the suggested edits are exposed to the user. The system also logs user interactions, such as whether they preview the suggested edit and whether they accept it.
Deploying and refining the system
Before rolling out the system, the research team conducted several iterations of refinement by testing the model on a separate set of data to see how well it predicted correct edits.
Then, the beta tool was deployed to a small group of “friendly” users, where it was refined further through user feedback metrics. Specifically, researchers measured the number of comments produced in a day, the number of predictions the model made, the number those predictions that were previewed, and how many of those were applied or received a thumbs up/thumbs down.
The tool was then deployed to 50% of Google’s developer population, refined further, and finally to the full 100% of the population.
Throughout this process, the research team made several important refinements to the model and system that improved performance and usability. For example, a seemingly small change to the way the suggest edits were shown to developers (a wording tweak and visual change) improved the percentage of edits previewed by developers from 20% up to 30%.
Evaluating the assistant’s impact
The ultimate goal for the tool was to increase productivity. Google used quantitative metrics and qualitative feedback to measure the system’s impact. As for quantitative metrics, the team chose to track the following:
Acceptance rate by author: The fraction of all code-review comments that are resolved by the assistant. This measures, out of all (non-automated) comments left by human reviewers, what fraction received an ML-suggested edit that the author accepted and applied directly to their changelist.
Prediction coverage: This measures the percentage of comments that receive a prediction.
Acceptance rate by reviewer: Similarly, the team measured the percentage of predictions accepted by reviewers.
After several months of deployment, the tool was addressing roughly 7.5% of comments produced by code reviewers in their day-to-day work. Considering tens of millions of code-review comments are left by Google developers every year, over 7% ML-assisted comment resolution is a considerable contribution to the company’s total engineering productivity.
Additionally, around half of all eligible comments received predictions. Of those predictions, over 63% were accepted by the reviewer and attached to the comment to be sent to the author. 34% of those suggested edits were previewed by the author. Of those previewed, 70% were accepted and applied to code.
Qualitatively, the research team received positive feedback from developers in internal message boards, who called the assistant's suggestions "sorcery," "magic," and "impressive." For example, reviewers often found that the assistant could suggest the right changes even before they finished typing their comments. This saved time and made the review process more efficient for both reviewers and authors.
Final thoughts
I recently shared Meta’s experiment to reduce code review times, which they achieved by targeting the slowest 25% of code reviews. This study provides another example of a company making targeted improvements to the code review process as a path for improving developer productivity.
Who’s hiring right now
Here is a roundup of recent Developer Experience job openings. Find more open roles here.
Airbnb is hiring a Senior Staff Engineer (AI) - Developer Productivity | US
Betterment is hiring a Staff Technical Program Manager | New York
Snyk is hiring a VP, Engineering - Developer Experience | Boston, London
Webflow is hiring an Engineering Manager - Developer Productivity | US
Thanks for reading. If you know someone who might like this issue, consider sharing it with them:
-Abi
Wow! Such a shame that so many great internal tools are… Internal 🤷♂️