Using AI to encourage best practices in the code review process

How Google developed and measured an AI tool that enforces best coding practices.

Jun 14, 2024

This is the latest issue of my newsletter. Each week I share research and perspectives on developer productivity.

Join an upcoming live discussion with Laura Tacho (DX CTO) and Crystal Hirschorn (Zoa CTO) on establishing a strong platform engineering function and avoiding common mistakes early teams make. Sign up here.

This week I read a new paper from Google: AI-Assisted Assessment of Coding Practices in Modern Code Review. Their research team developed an AI-based tool that flags best practice violations during code reviews and suggests fixes. The paper covers the tool's creation, deployment, and evaluation, as well as the hurdles faced during rollout.

This is the second paper I've highlighted from Google on AI-enhanced code reviews. Their first paper focused on using AI to assist with resolving reviewer comments. This new study zeroes in on using AI to suggest improvements and encourage best coding practices. It demonstrates that an end-to-end system for learning and enforcing coding best practices is not only viable but also enhances the developer workflow.

My summary of the paper

Enforcing best coding practices—like formatting, documentation, and naming—is a core function of code reviews. However, the code review process is expensive, and investing in making it more efficient is worthwhile. As the authors state, "Code reviews cost thousands of developer-years annually. Even single-digit percentage savings translate into significant business impact."

In this study, researchers explored the potential of partially automating code reviews to detect best practice violations. Success would mean timely feedback for code authors and freeing up reviewers to focus on functionality rather than coding practices.

Here’s more about the tool they developed, how they deployed it, and how they measured its success:

The AutoCommenter tool

Google's AutoCommenter is an AI tool designed to automatically detect best practice violations during code reviews. It provides timely feedback for code authors, reducing the need for manual best-practice checks. Here's a quick overview of how it was developed:

Model and training: The researchers used the T5X model, which excels at text-to-text transformations, to understand and analyze source code. They trained it on over 3 billion examples, including 800k examples specifically for best practice violations. These examples were sourced from human-authored comments linked to best practice documents.

How the model works: The model receives a prompt and the source code. The prompt, written as a code comment in natural language, describes the task, followed by the code itself. The model identifies rule violations, providing a URL to the relevant best practice document, or returns an empty target if no violations are found. It also includes a confidence score for its findings.

User experience: AutoCommenter delivers information through two main channels: the IDE and the code review system.

In the IDE: It highlights code violations with a blue curly underline. Hovering over the underlined code brings up a pop-up with a summary of the violation and a link to the relevant best practice document. This real-time feedback helps developers quickly fix issues without switching contexts.
In the code review process: AutoCommenter analyzes updated code and posts comments on violations. These comments are attached to specific lines of code and visually distinguished from human reviewer comments by a different background color. Each comment includes a summary of the violation and a link to the best practice document. Developers and reviewers can interact with these comments using feedback buttons, including thumbs up/down for usefulness and a "Please fix" button for significant issues that must be addressed before merging the code.

Deployment and challenges

Google rolled out AutoCommenter to all its developers over a year. They deployed it in stages: first to the paper’s authors for a month, then to an early adopter group of 3,000 volunteers for about a year, followed by half of all developers, and finally to everyone.

During this time, the team made several important changes:

Managing thresholds: The team initially set a high confidence threshold (t = 0.98) so it would only show predictions when the model was very certain that it was correct. They did this to ensure developers trusted the tool. However, they found that about 80% of the predictions that didn't meet this high threshold were actually correct, indicating that they were missing many accurate predictions. They adjusted by setting specific thresholds for each best practice document URL based on model performance on validation data.
Handling outdated best practices: Best practices evolve over time, leading to outdated recommendations in the model's predictions. This issue became apparent when many users reported problems related to Python import guidelines that had changed. To address this, the team implemented dynamic suppression of specific best-practice predictions using conditional filtering. This approach allowed for immediate application without needing full model retraining.
Handling non-actionable best practice documents: After initial deployment, the useful ratio of comments plateaued at around 54%. To understand this, the team conducted an independent human rating study, revealing patterns of unhelpful comments. The study identified non-actionable URLs and highlighted the importance of high-quality summaries. Suppressing non-actionable URLs and improving comment summaries helped increase the useful ratio.

Evaluation metrics and results

Google’s research team evaluated AutoCommenter’s performance by tracking the following:

Developer feedback:

They collected feedback from developers using feedback buttons within the code review system and the IDE. These buttons allowed developers to indicate whether a comment was useful (thumbs up or "Please fix") or not useful (thumbs down). The ratio of positive to negative feedback, referred to as the "useful ratio," was a key metric for assessing the tool's performance.

The researchers concluded that developers were generally satisfied with the comments produced by AutoCommenter.

Comment resolution: How often do developers modify their code to resolve AutoCommenter’s posted comments? This data helped the researchers understand how frequently and effectively developers were using the tool.

The researchers looked at 6,000 pairs of code snapshots to see how often developers fixed issues mentioned by AutoCommenter. They found that in 50% of the cases, the comment was gone by the time the code was submitted. After manually checking 40 of these cases, they discovered that in 80% of them, the developer had changed the code to fix the issue mentioned by AutoCommenter. This means that about 40% of the comments led to changes in the code.
This high resolution rate indicates that developers often acted on the automated comments by modifying their code to resolve the issues highlighted.

AutoCommenter vs. human comments: How well do AutoCommenter’s comments cover the best practice documents that human reviewers reference in their comments?

Researchers found that AutoCommenter often references the same key best practice guidelines as human reviewers, however its comments lack diversity. The top 85 best practice guidelines accounted for 90% of the tool’s provided comments, while the same set of guidelines only covers 35% of human comments.
The researchers note that increasing the variety of guidelines the tool uses, while maintaining accuracy and speed, is a top priority.

AutoCommenter vs. linters: To what extent does AutoCommenter’s output go beyond the capabilities of traditional static analysis tools?

Researchers sampled the top 50 most frequently predicted violations. For each sample, they inspected the best practice document provided and the best practice type, and then determined whether a linter that detects corresponding violations could be easily built.
They found that for 66% of these best practices, the AutoCommenter went beyond the scope of traditional static analysis.

Final thoughts

In an earlier paper, Google revealed how they used insights on developers' top pain points—like technical debt and poor documentation—to guide their AI investments. While AutoCommenter initially aimed to automate code reviews, it’s clear that by enforcing best practices, it's also tackling these pain points.

AutoCommenter not only surpassed traditional tools but also gained high user acceptance. This is a promising step towards automated code reviews.

Who’s hiring right now

Here is a roundup of DevEx job openings. Find more open roles here.

Netflix is hiring a Product Manager - Developer Platform | US
Snyk is hiring a VP, Engineering - Developer Experience | Boston, London
Webflow is hiring an Engineering Manager - Developer Productivity | US
VTEX is hiring a Engineering Manager - Developer Experience | Brazil

Thanks for reading. If you know someone who might like this newsletter, consider sharing it with them:

Share Engineering Enablement

-Abi

Engineering Enablement