This is the latest issue of my newsletter. Each week I cover the latest research and perspectives on developer productivity.
This week I read Using Nudges to Accelerate Code Reviews at Scale, a paper by a group of software engineers and researchers at Meta. This paper describes the results from their experiment to reduce the amount of time that code reviews take.
My summary of the paper
Code review is an important part of the software development process at Meta: every code change must be reviewed by a peer before being shipped. However, the company also values moving fast, and they recognize the need to make code reviews as fast as possible without sacrificing quality. There is a dedicated code review team focused on this problem.
The code review team uses data from the company’s developer experience survey to identify ways to make code review faster and more delightful. In 2020, they discovered that 85% of developers were satisfied with the code review process in general, however they were less satisfied with the speed with which their code was reviewed. This inspired the code review team to get a better understanding of how long code reviews take that are considered slow, and how the problem might be solved. Then, they conducted an experiment to test their solution.
Here are the results from their study:
When does a code review feel too slow?
The DevEx survey surfaced the insight that some developers were dissatisfied with how long code reviews take. To better understand how long “slow” code reviews are, the team triangulated the satisfaction data with a quantitative metric called Time in Review. (Time in Review is a measure of how long a code change is waiting on review across all of its individual review cycles. For example, if a code change went through two rounds of revisions, Time in Review would be the sum of the two durations while it was “in review.”)
By combining the survey and telemetry data, the team found a clear relationship between an increase in dissatisfaction and the amount of time it takes for engineers to get their code change reviewed. This is shown in the table below, which plots the level of dissatisfaction against the developer’s slowest 25% of code changes in review.
We can see from the table that there is no sharp cliff or threshold that separates a “good” experience from a “bad” one. Rather, the longer someone’s slowest 25% of diffs take to review, the more frustrated they are with the code review process.
Based on this, Meta set a goal to reduce p75 Time in Review (i.e. the slowest 25% of reviews).
Creating a nudge
To reduce p75 Time in Review, the team designed a NudgeBot to notify reviewers to act on “stale” diffs. For this experiment, they decided to consider a diff that had had no action for 24 hours as stale, which roughly corresponds to the slowest 25% of engineers’ diffs.
Developing the NudgeBot involved creating a model to determine who to nudge. This model ranks the reviewers based on the probability that they will make a comment or perform some other action on a diff. The model takes into account things like the relationship between the author and the reviewer (e.g., are they on the same team, has the reviewer reviewed diffs by this author previously), and characteristics of the diff itself (e.g., number of files, number of reviewers added by the author).
Ultimately, after testing the model, the team found that the most important features in this model were the amount of time the reviewers have reviewed the authors’ diffs in the past, how the reviewer was assigned, and the total number of assigned reviewers.
Testing the nudge
Hypothesis and metrics
Meta’s core hypothesis was that with the NudgeBot, the amount of time a diff is under review would decrease. They measured this in three ways: (1) the time a diff waits in the ‘needs review’ status, (2) the number of diffs that take over 3 days to close (this timeframe was chosen because they were only nudging diffs after 24 hours), (3) the time to first action (they wanted to encourage early actions).
The team also wanted to make sure that optimizing the speed of reviews did not lead to negative side effects, like encouraging rubber-stamping reviewing. They decided to use a guardrail metric to protect from unintended consequences. For this, they measured (1) the percentage of diffs reviewed in 24 hours or less, and (2) “Eyeball Time” - the total amount of time reviewers spent looking at a diff.
The experiment and results
The study started with an opt-in trial: 15 teams opted in. After the trial, the code review team made several iterations to the nudge, including the ability for developers to opt-out, refining the language used when sending nudges, and batching notifications to be less noisy.
Then the team ran a larger experiment using a cluster-randomized technique, which means the test was applied at the group level rather than at an individual level. Developers who had commonly reviewed each other’s diffs within the last 90 days were more likely in the same cluster. The experiment took 28 days, with 15k developers in the test group and 16k developers in the control group.
The team observed substantial, statistically significant improvements on the review cycle time goal metrics. Specifically:
The average time of diffs in “needs review” status (time in review) was reduced by 6.8% (p=0.049)
The percentage of diffs taking longer than 3 days to close, excluding weekends [10], was reduced by 11.89% (p=0.004).
The average time to first reviewer comment or action was reduced by 9.9% (p=0.010).
They did not observe statistically significant regressions in the guardrail metrics.
Given the positive results, the code review team then rolled out NudgeBot to all developers at Meta.
Final thoughts
This paper gives an excellent blueprint for how Developer Productivity teams can systematically drive meaningful results for a business. The team at Meta took the following steps:
They identified a problem from their DevEx survey
They correlated the survey data with quantitative metrics to better understand the problem
They designed a solution, and set up an experiment to test their solution
They selected success metrics and guardrail metrics to evaluate their experiment
And once the experiment passed, they rolled out their solution to the rest of the organization
I will certainly be referring to this paper in the future.
That’s it for this week. Thanks for reading.
-Abi
Thanks for sharing, Abi.
In some companies I used to have something similar to NudgeBot.
Back in the past I used to beg for reviews, stack branches and PRs to continue my work, which I think it is extremely inefficient.
All of this effort (maintaining a new tool, waiting anyways for people to review, etc.) to not focus on solve the problem: people not working together, in pair for instance.
Abi, thank you for sharing this insightful overview of Meta's approach to accelerating code reviews at scale. The detailed breakdown of their experiment, from identifying the issue in their DevEx survey to designing and testing the NudgeBot solution, offers a valuable blueprint for Developer Productivity teams.