Copilot’s impact on productivity: Results from three experiments
Microsoft and Accenture’s studies found a 26% increase in task completion with Copilot, but adoption rates varied.
This is the latest issue of Engineering Enablement, a weekly newsletter covering the data behind world-class engineering organizations. To get articles like this in your inbox every Friday, subscribe:
This week I read The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers, a paper examining the impact of Copilot on engineering productivity. This study was conducted by researchers from Princeton, MIT, Microsoft, and the University Pennsylvania, and presents the findings from three large-scale experiments involving over 5,000 developers.
My summary of the paper
AI coding tools are becoming more widespread, but it's still unclear how much they actually boost productivity. To investigate this, researchers analyzed data from three large-scale, randomized control trials conducted in real-world environments. The studies were conducted separately by three companies—Microsoft, Accenture, and an anonymous Fortune 100 electronics manufacturer. The results were then shared with the authors of this paper, who analyzed and reported on the findings.
In the Microsoft and Accenture experiments, one group of developers was randomly given access to Copilot, while the control group had no access for a period of seven months (Microsoft) and four months (Accenture). At the anonymous company, all users eventually gained access over a two-month period, but access dates were staggered, with some teams getting Copilot six weeks earlier than others.
To evaluate the results across these experiments, the researchers looked at metrics like pull requests, commits, builds, Copilot adoption rates, and the number of suggestions made by Copilot versus accepted suggestions. For the Microsoft study, they also analyzed developer hire dates and seniority to see if tenure influenced the impact of AI tools.
Here are the key findings from the study.
26.08% increase in task completion
In the Microsoft experiment, Copilot usage led to an increase in completed pull requests, commits, and builds. However, only the increase in pull requests was statistically significant. Similar trends were observed at Accenture and the anonymous company, though these effects weren't statistically significant. Notably, there was no negative impact on build success rates, which the researchers used as a proxy for code quality.
When the results from all experiments were combined, the research team found that Copilot increased task completion by 26%, as measured by completed pull requests. It also led to a 13.55% increase in commits and a 38.38% increase in builds.
Higher gains for less experienced developers
With Microsoft providing data on developer characteristics, the researchers were able to explore how Copilot impacts productivity based on developer tenure and seniority.
Developers were split into two groups: those with "shorter tenure" and "longer tenure," based on the median tenure of participants. They were also categorized as "junior" or "senior" based on their job level.
Key findings from this analysis include:
Adoption rates: Developers with shorter tenures were 9.5% more likely to adopt Copilot compared to their longer-tenured peers. Junior developers also had a higher adoption rate, though the difference was smaller at 5.3%.
Ongoing use: Short-tenure developers were more likely to continue using Copilot after the first month, suggesting they saw more value in it. This wasn’t observed when comparing junior and senior developers.
Acceptance rates: Long-tenure developers were 4.3% less likely to accept Copilot's code suggestions, while senior developers were 1.8% less likely than juniors to accept AI-generated code.
When looking at output metrics—pull requests, commits, and builds—developers with shorter tenures and junior developers saw the biggest gains. Short-tenure developers saw increases ranging from 27% to 39%, while long-tenure developers saw smaller gains of 8% to 13%. Similarly, junior developers improved their output by 21% to 40%, compared to senior developers, who saw gains of 7% to 16%.
AI adoption rates vary
The authors examined Copilot adoption across the three experiments. In this paper, "adoption" refers to the first time a software engineer uses Copilot, even if they later stop using it. “Trial" might be a more fitting term here.
Adoption timelines and levels varied, but across all experiments, about 30-40% of developers chose not to try Copilot. This suggests that factors beyond access, like individual preferences and perceived utility, play significant roles in the decision to use the tool.
Another study explored the factors influencing developers' adoption of AI tools and found several barriers: AI tools not meeting expectations, fear of judgment from peers, and the lack of a culture that promotes sharing best practices around AI tool usage. These hurdles can significantly slow adoption, even when the tools are available.
Final thoughts
Two insights from this study stood out:
Less experienced developers (both in tenure and overall) accept AI-generated suggestions more frequently than do their more experienced counterparts. This raises concerns, especially given the potential for AI to produce buggy or outdated code. But rather than limiting the use of AI tools, the focus should be on education. A recent
survey found that one of the biggest productivity gains that developers experience is in learning new concepts and conducting research. So leaders should focus on guiding the usage of these tools—for example, by offering training on where they provide the most value, defining clear guidelines for their use, and fostering a culture of internal knowledge-sharing around how others are successfully leveraging these tools.Across the three experiments, 30-40% of developers opted not to adopt Copilot. This underscores a key point: simply providing access to AI tools isn't enough to realize the productivity gains they promise.
As AI tools quickly become a standard part of the development workflow, studies like this help leaders better grasp how these tools are reshaping the way teams work, and what risks or challenges they might bring.
Who’s hiring right now
Here is a roundup of Developer Productivity job openings. Find more open roles here.
SiriusXM is hiring a Staff Software Engineer - Platform Observability | US
Vercel is hiring an Engineering Manager - Build and Deploy | UK
Snowflake is hiring a Senior Engineer - Developer Productivity | Bellevue
Wix is hiring a Software Team Lead - Developer Experience | Tel Aviv
That’s it for this week. Thanks for reading.
-Abi
What's interesting is the huge difference between "Junior" and "Senior".
Almost 40% more Output vs. 10%.
And there is something significantly lacking for me:
- What was the purpose of that "productivity" gain?
Measuring the Number of PRs, telling me Junior Devs produced almost twice as much PRs, I'm asking myself:
-> Did that increased in Bugs, errors in the application itself? (Something a successful build isn't telling us).
I'm sure AI Tools increase productivity, I'm using them myself, but it lacks that detailed analysis of "What was the cause of another Commit or PR?".
Hypothesis: More PRs and Commits, resulted in Bugs, which lead to another PR fixing that Bug. Especially, when you are inexperienced.