AI and productivity: A year-in-review with Microsoft, Google, and GitHub researchers

Engineering Enablement by DX

0:00

-41:59

AI and productivity: A year-in-review with Microsoft, Google, and GitHub researchers

In this year-end Engineering Enablement episode, Laura Tacho and research leaders from Microsoft, Google, and GitHub unpack what research says about measuring AI’s impact on engineering teams.

Abi Noda

Dec 29, 2025

Listen and watch now on YouTube, Apple, and Spotify.

As we close out 2025, I wanted to step back and take stock of what we have actually learned about AI adoption in engineering organizations. Not just where usage has increased, but where impact is real, where it is overstated, and what questions remain unanswered.

In this special year-end episode, I’m joined by Brian Houck from Microsoft, Collin Green and Ciera Jaspan from Google, and Eirini Kalliamvakou from GitHub. Together, we unpack the research each of them worked on this year and explore how leading organizations are thinking about AI measurement, developer experience, and long-term productivity. We talk candidly about why measuring AI’s impact is so difficult, why familiar metrics like lines of code keep resurfacing despite their flaws, and how multidimensional approaches like SPACE and DORA offer a more realistic lens.

We also look ahead to 2026. We discuss how AI is beginning to reshape the identity of the developer, how junior engineers’ skill sets may evolve, where agentic workflows are gaining traction, and why some of the most widely shared AI studies were misunderstood. This episode is an honest conversation about moving past hype and toward a more grounded, evidence-based approach to AI adoption in engineering teams.

Some takeaways:

Measuring AI impact requires multiple lenses

There is no single metric that can capture AI’s impact. Developer productivity and experience are inherently multidimensional, requiring trade-offs to be evaluated across speed, quality, collaboration, and meaning.
Frameworks like SPACE and DORA help avoid metric tunnel vision. They encourage teams to examine complementary signals rather than optimizing one dimension at the expense of others.
Measurement must reflect systems, not tools. AI does not operate in isolation; its impact depends on organizational context, workflows, and existing engineering practices.

Why familiar metrics keep failing us

Lines of code remains a deeply misleading metric. AI tends to generate verbose code, making raw output a poor proxy for productivity, quality, or long-term maintainability.
More code does not equal better outcomes. Excess code can increase maintenance burden, technical debt, and cognitive load over time.
Easy-to-measure metrics are often the most dangerous. Their simplicity makes them attractive during periods of uncertainty, even when they obscure what is actually changing.

The limits of tracking AI-generated code

Measuring the percentage of AI-generated code oversimplifies reality. AI may write, delete, refactor, or reorganize code in ways that raw percentages fail to capture.
AI-generated code does not inherently signal higher risk. In some contexts, AI output may be more consistent or higher quality than human-written code.
These metrics are better used as supporting signals, not goals. They can inform budgeting, experimentation, or adoption patterns but should not drive performance targets.

How AI is reshaping the role of the developer

Developers are shifting from implementers to orchestrators. Advanced AI users spend more time framing problems, setting context, and validating outcomes than writing raw code.
AI fluency is becoming a core skill. Knowing how to guide, correct, and collaborate with agents is increasingly important.
Adoption follows a progression. Developers tend to move from skepticism to exploration, collaboration, and eventually strategic use as expectations recalibrate.

What this means for junior engineers

Skill development may accelerate rather than disappear. Junior engineers may practice delegation, planning, and system-level thinking earlier by working with AI agents.
Technical fundamentals still matter. Understanding architecture, requirements, and failure modes remains essential for supervising AI-generated work.
Interpersonal skills risk being deprioritized. Managing agents is not the same as managing people, raising concerns about how collaboration skills develop over time.

AI is not just a productivity tool

Creativity and innovation benefit from friction. Research suggests that exposing decision points and seams can create space for new ideas rather than faster repetition.
Automating everything is not always desirable. Removing all toil may reduce opportunities for learning, insight, and creative problem-solving.
AI should augment thinking, not replace it. Tools that surface trade-offs and choices can support better outcomes than those that simply eliminate effort.

High-leverage AI use cases focus on toil

Developers spend only about 14% of their time writing code. Optimizing coding alone rarely leads to large productivity gains.
The biggest opportunities lie in removing friction. Documentation, compliance tasks, incident response, flaky tests, and knowledge discovery consistently rank as top pain points.
AI excels at work developers dislike but must still do. Automating dull, repetitive tasks can improve satisfaction and free time for meaningful work.

Why leadership and change management matter

AI adoption is a human problem before it is a technical one. Organizations that understand developer pain points deploy AI more effectively.
Agentic workflows amplify organizational differences. Teams with strong experimentation cultures and feedback loops move faster and with less friction.
Culture determines outcomes. How leaders communicate expectations, normalize experimentation, and support learning shapes whether AI adoption succeeds or stalls.

Looking ahead to 2026

Task parallelization is an emerging frontier. Developers are beginning to use agents to explore multiple solution paths simultaneously.
Collaboration with agents will redefine productivity. Teams, not just individuals, will increasingly work alongside AI systems.
Research must evolve with the work itself. New workflows will require new metrics, new telemetry, and new ways of understanding impact.

Lessons from the METR paper

Context matters more than headlines suggest. Results showing slower performance often reflected expert developers working in familiar codebases.
AI may help most where familiarity is lowest. New domains, unfamiliar systems, and onboarding scenarios show different outcomes.
Media oversimplification distorts understanding. Nuance is critical when interpreting AI research, especially as studies move into real-world environments.