Unpacking METR’s findings: Does AI slow developers down?

Engineering Enablement Podcast

0:00

-43:44

Unpacking METR’s findings: Does AI slow developers down?

Quentin Anthony unpacks why AI coding tools slow developers despite boosting perceived productivity, and how teams can improve results with task‑level fit and better digital hygiene.

Abi Noda

Aug 01, 2025

Listen and watch now on YouTube, Apple, and Spotify.

In this episode, I’m joined by Quentin Anthony, Head of Model Training at Zyphra and a participant in METR’s recent study on AI coding tools. We explore the study’s unexpected findings—why developers often felt more productive using AI, but in many cases weren’t—and unpack the nuances of where these tools actually add value. Quentin offers practical, experience-backed advice on avoiding common pitfalls, such as the sunk-cost fallacy and context rot, evaluating task-level fit, and building the kind of tool hygiene that’s critical for long-term success with AI.

Some takeaways:

The biggest takeaways from the METR study

Quentin participated in a recent study by METR (Model Evaluation and Training Red Teamers), which found that on average, developers were slower when using AI—despite feeling more productive.
The gap between perceived and actual efficiency is real—and often overlooked, especially when AI feels fun to use.

AI excels at documentation, unit tests, and refactoring—tasks that can often be completed in a single prompt.
For complex, low-level work—like GPU kernels or distributed systems—models tend to produce bloated code or require too much back-and-forth to be useful.

Time-boxing is key to avoiding sunk-cost fallacy and watching out for context rot

Quentin recommends setting strict time limits when using AI: if it’s not helping in 10–15 minutes, move on.
Developers often spend too long trying to force a model to help with the wrong kind of task.

Long chats and overloaded prompts can confuse models, causing hallucinations and inconsistent behavior.
Restarting chats frequently and summarizing past work helps keep models grounded and accurate.
Quentin recommends using summarization prompts to distill the current chat before restarting—this keeps context clean while avoiding repetition.

AI tools introduce more idle time than you think

Waiting on model responses—even just 15–30 seconds—adds up fast, especially with reasoning-heavy prompts.
Quentin uses this downtime for microtasks and blocks distractions, such as social media, to stay in the flow.

Prompting skill helps—but isn’t everything

Success with AI isn’t just about writing better prompts. Often, task-model mismatch is the real problem.
Blaming developers for tool failures ignores deeper limitations in model training and context handling.
Quentin treats AI output like junior engineer code—carefully reviewed, never blindly trusted. Even when correct, AI-generated code is often bloated or hard to maintain.

Focus on task-level fit, not team-level rollout

Organizations should evaluate AI usefulness at the task level, not by team, codebase, or tooling preference.
Not all work benefits from AI—even within the same repo or project.
In Quentin’s view, tasks like acceptance tests, PR review, or boilerplate generation are ideal candidates for model support—while planning and design should stay human-led.

Model behavior varies—test before trusting

Different models excel at different things. Claude may outperform Gemini at writing comments, while Gemini might be better at summarizing code.
Quentin tries new models on familiar tasks first and expands use only after carefully watching for failure patterns.
Claude is strong at writing clean, human-readable code.
Gemini 2.5 is particularly good at summarizing. Quentin picks models based on task-specific strengths rather than defaulting to one tool.

Tool sprawl creates friction and other limits

Constantly switching between AI tools leads to confusion and inconsistent results.
Quentin keeps his toolset small and stable, adjusting slowly to new platforms to avoid cognitive overload.
When adopting new tools, Quentin starts with familiar, low-risk tasks like unit tests, then gradually expands usage as he learns how the model behaves.

Multi-agent systems are exciting, but they are still unreliable. They perform well in narrow settings, but struggle in real-world workflows.
For now, you’ll get more value from well-scoped tools and clearly defined use cases