Listen and watch now on YouTube, Apple, and Spotify.
In this episode, I’m joined by Quentin Anthony, Head of Model Training at Zyphra and a participant in METR’s recent study on AI coding tools. We explore the study’s unexpected findings—why developers often felt more productive using AI, but in many cases weren’t—and unpack the nuances of where these tools actually add value. Quentin offers practical, experience-backed advice on avoiding common pitfalls, such as the sunk-cost fallacy and context rot, evaluating task-level fit, and building the kind of tool hygiene that’s critical for long-term success with AI.
Some takeaways:
The biggest takeaways from the METR study
Quentin participated in a recent study by METR (Model Evaluation and Training Red Teamers), which found that on average, developers were slower when using AI—despite feeling more productive.
The gap between perceived and actual efficiency is real—and often overlooked, especially when AI feels fun to use.
AI excels at documentation, unit tests, and refactoring—tasks that can often be completed in a single prompt.
For complex, low-level work—like GPU kernels or distributed systems—models tend to produce bloated code or require too much back-and-forth to be useful.
Time-boxing is key to avoiding sunk-cost fallacy and watching out for context rot
Quentin recommends setting strict time limits when using AI: if it’s not helping in 10–15 minutes, move on.
Developers often spend too long trying to force a model to help with the wrong kind of task.
Long chats and overloaded prompts can confuse models, causing hallucinations and inconsistent behavior.
Restarting chats frequently and summarizing past work helps keep models grounded and accurate.
Quentin recommends using summarization prompts to distill the current chat before restarting—this keeps context clean while avoiding repetition.
AI tools introduce more idle time than you think
Waiting on model responses—even just 15–30 seconds—adds up fast, especially with reasoning-heavy prompts.
Quentin uses this downtime for microtasks and blocks distractions, such as social media, to stay in the flow.
Prompting skill helps—but isn’t everything
Success with AI isn’t just about writing better prompts. Often, task-model mismatch is the real problem.
Blaming developers for tool failures ignores deeper limitations in model training and context handling.
Quentin treats AI output like junior engineer code—carefully reviewed, never blindly trusted. Even when correct, AI-generated code is often bloated or hard to maintain.
Focus on task-level fit, not team-level rollout
Organizations should evaluate AI usefulness at the task level, not by team, codebase, or tooling preference.
Not all work benefits from AI—even within the same repo or project.
In Quentin’s view, tasks like acceptance tests, PR review, or boilerplate generation are ideal candidates for model support—while planning and design should stay human-led.
Model behavior varies—test before trusting
Different models excel at different things. Claude may outperform Gemini at writing comments, while Gemini might be better at summarizing code.
Quentin tries new models on familiar tasks first and expands use only after carefully watching for failure patterns.
Claude is strong at writing clean, human-readable code.
Gemini 2.5 is particularly good at summarizing. Quentin picks models based on task-specific strengths rather than defaulting to one tool.
Tool sprawl creates friction and other limits
Constantly switching between AI tools leads to confusion and inconsistent results.
Quentin keeps his toolset small and stable, adjusting slowly to new platforms to avoid cognitive overload.
When adopting new tools, Quentin starts with familiar, low-risk tasks like unit tests, then gradually expands usage as he learns how the model behaves.
Multi-agent systems are exciting, but they are still unreliable. They perform well in narrow settings, but struggle in real-world workflows.
For now, you’ll get more value from well-scoped tools and clearly defined use cases
In this episode, we cover:
(00:00) Intro
(01:32) A brief overview of Quentin’s background and current work
(02:05) An explanation of METR and the study Quentin participated in
(11:02) Surprising results of the METR study
(12:47) Quentin’s takeaways from the study’s results
(16:30) How developers can avoid bloated code bases through self-reflection
(19:31) Signs that you’re not making progress with a model
(21:25) What is “context rot”?
(23:04) Advice for combating context rot
(25:34) How to make the most of your idle time as a developer
(28:13) Developer hygiene: the case for selectively using AI tools
(33:28) How to interact effectively with new models
(35:28) Why organizations should focus on tasks that AI handles well
(38:01) Where AI fits in the software development lifecycle
(39:40) How to approach testing with models
(40:31) What makes models different
(42:05) Quentin’s thoughts on agents
Where to find Quentin Anthony:
• LinkedIn: https://www.linkedin.com/in/quentin-anthony/
• X: https://x.com/QuentinAnthon15
Where to find Abi Noda:
• LinkedIn: https://www.linkedin.com/in/abinoda
Share this post