Listen and watch now on YouTube, Apple, and Spotify.
AI engineering tools are evolving fast. Every month brings new coding assistants, debugging agents, and automation capabilities. I want to help engineering leaders take advantage of that innovation while avoiding costly experiments that distract from real product work.
In this episode, Abi Noda and I share a practical, data-driven approach to evaluating AI tools. I walk through how to shortlist tools by use case, design structured trials that reflect real work, select representative participants, and measure impact using baselines and proven frameworks. My goal is to give you a way to test and adopt AI tools with confidence and a clear return on investment.
Some takeaways:
Data-driven evaluations are essential
Structured, measurable trials prevent bias. Without them, decisions are driven by novelty hype or a few loud voices.
Define a clear business outcome first (reduce toil, improve delivery speed, or raise code quality).
Evaluations must inform real decisions, not just check a procurement box.
Choose the right set of tools to evaluate
Group tools by use case and interaction mode (chat, agentic IDEs, code review assistants, etc.) to ensure fair comparisons.
Match shortlist size to org capacity to support multiple cohorts and reliable results.
Multi-vendor strategies reduce lock-in in a rapidly shifting market.
Re-evaluations are essential, not optional
Incumbent tools must be retested as capabilities evolve and new challengers emerge.
Triggers for re-evaluation include major feature launches, organic developer adoption of a new tool, and upcoming renewal cycles.
Every challenger tool evaluation requires a baseline of the incumbent, so you can compare like-for-like.
A cadence of every 8–14 months ensures decisions reflect the current reality, not the past purchase.
Design trials around research questions
Start with a hypothesis. It keeps experiments aligned to actual goals.
Developer sentiment is necessary but insufficient without measurable outcomes.
Success criteria must be defined in advance to avoid subjective decision-making.
Select representative participants
Diverse cohorts reveal real impact across languages, teams, and seniority levels.
Include skeptical and late adopters to uncover onboarding and enablement needs.
Volunteer-only trials distort results and won’t scale to full org rollout.
Run evaluations long enough to capture true behavior
Eight to twelve weeks is the minimum to get past the novelty phase and into sustained usage.
Align evaluation windows to procurement cycles so insights guide buying decisions.
Short trials lead to false signals and either inflate enthusiasm or create false negativity.
Use self-reported time savings carefully
Self-reporting is a strong early indicator of perceived usefulness.
Humans misremember time, often benchmarking against recent AI use.
Treat CSAT and time savings as directional, not the final truth.
Objective metrics validate real ROI, including throughput, quality, and innovation time.
Expect variation rather than a single winner
Different tools shine in different contexts, so multiple standards are often the best path.
Continuous re-evaluation is required as capabilities evolve every quarter.
The right goal isn’t the “best tool”, but the best tool for each problem space.
In this episode, we cover:
(00:00) Intro: Running a data-driven evaluation of AI tools
(02:36) Challenges in evaluating AI tools
(06:11) How often to reevaluate AI tools
(07:02) Incumbent tools vs challenger tools
(07:40) Why organizations need disciplined evaluations before rolling out tools
(09:28) How to size your tool shortlist based on developer population
(12:44) Why tools must be grouped by use case and interaction mode
(13:30) How to structure trials around a clear research question
(16:45) Best practices for selecting trial participants
(19:22) Why support and enablement are essential for success
(21:10) How to choose the right duration for evaluations
(22:52) How to measure impact using baselines and the AI Measurement Framework
(25:28) Key considerations for an AI tool evaluation
(28:52) Q&A: How reliable is self-reported time savings from AI tools?
(32:22) Q&A: Why not adopt multiple tools instead of choosing just one?
(33:27) Q&A: Tool performance differences and avoiding vendor lock-in
Where to find Laura Tacho:
• LinkedIn: https://www.linkedin.com/in/lauratacho/
• Website: https://lauratacho.com/
• Laura’s course (Measuring Engineering Performance and AI Impact): https://lauratacho.com/developer-productivity-metrics-course
Where to find Abi Noda:
• LinkedIn: https://www.linkedin.com/in/abinoda
• Substack: https://substack.com/@abinoda










