This week I read The Developer Experience of Flaky Tests, a recently published paper by Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. This study sought to better understand how developers define flaky tests and their experiences of the impacts and causes.
My summary of the paper
Developers experience flaky tests often, with over half of participants in this study experiencing flaky tests on at least a monthly basis. And while more studies have focused on test flakiness, little attention has been given to the views and experiences of developers about the problem.
Researchers sought to fill that gap by conducting a multi-source study: they used a developer survey to learn how developers define and react to flaky tests as well as the impacts and causes of the problem. The study followed with an analysis of StackOverflow threads to discover additional information not surfaced in the survey.
Here’s what they found:
How developers define flaky tests
To lay the foundation for this study, developers were first asked whether they agreed with a common definition of what a flaky test is: “a test case that can both pass and fail without any changes to the code under test.” If a respondent did not agree with the definition, they had the opportunity to offer their own definition.
Most participants (93.5%) agreed with that definition. The most common theme amongst those who disagreed was that the definition should extend beyond the test case code and the code that it covers (e.g., "a flaky test is any test that changes from pass to fail in different environments”).
The impact flaky tests have on developers
The analysis also yielded a list of ways in which flaky tests impact developers. Developers most strongly agree with the notion that flaky tests hinder CI. They also agree that flaky tests lead to both a loss of productivity and a reduction in test efficiency.
The top third of Table 2 below shows the mean score and ranks each impact statement (ranks are in parenthesis).
Another notable finding: there is a meaningful difference between developers who experience flaky tests on at least a monthly basis versus those who did not, in that those who experience flaky tests more often may be more likely to ignore potentially genuine test failures. This supports Martin Fowler’s position that when there are too many flaky tests, developers could lose the discipline to notice the failures.
The same dynamic also emerged when researchers looked at the actions developers take after identifying a flaky test. They found that developers who experience flaky tests more often are more likely to take no action in response to them, and less likely to attempt to repair them.
The causes of flaky tests
The most frequent cause of flaky tests is improper setup and teardown, and more specifically “tests not properly cleaning up after themselves or failing to set up their necessary preconditions.” Network-related issues were the second most frequent cause, with the authors noting that “any test case that requires a network connection will inevitably be flaky since infrastructure issues or periods of high traffic may cause the test case to fail”.
Developers were given the opportunity to offer additional causes of flaky tests. A common theme from those responses was that flaky tests are frequently caused by issues pertaining to external artifacts that the developer has no control over, such as third-party libraries and remote services.
Final thoughts
Flaky tests are an example of a problem that developers frequently experience, yet are often unable to convey the importance of addressing the problem. This paper can help DevEx teams and engineering leaders garner support for an investment in reducing the prevalence of flaky tests by presenting the impact they have on developers.
That’s it for this week! If you know someone who would enjoy this newsletter, please consider sharing it with them.
Reach out with thoughts about this newsletter issue on LinkedIn, or reply to this email.
-Abi