This is the latest issue of my newsletter. Each week I share research and perspectives on developer productivity.
Past issues of this newsletter have covered the history of DORA as well as recent findings from their reports. Today is a technical deep-dive that’s focused on the science and methodology behind DORA’s research.
Derek DeBellis, lead researcher at DORA, joined me on the Engineering Enablement podcast to walk us through DORA’s research process step by step. This newsletter summarizes what Derek shared, including how they define the outcomes and factors they want to measure, and how they manage the survey design process, analysis, and structural equation modeling.
If you follow DORA’s reports, today’s newsletter will give you a behind-the-scenes look at how they come together.
Note: Most of Derek's responses have been lightly edited for clarity. In some instances, I have summarized his points. You can listen to the full conversation here.
Let’s start from the beginning. How do you figure out what people care about—the outcomes?
Derek: The basic recipe for DORA is this: learn what people want to accomplish, figure out what might help them accomplish that, and then try to quantify the relationship. So the first step in the process of creating a DORA report is to learn what people want to accomplish.
For context, if we’re not trying to study what people are trying to accomplish, we’ll have a relevance problem. We need to meet people where they’re actually trying to go—not where we think they ought to go.
We do this through qualitative research. We’ll try to understand the outcomes that engineers at different levels, or people adjacent to engineers at different levels, are trying to accomplish. There’s also a thriving DORA community where you can learn a lot about what people are trying to accomplish. This is how we come up with the key outcomes we believe people are striving to reach or avoid.
The next steps are to develop and pre-test the survey items. What does that process look like for you?
1. Coming up with the factors in the model
The way we hypothesize about the drivers that may predict or affect the outcomes is through literature review and through talking with people. One thing that strikes me is that just listening to what’s going on in the community can be almost just as useful, if not more useful, than the literature reviews.
2. Identifying potential confounding factors
After this, we usually have a sense of our model. We also try to figure out the confounds—the variables that could potentially add noise to the relationships we’re exploring. In other words, what factors might make it look like there’s a relationship when there really isn’t one?
An example of this could be organization size. They might be more likely to do something because it’s a large organization and maybe they’re more likely to have, let’s say, a better software delivery performance. This isn’t true, it’s just an example. But that could make it look like there’s a connection between the technical practice and high software delivery performance when a connection doesn’t actually exist. So we try to capture these confounds before we do the survey because if we don’t have them in our data, we can’t account for them and then we’re likely to give biased results.
3. Writing survey questions
The operationalization part happens at this point too. We have all these concepts that are really hazy and the literature might not have a set of survey items that you can grab and run with. This is the more artistic component of research, when you take a concept and you think, how in the world would I figure out how to measure this concept? If I had three questions I could ask you to figure out if you’re burnt out, for example, what would they be? Or if I really wanted to figure out what loosely coupled teams look like, what would I ask you to figure out if your team was contingent on a bunch of other teams?
So we go through these questions with subject matter experts. We ask them to give us the questions they would use to diagnose whether something is happening on someone’s team. We inevitably will generate approximately 700 questions through this process.
4. Pre-testing survey items
Then we pre-test the questions. Everybody who takes the survey should be able to respond to the questions. Also, our survey is way too long and we’re trying to make that better all the time, so we’re always working on that in pre-testing. Last year we got it under 15 minutes, this year we’re going for 10 minutes. I would rather have a little bit of good data than a lot of bad data.
We figure out whether the survey is a decent length, if the respondents can comprehend these questions, and what the cognitive load looks like. We also have a few questions at the end of the survey pre-test about how easy it was to take the survey and how much effort it required.
One of the most common problems we find in the pre-testing process is that the survey takes people a lot more effort than we would hope. That’s especially true when the respondent is someone who should be included in the survey but who isn’t an expert on a particular technical capability or practice. Other than that, simplifying the incredibly technical questions is a challenge, but because we’re in a technical space, we always encounter that problem. I also catch a lot of questions that we should never ask, so I’m so happy we do that pre-testing work. It can be a humbling process.
You develop and launch the survey, then you go into the analysis and cleaning. How does that work?
1. Cleaning the data
There’s a very short window of about three to four weeks when we have survey data and we have a launch date for the report. During that time, there’s first the cleaning process, where we screen out people who may have not taken the survey in good faith.
2. Exploratory factor analysis
We go into an exploratory factor analysis after the data is cleaned. The reason there is an exploratory factor analysis is so we don’t input our ideas of how these things should group together. Instead, when we put all these questions into exploratory factor analysis, it shows us how they group together without us providing much of an opinion. Confirmatory factor analysis is when we test if things group together.
With the exploratory factor analysis, we can see if our theories naturally fall out of the data. We like to do that because we feel it’s a higher bar. If you start with a confirmatory, it can make you feel like your theory is right when in reality it’s because you didn’t test out all the other possible combinations of how this data could have been grouped together.
3. Confirmatory factor analysis
From there, if our theory is working, and especially if the constructs make intuitive sense, we move on to a confirmatory factor analysis. Then, from there, we have our measurement model.
4. Finding relationships in the model
In 2022, I used a method called partial least squares (PLS) to find relationships between data. PLS is a type of structural equation modeling that helps understand how well the data we have (explanatory variables) explains the outcomes (variance).
In 2023, we changed our approach. Instead of making one big, complex model, we created several smaller, focused models based on specific hypotheses. The reason for this change is that large models can be very complicated, and adding more variables can make the relationships between them confusing. With smaller models, it's easier to see and understand the relationships between variables. We can identify potential confounding factors and avoid getting confused by unrelated connections within a large model.
When testing our models, we use different methods to compare them. These methods help determine if adding a new variable or pathway improves our understanding. For example, we might use R-squared, which shows how much of the data's variation we can explain. Other methods include leave-one-out cross-validation, AIC, and BIC, which all help us decide if a new addition is valuable.
The goal is to keep our models as simple as possible while still being accurate, following the principle of Occam's razor: don't add unnecessary complexity. So, we repeatedly test and refine our models to ensure they are both effective and straightforward.
As for effect sizes, we used a Bayesian framework for our analysis this year, which I like because it's flexible and provides a range of possible values for our estimates. For example, if we want to study how independent team structures affect burnout, we calculate a beta weight that shows the strength of this relationship. We can look at the range of possible values for the beta weight and see how many of these values are close to zero, indicating no meaningful effect. This range is called the region of practical equivalence (ROPE).
If the estimated effect size falls within this range (e.g., between -0.2 and 0.2 on a ten-point scale), it means the effect is too small to be practical, even if it's statistically significant. With a large sample size of 3000 survey respondents, we might find statistical significance, but it wouldn't be worth the effort for practitioners to focus on this small effect. Therefore, we focus on effects that are clearly outside the ROPE, meaning they have a more substantial impact and are worth considering.
How do you come up with the benchmarks?
There's a bit of art and science to coming up with benchmarks for things like change lead time, where the options range from once per week to several times a week because we’re using surveys. Turning those ranges into a hard number could be a challenge—however, we’re thankfully not the first ones to deal with this. There’s a lot of existing research on handling this kind of data. The data from surveys like this is ordinal, meaning it's ranked but not measured on a precise scale. We know it’s not exact numbers, but we try our best to treat it as such.
In structural equation modeling, you can choose different methods to optimize this type of data, and some methods work better than others for ordinal data. After processing, we can often treat the results as continuous numbers. However, survey items like change fail rate or deployment frequency are tricky because they are just ranges.
My approach is to test how sensitive our results are to the method we choose. I try clustering the data in four different ways to see if we get very different answers. If the answers are very different, then the method might not be reliable. But if three out of four methods give similar answers, I feel more confident in the results.
It’s not perfect, but by using multiple methods and comparing the results, we can get a good sense of whether our analysis is reliable. If the results aren’t too dependent on the method used, it’s probably a pretty good answer.
What should we expect from this year’s report?
This year we’re focusing on three areas that just keep coming up. We’re not changing our outcomes, but we’re interested in diving into topics people are very curious about:
The first is artificial intelligence. We want to see how developers are using AI, if it's having any significant effects, and what developers think about it.
We’re also interested in workplace environments, so the antecedents to developer experience. We want to learn more about how teams function and the practices and philosophies underlying teams.
The third area, which we’ve been hesitant to get into because there are so many diverging connotations about what this means, is platform engineering. We want to better understand how organizations are approaching platform engineering. Platform engineering may include both technologies and teams. How does platform engineering impact software delivery performance?
That concludes my interview with Derek. If you enjoyed this issue, be sure to complete the 2024 DORA survey.
Special thanks to Derek for generously sharing his time to discuss how his team’s research comes together.
Upcoming webinar
I’m hosting a live conversation to learn how Airbnb, GitHub, and Jumio have adopted GenAI tools and the impact they’re observing.
Who’s hiring right now
Here is a roundup of recent Developer Experience job openings. Find more open roles here.
Snyk is hiring a VP, Engineering - Developer Experience | Boston, London
Uber is hiring a Sr. Staff Engineer (Gen AI) - Developer Platform | US
Webflow is hiring an Engineering Manager - Developer Productivity | US
That’s it for this week. Thanks for reading.
-Abi