7 Comments
User's avatar
Sep's avatar

My N=1 experiment with all publicly available solutions, including raw frontier models and specifically built solutions for AI development tasks like v0 or bolt, proves that while I can get some help, especially in terms of ideation, the more specific I become about the requirements, the less useful the results are.

Expand full comment
john flournoy's avatar

What a very interesting study and nicely concise write-up! One thing that you might want to watch out for here---it looks like all three of your figures are direct reproductions of the authors' Fig 5, Fig 6, and Fig 7, but branded with the name of this newsletter. Can I suggest either just directly using their figures (fair use) or at least making a note that gives credit, e.g., "minimally adapted from [citation]"? I found myself mistakenly thinking that you had actually put together the figures from tables in their paper because they had not provided visualizations.

Expand full comment
Abi Noda's avatar

We used to include attributions but somehow got away from it without realizing. We've updated the images in this post — thanks for the nudge.

Expand full comment
I.M.J. McInnis's avatar

So, is the typically Upwork SWE task now much harder (since all the easy ones are getting solved by AI before posting, or solved very quickly by someone with an AI)?

Expand full comment
Abi Noda's avatar

This paper didn’t explore that question, but that's a really interesting speculation. Would love to see future research dig into that.

Expand full comment
Joachim Sammer's avatar

I think the ‘more attempts’ need clarification. It sounds like there is an improvement with more attempts - whereas more tries might lead to probabilistic success, if the LLM gets lucky. There is also a (k) in the diagram that is not explained in text. Commonly this stands for kilo as in 1,000. So, the hapless engineering manager of their LLM team has to wade through thousands of results? Even 7 is bad enough…

Expand full comment
Abi Noda's avatar

Great point — and I agree the chart could be clearer.

The “more attempts” refers to the pass@k metric from Chen et al. (2021), where k is the number of completions (i.e., code attempts) a model is allowed to make per task. It doesn't mean generating thousands of options — in this paper, k ranges from 1 to 7.

The idea is: if a model gets 7 shots at a task instead of 1, what’s the chance that at least one of those is correct?

You're right, though, that success across multiple attempts doesn’t mean the model is reliably good. It might just mean it occasionally gets lucky, which is why pass@1 (single-shot success) is still the strongest indicator of consistency.

Expand full comment