Can LLMs earn $1M from real freelance coding…

Apr 16

A new benchmark tests AI’s ability to complete real-world tasks.

7 Comments

My N=1 experiment with all publicly available solutions, including raw frontier models and specifically built solutions for AI development tasks like v0 or bolt, proves that while I can get some help, especially in terms of ideation, the more specific I become about the requirements, the less useful the results are.

Expand full comment

john flournoy

Apr 16Edited

What a very interesting study and nicely concise write-up! One thing that you might want to watch out for here---it looks like all three of your figures are direct reproductions of the authors' Fig 5, Fig 6, and Fig 7, but branded with the name of this newsletter. Can I suggest either just directly using their figures (fair use) or at least making a note that gives credit, e.g., "minimally adapted from [citation]"? I found myself mistakenly thinking that you had actually put together the figures from tables in their paper because they had not provided visualizations.

Expand full comment

Reply (1)

Abi Noda

Apr 22

We used to include attributions but somehow got away from it without realizing. We've updated the images in this post — thanks for the nudge.

Expand full comment

I.M.J. McInnis

Apr 16

So, is the typically Upwork SWE task now much harder (since all the easy ones are getting solved by AI before posting, or solved very quickly by someone with an AI)?

Expand full comment

Reply (1)

Abi Noda

Apr 22

This paper didn’t explore that question, but that's a really interesting speculation. Would love to see future research dig into that.

Expand full comment

Joachim Sammer

Apr 16

I think the ‘more attempts’ need clarification. It sounds like there is an improvement with more attempts - whereas more tries might lead to probabilistic success, if the LLM gets lucky. There is also a (k) in the diagram that is not explained in text. Commonly this stands for kilo as in 1,000. So, the hapless engineering manager of their LLM team has to wade through thousands of results? Even 7 is bad enough…

Expand full comment

Reply (1)

Abi Noda

Apr 22

Great point — and I agree the chart could be clearer.

The “more attempts” refers to the pass@k metric from Chen et al. (2021), where k is the number of completions (i.e., code attempts) a model is allowed to make per task. It doesn't mean generating thousands of options — in this paper, k ranges from 1 to 7.

The idea is: if a model gets 7 shots at a task instead of 1, what’s the chance that at least one of those is correct?

You're right, though, that success across multiple attempts doesn’t mean the model is reliably good. It might just mean it occasionally gets lucky, which is why pass@1 (single-shot success) is still the strongest indicator of consistency.

Expand full comment

Engineering Enablement

Can LLMs earn $1M from real freelance coding…