4 Comments
User's avatar
john flournoy's avatar

What a very interesting study and nicely concise write-up! One thing that you might want to watch out for here---it looks like all three of your figures are direct reproductions of the authors' Fig 5, Fig 6, and Fig 7, but branded with the name of this newsletter. Can I suggest either just directly using their figures (fair use) or at least making a note that gives credit, e.g., "minimally adapted from [citation]"? I found myself mistakenly thinking that you had actually put together the figures from tables in their paper because they had not provided visualizations.

Expand full comment
I.M.J. McInnis's avatar

So, is the typically Upwork SWE task now much harder (since all the easy ones are getting solved by AI before posting, or solved very quickly by someone with an AI)?

Expand full comment
Sep's avatar

My N=1 experiment with all publicly available solutions, including raw frontier models and specifically built solutions for AI development tasks like v0 or bolt, proves that while I can get some help, especially in terms of ideation, the more specific I become about the requirements, the less useful the results are.

Expand full comment
Joachim Sammer's avatar

I think the ‘more attempts’ need clarification. It sounds like there is an improvement with more attempts - whereas more tries might lead to probabilistic success, if the LLM gets lucky. There is also a (k) in the diagram that is not explained in text. Commonly this stands for kilo as in 1,000. So, the hapless engineering manager of their LLM team has to wade through thousands of results? Even 7 is bad enough…

Expand full comment