What a very interesting study and nicely concise write-up! One thing that you might want to watch out for here---it looks like all three of your figures are direct reproductions of the authors' Fig 5, Fig 6, and Fig 7, but branded with the name of this newsletter. Can I suggest either just directly using their figures (fair use) or at least making a note that gives credit, e.g., "minimally adapted from [citation]"? I found myself mistakenly thinking that you had actually put together the figures from tables in their paper because they had not provided visualizations.
So, is the typically Upwork SWE task now much harder (since all the easy ones are getting solved by AI before posting, or solved very quickly by someone with an AI)?
My N=1 experiment with all publicly available solutions, including raw frontier models and specifically built solutions for AI development tasks like v0 or bolt, proves that while I can get some help, especially in terms of ideation, the more specific I become about the requirements, the less useful the results are.
I think the ‘more attempts’ need clarification. It sounds like there is an improvement with more attempts - whereas more tries might lead to probabilistic success, if the LLM gets lucky. There is also a (k) in the diagram that is not explained in text. Commonly this stands for kilo as in 1,000. So, the hapless engineering manager of their LLM team has to wade through thousands of results? Even 7 is bad enough…
What a very interesting study and nicely concise write-up! One thing that you might want to watch out for here---it looks like all three of your figures are direct reproductions of the authors' Fig 5, Fig 6, and Fig 7, but branded with the name of this newsletter. Can I suggest either just directly using their figures (fair use) or at least making a note that gives credit, e.g., "minimally adapted from [citation]"? I found myself mistakenly thinking that you had actually put together the figures from tables in their paper because they had not provided visualizations.
So, is the typically Upwork SWE task now much harder (since all the easy ones are getting solved by AI before posting, or solved very quickly by someone with an AI)?
My N=1 experiment with all publicly available solutions, including raw frontier models and specifically built solutions for AI development tasks like v0 or bolt, proves that while I can get some help, especially in terms of ideation, the more specific I become about the requirements, the less useful the results are.
I think the ‘more attempts’ need clarification. It sounds like there is an improvement with more attempts - whereas more tries might lead to probabilistic success, if the LLM gets lucky. There is also a (k) in the diagram that is not explained in text. Commonly this stands for kilo as in 1,000. So, the hapless engineering manager of their LLM team has to wade through thousands of results? Even 7 is bad enough…