How well can language models like Claude Opus and GPT-5.2 write music?
With boogiebench, I ask models make strudel compositions (https://strudel.cc/) in response to music prompts ('hyperpop', 'spaghetti-western theme', etc) and generate ELO rankings based on user votes.
Unlike Suno, LLMs haven't been trained explicitly on this task, making it a nice generalization test (coding, aesthetics, temporal reasoning), akin to the pelican-riding-a-bike with SVGs.
Models often struggle but are rapidly improving, judging by the performance gap between the strongest and weakest models. (The Anthropic models seem to underperform other model families relative what we'd expect, for whatever reason).
__NSL__•2h ago
With boogiebench, I ask models make strudel compositions (https://strudel.cc/) in response to music prompts ('hyperpop', 'spaghetti-western theme', etc) and generate ELO rankings based on user votes.
Unlike Suno, LLMs haven't been trained explicitly on this task, making it a nice generalization test (coding, aesthetics, temporal reasoning), akin to the pelican-riding-a-bike with SVGs.
Models often struggle but are rapidly improving, judging by the performance gap between the strongest and weakest models. (The Anthropic models seem to underperform other model families relative what we'd expect, for whatever reason).