That's exactly why I built narrator.sh! The platform takes in a user input for a novel idea, then generates serialized fiction chapter-by-chapter by using DSPy to optimize the writing based on real reader feedback. I'm using CoT and parallel modules to break down the writing task, refine modules + LLM-as-a-judge for reward functions, and the SIMBA optimizer to recompile user ratings from previous chapters to improve subsequent ones.
Instead of synthetic benchmarks, I track real reader metrics: time spent reading, ratings, bookmarks, comments, and return visits. This creates a leaderboard of which models actually write engaging fiction that people want to finish.
Right now the closest evals for creative writing LLMs come from the author perspective (OpenRouter's usage data for tools like Novelcrafter). But ultimately readers decide what's good, not authors.
You can try it at https://narrator.sh. Here's the current leaderboard: https://narrator.sh/llm-leaderboard (it's a bit bare right now b/c there's not that many users haha)
(Fair warning: there's some adult content since I posted on Reddit for beta testers and people got creative with prompts. I'm working on diversifying the content!)