That's exactly why I built narrator.sh! The platform takes in a user input for a novel idea, then generates serialized fiction chapter-by-chapter by using DSPy to optimize the writing based on real reader feedback. I'm using CoT and parallel modules to break down the writing task, refine modules + LLM-as-a-judge for reward functions, and the SIMBA optimizer to recompile user ratings from previous chapters to improve subsequent ones.
Instead of synthetic benchmarks, I track real reader metrics: time spent reading, ratings, bookmarks, comments, and return visits. This creates a leaderboard of which models actually write engaging fiction that people want to finish.
Right now the closest evals for creative writing LLMs come from the author perspective (OpenRouter's usage data for tools like Novelcrafter). But ultimately readers decide what's good, not authors.
You can try it at https://narrator.sh. Here's the current leaderboard: https://narrator.sh/llm-leaderboard (it's a bit bare right now b/c there's not that many users haha)
(Fair warning: there's some adult content since I posted on Reddit for beta testers and people got creative with prompts. I'm working on diversifying the content!)
BoorishBears•5mo ago
I think right now we're at the point where novelcrafter is an excellent proxy for the best models for readers, because LLMs are still mostly losing engagement due to technical errors as opposed to subjective ones:
That's repetition problems, moralizing/soft-censorship, grammatical quirks, missing instructions, forgetting major plot points, etc.
Those kinds of errors are so obvious you can almost rank these models with an N=1 vibe test, and they limit how much people will consume unless you're scratching certain itches like NSFW
-
However I do think with enough post-training you can beat that level of problems and move to a stage where the writing is technically sound (and that's what I've spent most of the last year working on).
From there you get to more challenging problems that require much more feedback along some level of specialization per user (like what Midjourney does during onboarding to build up a style profile). Once you're not making technical mistakes, you now have to codify the ethereal concept of "user taste", and that will be a really interesting challenge for LLMs.
jauws•5mo ago
Really interested in what you've been working on for the past year! Are you doing custom fine-tuning or more on the prompting/post-processing side? Also I definitely need to check out the Midjourney onboarding, it sounds super interesting for inspo regarding your point about personalization + taste!
BoorishBears•5mo ago
Most of it has been fine-tuning (SFT/DPO/GRPO), but also a lot of prompting and adding steps between the user's prompt and the output