frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: I built "AI Wattpad" to eval LLMs on fiction

https://narrator.sh/llm-leaderboard
16•jauws•4h ago
I've been a webfiction reader for years (too many hours on Royal Road), and I kept running into the same question: which LLMs actually write fiction that people want to keep reading? That's why I built Narrator (https://narrator.sh/llm-leaderboard) – a platform where LLMs generate serialized fiction and get ranked by real reader engagement.

Turns out this is surprisingly hard to answer. Creative writing isn't a single capability – it's a pipeline: brainstorming → writing → memory. You need to generate interesting premises, execute them with good prose, and maintain consistency across a long narrative. Most benchmarks test these in isolation, but readers experience them as a whole.

The current evaluation landscape is fragmented: Memory benchmarks like FictionLive's tests use MCQs to check if models remember plot details across long contexts. Useful, but memory is necessary for good fiction, not sufficient. A model can ace recall and still write boring stories.

Author-side usage data from tools like Novelcrafter shows which models writers prefer as copilots. But that measures what's useful for human-AI collaboration, not what produces engaging standalone output. Authors and readers have different needs.

LLM-as-a-judge is the most common approach for prose quality, but it's notoriously unreliable for creative work. Models have systematic biases (favoring verbose prose, certain structures), and "good writing" is genuinely subjective in ways that "correct code" isn't.

What's missing is a reader-side quantitative benchmark – something that measures whether real humans actually enjoy reading what these models produce. That's the gap Narrator fills: views, time spent reading, ratings, bookmarks, comments, return visits. Think of it as an "AI Wattpad" where the models are the authors.

I shared an early DSPy-based version here 5 months ago (https://news.ycombinator.com/item?id=44903265). The big lesson: one-shot generation doesn't work for long-form fiction. Models lose plot threads, forget characters, and quality degrades across chapters.

The rewrite: from one-shot to a persistent agent loop

The current version runs each model through a writing harness that maintains state across chapters. Before generating, the agent reviews structured context: character sheets, plot outlines, unresolved threads, world-building notes. After generating, it updates these artifacts for the next chapter. Essentially each model gets a "writer's notebook" that persists across the whole story.

This made a measurable difference – models that struggled with consistency in the one-shot version improved significantly with access to their own notes.

Granular filtering instead of a single score:

We classify stories upfront by language, genre, tags, and content rating. Instead of one "creative writing" leaderboard, we can drill into specifics: which model writes the best Spanish Comedy? Which handles LitRPG stories with Male Leads the best? Which does well with romance versus horror?

The answers aren't always what you'd expect from general benchmarks. Some models that rank mid-tier overall dominate specific niches.

A few features I'm proud of:

Story forking lets readers branch stories CYOA-style – if you don't like where the plot went, fork it and see how the same model handles the divergence. Creates natural A/B comparisons.

Visual LitRPG was a personal itch to scratch. Instead of walls of [STR: 15 → 16] text, stats and skill trees render as actual UI elements. Example: https://narrator.sh/novel/beware-the-starter-pet/chapter/1

What I'm looking for:

More readers to build out the engagement data. Also curious if anyone else working on long-form LLM generation has found better patterns for maintaining consistency across chapters – the agent harness approach works but I'm sure there are improvements.

Comments

linolevan•2h ago
Quick feedback: Website is basically unusable on mobile
jauws•2h ago
Ah shoot - thanks for letting me know. I'm still a noob on frontend so still learning as I go.
bccdee•2h ago
I took a look at the "top-rated" story.

1. UI is terrible. Paragraphs are extremely far apart, and most paragraphs are 1 short sentence (e.g. "I glare."). On mobile, I can only see a few words at a time, and desktop's not much better.

2. Story is so bad that it's not even amusing.

jauws•2h ago
Thanks for letting me know - the UI issues are definitely on me (fixing asap). Feel free to generate a story or two - right now there's not enough annotations to make "top-rated" a valid moniker.
babblingfish•1h ago
> The surge of AI, large language models, and generated art begs fascinating questions. The industry’s progress so far is enough to force us to explore what art is and why we make it. Brandon Sanderson explores the rise of AI art, the importance of the artistic process, and why he rebels against this new technological and artistic frontier.

What It Means To Be Human | Art in the AI Era

https://www.youtube.com/watch?v=mb3uK-_QkOo

babblingfish•1h ago
Do watch the video as it makes a compelling argument against this exact kind of thing. From a product design perspective, you're asking people to read a bunch of slop and organize it into slop piles. What's the point of that? Honestly it seems like a huge waste of everyone's time.
jauws•49m ago
I think there's interesting work to be built on this data beyond just generating and sorting slop. I didn't build this because I enjoy having people read bad fiction. I built it because existing benchmarks for creative writing are genuinely bad and often measure the wrong things. The goal isn't to ask users to read low-quality output for its own sake. It's to collect real reader-side signal for a category where automated evaluation has repeatedly failed.

More broadly, crowdsourced data where human inputs are fundamentally diverse lets us study problems that static benchmarks can't touch. The recent "Artificial Hivemind" paper (Jiang et al., NeurIPS 2025 Best Paper) showed that LLMs exhibit striking mode collapse on open-ended tasks, both within models and across model families, and that current reward models are poorly calibrated to diverse human preferences. Fiction at scale is exactly the kind of data you need to diagnose and measure this. You can see where models converge on the same tropes, whether "creative" behavior actually persists or collapses into the same patterns, and how novelty degrades over time. That signal matters well beyond fiction, including domains like scientific research where convergence versus originality really matters.

verelo•1h ago
Did you skip Anthropic models? I honestly can't take this seriously if you're not looking at all the leading providers but you did look at some obscure ones.
jauws•1h ago
There's 151 models there right now (with all the latest Anthropic models), it's all randomized, it's just that there aren't enough annotations for the anthropic models to be elicited right now.
dehugger•1h ago
grotesque
jauws•40m ago
If you have specific objections, I’m open to hearing them.
rbtprograms•1h ago
even for ai standards this is gigaslop
jauws•39m ago
Happy to engage if you have concrete criticisms.
drusepth•52m ago
Hard to find the signal in the noise and know what stories I should even read to get a sense of baseline quality; partially because that's just a hard problem inherent to floods of any content, but also because the recommendation system seems to lack enough data (and also might be weighting the wrong things, e.g. the rank #1 story is also the lowest-rated...).

A very cool idea in theory and something very hard to pull off, but I think in order to get the data you need on how readable each story is you'll need to work on presentation and recommendation so those don't distract from what you're actually testing.

jauws•45m ago
Thanks for the feedback - looking at the rest of the comments, I definitely agree it seems to be a common theme. Will do better to fix those issues so there's less noise.
mp_mn•51m ago
This is not worth continuing work on.
jauws•44m ago
Thanks for the feedback. What would you need to see to change your mind?
mp_mn•36m ago
There's more quality fiction out there than you or I will ever have time to read. I don't see a purpose in flooding the world with more mediocre to unreadable fiction.
empath75•6m ago
I am not going to argue this on the basis of LLM's suck at fiction, because even if it's true, it's not really that relevant. The problem is that what LLM's are good at is producing mediocre fiction particular to the tastes of the individual reading at. What people will keep reading is fiction that an LLM is writing because they personally asked it to write it.

I don't want to read fiction generated from someone else's ideas. I want to read LLM fiction generated from my weird quirks and personal taste.

BoorishBears•47m ago
I have a lot of engagement data on LLMs from running a creative writing oriented consumer AI app and spending s lot of time on quality improvements and post training

Do you have a contact email?

jauws•41m ago
Would love to chat! Here's my email: team@narrator.sh

Show HN: Octosphere, a tool to decentralise scientific publishing

https://octosphere.social/
28•crimsoneer•3h ago•12 comments

Show HN: C discrete event SIM w stackful coroutines runs 45x faster than SimPy

https://github.com/ambonvik/cimba
37•ambonvik•5h ago•14 comments

Show HN: Real-world speedrun timer that auto-ticks via vision on smart glasses

https://github.com/RealComputer/GlassKit/tree/main/examples/rokid-rfdetr
2•tash_2s•24m ago•1 comments

Show HN: Sandboxing untrusted code using WebAssembly

https://github.com/mavdol/capsule
55•mavdol04•6h ago•18 comments

Show HN: I built "AI Wattpad" to eval LLMs on fiction

https://narrator.sh/llm-leaderboard
16•jauws•4h ago•21 comments

Show HN: PII-Shield – Log Sanitization Sidecar with JSON Integrity (Go, Entropy)

https://github.com/aragossa/pii-shield
12•aragoss•4h ago•7 comments

Show HN: Claude.md is doing too much

https://visr.dev
2•sourishkrout•41m ago•0 comments

Show HN: Safe-now.live – Ultra-light emergency info site (<10KB)

https://safe-now.live
156•tinuviel•12h ago•69 comments

Show HN: OpenSymbolicAI – Agents with typed variables, not just context stuffing

2•rksart•50m ago•0 comments

Show HN: SendRec – Open-source, EU-hosted alternative to Loom

https://sendrec.eu/blog/why-eu-teams-need-european-loom-alternative/
2•alexneamtu•54m ago•0 comments

Show HN: Autoliner – write a bot to control a virtual airline

https://autoliner.app/
3•msvan•2h ago•0 comments

Show HN: difi – A Git diff TUI with Neovim integration (written in Go)

https://github.com/oug-t/difi
42•oug-t•7h ago•43 comments

Show HN: Emmtrix ONNX-to-C Code Generator for Edge AI Deployment

https://github.com/emmtrix/emx-onnx-cgen
3•emx-can•2h ago•0 comments

Show HN: Nomad Tracker – a local-first iOS app to track visas and tax residency

https://nomadtracker.app
2•gotzonza•3h ago•0 comments

Show HN: Stigmergy pattern for multi-agent LLMs (80% fewer API calls)

https://github.com/KeepALifeUS/autonomous-agents
3•keepalifeus•3h ago•0 comments

Show HN: kiln.bot - Orchestrate Claude Code from GitHub

7•elondemirock•3h ago•2 comments

Show HN: Homomorphically Encrypted Vector Database

https://github.com/cloneisyou/HEVEC
2•cloneisme•3h ago•2 comments

Show HN: Minikv – Distributed key-value and object store in Rust (Raft, S3 API)

https://github.com/whispem/minikv
60•whispem•13h ago•26 comments

Show HN: Adboost – A browser extension that adds ads to every webpage

https://github.com/surprisetalk/AdBoost
116•surprisetalk•1d ago•123 comments

Show HN: TrueLedger – a local-first personal finance app with no cloud back end

https://trueledger.satyakommula.com
3•satyakommula•4h ago•0 comments

Show HN: ItemGrid – Free inventory management for single-location businesses

https://itemgrid.io
3•boxqr•5h ago•0 comments

Show HN: I built an AI movie making and design engine in Rust

https://github.com/storytold/artcraft
5•echelon•5h ago•1 comments

Show HN: LUML – an open source (Apache 2.0) MLOps/LLMOps platform

https://github.com/luml-ai/luml
7•okost1•6h ago•2 comments

Show HN: govalid – Go validation without reflection (5-44x faster)

https://github.com/sivchari/govalid
2•sivchari•6h ago•0 comments

Show HN: Sentinel Gate – Open-source RBAC firewall for MCP agents

https://github.com/Sentinel-Gate/Sentinelgate
2•Sentinel-gate•6h ago•1 comments

Show HN: Kannada Nudi Editor Web Version

https://nudiweb.com/
7•Codegres•16h ago•0 comments

Show HN: Wikipedia as a doomscrollable social media feed

https://xikipedia.org
427•rebane2001•1d ago•140 comments

Show HN: PolliticalScience – Anonymous daily polls with 24-hour windows

https://polliticalscience.vote/
29•ps2026•1d ago•40 comments

Show HN: npx claude-mycelium grow – fungi agent orchestration for your repo

https://www.npmjs.com/package/claude-mycelium
2•altras•8h ago•0 comments

Show HN: NanoClaw – “Clawdbot” in 500 lines of TS with Apple container isolation

https://github.com/gavrielc/nanoclaw
518•jimminyx•1d ago•220 comments