LLMs Do Not Grade Essays Like Humans

6•PretzelFisch•1d ago

Comments

ergshankar•1d ago

Interesting point. Do you think this is more about training data limitations or evaluation methods?

nickpsecurity•1d ago

I feel like this is actually human-like but like the average human in the pretraining data. Let's look:

1. They reward short or under-developed essays. I'd say most online content, especially with high upvotes next to the post, fits that. Social media surely does.

2. If it's longer posts, the system starts nitpicking it on minor details, like grammar. We see this even on Hacker News, a community valuing quality, with some longer submissions. It's also a debate tactic to derail opponents' better arguments in many discussions which are in their pretraining data.

3. Essays with more praise get higher scores and with more criticism get lower scores. "Get on the Bandwagon" Effect. Echo chambers. One person writes a thing followed by 5-20 people confirming it. That's probably in the pretraining data. It might survive some filtering/cleaning strategies, too.

So, no, I think these AI's are acting way too human. They need to fine-tune them to act like more, reasonable humans. That will initially take RLHF data for many types of situations. Given pretraining bias, they might also have to train them to drop the bad habits the article mentions.

jerlam•1d ago

School-type long essays only seem to exist in academia. I took a "business communication" class in college and we didn't write essays. My life experience since then has supported the "no essays" conclusion.

A long comment online now means either two things: it's written by a crank who has strong opinions, usually only tangentially related; or someone who has deep knowledge about the subject and has a lot of detail to provide. It's usually the former.

nickpsecurity•1d ago

I agree with you on how their quality is spread out. But, this...

"School-type long essays only seem to exist in academia."

Does an AI know what an essay is? Would it consider any long, descriptive post an essay? Especially if pretraining data has many people describing long posts as essays or "essay-like?" Or only actual essays? And what is an actual essay again?

I think AI's might have different interpretations due to the above questions. They might also conflate essays with longer, detailed, or argumentative posts. We'd have to put a bunch of posts into a bunch of AI's to ask how they classify them.

Prompt intensity threshold effect on AI-generated invention quality (preprint)

Causality optional? Testing the "indefinite causal order" superposition

The Horrors That Could Lie Ahead If Vaccines Vanish

Does RAG Help AI Coding Tools?

What Happened to Procomm Plus

HN: AI-native investing app that builds and adapts thematic portfolios to you

The Racket Programming Language

Golang Constmap by Daniel Lemire

What are the best resources to learn about Harness Engineering?

After 16 years and $8B, the military's new GPS software still doesn't work

Claude Code's source code has been leaked via a map file in their NPM registry

LibreTranslate: Free and Open Source Machine Translation API

David Foster Wallace and the problem of loneliness [video]

£5M Funding for supply chain security innovation in UK

Tell HN: DeepL Moving Data to AWS

The First Bullshit

Monitor Claude Code Usage with Grafana

Databricks Compromised by TeamPCP

Show HN: Stochos – Keyboard driven mouse control

Fast and Gorgeous Erosion Filter

Tell HN: If your agent can create a PR, it can merge it too

The Reed and Pickup – The early internet was a feeling

Caltech quantum startup Oratomic launches with achieving scaling breakthrough

Vulniq AI: Autonomous Security Scanner for Any JavaScript/TS Codebase

Paris Saint-Germain Names Harvey The Official Legal AI Partner

Germany presents new climate action programme

Show HN: HolyCode – OpenCode in Docker. Use your Claude subscription. 30 tools

Closed Source AI = Neofeudalism

Show HN: Triplet-Based Parameterization for Characterization of Polynomial Roots

LangDrained: Paths to Your Data Through LangChain