Prompt Politeness Affects LLM Accuracy (2025)

32•KnuthIsGod•1d ago

Comments

331c8c71•56m ago

Interesting.

I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).

plewd•40m ago

I don't know much about stats, but does "the null is that the success rate is the same" imply that it's a sketchy methodology because they can come up with some findings ("ruder prompts are better/worse!") more often?

jampekka•17m ago

That's the usual null hypothesis for these kinds of tests.

331c8c71•4m ago

You are asking about one-sided vs two-sided tests. Not really "more often" because formal type 1 error rate is still the same. I'd say two-sided tests leave more space for post-hoc theorizing but there are valid situations when there is no clear one-sided hypothesis a priori. Do we really know whether that the hypothesis should have been "ruder prompts are worse"?

I'd say this is benign compared to other ways of (mis)using statistics e.g. looking which way the difference goes and then running one-sided tests or tweaking the setup until one gets "significant" p vals.

jampekka•29m ago

The methods could be better described in the paper, but my understanding is that they did 10 runs for each question for each prompt and took an average of those, so the compared values are not binary. You could do a sign test, but you'd lose power and answer a bit different question.

freehorse•5m ago

You can do a generalised mixed effects linear model with binomial outcome (ie a binomial test but with added random effects structure). But unless you want to introduce a richer random effects structure with more variables, it is overkill and overcomplicating things, and the result should be the same as t-tests.

dude250711•50m ago

I have an idea: let's use these things for autonomous software engineering.

faize•43m ago

Remember to always say "please" and "thank you" when planning a critical system

eigenspace•41m ago

Please remember to always say "please" and "thank you" when planning a critical system. Thank you!

theanonymousone•38m ago

I have always said please and thank you to LLMs, not because of accuracy or because I'm stupid. I believe it is more about me than about the LLM, and this is anyway a habit I don't want to lose.

jkarni•34m ago

Thomas Aquinas believed cruelty to animals was wrong not because animals have souls (and with that all the standard moral rights), but because it can teach us cruelty to other humans.

pfortuny•6m ago

Snarky morning: "spiritual souls" as opposed to "mere animal souls". Sorry, could not control myself.

niek_pas•23m ago

Genuine question: do you add 'please' and 'thank you' to Google searches? If not, what sets them apart?

perching_aix•23m ago

Google searches being keyword based, rather than simulated conversations?

The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).

spiderfarmer•19m ago

Google isn’t conversational.

gum_wobble•11m ago

Genuine question: do you write Google search queries in natural language?

TimCTRL•35m ago

i only say please and thank you such that when the robots finally take over, they will remember i was nice to them.

octocop•29m ago

it seems they will remember that you wasted tokens for no reason and punish you instead.

emil-lp•24m ago

Tokens are their food, it's literally what keeps them alive.

Not feeding them tokens is neglect.

I try to feed them a healthy diet.

polytely•16m ago

it sort of makes sense to me, when asking a question to an expert in the field while you are a student. I would guess the successful interactions on average would be more polite . Like for example if you were asking a question to donald knuth or terrence tao, you'd probably be polite while doing so. Being hostile while asking questions gets you into forum discussion territory.

robinhouston•4m ago

> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.

dSebastien•6m ago

I guess it makes sense since we as humans tend to be far less inclined to help someone who is not polite/is not friendly, so that "bias" is part of the training data, thus influences how LLMs function

robinhouston•5m ago

> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.

The Melancholy of Slaying Monsters

Cloudflare Flagship

What Gets Kept

BadHost – CVE-2026-48710: Starlette Host-Header Auth Bypass

Prompt Politeness Affects LLM Accuracy (2025)

That Methyl Methacrylate Tank

Cate v1.0 is out: The Infinite canvas workspace for developers

A few interesting modern pixel fonts

The worst job interview I ever had

I built a Git-tracked book production pipeline

A history of obituaries in American newspapers

TSDuck: Open-source toolkit for MPEG-TS analysis and manipulation

Claude Code as a Daily Driver: Claude.md, Skills, Subagents, Plugins, and MCPs

IBM Confidential: System/360 File Organization [video]

A portentous reunion

Show HN: Posthorn, self-hosted mail without the mail server

Unicode 18.0.0 Beta

The Structural Barriers to AI Lawyers

What I've Learned (So Far) Building Online Mini Games with Elixir and Swift

Launch HN: Minicor (YC P26) – Windows desktop automations at scale

Rosalind: A genomics toolkit in Rust running whole-genome pipelines on a laptop

Spain blocks prediction markets Polymarket, Kalshi over lack of gambling licence

Tunecat: Simple Internet Radio

C array types are weird

Seeking a Language in Mathematics 1523-1571

Dropbox CEO Drew Houston to step down

Nvidia Vera CPU Benchmarks: Olympus Cores Delivering Great Performance

The Steinwinter Supercargo

The Forgotten Art of the LAN Party (2023)

Splinter Cell veteran says realistic modern lighting has screwed up stealth game

Prompt Politeness Affects LLM Accuracy (2025)

Comments

The Melancholy of Slaying Monsters

Cloudflare Flagship

What Gets Kept

BadHost – CVE-2026-48710: Starlette Host-Header Auth Bypass

Prompt Politeness Affects LLM Accuracy (2025)

That Methyl Methacrylate Tank

Cate v1.0 is out: The Infinite canvas workspace for developers

A few interesting modern pixel fonts

The worst job interview I ever had

I built a Git-tracked book production pipeline

A history of obituaries in American newspapers

TSDuck: Open-source toolkit for MPEG-TS analysis and manipulation

Claude Code as a Daily Driver: Claude.md, Skills, Subagents, Plugins, and MCPs

IBM Confidential: System/360 File Organization [video]

A portentous reunion

Show HN: Posthorn, self-hosted mail without the mail server

Unicode 18.0.0 Beta

The Structural Barriers to AI Lawyers

What I've Learned (So Far) Building Online Mini Games with Elixir and Swift

Launch HN: Minicor (YC P26) – Windows desktop automations at scale

Rosalind: A genomics toolkit in Rust running whole-genome pipelines on a laptop

Spain blocks prediction markets Polymarket, Kalshi over lack of gambling licence

Tunecat: Simple Internet Radio

C array types are weird

Seeking a Language in Mathematics 1523-1571

Dropbox CEO Drew Houston to step down

Nvidia Vera CPU Benchmarks: Olympus Cores Delivering Great Performance

The Steinwinter Supercargo

The Forgotten Art of the LAN Party (2023)

Splinter Cell veteran says realistic modern lighting has screwed up stealth game