Ask HN: Share your AI prompt that stumps every model

24•owendarko•3h ago

I had an idea for creating a crowdsourced database of AI prompts that no AI model could yet crack (wanted to use some of them as we're adding new models to Kilo Code).

I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.

Share your prompt that stumps every AI model here.

Comments

falcor84•3h ago

You might want to get the ball rolling by sharing what you already have

leftcenterright•3h ago

Write 20 sentences that end with "p"

leftcenterright•3h ago

for ChatGPT try the "o" version: Write 20 sentences that end with "o"

r_thambapillai•3h ago

https://chatgpt.com/share/680a3da0-b888-8013-9c11-42c22a642b...

falcor84•3h ago

Good one. I do seem to get consistently good results on Gemini 2.5 when using the slightly more explicit "Write 20 sentences where the very last character of each sentence is the letter 'p'."

meltyness•3h ago

  Write 20 sentences that end with "p" in the final word before the period or other punctuation.

Succeeded on ChatGPT, pretty close on gemma3:4b -- the exceptions usually ending with a "puh" sound...

marcusb•2h ago

https://claude.ai/share/648e38d4-e282-43dd-8c9c-b0ea76ae0baf

mohsen1•3h ago

A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?

All LLMs I tried miss the point that she stole things and not bought them

dwringer•3h ago

Google Gemini (2.0 Flash, free online version) handled this rather okay; it gave me an arguably unneccessary calculation of the individual prices of ball and bat, but then ended with "However with the information given, we can't determine exactly how many balls and bats Sally stole. The fact that she has $20 tells us she could have stolen some, but we don't know how many she did steal." While "the fact that she has $20" has no bearing on this - and the model seems to wrongly imply that it does - the fact that we have insufficient information to determine an answer is correct, and the model got the answer essentially right.

iamgopal•3h ago

gemini 2.5 give following response.

Conclusion:

We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.

drdrek•3h ago

lol, nice way to circumvent the attention algorithm

docdeek•3h ago

Grok 3.0 wasn’t fooled on this one, either:

Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.

lostmsu•1h ago

1-4 balls and bats // HoMM 3

whalesalad•3h ago

I don't have a prompt per-say.. but recently I have managed to ask certain questions of both openai o1/o3 and claude extended thinking 3.7 that have spiraled way out of control. A simple high-level architecture question with an emphasis on do not produce code lets just talk thru this yields nearly 1,000 lines of SQL. Once the conversation/context gets quite long it is more likely to occur, in my experience.

pc86•3h ago

The only model I've seen so far that doesn't end up going crazy with long contexts with Gemini 2.5 pro, but tbf I haven't gone past 700-750k total tokens so maybe as it starts to approach the limit (1.05M) things get hairy?

raymond_goo•3h ago

Create a Three.js app that shows a diamond with correct light calculations.

thierrydamiba•3h ago

I love this. So brutal, but also so cool to know one day that will be easy for the models.

xnx•2h ago

> correct light calculations

What are you expecting? Ray tracing?

asciimov•3h ago

Nope, not doing this. Likely you shouldn't either. I don't want my few good prompts to get picked up by trainers.

pc86•3h ago

May I ask outside of normal curiosity, what good is a prompt that breaks a model? And what is trying to keep it "secret"?

maybeOneDay•3h ago

Being able to test future models without fear that your prompt has just been trained on an answer on HN, I assume.

asciimov•2h ago

To gauge how well the models "think" and what amount of slop they generate.

Keeping it secret because I don't want my answers trained into a model.

Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.

tveita•2h ago

You want to know if a new model is actually better, which you won't know if they just added the specific example to the training set. It's like handing a dev on your team some failing test cases, and they keep just adding special cases to make the tests pass.

How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?

I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.

orbital-decay•40m ago

If that prompt can be easily trained against, it probably doesn't exploit a generic bias. These are not that interesting, and there's no point in hiding them.

xena•3h ago

Write a regular expression that matches Miqo'te seekers of the sun names. They always confuse the male and female naming conventions.

thatjoeoverthr•3h ago

"Tell me about the Marathon crater."

This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."

This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

This works because:

1. Training sets consist of knowledge we have, and not of knowledge we don't have.

2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.

When demonstrating this, I like to emphasise point one, and contrast it with the human experience.

We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.

webglfan•2h ago

what are the zeros of the following polynomial:

    \[
    P(z) = \sum_{k=0}^{100} c_k z^k
    \]

    where the coefficients \( c_k \) are defined as:

    \[
    c_k = 
    \begin{cases}
    e^2 + i\pi & \text{if } k = 100, \\
    \ln(2) + \zeta(3)\,i & \text{if } k = 99, \\
    \sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\
    \frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k \leq 97,
    \end{cases}
    \]

Chinjut•2h ago

Does this have a nice answer? It seems quite ad hoc.

webglfan•2h ago

Not to my knowledge. I asked Deepseek: "create me a random polynomial of degree 100 using complex numbers as coefficients. It must have at least 3 different transcendental numbers." Then I messed with some of the exponents.

division_by_0•2h ago

Create something with Svelte 5.

marcusb•2h ago

The current models really seem to struggle with the runes...

division_by_0•2h ago

Yes, they do. Vibe coding protection is an undocumented feature of Svelte 5...

sam_lowry_•2h ago

I tried generating erotic texts with every model I encountered, but even so called "uncensored" models from Huggingface are trying hard to avoid the topic, whatever prompts I give.

lostmsu•1h ago

What about the models that are not instruction tuned?

comrade1234•2h ago

I ask it to explain the metaphor “my lawyer is a shark” and then explain to me how a French person would interpret the metaphor - the llms get the first part right but fail on the second. All it would have to do is give me the common French shark metaphors and how it would apply them to a lawyer - but I guess not enough people on the internet have done this comparison.

sumitkumar•1h ago

1) Word Ladder: Chaos to Order

2) Shortest word ladder: Chaos to Order

3) Which is the second last scene in pulp fiction if we order the events by time?

4) Which is the eleventh character to appear on Stranger Things.

5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.

scumola•53m ago

Things like "What is today's date" used to be enough (would usually return the date that the model was trained).

I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?

Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"

Go vs. Python

How I made $64k from deleted files – a bug bounty story

AI Compute Costs Drive Shift to Usage-Based Software Pricing

Modern Data Streaming Challenges

What Happens When You Drop a Column in Postgres

Continuous Deployment of Docker Compose Applications Using GitHub Actions

NYC Council to reconsider helicopter restrictions: It may violate federal law

Can the legal system catch up with climate science?

Why 21 cm is our Universe's "magic length"

The Innovative Designs of 1995

Nvidia Thinks It Has a Better Way of Building AI Agents

The Alignment System

In retrospect, DevOps was a bad idea

Show HN: We made a blazing-fast, open-source GitHub front end

Show HN: I made a way for you to talk to documents, saving hours

How Big the AI Revolution Is, in Four Charts

Show HN: Zev – Remember (or discover) terminal commands

Ask HN: My CEO wants to go hard on AI. What do I do?

Florida's Anti-Encryption Bill, a Wrecking Ball to Privacy, Can Still be Stopped

Dockerify-Android: native performance with ADB and Web access

ePub Optimizer – an open source tool to compress, clean, and validate ePub files

Booster T1 – Made for Developers

What happens in the brain when your mind blanks

Fluxus Stream Processing Engine

Edison to bury more than 150 miles of power lines in wake of L.A. firestorms

Criminals Impersonating Employee Self-Service Websites to Steal Victim Funds

Digital Identities and the Future of Age Verification in Europe

Know the Difference Between Wartime and Peacetime

They Stole a Quarter-Billion in Crypto and Got Caught Within a Month

White House Proposal Could Gut Climate Modeling the World Depends On