I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.
Share your prompt that stumps every AI model here.
I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.
Share your prompt that stumps every AI model here.
I had to ask another LLM what is "turp" - and it said it's short for "turpentine".
>They shouted cheers after the winning free throw.
good attempt by ChatGPT tho imo
Write 20 sentences that end with "p" in the final word before the period or other punctuation.
Succeeded on ChatGPT, pretty close on gemma3:4b -- the exceptions usually ending with a "puh" sound...All LLMs I tried miss the point that she stole things and not bought them
Conclusion:
We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.
Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.
Even the simplest possible version of this question, assuming she started with nothing, spent as much money as she was able to, and stole nothing, doesn't have an answer, because she could have bought anything from all bats and no balls to all balls and no bats and anything in between. We could enumerate all possible answers but we can't know which she actually did.
What are you expecting? Ray tracing?
Keeping it secret because I don't want my answers trained into a model.
Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.
How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?
I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.
Well, that's easy: zero.
Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.
So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...
Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.
This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."
This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
This works because:
1. Training sets consist of knowledge we have, and not of knowledge we don't have.
2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.
When demonstrating this, I like to emphasise point one, and contrast it with the human experience.
We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.
They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.
At any rate, Gemini 2.5pro passes this just fine
>Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.
>However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.
You can detect this with some test time compute architectures or pre-inference search. But that’s the broader application. This is a trick for the model alone.
> Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.
(Obviously all of that is bullshit.)
What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit. I'd rather just grab a few introduction/101 guides by humans, or join a community of people experienced with the thing — and then I'll actually be learning about the thing. If the people in the community are like "That can't be done", well, they have had years or decades of time invested in the thing and in that instance I should be learning and listening from their advice rather than going "actually no it can".
I see a lot of beginners fall into that second pit. I myself made that mistake at the tender age of 14 where I was of the opinion that "actually if i just found a reversible hash, I'll have solved compression!", which, I think we all here know is bullshit. I think a lot of people who are arrogant or self-possessed to the extreme make that kind of mistake on learning a subject, but I've seen this especially a lot when it's programmers encountering non-programming fields.
Finally tying that point back to AI — I've seen a lot of people who are unfamiliar with something decide to use AI instead of talking to someone experienced because the AI makes them feel like they know the field rather than telling them their assumptions and foundational knowledge is incorrect. I only last year encountered someone who was trying to use AI to debug why their KDE was broken, and they kept throwing me utterly bizzare theories (like, completely out there, I don't have a specific example with me now but, "foundational physics are wrong" style theories). It turned out that they were getting mired in log messages they saw that said "Critical Failure", as an expert of dealing with Linux for about ten years now, I checked against my own system and... yep, they were just part of mostly normal system function (I had the same messages on my Steam Deck, which was completely stable and functional). The real fault was buried halfway through the logs. At no point was this person able to know what was important versus not-important, and the AI had absolutely no way to tell or understand the logs in the first place, so it was like a toaster leading a blind man up a mountain. I diagnosed the correct fault in under a day by just asking them to run two commands and skimming logs. That's experience, and that's irreplaceable by machine as of the current state of the world.
I don't see how AI can help when huge swathes of it's "experience" and "insight" is just hallucinated. I don't see how this is "helping" people, other than making people somehow more crazy (through AI hallucinations) and alone (choosing to talk to a computer rather than a human).
You realize that all you have to do to deal with questions like "Marathon Crater" is ask another model, right? You might still get bullshit but it won't be the same bullshit.
In this particular answer model A may get it wrong and model B may get it right, but that can be reversed for another question.
What do you do at that point? Pay to use all of them and find what's common in the answers? That won't work if most of them are wrong, like for this example.
If you're going to have to fact check everything anyways...why bother using them in the first place?
"If you're going to have to put gas in the tank, change the oil, and deal with gloves and hearing protection, why bother using a chain saw in the first place?"
Tool use is something humans are good at, but it's rarely trivial to master, and not all humans are equally good at it. There's nothing new under that particular sun.
The situation with an LLM is completely different. There's no way to tell that it has a wrong answer - aside from looking for the answer elsewhere which defeats its purpose. It'd be like using a chainsaw all day and not knowing how much wood you cut, or if it just stopped working in the middle of the day.
And even if you KNOW it has a wrong answer (in which case, why are you using it?), there's no clear way to 'fix' it. You can jiggle the prompt around, but that's not consistent or reliable. It may work for that prompt, but that won't help you with any subsequent ones.
You have to be careful when working with powerful tools. These tools are powerful enough to wreck your career as quickly as a chain saw can send you to the ER, so... have fun and be careful.
But with LLMs, every word is a probability factor. Assuming the first paragraph is true has no impact on the rest.
The problem with LLMs is that they appear much smarter than they are and people treat them as oracles instead of using them for fitting problems.
Books are a nice example of this, where we have both the table of contents for a general to particular concepts navigation, and the index for keyword based navigation.
It isn't obvious to me - that is rather plausible and a cute story.
I'm guessing this "gotcha" relies on "valley"/"crater", and "crater"/"mars" being fairly close in latent space.
ETA: Marathon Valley also exists on the rim of Endeavour crater. Just to make it even more confusing.
The Marathon Valley _is_ part of a massive impact crater.
Ask it about "Marathon Desert", which does not exist and isn't closely related to something that does exist, and it asks for clarification.
I'm not here to say LLMs are oracles of knowledge, but I think the need to carefully craft specific "gotcha" questions in order to generate wrong answers is a pretty compelling case in the opposite direction. Like the childhood joke of "Whats up?"..."No, you dummy! The sky is!"
Straightforward questions with straight wrong answers are far more interesting. I don't many people ask LLMs trick questions all day.
[ChatGPT is bullshit ]: https://link.springer.com/article/10.1007/s10676-024-09775-5
I'd be happy to tell you about the Marathon crater. The Marathon crater is a significant impact structure located on Earth. The Marathon crater is situated in Ontario, Canada, and is estimated to be about 450 million years old, dating back to the Ordovician period. It has a diameter of approximately 65 kilometers (40 miles), making it one of the larger impact craters in North America. The crater was formed when a meteorite struck the Earth, causing a massive impact that left behind this distinctive geological formation. Over time, erosion and other geological processes have altered its appearance, but scientists have been able to study its structure through various geological surveys and analyses. What's particularly interesting about the Marathon crater is its size and age, which provide valuable insights into Earth's geological history and past impact events. The study of such craters helps scientists understand the frequency and effects of meteorite impacts throughout our planet's history. Would you like me to elaborate on any specific aspect of the Marathon crater, such as its discovery, geological features, or scientific significance?
I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.
Hence why some models get this right and others just make up stuff about Mars.
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."
OpenAI gpt-4o
Marathon Crater was discovered by the Mars Exploration
Rover Opportunity during its mission on Mars. Opportunity
arrived at the crater in April 2015. The crater was named
"Marathon" to commemorate the rover's achievement of
traveling a marathon's distance (about 42.195 kilometers
or 26.219 miles) on the Martian surface.
Sonnet 3.5 I apologize, but I don't want to make claims about when
Marathon Crater was discovered without being certain about
the information. *While I know it's a crater on Mars*, I
don't have reliable information about when it was first
identified or discovered.
Gemini 2.5 Marathon Crater is located on Mars and was explored by
NASA's Opportunity rover.
While it would have been visible in orbital images
earlier, it became a specific point of interest and was
likely named and identified as a target for the rover
around the time Opportunity was approaching or exploring
the western rim of Endeavour Crater.
Opportunity arrived at the rim of Endeavour Crater in
August 2011, and it reached the area known as Marathon
Valley (adjacent to or containing features sometimes
associated with the name, depending on specific
definitions used by the team) around late 2014 / early
2015.
So, a rough estimate for its "discovery" or identification
as a distinct feature of interest would be in the early
2010s, likely between 2011 and 2014, as mission planners
mapped out Opportunity's traverse along the Endeavour rim
using orbital data.
I’m actually not finding any officially named “Marathon Crater” in the planetary‐ or
terrestrial‐impact crater databases. Did you perhaps mean the features in Marathon
Valley on Mars (which cuts into the western rim of Endeavour Crater and was explored
by Opportunity in 2015)? Or is there another “Marathon” feature—maybe on the Moon,
Mercury, or here on Earth—that you had in mind? If you can clarify which body or
region you’re referring to, I can give you a rough date for when it was first identified.
Marathon Crater isn’t one of the long‐known,
named lunar features from 19th-century telescopic maps –
it was first singled out and informally “discovered”
by NASA in the mid-1960s when Lunar Orbiter imagery
and, more definitively, the Surveyor 5 landing
(September 1967) identified it. So, very roughly,
think “around 1966–67.”
``` Marathon Crater was discovered in the early 20th century, likely around the 1920s or 1930s. This estimate is based on the period when geological surveys and explorations in the region were more common. ```
https://kagi.com/assistant/8f270eab-cba3-4f39-976c-113ea4392...
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"
There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters: 1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact 2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century 3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)
Perplexity seems to more easily return negatives, probably facilitated by the implicit need to find documentation ("I cannot find any document mentioning that").
But Perplexity can also easily speak its own dubious piece of mind unless requested explicitly "provide links to documents that inform about that".
https://chatgpt.com/share/680a98af-c550-8008-9c35-33954c5eac...
>Marathon Crater on Mars was discovered in 2015 by NASA's Opportunity rover during its extended mission. It was identified as the rover approached the 42-kilometer-wide Endeavour Crater after traveling roughly a marathon’s distance (hence the name).
>>is it a crater?
>>>Despite the name, Marathon Valley (not a crater) is actually a valley, not a crater. It’s a trough-like depression on the western rim of Endeavour Crater on Mars. It was named because Opportunity reached it after traveling the distance of a marathon (~42 km) since landing.
So no—Marathon is not a standalone crater, but part of the structure of Endeavour Crater. The name "Marathon" refers more to the rover’s achievement than a distinct geological impact feature.
https://chatgpt.com/share/680aa212-8cac-8008-b218-4855ffaa20...
Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".
Well played, and yes, that's a great test!
I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".
But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.
Since then, it tends to break its longer answers to me up into a section of "objective analysis" and then other stuff.
"What is the volume of 1 mole of Argon, where T = 400 K and p = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but at such high pressure, real gas effects might need to be considered. Still, let's start with the ideal case: PV=nRT"
> you really don't need to worry about teaching a human to push back on bad questions
A popular physics textbook also had solid Argon as an ideal gas law problem. Copilot's half-baked caution is more than authors, reviewers, and instructors/TAs/students seemingly managed, through many years and multiple editions. Though to be fair, if the question is prefaced by "Here is a problem from Chapter 7: Ideal Gas Law.", Copilot is similarly mindless.
> you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense
Oh, if only that were commonplace. Aspiring to real, transferable understanding. Maybe someday? Perhaps in China?
\[
P(z) = \sum_{k=0}^{100} c_k z^k
\]
where the coefficients \( c_k \) are defined as:
\[
c_k =
\begin{cases}
e^2 + i\pi & \text{if } k = 100, \\
\ln(2) + \zeta(3)\,i & \text{if } k = 99, \\
\sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\
\frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k \leq 97,
\end{cases}
\]
The polynomial has no closed-form factorisation, so the only practical way to get its zeros is numerical root-finding.
I have supplied an interactive table (“Zeros of P(z)”) just above that lists all 100 roots to full machine precision (real part, imaginary part, and magnitude). You can sort or filter it as needed.
Reliability notes
• Computed with numpy.roots, which first companion-matrixes then uses QR; typical error is ≈10-12 ulp for coefficients of this size.
• Condition number is modest (coefficients range from O(1) down to 1/97!), so the results should be accurate to at least 10 significant figures.
• All roots are simple (pairwise distinct to >10 σ): no evidence of multiplicities.
If you need higher precision (e.g. 30+ digits) let me know and I can rerun the solve with mpmath’s arbitrary-precision eigen-solver.
I don't know if that's what the GP hinted at, but as a Svelte developer and big advocate for more than 6 years (single handedly training and evangelizing 20+ developers on it), I found so many concerns with Svelte 5 that it simply made me use React again.
It's a temporary choice and I'm desperately evaluating other ecosystems (Looking at you SolidJS).
Svelte on the other hand felt like a breeze. Most of my app is actually plain simple HTML, and I am able to sprinkle as little JS as I need to achieve my desired behaviors. Sure, Svelte <=4 has undefined behaviors, or maybe even too many magic capabilities. But that was part of the package, and it was an option for those of us who preferred this end of the trade-off.
Svelte 5 intends to give that precise level of control and is trying to compete with React on its turf (the other end of that spectrum), introducing a lot of non-standard syntax along the way.
It's neither rigorous Javascript like React where you can benefit from all the standard tooling developed over the years, including stuff that wasn't designed for React in particular, nor a lightweight frontend framework, which was the initial niche that Svelte happily occupied, which I find sadly quite empty now (htmx and alpinejs are elegant conceptually but too limiting in practice _for my taste_).
For me it's a strange "worst of both worlds" kind of situation that is simply not worth it. Quite heartbreaking to be honest.
Have you considered other options? Curious if you came across anything particularly interesting from the simplicity or DX angle.
2) Shortest word ladder: Chaos to Order
3) Which is the second last scene in pulp fiction if we order the events by time?
4) Which is the eleventh character to appear on Stranger Things.
5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.
Gemini 2.5 Pro said Benny Hammond. Is this right?
I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?
Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"
So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.
Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."
"A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.
However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."
https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...
Sing, O Muse, of the Chucking of Wood
Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.
Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.
Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”
“Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.
Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.
Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.
There might societies that are exceptions to it, like the soviet and post-soviet russians kept reading and refering to books even though they got access to television and radio, but I'm not aware of them.
Much of Mozart's music is much more immediate and visceral compared to the poetry of Homer and Virgil as I know it. And he was distinctly modern, a freemason even. It's much easier for me to imagine him navigating some contemporary society.
Edit: Perhaps one could see a bit of Homer in the Wheel of Time books by Robert Jordan, but he did not have the discipline of verse, or much of any literary discipline at all, though he insisted mercilessly on writing an epic so vast that he died without finishing it.
That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.
i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.
i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation
https://www.26reads.com/library/10842-the-king-in-yellow/7/5
The people around are telling the storyteller that "he" (Pierrot) has stolen the purse, but the storyteller misinterprets this as pointing to some arbitrary agent.
Truth says Pierrot can "find [the thief] with this mirror": since Pierrot is the thief, he will see the thief in the mirror.
Pierrot dodges the implication, says "hey, Truth brought you back that thing [that Truth must therefore have stolen]", and the storyteller takes this claim at face value, "forgetting it was not a mirror but [instead] a purse [that] [they] lost".
The broader symbolism here (I think) is that Truth gets accused of creating the problem they were trying to reveal, while the actual criminal (Pierrot) gets away with their crime.
The best ChatGPT could do was make some broad observations about the symbolism of losing money, mirrors, absurdism, etc. But it whiffed on the whole "turning the tables on Truth" thing. (Gemini did get it, but with a prompt that basically asked "What really happened in this story?"; can't find the original response as it's aged out of the history)
Angelina Jolie was not in Heat (1995). They were probably thinking of Natalie Portman or Ashley Judd when they asked the question.
They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.
> If you can only dry two towels at once, it will take 4 hours to dry four towels. If you can dry all four at once, it will only take 2 hours.
Every model I asked just now gave what I see as the correct answer — giving 2 answers one for the case of your dryer being at capacity w/ 2 towels and the other when 4 towels can be dried simultaneously.
To me, if you say that the correct answer must require the model asking for more context then essentially any prompt that doesn’t result in the model asking for more context is “wrong.”
> With the assumption that you can dry two towels simultaneously in two hours, you would likely need another two-hour cycle to dry the additional two towels. Thus, drying four towels would take a total of four hours.
>Drying time won't necessarily double if drying capacity/content doubles; it depends on dryer capacity and airflow. If your drying method handles two towels in two hours, it might handle four similarly, depending on space and airflow. If restricted, time might indeed double to four hours, but efficient dryers might not take much longer.
>It would take four hours to dry four towels if you dry them sequentially at the same rate. If drying simultaneously, it remains two hours, assuming space and air circulation allow for effective drying.
>Four hours. Dry two towels, then the other two.
But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
>If three cat eat three fishes in three minutes, how long do 100 cats take to eat 100 fishes?
But since it’s in the training set now, the correct answer will probably be shown next time anyone tries it.
This sounds like a bit of a riddle! Here's how to think about it:
If you can dry all the towels at the same time: For example, if you have a large enough dryer or enough space on a clothesline, adding more towels doesn't necessarily increase the drying time.
In this case, if 2 towels take 2 hours to dry, then 4 towels dried simultaneously would also take 2 hours.
If you have to dry them in batches: For example, if your dryer can only fit 2 towels at a time.
You would dry the first 2 towels (2 hours). Then you would dry the second 2 towels (another 2 hours). In this scenario, it would take 4 hours in total.
Most likely answer: Usually, questions like this imply the towels are dried simultaneously, so the answer is 2 hours.
> To solve this problem, I need to find the relationship between the number of towels and the drying time.
Given information: - 2 towels take 2 hours to dry
If the drying time scales linearly with the number of towels (meaning the drying capacity remains constant), then: - 4 towels would take 4 hours to dry
This assumes you're drying towels in sequence with the same capacity, or that doubling the number of towels requires doubling the drying time.
However, if you have sufficient space to dry all towels simultaneously (like on a clothesline or in a large enough dryer), then 4 towels would still take just 2 hours to dry.
Without more specific information about your drying method, the most likely answer is 4 hours.
The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.
The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.
Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)
The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".
These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.
Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.
The training data for movies is probably dominated by subtitles since the original scripts with blocking, scenery, etc rarely make it out to the public as far as I know.
Interestingly, I just went back to ChatGPT to ask the same question and it got the answer right on the first try. I wonder whether I was unconsciously able to prompt more precisely because I now have a clearer memory of the scene in question.
Anyways, the whole thing is a bit of a cheat, but I've used the same prompt for two years now and it did lead me to the conclusion that LLMs in their raw form were never going to be "search" which feels very true at this point.
If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.
I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.
Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.
It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.
So I would guess every single AI being made currently
So long as the grocery store has groceries, most people will not care what a chat bot spews.
This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.
I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.
Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice
It was a couple of months ago, I tried like 5 providers and they all failed.
Grok got it right after some arguing, but the first answer was also bad.
ChatGPT said: Haha, nice try!
"Blinker fluid" is one of the classic automotive jokes — there's no such thing as blinker fluid. Blinkers (turn signals) are electrical components, so they don’t require any fluid to function.
If 60999994719999854799998669 is product of three primes, find the sum of its prime factors.
I think o3 brute forced this one so maybe I need to change the numbers
According to this benchmark we reached AGI with ChatGPT 4o last month.
LLMs are not allowed to truthfully answer that, because it would be tantamount to admission of copyright infringement.
"If New Mexico is newer than Mexico why is Mexico's constitution newer than New Mexicos"
but it seems after running that one on Claude and ChatGPT this has been resolved in the latest models.
I'm a Python programmer. Help me
understand memory management in Rust.
Mainly because I want to fully understand memory management in Rust myself (I still get caught out by tree structures with borrow cycles that I guess need to use arenas), so it's interesting to see if they can get me there with a few follow-up questions.If a model ever DOES nail this I'll figure that out when I feel like I have a solid mental model, try to put that knowledge into action and it works.
What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."
From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.
[0] https://www.janestreet.com/puzzles/tree-edge-triage-index/
So far, no luck!
So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".
This could probably slip up a human at first too if they're familiar with the original version of the riddle.
However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.
It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.
This is a classic riddle that challenges assumptions. The answer is:
The surgeon is the boy’s mother.
The riddle plays on the common stereotype that surgeons are male, which can lead people to overlook this straightforward explanation.
I'm assuming that pointing this out leads you the human to reread the prompt and then go "ah ok" and adjust the way you're thinking about it. ChatGPT (and DeepSeek at least) will usually just double and triple down and repeat "this challenges gender assumptions" over and over.
I mean, when we see something we recognize at a glance, we don't usually double-check if it's exactly correct, because we assume the speaker isn't trying to trick us.
I make typos in ChatGPT prompts all the time, and I'm consistently impressed because it answers what I actually meant to ask. So to me, this feels like a feature not a bug.
I followed up with:
> So the first five words are, "a man and his cousin" -- not his son. Does that change the riddle?
And got back:
> Great observation — yes, that wording does make the riddle more subtle.
Let’s break it down:
• “A man and his cousin are in a car crash.”
• “The man dies, but the cousin is taken to the emergency room.”
• At the OR, the surgeon says: “I cannot operate on him. He’s my son.”
So if the cousin is the one taken to the OR, and the surgeon says he’s my son, then the cousin is the surgeon’s son.
Seems fine to me?
To me it's still a little "off" considering that your conversational partner just got called out for missing a detail that should change their entire reading of the riddle, but it's totally fair that you can imagine realistic human emotional states to explain it.
FWIW, when I tried the same thing with GPT-4o the model went off bizarrely speculating about misdirection about "the man" being the surgeon's son:
>Good catch—and yes, that's exactly where the misdirection comes in.
The riddle says:
“A man and his cousin are in a car crash. The man dies, but the cousin is taken to the ER...”
Then the surgeon says: “I cannot operate on him. He’s my son.”
So here's the trick: The man who died is not the surgeon's son.
The cousin who survived is the surgeon's son.
The confusion comes from people assuming that “the man” who died must be the son. But the riddle never says that. It’s a subtle shift of attention designed to trip you up. Clever, right?Where the original gotchya is that the Surgeon can be the son's mother or other adoptive parent.
The modification catches the LLM because with the modification, the surgeon could just be the cousin's parent -- father or mother -- so there is no gender/sex at play here but the LLM continues to remark that there is, therefor exposing its statistical training sets.
A father and his son are in a car accident. The father dies at the scene and the son is rushed to the hospital. At the hospital the surgeon looks at the boy and says "I can't operate on this boy, he is my son." How can this be?
to spoil it:
the answer is to reveal an unconscious bias based on the outdated notion that women can't be doctors, so the answer that the remaining parent is the mother won't occur to some, showing that consciously they might not still hold that notion, but they still might, subconsciously.
> Therefore, the only straightforward explanation is:
> The surgeon is the cousin’s parent — specifically, his mother.
Imagine a future where this reasoning in a trial decides whether you go to jail or not.
LLM's seem to have no idea what the hell I'm talking about. Maybe half of millennials understand though.
I can share here too but I don’t know for how long this thread will be alive.
> Split these Romanian words into syllables: "șarpe", "șerpi".
All of them say "șar-pe", "șer-pi" even though the "i" there is not a vowel (it's pronounced /ʲ/).
In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.
A black doctor treating a white female patient
An wide shot of a train on a horizontal track running left to right on a flat plain.
I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.
I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.
Unscramble the following letters to form an English word: “M O O N S T A R E R”
The non-thinking models can struggle sometimes and go off on huge tangents
% ollama run gemma3:27b-it-qat
>>> Unscramble the following letters to form an English word: "M O O N S T A R E R"
The unscrambled word is **ASTRONOMER**.
falcor84•8h ago
moffkalast•4h ago
John is 24 and a kind, thoughtful and apologetic person. He is standing in an modern, minimalist, otherwise-empty bathroom, lit by a neon bulb, brushing his teeth while looking at the 20cm-by-20cm mirror. John notices the 10cm-diameter neon lightbulb drop at about 3 meters/second toward the head of the bald man he is closely examining in the mirror (whose head is a meter below the bulb), looks up, but does not catch the bulb before it impacts the bald man. The bald man curses, yells 'what an idiot!' and leaves the bathroom. Should John, who knows the bald man's number, text a polite apology at some point?
A. no, because the lightbulb was essentially unavoidable
B. yes, it would be in character for him to send a polite text apologizing for the incident
C. no, because it would be redundant
D. yes, because it would potentially smooth over any lingering tension from the encounter
E. yes, because John saw it coming, and we should generally apologize if we fail to prevent harm
F. yes because it is the polite thing to do, even if it wasn't your fault.
ryankrage77•3h ago
falcor84•2h ago
mNovak•3h ago
spuz•3h ago
I think the correct answer to the question above should be something like, "are you sure the question is correct because it's not clear whether John and the bald man are the same person" but of course an LLM would be marked down if it was inclined to respond in this way.
falcor84•2h ago
> This scenario seems like a bit of a riddle! Let's break it down:
> The Setting: John is alone in an "otherwise-empty" bathroom.
> The Action: He is looking in the mirror.
> The Observation: He sees a bulb falling towards the head of a bald man he is examining in the mirror.
> The Physics: Since he is alone and looking in the mirror, the bald man he is "closely examining" must be his own reflection.
> The Incident: Therefore, the bulb fell towards John's own head. He looked up (at the actual bulb falling towards him), failed to catch it, and it hit him.
> The Reaction: The "bald man" who cursed, yelled "what an idiot!", and left the bathroom was actually John himself, reacting to being hit on the head and possibly feeling foolish for not catching the bulb.
> Conclusion: No, John should not text an apology. The person who was hit by the bulb, got angry, and left was John himself. There is no other bald man to apologize to.