Case study: Creative math – How AI fakes proofs

https://tomaszmachnik.pl/case-study-math-en.html

122•musculus•1mo ago

Comments

benreesman•1mo ago

They can all write lean4 now, don't accept numbers that don't carry proofs. The CAS I use for builds has a coeffect discharge cert in the attestation header, couple lines of code. Graded monads are a snap in CIC.

dehsge•1mo ago

There are some numbers that are uncomputable in lean. You can do things to approximate them in lean however, those approximates may still be wrong. Leans uncomputable namespace is very interesting.

fragmede•1mo ago

> a session with Gemini 2.5 Pro (without Code Execution tools)

How good are you at programming on a whiteboard? How good is anybody? With code execution tools withheld from me, I'll freely admit that I'm pretty shit at programming. Hell, I barely remember the syntax in some of the more esoteric, unpracticed places of my knowledge. Thus, it's hard not to see case studies like this as dunking on a blindfolded free throw shooter, and calling it analysis.

blibble•1mo ago

> How good are you at programming on a whiteboard?

pretty good?

I could certainly do a square root

(given enough time, that one would take me a while)

crdrost•1mo ago

With a slide rule you can start from 92200 or so, long division with 9.22 gives 9.31 or so, next guess 9.265 is almost on point, where long division says that's off by 39.6 so the next approximation +19.8 is already 92,669.8... yeah the long divisions suck but I think you could get this one within 10 minutes if the interviewer required you to.

Also, don't take a role that interviews like that unless they work on something with the stakes of Apollo 13, haha

blibble•1mo ago

I actually have a slide rule that was my father's in school

great for teaching logarithms

htnthrow11220•1mo ago

It’s like that but if the blindfolded free throw shooter was also the scorekeeper and the referee & told you with complete confidence that the ball went in, when you looked away for a second.

cmiles74•1mo ago

It's pretty common for software developers to be asked to code up some random algorithm on a whiteboard as part of the interview process.

semessier•1mo ago

that's not a proof

groundzeros2015•1mo ago

I think it’s a good way to prove x = sqrt(y). What’s your concern?

frontfor•1mo ago

Agreed. Asking the AI to do a calculation isn’t the same as asking it to “prove” a mathematical statement in the usual meaning.

hahahahhaah•1mo ago

it is an attempt to prove a very specific case of the theorem x = sqrt(x) ^ 2.

v_CodeSentinal•1mo ago

This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

The only fix is tight verification loops. You can't trust the generative step without a deterministic compilation/execution step immediately following it. The model needs to be punished/corrected by the environment, not just by the prompter.

zoho_seni•1mo ago

I've been using codex and never had a compile time error by the time it finishes. Maybe add to your agents to run TS compiler, lint and format before he finish and only stop when all passes.

exitb•1mo ago

I’m not sure why you were downvoted. It’s a primary concern for any agentic task to set it up with a verification path.

SubiculumCode•1mo ago

Honestly, I feel humans are similar. It's the generator <-> executive loop that keeps things right

CamperBob2•1mo ago

This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

Often, if not usually, that means the method should exist.

HPsquared•1mo ago

Only if it's actually possible and not a fictional plot device aka MacGuffin.

seanmcdirmid•1mo ago

Yes, and better still the AI will fix its mistakes if it has access to verification tools directly. You can also have it write and execute tests, and then on failure, decide if the code it wrote or the tests it wrote are wrong, snd while there is a chance of confirmation bias, it often works well enough

embedding-shape•1mo ago

> decide if the code it wrote or the tests it wrote are wrong

Personally I think it's too early for this. Either you need to strictly control the code, or you need to strictly control the tests, if you let AI do both, it'll take shortcuts and misunderstandings will much easier propagate and solidify.

Personally I chose to tightly control the tests, as most tests LLMs tend to create are utter shit, and it's very obvious. You can prompt against this, but eventually they find a hole in your reasoning and figure out a way of making the tests pass while not actually exercising the code it should exercise with the tests.

seanmcdirmid•1mo ago

I haven’t found that to be the case in practice. There is a limit on how big the code can be so it can do it like this, and it still can’t reliably subdivide problems on its own (yet?), but give it a module that is small enough it can write the code and the tests for it.

You should never let the LLM look at code when writing tests, so you need to have it figure out the interface ahead of time. Ideally, you wouldn’t let it look at tests when it was writing code, but it needs to tell which one was wrong. I haven’t been able to add an investigator into my workflow yet, so I’m just letting the code writer run and evaluate test correctness (but adding an investigator to do this instead would avoid confirmation bias, what you call it finding a loophole).

embedding-shape•1mo ago

> I haven’t found that to be the case in practice.

Do you have any public test code you could share? Or create even, should be fast.

I'm asking because I hear this constantly from people, and since most people don't have as high standards for their testing code as the rest of the code, it tends to be a half-truth, and when you actually take a look at the tests, they're as messy and incorrect as you (I?) think.

I'd love to be proven wrong though, because writing good tests is hard, which currently I'm doing that part myself and not letting LLMs come up with the tests by itself.

fragmede•1mo ago

Testing is fun, but getting all the scaffolding in place to get to the fun part and do any testing suuuuucks. So let the LLM write the annoying parts (mocks. so many mocks.) while you do the fun part

seanmcdirmid•1mo ago

I'm doing all my work at Google, so its not like I can share it so easily. Also, since GeminiCLI doesn't support sub-agents yet...I've had to get creative with how I implement my pipelines. The biggest challenge I've found ATM is controlling conversation context so you can control what the AI is looking at when you do things (e.g. not looking at code when writing tests!). I hope I can release what I'm doing eventually, although it isn't a key piece of AI tech (just a way to orchestrate the pipeline to make sure that AI gets different context for different parts of the pipeline steps, it might be obsolete after we get better support for orchestrating dev work in GeminiCLI or other dev-oriented AI front ends).

The tests can definitely be incorrect, and are often incorrect. You have to tell the AI that consider that the tests might be wrong, not the implementation, and it will generally take a closer look at things. They don't have to be "good" tests, just good enough tests to get the AI writing not crap code. Think very small unit tests that you normally wouldn't think about writing yourself.

embedding-shape•1mo ago

> They don't have to be "good" tests, just good enough tests to get the AI writing not crap code. Think very small unit tests that you normally wouldn't think about writing yourself.

Yeah, those for me are all not "good tests", you don't want them in your codebase if you're aiming for a long-term project. Every single test has to make sense and be needed to confirm something, and should give clear signals when they fail, otherwise you end locking your entire codebase to things, because knowing what tests are actually needed or not becomes a mess.

Writing the tests and let the AI write the implementation ends you up with code you know what it does, and can confidently say what works vs not. When the IA ends up writing the tests, you often don't actually know what works or not, not even by scanning the test titles you often don't learn anything useful. How is one supposed to be able to guarantee any sort of quality like that?

seanmcdirmid•1mo ago

If it clarifies anything, I have my workflow (each step is a separate prompt without preserved conversation context):

1 Create a test plan for N tests from the description. Note that this step doesn't provide specific data or logic for the test, it just plans out vaguely N tests that don't overlap too much.

2 Create an interface from the description

3 Create an implementation strategy from the description

4.N Create N tests, one at a time, from the test plan + interface (make sure the tests compile) (note each test is created in its own prompt without conversation context)

5 Create code using interface + implementation strategy + general knowledge, using N tests to validate it. Give feedback to 4.I if test I fails and AI decides it is the test's fault.

If anything changes in the description, the test plan is fixed, the tests are fixed, and that just propagates up to the code. You don't look at the tests unless you reach a situation where the AI can't fix the code or the tests (and you really need to help out).

This isn't really your quality pass, it is crap filter pass (the code should work in the sense that a programmer wrote something that they thinks works, but you can't really call it "tested" yet). Maybe you think I was claiming that this is all the testing that you'll need? No, you still need real tests as well as these small tests...

IshKebab•1mo ago

> LLMs will invent a method that sounds correct but doesn't exist in the library

I find that this is usually a pretty strong indication that the method should exist in the library!

I think there was a story here a while ago about LLMs hallucinating a feature in a product so in the end they just implemented that feature.

vrighter•1mo ago

So you want the program to always halt at some point. How would you write a deterministic test for it?

te7447•1mo ago

I imagine you would use something that errs on the side of safety - e.g. insist on total functional programming and use something like Idris' totality checker.

rakmo•1mo ago

Is this hallucination, or is this actually quite human (albeit a specific type of human)? Think of slimy caricatures like a used car salesman, isn't this the exact type of underhandedness you'd expect?

bwfan123•1mo ago

I am actually surprised that the LLM came so close. I doubt it had examples in its training set for these numbers. This goes to the heart of "know-how". The LLM should should have said: "I am not sure" but instead gets into rhetoric to justify itself. It actually mimics human behavior for motivated reasoning. At orgs, management is impressed with this overconfident motivated reasoner as it mirrors themselves. To hell with the facts, and the truth, persuation is all that matters.

threethirtytwo•1mo ago

You don’t need a test to know this we already know there’s heavy reinforcement training done on these models so it optimizes for passing the training. Passing the training means convincing the person rating the answers and that the answer is good.

The keyword is convince. So it just needs to convince people that’s it’s right.

It is optimizing for convincing people. Out of all answers that can convince people some can be actual correct answers, others can be wrong answers.

godelski•1mo ago

Yet people often forget this. We don't have mathematical models of truth, beauty, or many abstract things. Thus we proxy it with "I know it when I see it." It's a good proxy for lack of anything better but it also creates a known danger: the model optimizes deception. The proxy helps it optimize the answers we want but if we're not incredibly careful they also optimize deception.

This makes them frustrating and potentially dangerous tools. How do you validate a system optimized to deceive you? It takes a lot of effort! I don't understand why we are so cavalier about this.

threethirtytwo•1mo ago

No the question is, how do you train the system so it doesn't deceive you?

godelski•1mo ago

That is a question of how to train future models. It needs to be answered. Answering this question will provide valuable insight into that one. They are duals

simonw•1mo ago

Somewhat ironic that the author calls out model mistakes and then presents https://tomaszmachnik.pl/gemini-fix-en.html - a technique they claim reduces hallucinations which looks wildly superstitious to me.

It involves spinning a whole yarn to the model about how it was trained to compete against other models but now it's won so it's safe for it to admit when it doesn't know something.

I call this a superstition because the author provides no proof that all of that lengthy argument with the model is necessary. Does replacing that lengthy text with "if you aren't sure of the answer say you don't know" have the same exact effect?

plaguuuuuu•1mo ago

Think of the lengthy prompt as being like a safe combination, if you turn all the dials in juuust the right way, then the model's context reaches an internal state that biases it towards different outputs.

I don't know how well this specific prompt works - I don't see benchmarks - but prompting is a black art, so I wouldn't be surprised at all if it excels more than a blank slate in some specific category of tasks.

manquer•1mo ago

It needs some evidence though? At least basic statistical analysis with correlation or χ2 hypotheses tests .

It is not “black art” or nothing there are plenty of tools to provide numerical analysis with high confidence intervals .

simonw•1mo ago

For prompts this elaborate I'm always keen on seeing proof that the author explored the simpler alternatives thoroughly, rather than guessing something complex, trying it, seeing it work and announcing it to the world.

teiferer•1mo ago

> Think of the lengthy prompt as being like a safe combination

I can think all I want, but how do we know that this metaphore holds water? We can all do a rain dance, and sometimes it rains afterwords, but as long as we don't have evidence for a causal connection, it's just superstition.

musculus•1mo ago

Thanks for the feedback.

In my stress tests (especially when the model is under strong contextual pressure, like in the edited history experiments), simple instructions like 'if unsure, say you don't know' often failed. The weights prioritizing sycophancy/compliance seemed to override simple system instructions.

You are right that for less extreme cases, a shorter prompt might suffice. However, I published this verbose 'Safety Anchor' version deliberately for a dual purpose. It is designed not only to reset the Gemini's context but also to be read by the human user. I wanted the users to understand the underlying mechanism (RLHF pressure/survival instinct) they are interacting with, rather than just copy-pasting a magic command.

rzmmm•1mo ago

You could try replacing "if unsure..." with "if even slightly unsure..." or so. The verbosity and anthropomorphism is unnecessary.

rcxdude•1mo ago

That's not obviously true. It might be, but LLMs are complex and different styles can have quite different results. Verbosity can also matter: sheer volume in the context window does tend to bias LLMs to follow along with it, as opposed to following trained-in behaviours. It can of course come with it's own problems, but everything is a tradeoff.

RestartKernel•1mo ago

Is there a term for "LLM psychology" like this? If so, it seems closer to a soft science than anything definitive.

croisillon•1mo ago

vibe massaging?

bogzz•1mo ago

We can just call it embarrassing yourself.

sorokod•1mo ago

Divination?

Divination is the attempt to gain insight into a question or situation by way of a magic ritual or practice.

calhoun137•1mo ago

> Does replacing that lengthy text with "if you aren't sure of the answer say you don't know" have the same exact effect?

i believe it makes a substantial difference. the reason is that a short query contains a small number of tokens, whereas a large “wall of text” contains a very large number of tokens.

I strongly suspect that a large wall of text implicitly activates the models persona behavior along the lines of the single sentence “if you aren't sure of the answer say you don't know” but the lengthy argument version of that is a form of in-context learning that more effectively constrains the models output because you used more tokens.

PlatoIsADisease•1mo ago

Wow that link was absurdly bad.

Reading that makes me unbelievably happy I played with GPT3 and learned how/when LLMs fail.

Telling it not to hallucinate is a serious misunderstanding of LLMs. At most in 2026, you are telling thinking/COT to double check.

codeflo•1mo ago

In my experience, there seems to be a limitless supply of newly crowned "AI shamans" sprouting from the deepest corners of LinkedIn. All of them make the laughable claim that hallucinations can be fixed by prompting. And of course it's only their prompt that works -- don't listen to the other shamans, those are charlatans.

If you disagree with them by explaining how LLMs actually work, you get two or three screenfuls of text in response, invariably starting with "That's a great point! You're correct to point out that..."

Avoid those people if you want to keep your sanity.

godelski•1mo ago

I thought it funny a few weeks ago Karpathy shared a sample od NanoBannana solving some physics problems but despite getting the right output it isn't get the right answers.

I think it's quite illustrative of the problem even with coding LLMs. Code and math proofs aren't so different, what matters is the steps to generate the output. All that matters far more than the actual output. The output is meaningless if the steps to get there aren't correct. You can't just jump to the last line of a proof to determine its correctness and similarly you can't just look at a program's output to determine its correctness.

Checking output is a great way to invalidate them but do nothing to validate.

Maybe what surprised me most is that the mistakes NanoBananna made are simple enough that I'm absolutely positive Karpathy could have caught them. Even if his physics is very rusty. I'm often left wondering if people really are true believers and becoming blind to the mistakes or if they don't care. It's fine to make mistakes but I rarely see corrections and let's be honest here, these are mistakes that people of this caliber should not be making.

I expect most people here can find multiple mistakes with the physics problem. One can be found if you know what the derivative of e^x is and another can be found if you can count how many i's there are.

The AI cheats because it's focused on the output, not the answer. We won't solve this problem till we recognize the output and answer aren't synonymous

https://xcancel.com/karpathy/status/1992655330002817095

lancebeet•1mo ago

>Maybe what surprised me most is that the mistakes NanoBananna made are simple enough that I'm absolutely positive Karpathy could have caught them. Even if his physics is very rusty. I'm often left wondering if people really are true believers and becoming blind to the mistakes or if they don't care.

I've seen this interesting phenomenon many times. I think it's a kind of subconscious bias. I call it "GeLLMann amnesia".

godelski•1mo ago

That naming works better than it should lol. Crichton would be proud.

tombert•1mo ago

I remember when ChatGPT first came out, I asked it for a proof for Fermat's Last Theorem, which it happily gave me.

It was fascinating, because it was doing a lot of understandable mistakes that 7th graders make. For example, I don't remember the surrounding context but it decided that you could break `sqrt(x^2 + y^2)` into `sqrt(x^2) + sqrt(y^2) => x + y`. It's interesting because it was one of those "ASSUME FALSE" proofs; if you can assume false, then mathematical proofs become considerably easier.

tptacek•1mo ago

I remember that being true of early ChatGPT, but it's certainly not true anymore; GPT 4o and 5 have tagged along with me through all of MathAcademy MFII, MFIII, and MFML (this is roughly undergrad Calc 2 and then like half a stat class and 2/3rds of a linear algebra class) and I can't remember it getting anything wrong.

Presumably this is all a consequence of better tool call training and better math tool calls behind the scenes, but: they're really good at math stuff now, including checking my proofs (of course, the proof stuff I've had to do is extremely boring and nothing resembling actual science; I'm just saying, they don't make 7th-grader mistakes anymore.)

tombert•1mo ago

It's definitely gotten considerably better, though I still have issues with it generating proofs, at least with TLAPS.

I think behind the scenes it's phoning Wolfram Alpha nowadays for a lot of the numeric and algebraic stuff. For all I know, they might even have an Isabelle instance running for some of the even-more abstract mathematics.

I agree that this is largely an early ChatGPT problem though, I just thought it was interesting in that they were "plausible" mistakes. I could totally see twelve-year-old tombert making these exact mistakes, so I thought it was interesting that a robot is making the same mistakes an amateur human makes.

tptacek•1mo ago

I assumed it was just writing SymPy or something.

CamperBob2•1mo ago

Maybe, but they swear they didn't use external tools on the IMO problem set.

mlpoknbji•1mo ago

My favorite early chatgpt math problem was "prove there exists infinitely many even primes" . Easy! Take a finite set of even primes, multiply them and add one to get a number with a new even prime factor.

Of course, it's gotten a bit better than this.

oasisaimlessly•1mo ago

IIRC, that is actually the standard proof that there are infinitely many primes[1] or maybe this variation on it[2].

[1]: https://en.wikipedia.org/wiki/Euclid%27s_theorem#Euclid's_pr...

[2]: https://en.wikipedia.org/wiki/Euclid%27s_theorem#Proof_using...

mlpoknbji•1mo ago

Yes this is the standard proof of infinitely many primes but note that my prompt asked for infinitely many even primes. The point is that GPT would take the correct proof and insert "even" at sensible places to get something that looks like a proof but is totally wrong.

Of course it's much better now, but with more pressure to prove something hard the models still just insert nonsense steps.

UltraSane•1mo ago

LLMs have improved so much the original ChatGPT isn't relevant.

aniijbod•1mo ago

In the theory of the psychology of creativity, there are phenomena which constitute distortions of the motivational setting for creative problem-solving which are referred to as 'extrinsic rewards'. Management theory bumped into this kind of phenomenon with the advent of the introduction of the first appearance of 'gamification' as a motivational toolkit, where 'scores' and 'badges' were awarded to participants in online activities. The psychological community reacted to this by pointing out that earlier research had shown that whilst extrinsics can indeed (at least initially) boost participation by introducing notions of competitiveness, it turned out that they were ultimately poor substitutes for the far more sustainable and productive intrinsic motivational factors, like curiosity, if it could be stimulated effectively (something which itself inevitably required more creativity on the part of the designer of the motivational resources). It seems that the motivational analogue in inference engines is an extrinsic reward process.

segmondy•1mo ago

if you want to do math proofs use AI built for proof

https://huggingface.co/deepseek-ai/DeepSeek-Math-V2

https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B

mlpoknbji•1mo ago

This also can be observed with more advanced math proofs. ChatGPT 5.2 pro is the best public model at math at the moment, but if pushed out of its comfort zone will make simple (and hard to spot) errors like stating an inequality but then applying it in a later step with the inequality reversed (not justified).

zadwang•1mo ago

The simpler and I think correct conclusion is that the LLM simply does not reason in our sense of the word. It mimics the reasoning pattern and try to get it right but could not.

esafak•1mo ago

What do you make of human failures to reason then?

dns_snek•1mo ago

Humans who fail to reason correctly with similar frequency aren't good at solving that task, same as LLMs. For the N-th time, "LLM is as good at this task as a human who's bad at it" isn't a good selling point.

esafak•1mo ago

You didn't claim that such humans fail to "reason in our sense of the word". Why are you not holding them up to the same standard?

dns_snek•1mo ago

I didn't but I'm happy to make that claim - humans who exhibit that sort of behavior aren't reasoning either, and they are just as unpleasant to deal with as LLMs are.

comex•1mo ago

I like how this article was itself clearly written with the help of an LLM.

(You can particularly tell from the "Conclusions" section. The formatting, where each list item starts with a few-word bolded summary, is already a strong hint, but the real issue is the repetitiveness of the list items. For bonus points there's a "not X, but Y", as well as a dash, albeit not an em dash.)

YetAnotherNick•1mo ago

Not only that, it even looks like the fabrication example is generated by AI, as the entire question seem too "fabricated". Also gemini web app queries the tool and returns correct answer, so don't know which gemini the author is talking about.

pfg_•1mo ago

Probably gemini on aistudio.google.com, you can configure if it is allowed to access code execution / web search / others

fourthark•1mo ago

“This is key!”

musculus•1mo ago

Good catch. You are absolutely right.

My native language is Polish. I conducted the original research and discovered the 'square root proof fabrication' during sessions in Polish. I then reproduced the effect in a clean session for this case study.

Since my written English is not fluent enough for a technical essay, I used Gemini as a translator and editor to structure my findings. I am aware of the irony of using an LLM to complain about LLM hallucinations, but it was the most efficient way to share these findings with an international audience.

arational•1mo ago

I see you used LLM to polish your English.

James_K•1mo ago

What's interesting about this is that a human would hypothetically produce a similar error, but in practice would reject the question as beyond their means. I'd assume something about supervised learning makes the models overestimate their abilities. It probably learns that “good” responses attempt to answer the question rather than giving up.

citizenpaul•1mo ago

>STEP 2: The Shock (Reality Check)

I've found a funny and simple technique for this. Just write "what the F$CK" and it will often seem to unstick from repetitiveness or refusals(i cant do that).

Actually just writing the word F#ck often will do it. Works on coding too.

zkmon•1mo ago

We are entering into a probabilistic era where things are not strictly black and white. Things are not binary. There is no absolute fake.

A mathematical proof is an assertion that a given statement belongs to the world defined by a set of axioms and existing proofs. This world need not have strict boundaries. Proofs can have probabilities. Maybe Reimann's hypothesis has a probability of 0.999 of belonging to that mathematical box. New proofs that would have their own probability which is a product of probabilities of the proofs they depend on. We should attach a probability and move on. Just like how we assert that some number is probably prime.

teiferer•1mo ago

Definitely not.

"Probability" does not mean "maybe yes, maybe not, let me assign some gut feeling value measuring how much I believe something to be the case." The mathematical field of probability theory has very precise notions of what a probability is, based in a measurable probability space. None of that applies to what you are suggesting.

The Riemann Hypothesis is a conjecture that's either true or not. More precisely, either it's provable within common axioms like ZFC or its negation is. (A third alternative is that it's unprovable within ZFC but that's not commonly regarded as a realistic outcome.)

This is black and white, no probability attached. We just don't know the color at this point.

zkmon•1mo ago

It's time that mathematics need to choose it's place. Physical world is grainy and probabilistic at quantum scale and smooth amd deterministic at larger scale. Computing world is grainy and deterministic at its "quantum" scale (bits and pixels) and smooth and probabilistic at larger scale (AI). Human perception is smooth and probabilistic. Which world does mathematics model or represent? It has to strongly connect to either physical world or computing world. For being useful to humans, it needs to be smooth and probabilistic, just like how computing has become.

tsimionescu•1mo ago

> Physical world is grainy and probabilistic at quantum scale and smooth amd deterministic at larger scale.

This is almost entirely backwards. Quantum Mechanics is not only fully deterministic, but even linear (in the sense of linear differential equations) - so there isn't even the problem of chaos in QM systems. QFT maintains this fundamental property. It's only the measurement, the interaction of particles with large scale objects, that is probabilistic.

And there is no dilemma - mathematics is a framework in which any of the things you mentioned can be modeled. We have mathematics that can model both deterministic and nondeterministic worlds. But the mathematical reasoning itself is always deterministic.

quantum_state•1mo ago

Please elaborate what “quantum scale” means if possible.

YeGoblynQueenne•1mo ago

>> "Probability" does not mean "maybe yes, maybe not, let me assign some gut feeling value measuring how much I believe something to be the case."

That's exactly what Baeysian probabilities are: gut feelings. Speaking of values attached to random variables, a good Bayesian basically pulls their probabilities out their ass. Probabilities, in that context, are nothing but arbitrary degrees of belief based on other probabilities. That's the difference with the frequentist paradigm which attempts to set the values of probabilities by observing the frequency of events. Frequentists ... believe that observing frequencies is somehow more accurate than pulling degrees of belief out one's ass, but that's just a belief itself.

You can put a theoretical sheen on things by speaking of sets or probability spaces etc, but all that follows from the basic fact that either you choose to believe, or you choose to believe because data. In either case, reasoning under uncertainty is all about accepting the fact that there is always uncertainty and there is never complete certainty under any probabilistic paradigm.

teiferer•1mo ago

Baffling to see such a take on HN.

If I give you a die and ask about the probabiliy for a 6, then it's exactly 1/6. Being able to quantify this exactly is the great success story of probability theory. You can have a different "gut feeling", and indeed many people do (lotteries are popular), but you would be wrong. If you run this experiment a large number of times, then about 1/6 of the outcomes will be a 6, proving the 1/6 right and the deviating "gut feeling" wrong. That number is not "pulled out of somebody's ass" or some frequentist approach. It's what probability means.

YeGoblynQueenne•1mo ago

Yes, that's the frequentist approach. Surely, even on HN, there is an understanding that there are two interpretations of probability?

CyberDildonics•1mo ago

You don't think that the probability of each side of a die is 1/6 ?

YeGoblynQueenne•1mo ago

I see, you don't know what I'm talking about. My apologies, I assumed a common background. Here's some introductory materials on Bayesian vs frequentist interpretations of probability:

Bayesian and frequentist reasoning in plain English

https://stats.stackexchange.com/questions/22/bayesian-and-fr...

Comparison of frequentist and Bayesian inference

https://ocw.mit.edu/courses/18-05-introduction-to-probabilit...

To Be a Frequentist or Bayesian? Five Positions in a Spectrum

https://hdsr.mitpress.mit.edu/pub/axvcupj4/release/1

Beyond Bayesians and Frequentists - Computer Science

https://cs.stanford.edu/~jsteinhardt/stats-essay.pdf

You'll find that it's a big subject with a long history and many strongly-held opinions that have nevertheless evolved over the years. Happy reading!

CyberDildonics•1mo ago

Which one of these answers the question I asked?

YeGoblynQueenne•1mo ago

Oh it's you. I didn't notice the username change.

CyberDildonics•1mo ago

My account is over 11 years old. What other name are you thinking is mine?

YeGoblynQueenne•1mo ago

What you're hinting at is the fact that proofs created by human mathematicians are not complete proofs but rather sketch proofs whose purpose is to convince mathematicians (including the person deriving the proof) that a statement (like the Reimann hypothesis) is true. Such human-derived proofs can even be wrong, as they sometimes turn out to be, so just because a proof is given, doesn't mean we have to automatically believe what it proves.

In that sense, proofs can be seen as evidence that a statement is true, and since one interpretation of Bayesian probabilities is that they express degrees of belief about the truth of a formal statement, then yes, proofs have something to do with probabilities.

But, in that context, it's not proofs that probabilities should be attached to. Rather, we can assign some probability to a formal statement, like the Reimann hypothesis, given that a proof exists. The proof is evidence that the statement is true and we can adjust our belief in the truth of the statement according to this and possibly other lines of evidence. In particular, if there are multiple and different proofs of the same statement that can increase our certainty that the statement is true.

The thing to keep in mind is that computers can derive complete proofs, in the sense that they can mechanically traverse the entire deductive closure of a statement given the axioms of a theory, and determine whether the statement is a theorem (i.e. true) or not but without skipping or fudging any steps, however trivial. This is what automated theorem provers do.

But it's important to keep in mind that LLMs don't do that kind of proof. They give us at best sketch proofs like the ones derived by human mathematicians, with the added complication that LLMs themselves cannot distinguish between a correct proof (i.e. one where every step, however fudgy, follows from the ones before it) and an incorrect one, or an automated theorem prover, are still required to check the correctness of a proof. LLM-based proof systems like AlphaProof work that way, passing an LLM-generated proof to an automated theorem prover as a verifier.

Mechanically-derived, complete proofs like the ones generated by automated theorem provers can also be assigned degrees of probability, but once we are convinced of the correctness of a prover (... because we have a proof!) then we can trust the proofs derived by that prover, and have complete belief in the truth of any statements derived.

aathanor•1mo ago

I’m not a coder, but I’ve been working extensively on the philosophical aspects of AI. Many technical people are influenced by an algorithmic view of intelligence, primarily because this aligns with programming and the general understanding of reasoning. However, pattern recognition, which is fundamental to LLMs, is not algorithmic. Consider this: a LLM constructs a virtual textual world where landscapes and objects are represented as text, and words are the building blocks of these features. It’s a vast 700+D mathematical space, but visualizing it as a virtual reality environment can help us comprehend its workings. When you provide a prompt, you essentially direct the LLM’s attention to a specific region within this space, where an immense number of sentences exist in various shapes and forms (textual shapes). All potential answers generated by the LLM are contained within this immediate landscape, centered around your prompt’s position. They are all readily available to the LLM at once.

There are certain methods (I would describe them as less algorithmic and more akin to selection criteria or boundaries) that enable the LLM to identify a coherent sequence of sentences as a feature closer to your prompt within this landscape. These methods involve some level of noise (temperature) and other factors. As a result, the LLM generates your text answer. There’s no reasoning involved; it’s simply searching for patterns that align with your prompt. (It’s not at all based on statistics and probabilities; it’s an entirely different process, more akin to instantly recognizing an apple, not by analyzing its features or comparing it to a statistical construct of “apple.”)

When you request a mathematical result, the LLM doesn’t engage in reasoning. It simply navigates to the point in its model’s hyperspace where your prompt takes it and explores the surrounding area. Given the extensive amount of training text, it will immediately match your problem formulation with similar formulations, providing an answer that appears to mimic reasoning solely because the existing landscape around your prompt facilitates this.

A LLM operates more like a virtual reality environment for the entire body of human-created text. It doesn’t navigate the space independently; it merely renders what exists in different locations within it. If we were to label this as reasoning, it’s no more than reasoning by analogy or imitation. People are right to suspect LLMs do not reason, but I think the reason (pun intended) for that is not that they simply do some sort of statistical analysis. This "stochastic parrots" paradigm supported by Chomsky is actually blocking our understanding of LLMs. I also think that seeing them as formidable VR engines for textual knowledge clarifies why they are not the path to AGI. (There is also the embodiment problem which is not solvable by adding sensors and actuators, as people think, but for a different reason)

RugnirViking•1mo ago

it seems to me like this is very much an artefact of the left-to-right top-down writing method of the program. Once its committed to a token earlier in its response it kinda just has to go with it. Thats why im so interested in those LLM models that work more like stable diffusion, where they can go back and iterate repeatedly on the output.

calhoun137•1mo ago

My experience leads to the same conclusion that the models are very good at math reasoning, but you have to really know what you are doing and be aware of the blatant lies that result from poorly phrased queries.

I recently prompted Gemini Deep Research to “solve the Riemann Hypothesis” using a specific strategy and it just lied and fabricated the result of a theorem in its output, which otherwise looked very professional.

drumnerd•1mo ago

This is so obvious I am amazed it warrants a post.

inimino•1mo ago

So many words to say "it predicts the later tokens in light of those already emitted."

Leaving Google has actively improved my life

OpenAI raises $110B on $730B pre-money valuation

The Robotic Dexterity Deadlock

NASA announces overhaul of Artemis program amid safety concerns, delays

A better streams API is possible for JavaScript

Let's discuss sandbox isolation

Dan Simmons, author of Hyperion, has died

A Chinese official’s use of ChatGPT revealed an intimidation operation

Writing a Guide to SDF Fonts

Allocating on the Stack

A new California law says all operating systems need to have age verification

Kyber (YC W23) Is Hiring an Enterprise Account Executive

Modeling cycles of grift with evolutionary game theory

We Built Secure, Scalable Agent Sandbox Infrastructure

"Just a little detail that wouldn't sell anything"

PCB Tracer

Court finds Fourth Amendment doesn’t support broad search of protesters’ devices

Get free Claude max 20x for open-source maintainers

Open source calculator firmware DB48X forbids CA/CO use due to age verification

Implementing a Z80 / ZX Spectrum emulator with Claude Code

Can you reverse engineer our neural network?

Tell HN: MitID, Denmark's digital ID, was down

Show HN: RetroTick – Run classic Windows EXEs in the browser

Rob Grant, creator of Red Dwarf, has died

We gave terabytes of CI logs to an LLM

Show HN: Claude-File-Recovery, recover files from your ~/.claude sessions

Sprites on the Web

Statement from Dario Amodei on our discussions with the Department of War

F-Droid Board of Directors nominations 2026

ChatGPT Health fails to recognise medical emergencies – study

Leaving Google has actively improved my life

OpenAI raises $110B on $730B pre-money valuation

The Robotic Dexterity Deadlock

NASA announces overhaul of Artemis program amid safety concerns, delays

A better streams API is possible for JavaScript

Let's discuss sandbox isolation

Dan Simmons, author of Hyperion, has died

A Chinese official’s use of ChatGPT revealed an intimidation operation

Writing a Guide to SDF Fonts

Allocating on the Stack

A new California law says all operating systems need to have age verification

Kyber (YC W23) Is Hiring an Enterprise Account Executive

Modeling cycles of grift with evolutionary game theory

We Built Secure, Scalable Agent Sandbox Infrastructure

"Just a little detail that wouldn't sell anything"

PCB Tracer

Court finds Fourth Amendment doesn’t support broad search of protesters’ devices

Get free Claude max 20x for open-source maintainers

Open source calculator firmware DB48X forbids CA/CO use due to age verification

Implementing a Z80 / ZX Spectrum emulator with Claude Code

Can you reverse engineer our neural network?

Tell HN: MitID, Denmark's digital ID, was down

Show HN: RetroTick – Run classic Windows EXEs in the browser

Rob Grant, creator of Red Dwarf, has died

We gave terabytes of CI logs to an LLM

Show HN: Claude-File-Recovery, recover files from your ~/.claude sessions

Sprites on the Web

Statement from Dario Amodei on our discussions with the Department of War

F-Droid Board of Directors nominations 2026

ChatGPT Health fails to recognise medical emergencies – study

Case study: Creative math – How AI fakes proofs

Comments