EDIT: Look at all the people below just reacting to the headline and clearly not reading the posts. Aristotle (https://arxiv.org/abs/2510.01346) is key here folks.
EDIT2: It is clear much of the people below don't even understand basic terminology. Something being a transformer doesn't make it an LLM (vision transformers, anyone) and if you aren't training on language (e.g. AlphaFold, or Aristotle on LEAN stuff), it isn't a "language" model.
Do you mean that in this case, it was not a LLM?
It is far more than an LLM, and math != "language".
The second one being backed by a model.
> It is far more than an LLM
It's an LLM with a bunch of tools around it, and a slightly different runtime that ChatGPT. It's "only" that, but people - even here, of all places - keep underestimating just how much power there is in that.
> math != "language".
How so?
LLM = Large Language Model. Large refers to both the number of parameters (and in practice, depth) of the model, and also implicitly the amount of data used for training, and "language" means human (i.e. written, spoken) language. A Vision Transformer is not an LLM because it is trained on images, and AlphaFold is not an LLM because it is trained molecular configurations.
Aristotle works heavily with formalized LEAN statements and expressions. While you can certainly argue this is a language of sorts, it is not at all the same "language" as the "language" in LLMs. Calling Aristotle an "LLM" just because it has a transformer is more misleading than truthful, because every other single aspect of it is far more clever and involved.
There is no reason or framing where you can say Aristotle isn't a language model.
It's an LLM.
It is far more than an LLM, and math != "language".
1. "The search algorithm is a highly parallel Monte Carlo Graph Search (MCGS) using a large transformer as its policy and value functon." ... "We use a generative policy to take progressively widened [7] samples from the large action space of Lean tactics, conditioning on the Lean proof state, proof history, and, if available, an informal proof. We use the same model and prompt (up to a task token) to compute the value function which guides the search."
See that 'large transformer' phrase? That's where the LLM is involved.
2. "A lemma-based informal reasoning system which generates informal proofs of mathematical state-ments, breaks these proofs down into lemmas, formalizes each lemma into Lean, and iterates this process based on formal feedback" ... "First, the actions it generates consist of informal comments in addition to Lean tactics. Second, it uses a hidden chain of thought with a dynamically set thinking budget before predicting an action."
Unless you're proposing that this team solved AGI, "chain of thought" is a specific term of art in LLMs.
3. "A geometry solver which solves plane geometry problems outside of Lean using an approach based on AlphaGeometry [45]." ... following the reference: "AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. "
AlphaGeometry, like all of Deepmind's Alpha tools, is an LLM.
Instead of accusing people of not reading the paper, perhaps you should put some thought into what the things in the paper actually represent.
I think it's because it comes off as you are saying that we should move off of GenAI, and alot of people use LLM when they mean GenAI.
This is a really hopeful result for GenAI (fitting deep models tuned by gradient descent on large amounts of data), and IMO this is possible because of specific domain knowledge and approaches that aren't there in the usual LLM approaches.
Edit: I just realized from https://news.ycombinator.com/item?id=46296801 that you're the CEO! - in that case maybe you, or whoever you think most appropriate from your organization, could submit it along with a text description of what it is, and what is the easiest and/or most fun way to try it out?
Edit: Also https://hn.algolia.com/?dateRange=all&page=2&prefix=true&que... for the most popular Show HNs. Don't be discouraged we like personal/open source projects most often, Obsidian made #1
- It took me a minute or two of clicking around to figure out that the (only?) way to use it is to create an API key, then start aristotle in the terminal and interact with it there. It could be more obvious I think.
- Your profile links to http://www.cs.stanford.edu/~tachim/ which doesn't work; should be http://cs.stanford.edu/~tachim/ (without the www) (I think Stanford broke something recently for the former not to work.)
Are you getting the same value in your work, in your field?
For me, deep research tools have been essential for getting caught up with a quick lit review about research ideas I have now that I'm transitioning fields. They have also been quite helpful with some routine math that I'm not as familiar with but is relatively established (like standard random matrix theory results from ~5 years ago).
It does feel like the spectrum of utility is pretty aligned with what you might expect: routine programming > applied ML research > stats/applied math research > pure math research.
I will say ~1 year ago they were still useless for my math research area, but things have been changing quickly.
So yes, there is value here, and quite a bit but it requires a lot of forethought in how you structure your prompts and you need to be super skeptical about the output as well as able to check that output minutely.
If you would just plug in a bunch of data and formulate a query and would then use the answer in an uncritical way you're setting yourself up for a world of hurt and lost time by the time you realize you've been building your castle on quicksand.
I also found it can be helpful when exploring your mathematical intuitions on something, e.g. like how a dropout layer might effect learned weights and matrix properties, etc. Sometimes it will find some obscure rigorous math that can be very enlightening or relevant to correcting clumsy intuitions.
But maybe that is just me. I have read some of Terence Tao's transcripts, and the questions he asks LLMs are higher complexity than what I ask. Yet, he often gets reasonable answers. I don't yet know how I can get these tools to do better.
I have wondered if he has access to a better model than I, the way some people get promotional merchandise. A year or two ago he was saying the models were as good as an average math grad student when to me they were like a bad undergrad. In the current models I don't get solutions to new problems. I guess we could do some debugging and try prompting our models with this Erdos problem and see how far we get. (edit: Or maybe not; I guess LLMs search the web now.)
The difference between free ChatGPT, GPT-5.2 Thinking, and GPT-5.2 Pro is enormous for areas like logic and math. Often the answer to bad results is just to use a better model.
Additionally, sometimes when I get bad results I just ask the question again with a slightly rephrased prompt. Often this is enough to nudge the models in the right direction (and perhaps get a luckier response in the process). However, if you are just looking at a link to a chat transcript, this may not be clear.
Meanwhile I can’t get Claude code to fix its own shit to save my life.
Maybe this should give you some hint to that you're trying to use it in a different way than others?
People are still going to be moving the goalposts on this and claiming it's not all that impressive or that the solution must have been in the training data or something, but at this point that's kind of dubiously close to arguing that Terence Tao doesn't know what he's talking about, which to say the least is a rather perilous position.
At this point, I think I'm making a belated New Years resolution to stop arguing with people who are still staying that LLMs are stochastic parrots that just remix their training data and can never come up with anything novel. I think that discussion is now dead. There are lots of fascinating issues to work out with how we can best apply LLMs to interesting problems (or get them to write good code), but to even start solving those issues you have to at least accept that they are at least somewhat capable of doing novel things.
In 2023 I would have bet hard against us getting to this point ("there's no way chatbots can actually reason their way through novel math!"), but here we are are three years later. I wonder what comes next?
Or are you just saying that solving novel problems involves remixing ideas? Well, that's true for human problem solving too.
What? If we're discussing novel synthesis, and it's being contrasted with answer-from-search / answer-from-remix.. the problem does not matter. Only the answer and the originality of the approach. Connecting two fields that were not previously connected is novel, or applying a new kind of technique to an old problem. Recognizing that an unsolved problem is very much like a solved one is search / remix. So what happened here? Tao says it is
> is largely consistent with other recent demonstrations of AI using existing methods to resolve Erdos problem
Existing. Methods. Tao also says "This is a demonstration of the genuine increase in capability of these tools in recent months". This is the sentence everyone will focus on, so what is that capability?
> the more interesting capability revealed by these events is the ability to rapidly write and rewrite new versions of a text as needed, even if one was not the original author of the argument.
Rejoice! But rejoice for the right reasons, and about what actually happened. Style and voice transformations, interesting new capabilities for fuzzy search. Correct usage of external tools for heavy-lifting with symbolics. And yes, actual problem solving. Novel techniques, creativity, originality though? IDK, sounds kind of optimistic based on the detail here.
The METR institute predicts that the length of tasks AI agents can complete doubles every 7 months.
We should expect it to take until 2033 before AI solves Clay Institute-level problems with 50% reliability.
And once we have that formalized corpus, it's all set up as training data for moving forward.
I don’t think it can really be said to have occurred autonomously then?
Looks more like a 50/50 partnership with a super expert human one the one side which makes this way more vague in my opinion - and in line with my own AI tests, ie. they are pretty stupid even OPUS 4.5 or whatever unless you're already an expert and is doing boilerplate.
EDIT: I can see the title has been fixed now from solved to "more or less solved" which is still think is a big stretch.
Seems exactly like the tests at my company where even frontier models are revealed to be very expensive rubber ducks, but completely fails with non experts or anything novel or math heavy.
Ie. they mirror the intellect of the user but give you big dopamine hits that'll lead you astray.
"Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver"
Not saying it's not an amazing setup, i just don't understand the word "AI" being used like this when it's the setup / system that's brilliant in conjunction with absolute experts.
https://en.wikipedia.org/wiki/Dartmouth_workshop
AI != AGI != neural networks != LLMs
But Tao did mention ChatGPT so i believe LLMs were involved at least partially.
But speaking as a specialist in theorem proving, this result is pretty impressive! It would have likely taken me a lot longer to formalize this result even if it was in my area of specialty.
How did you arrive at "ridiculous"? What we're seeing here is incredible progress over what we had a year ago. Even ARC-AGI-2 is now at over 50%. Given that this sort of process is also being applied to AI development itself, it's really not clear to me that humans would be a valuable component in knowledge work for much longer.
To me, this sounds like when we first went to the moon, and people were sure we'd be on Mars be the end of the 80's.
> Even ARC-AGI-2 is now at over 50%.
Any measure of "are we close to AGI" is as scientifically meaningful as "are we close to a warp drive" because all anyone has to go on at this point is pure speculation. In my opinion, we should all strive to be better scientists and think more carefully about what an observation is supposed to mean before we tout it as evidence. Despite the name, there is no evidence that ARC-AGI tests for AGI.
This hits so true to home. Just today in my field a manager without expertise in a topic gave me an AI solution to something I am an expertise in. The AI was very plainly and painfully wrong, but it comes down to the user prompting really poorly. When I gave a el formulated prompt to the same topic, I got the correct answer on the first go.
Recent case:
I have a bar with a number of weights supported on either end:
|---+-+-//-+-+---|
What order and/or arrangement or of removing the weights would cause the least shift in center-of-mass? There is a non-obvious trick that you can pull here to reduce the shift considerably and I was curious if the AI would spot it or not but even after lots of prompting it just circled around the obvious solutions rather than to make a leap outside of that box and come up with a solution that is better in every case.
I wonder what the cause of that kind of blindness is.
My guess is: first move the weights to the middle, and only then remove them.
However “weights” and “bar” might confuse both machines and people into thinking that this is related to weight lifting, where there’s two stops on the bar preventing the weights from being moved to the middle.
> There seems to be some confusion on this so let me clear this up. No, after the model gave its original response, I then proceeded to ask it if it could solve the problem with C=k/logN arbitrarily large. It then identified for itself what both I and Tao noticed about it throwing away k!, and subsequently repaired its proof. I did not need to provide that observation.
so it was literally "yo, your proof is weak!" - "naah, watch this! [proceeds to give full proof all on its own]"
I'd say that counts
Please remember that this is a theorem about integers that is subject to a fairly elementary proof that is well-supported by the existing Mathlib infrastructure. It seems that the AI relies on the symbolic proof checker, and the proofs that it is checking don't use very complex definitions in this result. In my experience, proofs like this which are one step removed from existing infra are much much more likely to work.
Again though, this is really insanely cool!!
To clear up a few misconceptions:
- Aristotle uses modern AI techniques heavily, including language modeling.
- Aristotle can be guided by an informal (English) proof. If the proof is correct, Aristotle has a good chance at translating it into Lean (which is a strong vote of confidence that your English proof is solid). I believe that's what happened here.
- Once a proof is formalized into Lean (assuming you have formalized the statement correctly), there is no doubt that the proof is correct. This is the core of our approach: you can do a lot of (AI-driven) search, and once you find the answer you are certain it's correct no matter how complex the solution is.
Happy to answer any questions!
Sometimes when I'm using new LLMs I'm not sure if it’s a step forward or just benchmark hacking, but formalized math results always show that the progress is real and huge.
When do you think Harmonic will reach formalizing most (even hard) human written math?
I saw an interview with Christian Szegedy (your competitor I guess) that he believes it will be this year.
In other fields (topology, probability, linear algebra), many key definitions are not in Mathlib yet, so you will struggle to write down the theorem itself. (But in some cases, Aristotle can define the structure you are talking about on the fly!)
This is not an intrinsic limitations of Lean, it's just that nobody has taken the time to formalize much of those fields yet. We hope to dramatically accelerate this process by making it trivial to prove lemmas, which make up much of the work. For now, I still think humans should write the key definitions and statements of "central theorems" in a field, to ensure they are compatible with the rest of the library.
Do you have any plans to characterize these cases more fully, and perhaps propose your own contributions to mathlib itself on that basis?
My limited understanding of Rust is that it applies a fixed set of rules to guarantee memory safety. The rules are somewhat simple and limiting, for ease of understanding and implementation, but also because of undecidability.
Programmers run into situations where they know that their code won't cause memory errors, but it doesn't follow the rules. Wouldn't it be cool if something like Aristotle was integrated into the compiler? Any code for which a proof of correctness could be written would pass/compile, without having to add more and more rules
Formal methods are cool because, by contrast to tools like the borrow checker, you can prove some very "nonlocal" properties: this system does not deadlock, or it makes progress at least every N steps, etc.
Eg you can give the vague spec 'build me a todo list app', but you can still formally prove that everything your app does finishes, or even that it finishes in reasonable time.
Given a formal statement of what you want, Lean can validate that the steps in a (tedious) machine-readable purported proof are valid and imply the result from accepted axioms. This is not AI, but a tiny, well reviewed kernel that only accepts correct formal logic arguments.
So, if you have a formal statement that you've verified to represent what you are interested in by some other means, Lean can tell you whether the proof created by genAI is correct. Basically, there is a nigh infallible checker that won't accept incorrect hallucinations.
The lean proof checker then checks to make sure the proof actually proves the statement.
In this case an AI is generating the proof, but if it "compiles" it's correct. The only thing humans need to check is the statement to be proven.
(I don't know anything about this project but I've played around with lean and used other proof checkers more sesrisously).
> Let C>0 and ϵ>0 be sufficiently small. Are there infinitely many integers a,b,n with a≥ϵn and b≥ϵn such that a!b!∣n!(a+b−n)! and a+b>n+Clogn?
Which seems like it's the type of thing you give as a homework problem to state formally in an intro class.
These are the two main problems:
1. Formalizing a theorem.
2. Finding a formal proof.
Part 2 is where AI could help as proof search is full of heuristics. That's also how humans find proofs and is one of the main skills of a mathematician. The formal proof can then be machine checked with well known and mature techniques not involving AI.
Part 1 is the part that's missing and will always be hard. It's also the issue with formal verification of programs for which correctness criteria are often very complex and it's easy to mess up the formalization, so that even if you trust the proof, you can't trust that it proves the right thing.
A little bit nitpicking, but according to books like AIMA that is indeed AI. In the first chapter even any control system is classified as AI.
Because of the reasons stated in the 1st chapter, I totally agree with the authors.
The whole system is AI. That part is a verifier in a chain of “suggestions/instict -> verifier” like used in neurosymbolic systems for automated driving, for example.
Aristotle is doing the matching AI part, the modern LLM approach, previously called fuzzy logic.
Both are AI.
However, there are some good heuristics. If you expect a problem to be hard and the proof is very short, you've probably missed something!
This is also a big risk when trying to prove code correctness: "prove this algo works" means you gotta define "works" along certain axes, and if you're very unlucky you might have a proof that exploits the uncertainty around a certain axis.
Claude Code can write lean, but we do a heck of a lot of RL on theorem proving, so Aristotle winds up being much better at writing Lean than other coding agents are.
Have you tried Aristotle on other, non-Lean tasks? Is it better at logical reasoning in general?
Second, when you say language modeling support, it means that can better understand code representation (ASTs) or something else? I am just an AI user, not very knowledgeable in the field. My main interest is if it would be great for static analysis oriented to computer security (SAST).
How does this depend on the area of mathematics of the proof? I was under the impression that it was still difficult to formalize most research areas, even for a human. How close is Aristotle to this frontier?
It's another reformulation rather than a true proof. Now, instead of wanting a proof of a theorem, now we just need to prove that this proof is actually proving the theorem. The proof itself being so incomprehensible that it can't on its own be trusted, but if it can be shown that it can be trusted then the theorem must be true.
Do you have any links to reading about how often lean core has soundness bugs or mathlib has correctness bugs?
That's a pretty big assumption, though, isn't it? As we saw the Navier-Stokes psychosis episode over the New Year holiday, formalizing correctly really isn't guaranteed.
1. This result is very far from showing something like "human mathematicians are no longer needed to advance mathematics".
2. Even if it did show that, as long as we need humans trained in understanding maths, since "professional mathematicians" are mostly educators, they probably aren't going anywhere.
observationist•9h ago
Or someone like Terence Tao might figure out how to wield AI better than anyone else, even the labs, and use the tools to take a bunch down at once. I'm excited to see what's coming this year.
malux85•9h ago
What happens if we reason upwards but change some universal constants? What happens if we use Tao instead of Pi everywhere, these kind of fun questions would otherwise require an enormous intellectual effort whereas with the mechanisation and automation of thought, we might be able to run them and see!
stouset•9h ago
Literally nothing other than mild convenience. It’s just 2pi.
lapetitejort•9h ago
measurablefunc•9h ago
measurablefunc•9h ago
ogogmad•7h ago
I'm not sure why that point gets lost in these discussions. And personally, I think of the set of fundamental mathematical objects as having a unique and objective definition. So, I get weirdly bothered by the offset in the Gamma function.
chmod775•9h ago
The "intellectual effort" this requires is about 0.
Maybe you meant Euler's number? Since it also relates to PI, it can be used and might actually change the framework in an "interesting way" (making it more awkward in most cases - people picked PI for a reason).
saulpw•9h ago
observationist•8h ago
I think what they were getting at is something like this: The application of existing ideas that simply haven't been applied in certain ways because it's too boring or obvious or abstract for humans to have bothered with, but AI can plow through a year's worth of human drudgery in a day or a month or so, and that sort of "brute force" won't require any amazing new technical capabilities from AI.
sublinear•9h ago
HardCodedBias•9h ago
https://ddcolrs.wordpress.com/2018/01/17/maxwells-equations-...
kridsdale3•9h ago
Google Scholar was a huge step forward for doing meta-analysis vs a physical library.
But agents scanning the vastness of PDFs to find correlations and insights that are far beyond human context-capacity will I hope find a lot of knowledge that we have technically already collected, but remain ignorant of.
newyankee•8h ago
It is likely we might see some AlphaGo type new styles in existing research workflows that AI might work out if there is some verification logic. Humans could probably never go into that space, or may be none of the researchers ever ventured there due to different reasons as progress in general is mostly always incremental.
zozbot234•8h ago
semi-extrinsic•8h ago
In any given scientific niche, there is a huge amount of tribal knowledge that never gets written down anywhere, just passed on from one grad student to the rest of the group, and from there spreads by percolation in the tiny niche. And papers are never honest about the performance of the results and what does not work, there is always cherry picking of benchmarks/comparisons etc.
There is absolutely no way you can get these kinds of insights beyond human context capacity that you speak of. The information necessary does not exist in any dataset available to the LLM.
charcircuit•5h ago
ogogmad•7h ago
sublinear•9h ago
For all topics that can be expressed with language, the value of LLMs is shuffling things around to tease out a different perspective from the humans reading the output. This is the only realistic way to understand AI enough to make it practical and see it gain traction.
As much as I respect Tao, I feel like his comments about AI usage can be misleading without carefully reading what he is saying in the linked posts.
roadside_picnic•8h ago
A large amount of Tao's work is around using AI to assist in creating Lean proofs.
I'm generally on the more skeptical side of things regarding LLMs and grand visions, but assisting in the creation of Lean proofs is a huge area of opportunity for LLMs and really could change mathematics in fundamental ways.
One naive belief many people have is that proofs should be "intelligible" but it's increasingly clear this is not the case. We have proofs that are gigabytes (I believe even terabytes in some cases) in size, but we know they are correct because they check in Lean.
This particular pattern of using state of the art work in two different areas (LLMs and theorem proving) absolutely has the potentially to fundamentally change how mathematics is done. There's a great picture on pp 381 of Type Theory and Formal Proof where you can easily see how LLMs can be placed in two of the most tricky parts of that diagram to solve.
Because the work is formally verified we can throw out entire classes of LLM problems (like hallucinations).
Personally I think strongly typed language, with powerful type systems are also the long term ideal coding with LLMs (but I'm less optimistic about devs following this path).
zozbot234•8h ago
It absolutely is. With the twist that ChatGPT 5.2 can now also "explain" an AI-generated Lean proof in human-readable terms. This is a game changer, because "refactoring" can now become end-to-end: if the human explanation of a Lean proof is hard to grok and could be improved, you can test changes directly on the formal text and check that the proof still goes through for the original statement.
roadside_picnic•4h ago
Formal verification combined with AI is, imho, exactly the type of thinking that gets the most value out of the current state of LLMs.
j2kun•8h ago
Perhaps you meant, "a decent amount of his recent work." He has been doing math long before LLMs, and is still regularly publishing papers with collaborators that have nothing to do with AI. The most recent was last week. https://arxiv.org/search/math?searchtype=author&query=Tao,+T
roadside_picnic•4h ago
cocoto•8h ago
That’s one of the main reason why I did not pursue an academic math career. The pure joy of solving exam problems with elegant proofs is very hard to get on harder problems.
threethirtytwo•8h ago
robotresearcher•7h ago
threethirtytwo•7h ago
amluto•7h ago
https://www.science.org/cms/asset/7f2147df-b2f1-4748-9e98-1a...
wizzwizz4•6h ago
That's not a naïve belief. Intelligible proofs represent insight that can be applied to other problems. If our only proof is an opaque one, that means we don't really understand the area yet. Take, for example, the classification of finite simple groups (a ten-thousand-page proof): that is very much not a closed area of research, and we're still discovering new things in the vicinity of the problem.
ComplexSystems•8h ago
mkl•8h ago
threethirtytwo•8h ago
gf000•7h ago
Just because a specialized human placed in an F-16 can fly at Mach 2.0, doesn't mean humans in general can fly.
threethirtytwo•4h ago
What happens when we put an artificial general intelligence in an F-16? That's what happened here with this proof.
epolanski•8h ago
pfdietz•7h ago
ACS_Solver•7h ago
fc417fc802•3h ago
ben_w•8h ago
While quickly I noticed that my pre-ChatGPT-3.5 use of the term was satisfied by ChatGPT-3.5, this turned out to be completely useless for 99% of discussions, as everyone turned out to have different boolean cut-offs for not only the generality, but also the artificiality and the intelligence, and also what counts as "intelligence" in the first place.
That everyone can pick a different boolean cut-off for each initial, means they're not really booleans.
Therefore, consider that this can't drive a car, so it's not fully general. And even those AI which can drive a car, can't do so in genuinely all conditions expected of a human, just most of them. Stuff like that.
throw310822•6h ago
So blind people are not general intelligences?
AxEy•6h ago
So no we do not deem a blind person to be unintelligent due to their lack of being able to drive without sight. But we might judge a sighted person as being not generally intelligent if they could not drive with sight.
pfdietz•7h ago
"I doubt that anything resembling genuine "artificial general intelligence" is within reach of current #AI tools. However, I think a weaker, but still quite valuable, type of "artificial general cleverness" is becoming a reality in various ways.
By "general cleverness", I mean the ability to solve broad classes of complex problems via somewhat ad hoc means. These means may be stochastic or the result of brute force computation; they may be ungrounded or fallible; and they may be either uninterpretable, or traceable back to similar tricks found in an AI's training data. So they would not qualify as the result of any true "intelligence". And yet, they can have a non-trivial success rate at achieving an increasingly wide spectrum of tasks, particularly when coupled with stringent verification procedures to filter out incorrect or unpromising approaches, at scales beyond what individual humans could achieve.
This results in the somewhat unintuitive combination of a technology that can be very useful and impressive, while simultaneously being fundamentally unsatisfying and disappointing - somewhat akin to how one's awe at an amazingly clever magic trick can dissipate (or transform to technical respect) once one learns how the trick was performed.
But perhaps this can be resolved by the realization that while cleverness and intelligence are somewhat correlated traits for humans, they are much more decoupled for AI tools (which are often optimized for cleverness), and viewing the current generation of such tools primarily as a stochastic generator of sometimes clever - and often useful - thoughts and outputs may be a more productive perspective when trying to use them to solve difficult problems."
This comment was made on Dec. 15, so I'm not entirely confident he still holds it?
https://mathstodon.xyz/@tao/115722360006034040
xorcist•8h ago
To a layman, that doesn't sound like very AI-like? Surely there must be a dozen algorithms to effectively search this space already, given that mathematics is pretty logical?
tombert•8h ago
Isabelle has had the "Sledgehammer" tool for quite awhile [1]. It uses solvers like z3 to search and apply a catalog of proof strategies and then try and construct a proof for your main proof or any remaining subtasks that you have to complete. It's not perfect but it's remarkably useful (even if it does sometimes give you proofs that import like ten different libraries and are hard to read).
I think Coq has Coqhammer but I haven't played with that one yet.
[1] https://isabelle.in.tum.de/dist/doc/sledgehammer.pdf
matu3ba•6h ago
Sorry for my probably senseless questions, as I'm trying to map the computing model of math solvers to common PL semantics. Probably there is better overview literature. I'd like to get an overview of proof system runtime semantics for later usage. 2 Is there an equivalent of fuzz testing (of computing systems) in math, say to construct the general proof framework? 3 Or how are proof frameworks (based on ideas how the proof could work) constructed? 4 Do I understand it correct, that math in proof systems works with term rewrite systems + used theory/logic as computing model of valid representation and operations? How is then the step semantic formally defined?
wizzwizz4•6h ago
1. Yes, the Sledgehammer suite contains three testing systems that I believe are concolic (quickcheck, nitpick, nunchaku), but they only work due to hand-coded support for the mathematical constructs in question. They'd be really inefficient for a formalised software environment, because they'd be operating at a much lower-level than the abstractions of the software environment, unless dedicated support for that software environment were provided.
2. Quickcheck is a fuzz-testing framework, but it doesn't help to construct anything (except as far as it constructs examples / counterexamples). Are you thinking of something that automatically finds and defines intermediary lemmas for arbitrary areas of mathematics? Because I'm not aware of any particular work in that direction: if computers could do that, there'd be little need for mathematicians.
3. By thinking really hard, then writing down the mathematics. Same way you write computer programs, really, except there's a lot more architectural work. (Most of the time, the computer can brute-force a proof for you, so you need only choose appropriate intermediary lemmas.)
4. I can't parse the question, but I suspect you're thinking of the meta-logic / object-logic distinction. The actual steps in the term rewriting are not represented in the object logic: the meta-logic simply asserts that it's valid to perform these steps. (Not even that: it just does them, accountable to none other than itself.) Isabelle's meta-logic is software, written in a programming language called ML.
matu3ba•6h ago
zozbot234•6h ago
2. there is no exact analogue of fuzz-testing bytecode for "theorem fuzzing", but arguably the closest match is counterexample generators - model finders, finite-model search, SMT instantiation with concrete valuations, all serve a role similar to fuzzers by finding invalid conjectures quickly.
these are weaker than fuzzing for finding execution bugs because mathematical statements are higher-level and provers operate symbolically.
3. here's how proof frameworks are constructed, at the high-level:
a. start by picking a logical foundation: e.g., first-order logic (FOL), higher-order logic (HOL), dependent type theory (Martin-Löf/CIC).
b. define syntax (terms, types, formulas) and typing/judgment rules (inference rules or typing rules).
c. define proof objects or proof rules: natural deduction, sequent calculus, Hilbert-style axioms, or type-as-proofs (Curry-Howard).
d. pick or design the kernel: small, trusted inference engine that checks proof objects (reduction rules, conversion, rule admissibility).
e. add automation layers: tactics, decision procedures, external ATP/SMT, rewriting engines.
f. provide libraries of axioms/definitions and extraction/interpretation mechanisms (code extraction, models).
g. implement proof search strategies and heuristics (backtracking, heuristics, lemma databases, proof-producing solvers).
this is the standard engineering pipeline behind modern ITPs and automated systems.
4. yes, many proof assistants and automated provers treat computation inside proofs as:
a. term rewriting / reduction (beta-reduction, delta-unfolding, normalization) for computation oriented parts; this is the "computation" layer.
b. a separate deductive/logic layer for reasoning (inference rules, quantifier instantiation, congruence closure).
the combined model is: terms represent data/computation; rewriting gives deterministic computation semantics; logical rules govern valid inference about those terms. Dependent type theories conflate computation and proof via conversion (definitional equality) in the kernel.
5. here's how a proof step's semantics is defined:
proof steps are applications of inference rules transforming sequents/judgments. formally:
a. a judgment form J (e.g., Γ ⊢ t : T or Γ ⊢ φ) is defined.
b. inference rules are of the form: from premises J1,...,Jn infer J, written as (J1 ... Jn) / J.
c. a proof is a finite tree whose nodes are judgments; leaves are axioms or assumptions; each internal node follows an inference rule.
d. for systems with computation, a reduction relation → on terms defines definitional equality; many rules use conversion: if t →* t' and t' has form required by rule, the rule applies.
e. in type-theoretic kernels, the step semantics is checking that a proof object reduces/normalizes and that constructors/eliminators are used respecting typing/conversion; the kernel checks a small set of primitive steps (e.g., beta, iota reductions, rule applications).
operationally: a single proof step = instantiate a rule schema with substitutions for metavariables, perform any required reductions/conversions, and check side conditions (freshness, well-formedness).
equational reasoning and rewriting: a rewrite rule l → r can be applied to a term t at a position p if the subterm at p matches l under some substitution σ; result is t with that subterm replaced by σ(r). Confluence/termination properties determine global behavior.
higher-level tactics encode sequences of such primitive steps (rule applications + rewrites + searches) but are not part of kernel semantics; only the kernel rules determine validity.
relevant concise implications for mapping to PL semantics:
treat proof state as a machine state (context Γ, goal G, local proof term), tactics as programs that transform state; kernel inference rules are the atomic instruction set; rewriting/normalization are deterministic evaluation semantics used inside checks.
automation ≈ search processes over nondeterministic tactic/program choices; model finders/SMTs act as external oracles producing either counterexamples (concrete) or proof certificates (symbolic).
AndrewKemendo•8h ago
The fact of how you use the term AI tells me that you are a representative of laymen so what precisely are you trying to define?
It might be helpful to understand the term artificial intelligence first:
https://kemendo.com/Understand-AI.html
Legend2440•7h ago
Logic solvers are useful, but not tractable as a general way to approach mathematics.
zozbot234•7h ago
To be clear, there are explicitly computationally tractable fragments of existing logics, but they're more-or-less uninteresting by definition: they often look like very simple taxonomies (i.e. purely implicational) or like a variety of "modal" and/or "multi-modal" constructions over simpler logics.
Of course it would be nice to explicitly tease out and write down the "computationally tractable" general logical reasoning that some existing style of proof is implicitly relying on (AIUI this kind of inquiry would generally be comprised under "synthetic mathematics", trying to find simple treatments in axiom- and rule-of-inference style for existing complex theories) but that's also difficult.
Davidzheng•8h ago