> In some tasks, AI is unreliable. In others, it is superhuman. You could, of course, say the same thing about calculators, but it is also clear that AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given the definitional problems, I really don’t know, but I do think they can be credibly seen as a form of “Jagged AGI” - superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t.
Huh? Isn't a LLM's capability fully constrained by the training data? Everything else is hallucinated.
In that sense, they absolutely know things that aren’t in their training data. You’re correct about factual knowledge, tho — that’s why they’re not trained to optimize it! A database(/pagerank?) solves that problem already.
Certainly jagged does not imply general
It seems to me the bar for "AGI" has been lowered to measuring what tasks it can do rather than the traits we normally associate with general intelligence. People want it to be here so bad they nerf the requirements...
Re:”traits we associate with general intelligence”, I think the exact issue is that there is no scientific (ie specific*consistent) list of such traits. This is why Turing wrote his famous 1950 paper and invoked the Imitation Game; not to detail how one could test for a computer that’s really thinking(/truly general), but to show why that question isn’t necessary in the first place.
Certainly creativity is missing, it has no internal motivation, and it will answer the same simple question both right and wrong, depending on unknown factors.
I'm with Yann Lecun when he says that we won't reach AGI until we move beyond transformers
I'm guilty as charged of having looked at GPT 3.5 and having thought "it's meh", but more than anything this is showing that debating words rather than the underlying capabilities is an empty discussion.
It is not a simple matter of patching the rough edges. We are fundamentally not using an architecture that is capable of intelligence.
Personally the first time I tried deep research on a real topic it was disastrously incorrect on a key point.
What does that even mean? Do you actually have any particular numeric test of intelligence that's somehow better than all the others?
If you ask an intelligent being the same question they may occasionally change the precise words they use but their answer will be the same over and over.
"An AGI is a human-created system that demonstrates iteratively improving its own conceptual design without further human assistance".
Note that a "conceptual design" here does not include tweaking weights within an already-externally-established formula.
My reasoning is thus:
1. A system that is only capable of acting with human assistance cannot have its own intelligence disentangled from the humans'
2. A system that is only intelligent enough to solve problems that somehow exclude problems with itself is not "generally" intelligent
3. A system that can only generate a single round of improvements to its own designs has not demonstrated improvements to those designs, as if iteration N+1 were truly superior to iteration N, it would be able to produce iteration N+2
4. A system that is not capable of changing its own design is incapable of iterative improvement, as there is a maximum efficacy within any single framework
5. A system that could improve itself in theory and fails to do so in practice has not demonstrated intelligence
It's pretty clear that no current-day system has hit this milestone; if some program had, there would no longer be a need for continued investment in algorithms design (or computer science, or most of humanity...).
A program that randomly mutates its own code could self-improve in theory but fails to do so in practice.
I don't think these goalposts have moved in the past or need to move in the future. This is what it takes to cause the singularity. The movement recently has been people trying to sell something less than this as an AGI.
It’s bad luck for those of us who want to talk about how good or bad they are in general. Summary statistics aren’t going to tell us much more than a reasonable guess as to whether a new model is worth trying on a task we actually care about.
However (as the article admits) there is still no general agreement of what AGI is, or how we (or even if we can) get there from here.
What there is is a growing and often naïve excitement that anticipates it as coming into view, and unfortunately that will be accompanied by the hype-merchants desperate to be first to "call it".
This article seems reasonable in some ways but unfortunately falls into the latter category with its title and sloganeering.
"AGI" in the title of any article should be seen as a cautionary flag. On HN - if anywhere - we need to be on the alert for this.
Until those capabilities are expanded for model self-improvement -- including being able to adapt its own infrastructure, code, storage, etc. -- then I think AGI/ASI are yet to be realized. My POV is SkyNet, Traveler's "The Director", Person of Interest's "The Machine" and "Samaritan." The ability to target a potentially inscrutable goal along with the self-agency to direct itself towards that is true "AGI" in my book. We have a lot of components that we can reason are necessary, but it is unclear to me that we get there in the next few months.
sejje•1h ago
I would do the same thing, I think. It's too well-known.
The variation doesn't read like a riddle at all, so it's confusing even to me as a human. I can't find the riddle part. Maybe the AI is confused, too. I think it makes an okay assumption.
I guess it would be nice if the AI asked a follow up question like "are you sure you wrote down the riddle correctly?", and I think it could if instructed to, but right now they don't generally do that on their own.
Jensson•51m ago
LLMs doesn't assume, its a text completer. It sees something that looks almost like a well known problem and it will complete with that well known problem, its a problem specific to being a text completer that is hard to get around.
simonw•43m ago
jordemort•30m ago
gavinray•30m ago
Neural Networks as I understand them are universal function approximators.
In terms of text, that means they're trained to output what they believe to be the "most probably correct" sequence of text.
An LLM has no idea that it is "conversing", or "answering" -- it relates some series of symbolic inputs to another series of probabilistic symbolic outputs, aye?
Borealid•27m ago
If you give an LLM "The rain in Spain falls" the single most likely next token is "mainly", and you'll see that one proportionately more than any other.
If you give an LLM "Find an unorthodox completion for the sentence 'The rain in Spain falls'", the most likely next token is something other than "mainly" because the tokens in "unorthodox" are more likely to appear before text that otherwise bucks statistical trends.
If you give the LLM "blarghl unorthodox babble The rain in Spain" it's likely the results are similar to the second one but less likely to be coherent (because text obeying grammatical rules is more likely to follow other text also obeying those same rules).
In any of the three cases, the LLM is predicting text, not "parsing" or "understanding" a prompt. The fact it will respond similarly to a well-formed and unreasonably-formed prompt is evidence of this.
It's theoretically possible to engineer a string of complete gibberish tokens that will prompt the LLM to recite song lyrics, or answer questions about mathemtical formulae. Those strings of gibberish are just difficult to discover.
Workaccount2•20m ago
Borealid•7m ago
If those two sets of accomplishments are the same there's no point arguing about differences in means or terms. Right now humans can build better LLMs but nobody has come up with an LLM that can build better LLMs.
dannyobrien•20m ago
Did you mean to ask about the well-known phrase "The rain in Spain falls mainly on the plain"? This is a famous elocution exercise from the musical "My Fair Lady," where it's used to teach proper pronunciation.
Or was there something specific you wanted to discuss about Spain's rainfall patterns or perhaps something else entirely? I'd be happy to help with whatever you intended to ask. "
I think you have a point here, but maybe re-express it? Because right now your argument seems trivially falsifiable even under your own terms.
Borealid•6m ago
If you want to test convolution you have to use a raw model with no system prompt. You can do that with a Llama or similar. Otherwise your context window is full of words like "helpful" and "answer" and "question" that guide the response and make it harder (not impossible) to see the effect I'm talking about.
simonw•16m ago
There's more than just next token prediction going on. Those reasoning chain of thoughts have undergone their own reinforcement learning training against a different category of samples.
They've seen countless examples of how a reasoning chain would look for calculating a mortgage, or searching a flight, or debugging a Python program.
So I don't think it is accurate to describe the eventual result as "just next token prediction". It is a combination of next token production that has been informed by a chain of thought that was based on a different set of specially chosen examples.
Borealid•10m ago
If not, why not? Explain.
If so, how does your argument address the fact that this implies any given "reasoning" model can be trained without giving it a single example of something you would consider "reasoning"? (in fact, a "reasoning" model may be produced by random chance?)
wongarsu•9m ago
Don't humans do the same in conversation? How should an intelligent being (constrained to the same I/O system) respond here to show that it is in fact intelligent?
monkpit•22m ago
In every LLM thread someone chimes in with “it’s just a statistical token predictor”.
I feel this misses the point and I think it dismisses attention heads and transformers, and that’s what sits weird with me every time I see this kind of take.
There _is_ an assumption being made within the model at runtime. Assumption, confusion, uncertainty - one camp might argue that none of these exist in the LLM.
But doesn’t the implementation constantly make assumptions? And what even IS your definition of “assumption” that’s not being met here?
Edit: I guess my point, overall, is: what’s even the purpose of making this distinction anymore? It derails the discussion in a way that’s not insightful or productive.
wongarsu•21m ago
Discussing whether models can "reason" or "think" is a popular debate topic on here, but I think we can all at least agree that they do something that at least resembles "reasoning" and "assumptions" from our human point of view. And if in its chain-of-thought it decides your prompt is wrong it will go ahead answering what it assumes is the right prompt
sejje•21m ago
Yes, and it can express its assumptions in text.
Ask it to make some assumptions, like about a stack for a programming task, and it will.
Whether or not the mechanism behind it feels like real thinking to you, it can definitely do this.
og_kalu•15m ago
The problem you've just described is a problem with humans as well. LLMs are assuming all the time. Maybe you would like to call it another word, but it is happening.