suitcases full of money?
https://arxiv.org/abs/1911.01547
GPT-3 didn't come out until 2020.
That said, I'd still listen these two guys (+ Schmidhuber) more than any other AI-guy.
I think the debate hqas been flat-footed by the speed all this happened. We're not talking AGI any more, we're talking about how to build superintelligences hitherto unseen in nature.
I enjoy seeing people repeatedly move the goalposts for "intelligence" as AIs simply get smarter and smarter every week. Soon AI will have to beat Einstein in Physics, Usain Bolt in running, and Steve Jobs in marketing to be considered AGI...
Where did I say there was nothing meaningful about current capabilities? I'm saying that's what is novel about a claim of "AGI" (as opposed to a claim of "computer does something better than humans", which has been an obviously true statement since the ENIAC) is the ability to do at some level everything a normal human intelligence can do.
The second highlight from this video is the section from 29 minutes onward, where he talks about designing systems that can build up rich libraries of abstractions which can be applied to new problems. I wish he had lingered more on exploring and explaining this approach, but maybe they're trying to keep a bit of secret sauce because it's what his company is actively working on.
One of the major points which seems to be emerging from recent AI discourse is that the ability to integrate continuous learning seems like it'll be a key element in building AGI. Context is fine for short tasks, but if lessons are never preserved you're severely capped with how far the system can go.
If we assume that humans have "general intelligence", we would assume all humans could ace Arc... but they can't. Try asking your average person, i.e. supermarket workers, gas station attendants etc to do the Arc puzzles, they will do poorly, especially on the newer ones, but AI has to do perfectly to prove they have general intelligence? (not trying to throw shade here but the reality is this test is more like an IQ test than an AGI test).
Arc is a great example of AI researchers moving the goal posts for what we consider intelligent.
Let's get real, Claude Opus is smarter than 99% of people right now, and I would trust its decision making over 99% of people I know in most situations, except perhaps emotion driven ones.
Arc agi benchmark is just a gimmick. Also, since it's a visual test and the current models are text based it's actually a rigged (against the AI models) test anyway, since their datasets were completely text based.
Basically, it's a test of some kind, but it doesn't mean quite as much as Chollet thinks it means.
If we think humans have "GI" then I think we have AIs right now with "GI" too. Just like humans do, AIs spike in various directions. They are amazing at some things and weak at visual/IQ test type problems like ARC.
I think the charitable interpretation is that, if intelligence is made up of many skills, and AIs are super human at some, like image recognition.
And that therefore, future efforts need to be on the areas where AIs are significantly less skilled. And also, since they are good at memorizing things, knowledge questions are the wrong direction and anything most humans could solve but that AIs can not, especially if as generic as pattern matching, should be an important target.
But in practice, it's like stopping an arms race.
Personally I don't think it's possible at this stage. The cat's out of the bag (this new class of tools are working) the economic incentive is way too strong.
There are dozens of ready-made, well-designed, and very creative games there. All are tile-based and solved with only arrow keys and a single action button. Maybe someone should make a PuzzleScript AGI benchmark?
https://nebu-soku.itch.io/golfshall-we-golf
Maybe someone can make an MCP connection for the AIs to practice. But I think the idea of the benchmark is to reserve some puzzles for private evaluation, so that they're not in the training data.
My impression is that models are pretty bad at interpreting grids of characters. Yesterday, I was trying to get Claude to convert a message into a cipher where it converted a 98-character string into 7x14 grid where the sequential letters moved 2-right and 1-down (i.e., like a knight it chess). Claude seriously struggled.
Yet, Francois always pumps up the "fluid intelligence" component of this test and emphasizes how easy these are for humans. Yet, humans would presumably be terrible at the tasks if they looked at it character-by-character
This feels like a somewhat similar (intuition-lie?) case as the Apple paper showing how reasoning model's can't do tower of hanoi past 10+ disks. Readers will intuitively think about how they themselves could tediously do an infinitely long tower of hanoi, which is what the paper is trying to allude to. However, the more appropriate analogy would be writing out all >1000 moves on a piece of paper at once and being 100% correct, which is obviously much harder
https://news.ycombinator.com/item?id=44492241
My comment was basically instantly flagged. I see at least 3 other flagged comments that I can't imagine deserve to be flagged.
LLMs are fundamentally text-based. The majority of their training is text-based. The majority of their usage is text-based. And a very large majority of their output is text-based. So it seems somewhat bizzare to perform general evaluation of these models using what are effectively image-centric tests.
Evaluating LLM visual skills and reasoning is a very important and reasonable thing to do. And I believe that there are an infinite number of ways to evaluate LLMs and general intelligence and that visual tests are a viable approach. But I personally feel that the mismatch between the core design of LLMs and the evaluation framework of ARC-AGI is simply too large to ignore.
I have a (draft) blog post on this subject that I copied some of my comment here from https://www.xent.tech/blog/problems-in-llm-benchmarking-and-...
Fun piece of trivia: François Chollet's "On the Measure of Intelligence" was released on November 5, 2019, the exact same day that the full GPT-2 model was released
qoez•4h ago
avmich•4h ago
The diagnosis is pattern matching (again, roughly). It kinda suggests that a lot of "intelligent" problems are focused on pattern matching, and (relatively straightforward) application of "previous experience". So, pattern matching can bring us a great deal towards AGI.
AnimalMuppet•4h ago
yorwba•4h ago
whiplash451•4h ago
"We argue that human cognition follows strictly the same pattern as human physical capabilities: both emerged as evolutionary solutions to specific problems in specific evironments" (from page 22 of On the Measure of Intelligence)
https://arxiv.org/pdf/1911.01547
Davidzheng•4h ago
energy123•4h ago
But on a serious note, I don't think Chollet would disagree. ARC is a necessary but not sufficient condition, and he says that, despite the unfortunate attention-grabbing name choice of the benchmark. I like Chollet's view that we will know that AGI is here when we can't come up with new benchmarks that separate humans from AI.
loki_ikol•4h ago
kubb•4h ago
FrustratedMonky•3h ago
mindcrime•10m ago
oldge•4h ago
KoolKat23•4h ago
Looking at the human side, it takes a while to actually learn something. If you've recently read something it remains in your "context window". You need to dream about it, to think about, to revisit and repeat until you actually learn it and "update your internal model". We need a mechanism for continuous weight updating.
Goal-generation is pretty much covered by your body constantly drip-feeding your brain various hormones "ongoing input prompts".
onemoresoop•3h ago
How are we not far off? How can LLMs generate goals and based on what?
NetRunnerSu•3h ago
tsurba•2h ago
FeepingCreature•2h ago
Alternately, you can train it on following a goal and then you have a system where you can specify a goal.
At sufficient scale, a model will already contain goal-following algorithms because those help predict the next token when the model is basetrained on goal-following entities, ie. humans. Goal-driven RL then brings those algorithms to prominence.
kelseyfrog•37m ago
NetRunnerSu•3h ago
https://github.com/dmf-archive/PILF
NetRunnerSu•3h ago
https://dmf-archive.github.io/docs/posts/beyond-snn-plausibl...
TheAceOfHearts•4h ago
lostphilosopher•1h ago
ummonk•13m ago
ben_w•4h ago
But conversely, not passing this test is a proof of not being as general as a human's intelligence.
NetRunnerSu•4h ago
https://news.ycombinator.com/item?id=44488126
kypro•3h ago
While understanding why a person or AI is doing what it's doing can be important (perhaps specifically in safety contexts) at the end of the day all that's really going to matter to most people is the outcomes.
So if an AI can use what appears to be intelligence to solve general problems and can act in ways that are broadly good for society, whether or not it meets some philosophical definition of "intelligent" or "good" doesn't matter much – at least in most contexts.
That said, my own opinion on this is that the truth is likely in between. LLMs today seem extremely good at being glorified auto-completes, and I suspect most (95%+) of what they do is just recalling patterns in their weights. But unlike traditional auto-completes they do seem to have some ability to reason and solve truly novel problems. As it stands I'd argue that ability is fairly poor, but this might only represent 1-2% of what we use intelligence for.
If I were to guess why this is I suspect it's not that LLM architecture today is completely wrong, but that the way LLMs are trained means that in general knowledge recall is rewarded more than reasoning. This is similar to the trade-off we humans have with education – do you prioritise the acquisition of knowledge or critical thinking? Maybe believe critical thinking is more important and should be prioritised more, but I suspect for the vast majority of tasks we're interested in solving knowledge storage and recall is actually more important.
ben_w•19m ago
But when the question is "are they going to more important to the economy than humans?", then they have to be good at basically everything a human can do, otherwise we just see a variant of Amdahl's law in action and the AI perform an arbitrary speed-up of n % of the economy while humans are needed for the remaining 100-n %.
I may be wrong, but it seems to me that the ARC prize is more about the latter.
OtomotO•4h ago
My definition of AGI is the one I was brought up with, not an ever moving goal post (to the "easier" side).
And no, I also don't buy that we are just stochastic parrots.
But whatever. I've seen many hypes and if I don't die and the world doesn't go to shit, I'll see a few more in the next couple of decades
NetRunnerSu•4h ago
https://news.ycombinator.com/item?id=44488126
nxobject•4h ago
However, it does rub me the wrong way - as someone who's cynical of how branding can enable breathless AI hype by bad journalism. A hypothetical comparison would be labelling SHRDLU's (1968) performance on Block World planning tasks as "ARC-AGI-(-1)".[0]
A less loaded name like (bad strawman option) "ARC-VeryToughSymbolicReasoning" should capture how the ARC-AGI-n suite is genuinely and intrinsically very hard for current AIs, and what progress satisfactory performance on the benchmark suite would represent. Which Chollet has done, and has grounded him throughout! [1]
[0] https://en.wikipedia.org/wiki/SHRDLU [1] https://arxiv.org/abs/1911.01547
heymijo•3h ago
In practice when I have seen ARC brought up, it has more nuance than any of the other benchmarks.
Unlike, Humanity's Last Exam, which is the most egregious example I have seen in naming and when it is referenced in terms of an LLMs capability.
maaaaattttt•4h ago
Lerc•3h ago
"That's not really AGI because xyz"
What then? The difficulty in coming up with a test for AGI is coming up with something that people will accept a passing grade as AGI.
In many respects I feel like all of the claims that models don't really understand or have internal representation or whatever tend to lean on nebulous or circular definitions of the properties in question. Trying to pin the arguments down usually end up with dualism and/or religion.
Doing what Chollet has done is infinitely better, if a person can easily do something and a model cannot then there is clearly something significant missing
It doesn't matter what the property is or what it is called. Such tests might even help us see what those properties are.
Anyone who wants to claim the fundamental inability of these models should be able to provide a task that it is clearly possible to tell when it has been succeeded, and to show that humans can do it (if that's the bar we are claiming can't be met). If they are right, then no future model should be able to solve that class of problems.
maaaaattttt•3h ago
wat10000•1h ago
jcranmer•2h ago
The difficulty with intelligence is we don't even know what it is in the first place (in a psychology sense, we don't even have a reliable model of anything that corresponds to what humans point at and call intelligence; IQ and g are really poor substitutes).
Add into that Goodhart's Law (essentially, propose a test as a metric for something, and people will optimize for the test rather than what the test is trying to measure), and it's really no surprise that there's no test for AGI.
bonoboTP•1h ago
This is a very good point and somewhat novel to me in its explicitness.
There's no reason to think that we already have the concepts and terminology to point out the gaps between the current state and human-level intelligence and beyond. It's incredibly naive to think we have armchair-generated already those concepts by pure self-reflection and philosophizing. This is obvious in fields like physics. Experiments were necessary to even come up with the basic concepts of electromagnetism or relativity or quantum mechanics.
I think the reason is that pure philosophizing is still more prestigious than getting down in the weeds and dirt and doing limited-scope well-defined experiments on concrete things. So people feel smart by wielding poorly defined concepts like "understanding" or "reasoning" or "thinking", contrasting it with "mere pattern matching", a bit like the stalemate that philosophy as a field often hits, as opposed to the more pragmatic approach in the sciences, where empirical contact with reality allows more consensus and clarity without getting caught up in mere semantics.
davidclark•3h ago
cainxinth•3h ago
For all we know, human intelligence is just an emergent property of really good pattern matching.
cttet•3h ago
CamperBob2•2h ago
But then, I guess it wouldn't be "overfitting" after all, would it?
gonzobonzo•2h ago
A good base test would be to give a manager a mixed team of remote workers, half being human and half being AI, and seeing if the manager or any of the coworkers would be able to tell the difference. We wouldn't be able to say that AI that passed that test would necessarily be AGI, since we would have to test it in other situations. But we could say that AI that couldn't pass that test wouldn't qualify, since it wouldn't be able to successfully accomplish some tasks that humans are able to.
But of course, current AI is nowhere near that level yet. We're left with benchmarks, because we all know how far away we are from actual AGI.
criddell•2h ago
These are all things my kids would do when they were pretty young.
gonzobonzo•1h ago
godshatter•2h ago
I agree that current AI is nowhere near that level yet. If AI isn't even trying to extract meaning from the words it smiths or the pictures it diffuses then it's nothing more than a cute (albeit useful) parlor trick.
SubiculumCode•2h ago
Perhaps it's because the representations are fractured. The link above is to the transcript of an episode of Machine Learning Street Talk with Kenneth O. Stanleyabout The Fractured Entangled Representation Hypothesis[1]
crazylogger•1h ago
Give the AI tools and let it do real stuff in the world:
"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.
Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.
mindcrime•13m ago
Wait, what? Approximately nobody is claiming that "getting a high score on the ARC eval test means we have AGI". It's a useful eval for measuring progress along the way, but I don't think anybody considers it the final word.