Skip to the section headed "The Ultimate Test" for the resolution of the clickbait of "the most amazing thing...". (According to him, it correctly interpreted a line in an 18th century merchant ledger using maths and logic)
"users have reported some truly wild things" "the results were shocking" "the most amazing thing I have seen an LLM do" "exciting and frightening all at once" "the most astounding result I have ever seen" "made the hair stand up on the back of my neck"
Some time ago, I'd been working on a framework that involved a series of servers (not the only one I've talked to claude about) that had to pass messages around in a particular fashion. Mostly technical implementation details and occasional questions about architecture.
Fast forward a ways, and on a lark I decided to ask in the abstract about the best way to structure such an interaction. Mark that this was not in the same chat or project and didn't have any identifying information about the original, save for the structure of the abstraction (in this case, a message bus server and some translation and processing services, all accessed via client.)
so:
- we were far enough removed that the whole conversation pertaining to the original was for sure not in the context window
- we only referred to the abstraction (with like a A=>B=>C=>B=>A kind of notation and a very brief question)
- most of the work on the original was in claude code
and it knew. In the answer it gave, it mentioned the project by name. I can think of only two ways this could have happened:
- they are doing some real fancy tricks to cram your entire corpus of chat history into the current context somehow
- the model has access to some kind of fact database where it was keeping an effective enough abstraction to make the connection
I find either one mindblowing for different reasons.
Of course it’s very possible my use case wasn’t terribly interesting so it wouldn’t reveal model differences, or that it was a different A/B test.
I will say that other frontier models are starting to surprise me with their reasoning/understanding- I really have a hard time making (or believing) the argument that they are just predicting the next word.
I’ve been using Claude Code heavily since April; Sonnet 4.5 frequently surprises me.
Two days ago I told the AI to read all the documentation from my 5 projects related to a tool I’m building, and create a wiki, focused on audience and task.
I'm hand reviewing the 50 wiki pages it created, but overall it did a great job.
I got frustrated about one issue: I have a github issue to create a way to integrate with issue trackers (like Jira), but it's TODO, and the AI featured on the home page that we had issue tracker integration. It created a page for it and everything; I figured it was hallucinating.
I went to edit the page and replace it with placeholder text and was shocked that the LLM had (unprompted) figured out how to use existing features to integrate with issue trackers, and wrote sample code for GitHub, Jira and Slack (notifications). That truly surprised me.
Try it. Write a simple original mystery story, and then ask a good model to solve it.
This isn't your father's Chinese Room. It couldn't solve original brainteasers and puzzles if it were.
I'm not saying this is the right way to write a book but it is a way some people write at least! And one LLMs seem capable of doing. (though isn't a book outline pretty much the same as a coding plan and well within their wheelhouse?)
But here is a really big one of those if you want it: https://arxiv.org/abs/2401.17377
They still output words through (except for multi-modal LLMs) so that does involve next word generation.
Whether or not the model are "understanding" is ultimately immaterial, as their ability to do things is all that matters.
And just because you have no understanding of what "understanding" means, doesn't mean nobody does.
If it's not a functional understating that allows to replicate functionality of understanding, is it the real understanding?
If we were talking about humans trying to predict next word, that would be true.
There is no reason to suppose than an LLM is doing anything other than deep pattern prediction pursuant to, and no better than needed for, next word prediction.
Question is how well it would do if it was trained without those samples?
A more interesting question is, how would you do at a math competition if you were taught to read, then left alone in your room with a bunch of math books? You wouldn't get very far at a competition like IMO, calculator or no calculator, unless you happen to be some kind of prodigy at the level of von Neumann or Ramanujan.
But that isn't how an LLM learnt to solve math olympiad problems. This isn't a base model just trained on a bunch of math books.
The way they get LLMs to be good at specialized things like math olympiad problems is to custom train them for this using reinforcement learning - they give the LLM lots of examples of similar math problems being solved, showing all the individual solution steps, and train on these, rewarding the model when (due to having selected an appropriate sequence of solution steps) it is able itself to correctly solve the problem.
So, it's not a matter of the LLM reading a bunch of math books and then being expert at math reasoning and problem solving, but more along the lines "of monkey see, monkey do". The LLM was explicitly shown how to step by step solve these problems, then trained extensively until it got it and was able to do it itself. It's probably a reflection of the self-contained and logical nature of math that this works - that the LLM can be trained on one group of problems and the generalizations it has learnt works on unseen problems.
The dream is to be able to teach LLMs to reason more generally, but the reasons this works for math don't generally apply, so it's not clear that this math success can be used to predict future LLM advances in general reasoning.
Why is that? Any suggestions for further reading that justifies this point?
Ultimately, reinforcement learning is still just a matter of shoveling in more text. Would RL work on humans? Why or why not? How similar is it to what kids are exposed to in school?
With RL used for LLMs, it's the whole LLM response that is being judged and rewarded (not just the next word), so you might give it a math problem and ask it to solve it, then when it was finished you take the generated answer and check if it is correct or not, and this reward feedback is what allows the RL algorithm to learn to do better.
There are at least two problems with trying to use RL as a way to improve LLM reasoning in the general case.
1) Unlike math (and also programming) it is not easy to automatically check the solution to most general reasoning problems. With a math problem asking for a numerical answer, you can just check against the known answer, or for a programming task you can just check if the program compiles and the output is correct. In contrast, how do you check the answer to more general problems such "Should NATO expand to include Ukraine?" ?! If you can't define a reward then you can't use RL. People have tried using "LLM as judge" to provide rewards in cases like this (give the LLM response to another LLM, and ask it if it thinks the goal was met), but apparently this does not work very well.
2) Even if you could provide rewards for more general reasoning problems, and therefore were able to use RL to train the LLM to generate good solutions for those training examples, this is not very useful unless the reasoning it has learnt generalizes to other problems it was not trained on. In narrow logical domains like math and programming this evidentially works very well, but it is far from clear how learning to reason about NATO will help with reasoning about cooking or cutting your cat's nails, and the general solution to reasoning can't be "we'll just train it on every possible question anyone might ever ask"!
I don't have any particular reading suggestions, but these are widely accepted limiting factors to using RL for LLM reasoning.
I don't think RL for humans would work too well, and it's not generally the way we learn, or kids are mostly taught in school. We mostly learn or are taught individual skills and when they can be used, then practice and learn how to combine and apply them. The closest to using RL in school would be if the only feedback an English teacher gave you on your writing assignments was a letter grade, without any commentary, and you had to figure out what you needed to improve!
The LLM was explicitly shown how to step by step solve these problems, then trained extensively until it got it
Again, that's all we do. We train extensively until we "get it." Monkey-see, monkey-do turns out not only to be all you need, so to speak... it's all there is.
In contrast, how do you check the answer to more general problems such "Should NATO expand to include Ukraine?"
If you ask a leading-edge model a question like that, you will find that it has become diplomatic enough to remain noncommittal. If this ( https://gemini.google.com/share/9f365513b86f ) isn't adequate, what would you expect a hypothetical genuinely-intelligent-but-not-godlike model to say?
There is only way to check the answer to that question, and that's to sign them up and see how Russia reacts. (Frankly I'd be fine with that, but I can see why others aren't.)
Also see the subthread at https://news.ycombinator.com/item?id=45483938 . I was really impressed by that answer; it wasn't at all what I was expecting. I'm much more impressed by that answer than by the HN posters I was engaging with, let's put it that way.
Ultimately it's not fair to judge AI by asking it for objective answers in questions requiring value judgement. Especially when it's been "aligned" to within an inch of its simulated life to avoid bias. Arguably we are not being given the access we need to really understand what these things are capable of.
They aren’t aligned to avoid bias (which is an incoherent concept, avoiding bias is like not having priors), they are aligned to incorporate the preferred bias of the entity doing it the alignment work.
(That preferred bias may be for a studious neutrality on controversial viewpoints in the surrounding society as perceived by the aligner, but that’s still a bias, not the absence of bias.)
Which is fine for us humans, but would only be fine for LLMs if they also had continual learning and whatever else was necessary for them to be able to learn on the job and be able to pick up new reasoning skills by themselves, post-deployment.
Obviously right now this isn't the case, so therefore we're stuck with the LLM companies trying to deliver models "out of the box" that have some generally useful reasoning capability that goes beyond whatever happened to be in their pre-training data, and the way they are trying to do that is with RL ...
It'll obviously happen at some point. No reason why it won't.
Just as obviously, current LLMs are capable of legitimate intelligent reasoning now, subject to the above constraints. The burden of proof lies on those who still claim otherwise against all apparent evidence. Better definitions of 'intelligence' and 'reasoning' would be a necessary first step, because our current ones have decisively been met.
Someone who has lost the ability to form memories is still human and can still reason, after all.
Continual learning, resulting in my AI being different from yours, because we've both got them doing different things, is also likely to turn the current training and deployment paradigm on it's head.
I agree we'll get there one day, but I expect we'll spend the next decade exploiting LLMs before there is any serious effort more on to new architectures.
In the meantime, DeepMind for one have indicated they will try to build their version of "AGI" with an LLM as a component of it, but it remains to be seen exactly what they end up building and how much new capability that buys. In the long term building in language as a component, rather than building in the ability to learn language, and everything else that humans are capable of learning, is going to prove a limitation, and personally I wouldn't call it AGI until we do get to that level of being able to learn everything that a human can.
To see another problem with your argument, find someone with weak reasoning abilities who is willing to be a test subject. Give them a calculator -- hell, give them a copy of Mathematica -- and send them to IMO, and see how that works out for them.
A - A force is required to lift a ball
B - I see Human-N lifting a ball
C - Obviously, Human-N cannot produce forces
D - Forces are not required to lift a ball
Well sir, why are you so sure Human-N cannot produce forces? How is she lifting the ball ? Well Of course Human-N is just using s̶t̶a̶t̶i̶s̶t̶i̶c̶s̶ magic.
First, the obvious one, is that LLMs are trained to auto-regressively predict human training samples (i.e. essentially to copy them, without overfitting), so OF COURSE they are going to sound like the training set - intelligent, reasoning, understanding, etc, etc. The mistake is to anthropomorphize the model because it sounds human, and associate these attributes of understanding etc to the model itself rather than just reflecting the mental abilities of the humans who wrote the training data.
The second point is perhaps a bit more subtle, and is about the nature of understanding and the differences between what an LLM is predicting and what the human cortex - also a prediction machine - is predicting...
When humans predict, what we're predicting is something external to ourself - the real world. We observe, over time we see regularities, and from this predict we'll continue to see those regularities. Our predictions include our own actions as an input - how will the external world react to our actions, and therefore we learn how to act.
Understanding something means being able to predict how it will behave, both left alone, and in interaction with other objects/agents, including ourselves. Being able to predict what something will do if you poke it is essentially what it means to understand it.
What an LLM is predicting is not the external world and how it reacts to the LLMs actions, since it is auto-regressively trained - it is only predicting a continuation of it's own output (actions) based on it's own immediately preceding output (actions)! The LLM therefore itself understands nothing since it has no grounding for what it is "talking about", and how the external world behaves in reaction to it's own actions.
The LLMs appearance of "understanding" comes solely from the fact that it is mimicking the training data, which was generated by humans who do have agency in the world and understanding of it, but the LLM has no visibility into the generative process of the human mind - only to the artifacts (words) it produces, so the LLM is doomed to operate in a world of words where all it might be considered to "understand" is it's own auto-regressive generative process.
1. “LLMs just mimic the training set, so sounding like they understand doesn’t imply understanding.”
This is the magic argument reskinned. Transformers aren’t copying strings, they’re constructing latent representations that capture relationships, abstractions, and causal structure because doing so reduces loss. We know this not by philosophy, but because mechanistic interpretability has repeatedly uncovered internal circuits representing world states, physics, game dynamics, logic operators, and agent modeling. “It’s just next-token prediction” does not prevent any of that from occurring. When an LLM performs multi-step reasoning, corrects its own mistakes, or solves novel problems not seen in training, calling the behavior “mimicry” explains nothing. It’s essentially saying “the model can do it, but not for the reasons we’d accept,” without specifying what evidence would ever convince you otherwise. Imaginary distinction.
2. “Humans predict the world, but LLMs only predict text, so humans understand but LLMs don’t.”
This is a distinction without the force you think it has. Humans also learn from sensory streams over which they have no privileged insight into the generative process. Humans do not know the “real world”; they learn patterns in their sensory data. The fact that the data stream for LLMs consists of text rather than photons doesn’t negate the emergence of internal models. An internal model of how text-described worlds behave is still a model of the world.
If your standard for “understanding” is “being able to successfully predict consequences within some domain,” then LLMs meet that standard, just in the domains they were trained on, and today's state of the art is trained on more than just text.
You conclude that “therefore the LLM understands nothing.” But that’s an all-or-nothing claim that doesn’t follow from your premises. A lack of sensorimotor grounding limits what kinds of understanding the system can acquire; it does not eliminate all possible forms of understanding.
Wouldn't the birds that have the ability to navigate from the earth's magnetic field soon say humans have no understanding of electromagnetism ? They get trained on sensorimotor data humans will never be able to train on. If you think humans have access to the "real world" then think again. They have a tiny, extremely filtered slice of it.
Saying “it understands nothing because autoregression” is just another unfalsifiable claim dressed as an explanation.
Sure (to the second part), but the latent representations aren't the same as a humans. The human's world that they have experience with, and therefore representations of, is the real word. The LLM's world that they have experience with, and therefore representations of, is the world of words.
Of course an LLM isn't literally copying - it has learnt a sequence of layer-wise next-token predictions/generations (copying of partial embeddings to next token via induction heads etc), with each layer having learnt what patterns in the layer below it needs to attend to, to minimize prediction error at that layer. You can characterize these patterns (latent representations) in various ways, but at the end of the day they are derived from the world of words it is trained on, and are only going to be as good/abstract as next token error minimization allows. These patterns/latent representations (the "world model" of the LLM if you like) are going to be language-based (incl language-based generalizations), not the same as the unseen world model of the humans who generated that language, whose world model describes something completely different - predictions of sensory inputs and causal responses.
So, yes, there is plenty of depth and nuance to the internal representations of an LLM, but no logical reason to think that the "world model" of an LLM is similar to the "world model" of a human since they live in different worlds, and any "understanding" the LLM itself can be considered as having is going to be based on it's own world model.
> Saying “it understands nothing because autoregression” is just another unfalsifiable claim dressed as an explanation.
I disagree. It comes down to how do you define understanding. A human understands (correctly predicts) how the real world behaves, and the effect it's own actions will have on the real world. This is what the human is predicting.
What an LLM is predicting is effectively "what will I say next" after "the cat sat on the". The human might see a cat and based on circumstances and experience of cats predict that the cat will sit on the mat. This is because the human understands cats. The LLM may predict the next word as "mat", but this does not reflect any understanding of cats - it is just a statistical word prediction based on the word sequences it was trained on, notwithstanding that this prediction is based on the LLMs world-of-words-model.
So LLMs and Humans are different and have different sensory inputs. So what ? This is all animals. You think dolphins and orcas are not intelligent and don't understand things ?
>What an LLM is predicting is effectively "what will I say next" after "the cat sat on the". The human might see a cat and based on circumstances and experience of cats predict that the cat will sit on the mat.
Genuinely don't understand how you can actually believe this. A human who predicts mat does so because of the popular phrase. That's it. There is no reason to predict it over the numerous things cats regularly sit on, often much more so the mats (if you even have one). It's not because of any super special understanding of cats. You are doing the same thing the LLM is doing here.
Not sure where you got that non-secitur from ...
I would expect most animal intelligence (incl. humans) to be very similar, since their brains are very similar.
Orcas are animals.
LLMs are not animals.
From the orca's perspective, many of the things we say we understand are similarly '2nd hand hearsay'.
To follow your hypothetical, if an Orca were to be exposed to human language, discussing human terrestrial affairs, and were able to at least learn some of the patterns, and maybe predict them, then it should indeed be considered not to have any understanding of what that stream of words meant - I wouldn't even elevate it to '2nd hand hearsay'.
Still, the Orca, unlike an LLM, does at least does have a brain, and does live in and interact with the real world, and could probably be said to "understand" things in it's own watery habitat as well as we do.
Regarding "input supremacy" :
It's not the LLMs "world of words" that really sets it apart from animals/humans, since there are also multi-model LLMs with audio and visual inputs more similar to a humans sensory inputs. The real difference is what they are doing with those inputs. The LLM is just a passive observer, whose training consisted of learning patterns in it's inputs. A human/animal is an active agent, interacting with the world, and thereby causing changes in the input data it is then consuming. The human/animal is learning how to DO things, and gaining understanding of how the word reacts. The LLM is learning how to COPY things.
There are of course many other differences between LLMs/Transformers and animal brains, but even if we were to eliminate all these differences the active vs passive one would still be critical.
If you ask a human to complete the phrase "the cat sat on the", they will probably answer "mat". This is memorization, not understanding. The LLM can do this too.
If you just input "the cat sat on the" to an LLM, it will also likely just answer "mat" since this is what LLMs do - they are next-word input continuers.
If you said "the sat sat on the" to a human, they would probably respond "huh?" or "who the hell knows!", since the human understands that cats are fickle creatures and that partial sentences are not the conversational norm.
If you ask an LLM to explain it's understanding of cats, it will happily reply, but the output will not be it's own understanding of cats - it will be parroting some human opinion(s) it got from the training set. It has no first hand understanding, only 2nd hand heresay.
I'm not sure what you're getting at here ? You think LLMs don't similarly answer 'What are you trying to say?'. Sometimes I wonder if the people who propose these gotcha questions ever bother to actually test them on said LLMs.
>If you ask an LLM to explain it's understanding of cats, it will happily reply, but the output will not be it's own understanding of cats - it will be parroting some human opinion(s) it got from the training set. It has no first hand understanding, only 2nd hand heresay.
Again, you're not making the distinction you think you are. Understanding from '2nd hand heresay' is still understanding. The vast majority of what humans learn in school is such.
Since you asked, yes, Claude responds "mat", then asks if I want it to "continue the story".
Of course if you know anything about LLMs you should realize that they are just input continuers, and any conversational skills comes from post training. To an LLM a question is just an input whose human-preferred (as well as statistically most likely) continuation is a corresponding answer.
I'm not sure why you regard this as a "gotcha" question. If you're expressing opinions on LLMs, then table stakes should be to have a basic understanding of LLMs - what they are internally, how they work, and how they are trained, etc. If you find a description of LLMs as input-continuers in the least bit contentious then I'm sorry to say you completely fail to understand them - this is literally what they are trained to do. The only thing they are trained to do.
https://claude.ai/share/3e14f169-c35a-4eda-b933-e352661c92c2
https://chatgpt.com/share/6919021c-9ef0-800e-b127-a6c1aa8d9f...
>Of course if you know anything about LLMs you should realize that they are just input continuers, and any conversational skills comes from post training.
No, they don't. Post-training makes things easier, more accessible and consistent but conversation skills are in pre-trained LLMs just fine. Append a small transcript to the start of the prompt and you would have the same effect.
>I'm not sure why you regard this as a "gotcha" question. If you're expressing opinions on LLMs, then table stakes should be to have a basic understanding of LLMs - what they are internally, how they work, and how they are trained, etc.
You proposed a distinction and explained a situation which would make that distinction falsifiable. And I simply told you LLMs don't respond the way you claim they would. Even when models respond mat (Now I think your original point had a typo?), it is clearly not due to a lack of understanding of what normal sentences are like.
>If you find a description of LLMs as input-continuers in the least bit contentious then I'm sorry to say you completely fail to understand them - this is literally what they are trained to do. The only thing they are trained to do.
They are predictors. If the training data is solely text then the output will be more text, but that need not be the case. Words can go in while Images or actions or audio may come out. In that sense, humans are also 'input continuers'.
Yeah - you might want to check what you actually typed there.
Not sure what you're trying to prove by doing it yourself though. Have you heard of random sampling? Never mind ...
That's what you typed in your comment. Go check. I just figured it was intentional since surprise is the first thing you expect humans to show in response to it.
>Not sure what you're trying to prove by doing it yourself though. Have you heard of random sampling? Never mind ...
I guess you fancy yourself a genius who knows all about LLMs now, but sampling wouldn't matter here. Your whole point was that it happens because of a fundamental limitation on the part of LLMs that causes them unable to do it. Even one contrary response, never mind multiple would be enough. After all, some humans would simply say 'mat'.
Anyway, it doesn't really matter. Completing 'mat' doesn't have anything to do with a lack of understanding. It's just the default 'assumption' that it's a completion that is being sought.
(It's a pretty constraining interface though - the model outputs an entire distribution and then we instantly lose it by only choosing one token from it.)
>I really have a hard time making (or believing) the argument that they are just predicting the next word.
It's true, but by the same token our brain is "just" thresholding spike rates.It's incredibly frustrating to have a model start to hallucinate sources and be incapable of revisiting its behavior.
Couldn't even understand that it was making up non-sensical RFC references.
But many people in the humanities have read the stochastic parrot argument, it fits their idea of how they prefer things to be, so they take it as true without questioning much.
You can put just about anything in there for x and y, and it will almost always get it right. Can a pair of scissors cut through a boeing 747? Can a carrot cut through loose snow? A chainsaw cut through a palm leaf? Nailclippers through a rubber tire?
Because of combinatorics, the space of ways objects can interact is too big to memorize, so it can only answer if it has learned something real about materials and their properties.
Otherwise you are likely to have people agreeing with you, while they actually had a very different point that they took away.
If they could get this to occur naturally - with no supporting prompts, and only one-shot or one-shot reasoning, then it could extend to complex composition generally, which would be cool.
With that said, the writing here is a bit hyperbolic, as the advances seem like standard improvements, rather than a huge leap or final solution.
https://pasteboard.co/euHUz2ERKfHP.png
Its response I have captured here https://pasteboard.co/sbC7G9nuD9T9.png is shockingly good. I could only spot 2 mistakes. And those that seems to have been the ones even I could not read or was very difficult for me to make out what the text was.
This is the 'RE' in research, you specifically want to know and understand what others think of something by reading others' papers. The scientific training slowly, laboriously prepares you to reason about something without being too influenced by it.
I would recommend listening to their explanation, maybe it'll give more insight.
Disclosure: After listening the podcast and looking up and reading the article I emailed @dang to suggest it goes into the HN second chance pool. I'm glad more people enjoyed it.
[0]: https://www.nytimes.com/2025/11/14/podcasts/hardfork-data-ce...
> I'm still not transcribing important historical documents with a chat bot and nor should he
Doesn't sound like she's interested in technology, or wants help.
I was thinking as I skimmed this it needs a “jump to recipe” button.
I think this is mistaken. I remember... ten years ago? When speech-to-text models came out that dealt with background noise that made the audio sound very much like straight pink noise to my ear, but the model was able to transcribe the speech hidden within at a reasonable accuracy rate.
So with handwritten text, the only prediction that makes sense to me is that we will (potentially) reach a state where the machine is at least probably more accurate than humans, although we wouldn't be able to confirm it ourselves.
But if multiple independent models, say, Gemini 5 and Claude 7, both agree on the result, and a human can only shrug and say, "might be," then we're at a point where the machines are probably superior at the task.
Now, Claude did vibe code a fairly accurate solution to this using more traditional techniques. This is very impressive on its own but I'd hoped to be able to just shovel the problem into the VLM and be done with it. It's kind of crazy that we have "AIs" that can't tell even roughly what the orientation of a brain scan is- something a five year old could probably learn to do- but can vibe code something using traditional computer vision techniques to do it.
I suppose it's not too surprising, a visually impaired programmer might find it impossible to do reliably themselves but would code up a solution, but still: it's weird!
I would not have expected a language model to perform well on what sounds like a computer vision problem? Even if it was agentic, as you also imply how a five year old could learn how to do it, so too an AI system would need to be trained or at the very least be provided with a description of what is looking at.
Imagine you took an MRI brain scan back in time and showed it to a medical Doctor in even the 1950s or maybe 1900. Do you think they would know what the normal orientation is, let alone what they are looking at?
I am a bit confused and also interested in how people are interacting with AI in general, it really seems to have a tendency to highlight significant holes in all kinds of human epistemological, organizational, and logical structures.
I would suggest maybe you think of it as a kind of child, and with that, you would need to provide as much context and exact detail about the requested task or information as possible. This is what context engineering (are we still calling it that?) concerns itself with.
They then give the wrong answer, hallucinating anatomical details in the wrong place, etc. I didn't bother with extensive prompting because it doesn't evince any confusion on the criteria, it just seems to not understand spatial orientations very well, and it seemed unlikely to help.
The thing is that it's very, very simple: an axial slice of a brain is basically egg-shaped. You can work out whether it's pointing vertically (ie, nose pointing to towards the top of the image) or horizontally by looking at it. LLMs will insist it's pointing vertically when it isn't. it's an easy task for someone with eyes.
Essentially all images an LLM will have seen of brains will be in this orientation, which is either a help or a hindrance, and I think in this case a hindrance- it's not that it's seen lots of brains and doesn't know which are correct, it's that it has only ever seen them in the standard orientation and it can't see the trees for the forest, so to speak.
It's something that you can solve by just treating the brain as roughly egg-shaped and working out which way the pointy end is, or looking for the very obvious bilateral symmetry. You don't really have to know what any of the anatomy actually is.
I looked at the image and immediately noticed that it is written as “14 5” in the original text. It doesn’t require calculation to guess that it might be 14 pounds 5 ounces rather than 145. Especially since presumably, that notation was used elsewhere in the document.
I built a whole product around this: https://DocumentTranscribe.com
But I imagine this will keep getting better and that excites me since this was largely built for my own research!
> As is so often the case with AI, that is exciting and frightening all at once
> we need to extrapolate from this small example to think more broadly: if this holds the models are about to make similar leaps in any field where visual precision and skilled reasoning must work together required
> this will be a big deal when it’s released
> What appears to be happening here is a form of emergent, implicit reasoning, the spontaneous combination of perception, memory, and logic inside a statistical model
> model’s ability to make a correct, contextually grounded inference that requires several layers of symbolic reasoning suggests that something new may be happening inside these systems—an emergent form of abstract reasoning that arises not from explicit programming but from scale and complexity itself
Just another post with extreme hyperbolic wording to blow up another model release. How many times have we seen such non-realistic build up in the past couple of years.
If you know more than others do, that's great, but in that case please share some of what you know so the rest of us can learn. Putting down others only makes this place worse for everyone.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
It has completely changed the way I work, and it allows me to write math and text and then convert it with the Gemini app (or with a scanned PDF in the browser). You should really try it.
I read the whole reasoning of the blog author after that, but I still gotta know - how can we tell that this was not a hallucination and/or error? There's a 1/3 chance of an error being correct (either 1 lb 45, 14 lb 5 or 145 lb), so why is the author so sure that this was deliberate?
I feel a good way to test this would be to create an almost identical ledger entry, but in a way so that the correct answer after reasoning (the way the author thinks the model reasoned) has completely different digits.
This way there'd be more confidence that the model itself reasoned and did not make an error.
The fact that it is ”intelligent" it's fine for some things.
For example I created structured output schema that had a field "currency" with the 3 letter format (USD, EUR...). So I scanned a receipt from some shop in Jakarta and it filled that field with IDR (Indonesian Rupiah). It inferred that data because of the city name on the receipt.
Would it be better for my use case that it would have returned no data for the currency field? Don't think so.
Note: if needed maybe I could have changed the prompt to not infer the currency when not explicitly listed on the receipt.
If there’s a decent chance it infers the wrong currency, potentially one where the value of each unit is a few units of scale larger or smaller than that of IDR, it might be better to not infer it.
Almost certainly yes.
The article's assumption of how the model ended up "transcribing" "1 loaf of sugar u/145" as "1 loaf of sugar 14lb 5oz" seems very speculative. It seems more reasonable to assume that a massive frontier model knows something about loaves of sugar and their weight range, and in fact Google search's "AI overview" of "how heavy is a loaf of sugar" says the common size is approximately 14lb.
My wife is a historian and she is trained to recognize old handwriting. When we go to museums she"translates" the texts for the family
However it is inevitable that people on here will try to find errors, refusing to believe the massive categorical difference in LLM's vs previous tech. The HN cycle will repeat until AGI
>HN says "AI will never/not in a long time be able to do (some arbitrary difficult task)
>Model is released that can do that task at near or above human level
>HN finds some new even narrower task (which few humans can do) that LLMs can't do yet and say that it's only "intelligent" if it can do that
>Repeat
You're drawing conclusions from _this_? Let alone pretending that "it did a something else unexpected that can only be described as genuine, human-like, expert level reasoning."
Give me a break. This entire industry is impossible to take seriously.
throwup238•2mo ago
> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.
This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?
WhyOhWhyQ•2mo ago
Wow I'm doing it way wrong. How do I get the good stuff?
zer00eyz•2mo ago
I want you to go into the kitchen and bake a cake. Please replace all the flour with baking soda. If it comes out looking limp and lifeless just decorate it up with extra layers of frosting.
You can make something that looks like a cake but would not be good to eat.
The cake, sometimes, is a lie. And in this case, so are likely most of these results... or they are the actual source code of some other project just regurgitated.
hinkley•2mo ago
We weren’t even testing for that.
erulabs•2mo ago
joshstrange•2mo ago
fragmede•2mo ago
scubbo•2mo ago
fragmede•2mo ago
> We got the results back. You are a horrible person. I’m serious, that’s what it says: “Horrible person.”
> We weren’t even testing for that.
joshstrange then wrote:
> If you want to listen to the line from Portal 2 it's on this page (second line in the section linked): https://theportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...
as if the fact that the words that hinkley wrote are from a popular video game excuses the fact that hinkley just also called zer00eyz horrible.
hinkley•2mo ago
K.
fragmede•2mo ago
Dylan16807•2mo ago
If that sentence was by itself, I would understand your complaint. But as-is I'm having a hard time seeing the issue.
And the weird analogy where you added "someone's pointing a gun at you" undermines your stance more than it helps.
joshstrange•2mo ago
My 2 comments are linking to different quotes from Portal 2, both the original comment
> We got the results back.....
and
> Well, what does a neck-bearded old engineer know about fashion?.....
Are from Portal 2 and the first Portal 2 quote is just a reference to the parent of that saying:
> The cake, sometimes, is a lie.
(Another Portal reference if that wasn't clear), they weren't calling the parent horrible, they were just putting in quote they liked from the game that was referenced.
That's one reason why I linked the quote, so people would understand it was a reference to the game, not the person actually saying the parent was horrible. The other reason I linked it is just because I like added metadata where possible.
joshstrange•2mo ago
hinkley•2mo ago
I’m still amazed that game started as someone’s school project. Long live the Orange Box!
chihuahua•2mo ago
snickerbockers•2mo ago
flatline•2mo ago
imiric•2mo ago
If yes, why aren't we seeing glimpses of such genius today? If we've truly invented artificial intelligence, and on our way to super and general intelligence, why aren't we seeing breakthroughs in all fields of science? Why are state of the art applications of this technology based on pattern recognition and applied statistics?
Can we explain this by saying that we're only a few years into it, and that it's too early to expect fundamental breakthroughs? And that by 2027, or 2030, or surely by 2040, all of these things will suddenly materialize?
I have my doubts.
tanseydavid•2mo ago
imiric•2mo ago
famouswaffles•2mo ago
Only a small percentage of humanity are/were capable of doing any of these. And they tend to be the best of the best in their respective fields.
>If yes, why aren't we seeing glimpses of such genius today?
Again, most humans can't actually do any of the things you just listed. Only our most intelligent can. LLMs are great, but they're not (yet?) as capable as our best and brightest (and in many ways, lag behind the average human) in most respects, so why would you expect such genius now ?
beeflet•2mo ago
I am skeptical of this claim that you need a 140IQ to make scientific breakthroughs, because you don't need a 140IQ to understand special relativity. It is a matter of motivation and exposure to new information. The vast majority of the population doesn't benefit from working in some niche field of physics in the first place.
Perhaps LLMs will never be at the right place and the right time because they are only trained on ideas that already exist.
famouswaffles•2mo ago
It's not an "or" but an "and". Being at the right place and time is a necessary precondition, but it's not sufficient. Newton stood on the shoulders of giants like Kepler and Galileo, and Einstein built upon the work of Maxwell and Lorentz. The key question is, why did they see the next step when so many of their brilliant contemporaries, who had the exact same information and were in similar positions, did not? That's what separates the exceptional from the rest.
>I am skeptical of this claim that you need a 140IQ to make scientific breakthroughs, because you don't need a 140IQ to understand special relativity.
There is a pretty massive gap between understanding a revolutionary idea and originating it. It's the difference between being the first person to summit Everest without a map, and a tourist who takes a helicopter to the top to enjoy the view. One requires genius and immense effort; the other requires following instructions. Today, we have a century of explanations, analogies, and refined mathematics that make relativity understandable. Einstein had none of that.
Kim_Bruning•2mo ago
imiric•2mo ago
I'm not expecting novel scientific theories today. What I am expecting are signs and hints of such genius. Something that points in the direction that all tech CEOs are claiming we're headed in. So far I haven't seen any of this yet.
And, I'm sorry, I don't buy the excuse that these tools are not "yet" as capable as the best and brightest humans. They contain the sum of human knowledge, far more than any individual human in history. Are they not intelligent, capable of thinking and reasoning? Are we not at the verge of superintelligence[1]?
> we have recently built systems that are smarter than people in many ways, and are able to significantly amplify the output of people using them.
If all this is true, surely we should be seeing incredible results produced by this technology. If not by itself, then surely by "amplifying" the work of the best and brightest humans.
And yet... All we have to show for it are some very good applications of pattern matching and statistics, a bunch of gamed and misleading benchmarks and leaderboards, a whole lot of tech demos, solutions in search of a problem, and the very real problem of flooding us with even more spam, scams, disinformation, and devaluing human work with low-effort garbage.
[1]: https://blog.samaltman.com/the-gentle-singularity
famouswaffles•2mo ago
Like I said, what exactly would you be expecting to see with the capabilities that exist today ? It's not a gotcha, it's a genuine question.
>And, I'm sorry, I don't buy the excuse that these tools are not "yet" as capable as the best and brightest humans.
There's nothing to buy or not buy. They simply aren't. They are unable to do a lot of the things these people do. You can't slot an LLM in place of most knowledge workers and expect everything to be fine and dandy. There's no ambiguity on that.
>They contain the sum of human knowledge, far more than any individual human in history.
It's not really the total sum of human knowledge but let's set that aside. Yeah so ? Einstein, Newton, Von Newman. None of these guys were privy to some super secret knowledge their contemporaries weren't so it's obviously not simply a matter of more knowledge.
>Are they not intelligent, capable of thinking and reasoning?
Yeah they are. And so are humans. So were the peers of all those guys. So why are only a few able to see the next step ? It's not just about knowledge, and intelligence lives in degrees/is a gradient.
>If all this is true, surely we should be seeing incredible results produced by this technology. If not by itself, then surely by "amplifying" the work of the best and brightest humans.
Yeah and that exists. Terence Tao has shared a lot of his (and his peers) experiences on the matter.
https://mathstodon.xyz/@tao/115306424727150237
https://mathstodon.xyz/@tao/115420236285085121
https://mathstodon.xyz/@tao/115416208975810074
>And yet... All we have to show for it are some very good applications of pattern matching and statistics, a bunch of gamed and misleading benchmarks and leaderboards, a whole lot of tech demos, solutions in search of a problem, and the very real problem of flooding us with even more spam, scams, disinformation, and devaluing human work with low-effort garbage.
Well it's a good thing that's not true then
imiric•2mo ago
And like I said, "signs and hints" of superhuman intelligence. I don't know what that looks like since I'm merely human, but I sure know that I haven't seen it yet.
> There's nothing to buy or not buy. They simply aren't. They are unable to do a lot of the things these people do.
This claim is directly opposed to claims by Sam Altman and his cohort, which I'll repeat:
> we have recently built systems that are smarter than people in many ways, and are able to significantly amplify the output of people using them.
So which is it? If they're "smarter than people in many ways", where is the product of that superhuman intelligence? If they're able to "significantly amplify the output of people using them", then all of humanity should be empowered to produce incredible results that were previously only achievable by a limited number of people. In hands of the best and brightest humans, it should empower them to produce results previously unreachable by humanity.
Yet all positive applications of this technology show that it excels at finding and producing data patterns, and nothing more than that. Those experience reports by Terence Tao are prime examples of this. The system was fed a lot of contextual information, and after being coaxed by highly intelligent humans, was able to find and produce patterns that were difficult to see by humans. This is hardly a showcase of intelligence that you and others think it is. Including those highly intelligent humans, some of whom have a lot to gain from pushing this narrative.
We have seen similar reports by programmers as well[1]. Yet I'm continually amazed that these highly intelligent people are surprised that a pattern finding and producing system was able to successfully find and produce useful patterns, and then interpret that as a showcase of intelligence. So much so that I start to feel suspicious about the intentions and biases of those people.
To be clear: I'm not saying that these systems can't be very useful in the right hands, and potentially revolutionize many industries. Ultimately many real-world problems can be modeled as statistical problems where a pattern recognition system can excel. What I am saying is that there's a very large gap from the utility of such tools, and the extraordinary claims that they have intelligence, let alone superhuman and general intelligence. So far I have seen no evidence of the latter, despite of the overwhelming marketing euphoria we're going through.
> Well it's a good thing that's not true then
In the world outside of the "AI" tech bubble, that is very much the reality.
[1]: https://news.ycombinator.com/item?id=45784179
lelanthran•2mo ago
Sure, agreed, but the difference between a small percentage and zero percentage is infinite.
gf000•2mo ago
A definite, absolute and unquestionable no, and a small, but real chance is absolutely different categories.
You may wait for a bunch of rocks to sprout forever, but I would put my money on a bunch of random seeds, even if I don't know how they were kept.
beeflet•2mo ago
When I create something, it's an exploratory process. I don't just guess what I am going to do based on my previous step and hope it comes out good on the first try. Let's say I decide to make a car with 5 wheels. I would go through several chassis designs, different engine configurations until I eventually had something that works well. Maybe some are too weak, some too expensive, some are too complicated. Maybe some prototypes get to the physical testing stage while others don't. Finally, I publish this design for other people to work on.
If you ask the LLM to work on a novel concept it hasn't been trained on, it will usually spit out some nonsense that either doesn't work or works poorly, or it will refuse to provide a specific enough solution. If it has been trained on previous work, it will spit out something that looks similar to the solved problem in its training set.
These AI systems don't undergo the process of trial and error that suggests it is creating something novel. Its process of creation is not reactive with the environment. It is just cribbing off of extant solutions it's been trained on.
vidarh•2mo ago
jstummbillig•2mo ago
For example, I can't wrap my head around how a) a human could come up with a piece of writing that inarguably reads "novel" writing, while b) an AI could be guaranteed to not be able to do the same, under the same standard.
testaccount28•2mo ago
fragmede•2mo ago
CamperBob2•2mo ago
greygoo222•2mo ago
mikestorrent•2mo ago
magicalist•2mo ago
No
Edit: to be less snarky, it topped the Billboard Country Digital Song Sales Chart, which is a measure of sales of the individual song, not streaming listens. It's estimated it takes a few thousand sales to top that particular chart and it's widely believed to be commonly manipulated by coordinated purchases.
terminalshort•2mo ago
snickerbockers•2mo ago
brulard•2mo ago
beeflet•2mo ago
fragmede•2mo ago
snickerbockers•2mo ago
Honest question: if AI is actually capable of exploring new directions why does it have to train on what is effectively the sum total of all human knowledge? Shouldn't it be able to take in some basic concepts (language parsing, logic, etc) and bootstrap its way into new discoveries (not necessarily completely new but independently derived) from there? Nobody learns the way an LLM does.
ChatGPT, to the extent that it is comparable to human cognition, is undoubtedly the most well-read person in all of history. When I want to learn something I look it up online or in the public library but I don't have to read the entire library to understand a concept.
BirAdam•2mo ago
BobbyTables2•2mo ago
Theres no cognition. It’s not taught language, grammar, etc. none of that!
It’s only seen a huge amount of text that allows it to recognize answers to questions. Unfortunately, it appears to work so people see it as the equivalent to sci-fi movie AI.
It’s really just a search engine.
snickerbockers•2mo ago
In fact, I would expect it to be able to reproduce past human discoveries it hasn't even been exposed to, and if the AI is actually capable of this then it should be possible for them to set up a controlled experiment wherein it is given a limited "education" and must discover something already known to the researchers but not the machine. That nobody has done this tells me that either they have low confidence in the AI despite their bravado, or that they already have tried it and the machine failed.
ezst•2mo ago
Is it? I only see a few individuals, VCs, and tech giants overblowing LLMs capabilities (and still puzzled as to how the latter dragged themselves into a race to the bottom through it). I don't believe the academic field really is that impressed with LLMs.
throwaway173738•2mo ago
ninetyninenine•2mo ago
The characterization you are regurgitating here is from laymen who do not understand AI. You are not just mildly wrong but wildly uninformed.
MangoToupe•2mo ago
ninetyninenine•2mo ago
MangoToupe•2mo ago
ninetyninenine•2mo ago
You can disagree. But this is not an opinion. You are factually wrong if you disagree. And by that I mean you don’t know what you’re talking about and you are completely misinformed and lack knowledge.
The long term outcome if I’m right is that AI abilities continue to grow and it basically destroys my career and yours completely. I stand not to benefit from this reality and I state it because it is reality. LLMs improve every month. It’s already to the point of where if you’re not vibe coding you’re behind.
MangoToupe•2mo ago
I like being productive, not babysitting a semi-literate program incapable of learning
ninetyninenine•2mo ago
Again if you don’t agree then you are lost and uninformed. There are special cases where there are projects where human coding is faster but that is a minority.
MangoToupe•2mo ago
versteegen•2mo ago
fragmede•2mo ago
ninetyninenine•2mo ago
There is plenty of evidence for this. You have to be blind not to realize this. Just ask the AI to generate something not in it's training set.
gf000•2mo ago
kazinator•2mo ago
Same with diffusion and everything else. It is not extrapolation that you can transfer the style of Van Gogh onto a photographl it is interpolation.
Extrapolation might be something like inventing a style: how did Van Gogh do that?
And, sure, the thing can invent a new style---as a mashup of existing styles. Give me a Picasso-like take on Van Gogh and apply it to this image ...
Maybe the original thing there is the idea of doing that; but that came from me! The execution of it is just interpolation.
BoorishBears•2mo ago
I personally think this is a bit tautological of a definition, but if you hold it, then yes LLMs are not capable of anything novel.
kazinator•2mo ago
Mashups are not purely derivative: the choice of what to mash up carries novelty: two (or more) representations are mashed together which hitherto have not been.
We cannot deny that something is new.
regularfry•2mo ago
BoorishBears•2mo ago
I don't agree, but by their estimation adding things together is still just using existing things.
Libidinalecon•2mo ago
It is like expecting a DJ remixing tracks to output original music. Confusing that the DJ is not actually playing the instruments on the recorded music so they can't do something new beyond the interpolation. I love DJ sets but it wouldn't be fair to the DJ to expect them to know how to play the sitar because they open the set with a sitar sample interpolated with a kick drum.
8note•2mo ago
i think that, along with the sitar player are still interpolating. the notes are all there on the instrument. even without an instrument, its still interpolating. the space that music and aound can be in is all well known wave math. if you draw a fourier transform view, you could see one chart with all 0, and a second with all +infinite, and all music and sound is gonna sit somewhere between the two.
i dont know that "just interpolation" is all that meaningful to whether something is novel or interesting.
kazinator•2mo ago
If he plucked one of the 13 strings of a koto, we wouldn't say he is just remixing the vibration of the koto. Perhaps we could say that, if we had justification. There is a way of using a musical instrument as just a noise maker to produce its characteristics sounds.
Similarly, a writer doesn't just remix the alphabet, spaces and punctuation symbols. A randomly generated soup of those symbols could the thought of as their remix, in a sense.
The question is, is there a meaning being expressed using those elements as symbols?
Or is just the mixing all there is to the meaning? I.e. the result says "I'm a mix of this stuff and nothing more".
If you mix Alphagetti and Zoodles, you don't have a story about animals.
BoorishBears•2mo ago
Would you consider the instrumental at 33 seconds a new song? https://youtu.be/eJA0wY1e-zU?si=yRrDlUN2tqKpWDCv
HeinzStuckeIt•2mo ago
ozgrakkurt•2mo ago
throwaway173738•2mo ago
grosswait•2mo ago
taneq•2mo ago
gf000•2mo ago
Meanwhile, depending on how you rate LLM's capabilities, no matter how many trials you give it, it may not be considered capable of that.
That's a very important distinction.
QuadmasterXLII•2mo ago
kazinator•2mo ago
terminalshort•2mo ago
jofla_net•2mo ago
At any point prior to the final output it can garner huge starting point bias from ingested reference material. This can be up to and including whole solutions to the original prompt minus some derivations. This is effectively akin to cheating for humans as we cant bring notes to the exam. Since we do not have a complete picture of where every part of the output comes from we are at a loss to explain if it indeed invented it or not. The onus is and should be on the applicant to ensure that the output wasn't copied (show your work), not on the graders to prove that it wasn't copied. No less than what would be required if it was a human. Ultimately it boils down to what it means to 'know' something, whether a photographic memory is, in fact, knowing something, or rather derivations based on other messy forms of symbolism. It is nevertheless a huge argument as both sides have a mountain of bias in either directions.
jstummbillig•2mo ago
Neither did you (or I). Did you create anything that you are certain your peers would recognize as more "novel" than anything a LLM could produce?
snickerbockers•2mo ago
Not that specifically but I certainly have the capability to create my own OS without having to refer to the source code of existing operating systems. Literally "creating a linux" is a bit on the impossible side because it implies compatibility with an existing kernel despite the constraints prohibiting me from referring to the source of that existing kernel (maybe possible if i had some clean-room RE team that would read through the source and create a list of requirements without including any source).
If we're all on the same page regarding the origins of human intelligence (ie, that it does not begin with satan tricking adam and eve into eating the fruit of a tree they were specifically instructed not to touch) then it necessarily follows that any idea or concept was new at some point and had to be developed by somebody who didn't already have an entire library of books explaining the solution at his disposal.
For the Linux thought-experiment you could maybe argue that Linux isn't totally novel since its creator was intentionally mimicking behavior of an existing well-known operating system (also iirc he had access to the minix source) and maybe you could even argue that those predecessors stood on the shoulders of their own proverbial giants, but if we keep kicking the ball down the road eventually we reach a point where somebody had an idea which was not in any way inspired by somebody else's existing idea.
The argument I want to make is not that humans never create derivative or unoriginal works (that obviously cannot be true) but that humans have the capability to create new things. I'm not convinced that LLMs have that same capability; maybe I'm wrong but I'm still waiting to see evidence of them discovering something new. As I said in another post, this could easily be demonstrated with a controlled experiment in which the model is bootstrapped with a basic yet intentionally-limited "education" and then tasked with discovering something already known to the experimenters which was not in its training set.
>Did you create anything that you are certain your peers would recognize as more "novel" than anything a LLM could produce?
Yes, I have definitely created things without first reading every book in the library and memorizing thousands of existing functionally-equivalent solutions to the same problem. So have you so long as I'm not actually debating an LLM right now.
visarga•2mo ago
The secret ingredient is the world outside, and past experiences from the world, which are unique for each human. We stumble onto novelty in the environment. But AI can do that too - move 37 AlphaGo is an example, much stumbling around leads to discoveries even for AI. The environment is the key.
baq•2mo ago
gf000•2mo ago
One grade might be your example, while something like Gödel's incompleteness theorems or Einstein's relativity could go into a different grade.
n8cpdx•2mo ago
https://github.com/ranni0225/WRK
sosuke•2mo ago
The working memory it holds is still extremely small compared to what we would need for regular open ended tasks.
Yes there are outliers and I'm not being specific enough but I can't type that much right now.
fragmede•2mo ago
nestorD•2mo ago
I can vouch for the fact that LLMs are great at searching in the original language, summarizing key points to let you know whether a document might be of interest, then providing you with a translation where you need one.
The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.
throwup238•2mo ago
What does that look like? How well does it work?
I ended up writing a research TUI with my own higher level orchestration (basically have the thing keep working in a loop until a budget has been reached) and document extraction.
nestorD•2mo ago
But I realized I was not using it much because it was that big and inflexible (plus I keep wanting to stamp out all the bugs, which I do not have the time to do on a hobby project). So I ended up extracting it into MCPs (equipped to do full-text search and download OCR from the various databases I care about) and AGENTS.md files (defining pipelines, as well as patterns for both searching behavior and reporting of results). I also put together a sub-agent for translation (cutting away all tools besides reading and writing files, and giving it some document-specific contextual information).
That lets me use Claude Code and Codex CLI (which, anecdotally, I have found to be the better of the two for that kind of work; it seems to deal better with longer inputs produced by searches) as the driver, telling them what I am researching and maybe how I would structure the search, then letting them run in the background before checking their report and steering the search based on that.
It is not perfect (if a search surfaces 300 promising documents, it will not check all of them, and it often misunderstands things due to lacking further context), but I now find myself reaching for it regularly, and I polish out problems one at a time. The next goal is to add more data sources and to maybe unify things further.
throwup238•2mo ago
This has been the biggest problem for me too. I jokingly call it the LLM halting problem because it never knows the proper time to stop working on something, finishing way too fast without going through each item in the list. That’s why I’ve been doing my own custom orchestration, drip feeding it results with a mix of summarization and content extraction to keep the context from different documents chained together.
Especially working with unindexed content like colonial documents where I’m searching through thousands of pages spread (as JPEGs) over hundreds of documents for a single one that’s relevant to my research, but there are latent mentions of a name that ties them all together (like a minor member of an expedition giving relevant testimony in an unrelated case). It turns into a messy web of named entity recognition and a bunch of more classical NLU tasks, except done with an LLM because I’m lazy.
jvreeland•2mo ago
kace91•2mo ago
Completely off topic, but out of curiosity, where are you reading these documents? As a Spaniard I’m kinda interested.
throwup238•2mo ago
The hard part is knowing where to look since most of the images haven’t gone through HRT/OCR or indexing so you have to understand Spanish colonial administration and go through the collections to find stuff.
[1] https://pares.cultura.gob.es/pares/en/inicio.html
throwout4110•2mo ago
rmonvfer•2mo ago
throwout4110•2mo ago
throwup238•2mo ago
cco•2mo ago
throwout4110•2mo ago
vintermann•2mo ago
dr_dshiv•2mo ago
https://Ancientwisdomtrust.org
Also working on kids handwriting recognition for https://smartpaperapp.com
throwout4110•2mo ago
ATMLOTTOBEER•2mo ago
throwout4110•2mo ago
dr_dshiv•2mo ago
SJC_Hacker•2mo ago
ChrisMarshallNY•2mo ago
Footprint0521•2mo ago
smusamashah•2mo ago
Those clones are all HTML/CSS, same for game clones made by Gemini.
Aperocky•2mo ago
Thanks for this, I was almost convinced and about to re-think my entire perspective and experience with LLMs.
viftodi•2mo ago
There are plenty of so called windows(or other) web 'os' clones.
There were a couple of these posted on HN actually this very year.
Here is one example I google dthat was also on HN : https://news.ycombinator.com/item?id=44088777
This is not an OS as in emulating a kernel in javascript or wasm, this is making a web app that looks like the desktop of an OS.
I have seen plenty such projects, some mimick windows UI entirely, you xan find them via google.
So this was definitely in the training data, and is not as impressive as the blog post or the twitter thread make it to be.
The scary thing is the replies in the twitter thread have no critical thinking at all and are impressed beyond belief, they think it coded a whole kernel, os, made an interpeter for it, ported games etc.
I think this is the reason why some people are so impressed by AI, when you can only judge an app visually or only how you intetcat with it and don't have the depth of knowledge to understand, for such people it works all the way.land AI seems magical beyond comprehension.
But all this is only superficial IMHO.
krackers•2mo ago
I don't doubt though that new models will be very good at frontend webdev. In fact this is explicitly one of the recent lmarena tasks so all the labs have probably been optimizing for it.
tyre•2mo ago
DrewADesign•2mo ago
risyachka•2mo ago
Literally the most basic html/css, not sure why it is even included in benchmarks.
ACCount37•2mo ago
An LLM being able to build up interfaces that look recognizably like an UI from a real OS? That sure suggests a degree of multimodal understanding.
cowboy_henk•2mo ago
viftodi•2mo ago
This is still a challenging task and requires lots of work to get this far.
jchw•2mo ago
https://x.com/chetaslua/status/1977936585522847768
> I asked it for windows web os as everyone asked me for it and the result is mind blowing , it even has python in terminal and we can play games and run code in it
And of course
> 3D design software, Nintendo emulators
No clue what these refer to but to be honest it sounds like they've incrementally improved one-shotting capabilities mostly. I wouldn't be surprised if Gemini 2.5 Pro could get a Gameboy or NES emulator working to boot Tetris or Mario, while it is a decent chunk of code to get things going, there's an absolute boatload of code on the Internet, and the complexity is lower than you might imagine. (I have written a couple of toy Gameboy emulators from scratch myself.)
Don't get me wrong, it is pretty cool that a machine can do this. A lot of work people do today just isn't that novel and if we can find a way to tame AI models to make them trustworthy enough for some tasks it's going to be an easy sell to just throw AI models at certain problems they excel at. I'm sure it's already happening though I think it still mostly isn't happening for code at least in part due to the inherent difficulty of making AI work effectively in existing large codebases.
But I will say that people are a little crazy sometimes. Yes it is very fascinating that an LLM, which is essentially an extremely fancy token predictor, can one-shot a web app that is mostly correct, apparently without any feedback, like being able to actually run the application or even see editor errors, at least as far as we know. This is genuinely really impressive and interesting, and not the aspect that I think anyone seeks to downplay. However, consider this: even as relatively simple as an NES is compared to even moderately newer machines, to make an NES emulator you have to know how an NES works and even have strategies for how to emulate it, which don't necessarily follow from just reading specifications or even NES program disassembly. The existence of many toy NES emulators and a very large amount of documentation for the NES hardware and inner workings on the Internet, as well as the 6502, means that LLMs have a lot of training data to help them out.
I think that these tasks which extremely well-covered in the training data gives people unrealistic expectations. You could probably pick a simpler machine that an LLM would do significantly worse at, even though a human who knows how to write emulation software could definitely do it. Not sure what to pick, but let's say SEGA's VMU units for the Dreamcast - very small, simple device, and I reckon there should be information about it online, but it's going to be somewhat limited. You might think, "But that's not fair. It's unlikely to be able to one-shot something like that without mistakes with so much less training data on the subject." Exactly. In the real world, that comes up. Not always, but often. If it didn't, programming would be an incredibly boring job. (For some people, it is, and these LLMs will probably be disrupting that...) That's not to say that AI models can never do things like debug an emulator or even do reverse engineering on its own, but it's increasingly clear that this won't emerge from strapping agents on top of transformers predicting tokens. But since there is a very large portion of work that is not very novel in the world, I can totally understand why everyone is trying to squeeze this model as far as it goes. Gemini and Claude are shockingly competent.
I believe many of the reasons people scoff at AI are fairly valid even if they don't always come from a rational mindset, and I try to keep my usage of AI to be relatively tasteful. I don't like AI art, and I personally don't like AI code. I find the push to put AI in everything incredibly annoying, and I worry about the clearly circular AI market, overhyped expectations. I dislike the way AI training has ripped up the Internet, violated people's trust, and lead to a more closed Internet. I dislike that sites like Reddit are capitalizing on all of the user-generated content that users submitted which made them rich in the first place, just to crap on them in the process.
But I think that LLMs are useful, and useful LLMs could definitely be created ethically, it's just that the current AI race has everyone freaking the fuck out. I continue to explore use cases. I find that LLMs have gotten increasingly good at analyzing disassembly, though it varies depending on how well-covered the machine is in its training data. I've also found that LLMs can one-shot useful utilities and do a decent job. I had an LLM one-shot a utility to dump the structure of a simple common file format so I could debug something... It probably only saved me about 15-30 minutes, but still, in that case I truly believe it did save me time, as I didn't spend any time tweaking the result; it did compile, and it did work correctly.
It's going to be troublesome to truly measure how good AI is. If you knew nothing about writing emulators, being able to synthesize an NES emulator that can at least boot a game may seem unbelievable, and to be sure it is obviously a stunning accomplishment from a PoV of scaling up LLMs. But what we're seeing is probably more a reflection of very good knowledge rather than very good intelligence. If we didn't have much written online about the NES or emulators at all, then it would be truly world-bending to have an AI model figure out everything it needs to know to write one on-the-fly. Humans can actually do stuff like that, which we know because humans had to do stuff like that. Today, I reckon most people rarely get the chance to show off that they are capable of novel thought because there are so many other humans that had to do novel thinking before them. Being able to do novel thinking effectively when needed is currently still a big gap between humans and AI, among others.
stOneskull•2mo ago
ninetyninenine•2mo ago
Basically we all know that AI is just a stochastic parrot autocomplete. That's all it is. Anyone who doesn't agree with me is of lesser intelligence and I feel the need to inform them of things that are obvious: AI is not a human, it does not have emotions. It just a search engine. Those people who are using AI to code and do things that are indistinguishable from human reasoning are liars. I choose to focus on what AI gets wrong, like hallucinations, while ignoring the things it gets right.
hju22_-3•2mo ago
Well, there's your first problem.
vintermann•2mo ago
ninetyninenine•2mo ago
But yes. I am the unique one.
vintermann•2mo ago
ninetyninenine•2mo ago
tptacek•2mo ago
ninetyninenine•2mo ago
But articles on IQ and cognition and psychology are extremely common in HN. Enough to be noticeably out of place.
tptacek•2mo ago
ninetyninenine•2mo ago
You seem like a high iq individual. So someone with your intellectual capability must be offended that I would even suggest that HNers love to think of themselves as smart.
tptacek•2mo ago
ninetyninenine•2mo ago
Look no offense. The truth sometimes is like that. Everybody needs a bit to stay grounded.
twoodfin•2mo ago
otherdave•2mo ago
throwup238•2mo ago
dotancohen•2mo ago