> what LLMs do is string together words in a statistically highly probable manner.
This is not incorrect, but it's no longer a sufficient mental model for reasoning models. For example, while researching new monitors today, I told Gemini to compare $NEW_MODEL_1 with $NEW_MODEL_2. Its training data did not contain information about either model, but it was capable of searching the Internet to find information about both and provide me with a factual (and, yes, I checked, accurate) comparison of the differences in the specs of the models as well as a summary of sentiment for reliability etc for the two brands.
> Currently available software may very well make human drivers both more comfortable and safe, but the hype has promised completely autonomous cars reliably zipping about in rush hour traffic.
And this is already not hype, it's reality anywhere Waymo operates.
Anyway, you're moving the goalposts here. Waymo is operating at scale in actual human cities in actual rush hour traffic. Sure, it would struggle in Buffalo during a snowstorm or in Mumbai during the monsoon, but so do human drivers.
We don't expect technology to be on par with human capabilities but to exceed them.
All of the above happened over the last ~20 year or so. The progression clearly seems to point to this being more than hype, even if it takes us longer to realize than originally anticipated.
In fact, cricket doesn't even _have_ goalposts, it has wickets. Driving in cities outside North America is very different.
Waymo is testing in Japan: https://waymo.com/blog/2025/04/new-beginnings-in-japan
10 years ago the claim was that "cars can't drive autonomously," Waymo quietly chips away to the point that they absolutely can drive autonomously, even in an unpredictable environment (with evidently drastically lower-than-human accident rates, for example), and the reaction of those original people is to say "yeah but it can't drive in [even more complex place]"
Sure, that's not exactly surprising. We generally don't design technology to do the most complex version of the task it's supposed to do first. We generally start with a simpler scenario it can accomplish and progressively enhance it as we learn more. Cars have been doing that for decades.
So perhaps the tech doesn't work in Mumbai or Rome yet. Maybe we'll advance the tech to do that thing, or maybe we'll come up with a different solution to autonomous driving in these places if we find out it'll be more expensive to advance this technology than it will be to do something else instead. But either way, it's already doing the thing that many, many people claimed it can't do, and those people are now claiming there's something else it can't do. That is the very definition of moving the goalposts.
Perfect example of the saying: "if you have a big problem, first solve the smaller problems. Then your bigger problem may turn out to be not so big after all".
Current AI is much like that: one 'little' problem after another being solved (or at least, progressing).
Having navigation and music, and lane assist, and adaptive cruise control, and some cars that can operate autonomously in some environments is great, but it's not what we meant when we said self driving cars.
Today, you absolutely can "get in a car, tell it where you want to go, and it goes there while you read a book" - it's literally what Waymo is and has been doing. And now we're saying it can't do it in Mumbai, so it's still not self-driving.
At some point, the distinction seems pointless. We are undeniably continuing to make progress on the road to autonomous driving, and it does work in certain scenarios today. To suggest things are slowing down because we haven't met the most reason interpretation of the words is neither helpful nor correct.
...Can you cite that?
> And then the thing we said they can't do changed to something else.
...And they were the same people?
> We are undeniably continuing to make progress
Where did anyone deny this?
> To suggest things are slowing down
Where did anyone make this argument?
The quote from TFA:
> but the hype has promised completely autonomous cars reliably zipping about in rush hour traffic.
The author did not restrict that to SF, and is presumably referring to "hype" that "promised" this globally.
> Currently available software may very well make human drivers both more comfortable and safe...
Which is objectively not what Waymo does, and whether intentional or not, invalidates the progress that has been made.
Also, immediately preceding that:
> Driverless vehicles in closed systems have been in use for a long time.
Which is also not what current frontier self driving technology is.
> Where did anyone make this argument?
The title of the article is quite literally "Is Winter Coming?"
And lol at anyone who thinks any urban driving environment is “highly controlled”.
Aren’t you confusing “navigating” vs “driving”?
I remember the first time I went to visit my father in Kathmandu what his address was, and he patiently explained to me the street he lived on simply had no name or unique identifier. Or driving for the first time in Vietnam and being inducted into a traffic system where your only responsibility is the cone in front of you of things you can see. Or the terror of realizing that your second taxi driver of the day in Bangkok is literally on speed.
All this to say: no, I assume he means he can’t handle driving in Mumbai.
1. People do not as a matter of their daily complaints complain about bad traffic, bad drivers, dents, door dings etc. 2. There are less accidents per capita than US 3. Insurance is required, body shops work better than in US. 4. Electrification of tuk-tuk fleet is....impressive.
Waymo/Autonomos driving would drastically slow down most of transporation infrastructure in most of the world. I don't think waymo should spend billions figuirng out how to drive better than Indians.
My favorite data point here is Cairo: the sound of traffic there is horns blaring and metal-on-metal. Driving in Cario is a contact sport. And it doesn't seem to matter how nice a car is: a fancy Mercedes will have as many body dents as a rust bucket Lada.
If you skip this two-part understanding then you run the risk of missing when the agent decided not to do a search for some reason and is therefore entirely dependent on statistical probability in the training data. I've personally seen people without this mental model take an LLM at its word when it was wrong because they'd gotten used to it looking things up for them.
"Completely" here should be expanded to include all the unique and unforseen circumstances a driver might encounter, such as a policeman directing traffic manually or any other "soft" situation that is not well represented in training.
Not to mention the somewhat extreme amount of apriori and continuous mapping that goes into operating a fleet of AVs. That is hardly to be considered "Completely autonomous".
This isn't just pedantry, the disconnect between a technical person's deep understanding and a common user's everyday experience is pretty much what the article hinges on. Try taking a Waymo from SF to NYC. This seems like something a "Completely autonomous" car should be able to do given a layperson's understanding of "Completely", without the experts' long list of caveats.
But this feature was a staple of most online shops that sell monitors and a bunch of "review" sites. You don't need a highly complex system to compare 2 monitors, you need a spreadsheet.
I guess the article fails to admit that when you have billions of connected points in a vector space, "stringing together" is not simply "stringing together". I'm not a fanboy but somehow GPT/attention based logic is capable of parsing input and data then remodeling it in depths that are surprising.
You told an agent, not just an LLM.
> And this is already not hype, it's reality anywhere Waymo operates.
Some beg to differ; see e.g. https://www.youtube.com/watch?v=040ejWnFkj0 .
I want to reitterate that I don't want dull, minimal, writing. I don't subscribe to the "reduce your wordcount until it can't be reduced any further" style of writing advice. I just think that many people have very similar ideas about ai (and written very similar things), and if you have something to say that you haven't seen expressed before, it is worthwile (imo) to express it without preamble.
Summarize and critique this argument in a series of bullet points.
More seriously though, I think there is a lack of rigorous thinking about AI specifically and technology in general. And hence you get a lot of these rambling thought-style posts which are no doubt by intelligent people with something compelling to say, but without any fundamental method for analyzing those thoughts.
Which is why I really recommend taking a course in symbolic logic or analytic philosophy, if you are able to. You’ll quickly learn how to communicate your ideas in a straightforward, no nonsense manner.
Do you have any free online course recommendations?
There are a bunch of lectures on YouTube about analytic philosophy though, and from a quick look they seem solid.
In professional settings, brevity is often mistaken for inexperience or a weak position. As the thinking goes, a competent engineer should be able to defend every position they take like a PhD candidate defending their dissertation. At the same time, however, excess verbosity is viewed as distinctly “cold” and “engineer” in tone, and frowned upon by non-technical folks in my experience; they wanted an answer, not an explainer.
The problem is that each of us have the data points of what succeeds in convincing others: the longer argument, every single time. Thus we use it in our own writing because we want to convince the imagined reader (as well as ourselves) that our position is correct, or at the very least, sound. In doing so we write lengthy posts, while often doing research to validate our positions with charts, screenshots, Wikipedia articles, news sources, etc. It’s as much about convincing ourselves as it is other readers, hence why we go for longer posts based on real world experiences.
One plot twist subjective to me: my lengthy posts are also about quelling my brain, in a very real sense. It is the reader, and if I do not get everything out of my head about that topic and onto “paper”, it will continue to dwell and gnaw on the missed points in perpetuity. Thus, 5k posts about things like the inefficiency of hate in Capital or a Systems Analysis of American Hegemony, just so I can have peace and quiet in my own head by getting it completely out of said head.
1) If I am considering possible objections to my position, I have to be very clear which points I am raising only for the sake of argument, and which are the ones I am actually advocating for, or else it will appear confused or self-contradictory.
A related issue is to preempt possible objections to the point where the reader might lose track of the main issue.
2) After making several passes to hone my position, it can seem so obvious to me that what I write for the reader is too terse for anyone who is approaching the issue for the first time.
I find myself writing longer and more defensively because lots of people don't understand nuance or subtext. Forget hyperbole or humour - lots of technical readers lack the ability to understand them.
Finally, editing is hard work. Revising and refining a document often takes several times longer than writing the first draft.
She showed me the result and I immediately saw the logical flaws and pointed them out to her. She pressed the model on it and it of course apologized and corrected itself. Out of curiosity I tried the prompt again, this time using financial jargon that I was familiar with and my wife was not. The intended meaning of the words was the same, the only difference is that my prompt sounded like it came from someone who knew finance. The result was that the model got it right and gave an explanation for the reasoning in exacting detail.
It was an interesting result to me because it shows that experts in a field are not only more likely to recognize when a model is giving incorrect answers but they're also more likely to get correct answers because they are able to tap into a set of weights that are populated by text that knew what it was talking about. Lay people trying to use an LLM to understand an unfamiliar field are vulnerable to accidentally tapping into the "amateur" weights and ending up with an answer learned from random Reddit threads or SEO marketing blog posts, whereas experts can use jargon correctly in order to tap into answers learned from other experts.
Far too often it'll cheerily apologise and correct their own answer.
Secondly, it can be even worse. I’ve been ”gaslighted” when pressing on answers I knew were incorrect (in this case, cryptography). It comes up with extremely plausibly sounding arguments, specifically addressing my counterpoints, and even chain of reasoning, yet still isn’t correct. You’d have to be a domain expert to tell it’s wrong, at which point it makes no sense to use LLMs in the first place.
It just leaves you with two contradictory statements, much like the man with two watches who never knows the correct time.
That reminded me how important it is to give it the full parameters and context of my question, including things you could assume another human being would just get. It also has a sort of puppy-dog's eagerness to please that I've had to tell it not to let get in the way of objective analysis. Sometimes the "It's awesome that you asked about that" stuff verges on a Hitchhiker's Guide joke. Maybe that's what they were going for.
And even if it were a misinterpretation the result is still largely the same: if you don't know how to ask good questions you won't get good answers, which makes it dangerous to rely on the tools for things that you're not already an expert in. This is in contrast to all to people who claim to be using them for learning about important concepts (including lots of people who claim to be using them as financial advisors!).
The difference is that a human doctor probably has a lot of context about you and the situation you're in, so that they probably guess what your intention behind the question is, and adjust their answer appropriately. When you talk to an LLM, it has none of that context. So the comparison isn't really fair.
Has your mom ever asked you a computer question? Half of the time the question makes no sense and explaining to her why would take hours, and then she still wouldn't get it. So the best you can do is guess what she wants based on the context you have.
Yeah, we're basically repeating the "search engine/query" problem but slightly differently. Using a search engine the right way always been a skill you needed to learn, and the ones who didn't always got poor results, and many times took those results at face value. Then Google started giving "answers" so if your query is shit, the "answer" most likely is too.
Point is, I don't think this phenomenon is new, it's just way less subtle today with LLMs, at least for people who have expertise in the subjects.
But, I work in healthcare and have enough knowledge of health to know that CKD almost certainly could not advance fast enough to be the cause of the kidney value changes in the labs that were only 6 weeks apart. I asked the LLM if that's the best explanation for these values given they're only 6 weeks apart, and it adjusted its answer to say CKD is likely not the explanation as progression would happen typically over 6+ months to a year at this stage, and more likely explanations were nephrotoxins (recent NSAID use), temporary dehydration, or recent infection.
We then spoke to our vet who confirmed that CKD would be unlikely to explain a shift in values like this between two tests that were just 6 weeks apart.
That would almost certainly throw off someone with less knowledge about this, however. If the tests were 4-6 months apart, CKD could explain the change. It's not an implausible explanation, but it skipped over a critical piece of information (the time between tests) before originally coming to that answer.
However, they do have particular types of failure modes that they're more prone to, and this is one of them. So they're imperfect.
ChatGPT is not reliable for medical diagnosis.
While it can summarize symptoms, explain conditions, or clarify test results using public medical knowledge, it: • Is not a doctor and lacks clinical judgment • May miss serious red flags or hallucinate diagnoses • Doesn’t have access to your medical history, labs, or physical exams • Can’t ask follow-up questions like a real doctor would
I am suggesting that today's best in class models (Gemini 2.5 Pro and o3, for example), when given the same context that a physician has access to (labs, prior notes, medication history, diagnosis history, etc), and given an appropriate eval loop, can achieve similar diagnostic accuracy.
I am not suggesting that patients turn to ChatGPT for medical diagnosis, or that these tools are made available to patients to self diagnose, or that physicians can or should be replaced by an LLM.
But there absolutely is a role for an LLM to play in diagnostic workflows to support physicians and care teams.
My fear is that people treat AI like an oracle when they should be treating it just like any other human being.
I have a personal gripe about this bringing an unfinished tool to market and then prophetizing about its usefulness. And how we all better get ready for it. This seems very hand-wavey and is looking more and more like vaporware.
It's like trying to quickly build a house on an unfinished foundation. Why are we rushing to build? Can't we get the foundational things right first?
> It was an interesting result to me because it shows that experts in a field are not only more likely to recognize when a model is giving incorrect answers but they're also more likely to get correct answers because they are able to tap into a set of weights that are populated by text that knew what it was talking about. Lay people trying to use an LLM to understand an unfamiliar field are vulnerable to accidentally tapping into the "amateur" weights and ending up with an answer learned from random Reddit threads or SEO marketing blog posts, whereas experts can use jargon correctly in order to tap into answers learned from other experts.
Couldn't it be the case that people who (in this case recognizable to the AI by their choice of wording) are knowledgeable in the topic need different advise than people who know less about the topic?
To give one specific examples from finance: if you know a lot about finance, getting some deep analysis and advice about what is the best way to trade some exotic options is likely sound advice. On the other hand, for people who are not deeply into finance the best advice is likely rather "don't do it!".
> for people who are not deeply into finance the best advice is likely rather "don't do it!".
Oh boy, more nanny software. This future blows.
I think this topic is a little bit more complicated: this is rather a balancing of the model between
1. "giving the best possible advice to the respective person given their circumstances" vs
2. "giving the most precise answer to the query to the user"
(if you ask me: the best decision would in my opinion be to give the user a choice for this, but this would be overtaxing to many users)
- Freedom-loving people will hate it if they don't get 2
- On the other hand, many people would like to actually get the advice that is most helpful to them (i.e. 1), and not the one that may answer their question exactly, but is likely a bad idea for them
1. The AI will never know the user well enough to predict what will be best for them. It will resort to treating everybody like children. In fact, many of the crude ways LLMs currently steer and censor are already infantilizing.
2. The users' benefit vs. "for your own good" as defined by a product vendor's financial interest is a scam that vendors have perpetrated for ages. Even the unsubtle version of it has a bunch of stooges and chumps that defend it. Things will not improve in the users' favor when it's harder to notice, and easier to pretend they're not malicious.
3. A bunch of Californians using the next wave of tech to spread cultural imperialism is better than China doing it, I guess. But why are those my options?
Also how you ask matters a lot. Sometimes it just wants to make you happy with whatever answer, if you go along without skepticism it will definitely make garbage.
Fun story: at a previous job a Product Manager made someone work a full week on a QR-Code standard that doesn't exist, except in ChatGPT's mind. It produced test cases and examples, but since nobody had a way to test
When it was sent to a bank in Sweden to test, the customer was just "wait this feature doesn't exist in Sweden" and a heated discussion ensued until the PM admitted using ChatGPT to create the requirements.
This may or may not be easily possible by tweaking current training techniques. But it shows the many edge cases still needed to be addressed by AI models.
Obviously those exposed in the AI hype will tell you that there is no winter.
Until the music stops and almost little to no-one can make money out of this AI race to zero.
Half the world runs on Big Tech. Some of them have cash reserves bigger than the GDP of sizeable countries. They lead in R&D investment: https://www.rdworldonline.com/top-15-rd-spenders-of-2024/
> Obviously those exposed in the AI hype will tell you that there is no winter.
Go look at how much money was spent on AI R&D in the last AI 'summers' (and winters). Pennies compared to the billions and billions of dollars the private and public sector is throwing at it right now.
Will some investments turn out to be a waste of time and money? Yes.
Will investment be reduced to a fraction of what it is today? Hell no.
The music stops when humans are economically obsolete.
At the end of the day, "AI" really just means throwing expensive algorithms at problems we've labeled as "subjective" and hoping for the best. More compute, faster communication, bigger storage, and we get to run more of those algorithms. Odds are, the real bottleneck is hardware, not software. Better hardware just lets us take bolder swings at problems, basically wasting even more computing power on nothing.
So yeah, we’ll get yet another AI boom when a new computing paradigm shows up. And that boom will hit yet another AI winter, because it'll smack into the same old bottleneck. And when that winter hits, we'll do what we've always done. Move the goalposts, lower the bar, and start the cycle all over again. Just with new chips this time.
Ah, Jesus. I should quit drinking Turkish coffee.
A lot of people (still a tiny proportion of the population) will be loud in opposition but ultimately overwhelmed by the nihilism and indifference of the aggregate.
The loudest will be those who perceive some loss to their own lifestyle that relies on exploiting other’s attention, as AI presents new risk to their attention grabbing behaviors.
Then they will die off and humanity will carry on with AI not them.
Circle of life Simba.
GaggiX•7h ago
Reasoning models like o1 had not yet been released at that time. It's amazing how much progress has been made since then.
Edit: also Search wasn't available as the blog mention "citation"s.
netdevphoenix•7h ago
We are just getting cars of different shapes and colours, with built-in speakers and radio. Not exactly progress
GaggiX•7h ago
eisfresser•6h ago
That was only six month ago. I don't think this is an argument that things are slowing down (yet).
netdevphoenix•3h ago
pixl97•6h ago
Progress isn't a smooth curve but more step like.
Also, the last 10% of getting AI right is 90% of the work, but it doesn't seem to us humans. I don't think you understand the gigantic impact that last 10% is going to make on the world and how fast it will change things once we accomplish it.
Personally, I hope it takes us a while. We're not ready for this as a society and planet.
patapong•6h ago
Thus, I think we can compare them to electricity - a sophisticated technology with a ton of potential, which will take years to fully exploit, even if there are no more fundamental breakthroughs. But also not the solution to every single problem.
zahlman•3h ago
empath75•4h ago
Your car example is a perfect one -- society was _completely reordered_ around the car, even though the fundamental technology behind the car didn't change from the early 20th century until the invention of the electric car.
netdevphoenix•3h ago
Jedd•6h ago
Spring 2024 for me was from the 1st of September to the 30th of November.
GaggiX•6h ago
Jedd•6h ago
GaggiX•6h ago
Jedd•5h ago
barbazoo•4h ago
rgreeko42•3h ago