Is Winter Coming? (2024)

https://www.datagubbe.se/winter/

122•rbanffy•6mo ago

Comments

GaggiX•6mo ago

>Spring 2024

Reasoning models like o1 had not yet been released at that time. It's amazing how much progress has been made since then.

Edit: also Search wasn't available as the blog mention "citation"s.

netdevphoenix•6mo ago

The point isn't that progress is not happening but that it's slowing down. You get more of the same, smaller memory footprint, faster responses, less hallucinations, etc. Significant progress would be another deep seek kinda of breakthrough, near 0% hallucination rate, performing like current models with less than half of their dataset, epistemological self awareness (i.e. I am not sure of the correctness of the answer I just gave you, ability to override assumptions from the training dataset, etc).

We are just getting cars of different shapes and colours, with built-in speakers and radio. Not exactly progress

GaggiX•6mo ago

Reasoning models like o1 are on a whole new level compared to previous models, for example they are incredible at math, something that previous models struggled with a lot, this seems pretty huge to me as the performance of previous models was flattening, it's a kinda a new paradigm.

eisfresser•6mo ago

> another deep seek kinda of breakthrough

That was only six month ago. I don't think this is an argument that things are slowing down (yet).

netdevphoenix•6mo ago

I didn't say that progress stopped only that it is slowing down (ie they become less frequent). Deep seek happening 6 months ago doesn't counter what I said.

pixl97•6mo ago

>but that it's slowing down

Progress isn't a smooth curve but more step like.

Also, the last 10% of getting AI right is 90% of the work, but it doesn't seem to us humans. I don't think you understand the gigantic impact that last 10% is going to make on the world and how fast it will change things once we accomplish it.

Personally, I hope it takes us a while. We're not ready for this as a society and planet.

patapong•6mo ago

This assumes that LLMs are only useful if they are AGI. I don't think we do - what we have today is already sufficient to unlock an enormous amount of value, we just haven't done so yet.

Thus, I think we can compare them to electricity - a sophisticated technology with a ton of potential, which will take years to fully exploit, even if there are no more fundamental breakthroughs. But also not the solution to every single problem.

zahlman•6mo ago

Arguably, LLMs - or whatever systems succeed them - are only useful if they are not AGI. Given the evidence already collected about how willing humans are to make these systems "agentive", we pretty well have to worry about the possibility of an AGI using us instead. Even if there's some other logical barrier to recursive self-improvement ("hard takeoff") scenarios.

empath75•6mo ago

The consequences of a new thing being invented are not entirely dependent on progress of innovation in the thing itself. It takes quite a long time for people to build _on top of_ a new technology. LLMs could not appreciably improve at all, and we've still barely scratched the surface of applying what we have.

Your car example is a perfect one -- society was _completely reordered_ around the car, even though the fundamental technology behind the car didn't change from the early 20th century until the invention of the electric car.

netdevphoenix•6mo ago

Surely, finding new applications to existing tech can't be considered progress in the development of that tech

Jedd•6mo ago

That's also one of those things that would probably confuse LLMs as readily as it confuses North Americans (for much the same reason - training).

Spring 2024 for me was from the 1st of September to the 30th of November.

GaggiX•6mo ago

What's the confusion here? The author is from Sweden, also neither I nor the author are North Americans.

Jedd•6mo ago

While one half of the planet is having spring, the other half is having autumn.

GaggiX•6mo ago

Yeah that's something you generally learn when you are a kid.

Jedd•6mo ago

We should catch up next autumn for a quiet ale to talk about the ambiguity of that date format.

barbazoo•6mo ago

Only about 10% of earths population live in the southern hemisphere. It’s pretty fair to assume northern.

Jedd•6mo ago

Maybe, but if you're trying to appear folksy by using region-dependent seasonal identifiers rather than, you know, dates AND you're happy that instantly means 10% of your readers will be confounded by that, then I guess go for it.

I'll note that about 40% of the earth's population lives in the tropics, where seasons also don't break down the same way they do for the author - typically that demographic will recognise two identifiable seasons through the year.

rgreeko42•6mo ago

o1 was not a major advancement

decimalenough•6mo ago

I'm generally quite skeptical of AI, but this overstates its case. Two things stand out:

> what LLMs do is string together words in a statistically highly probable manner.

This is not incorrect, but it's no longer a sufficient mental model for reasoning models. For example, while researching new monitors today, I told Gemini to compare $NEW_MODEL_1 with $NEW_MODEL_2. Its training data did not contain information about either model, but it was capable of searching the Internet to find information about both and provide me with a factual (and, yes, I checked, accurate) comparison of the differences in the specs of the models as well as a summary of sentiment for reliability etc for the two brands.

> Currently available software may very well make human drivers both more comfortable and safe, but the hype has promised completely autonomous cars reliably zipping about in rush hour traffic.

And this is already not hype, it's reality anywhere Waymo operates.

belter•6mo ago

Waymo operates on highly controlled and mapped environments. Can they handle Rome or Mumbai?

decimalenough•6mo ago

Describing SF's Tenderloin at night as a "highly controlled environment" would be stretching it.

Anyway, you're moving the goalposts here. Waymo is operating at scale in actual human cities in actual rush hour traffic. Sure, it would struggle in Buffalo during a snowstorm or in Mumbai during the monsoon, but so do human drivers.

hiatus•6mo ago

> Sure, it would struggle in Buffalo during a snowstorm or in Mumbai during the monsoon, but so do human drivers.

We don't expect technology to be on par with human capabilities but to exceed them.

colinmorelli•6mo ago

This feels a lot like "moving the goalposts." First, it was complete science fiction to have technology in the car. Then, it was in the car, but it could only do navigation and music, it can't operate the car the way humans can. Then, it can prevent you from weaving out of your lane, and it can stop the car if you're about to crash into something, but it can't help you with your commute. Then, it can speed up, slow down, and steer on the highway, but it can't take you door to door. Now, it can take you door to door, but only in certain environments, it can't do it everywhere.

All of the above happened over the last ~20 year or so. The progression clearly seems to point to this being more than hype, even if it takes us longer to realize than originally anticipated.

gilleain•6mo ago

Not so much moving the goalposts as pointing out that playing American football is not like playing soccer (football - that is, driving in Rome) or even cricket (Mumbai).

In fact, cricket doesn't even _have_ goalposts, it has wickets. Driving in cities outside North America is very different.

xnx•6mo ago

> Driving in cities outside North America is very different.

Waymo is testing in Japan: https://waymo.com/blog/2025/04/new-beginnings-in-japan

colinmorelli•6mo ago

I'm not sure I get your analogy here. If you're suggesting that it's not "moving the goalposts" because it's pointing out that driving in Rome or Mumbai is different than driving in North America, then that is exactly what is meant by moving the goalposts.

10 years ago the claim was that "cars can't drive autonomously," Waymo quietly chips away to the point that they absolutely can drive autonomously, even in an unpredictable environment (with evidently drastically lower-than-human accident rates, for example), and the reaction of those original people is to say "yeah but it can't drive in [even more complex place]"

Sure, that's not exactly surprising. We generally don't design technology to do the most complex version of the task it's supposed to do first. We generally start with a simpler scenario it can accomplish and progressively enhance it as we learn more. Cars have been doing that for decades.

So perhaps the tech doesn't work in Mumbai or Rome yet. Maybe we'll advance the tech to do that thing, or maybe we'll come up with a different solution to autonomous driving in these places if we find out it'll be more expensive to advance this technology than it will be to do something else instead. But either way, it's already doing the thing that many, many people claimed it can't do, and those people are now claiming there's something else it can't do. That is the very definition of moving the goalposts.

RetroTechie•6mo ago

> Waymo quietly chips away (..)

Perfect example of the saying: "if you have a big problem, first solve the smaller problems. Then your bigger problem may turn out to be not so big after all".

Current AI is much like that: one 'little' problem after another being solved (or at least, progressing).

thesuitonym•6mo ago

It's not really moving the goalposts, though. The idea of a self driving car has always been "I can get in my car, tell it where I want to go, and then it goes there while I read a book."

Having navigation and music, and lane assist, and adaptive cruise control, and some cars that can operate autonomously in some environments is great, but it's not what we meant when we said self driving cars.

colinmorelli•6mo ago

The point is not that those things were meant when we said self driving cars. It's that, at every step along the way, there were a group of people who doubted that cars could do that thing, and then they did that thing. And then the thing we said they can't do changed to something else.

Today, you absolutely can "get in a car, tell it where you want to go, and it goes there while you read a book" - it's literally what Waymo is and has been doing. And now we're saying it can't do it in Mumbai, so it's still not self-driving.

At some point, the distinction seems pointless. We are undeniably continuing to make progress on the road to autonomous driving, and it does work in certain scenarios today. To suggest things are slowing down because we haven't met the most reason interpretation of the words is neither helpful nor correct.

zahlman•6mo ago

> It's that, at every step along the way, there were a group of people who doubted that cars could do that thing

...Can you cite that?

> And then the thing we said they can't do changed to something else.

...And they were the same people?

> We are undeniably continuing to make progress

Where did anyone deny this?

> To suggest things are slowing down

Where did anyone make this argument?

The quote from TFA:

> but the hype has promised completely autonomous cars reliably zipping about in rush hour traffic.

The author did not restrict that to SF, and is presumably referring to "hype" that "promised" this globally.

colinmorelli•6mo ago

You conveniently left out the first part of the sentence you quoted:

> Currently available software may very well make human drivers both more comfortable and safe...

Which is objectively not what Waymo does, and whether intentional or not, invalidates the progress that has been made.

Also, immediately preceding that:

> Driverless vehicles in closed systems have been in use for a long time.

Which is also not what current frontier self driving technology is.

> Where did anyone make this argument?

The title of the article is quite literally "Is Winter Coming?"

whynotminot•6mo ago

Why is that the standard? I, a human, can’t handle driving in Mumbai.

And lol at anyone who thinks any urban driving environment is “highly controlled”.

rad_gruchalski•6mo ago

> Why is that the standard? I, a human, can’t handle driving in Mumbai.

Aren’t you confusing “navigating” vs “driving”?

decimalenough•6mo ago

In Indian traffic, navigating is the least of your worries.

petesergeant•6mo ago

The Western mind simply cannot fathom what some of our Eastern brothers and sisters have managed to achieve on the roads.

I remember the first time I went to visit my father in Kathmandu what his address was, and he patiently explained to me the street he lived on simply had no name or unique identifier. Or driving for the first time in Vietnam and being inducted into a traffic system where your only responsibility is the cone in front of you of things you can see. Or the terror of realizing that your second taxi driver of the day in Bangkok is literally on speed.

All this to say: no, I assume he means he can’t handle driving in Mumbai.

jogjayr•6mo ago

I, a human who learned to drive in Mumbai, can't handle driving in Mumbai anymore.

Reubachi•6mo ago

some things in "Car-culture" that surprised me on my trips to india;

1. People do not as a matter of their daily complaints complain about bad traffic, bad drivers, dents, door dings etc. 2. There are less accidents per capita than US 3. Insurance is required, body shops work better than in US. 4. Electrification of tuk-tuk fleet is....impressive.

Waymo/Autonomos driving would drastically slow down most of transporation infrastructure in most of the world. I don't think waymo should spend billions figuirng out how to drive better than Indians.

smus•6mo ago

Have you been to San Francisco?

riehwvfbk•6mo ago

Compared to driving in the developing world though, SF traffic is very structured and very tame.

My favorite data point here is Cairo: the sound of traffic there is horns blaring and metal-on-metal. Driving in Cario is a contact sport. And it doesn't seem to matter how nice a car is: a fancy Mercedes will have as many body dents as a rust bucket Lada.

jogjayr•6mo ago

Waymo works where it works and it's useful where it works. Can a Mumbai autorickshaw handle an American freeway? Does that make it a pointless vehicle?

lolinder•6mo ago

To have a good mental model for modern AI agents you have to understand both the LLM and the other stuff that's built up around it. OP is correct about the behavior of LLMs, and that is valuable information to keep in mind. Then you layer on top of that an understanding that some implementations of agents will sometimes automatically feed search results into context, if you ask them to or are paying for an advanced tier or whatever the extra qualifications are for your particular tool.

If you skip this two-part understanding then you run the risk of missing when the agent decided not to do a search for some reason and is therefore entirely dependent on statistical probability in the training data. I've personally seen people without this mental model take an LLM at its word when it was wrong because they'd gotten used to it looking things up for them.

jvanderbot•6mo ago

Defending TFA a little ... the rest of the article builds up context around the word "Completely", so that the single example of "Highway zipping" is not just what is being discussed.

"Completely" here should be expanded to include all the unique and unforseen circumstances a driver might encounter, such as a policeman directing traffic manually or any other "soft" situation that is not well represented in training.

Not to mention the somewhat extreme amount of apriori and continuous mapping that goes into operating a fleet of AVs. That is hardly to be considered "Completely autonomous".

This isn't just pedantry, the disconnect between a technical person's deep understanding and a common user's everyday experience is pretty much what the article hinges on. Try taking a Waymo from SF to NYC. This seems like something a "Completely autonomous" car should be able to do given a layperson's understanding of "Completely", without the experts' long list of caveats.

pbalau•6mo ago

> For example, while researching new monitors today, I told Gemini to compare $NEW_MODEL_1 with $NEW_MODEL_2.

But this feature was a staple of most online shops that sell monitors and a bunch of "review" sites. You don't need a highly complex system to compare 2 monitors, you need a spreadsheet.

agumonkey•6mo ago

> what LLMs do is string together words in a statistically highly probable manner.

I guess the article fails to admit that when you have billions of connected points in a vector space, "stringing together" is not simply "stringing together". I'm not a fanboy but somehow GPT/attention based logic is capable of parsing input and data then remodeling it in depths that are surprising.

sceptic123•6mo ago

Waymo only works because it's geofenced — that's a massive barrier to "completely autonomous" (or level 5 automation)

zahlman•6mo ago

> For example, while researching new monitors today, I told Gemini...

You told an agent, not just an LLM.

> And this is already not hype, it's reality anywhere Waymo operates.

Some beg to differ; see e.g. https://www.youtube.com/watch?v=040ejWnFkj0 .

efficax•6mo ago

waymo is pretty amazing but it's still short of complete full self driving, and it's not at all clear that it can close the gap. similary, current LLMs are really remarkable. but it's not at all clear that they will make the leap into "real" intelligence.

Fraterkes•6mo ago

I love reading, I enjoy long-form articles, but I really wish technical bloggers especially would practice distilling their point into shorter posts. I notice it a lot with (older) scott alexander articles, this implicit assumption that your writing is informative/entertaining enough that you can stretch a simple idea to many pages.

I want to reitterate that I don't want dull, minimal, writing. I don't subscribe to the "reduce your wordcount until it can't be reduced any further" style of writing advice. I just think that many people have very similar ideas about ai (and written very similar things), and if you have something to say that you haven't seen expressed before, it is worthwile (imo) to express it without preamble.

keiferski•6mo ago

Ironically this is one of the best use cases I’ve found for AI tools at the moment.

Summarize and critique this argument in a series of bullet points.

More seriously though, I think there is a lack of rigorous thinking about AI specifically and technology in general. And hence you get a lot of these rambling thought-style posts which are no doubt by intelligent people with something compelling to say, but without any fundamental method for analyzing those thoughts.

Which is why I really recommend taking a course in symbolic logic or analytic philosophy, if you are able to. You’ll quickly learn how to communicate your ideas in a straightforward, no nonsense manner.

SilverSlash•6mo ago

> Which is why I really recommend taking a course in symbolic logic or analytic philosophy, if you are able to. You’ll quickly learn how to communicate your ideas in a straightforward, no nonsense manner.

Do you have any free online course recommendations?

keiferski•6mo ago

I haven’t taken any online courses unfortunately (took them in person in college) but for symbolic logic I recommend the book by Klenk. I used that in my course and found it to be a good intro.

There are a bunch of lectures on YouTube about analytic philosophy though, and from a quick look they seem solid.

stego-tech•6mo ago

As someone with this very style (my own blog posts often rise into the 5k word range) and also from a technical background, I can at least explain my motivations for length: absolute domination of the argument.

In professional settings, brevity is often mistaken for inexperience or a weak position. As the thinking goes, a competent engineer should be able to defend every position they take like a PhD candidate defending their dissertation. At the same time, however, excess verbosity is viewed as distinctly “cold” and “engineer” in tone, and frowned upon by non-technical folks in my experience; they wanted an answer, not an explainer.

The problem is that each of us have the data points of what succeeds in convincing others: the longer argument, every single time. Thus we use it in our own writing because we want to convince the imagined reader (as well as ourselves) that our position is correct, or at the very least, sound. In doing so we write lengthy posts, while often doing research to validate our positions with charts, screenshots, Wikipedia articles, news sources, etc. It’s as much about convincing ourselves as it is other readers, hence why we go for longer posts based on real world experiences.

One plot twist subjective to me: my lengthy posts are also about quelling my brain, in a very real sense. It is the reader, and if I do not get everything out of my head about that topic and onto “paper”, it will continue to dwell and gnaw on the missed points in perpetuity. Thus, 5k posts about things like the inefficiency of hate in Capital or a Systems Analysis of American Hegemony, just so I can have peace and quiet in my own head by getting it completely out of said head.

collinmcnulty•6mo ago

As someone who similarly writes to think, I found a lot of insight from this video [0] from the University of Chicago. Long story appropriately short, he recommends writing something twice: once for yourself and once for the reader.

[0]: https://www.youtube.com/watch?v=vtIzMaLkCaM

stego-tech•6mo ago

That approach has helped me immensely in my communications, but less so for blog posts. I think it’s because I’ve fully internalized writing in my downtime as writing for myself first, and I just like longer, in-depth reads as a personal preference.

mannykannot•6mo ago

I don't recall if this is covered in the video, but here are two pitfalls I have noticed from my own attempts:

1) If I am considering possible objections to my position, I have to be very clear which points I am raising only for the sake of argument, and which are the ones I am actually advocating for, or else it will appear confused or self-contradictory.

A related issue is to preempt possible objections to the point where the reader might lose track of the main issue.

2) After making several passes to hone my position, it can seem so obvious to me that what I write for the reader is too terse for anyone who is approaching the issue for the first time.

542354234235•6mo ago

Also, the internet in particular has a tendency to go out of its way to interpret things in the least charitable way possible. If there is a way to take anything you said negatively, or hyper literally, or any other way to misinterpret your intention, it will. So you tend to assume a bad-faith reading and preemptively explain/respond to possible nitpicks.

divan•6mo ago

I wish it was a skill easy to learn but it's not.

disambiguation•6mo ago

I mean it's their blog, the writing is just as much for them as it is for you.

edent•6mo ago

The problem is, unless you expand every point then some jerk on HN will nit-pick your "logical fallacies".

I find myself writing longer and more defensively because lots of people don't understand nuance or subtext. Forget hyperbole or humour - lots of technical readers lack the ability to understand them.

Finally, editing is hard work. Revising and refining a document often takes several times longer than writing the first draft.

mvdtnz•6mo ago

Strategery is the most extreme example of this. He has interesting things to say but boy does he love the act of actually _saying_ them.

Havoc•6mo ago

The hype cooling down a bit might not be a terrible thing

lolinder•6mo ago

An example of the prompt engineering phenomenon: my wife and I were recently discussing a financial decision. I'd offered my arguments in favor of one choice and she was mostly persuaded but decided to check in with ChatGPT to help reassure herself that I was right. She asked the financial question in layman's terms and got the opposite answer that I had given.

She showed me the result and I immediately saw the logical flaws and pointed them out to her. She pressed the model on it and it of course apologized and corrected itself. Out of curiosity I tried the prompt again, this time using financial jargon that I was familiar with and my wife was not. The intended meaning of the words was the same, the only difference is that my prompt sounded like it came from someone who knew finance. The result was that the model got it right and gave an explanation for the reasoning in exacting detail.

It was an interesting result to me because it shows that experts in a field are not only more likely to recognize when a model is giving incorrect answers but they're also more likely to get correct answers because they are able to tap into a set of weights that are populated by text that knew what it was talking about. Lay people trying to use an LLM to understand an unfamiliar field are vulnerable to accidentally tapping into the "amateur" weights and ending up with an answer learned from random Reddit threads or SEO marketing blog posts, whereas experts can use jargon correctly in order to tap into answers learned from other experts.

xnorswap•6mo ago

Using LLM's in any unfamiliar context is dangerous. When asking them anything, always follow up with: "Are you sure?".

Far too often it'll cheerily apologise and correct their own answer.

jononor•6mo ago

Not necessarily correct the answer! Sometimes it just changes it, insists that it is correct - but it might still be wrong (perhaps now in a new way)...

klabb3•6mo ago

First, these tips and tricks that ”works for me” aren’t universal, it may or may not work, there’s literally no way to tell other than to run large scale empirical experiments, otherwise it’s just mythbuilding.

Secondly, it can be even worse. I’ve been ”gaslighted” when pressing on answers I knew were incorrect (in this case, cryptography). It comes up with extremely plausibly sounding arguments, specifically addressing my counterpoints, and even chain of reasoning, yet still isn’t correct. You’d have to be a domain expert to tell it’s wrong, at which point it makes no sense to use LLMs in the first place.

xnorswap•6mo ago

Oh for sure, asking "Are you sure?" isn't a trick to improve accuracy, it's a trick to show people the danger of asking it in the first place.

It just leaves you with two contradictory statements, much like the man with two watches who never knows the correct time.

klabb3•6mo ago

Ah, I think I misunderstood the sentiment. Thanks for clarifying

jon-wood•6mo ago

Often it'll even apologise and correct it's own answer despite the original answer being correct, because you just primed it's model to believe it was wrong.

aaronbaugher•6mo ago

I asked Grok for advice on a personal decision I was making. It suggested I do A. I said I was leaning toward not-A, and explained why. Then it said I should do not-A. I asked why, if not-A was the best choice, it hadn't said that in the first place. Was it unable to think of not-A before I mentioned it? It said no, it had considered not-A from the start, but thought A was the best choice for me until I gave it more context.

That reminded me how important it is to give it the full parameters and context of my question, including things you could assume another human being would just get. It also has a sort of puppy-dog's eagerness to please that I've had to tell it not to let get in the way of objective analysis. Sometimes the "It's awesome that you asked about that" stuff verges on a Hitchhiker's Guide joke. Maybe that's what they were going for.

daveguy•6mo ago

It should have told you it has no concept of A or not-A, doesn't perform anything close to thinking, and was just picking the most likely words to follow the prompt and context window both times. The people who program it weren't "going for" anything as they couldn't "go for" any specific response. And the model has no concept of self, jokes, or even perception by an outside entity. But it will pick phrases that mimic its training set and those might happen to be about any one of those topics.

zehaeva•6mo ago

I wonder how you are to tell when it does give you the correct answer, and one asks "Are you sure?" and it cheerily apologizes and give you a new incorrect answer.

zahlman•6mo ago

... But does it even actually try to evaluate the veracity of what it just output? Or is it merely modelling the idea that being asked "Are you sure?" is a reason for self-doubt?

xnorswap•6mo ago

Do LLMs "evaluate the veracity" of anything? That's not really how they work.

bdangubic•6mo ago

very interesting - could you share your wife's and yours prompt to provide more concrete context?

lolinder•6mo ago

I considered it, but given the personal nature of the topic I want to keep the details off the public internet. I realize that makes my observation less specific than it could be, but I think the result should be relatively unsurprising to anyone familiar with the tools—this is just a natural consequence of the way the stats fall out.

amelius•6mo ago

Maybe try a 2-step approach: first ask the LLM to translate your question into expert-language, then ask that question :)

aleph_minus_one•6mo ago

The expert's answer when asked in expert language is "it's complicated". :-)

thunky•6mo ago

I'd like to see the before and after questions because it seems possible that the layman's version was less exact and therefore interpreted differently, even if the intention was the same. Which, can happen with humans too.

lolinder•6mo ago

Given the topic I'm unfortunately not comfortable sharing the details in a public space like this, but the answer that it gave was not just a misinterpretation of the question, it was actually entirely wrong on the merits of its own interpretation.

And even if it were a misinterpretation the result is still largely the same: if you don't know how to ask good questions you won't get good answers, which makes it dangerous to rely on the tools for things that you're not already an expert in. This is in contrast to all to people who claim to be using them for learning about important concepts (including lots of people who claim to be using them as financial advisors!).

loveparade•6mo ago

If you don't know how to ask a human doctor a good question you can't expect to get a good answer either.

The difference is that a human doctor probably has a lot of context about you and the situation you're in, so that they probably guess what your intention behind the question is, and adjust their answer appropriately. When you talk to an LLM, it has none of that context. So the comparison isn't really fair.

Has your mom ever asked you a computer question? Half of the time the question makes no sense and explaining to her why would take hours, and then she still wouldn't get it. So the best you can do is guess what she wants based on the context you have.

deadbabe•6mo ago

Doesn’t matter. The LLM’s job should be to deliver results in the way that is intended, regardless of the skill of the prompter. If it can’t do that then it’s really no better than a Google search.

osigurdson•6mo ago

That is a pretty high bar. Humans aren't any better than a Google search based on this criteria.

deadbabe•6mo ago

An expert human can give you the answer you need with the same layman prompt, without errors.

osigurdson•6mo ago

You would first have to find the expert however, which might not be trivial. Anyway, I think there is value in the space between a basic google search and a human expert. If you don't think so that is fine.

conception•6mo ago

Can you do a quick experiment and ask ChatGPT to craft a prompt to answer “laymen question “ as a financial expert? Having ChatGPT craft its own prompts is usually pretty successful for me.

lolinder•6mo ago

That probably would work most of the time, but that's also an example of the phenomenon that TFA is talking about: you can't safely just use these tools without becoming an expert at least in the tool. The way they're currently being sold as totally accessible to everyone is dangerous.

diggan•6mo ago

> without becoming an expert at least in the tool

Yeah, we're basically repeating the "search engine/query" problem but slightly differently. Using a search engine the right way always been a skill you needed to learn, and the ones who didn't always got poor results, and many times took those results at face value. Then Google started giving "answers" so if your query is shit, the "answer" most likely is too.

Point is, I don't think this phenomenon is new, it's just way less subtle today with LLMs, at least for people who have expertise in the subjects.

solarwindy•6mo ago

> On two occasions I have been asked, ’Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

osigurdson•6mo ago

Understanding the terminology can help a lot. The key is to start the LLM conversation like this, ask it what terminology is used in a particular problem domain and then frame your actual question based on this.

colinmorelli•6mo ago

Related similar thing when I sent my dog's recent bloodwork to an LLM, including dates, tests, and values. The model suggested that an advancement in her kidney values (all still within normal range) were likely evidence of chronic kidney disease in its early stage. Naturally this caused some concern for my wife.

But, I work in healthcare and have enough knowledge of health to know that CKD almost certainly could not advance fast enough to be the cause of the kidney value changes in the labs that were only 6 weeks apart. I asked the LLM if that's the best explanation for these values given they're only 6 weeks apart, and it adjusted its answer to say CKD is likely not the explanation as progression would happen typically over 6+ months to a year at this stage, and more likely explanations were nephrotoxins (recent NSAID use), temporary dehydration, or recent infection.

We then spoke to our vet who confirmed that CKD would be unlikely to explain a shift in values like this between two tests that were just 6 weeks apart.

That would almost certainly throw off someone with less knowledge about this, however. If the tests were 4-6 months apart, CKD could explain the change. It's not an implausible explanation, but it skipped over a critical piece of information (the time between tests) before originally coming to that answer.

osigurdson•6mo ago

The internet, and now LLMs have always been bad at diagnosing medical problems. I think it comes from the data source. For instance, few articles would be linked to / popular if a given set of symptoms were just associated with not getting enough sleep. No, the articles stand out are the ones where the symptoms are associated with some rare / horrible condition. This is our LLM training data which are often missing the entire middle part of the bell curve.

colinmorelli•6mo ago

For what it's worth this statement is actually not entirely correct anymore. Top-end models today are on par with diagnostic capabilities of physicians on average (across many specialties), and, in some cases, can outperform them when RAG'd in with vetted clinical guidelines (like NIH data, UpToDate, etc)

However, they do have particular types of failure modes that they're more prone to, and this is one of them. So they're imperfect.

osigurdson•6mo ago

This is ChatGPT's self assessment. Perhaps you mean a specialized agent with RAG + evals however.

ChatGPT is not reliable for medical diagnosis.

While it can summarize symptoms, explain conditions, or clarify test results using public medical knowledge, it: • Is not a doctor and lacks clinical judgment • May miss serious red flags or hallucinate diagnoses • Doesn’t have access to your medical history, labs, or physical exams • Can’t ask follow-up questions like a real doctor would

colinmorelli•6mo ago

Sorry, I should have clarified, but no this is not ChatGPT's self assessment.

I am suggesting that today's best in class models (Gemini 2.5 Pro and o3, for example), when given the same context that a physician has access to (labs, prior notes, medication history, diagnosis history, etc), and given an appropriate eval loop, can achieve similar diagnostic accuracy.

I am not suggesting that patients turn to ChatGPT for medical diagnosis, or that these tools are made available to patients to self diagnose, or that physicians can or should be replaced by an LLM.

But there absolutely is a role for an LLM to play in diagnostic workflows to support physicians and care teams.

rvz•6mo ago

Also see the Gell-Mann amnesia effect: [0]

[0] https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

cjohnson318•6mo ago

I had a similar experience. I did some back of the envelope math and my wife suggested I run it through ChatGPT. After actually doing the math, I felt a lot better about my understanding of the problem and I just... don't trust an LLM to understand algebra. Yeah, they're awesome most of the time, but I don't want to trust it with something important, be wrong, and then have to explain to someone that I trusted the opinion of a couple of big matrices, over my own knowledge and experience, on a high school word problem.

aaronbaugher•6mo ago

I've noticed Grok struggles with dates and relative time, especially when referring to things it "remembers" from earlier conversations. Even a phrase like "last night" will be wrongly interpreted sometimes. So although I've had it research numbers and create estimates for me, I wouldn't just assume the numbers are right without checking everything.

lblume•6mo ago

Since LLMs are basically linear algebra all the way down, this is vaguely reminiscent of how human brains also have a very hard time understanding neural circuitry despite literally being made from it.

disambiguation•6mo ago

This is my experience with prompting as well, but I struggle to describe it adequately. Something like the "direction" of the prompt, if it's too open ended you're likely to get mixed results, but if you give it a kind of "running start" it performs much better.

caust1c•6mo ago

This anecdote corroborates my theory that it will still be critical to become an expert in your field. Everyone is treating AI like it's a zero-sum game with regards to jobs being "lost" to AI, but the reality is that the best results will come from experts in the field who have the vocabulary and knowledge to get the best answers.

My fear is that people treat AI like an oracle when they should be treating it just like any other human being.

lblume•6mo ago

What percentage of people are actually experts at their jobs though?

xnorswap•6mo ago

People treat certain humans, or humans in certain roles, as oracles too.

redeye100•6mo ago

This is just bad design. Or a faulty tool. Why should the job market shift to accommodate this gap in the functioning of LLMs? This is a bug that needs to be fixed.

I have a personal gripe about this bringing an unfinished tool to market and then prophetizing about its usefulness. And how we all better get ready for it. This seems very hand-wavey and is looking more and more like vaporware.

It's like trying to quickly build a house on an unfinished foundation. Why are we rushing to build? Can't we get the foundational things right first?

chriskanan•6mo ago

What I do is to always set the context by giving the my "background" and some papers as reading material such that I've conditioned the model for whatever topic that will be discussed as the first step.

aleph_minus_one•6mo ago

For the sake of discussion I want to play devil's advocate concerning your point

> It was an interesting result to me because it shows that experts in a field are not only more likely to recognize when a model is giving incorrect answers but they're also more likely to get correct answers because they are able to tap into a set of weights that are populated by text that knew what it was talking about. Lay people trying to use an LLM to understand an unfamiliar field are vulnerable to accidentally tapping into the "amateur" weights and ending up with an answer learned from random Reddit threads or SEO marketing blog posts, whereas experts can use jargon correctly in order to tap into answers learned from other experts.

Couldn't it be the case that people who (in this case recognizable to the AI by their choice of wording) are knowledgeable in the topic need different advise than people who know less about the topic?

To give one specific examples from finance: if you know a lot about finance, getting some deep analysis and advice about what is the best way to trade some exotic options is likely sound advice. On the other hand, for people who are not deeply into finance the best advice is likely rather "don't do it!".

lolinder•6mo ago

In some cases, sure, but not here—neither option had more risk associated with it than the other, it was just an optimization problem. The first answer that the model gave to my wife was just wrong about the math, with no room for subjectivity.

dogleash•6mo ago

> Couldn't it be the case [...] need different advise than people who know less about the topic?

> for people who are not deeply into finance the best advice is likely rather "don't do it!".

Oh boy, more nanny software. This future blows.

aleph_minus_one•6mo ago

> Oh boy, more nanny software. This future blows.

I think this topic is a little bit more complicated: this is rather a balancing of the model between

1. "giving the best possible advice to the respective person given their circumstances" vs

2. "giving the most precise answer to the query to the user"

(if you ask me: the best decision would in my opinion be to give the user a choice for this, but this would be overtaxing to many users)

- Freedom-loving people will hate it if they don't get 2

- On the other hand, many people would like to actually get the advice that is most helpful to them (i.e. 1), and not the one that may answer their question exactly, but is likely a bad idea for them

dogleash•6mo ago

Everything can always be more complicated, of course. For example:

1. The AI will never know the user well enough to predict what will be best for them. It will resort to treating everybody like children. In fact, many of the crude ways LLMs currently steer and censor are already infantilizing.

2. The users' benefit vs. "for your own good" as defined by a product vendor's financial interest is a scam that vendors have perpetrated for ages. Even the unsubtle version of it has a bunch of stooges and chumps that defend it. Things will not improve in the users' favor when it's harder to notice, and easier to pretend they're not malicious.

3. A bunch of Californians using the next wave of tech to spread cultural imperialism is better than China doing it, I guess. But why are those my options?

c22•6mo ago

This same phenomenon is true for classic search engines as well. Whenever I am becoming informed on a new topic my first searches are always very naive and targeted just at discovering the relevant jargon that will let me make better searches. It turns out that many disciplines contain analogue concepts, just with different words being used to describe them. Understanding the domain specific language used is more than half the battle.

nothercastle•6mo ago

Yeah I reminder that being the trick to get Google to provide good results was to find some key industry or area terms for what you were looking for. This doesn’t work anymore because Google search has gotten so bad

skydhash•6mo ago

If I’m dealing with an unfamiliar domain, my next step is always wikipedia or an introductory book. Just to collect a sets of keywords and references to narrow my future searches. I don’t think I’ve ever asked google a question shaped query.

citrin_ru•6mo ago

The difference is that in the search engine output one can somewhat judge if a site is trustworthy. E. g. if I research a legal question I would trust a guide on a government website more than an answer on Reddit which contradicts an official source. With LLM I don’t know where weight for a particular answer came from. You can ask LLM for sources but its text answer can still misrepresent these sources.

lostmsu•6mo ago

A couple more alternative explanations: 1. random - sample size is too small to claim otherwise 2. phrasing in both cases had a leaning in it, and as a "yes man" LLM gave corresponding biased response.

whstl•6mo ago

I very often also get better programming results than less experienced engineers, even though I'm not remotely doing any kind of "prompt engineering".

Also how you ask matters a lot. Sometimes it just wants to make you happy with whatever answer, if you go along without skepticism it will definitely make garbage.

Fun story: at a previous job a Product Manager made someone work a full week on a QR-Code standard that doesn't exist, except in ChatGPT's mind. It produced test cases and examples, but since nobody had a way to test

When it was sent to a bank in Sweden to test, the customer was just "wait this feature doesn't exist in Sweden" and a heated discussion ensued until the PM admitted using ChatGPT to create the requirements.

jimbokun•6mo ago

So LLMs need an ELI5 mode and detect when to use it. This would use the non-technical terminology but pull concepts from the more jargony model.

This may or may not be easily possible by tweaking current training techniques. But it shows the many edge cases still needed to be addressed by AI models.

ivanjermakov•6mo ago

LLM users are highly susceptible to the confirmation bias, by putting their expectations into the prompt.

tda•6mo ago

Interesting observation. I find myself using ChatGPT to find the proper words for something. I describe my problem or algorithm in a naive way, and usually ChapGPT will present some naive python solution. But when pressed ChatGPT will tell you this problem is actually say Graph Theory. But it will still fall back to the naive python solution.

So then I just start a new "conversation" and ask which graph algorithm applies to my problem and instead of a naive solution I am pointed to an optimised algorithm and a usable library that implements it.

rvz•6mo ago

Let's just say that people once thought that Big Tech was once invincible, until it wasn't.

Obviously those exposed in the AI hype will tell you that there is no winter.

Until the music stops and almost little to no-one can make money out of this AI race to zero.

dinfinity•6mo ago

> Let's just say that people once thought that Big Tech was once invincible, until it wasn't.

Half the world runs on Big Tech. Some of them have cash reserves bigger than the GDP of sizeable countries. They lead in R&D investment: https://www.rdworldonline.com/top-15-rd-spenders-of-2024/

> Obviously those exposed in the AI hype will tell you that there is no winter.

Go look at how much money was spent on AI R&D in the last AI 'summers' (and winters). Pennies compared to the billions and billions of dollars the private and public sector is throwing at it right now.

Will some investments turn out to be a waste of time and money? Yes.

Will investment be reduced to a fraction of what it is today? Hell no.

The music stops when humans are economically obsolete.

qudat•6mo ago

While reading this article, I kept asking myself the question: "Why can't LLM ask us follow up questions?"

lblume•6mo ago

They absolutely can if you prompt them to. You can even add it to your system prompt for it to happen in every new conversation!

esjeon•6mo ago

AI winters will keep coming as long as the definition of AI stays relative. We used to call chess programs chess "AIs", but hardly anyone says that anymore. We call LLMs "AIs" now, but let's be real: a few decades from now, we'll probably be calling them token predictors, while some shiny new "AIs" are already out there kicking asses.

At the end of the day, "AI" really just means throwing expensive algorithms at problems we've labeled as "subjective" and hoping for the best. More compute, faster communication, bigger storage, and we get to run more of those algorithms. Odds are, the real bottleneck is hardware, not software. Better hardware just lets us take bolder swings at problems, basically wasting even more computing power on nothing.

So yeah, we’ll get yet another AI boom when a new computing paradigm shows up. And that boom will hit yet another AI winter, because it'll smack into the same old bottleneck. And when that winter hits, we'll do what we've always done. Move the goalposts, lower the bar, and start the cycle all over again. Just with new chips this time.

Ah, Jesus. I should quit drinking Turkish coffee.

antirez•6mo ago

Orthogonal: the lemons in the picture, from Palermo (Sicily), could not only being lemon or lemon-shaped soap, but also a sweet, our very famous "frutta martorana": https://en.wikipedia.org/wiki/Frutta_martorana

osigurdson•6mo ago

A different kind of AI winter is already here. This "winter" is associated with companies laying people off and then lazily waiting around for AGI to emerge. This is leading to a kind of malaise that I think will ultimately be bad for economies. It is fine to use any available tool to boost productivity, but magical thinking is not sound management.

wellUc•6mo ago

A predict a lot of circumlocutions about AI but most people not noticing since they blindly follow the TV/politics as-is anyway.

A lot of people (still a tiny proportion of the population) will be loud in opposition but ultimately overwhelmed by the nihilism and indifference of the aggregate.

The loudest will be those who perceive some loss to their own lifestyle that relies on exploiting other’s attention, as AI presents new risk to their attention grabbing behaviors.

Then they will die off and humanity will carry on with AI not them.

Circle of life Simba.

tim333•6mo ago

Winter seems unlikely in the near term because we are around the level in the steady increase in computing power where it's around human level, like this cartoon https://x.com/waitbutwhy/status/1919870578502021257

Self-hosting my photos with Immich

Have I been Flocked? – Check if your license plate is being watched

Nook Browser

Cloudflare outage on December 5, 2025

Leaving Intel

PalmOS on FisherPrice Pixter Toy

Gemini 3 Pro: the frontier of vision AI

Netflix to Acquire Warner Bros

Albert Michelson's Harmonic Analyzer (2014) [pdf]

Extra Instructions Of The 65XX Series CPU (1996)

Frinkiac – 3M "The Simpsons" Screencaps

Ivan Sutherland Sketchpad Demo 1963 [video]

Making tiny 0.1cc two stroke engine from scratch

Most technical problems are people problems

Adenosine on the common path of rapid antidepressant action: The coffee paradox

YouTube caught making AI-edits to videos and adding misleading AI summaries

Perpetual futures, explained

Patterns for Defensive Programming in Rust

Idempotency keys for exactly-once processing

I'm Peter Roberts, immigration attorney who does work for YC and startups. AMA

Netflix’s AV1 Journey: From Android to TVs and Beyond

Fizz Buzz in CSS

Guide to making a CHIP-8 emulator (2020)

Show HN: HCB Mobile – financial app built by 17 y/o, processing $6M/month

Tides are weirder than you think

The missing standard library for multithreading in JavaScript

Making RSS More Fun

Frank Gehry has died

Sam Altman’s DRAM Deal

How fast can browsers process base64 data?

Self-hosting my photos with Immich

Have I been Flocked? – Check if your license plate is being watched

Nook Browser

Cloudflare outage on December 5, 2025

Leaving Intel

PalmOS on FisherPrice Pixter Toy

Gemini 3 Pro: the frontier of vision AI

Netflix to Acquire Warner Bros

Albert Michelson's Harmonic Analyzer (2014) [pdf]

Extra Instructions Of The 65XX Series CPU (1996)

Frinkiac – 3M "The Simpsons" Screencaps

Ivan Sutherland Sketchpad Demo 1963 [video]

Making tiny 0.1cc two stroke engine from scratch

Most technical problems are people problems

Adenosine on the common path of rapid antidepressant action: The coffee paradox

YouTube caught making AI-edits to videos and adding misleading AI summaries

Perpetual futures, explained

Patterns for Defensive Programming in Rust

Idempotency keys for exactly-once processing

I'm Peter Roberts, immigration attorney who does work for YC and startups. AMA

Netflix’s AV1 Journey: From Android to TVs and Beyond

Fizz Buzz in CSS

Guide to making a CHIP-8 emulator (2020)

Show HN: HCB Mobile – financial app built by 17 y/o, processing $6M/month

Tides are weirder than you think

The missing standard library for multithreading in JavaScript

Making RSS More Fun

Frank Gehry has died

Sam Altman’s DRAM Deal

How fast can browsers process base64 data?

Is Winter Coming? (2024)

Comments