AI Responses May Include Mistakes

https://www.os2museum.com/wp/ai-responses-may-include-mistakes/

180•userbinator•1d ago

Comments

_wire_•1d ago

Google's Gemini in search just makes up something that arbitrarily appears to support the query without care for context and accuracy. Pure confabulation. Try it for yourself. Ridiculous. It works as memory support if you know the result you're looking for, but if you don't, you can't trust it as far as you can throw it.

If you look carefully at Google Veo output, it's similarly full of holes.

It's plain there's no reasoning whatsoever informing the output.

Veo output with goofy wrongness

https://arstechnica.com/ai/2025/05/ai-video-just-took-a-star...

Tesla FSD goes crazy

https://electrek.co/2025/05/23/tesla-full-self-driving-veers...

ImPostingOnHN•1d ago

gemini is the worst LLM I've used, whether directly or through search. As in your experience, it regularly makes stuff up, like language/application features, or command flags (including regarding google products), and provides helpful references to sources which do not say what is cited from them.

in my case, it does so roughly half the time, which is the worst proportion, because that means I can't even slightly rely upon the truth being the opposite of the output.

JimDabell•1d ago

Gemini was underwhelming until 2.5 Pro came along, which is very good. But in my experience all of the Google models are far worse than everything else when it comes to hallucination.

MangoToupe•1d ago

As a corollary, though, the chatbots are probably the most creative.

mdp2021•1d ago

Professional creatives do measure their intuitions against a number of constraints...

flomo•1d ago

I had a question about my car, so I googled '[year] [make] [model] [feature]'. This seems like the sort of thing that Google had always absolutely nailed. But now, 90% of the page was ai slop about wrong model, wrong year, even the wrong make. (There was one youtube which was sorta informative, so some credit.)

But way way down on at the very the bottom of the page, there was the classic google search answer on a totally unrelated car forum. Thanks CamaroZ28.com!

camillomiller•1d ago

This is a very very good point. If this was happening with different queries that we never used, or a new type of questions/queries than I would have some patience. But it happens exactly with the formulations that were giving you the best results in the SERP before!

TeMPOraL•1d ago

That was true before AI too (I know, I did such searches myself). Google results has been drowning in slop for over a decade now - it was just human-generated slop, aka. content marketing and SEO stuff.

I'm not defending the AI feature here, just trying to frame the problem: the lies and hallucinations were already there, but nobody cared because apparently people don't mind being constantly lied to by other people.

flomo•1d ago

No, I'm not complaining about SEO shit...

The thing is the primordial google had the answer, but Google themselves buried it under 100+ links of Google-generated slopdiarrhea, most of which didn't even factually fit the question, and was not at all relevant to my automobile.

ben_w•1d ago

Indeed, this is part of the current monopoly abuse case they're facing — did Google deliberately choose to make search worse, because that causes people to return to the search results page and spend more time looking at ads, and they knew they could get away with it?

e.g. bottom of first page, penultimate paragraph https://www.justice.gov/d9/2023-11/417557.pdf

flomo•1d ago

Yeah they definitely did that.

But this AI diarrhea is so awful, I honestly can't see any angle in giving me tons of bad results about a minor feature of my car. (I should sell it and use waymo??) Maybe the really sharp monopolists ran for the hills when the DOJ sheriffs showed up, and now Google Search is being run by former Yahoo execs.

gambiting•1d ago

I'm a member of a few car groups on Facebook and the misinformation coming from Google is infuriating, because people treat it as gospel and then you have to explain to them that the AI slop they were shown as the top result in Google is not - in fact - correct.

As a simple example - someone googled "how to reset sensus system in Volvo xc60" and Google told them to hold the button under the infotainment screen for 20 seconds and they came to the group confused why it doesn't work. And it doesn't work because that's not the way to do it, but Google told them so, so of course it must be true.

flomo•1d ago

Exactly, the "AI" opiateturd results are often for the completely wrong year/model or just obviously false. I'm certain Google used to be really good at this kind of thing.

nyarlathotep_•15h ago

I wonder this too--if there will actually be more work created from LLM generations from a whole new genre of customer support that now not only has to know the "material" but has to know provide secondary support in resolving issues customers have from incorrect nonsense.

dingnuts•23h ago

Ad supported search has been awful for a few years now, just buy a Kagi subscription and you'll be like me: horrified but mildly amused with a dash of "oh that explains a lot" when people complain about Google

MangoToupe•1d ago

I'm honestly so confused how people use LLMs as a replacement for search. All chatbots can ever find are data tangential to the stuff I want (eg i ask for a source, it gives me a quote). Maybe i was just holding search wrong?

Garlef•1d ago

Maybe that's because we're conditioned by the UX of search.

But another thing I find even more surprising is that, at least initially, many expected that the LLMs would give them access to some form of higher truth.

MangoToupe•22h ago

I think you might be on to something. I've found myself irritated that i can't just chuck keywords at LLMs.

MaxikCZ•1d ago

Its good to be shown direction. When I only have a vauge idea of what I want, AI usually helps me frame it into searchable terms I had no clue existed.

mdp2021•1d ago

> LLMs as a replacement for search

Some people expect LLMs as part of a better "search".

LLMs should be integrated to search, as a natural application: search results can heavily depend on happy phrasing, search engines work through sparse keywords, and LLMs allow to use structured natural language (not "foo bar baz" but "Which foo did a bar baz?" - which should be resistant to terms variation and exclude different semantics related to those otherwise sparse terms).

But it has to be done properly - understand the question, find material, verify the material, produce a draft reply, verify the draft vis-a-vis the material, maybe iterate...

1659447091•1d ago

DuckDuckGo Ai assist is going in the right direction, imo. It will pull info from wikipedia, use math and map tools plus other web sources that has been mostly accurate for me on the search page.

The chat option uses gpt-4o with web search and was able to provide links to colonial map resources I was curious about after falling down that rabbit hole. It also gave me general (& proper) present day map links to the places I was looking for in the map sites I asked for.

It did get confused a few times when I was trying to get present day names of old places I had forgot; like Charles River in Va that it kept trying to send me to Boston or Charles City Co on the James river and told me to look for it around there...

The York river wiki page clearly says it was once Charles River. Maybe I wasn't asking the right questions. For more unique things it was pretty helpful thou and saved the endless searching w/ 100 tabs adventure

TeMPOraL•1d ago

> eg i ask for a source, it gives me a quote

It should give you both - the quote should be attributed to where it was found. That's, generally, what people mean when they ask or search for "a source" of some claim.

As for general point - using LLMs as "better search" doesn't really look like those Google quick AI answers. It looks like what Perplexity does, or what o3 in ChatGPT does when asked a question or given a problem to solve. I recommend checking out the latter; it's not perfect, but good enough to be my default for nontrivial searches, and more importantly, it shows how "LLMs for search" should work to be useful.

incangold•1d ago

I find LLMs are often better for X vs Y questions where search results were already choked by content farm chaff. Or at least LLMs present more concise answers, surrounded by fewer ads and less padding. Still have to double check the claims of course.

MangoToupe•22h ago

I think I'm discovering that I just don't tend to think in terms of questions rather than content

jazzyjackson•8h ago

Some chatbots plan a query and summarize what a search returns instead of trying to produce an answer on their own; I use perplexity a lot which always performs a search, I think ChatGPT et al have some kind of classifier to decide if web search is necessary. I especially use it when I want a suggestion without sifting through pages of top ten affiliate listicles (why is there a list of top 10 microwaves? I only need one microwave!)

camillomiller•1d ago

This baffles me like no other tech has done before. Google is betting its own core business on a pivot that relies on a massively faulty piece of technology. And as Ben Evans also says, promising that it will get better only gets you so far, it’s an empty promise. Yesterday AI overview made up an entire album by a dead Italian musician when I searched for a tribute event that was happening at a Berlin venue. It just took the name of the venue and claimed it was the most important work from that artist.

Funnily enough (not for Google), I copypasted that answer on chatGPT and it roasted AI Overview so bad on its mistakes and with such sarcasm that it even made me chuckle.

DanHulton•1d ago

It's the unfounded promises that this will be solved because the tech will only get better that really upset me. Because sure, it will get better, I'm pretty certain of that. They'll have additional capabilities, they'll have access to more-recent data, etc. But "better" does not necessarily equate to "will fix the lying problem." That's a problem that is BAKED INTO the technology, and requires some kind of different approach to solve -- you can't just keep making a hammer bigger and bigger in the hopes that one day it'll turn into a screwdriver.

Before LLMs really took off, we were in the middle of an "AI winter", where there just weren't any promising techs, at least none with sufficient funding attached to them. And it's WORSE now. LLMs have sucked all the air out of the room, and all of the funding out of other avenues of research. Technologies that were "10-20" years away now might be 30-40, because there's fewer people researching them, with less money, and they might even be completely different people trying to restart the research after the old ones got recruited away to work on LLMs!

mountainriver•12h ago

I really don’t understand the whole AI winter talk all the time. We haven’t had anything of the sort since 2008. There were tons of major RL advancements before ChatGPT that were stunning.

I would challenge anyone to find data to actually support any of these claims. ML spending has been up since deep learning year over year and the models just keep getting better

XorNot•1d ago

I use ublock to remove Gemini responses from search, because even glancing at then is liable to bias my assumption about whatever I'm looking for.

Information hygiene is a skill which started out important but is going to become absolutely critical.

justmarc•1d ago

We can't expect the vast majority of regular users to have any of that skill.

What is this going to lead to? fascinating times.

jobigoud•1d ago

It's very easy though, Right click > Block element > Create. Overlays show which blocks you are removing. Sliders can be used to increase/refine.

How can we make it even easier and visual? Customizing pages by removing elements should be part of the default browser experience to be honest. Like in the initial web where you would tweak the color of links, visited links, etc.

MaxikCZ•1d ago

Half my browser extensions have sole purpose of removing shit from sites I visit.

HN is like a unicorn that havent made me block a single thing yet.

alpaca128•1d ago

Ironically that's an AI tool I would use - one that can dynamically filter content from sites according to my preferences, to counter the algorithmic spam. It wouldn't be generative AI though, and that's the only kind of AI that matters right now apparently.

jazzyjackson•8h ago

Could be a good use of structured output, Llms work okay as one shot classifiers, you could define a schema that's just like, array[]{xpath:true/false} and tell the bot what you want to see and what you don't want to see.

justmarc•1d ago

And suddenly this type of quality is becoming "normal" and acceptable now? Nobody really complains.

That is very worrying. Normally this would never fly, but nowadays it's kind of OK?

Why should false and or inaccurate results be accepted?

Nuzzerino•1d ago

Complain to it enough times, remain resilient and you’ll eventually figure it out (that’s a wild card though). Or find someone who has and take their word for it (except you can’t because they’re probably indistinguishable from the ‘bot’ now according to the contradictory narrative). Iterate. Spiral. No one should have to go through that though. Be merciful.

chronid•1d ago

Suddenly? That's the level of quality that is standard in all software projects I've ever seen since I've started working in IT.

Enshittification is all around us and is unstoppable. Because we have deadlines to hit and goals to shows we reached to the VP. We broke everything and the software is just half working? Come on that's an issue for the support and ops teams. On to the next beautiful feature we can put on marketing slides!

justmarc•6h ago

Sadly you are absolutely right.

TeMPOraL•1d ago

We lost that battle back when we collectively decided that sales and marketing is respectable work.

bheadmaster•1d ago

Hah. Good observatoon.

I often get in arguments about how I tend to avoid brands that put too much into marketing. Of course, theoretically, the amount of money a company puts into marketing doesn't automatically lower the quality of their products, but in my experience, the correlation is there. Whiskas, Coka Cola, McDonalds, etc.

justmarc•6h ago

How would products get known, let alone sold, without this?

TeMPOraL•2h ago

How would you give your neighbor a warm welcome without setting their house on fire?

Scale and intent matter.

meander_water•1d ago

I've recently started wondering what the long term impacts of AI slop is going to be. Will people get so sick of the sub-par quality that there will be a widespread backlash, and a renewed focus on handmade or artisinal products made by hand? Or will we go the other way where everyone will accept the status-quo and everything will just get shittier, and we will just have multiple cycles of AI slop trained on AI slop?

jazzyjackson•8h ago

I'm already seeing screen-free summer camps in my area. There's going to be a subset of the population that does not want to play along with calling hallucinations and deepfakes "progress," kids will be homeschooled more as parents lose their jobs and traditional classroom instruction loses effectiveness.

I thought the movie "the Creator" was pretty neat, it envisions a future where AI gets blamed for accidentally nuking Los Angeles so America bans it and reignites a kind of cold war with Asia which has embraced GAI and transcended the need for central governance. Really it's a film about war and how it can be started with a lie but continue out of real existential fear.

veunes•1d ago

And how quickly the bar is being lowered

krapp•1d ago

>Why should false and or inaccurate results be accepted?

The typical response is "because humans are just as bad, if not worse."

reaperducer•1d ago

And suddenly this type of quality is becoming "normal" and acceptable now?

The notion that "computers are never wrong" has been engrained in society for at least a century now, starting with scifi, and spreading to the rest of culture.

It's an idea that has caused more harm than good.

rchaud•1d ago

> Normally this would never fly, but nowadays it's kind of OK?

We started down this path ever since obvious bugs were reframed as "hallucinations".

emrah•1d ago

When were search results 100% fact checked and accurate??

mdp2021•23h ago

For example, in the times of "lectures", where transmitted information was literally read (as the term says) in real time from the source to the public.

But in general, the (mis-)information that spinach could contain so much iron to be interchangeable with nails had to be a typo so rare that it would become anecdotal and generate cultural phenomena like Popeye.

Nursie•1d ago

Yep, I was looking up a hint for the Blue Prince game the other day for the (spoiler alert?) casino room.

Google’s AI results proceeded to tell me all about the games available at the Blue Prince Casino down the road from here, where I know for a fact there’s only a prison, a Costco, a few rural properties and a whole lot of fuck-all.

It’s amazing to watch it fill in absolute false, fabricated tripe at the top of their search page. It also frequently returns bad information on subjects like employment law and whatever else I look up.

It would be hilarious if people weren’t actually relying on it.

datavirtue•1d ago

I the have had a lot of luck with copilot conversations to research stocks and trading strategies. I am always skeptical of results and verify everything with various sources but it does help me find/get on the right track.

Kwpolska•1d ago

Google recently started showing me their AI bullshit. This made me pull the trigger and switch to DuckDuckGo as the primary search engine.

That said, some niche stuff has significantly better results in Google. But not in the AI bullshit. I searched for a very niche train-related, the bullshit response said condescendingly "this word does not exist, maybe you meant [similarly sounding but completely different word], which in the context of trains means ...". The first real result? Turns out that word does exist.

christophilus•1d ago

What’s the word?

dijksterhuis•1d ago

fyi, you can remove any and all “ai” assistant bs etc from DDG if you use the noai subdomain (in case you wanna avoid their stuff, although it’s much less prominent anyway) https://noai.duckduckgo.com/

datavirtue•1d ago

I switched to DDG over seven years ago and just realized it had been that long when I read your comment. Google started wasting my time and I had to shift.

christophilus•1d ago

I’ve had good results with Brave search, which self reports to use: Meta Llama 3, Mistral / Mixtral, and CodeLLM. It’s not always 100% accurate, but it’s almost always done the trick and saved me digging through more docs than necessary.

veunes•1d ago

Yeah, it feels like we've crossed into a weird uncanny valley where AI outputs sound smarter than ever, but the underlying logic (or lack thereof) hasn't caught up

roywiggins•1d ago

I think it's just much easier for an LLM to learn how to be convincing than it is to actually be accurate. It just has to convince RLHF trainers that it's right, not actually be right. And the first one is a general skill that can be learned and applied to anything.

https://arxiv.org/html/2409.12822v1

minimaxir•1d ago

The simple "AI responses may include mistakes" disclaimer or ChatGPT's "ChatGPT can make mistakes. Check important info." CYA text at the bottom of the UI are clearly no longer sufficient. After years of news stories about LLM hallucinations in fact-specific domains and people still getting burnt by them, LLM providers should be more aggressive in educating users about their fallability since hallucinations can't ever be fully fixed, even if it means adding friction.

tbrownaw•1d ago

> should be more aggressive in educating users about their fallability

This might be an "experience is the best teacher" situation. It'd probably be pretty hard to invent a disclaimer that'd be as effective as getting bit.

minimaxir•1d ago

Unfortunately, getting bit in cases such as publishing misinformation or false legal citations waste everyone time, not just their own.

eddythompson80•1d ago

That doesn't really make sense. You either make the LLM provider liable for the output of the model, or you have the current model. The friction already exists. All these AI companies and cloud providers are running "censored models" and more censorship is added at every layer. What would more friction be here? more pop-ups?

Doing the former basically means killing the model-hosting business. Companies could develop models, use them internally and give them to their employees, but no public APIs exists. Companies strike legally binding contracts to use/license each other models, but the general public doesn't have access to those without something that would mitigate the legal risk.

Maybe years down the line, as attitudes soften, some companies would begin to push the boundaries. Automating the legal approval process, opening signups, etc.

minimaxir•1d ago

Yes, more popups, retention metrics be damned. Even 2 years since ChatGPT, many people still think it's omniscent which is what's causing trouble.

eddythompson80•1d ago

I don't think more pop-ups solve anything. It'll just make a chrome extension called "ChatGPTAutoAccept" get popular. You think someone who is thinking it's omniscent will suddenly reconsider because a "reminder, this is dumb as shit" pop up keeps annoying them every 5 minutes?

MaxikCZ•1d ago

People who are suspectible to read AI slop as universal truth do so because they dont read much at all. I guess you would be surprised how huge amount of users dont bother to read anything at all: a popup only exists in a sense "how do I close this", which is being solved by clicking at the most visually distinct button. If you asked them what they clicked or what the popup was about, they look at you like you are crazy for even assuming they should know.

TeMPOraL•1d ago

Hyperbole is as much of a problem. ChatGPT is not omniscient, but it's also not "dumb as shit", at least not across the board. LLMs got popular in big part because they provide unique, distinctly new value.

This black and white assumption that because LLMs are not always giving probably correct answers therefore they are dangerous, reminds me of what the generation of my parents and teachers thought of Wikipedia when it became popular. The problems were different ("anyone can edit" vs. "hallucinates stuff"), but the mindset seems very similar.

neepi•1d ago

To be fair people are pretty damn unintelligent when it comes to verifying information. Despite my academic background I catch myself doing it all the time as well.

However LLMs amplify this damage by sounding authoritative on everything and even worse being promoted as authoritative problem solvers for all domains with a small disclaimer. This double think is unacceptable.

But if they made the disclaimer bigger then the AI market would collapse in about an hour much like people’s usage does when they don’t verify something and get shot by someone actually authoritative. This has happened at work a couple of times and caused some fairly high profile problems. Many people refuse to use it now.

What we have is bullshit generator propped up by avoiding speaking the truth because the truth compromises the promoted utility. Classic bubble.

userbinator•1d ago

The disclaimer needs to be in bold red text at the top.

YetAnotherNick•1d ago

You are assuming that the people burnt by LLM responses doesn't know that ChatGPT can make mistakes?

camillomiller•1d ago

Remember when Apple was roasted to hell anytime Maps would push you to get a wrong turn? Or when Google Maps would take you to the wrong place at the wrong time (like a sketchy neighborhood)? Those were all news stories they had to do PR crisis management for. Now they slap a disclaimer like that and we’re all good to go. The amount of public opinion forgiveness these technologies are granted is disproportionate and disheartening.

arcanemachiner•1d ago

Yeah, but we're all used to having software integrated into our lives now. And we all now how shitty and broken software often is...

thejohnconway•1d ago

That always struck me as pretty overblown, given that before map apps, people got lost all the goddam time. It was a rare trip with any complexity that a human map reader wouldn’t make a mistake or two.

LLMs aren’t competing with perfect, they are competing with websites that may or may not be full of errors, or asking someone that may or may not know what they are talking about.

alpaca128•1d ago

Worse - LLMs are competing with inconvenience, and inconvenience always loses.

TeMPOraL•1d ago

That's literally the point of all progress, though.

alpaca128•1d ago

No. The point of progress is to improve the outcome. Improved outcome does not always align with convenience, and this is one example.

Critical thinking is inconvenient and does not scale, but it's very important for finding the truth. Technology that disincentivizes critical thought makes it easier to spread lies.

TeMPOraL•1d ago

> Critical thinking is inconvenient and does not scale, but it's very important for finding the truth. Technology that disincentivizes critical thought makes it easier to spread lies.

True. At the same time, technology that removes the need for critical thinking is bona fide positive form of progress.

Think of e.g. consumer protection laws, and larger body of laws (and systems of enforcement) surrounding commerce. Their main goal is to reduce the risk for customers - and therefore, their need to critically evaluate every purchase. You don't as much critical thinking when you know certain classes of dangers are almost nonexistent; you don't need to overthink choices you know you can undo.

There are good and bad ways of eliminating inconvenience, but by itself, inconvenience is an indication of waste. Our lives are finite, and even our capacity to hope and dream is exhaustible. This makes inconvenience our enemy.

alpaca128•1d ago

The examples you listed work because they increase trust to a point people feel safe enough to not second-guess everything. I disagree that AI in its current form can be trusted. Food safety is enforced by law, correctness in Google searches isn't enforced at all, in fact Google is incentivized to decrease the quality to reduce running costs.

So yes, convenience and progress are strongly correlated but they're not the same.

ben_w•1d ago

Apple maps currently insists that there's a hotel and restaurant across the street from me.

According to the address on the business website that Apple Maps itself links to, the business is 432 km away from me.

mdp2021•1d ago

> The simple ...

No, improper phrasing. Correct disclaimer is, "The below engine is structurally unreliable".

Comment, snipers. We cannot reply to unclear noise.

nyarlathotep_•14h ago

> LLM providers should be more aggressive in educating users about their fallability since hallucinations can't ever be fully fixed, even if it means adding friction

But they can't be as the whole premise of the boom is replacing human intellectual labor. They've said as much on many many occasions--see Anthropic's CEO going off about mass unemployment quite recently. How can the two of these co-exist?

Biganon•1d ago

Make up a fake popular wisdom saying that sounds real, search for it, Gemini will gladly tell you it exists and explain it to you

mdp2021•1d ago

...As others have noted elsewhere: which Gemini? There are cheap ones and others proposed as flagship.

ghusbands•1d ago

The one that appears when you search with Google.

elmerfud•1d ago

AI is like that one guy who always can tell you something about anything with total confidence. So really not sure why anyone would trust it beyond a bar conversation.

vouaobrasil•1d ago

I think it's psychological. Most people use visual body cues to determine whether someone is lacking in confidence in their answer. AI does not have any cues to show a lack of confidence, and people also have a high trust in machine output because traditional algorithms always give the correct answer.

The percentage of people that will look at it critically is negligible.

normie3000•1d ago

> Most people use visual body cues to determine whether someone is lacking in confidence in their answer

Do they?

vouaobrasil•1d ago

I can certainly tell when someone is just bull**ing from their tone of voice long before they tell me the information.

tonyedgecombe•1d ago

You obviously haven’t met any of my past bosses. Some people have turned bullshitting into an art form.

ghusbands•1d ago

Which is notably not a "visual body cue"

JdeBP•1d ago

No "AI" company has yet had the bravery to name its product Cliff Clavin.

Bravery in several ways, that is. There's the risk of being sued by John Ratzenberger. (-:

mvdtnz•19h ago

> So really not sure why anyone would trust it beyond a bar conversation

Really, you don't know why? Maybe because it's being promoted as "AI" by companies with missions like "organise the world's information", who have spent decade now doing their best to provide accurate information to user queries?

charcircuit•1d ago

Gemini can handle this fine.

nehal3m•1d ago

Huh, I thought that AI overview feature is powered by Gemini.

charcircuit•1d ago

Gemini doesn't refer to a specific model so perhaps the one on the search page is weaker than the ones it offers in the app.

mdp2021•1d ago

They may have estimated the volume of replies in those "overview" pages and shrunk the costs routing through a computationally lightweight model.

They may this way also have underestimated the reputational loss - the big umbrella of the "Ford Pinto case".

Edit: I was just now looking at the new Visual Capitalist's "Ranked: 2025’s 10 Largest S&P 500 Stocks". Is it possible that Alphabet being at the top with 7.6% of the weight of the 500-items set is paradoxically allowing to afford more damage?

mucha•19h ago

Capability != Reliability

9x39•1d ago

Gemini appears tuned to try to handle the typical questions people type in, while more traditional things you search for get some confabulated nonsense.

I've observed a great deal of people trust the AI Overview as an oracle. IMO, it's how 'normal' people interact with AI if they aren't direct LLM users. It's not even age gated like trusting the news - trusting AI outputs seems to cross most demographics. We love our confident-based-on-nothing computer answers as a species, I think.

danielbln•1d ago

We love our confident-based-on-nothing answers period, computer or not.

chneu•1d ago

Most folks just want confirmation. They don't want to have their views/opinions changed. LLM are good at trying to give folks what they're looking for.

mdp2021•1d ago

Repent.

You are not there to "love what gives you the kicks". That's a kind of love that should not exit the bedroom (better, the bathroom).

Llamamoe•1d ago

I already went through a realization a while ago that you just can't mention something to people anymore and expect them to be able to learn about it by searching the web, like it used to be possible, because everything is just unreliable misleading SEO spam slop.

I shudder to think how much worse this is going to be with "AI Overview". Are we entering an era of people googling "how does a printer work" and (possibly) being told that it's built by a system of pulleys and ropes and just trusting it blindly?

Because that's the kind of magnitude of errors I've seen in dozens of searches I've made in the domains I'm interested in, and I think everyone has seen the screenshots of even more outlandish - or outright dangerous - answers.

eddythompson80•1d ago

I think Google is in a particularly bad situation here.

For over a decade now, that spot in the search page had the "excerpt from a page" UI, which made a lot of sense. It cut down an extra click, and if you trusted the source site, and presumably Google's "Excerpt Extraction Technology" (whatever that was) what was left not to trust? It was very trust worthy information location.

Like if I search for a quick medical question, and there is an except from the mayoclinic, I trust the mayoclinic, so good enough for me. Sometimes I'd copy the excerpt from google, and go to the page and ctrl-f it.

Google used to do a decent job at picking reputable sources, the excerpts were always indeed found in the page in a non-altering context, so it was good enough to build trust. Now that system has degraded over the years in terms of how good it was at picking those reputable sources. Most likely because it was SEO gamed.

However, it has been replaced with a the AI Overview. I'm not against AI, but AI is fundamental different than "a relevant excerpt from a source you trust with a verifiable source in milliseconds".

tsunamifury•1d ago

How could you think this hard and be so far off. Google is in a hyper strong position here and I don’t even like them.

They can refine grounded results over time and begin serving up increasingly well reasoned results over time as Models improve cost effectively. Then that drives better vectors for ads.

Like what about this is hard to understand?

eddythompson80•1d ago

What about what is hard to understand?

Google did it because it's better for Google, yes. They no longer have to deal with people trying to hack SEO. Now you would have to figure out how to influence the training process of google to hijack that box. So it's better for Google to move to AI Overview. What's your point here?

I say Google is in a bad position morally or in terms of "doing the right thing" not that one would really expect it from a corporation per se. There is a distinction you know.

Google introduced the box as "Excerpt from a search result" box. They traditionally put a lot of care into their search quality and it showed and built trust with their users. Over the years, the search quality dropped. Whether it was less attention from Google, fundamentally harder problem to solve with far more motivated attackers. Yet, the intrusion of bullshit website in the "Excerpt from a search result" box still let you decide that you are not gonna trust medical advice from "mikeInTheDeep.biz" it wasn't ideal that they build trust then let it slip, but being able to see a source with a quote makes it useful when you trust the source.

With AI Overview, you either trust it all, don't trust any of it, use it as confirmation bias, don't

geraneum•1d ago

> if they aren't direct LLM users

My manager, a direct LLM user, uses the latest models to confirm his assumptions. If they are not confirmed on the first try, he then proceeds to form the question differently until gets what he wants from them.

edit: typo

jimmySixDOF•1d ago

Error control as in LLM-as-a-Judge needs to be integrated into every pipeline there are use at home (sub 7B) sized models and nano's (fit into browser) [1] so push out to the edge if its a Google sized scale/cost problem. Should actually be like VirusTotal where you get a consensus score of x/72 different lookups.

[1] see haize labs : https://www.haizelabs.com/product/judges

eddythompson80•1d ago

This is a perfect example of snakeoil nonesense being sold in the AI tech market these days. People offering to wrap your LLM calls with another prompt asking the LLM to "reconsider" or "does this make sense to you", and selling you that artisanally crafted prompt at a premium.

This is simply an information retrieval problem. LLMs don't retrieve information perfectly from their context. They are very good at it, but every now and then they'll introduce a change here or there. Changing an "Hello" to "Hi" doesn't really make any difference, but changing an "PS/2 Model 286" to "PS/2 Model 280" makes a huge difference. The LLM "knows"* this at some level because it "knows" that names are important to be reproduced in exact format. But everynow and then even names can change and sill generally mean the same thing, so everynow and then it'll change a name or an identifier for no reason.

some of my favorite descriptions of this I have heard from people:

- We need to introduce a reflection layer

- We need a supervisor tree-like checks on responses

- We need an evaluation-feedback mechanism for AI operations

- We need to agents that operates and judges/supervisors/evaluators

all apparently mean:

     const response = await getChatResponse([...messages, newMsg]);
     const {didGood, why} = await getChatResponse([RolePlayLikeAMeanJudgePromot, ...messages, newMsg, response])

     if (!didGoog) {
        response = await getChatResonse([ThreateningMafiaGuyPrompt, [ ...messages, newMsg, response, why[])
     }

     // loop like 4 times, maybe use different models and egg them on each other, like use Grok and tell it you got the response from Claude and that Claude was shit talking Grok all the way through it. Like it was unnecessary tbh.
     // this makes Grook work extra hard. It's actually been measured by kill-a-watt.

*: I say "knows" to indicate just that the model training biases it that way

mdp2021•23h ago

We consider LLMs an "intuition" machine that can talk and partially understand, and do not let it retrieve information from its faulty memory and will, but force it to use an implemented memex to produce any output.

"Tell us about X (X='PS/2 Models'); here are your encyclopedias: extract and formulate".

eddythompson80•22h ago

If you were to actually try that you'd know that approach doesn't really work either. Or rather, it's not the silver bullet you hope it is. If you still think that, go ahead and implement it. That's literally the main "output quality" struggle all AI providers are in.

If you're just building a chatbot (like a pure ChatGPT/Claude interface-like) you risk massively increasing your latency and degrading your overall result quality for an attempt to improve a small scenario here or there.

Seriously, try it. Take any "Tell us about X" prompt you like. Try it as-is with an LLM, then try it with + "; here are your encyclopedias: extract and formulate"

I guarantee you that 99 times out of 100, the LLM will always reach out to the encyclopedia. The existing encyclopedia doesn't have a great LLM-like search interface that's able to find the most relevant parts to the LLM's query about X. In fact, you're building that part if I'm not mistaken. If you expect the encyclopedia to have that great search functionality that the LLM could use to always find the most relevant information about X, then you just pushed the problem one layer down. Someone will actually eventually have to tackle it.

You can also see this in both ChatGPT and Claude outputs. Every now and then they will push a change to make it "more reliable" which basically makes it more likely to search the internet before answering a question. Which also happens to be more likely to skew its output based on SEO, current popular news and other nonesense.

While nonscientific, I experience this everytime ChatGPT or Claude decide to do a web search instead of just answering the question. Ask it "I like tv show X, suggest tv shows like that" or "I like product X, suggest a similar product". If it uses the internet to search, it's a summary of the top gamed SEO results. Just whatever is popular atm, or whatever has commission links. Ask it not to use the internet and the result is surprisingly less.... "viral, SEO optimized, trended recently" type content.

mdp2021•14h ago

You are misunderstanding the proposed frame: implementations may be faulty, but the approach remains necessary for "LLMs as informers". I.e., the answer provider should only work vis-a-vis documentation founding the output.

This implies that if we do not have good enough ways to retrieve information from repositories, we will have to invent them. Because the "LLM as informer" can only be allowed to formulate what it will find through the memex.

It is possible that to that aim, LLMs can not be directly implemented as they are in the current general state.

Also the problem of information reliability has to be tackled, in order to build such system (e.g. some sources rank higher).

It is not a solved problem, but it is a clear one. In mission critical applications, you would not even allow asking John at the nearby desk for information he may confuse.

rcarmo•1d ago

This is why Google has got search fundamentally wrong. They just don’t care about accuracy of results anymore, and worry mostly about providing a quick answer and a bunch of sponsored links below it.

Llamamoe•1d ago

Except that out of 10 answers, the "quick answer" is subtly wrong 6 times, egregiously wrong 2, and outright dangerous once. I've seen screenshots of stuff that would get people killed or in legal trouble.

dandanua•1d ago

They just continue the Eric Schmidt idea "More results are better than none". It has evolved to "It's better to hallucinate than produce a negative answer", I guess.

jkuli•1d ago

Humans can also make mistakes. This is the first test I apply every time. Could it be that AI is actually more capable than a human? If industry decides than humans are more reliable, they will choose to use humans. Reliability is part of cost-effectiveness, and it's built in to the business decision process.

theodric•1d ago

A human would only make up an answer like the ones in the article if it were a compulsive liar. A human would ideally say "I don't know" or at worst employ the "I'll confirm that and circle back" corpspeak evasion.

mdp2021•1d ago

Another case of badly computed reputational loss.

That context is not of "this more than that" comparisons, but of threshold: the service must be _adequate_.

If you don't have random humans capable of providing the needed service, find them. Do not employ random agents.

alpaca128•1d ago

The average Google user expects LLMs to be perfect and treats their responses as answer from an oracle, not just better than the average human.

It doesn't matter whether it's better than humans, the one thing that matters are the consequences of its widespread use.

neilv•1d ago

On the Google search Web site, the "AI responses may include mistakes." weak disclaimer small print is also hidden behind the "Show more" button.

When OpenAI launched ChatGPT, I had to explain to a non-CS professor that it wasn't AI like they're thinking of, but currently more like a computational parlor trick that looks a lot like AI.

But turns out this parlor trick is awesome for cheating on homework.

Also good at cheating at many other kinds of work, if you don't care much about quality, nor about copyrights.

loa_in_•1d ago

It's a memory augmentation/information retrieval tool with flexible input and output interface.

stavros•1d ago

I really don't understand the view that it's a "parlor trick that looks like AI". If it's not "a thing that can write code", but instead just looks like a thing that can write code (but can actually write code), it can write code. All the "no true Scotsman" stuff about what it's doing behind the scenes is irrelevant, because we have no idea what human brains are doing behind the scenes either.

keiferski•1d ago

It matters if we are making a distinction between essence and output.

On the output side, it functionally doesn’t really have a difference. At least in terms of more abstract things like writing code. Although I would argue that the output AI makes still doesn’t match the complexity and nuance of an individual human being, though, and may never do so, simply because the AI is simulating embodiment and existing in the world. It might need to simulate an Earth equivalent to truly simulate a human’s personal output.

In the essence side, it’s much more of a clear distinction. We have numerous ways of determining if a thing is human or not - biology, for one. It would take some serious sci-fi until we get to the point where an android is indistinguishable from a human on the cellular level.

ben_w•1d ago

> Although I would argue that the output AI makes still doesn’t match the complexity and nuance of an individual human being, though

LLMs are very good at nuance. Better than any human I've seen — so much so, I find it to be a tell.

> We have numerous ways of determining if a thing is human or not - biology, for one.

I don't care if the intelligence is human, I care if it's (1) (a) intelligent, (b) educated, and (2) has the ability to suffer or not so I know if it should have moral subject rights.

1a is slowly improving but we're guessing and experimenting: not really engineering intelligence, just scaling up the guesses that work OK. 1b was always easy, libraries fit "education" in isolation from the "intelligent" part of 1a. LLMs are good enough combination of (a) and (b) to be interesting, potentially even an economic threat, depending on how long the time-horizon between failures gets.

2 is pre-paradigmatic, we don't have enough understanding of the problem to ask the correct question — even ignoring AI for the moment, the same problem faces animal welfare (and why would the answer be the same for each of chimps, dogs, ravens, lobsters, and bees?) and even within humans on topics such as abortion, terminal stage of neurodegenerative conditions such as Alzheimer's, etc.

ben_w•1d ago

Although I broadly agree, I wouldn't go quite as far as where you say:

> All the "no true Scotsman" stuff about what it's doing behind the scenes is irrelevant, because we have no idea what human brains are doing behind the scenes either.

Computers and transistors have a massive speed advantage over biological brains and synapses — literally, not metaphorically, the same ratio as the speed difference between how far you walk in a day and continental drift, with your brain being continental draft — which means they have the possibility of reading the entire Internet in a few weeks to months to learn what they know, and not the few tens to hundreds of millenia it would take a human.

Unfortunately, the method by which they acquire information and knowledge, is sufficiently inefficient that they actually need to read the entire Internet to reach the skill level of someone who has only just graduated.

This means I'm quite happy to *simultaneously* call them extremely useful, even "artificial general intelligence", and yet also agree with anyone who calls them "very very stupid".

If we actually knew how our brains did this inteligence thing, we could probably make AI genuinely smart as well as absurdly fast.

neilv•1d ago

Historically, there's been some discussion about that:

https://en.wikipedia.org/wiki/Chinese_room

hnlmorg•1d ago

Their point wasn’t that it’s not useful. It’s that it isn’t artificial intelligence like the masses consider the term.

You wouldn’t say Intellisense isn’t useful but you also wouldn’t call it “AI”. And what LLMs are like is basically Intellisense on steroids (probably more like a cocktail of speed and acid, but you get my point)

stavros•1d ago

If you'd call k-means AI but you wouldn't call LLMs AI, I'm so far off that reasoning that I don't think we can agree.

hnlmorg•1d ago

I’m not arguing that LLMs are not AI. The problem is that “AI” itself is a nonsense term. It’s been around since forever and used to describe a whole plethora of different behaviours.

My point is that to the average user of Gemini or ChatGPT, LLMs are like AGI. Whereas they’re actually more closer to intellisense or text-completions.

And this is where the problem lies. People will read the output of LLMs and think it has read content on the topic (which is correct) and then deduced an answer (which is technically incorrect).

It also doesn’t help that OpenAI keep using terms like “reasoning” which sounds a lot like general intelligence. But it’s actually just a bunch of scales based on words.

AI doesn’t understand any of the core concepts it is reasoning about. So its reasoning is akin to asking a Hollywood script writer to throw a bunch of medical terms together for a new medical drama. Sure the words might be correct on their own, but that doesn’t mean the sentences are correct. And any subject matter expert who watches a drama that features their specialist subject will tell you that there’s more to understanding a subject than simply knowing the words.

stavros•1d ago

Ah OK, I see what you mean, by "AI" you mean "AGI", not what we call ML. It makes sense that way.

dijksterhuis•1d ago

over the last year, i’ve mentally split the two separate concepts like so

* ML - actual backend models etc

* AI - user interface that appears “intelligent” to humans

LLMs UIs tend to have more appearance of intelligence because their interface is natural language — it’s basically the Eliza Effect https://en.m.wikipedia.org/wiki/ELIZA_effect

i know it’s not the classic definition of the terms, but it’s helped me with my frustration around the bs marketing hype

otabdeveloper4•1d ago

LLMs can't write code.

They don't have capacity to understand logical or temporal relationships, which is the core competency of coding.

They can form syntactically valid strings in a formal language, which isn't the same thing as coding.

stavros•1d ago

Hmm, I guess I better throw away all this working code they wrote, then.

otabdeveloper4•18h ago

See my second paragraph. "Working" (aka syntactically correct) code is not the significant and difficult part of coding.

stavros•18h ago

I don't care if the code works because it was formed because of temporal understanding or if it works because an LLM predicted enough tokens correctly, I care that it works.

yusina•1d ago

This heading is so obvious, I hardly see why it warrants an article.

The other day I googled "Is it 2025?" and the AI reply was that nope, it's 2024. Such a joke.

mdp2021•1d ago

> [current time]

You should know that LLMs are very weak in procedural operations.

> obvious ... why it warrants an article

The phenomenon is part of all the "obvious" things that are not in the awareness of very large masses.

yusina•1d ago

> You should know that LLMs are very weak in procedural operations.

Indeed! That's why they are LLMs, not AI!

Why are new hypes always re-defining terms? 20 years ago, "AI" was actually about intelligence. Just like "crypto" was about cryptography instead of money scams.

> The phenomenon is part of all the "obvious" things that are not in the awareness of very large masses.

I can't imagine a single HN reader who is not aware that LLMs make mistakes. This is not the general public. (And even most of the general public has heard about this minor issue by now.)

mdp2021•1d ago

> terms[!] ... ... I can't imagine a single HN reader who is not aware that LLMs make mistakes. This is not the general public

So you wrote 'article' but you meant "submission" ;)

> And even most of the general public has heard about this minor issue by now

And still some public or private enterprises are trying to push LLMs in dangerous parts of the workflow - possibly because drives like "cutting costs" or "chasing waves" are valued more than objective results...

> "AI" was actually about intelligence

It was (is) about providing outputs intelligently - relevantly.

yusina•1d ago

> So you wrote 'article' but you meant "submission" ;)

Ah, true. Sorry.

Dwedit•1d ago

Duckduckgo and Brave Search's AI-generated answers seem to correctly mention that model 280 does not exist, but this is 11 days after the article was first published, and the article is now part of the search results used to generate the AI responses.

Almondsetat•1d ago

"I asked an LLM and this is the wrong answer it gave me" is a genre of content that I'm growing ever more annoyed at. There is nothing constructive or informative about these stories other than the usual adage of not blindly trusting an LLM

snarf_br•1d ago

Then don't read it.

With Gemini replacing Google search more and more people are blindly trusting those answers, so these stories are needed.

Almondsetat•1d ago

Dismissing my observation with 'then don't read it' sidesteps the core issue. My point isn't about my personal reading habits, but about the low signal-to-noise ratio of this content genre. While you argue these stories are 'needed' because people blindly trust LLMs, especially with integrations like Gemini in search, these posts rarely offer more than the simplistic, already widely understood caution: 'don't blindly trust LLMs.' This is precisely the 'usual adage' I mentioned. The genre often lacks depth, failing to provide nuanced understanding or genuinely new information about why these systems fail in specific ways or how users can develop better critical assessment skills beyond mere distrust. If the goal is genuine education due to increased LLM exposure, the content needs to evolve beyond just showcasing errors.

mdp2021•1d ago

> informative

That Google uses a faulty assistant in the page is actually informative, not just for people who do not use that search engine, but for those attentive to the progresses in the area - where Google has given massive hits recently.

> constructive

The - extremely damaging - replacement of experts with "employees wielding an LLM" is ongoing. Some of us have been told nonsense by remote support service staff...

Almondsetat•1d ago

While you argue that showcasing a 'faulty assistant' like Google's is 'informative', particularly for those tracking AI progress, the typical LLM-got-it-wrong post often doesn't provide that deeper insight. It usually presents an isolated error without context or analysis of the system's architecture, training data limitations, or the specific type of reasoning failure. This makes its informative value quite shallow, quickly becoming repetitive rather than truly enlightening about 'progresses in the area' beyond the surface-level observation that LLMs are imperfect.

Regarding the 'constructive' aspect and the 'damaging replacement of experts,' I agree this is a critical concern. However, the genre of simply posting screenshots of LLM errors is rarely constructive in addressing this complex socio-technical issue. It highlights a symptom (LLMs making mistakes) but typically fails to constructively engage with the causes or potential solutions for deskilling, corporate responsibility in AI deployment, or the nuances of human-AI collaboration. True constructive engagement would require more than just pointing out a wrong answer; it would demand analysis, discussion of best practices, or calls for better system design and oversight, which this genre seldom provides.

mdp2021•23h ago

Right. But simply, raising awareness helps fighting the "nurses as cheap doctors, random people with a script as a greater bargain" phenomenon.

And for what the progresses in LLMs are concerned¹, it seems evident a revolution is required - and when the key (to surpass intuition towards process, dream towards wake) will be found it will become evident.

(¹Before I was mentioning «progresses» in general - as in, "they give us Veo3 and yet Banged Inthehead at the search pages"?!)

mdp2021•1d ago

The paradox is that that the "overview" is within a search engine, that should provide links to pages containing third party answers to the question; a rational individual knowing to suffer from bad memory would keep documentation at hand to verify its faulty-memory based intuition; the elements for the normal process ("get an intuition - if that is what you do - but verify it through (assessed) available documentation") are there but are not used.

csomar•1d ago

I had this a few months ago with an old man. He said there are 10 billion people in the world; so I told him you are off by 2 billion. He was adamant and challenged me to a Google search. So I did just that and lo and behold there are 10 billion people according to Google.

I even took a screenshot: https://imgur.com/a/oQYKmKP

I really had nothing to say at that moment.

ekianjo•1d ago

Note that we don't really know the exact answer because population reporting is shaky at best where there are the most people.

mdp2021•1d ago

And a rational interlocutor, besides conversational shortcuts, replies in the form "The best estimations from sources like S0, S1 and S2, publish a value between V1 and V2".

system2•1d ago

The sun is hot, water is wet.

afro88•1d ago

If you look at the sources behind the AI responses for this search, they clearly don't mention the 280. Google are probably using a dirt cheap model for these responses and it's harming user trust

mpweiher•1d ago

In other exciting and groundshaking news: water wet!

And I have to admit I thought the title was a joke.

However, I loved the detailed description of just how bad it can be. And it puzzles me why people present AI slop as authoritative. Happens a lot in discussions these days. One example was someone presenting me with a grok answer about some aspect of the energy system. It turned out grok was off by a factor of 1000.

Of course you can also use that to your advantage with people who believe AI slop, as it is fairly simple to get the AI to produce the answer you want, including the answer you know is right ;-)

And I've actually started using AI a bit more in my coding, and it's been helpful as a starter. For example to get my little HTMX-Native project going, I needed to figure out how to configure Android's WebView for local data access.

Would I have figured it out eventually? Yes.

Was it faster with AI? Yes.

Was the AI answer faulty? Yes.

Was it still helpful? Yes.

hannob•1d ago

"AI Responses May Include Mistakes" is really the one, single most important thing I want to shout into the whole AI debate.

It also should be the central issue - together with the energy/climate impacts - in every debate about AI ethics or AI safety. It's those two things that will harm us most if this hype continues unchecked.

consp•1d ago

The problem is not it may, but it will make mistakes. But people do not realize this and treat it as an almighty oracle. It's a statistical model after all, there is a non zero chance of the monkey creating the works Shakespeare.

jll29•1d ago

Language models are not designed to know things, they are designed to say things - that's why they are called language models and not knowledge models.

Given a bunch of words have already been generated, it always ads the next words based on how common the sequence is.

The reason you get different answers each time is the effect of the pseudo-random number generator on picking the next word. The model looks at the probability distribution of most likely next words, and when the configuration parameter called "temperature" is 0 (and it is actually not possible to set to 0 in the GUI), there is no random influence, and strictly the most likely next word (top-1 MLE) will always be chosen. This leads to output that we would classify as "very boring".

So the model knows nothing about IBM, PS/2, 80286 versus 80486, CPUs, 280 or any models per se. -- One of the answers seems to suggest that there is no model 280, I wonder whether that one was generated through another process (there is a way to incorporate user feedback via "reinforcement learning"), or whether that was a consequence of the same randomized next-word picking, just a more lucky attempt.

otabdeveloper4•1d ago

> This leads to output that we would classify as "very boring".

Not really. I set temperature to 0 for my local models, it works fine.

The reason why the cloud UIs don't allow a temperature of 0 is because then models sometimes start to do infinite loops of tokens, and that would break the suspension of disbelief if the public saw it.

mdp2021•1d ago

Which local models are you using, that do not output loop garbage at temperature 0?

What do you get at very low temperature values instead of 0?

otabdeveloper4•1d ago

> Which local models are you using, that do not output loop garbage at temperature 0?

All of them. I make my own frontends using llama-cpp. Quality goes up with temperature 0 and loops are rare.

The temperature setting isn't for improving quality, it's to not break your suspension of disbelief that you're talking to an intelligent entity.

mdp2021•23h ago

> All of them

You must be using recent (or just different) models than those I tried. Mine returned garbage easily at temperature 0. (But unfortunately, I cannot try and report from there.)

This (LLM behaviour and benchmarking at low or 0 temperature value) should be a topic to investigate.

otabdeveloper4•18h ago

Probably a bug in the code you ran somewhere.

verisimi•1d ago

> Language models are not designed to know things, they are designed to say things - that's why they are called language models and not knowledge models.

This is true. But you go to Google not to 'have a chat' but ostensibly to learn something based in knowledge.

Google seem to be making an error in swapping the provision of 'knowledge' for 'words' you'd think, but then again perhaps it makes no difference when it comes to advertising dollars which is their actual business.

simianwords•1d ago

In such discourse I never see discussion on this:

There is no doubt that LLMs have gotten more accurate as newer models were released. At what point should we say "look this is accurate enough to be useful"?

We should acknowledge that nothing is ever 100% accurate. You won't go to a doctor expecting 100% accuracy. You know that the doctor's accuracy is high enough for the effort of making an appointment, listening to them to be worth it. Maybe they are 60% accurate?

My point is that LLM's are maybe at 20-30% accuracy where the benefit clearly exists even if we collectively acknowledge that 20-30% is not that high.

I find it amusing to think about an LLM that is 1% accurate (which could have been achieved way earlier in 2010's). What could have been possible with such an LLM with the right mindset?

demaga•1d ago

Realistic-looking lorem ipsum

gambiting•1d ago

>>You won't go to a doctor expecting 100% accuracy.

The way LLMs work at the moment is equivalent to going to a doctor with a set of symptoms, and the doctor telling you "ah yes, of course, it's illness X and you need to take medicine Y to cure it" and then you check and neither X nor Y exists. That's not "accuracy" that's just straight up fraud?

I wouldn't have any problem with Google's AI saying "I don't know" or "I don't have enough sources to provide a good answer, but my best approximation is this". But I literally battle misinformation produced by Google's AI search every single day, because it's telling people actual made up facts that don't exist.

simianwords•1d ago

I have a system prompt that gives me probability estimates to everything the LLM claims. Its a simple fix for your problem.

gambiting•1d ago

My problem is people coming to communities I'm a part of with information they got "from Google" and that information being 100% wrong. Not sure how your prompt helps with that, I need Google to fix their system first.

_shantaram•1d ago

How does that work if the LLM is the one generating the probabilities too?

simianwords•6h ago

according to you, if a human makes a prediction with some probability estimate it is useless because the estimate itself is inaccurate (hence probability "estimate"). in reality nothing needs to be 100% accurate for it to be useful including the estimate of probability itself.

gambiting•6h ago

It's weird to make an assumption about OPs position and argue with that instead of what they actually wrote.

Also, why make it so personal? I think it was a fair question to ask - you didn't answer how it works - just got weirdly defensive about it.

simianwords•6h ago

hey that was not my intention, it was to bring to light that we assign probability estimates ourselves to our own predictions despite the estimates being not 100% accurate.

otabdeveloper4•1d ago

Hate to break it to you, but the "probability estimates" it spits out are also complete bullshit.

simianwords•6h ago

nope! you can also self assign probability estimates to your own predictions. if you follow it you will end up being more accurate in the long run even if your accuracy of probability estimates are themselves not accurate.

jazzyjackson•8h ago

LLMs are unable to introspect and don't know what they don't know. Watson the Jeopardy-bot had a confidence interval, but Watson was not an LLM

sspiff•1d ago

I find this phenomenon really frustrating. I understand (or am at least aware of) the probabilistic nature of LLMs and their limitations, but when I point this out to my wife or friends when they are misusing LLMs for tasks they are both unsuited for and unreliable at, they wave their hands and dismiss my concerns as me being an AI cynic.

They continue to use AI for math (asking LLMs to split bills, for example) and treat its responses for factual data lookup as 100% reliable and correct.

osmsucks•1d ago

> They continue to use AI for math (asking LLMs to split bills, for example)

Ah, yes, high tech solutions for low tech problems. Let's use the word machine for this number problem!

thaumasiotes•1d ago

> Let's use the word machine for this number problem!

You know, that's a thought process that makes internal sense.

You have someone who's terrible at math. They want something else to do math for them.

Will they prefer to use a calculator, or a natural language interface?

How do you use a calculator without knowing what you're doing?

osmsucks•15h ago

Feels to me like if you can't even use a calculator you have bigger problems to worry about...

thaumasiotes•14h ago

You don't think people who can't use a calculator ever have dinner with their friends?

datavirtue•1d ago

I'm so lazy, I have chat bots do all kinds of complex calculations for me. I even use it as a stock screener and the poor thing just suffers, burning fuck tons of electricity.

sspiff•8h ago

Many of them try to be mindful of their climate impact.

I've tried to explain it in those terms as well: every medium-sized prompt on these large models consumes roughly one phone battery charge worth of energy. You have a phone with a calculator.

I'd ask them to do the math on how much energy they're wasting asking stupid things of these systems, but I'm too afraid they'd ask ChatGPT to do the math.

jatora•1d ago

Using it for simple math is actually pretty hilarious. Hey maybe they make sure to have it use python!...but I dream

BlueTemplar•1d ago

Using LLMs (or platforms in general) is a bit like smoking (in closed spaces, with others present) : a nuisance.

diggan•1d ago

That's just plain wrong, and I'm a smoker. LLMs won't affect the ones around you, unless you engage with them in some way. Sit next to me while I smoke and you'll be affected by passive smoking regardless of how much you engage or not. Not really a accurate comparison :)

JeremyNT•1d ago

> They continue to use AI for math (asking LLMs to split bills, for example) and treat its responses for factual data lookup as 100% reliable and correct.

I don't do this but isn't it basically... fine? I assume all the major chatbots can do this correctly at this point.

The trick here is that chatbots can do a wide range of tasks, so why context switch to a whole different app for something like this? I believe you'll find this happening more frequently for other use cases as well.

Usability trumps all.

JeremyNT•1d ago

Wish I could edit, but I was referring to the bill splitting math specifically here. I didn't mean to quote the rest.

When it comes to facts that actually matter, people need to know to verify the output.

veunes•1d ago

What's tricky is that for casual use, it gets things "close enough" often enough that people start building habits around it

BlueTemplar•1d ago

> An expert will immediately notice discrepancies in the hallucinated answers, and will follow for example the List of IBM PS/2 Models article on Wikipedia. Which will very quickly establish that there is no Model 280.

'member when similar blogposts were written about not trusting Wikipedia ?

(And Wikipedia is still better than LLMs : while you can trust it less than fixed, specialist-made references, you can improve it yourself, as well as check Talk pages for potential disagreements, and page history for potential shenanigans.)

nickjj•1d ago

I had an experience the other day with ChatGPT and some Python code.

I wanted to modify Gunicorn's logger class so I can filter out specific URL paths. Given it's a hot code path (running on every request) I told it I made 3 solutions and was looking to see which one is the fastest. I used a list + loop using startswith, compiled regex and also used startswith while passing in a tuple of paths.

It produced me benchmark code and also benchmark results which stated the regex solution was the best and fastest solution using Python's standard library.

I didn't believe it so I ran the benchmark myself and the tuple version was over 5x faster than the regex solution.

I then told it I ran the benchmark and got different results and it almost word for word said something like "Oh right, thank you for the clarification, the tuple version is indeed the fastest!". It saved me a few minutes writing the benchmark code but yeah, I rarely trust its output for anything I'm not 100% on.

sanbor•1d ago

I tried to reproduce the situation described in the post by Googling "ps/2 model 280". I refreshed the page about 15 times and gives a very similar answer every time. For example:

    The IBM PS/2 Model 280 was a low-end PS/2 model, often associated with the earlier IBM PC/XT and PC/AT models.
    It featured an 80286 CPU, a smaller hard drive, and a proprietary 3.5-inch floppy drive. While it used ISA slots for expansion, its proprietary RAM and other components made it less attractive for retro computing compared to the Model 30 286

I tried again after 10 minutes and I got the "The IBM PS/2 Model 280 was not a model in the original Personal System/2 family".

It seems the AI response is cached for a few minutes.

I tried in Kagi "ps/2 model 280?". This is the first response:

```

While a specific "PS/2 Model 280" is not definitively documented, here's what can be gathered:

    - It seems that AI models sometimes make errors regarding the PS/2 Model 280, describing it as an ISA-based 286 system. 12
    - The IBM PS/2 (Personal System/2) was a line of personal computers from IBM, introduced in 1987 3.
    - PS/2 computers had two BIOSes: ABIOS (Advanced BIOS) and CBIOS (Compatible BIOS). 4
    - One source mentions "CHIPS/280" including undocumented IBM PS/2 Model 80 address recovery logic. 5
    - The PS/2 Model 30, released in 1987, used an Intel 8086 processor and had an 8-bit ISA bus. 6
    - The PS/2 Model 30 286 had a 10MHz 286 CPU and could have 2 MB of RAM. 7
    - The PS/2 Model 80 offered enhanced performance for commercial and compute-intensive tasks. 8

References

    AI Responses May Include Mistakes | OS/2 Museum www.os2museum.com
    OS/2 Museum | OS/2, vintage PC computing, and random musings www.os2museum.com
    List of IBM PS/2 models - Wikipedia en.wikipedia.org
    IBM PS/2 - Wikipedia en.wikipedia.org
    Technology ardent-tool.com
    IBM PS/2 Model 30 - Wikipedia en.wikipedia.org
    IBM PS/2 Model 30 286 - Computers of Significant History - Userlandia www.userlandia.com
    IBM PS/2 (Model 80) - Technical specifications ardent-tool.com

```

I waited a few minutes and this is the response now:

```

While information on the IBM PS/2 Model 280 is limited, here's what is available:

    - There is mention of the PS/2 Model 280, with claims that it "was a significant step forward in IBM's personal computer line" 1. However, it's also noted that claims about the PS/2 Model 280 being an ISA-based 286 system may be inaccurate 2.

    - The IBM PS/2 series included a variety of models in different form factors like desktops, towers, all-in-ones, portables, laptops, and notebooks 3. Some notable models include:
        Model 30: Featured an Intel 8086 processor at 8 MHz and an 8-bit ISA bus 4.
        Model 80: A high-end desktop with a 32-bit Intel 386 processor, initially running at 16 MHz, housed in a tower case 5.

References

    AI Responses May Include Mistakes | OS/2 Museum www.os2museum.com
    OS/2 Museum | OS/2, vintage PC computing, and random musings www.os2museum.com
    List of IBM PS/2 models - Wikipedia en.wikipedia.org
    IBM PS/2 Model 30 - Wikipedia en.wikipedia.org
    IBM PS/2 Model 80 - Wikipedia en.wikipedia.org

```

I find the way Kagi gives references superior to the way Google does. Anyway, in the second answer Kagi's AI started to get a bit confused by stating "it's also noted that claims about the PS/2 Model 280 being an ISA-based 286 system may be inaccurate" missing the conclusion of the post.

ErrantX•1d ago

There is some parallel here with the message of 1984. In that Orwell saw political entity as the vehicle for mass misinformation.

But in some ways Google AI results, with the trust they have, seem just another vehicle for "we've always been at war with Eurasia".

For me this is the real risk of AI: developing dependence on it's output

veunes•1d ago

And the worst part is, the more plausible the output sounds, the more dangerous it becomes for casual users who don't know any better.

Hilift•1d ago

"Mistakes"? These are confabulations. There is a reason it is occurring, the code was written by people, who exhibit the same behavior. If you wonder why, what possible disincentive does an AI agent have to not confabulate the truth or facts?

rchaud•1d ago

Of course it can't admit it gets things wrong. That would signal to the user that it can't provide what you want. That's Product Design 101.

Would you expect Netflix or Prime to simply show "No results" when you look up a show it doesn't have? Better to fill the screen with a bunch of "we think this is close enough" than to save the user some time.

johnea•19h ago

> AI simply makes stuff up. I do not consider made up, hallucinated answers useful, in fact they are worse than useless.

Let's all say it together: The LLM is just WRONG.

It's not "hallucinating", it doesn't have a brain, or consciousness. It's just generating a wrong answer.

akomtu•18h ago

If you asked this "AI" to draw a possible continuation of a fictional map, it would draw something very believable and everyone would understand that this map and its continuation is a fiction. No one would try to say that the continuation has a "mistake" in it. "AI" works with text the same way it works with some meaningless ornaments, but we interpret that text to find meaning, and this is why AI is fooling people so much.

Duolingo grapples with its 'AI-first' promise before angry social mob

Show HN: LLM in Godot 4

How does GEO (SEO for AI) work?

European Lisp Symposium

Why 90% of great products fail at customer discovery

Can We Afford Large-Scale Solar PV? – By Brian Potter

Petabyte-Class E2 SSDs Poised to Disrupt Warm Data Storage – Storagereview.com

Security Through Observability. Lightweight Agent, Powered by AI

Intel: Stumbling in the Spotlight

Scapple

People Spend Too Much Time on Decisions with Equally Satisfying Outcomes

LICEcap

Ukraine destroys more than 40 military aircraft in drone attack deep in Russia

The White House Vision for Dismantling Science

Equivariance is dead, long live equivariance?

Show HN: TestPanel, AI studying app for adult learners

LLM Visualization

Show HN: Reactylon – React framework to build 3D/XR experiences

How can you find unused functions in Python code?

DoorDash CEO Xu is taking on the role of industry consolidator in food delivery

The Big Ugly Old Thing

Ukraine hits over 40 Russian warplanes in secret Security Service's operation

China "Hawkeye terminal" Claims To Extract data for any devices even GrapheneOS

Ancient bread rises again as Turkey recreates 5k-year-old loaf

Ukraine drones 'emerged from trucks' before strikes on bombers

SSRIs reduce plasma tau&restore dorsal raphe metabolism in Alzheimer's disease

Maximum Likelihood estimation with Quipu, part 1

Infinity plus 1: Finding Larger Infinities

Ready or not, AI is starting to replace people

How much coffee is too much?

AI Responses May Include Mistakes

Comments

Duolingo grapples with its 'AI-first' promise before angry social mob

Show HN: LLM in Godot 4

How does GEO (SEO for AI) work?

European Lisp Symposium

Why 90% of great products fail at customer discovery

Can We Afford Large-Scale Solar PV? – By Brian Potter

Petabyte-Class E2 SSDs Poised to Disrupt Warm Data Storage – Storagereview.com

Security Through Observability. Lightweight Agent, Powered by AI

Intel: Stumbling in the Spotlight

Scapple

People Spend Too Much Time on Decisions with Equally Satisfying Outcomes

LICEcap

Ukraine destroys more than 40 military aircraft in drone attack deep in Russia

The White House Vision for Dismantling Science

Equivariance is dead, long live equivariance?

Show HN: TestPanel, AI studying app for adult learners

LLM Visualization

Show HN: Reactylon – React framework to build 3D/XR experiences

How can you find unused functions in Python code?

DoorDash CEO Xu is taking on the role of industry consolidator in food delivery

The Big Ugly Old Thing

Ukraine hits over 40 Russian warplanes in secret Security Service's operation

China "Hawkeye terminal" Claims To Extract data for any devices even GrapheneOS

Ancient bread rises again as Turkey recreates 5k-year-old loaf

Ukraine drones 'emerged from trucks' before strikes on bombers

SSRIs reduce plasma tau&restore dorsal raphe metabolism in Alzheimer's disease

Maximum Likelihood estimation with Quipu, part 1

Infinity plus 1: Finding Larger Infinities

Ready or not, AI is starting to replace people

How much coffee is too much?