frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Stop Hiding My Controls: Hidden Interface Controls Are Affecting Usability

https://interactions.acm.org/archive/view/july-august-2025/stop-hiding-my-controls-hidden-interface-controls-are-affecting-usability
123•cxr•2h ago•46 comments

Local-first software (2019)

https://www.inkandswitch.com/essay/local-first/
590•gasull•10h ago•189 comments

Cod Have Been Shrinking for Decades, Scientists Say They've Solved Mystery

https://www.smithsonianmag.com/smart-news/these-cod-have-been-shrinking-dramatically-for-decades-now-scientists-say-theyve-solved-the-mystery-180986920/
113•littlexsparkee•6h ago•36 comments

Techno-Feudalism and the Rise of AGI: A Future Without Economic Rights?

https://arxiv.org/abs/2503.14283
45•lexandstuff•4h ago•15 comments

Operators, Not Users and Programmers

https://jyn.dev/operators-not-users-and-programmers/
36•todsacerdoti•2h ago•8 comments

Optimizing Tool Selection for LLM Workflows with Differentiable Programming

https://viksit.substack.com/p/optimizing-tool-selection-for-llm
47•viksit•4h ago•12 comments

How to Network as an Introvert

https://aginfer.bearblog.dev/how-to-network-as-an-introvert/
40•agcat•4h ago•6 comments

Europe's first geostationary sounder satellite is launched

https://www.eumetsat.int/europes-first-geostationary-sounder-satellite-launched
164•diggan•11h ago•36 comments

What a Hacker Stole from Me

https://mynoise.net/blog.php
28•wonger_•3h ago•7 comments

macOS Icon History

https://basicappleguy.com/basicappleblog/macos-icon-history
136•ksec•10h ago•57 comments

Serving 200M requests per day with a CGI-bin

https://simonwillison.net/2025/Jul/5/cgi-bin-performance/
6•mustache_kimono•1h ago•2 comments

Speeding up PostgreSQL dump/restore snapshots

https://xata.io/blog/behind-the-scenes-speeding-up-pgstream-snapshots-for-postgresql
91•tudorg•8h ago•16 comments

X-Clacks-Overhead

https://xclacksoverhead.org/home/about
202•weinzierl•3d ago•43 comments

Atomic "Bomb" Ring from KiX (1947)

https://toytales.ca/atomic-bomb-ring-from-kix-1947/
57•gscott•3d ago•11 comments

7-Zip 25.00

https://github.com/ip7z/7zip/releases/tag/25.00
35•pentagrama•3h ago•31 comments

Yet Another Zip Trick

https://hackarcana.com/article/yet-another-zip-trick
17•todsacerdoti•3d ago•4 comments

The Calculator-on-a-Chip (2015)

http://www.vintagecalculators.com/html/the_calculator-on-a-chip.html
24•Bogdanp•8h ago•4 comments

Haskell, Reverse Polish Notation, and Parsing

https://mattwills.bearblog.dev/haskell-postfix/
42•mw_1•3d ago•8 comments

WinUAE 6 Amiga Emulator

https://www.winuae.net/
40•doener•4h ago•5 comments

ClojureScript from First Principles – David Nolen [video]

https://www.youtube.com/watch?v=An-ImWVppNQ
7•puredanger•3d ago•0 comments

The Right Way to Embed an LLM in a Group Chat

https://blog.tripjam.app/the-right-way-to-embed-an-llm-in-a-group-chat/
7•kenforthewin•3h ago•9 comments

Seine reopens to Paris swimmers after century-long ban

https://www.lemonde.fr/en/france/article/2025/07/05/seine-reopens-to-paris-swimmers-after-century-long-ban_6743058_7.html
110•divbzero•7h ago•57 comments

Solve high degree polynomials using Geode numbers

https://www.tandfonline.com/doi/full/10.1080/00029890.2025.2460966
10•somethingsome•3d ago•2 comments

The Hell of Tetra Master

https://xvw.lol/en/articles/tetra-master.html
5•zdw•3d ago•1 comments

What 'Project Hail Mary' teaches us about the PlanetScale vs. Neon debate

https://blog.alexoglou.com/posts/database-decisions/
45•konsalexee•13h ago•67 comments

Parametric shape optimization with differentiable FEM simulation

https://docs.pasteurlabs.ai/projects/tesseract-jax/latest/examples/fem-shapeopt/demo.html
12•dionhaefner•2d ago•2 comments

QSBS Limits Raised

https://www.mintz.com/insights-center/viewpoints/2906/2025-06-25-qsbs-benefits-expanded-under-senate-finance-proposal
58•tomasreimers•14h ago•24 comments

Is It Cake? How Our Brain Deciphers Materials

https://nautil.us/is-it-cake-how-our-brain-deciphers-materials-1222193/
16•dnetesn•2d ago•4 comments

Gecode is an open source C++ toolkit for developing constraint-based systems (2019)

https://www.gecode.org/
60•gjvc•16h ago•13 comments

Pet ownership and cognitive functioning in later adulthood across pet types

https://www.nature.com/articles/s41598-025-03727-9
60•bookofjoe•6h ago•20 comments
Open in hackernews

AI assisted search-based research works now

https://simonwillison.net/2025/Apr/21/ai-assisted-search/
283•simonw•2mo ago

Comments

simonw•2mo ago
I think it's important to keep tabs on things that LLM systems fail at (or don't do well enough on) and try to notice when their performance rises above that bar.

Gemini 2.5 Pro and o3/o4-mini seem to have crossed a threshold for a bunch of things (at least for me) in the last few weeks.

Tasteful, effective use of the search tool for o3/o4-mini is one of those. Being able to "reason" effectively over long context inputs (particularly useful for understanding and debugging larger volumes of code) is another.

skydhash•2mo ago
One issue I can find with this workflow is tunnel vision, making ill informed decision because of the lack of surrounding information. I often skim books because even if I don't retain the content, I can have a mental map that can help me find further information when I need them. I wouldn't try to construct a complete answer to a question with just this amount of information, but I will use that map to quickly locate the source and have more information to synthesize the answer.

One could use the above workflow in the same way and argues that natural language search is more intuitive than keyword based search. But I don't think that brings any meaningful productivity improvement.

> Being able to "reason" effectively over long context inputs (particularly useful for understanding and debugging larger volumes of code) is another.

Any time I saw this "wish" pop up, my suggestion is to try a disassembler to reverse engineer some binary to really understand the problem of coming up with a theory of a program (based on Naur's definition). Individual statements are always clear (programming language are formal and have no ambiguity). The issue is grouping them, unambiguously define the semantic of these groups, and find the links between them, recursively.

Once that's done, what you'll have is a domain. And you could have skipped the whole thing by just learning the domain from a domain expert. So the only reason to do this is because the code doesn't really implement the domain (bugs) or it's hidden purposefully. So the most productive workflow there is to learn the domain first to find discrepancy (first case) or focus yourself on the missing part (second case). In the first case, the easiest approach is writing tests, and the more complete one is to do a formal verification of the software.

jsemrau•2mo ago
My main observation here is

1. Technically it might be possible to search the Internet, but it might not surface correct and/or useful information.

2. High-value information that would make a research report valuable is rarely public nor free. This holds especially true in capital-intensive or regulated industries.

simonw•2mo ago
I fully expect one of the AI-related business models going forward to be charging subscriptions for LLM search tool access to those kinds of archives.

ChatGPT plus an extra $30/month for search access to a specific archive would make sense to me.

sshine•2mo ago
Kagi is $10/mo. for search and +$15/mo. for premium LLMs with agentic access to search.
AlotOfReading•2mo ago
What they're talking about is access to professional archives like EBSCOnet or Bloomberg, which usually don't sell to individuals in the first place and start at tens of thousands of dollars per seat for institutional access.
JustFinishedBSG•2mo ago
Ofc Mistral model is a lot worse but for $14.99 you get access to AFP news. So OpenAI for $60 for the same things would be a huge joke
ac29•2mo ago
The $10 plan includes the LLM assistant a now as well (with a more limited selection of models than the $25 plan).
jsemrau•2mo ago
Then I'd rather see domain-specific agent-first data. I.e., not a simple API call but token->BM25->token
hadlock•2mo ago
o3/o4 seem to know how to search things like pypi, crates.io, pkg.go.dev etc and apply those changes on the first try. My application (running on an older version of code) had a breaking change to how the event controller functioned in the newer version, o3 looked at the documentation and rewrote it to use the new event controller. It used to be that you were trapped with the LLM being 3-8 months behind on package versions.
simonw•2mo ago
Huh, now I'm thinking that maybe a target for release notes should be to provide enough details that a good LLM can be used to apply fixes for any breaking changes.
rd•2mo ago
MCP maybe? A release notes MCP (maybe into ReadTheDocs or pypi) that understands upgrade instructions for every package.
sanderjd•2mo ago
This is the thing I don't really love about MCP: Why should it require a separate protocol, rather than just good readable documentation?
navinsylvester•2mo ago
Ironically, https://context7.com/
sitkack•2mo ago
This is devdocs to be consumed by LLMs, https://github.com/upstash/context7

Brilliant (and one less think I don't have to build)!

sanderjd•2mo ago
I'm glad this exists, but can you describe to me why it needs to? Why can't agents just read the docs directly?
sitkack•2mo ago
By returning detailed docs for exactly what the AI is coding at the time, it greatly reduces the the likelihood it will make a mistake. It moves from a recall from the training data problem, to a transcription problem.

This is RAG but for API docs.

TrackerFF•2mo ago
I'm not a researcher, but don't most researchers these days also upload their work to arXiv?

Sure, it's not a journal - but in some fields (Machine Learning, Math) it seems like everyone also uploads their stuff there. So if the models can crawls sites like arXiv, at least there's some decent stuff to be found.

levocardia•2mo ago
Not outside of ML, physics, and math. Preprints are extremely rare in many (dare I say most) scientific fields, and of course many times you are interested in not the cutting edge work, but the foundational work in a field from the 60s, 70s, or 80s, all of which is locked behind a paywall. Or at least it's supposed to be, and corporate LLMs are not "allowed" to go poking around on sketchy Russian website for non-paywalled versions.
qingcharles•2mo ago
I'm assuming many AI companies are probably scraping all the PDFs from the "shadow libraries" of the world that have done some of the work of liberating these papers from behind their paywalls. Obviously it's legally unsettled territory right now...
jsemrau•2mo ago
Proper research, especially the one contributed to conferences, is hard to get by and is usually managed by the conference organizers. Arxiv has some, but it's limited.

It would be great if for a DeepSearch tool for ML, I could just use Arxiv as a source and have the Agent search this. But so far I have not found a working Arxiv tool that does this well.

btbuildem•2mo ago
It's a relevant question about the economic model for the web. On one hand, the replacement of search with a LLM-based approach threatens the existing, advertising-based model. On the other hand, the advertising model has produced so much harm: literally irreparable damage to attention spans, outrage-driven "engagement", and the general enshittification of the internet to mention just a few. I find it a bit hard to imagine whatever succeeds it will be worse for us collectively.

My question is, how to reproduce this level of functionality locally, in a "home lab" type setting. I fully expect the various AI companies to follow the exact same business model as any other VC-funded tech outfit: free service (you're the product) -> paid service (you're still the product) -> paid service with advertising baked in (now you're unabashedly the product).

I fear that with LLM-based offerings, the advertising will be increasingly inseparable, and eventually undetectable, from the actual useful information we seek. I'd like to get a "clean" capsule of the world's compendium of knowledge with this amazing ability to self-reason, before it's truly corrupted.

fzzzy•2mo ago
You need a copy of r1 and enough ram to run it, and a web searching tool, or a rag database with your personal data store.
btbuildem•2mo ago
R1 would be the reasoning model - as in, the initial part of the output being the "train of thought" revealed before the "final answer" is provided. I was able to deploy a heavily quantized version of that locally, and run it with RAG (Open Webui in this instance) -- with web search enabled, sure, but it's still a far cry from an actual "research" model that know when and how to seek extra data / information.
qwertox•2mo ago
I feel like the benefit which AI gives us programmers is limited. They can be extremely advanced, accelerative and helpful assistants, but we're limited to just that: architecting and developing software.

Biologists, mathematicians, physicists, philosophers and the like seem to have an open-ended benefit from the research which AI is now starting to enable. I kind of envy them.

Unless one moves into AI research?

bluefirebrand•2mo ago
I don't think AI is trustworthy or accurate enough to be valuable for anyone trying to do real science

That doesn't mean they won't try though. I think the replication crisis has illustrated how many researchers actually care about correctness versus just publishing papers

simonw•2mo ago
If you're a skilled researcher I expect you should be able to get great results out of unreliable AI assistants already.

Scientists are meant to be good at verifying and double-checking results - similar to how journalists have to learn to derive the truth from unreliable sources.

These are skills that turn out to be crucial when working with LLMs.

bluefirebrand•2mo ago
> Scientists are meant to be good at verifying and double-checking results

Verifying and double-checking results requires replicating experiments, doesn't it?

> similar to how journalists have to learn to derive the truth from unreliable sources

I think maybe you are giving journalists too much credit here, or you have a very low standard for "truth"

You cannot, no matter how good you are, derive truth from faulty data

simonw•2mo ago
Don't make the mistake of assuming all journalists are the same. There's a big difference between an investigative reporter at a respected publication and someone who gets paid to write clickbait.

Figuring out that the data is faulty is part of research.

bluefirebrand•2mo ago
Figuring out that data is faulty is one thing

There is still no possible way that a journalist can arrive at correct information, no matter how good, if they only have faulty data to go with

simonw•2mo ago
That's what (good) journalism is: the craft of hunting down sources of information, figuring out how accurate and reliable they are and piecing tougher as close to the truth as you can get.

A friend of mine is an investigative reporter for a major publication. They once told me that an effective trick for figuring out what's happening in a political story is to play different sources off against each other - tell one source snippets of information you've got from another source to see if they'll rebut or support it, or if they'll leak you a new detail because what you've got already makes them look bad.

Obviously these sources are all inherently biased and flawed! They'll lie to you because they have an agenda. Your job is to figure out that agenda and figure out which bits are true.

The best way to confirm a fact is to hear about it from multiple sources who don't know who else you are talking to.

That's part of how the human intelligence side of journalism works. This is why I think journalists are particularly well suited to dealing with LLMs - human sources lie and mislead and hallucinate to them all the time already. They know how to get (as close as possible) to the truth.

barbazoo•2mo ago
Same with using AI for coding. I can’t imagine someone having the expectation to use the LLM output verbatim but maybe I’m just not good enough at prompting.
simonw•2mo ago
Using AI for coding effectively involves getting very good at testing (both manual and automated) and code review: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/...
bluefirebrand•2mo ago
Manual testing, automated testing, and code review

All three of those things are things that software engineers rather reliably are bad at and cut corners on, because they are the least engaging and least interesting part of the job of building software

simonw•2mo ago
Yep. Engineers who aren't willing to invest in those skills will have limited success with AI-assisted development.

I've seen a few people state that they don't like using LLMs because it takes away the fun part (writing the code) and leaves them with the bits they don't enjoy.

bluefirebrand•2mo ago
> Engineers who aren't willing to invest in those skills

Are bad engineers

> AI-assisted development

Are also bad engineers

parodysbird•2mo ago
Biologists, mathematicians, physicists, and philosophers are already the experts who produce the text in their domain that the LLMs might have been trained on...
twic•2mo ago
Until AI can work a micropipette, it's going to be of fairly marginal use to biologists.
sshine•2mo ago
The article doesn’t mention Kagi: The Assistant, a search-powered LLM frontend that came out of closed beta around the beginning of the year, and got included in all paid plans since yesterday.

It really is a game changer when the search engine

I find that an AI performing multiple searches on variations of keywords, and aggregating the top results across keywords is more extensive than most people, myself included, would do.

I had luck once asking what its search queries were. It usually provides the references.

simonw•2mo ago
I haven't tried Kagi's product here yet. Do you know which LLM it uses under the hood?

Edit: from https://help.kagi.com/kagi/ai/assistant.html it looks like the answer is "all of them":

> Access to the latest and most performant large language models from OpenAI, Anthropic, Meta, Google, Mistral, Amazon, Alibaba and DeepSeek

dcre•2mo ago
Yep, regular paid Kagi sub comes with cheap models for free: GPT-4o-mini, Gemini 2.5 Flash, etc. If you pay extra you can get the SOTA models, though IMO flash is good enough for most stuff if the search result context is good.
intended•2mo ago
I find that these conversations on HN end up covering similar positions constantly.

I believe that most positions are resolved if

1) you accept that these are fundamentally narrative tools. They build stories, In whatever style you wish. Stories of code, stories of project reports. Stories or conversations.

2) this is balanced by the idea that the core of everything in our shared information economy is Verification.

The reason experts get use out of these tools, is because they can verify when the output is close enough to be indistinguishable from expert effort.

Domain experts also do another level of verification (hopefully) which is to check if the generated content computes correctly as a result - based on their mental model of their domain.

I would predict that that LLMs are deadly in the hands of people who can’t gauge the output, and will end up driving themselves off of a cliff, while experts will be able to use it effectively on tasks where verification of the output has a comparative effort advantage, over the task of creating the output.

gh0stcat•2mo ago
You've perfectly captured my experience as well, I typically only trust it and have good experiences with LLMs when I have enough domain expertise to get to at least a 95% confidence the output is correct. (Specific to my domain of work, I don't always need "perfect"). I also can mostly use it as a first pass for getting the idea of where to begin research, after that I lose confidence that the more detailed and advanced content it is giving me is accurate. There is a gray area though where a domain expert might have a false sense of confidence, and over time experience "Skill Drift", where they lose expertise because they are only ever verifying a lossy compression of information, rather than re-setting their context with real world information. I am mostly concerned with that last bit.
ilrwbwrkhv•2mo ago
Yup succinct summarization of the current state. This works across domains from research to software engineering.
intended•2mo ago
New LinkedIn Post.
saulpw•2mo ago
I tried it recently. I asked for videochat services like the one I use (WB) with 2 specific features that the most commonly used services don't have. It asked some clarifying questions and seemed to understand the mission, then went off for 10 minutes after which it returned 5 results in a table.

The first result was WB, which I gave to it as the first example and am already using. Results 2 and 3 were the mainstream services which it helpfully marked in the table as not having the features I need. Result 4 looked promising but was discontinued 3 years ago. Result 5 was an actual option which I'm trying out (but may not work for other reasons).

So, 1/5 usable results. That was mildly helpful I guess, but it appeared a lot more helpful on the surface than it was. And I don't seem to have the ability to say "nice try but dig deeper".

simonw•2mo ago
That sounds like a Deep Research query, was that with OpenAI or Gemini?
saulpw•2mo ago
This was OpenAI.
Gracana•2mo ago
You can tell it to try again. It took me a couple rounds with the tool before I noticed that your conversation after the initial research isn't limited to just chatting: if you select the "deep research" button on your message, it will run the search process in its response.
baq•2mo ago
> I can feel my usage of Google search taking a nosedive already.

Conveniently Gemini is the best frontier model for everything else, they’re very interested and well positioned (if not best?) to also be the best in deep research. Let’s check back in 3-6 months.

throwup238•2mo ago
IMO they’re already the best. Not only is the rate limit much higher (20/day instead of OpenAI’s 10/month) but Gemini is capable of looking at far more sources, on the order of 10x.

I just had a research report last night that looked at 400 sources when I asked it to help identify a first edition Origin of Species (it did a great job too, correctly explaining how to identify a true first edition from chimeral ones).

jillesvangurp•2mo ago
Google has two advantages:

1) Their AI models aren't half bad. Gemini 2.5 seems to be doing quite well relative to some competitors.

2) They know how to scale this stuff. They have their own hardware, lots of data, etc.

Scaling is of course the hard part. Doing things at Google scale means doing it well while still making a profit. Most AI companies are just converting VC cash into GPUs and energy. VC subsidized AI is nice at a small scale but cripplingly expensive at a larger scale. Google can't do this; they are too large for that. But they are vertically integrated, build their own data centers, with their own TPUs, etc. So, once this starts happening at their scale, they might just have an advantage.

A lot of what we are seeing is them learning to walk before they start running faster. Most of the world has no clue what perlexity is or any notion of the pros and cons of claude 3.7 sonnet vs. o4 mini high. None of that stuff matters long term. What matters is who can do this stuff well enough for billions of people.

So, I wouldn't count them out. But none of this stuff guarantees success either, of course.

energy123•2mo ago

  > The user-facing Google Gemini app can search too, but it doesn’t show me what it’s searching for. 
Gemini 2.5 Pro is also capable of search as part of its chain of thought but it needs light prodding to show URLs, but it'll do so and is good at it.

Unrelated point, but I'm going to keep saying this anywhere Google engineers may be reading, the main problem with Gemini is their horrendous web app riddled with 5 annoying bugs that I identified as a casual user after a week. I assume it's in such a bad state because they don't actually use the app and they use the API, but come on. You solved the hard problem of making the world's best overall model but are squandering it on the world's worst user interface.

loufe•2mo ago
There must be some form of memory leak in AI Studio as I'll have to close and open a new tab after about 2 hours as it slowly grinds my slower computers to a halt. Its ability to create a markdown file without escaping the markdown itself (included code snippets) is definitely my first suggestion for them to fix.

It's a great tool, but sometimes frustrating.

energy123•2mo ago
It's a great model, but the web developers that built the web app either don't care or are incompetent.
oulipo•2mo ago
The main "real-world" use cases for AI use for now have been:

- shooting buildings in Gaza https://apnews.com/article/israel-palestinians-ai-weapons-43...

- compiling a list of information on Government workers in US https://www.msn.com/en-us/news/politics/elon-musk-s-doge-usi...

- creating a few losy music videos

I'd argue we'd be better off SLOWING DOWN with that shit

esafak•2mo ago
Programming is not real world?
das_keyboard•2mo ago
Yeah right. We also got "vibe coding" out of it.
oulipo•2mo ago
I said "the main use cases", not "the little toys to distract and amuse engineers while the ruin the environment with CO2 emissions"
esafak•2mo ago
It is not a little toy, and when you consider the carbon footprint of humans and the productivity savings afforded by AI the balance evens out. AI is a very powerful technology if you know how to wield it.
sandspar•2mo ago
You seem ideology motivated instead of truth motivated which makes you untrustworthy.
oulipo•2mo ago
So give some citations of other notable uses?
sandspar•2mo ago
I don't trust you to listen objectively so why would I tell you anything?
swyx•2mo ago
> Deep Research, from three different vendors

dont forget Xai grok!

M4v3R•2mo ago
Which, at least in my experience is surprisingly good while being much faster than others.
fudged71•2mo ago
you.com is surprisingly good for this as well (I like the corporate report PDF export)
ilrwbwrkhv•2mo ago
Horrible compared to sota. I only find it being mentioned by random ai influencers who are a waste of air and who live on twitter.
softwaredoug•2mo ago
I wonder when Google search will let me "chat" with the search results. I often want to ask the AI Overview follow up questions.

I secondarily wonder how an LLM solves the trust problem in Web search. What's traditionally solved (and now gamed) through PageRank. It doesn't seem ChatGPT is easily fooled by Spam as direct search.

How much is Bing (or whatever the search engine is) getting better? vs how much are LLMs better at knowing what a good result is for a query?

Or perhaps it has to do with the richer questions that get asked to chat vs search?

vunderba•2mo ago
> I wonder when Google search will let me "chat" with the search results.

You don't hear a lot of buzz around them, but thats kind of what Perplexity lets you do. (Possibly phind too but it's been a while since I used them).

dingnuts•2mo ago
>I wonder when Google search will let me "chat" with the search results

Kagi has this already, it's great. Choose a result, click the three-dot menu, choose "Ask questions about this page." I love to do this with hosted man pages to discover ways to combine the available flags (and to discover what is there)

I find most code LLMs write to be subpar but Kagi can definitely write a better ffmpeg line than I can when I use this approach

KTibow•2mo ago
When AI Overview was called Search Generative Experience, you could do that. You can do that again now if you have access to AI Mode.
63•2mo ago
One downside I found is that the llm cannot change its initial prompt until it's done thinking. I used deep research to compare counseling centers for me but of course when it encounters some factor I hadn't thought of (e.g. the counselors here fit the criteria perfectly but none accept my insurance), it doesn't know that it ought to skip that site entirely. Really this is a critique of the deep-research approach rather than search in general, but I imagine it can still play out on smaller scales. Often, searching for information is a dynamic process involving the discovery of unknown unknowns and adjustment based on that, but ai isn't great at abstract goals or stopping to ask clarifying questions before resuming. Ultimately, the report I got wasn't useless, but it mostly just regurgitated the top 3 google results. I got much better recommendations by reaching out to a friend who works in the field.
xp84•2mo ago
From article:

> “Google is still showing slop for Encanto 2!” (Link is provided)

I believe quite strongly that Google is making a serious misstep in this area, the “supposed answer text pinned at the top above the actual search results.”

For years they showed something in this area which was directly quoted from what I assume was a shortlist of non-BS sites so users were conditioned for years that if they just wanted a simple answer like when a certain movie came out or if a certain show had been canceled or something, you may as well trust it.

Now it seems like they have given over that previous real estate to a far less reliable feature, which simply feeds any old garbage it finds anywhere into a credulous LLM and takes whatever pops out. 90% of people that I witness using Google today simply read that text and never click any results.

As a result, Google is now pretty much always even less accurate at the job of answering questions than if you posed that same question to ChatGPT, because GPT seems to be drawing from its overall weights which tend toward basic reality, whereas Google’s “Answer” seems to be summarizing a random 1-5 articles from the Spam Web, with zero discrimination between fact, satire, fiction, and propaganda. How can they keep doing this and not expect it to go badly?

ljsprague•2mo ago
I have stopped using Google when I have a random fact I need answered. Faster to ask ChatGPT. I trust it enough now.
CSMastermind•2mo ago
The various deep research products don't work well for me. For example I asked these tools yesterday, "How many unique NFL players were on the roster for at least one regular season game during the 2024 season? I'd like the specific number not a general estimate."

I as a human know how to find this information. The game day rosters for many NFL teams are available on many sites. It would be tedious but possible for me to find this number. It might take an hour of my time.

But despite this being a relatively easy research task all of the deep research tools I tried (OpenAI, Google, and Perplexity) completely failed and just gave me a general estimate.

Based on this article I tried that search just using o3 without deep research and it still failed miserably.

simonw•2mo ago
That is an excellent prompt to tuck away in your back pocket and try again future iterations of this technology. It's going to be an interesting milestone when or if any of these systems get good enough at comprehensive research to provide a correct answer.
minraws•2mo ago
If you keep the prompt the same at some point the data will appear in training set and we might have answer.

So even though today it might be a good check it might not remain as such a good benchmark.

I think we need a way to keep updating prompts without increasing complexity in someway to properly verify model improvements. ARC Deep Research anyone?

ljsprague•2mo ago
Wouldn't somebody need to answer the question below? Or do you mean the discussion of its weakness might somehow make it stronger the next time it's trained?
minraws•2mo ago
I think it can be both, what happens if discussing weakness provides more relavent links for the question and help the model that is trained scraped web data to learn somehow.

I am not sure if the model will need the exact answer or just the backlinks to site where they can find them is enough. Maybe just documenting how to do it could do the job as well...

red_trumpet•2mo ago
Well, to test research capabilities, one could just adopt the year (2024->2025) in the prompt.
minraws•2mo ago
I am not sure what happens if some site keeps tracking these metrics and that manages to find its way into the training data.

There are some NBA fan sites that do keep track of some of these tournament level final metrics.

qingcharles•2mo ago
I had o3 "cheat" yesterday. I tried to demo a Deep Research task to a friend, but o3 managed to find the answer immediately in a Reddit comment I'd made after trying out the same problem previously.

I was still impressed though!

danielmarkbruce•2mo ago
This is just a bad match to the capabilities. What you are actually looking for is analysis, similar in nature to what a data scientist may do.

The deep research capabilities are much better suited to more qualitative research / aggregation.

pton_xd•2mo ago
> The deep research capabilities are much better suited to more qualitative research / aggregation.

Unfortunately sentiment analysis like "Tell me how you feel about how many players the NFL has" is just way less useful than: "Tell me how many players the NFL has."

lucyjojo•2mo ago
First person that makes a good exact aggregation AI will make so much money...

Precise aggregation is what so many juniors do in so many fields of work it's not even funny...

johnnyanmac•2mo ago
If AI Can't look up and read a chart, why would I trust it with any real aggregation?
netghost•2mo ago
Because AI is weird and does some things really well, and some things poorly. The terrible/exciting/weird part is figuring out which is which.
oytis•2mo ago
So it's not doing well in things that we can verify/measure, but sure it's doing much better in things we can't measure - except we can't measure them, so we have no idea about how well it is doing actually. The most impressive feature of LLMs stays its ability to impress.
danielmarkbruce•2mo ago
Yup. Like humans.
oytis•2mo ago
At least in a liberal society humans matter. Their opinions, judgements, tastes matter. Why should "opinion" (which is not even a real opinion) of a machine matter?

Not to say that we validate whether to trust an opinion of a human expert by them being able to deliver measurably correct judgements, the same thing LLM seem to be not good at.

danielmarkbruce•2mo ago
You can't measure the results of your legal advice in most cases. There are all manner of things we can't measure well. We don't throw up our hands and say "then forget it". We do our best with the anecdotes and move forward.
southernplaces7•2mo ago
Your logic is.... strange...

Because it failed miserably at a very simple task of looking through some scattered charts, the human asking should blame themselves for this basic failure and trust it to do better with much harder and more specialized tasks?

MyPasswordSucks•2mo ago
I think you might as well be saying "robotics fail miserably at the very simple task of jogging around the block, so why should we trust the field to be able to accurately place millions of transistors within a 25cm square of silicon?"

His point is that the two tasks are very different at their core, and deep research is better at teasing out an accurate "fuzzy" answer from a swamp of interrelated data, and a data scientist is better at getting an accurate answer for a precise, sharply-defined question from a sea of comma-separated numbers.

A human readily understands that "hold the onions, hots on the side" means to not serve any onions and to place any spicy components of the sandwich in a separate container rather than on the sandwich itself. A machine needs to do a lot of educated guessing to decide whether it's being asked to keep the onions in its "hand" for a moment or keep them off the sandwich entirely, and whether black pepper used in the barbeque sauce needs to be separated and placed in a pile along with the haberno peppers.

southernplaces7•2mo ago
You seem to misunderstand my previous comment and also the thing being criticized by the post I replied to.

I understand that there are fuzzy tasks that AIs/algorithms are terrible at, which seem really simple for a human mind, and this hasn't gone away with the latest generations of LLMs. That's fine and I wouldn't criticize an AI for failing at something like the instructions you describe, for example.

However in this case, the human was asking for very specific, cut and dry information from easily available NFL rosters. Again, if an AI fails at that, especially because you didn't phrase the question "just so", then sorry, but no, it's not much more trustworthy for deep research and data scientist inquiries.

What in any case makes you think the data scientists will use superior phrasing to tease better results under more complexity from an LLM?

simonw•2mo ago
"Don’t fall into the trap of anthropomorphizing LLMs and assuming that failures which would discredit a human should discredit the machine in the same way." - https://simonwillison.net/2025/Mar/11/using-llms-for-code/#s...
daveguy•2mo ago
What I got out of that essay is that you should discredit most responses of LLMs unless you want to do just as much or more work yourself confirming the accuracy of an unreliable and deeply flawed partner. Whereas if a human "hallucinated a non-existent library or method you would instantly lose trust in them." But, for reasons, we should either give the machine the benefit of the doubt or manually confirm everything.
simonw•2mo ago
From that same essay:

> If your reaction to this is “surely typing out the code is faster than typing out an English instruction of it”, all I can tell you is that it really isn’t for me any more. Code needs to be correct. English has enormous room for shortcuts, and vagaries, and typos, and saying things like “use that popular HTTP library” if you can’t remember the name off the top of your head.

Using LLMs as part of my coding work speeds me up by a significant amount.

danielmarkbruce•2mo ago
No logic needed. Just use them, build them, play with them. You'll figure out what they are good at and what they aren't good at.
neom•2mo ago
Is it accurate that there are 544 rosters? If so, even at 2 minutes a roster isn't that days of work, even if you coded something? How would you go about completing this task in 1 hour as a human? (also chatgpt 4.1 gave me 2,503 and it said it used the NFL 2024 fact book)
dghlsakjg•2mo ago
If the rosters are in some sort of pretty easily parsed or scrapable format from the nfl, as sports stats typically are, this is just a matter of finding every unique name. This is something that I imagine would take less than an hour or two for a very beginner coder, and maybe a second or two for the code to actually run
krainboltgreene•2mo ago
FYI for readers: All the major leagues have a stats API, most are public, some are public and "undocumented" with tons of documentation by the community. It's quite a feat!
CSMastermind•2mo ago
544 rosters but half as many games (because the teams play each other).

Technically I can probably do it in about 10 minutes because I've worked with these kind of stats before and know about packages that will get you this basically instantly (https://pypi.org/project/nfl-data-py/).

It's exactly 4 lines of code to find the correct answer, which is 2,227.

Assuming I didn't know about that package though I'd open a site like pro football reference up, middle click on each game to open the page in a new tab, click through the tabs, copy paste the rosters into sublime text, do some regex to get the names one per line, drop the new one per line list into sortmylist or a similar utility, dedupe it, and then paste it back into sublime text to get the line count.

That would probably take me about an hour.

neom•2mo ago
I see. When you said "game day rosters for many NFL teams are available on many sites" - I thought "that sounds like a lot of hours!!" heh. - I didn't realize it was packaged well, I also know sweet fa about football. Thanks for explaining it more. :)
riku_iki•2mo ago
https://chatgpt.com/share/6807278c-a0d0-8006-80f3-f62ae9f8ff...
ljsprague•2mo ago
Did you run the code?
riku_iki•2mo ago
No, but even if it is buggy, it is usual LLM cycle:

- you: there is no such function

- LLM: you are absolutely right, here is fixed code

paulsutter•2mo ago
I bet these models could create a python program that does this
Retric•2mo ago
Maybe eventually, but I bet it’s not going to work with less than 30 minutes of effort on your part.

If “It might take an hour of my time.” to get the correct answer then there’s a lower bond for trying a shortcut that might not work.

kenjackson•2mo ago
o3 deep research gave me an answer after I requested an exact answer again (it gave me an estimate first): 2147.
raybb•2mo ago
Similarly, I asked it a rather simple question of giving me a list of AC repair places near me with my numbers. Weirdly, Gemini repeated a bunch of them like 3 or 4 times, gave some completely wrong phone numbers, and found many places hours away but labeled them as in the neighboring city.
wontonaroo•2mo ago
I used Google AI Studio instead of Google Gemini App because it provides references to the search results.

Google AI Studio gave me an exact answer of 2227 as a possible answer and linked to these comments because there is a comment further down which claims that is the exact answer. The comment was 2 hours old when I did the prompt.

It also provided a code example of how to find it using the python nfl data library mentioned in one of the comments here.

patapong•2mo ago
So the time to test data leakage from posting a question and answer to the internet, to LLMs having access to the answer is less than 2h... Does not bode well for the benchmarks of the future!
gilbetron•2mo ago
To avoid "result corruption" I asked a similar question, but for NBA players, and used O4-mini, and got a specific answer:

"For the 2023‑24 NBA regular season (which ran from October 24, 2023 to April 14, 2024), a total of 561 distinct players logged at least one game appearance, as indexed by their “Rk” on the Basketball‑Reference “Player Stats: Totals” page (the final rank shown is 561)"

Doing a quick search on my own, this number seems like it could be correct.

Havoc•2mo ago
Are any of the Deep Research tools pure api cost? Or all monthly sub?
simonw•2mo ago
I think the Gemini one may still be available for free.
sublimefire•2mo ago
I do prefer tools like GPT researcher where you are in control over sources and search engines. Sometimes you just need to use arxiv, sometimes mix research with the docs you have. Sometimes you want to use different models. I believe the future is in choosing what you need for the specific task at that moment, eg 3d model generation mixed with something else, and this all requires some sort of new “OS” level application to run from.

Individual model vendors cannot do such a product as they are biased towards their own model, they would not allow you to choose models from competitors.

jeffbee•2mo ago
The Deep Research stuff is crazy good. It solves the issue that I can often no longer find articles that I know are out there. Example: yesterday I was holding forth on the socials about how 25 years ago my local government did such and such thing to screw up an apartment development at the site of an old movie theater, but I couldn't think of the names of any of the principals. After Googling for a bit I used a Deep Research bot to chase it down for me, and while it was doing that I made a sandwich. When I came back it had compiled a bunch of contemporaneous news articles from really obscure bloggers, plus allusions to public records it couldn't access but was confident existed, that I later found using the URLs and suggested search texts.
otistravel•2mo ago
The most impressive demos of these tools always involve technical tasks where the user already knows enough to verify accuracy. But for the average person asking about health issues, legal questions, or historical facts? It's basically fancy snake oil - confident-sounding BS that people can't verify. The real breakthrough would be systems that are actually trustworthy without human verification, not slightly better BS generators. True AI research breakthroughs would admit uncertainty and provide citations for everything, not fake certainty like these tools do.
spongebobstoes•2mo ago
this remains true for pretty much all advice or information we receive. doctors, lawyers, accountants, teachers. there have been countless times that all of these professionals have given me bad advice or information

sure, at least I have someone to blame in that case. but in my experience, the AI is at least as reliable as a person who I don't personally know

neural_thing•2mo ago
I tested o3 on a medical issue I've had that 50+ doctors couldn't diagnose over the span of 6-7 years, ended up figuring it out through sheer luck. With my first prompt, it gave a list of probabilities, with the correct answer being listed as the third most likely. It also suggested correct tests to run for every option. I trust it way more than I trust human doctors who were confidently wrong about me for years.
arkh•2mo ago
I'm still thinking a simple expert system used by some nurse used to getting people to explain their symptoms would be enough to replace a general MD.

Less time and money spent to train those nurses, which you can then spend on training specialists. And your expert system will take less time to update than training thousands of doctor every time some new protocol or drug is released.

franze•2mo ago
totally agree, identified an infection with a dangerous bacteria based on a single photo - the doctor just thought about if after presented with the AI opinion
FieryTransition•2mo ago
Plenty studies show that these models are better at catching and diagnosing than even a board of doctors are. Doctors are good at other things, and I hope the future will allow doctors to use these models together with their practice.

The problem is when the ai makes a catastrophic prediction, and the layman can't see it.

intended•2mo ago
Plenty of studies is news to me. I’ve only seen anecdotal content.
NewsaHackO•2mo ago
Yes, and I am pretty sure this is already an established phenomenon; AI and ML in general are able to apply domain specific algorithms better than the people who wrote them, because of how much it gets exposed to it during training.
spongebobstoes•2mo ago
as a layman, I can't see errors that professionals make either. I trust them, and later I experience the consequences of their mistakes. sometimes catastrophically.

I don't see how it is really different with AI

otabdeveloper4•2mo ago
> but in my experience, the AI is at least as reliable as a person who I don't personally know

How do you know this?

blackhaz•2mo ago
This is surprising. o3 produces incredible amount of hallucinations for me, and there are lots of reddit threads about it. I've had to roll back to another model because it just swamps everything in made up facts. But sometimes it is frighteningly smart. Reading its output sometimes feels like I'm missing IQ points.
in_ab•2mo ago
Claude doesn't seem to have a built in search tool but I tried this with a MCP server to search google and it gives similar results.
jonas_b•2mo ago
A common google searching thing I counter have is something like this:

I need to get from A to B via C via public transport in a big metropolis.

Now C could be one of say 5 different locations of a bank branch, electronics retailer, blood test lab or whatever, so there's multiple ways of going about this.

I would like a chatbot solution that compares all the different options and lays them out ranked by time from A to B. Is this doable today?

pyfon•2mo ago
I am on holiday now and want something similar. Get me from A to B but with a memorable heuristic that I can use if I leave at any time. E.g. "if you catch a 134 or 175 bus to Kings station then get the metro 3 stops to Cental station.". Even better if you add some landmarks.

This may exclude some clever routes that shave off 3 minutes if you do the correct parkour... but it means I can now put my phone down and enjoy the journey without tracking it like a hawk.

csallen•2mo ago
It's actually quite doable to build your own deep research agent. You just need a single prompt, a solid code loop to run it agentically, and some tools for it to call. I've been building a domain-specific deep research agent over the past few days for internal use, and I'm pretty impressed with how much better it is than any of the official deep search agents for my use case.
mz00•2mo ago
same here. I first built the agentic workflow in python and later nextjs. uses dozens of llm apis. works well, and am also impressed with the results.
csallen•2mo ago
Curious what you mean by dozens of llm apis. Different models? Or different tool calls?
mz00•2mo ago
different models. high level approach: N models receive the same prompt → responses are consolidated through a simple embedding-based similarity matrix → another model performs a simple assessment of deduplicated responses → these are then consolidated into a research report & data visualization.

the whole contraption uses ~10 different models, but more can easily be plugged into the initial generation phase. happy to demo it sometime! [edit: email on profile].

mehulashah•2mo ago
I find that people often conflate search with analytics when discussing Deep Research. Deep Research is iterated search and tool use. And, no doubt, it’s remarkably good. Deep Analytics is like Deep Research in that it uses generative AI models to generate a plan, but LLMs operations and structured (tool use) are interleaved in database style query pipelines. This allows for the more precise counting and exhaustive search type use cases.
gcanyon•2mo ago
The concern over "LLMs vs. the Web" is giving me serious "The Web vs. Brick and Mortar" vibes. That's not to say that it won't be the predicted cataclysm, just that it might not be. Time will tell, because This Is Happening, People, but if it does turn out to be a serious problem, I think we'll find some way to adapt. We're unlikely to accept a lesser result.
Tycho•2mo ago
I tried o3 for a couple of things.

First one, geolocation a photo I saw in a museum. It didn’t find a definitive answer but it sure turned up a lot of fascinating info in its research.

Second one, I asked it to suggest a new line of enquiry in the Madeleine McCann missing person case. It made the interesting suggestion that the 30 minute phone call the suspect made on the evening of the disappearance, from a place near the location of the abduction, was actually a sort of “lookout call” to an accomplice nearby.

Quite impressed. This is a great investigative tool.

gitroom•2mo ago
I feel like half the time these AI tools are either way too smart or just eating glue, tbh - do you think people will ever actually trust AI for deep answers or are we all just using it to pass time at work?
soulofmischief•2mo ago
Do you understand how search works? After finding something, you still have to verify it.
noja•2mo ago
Can it geotag my old scanned photos?
simonw•2mo ago
Latest models might actually be able to help with that - they are getting spooky good at guessing the location of a photograph: https://techcrunch.com/2025/04/17/the-latest-viral-chatgpt-t...
noja•2mo ago
Even with slightly blurry old photos?
simonw•2mo ago
Worth a try. It's pretty fascinating seeing what it hooks into - it figured out where my backyard was based on the moss on the trees and the architectural style of the buildings ("coastal California, likely near Half Moon Bay" - it was just north of HMB).
Alifatisk•2mo ago
Have anyone tried ithy.com yet? If you have any prompt that you know most llms fail at, I’d love to know how itchy responds!
BambooBandit•2mo ago
I feel like the bigger problem isn't whether these deep research products work, but rather the raw material, so to speak, that they're working with.

For example, a lot of the "sources" cited in Google's AI Overview (notably not a deep research product) are not official, just sites that probably rank high in SEO. I want the original source, or a reliable source, not joeswebsite dot com (no offense to this website if it indeed exists).

simonw•2mo ago
Yes, Google's AI overviews are terrible. They're an example of how not to build this.

That'd what makes o3/o4-mini driven search notable to me: those models appear to have much better taste in which searches to run and which sources to consider.