I suppose in this case it picked up an existing project and DIDN’T walk off a cliff? Were the mutations really small?
I’ve been starting with topics where I’m already familiar with the answer but want a refreshed. So far, I’m not impressed. Some times the info will be correct. Most of the time it strings together a lot of words from the material it finds but it reads like an undergrad trying to paraphrase the Wikipedia page without understanding the content. Often it will have one bullet point that is completely wrong.
The other problem I’m having is that it’s not very good at identifying poor sources. This is less of a problem with topics like math and engineering, but a big problem with topics like health and medicine where it will pick up alternative medicine and pseudoscience pages and integrate them into the research as if they were real. There are a lot of health and medicine topics where the way pseudoscience people talk about a subject doesn’t match the real science, but they use the same words and therefore catch the same search terms.
An example is the way “dopamine” is used in casual conversation and by influencers in ways that aren’t accurate. Concepts like “dopamine fasting” or claiming things “raise your dopamine” aren’t scientifically accurate but use the same words nevertheless and therefore can get pulled into the training set and searches.
Reddit may not be the greatest source for hard science, but for things like "tell me what shoes people are finding helpful for their plantar fasciitis" I appreciate reddit's anecdata of reddit over every other source.
Reddit wants money for its users' data. Is the reason Anthropic doesn't want to pay Reddit's shareholders for it?
Also Sam Altman owns quite a lot of Reddit stock and was briefly the CEO, so it's not inconceivable he's influenced them not to cooperate with one of his chief rivals.
1) A response originating from LLM pre-training, in a domain where there has not been any (successful) Rl-for-reasoning post-training. In this case the amount of reasoning around the raw facts "recalled" by the LLM is going to be limited by any reasoning present in the training data.
2) A non-agentic response in a domain like Math Olmypiad problems where the LLM was post-trained with RL to encourage reasoning mirroring this RL training set. This type of domain-specific reasoning training seems to have little benefit to other domains (although in the early LLM days it was said that training on computer code did provide some general benefit).
3) An agentic response, such as from one of these research systems, where it seems the agent is following some sort of generic research / summarization template with proscribed steps. I've never tried these myself, but it seems they can be quite successful in deep diving and gathering relevant source material, but then the ability to reason over this retrieved material is going to come down to the reasoning capability of the underlying model per 1) and 2) above.
Bottom line would seem to be that with today's systems domain specific reasoning capability largely comes down to RL post-training for reasoning in that specific domain, resulting in what some call "jagged" performance - excellent in some areas and very poor in others. Demis Hassabis, for one, seems to be saying that this will not be fixed until architectural changes/additions are made to bring us closer to AGI.
Hate to break it to you, but if GPT-5 is better AI researcher than you, you were probably not that good to begin with.
Does Codex tell you why 95% of AI projects in the enterprise fail? Or why the only study up to date on merits of AI for coding shows 19% decrease of productivity.
Also this footnote:
> Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tool
I have no words. I wonder if this "AI researcher" can make it through the original Attention Is All You Need paper without an LLM.
>I built significant pieces of the Copilot onboarding, purchasing, billing and settings flow. For eight months I headed up the Copilot anti-abuse effort. I then led the launch of GitHub Models, and am now working on other Copilot projects.
As an aside I had a look at GitHub Models and it was quite interesting - you can try the API for a number of models for free using your GitHub login.
Hypothesis: it's the standard "teacher-student" or "distillation" trick - if you're learning next-token-prediction, you only learn what the correct answer is (i.e. the spike in probability), but when you're distilling from a teacher model, you learn the entire distribution of potential answers.
Curious, can anyone more experienced in AI research comment on this?
If I was the owner of an AI company that was forever trying to juice its valuation and raise money, you can bet I'd be telling people I had built a magic self-improving AI.
And even with pure model development, making incremental changes to try different strategies in notebooks etc are probably not that hard to write, when given clear instructions by a data scientist, etc. (I’m not saying these disciplines are easy, I’m saying that a data scientist could more easily describe what they want)
Or the devops stuff. Or the RL UIs.
All that stuff is run of the mill software in service of building the models. And it can be vibe coded.
A lot of negative comments on here, which seems to always be the case with HN and vibe coding. The reality is that it’s actually starting to work, quite well.
What happened to the MLE? Are they all going to end up that way?
> A lot of negative comments on here, which seems to always be the case with HN and vibe coding. The reality is that it’s actually starting to work, quite well.
It's hard to be positive about the idea of your skills getting devalued and getting kicked to the curb.
I think it depends where people build their own identities in the value stream. Do you see yourself as a product/hacker type person and writing code just is a blocker on delivering your vision? Building greenfield prototypes is now 100x easier! Do you see yourself as a craftsperson that brings years of experience to hard technical challenges? Some folks see AI as an attack, and some see it as a way to remove some drudgery while they focus on harder problems. It is about mindset.
What skills do you value?
> I think it depends where people build their own identities in the value stream. Do you see yourself as.... It is about mindset.
No. You're missing the point in a pretty serious way. In large part, we're talking about lower levels on the hierarchy of needs than that.
> Some folks see AI as an attack, and some see it as a way to remove some drudgery while they focus on harder problems.
Some folks see it as a way to remove hard problems so they can focus on drudgery. Do you love to code, but hate code reviews? Guess what you get to do more of now!
> Building greenfield prototypes is now 100x easier!
And then the boss-man can use his MBA, fire 90% of the team, pocket their wages, and get 10x the prototypes. Repeat that through the economy. Your teammates have 20 years to retirement and can't pay their mortgage. "Progress!," but for whom?
My intuition is that with the law of leverage in mind, the former would be relatively low and the latter would be relatively high.
It is of course up to the government and culture to minimize the former and maximize the latter.
Same for tasks you know how to do but AI does them faster, there is also value there. (Claude Code used by a senior goes here)
The interesting thing is when AI is lifting up the ceiling everywhere, but maybe then is when we are almost on AGI territory.
I feel like that's the only way this insanely high bet on AI could possibly pay off - if AGI/"ceiling lifting" is achieved then maybe all this could be the trillion+ bet they expect. If it doesn't really progress much beyond where it is now, they're in trouble.
We have several synthetic datasets and automated evaluation options for such things that were close to impossible to do before LLMs.
While the 5-minutes model will never be useful in itself it lays the groundwork for amateurs and small groups to getting into developing small models. There's at the moment another HN headline hyping up a tiny model that scores impressively at the arc-agi benchmarks so it's clearly not a dead end to explore what is "household-affordable" models.
Though an approach that doesn't lean on the authors $200/month OAI sub would've been more interesting to follow.
And if it is then I'm a farmer because I bought potatoes from the store.
conartist6•4mo ago
They aren't excited about anything. They aren't in awe. They haven't done any hard work. They're just here to ooze lukewarm sludge
dist-epoch•4mo ago
conartist6•4mo ago