frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Knowledge-Bank

https://github.com/gabrywu-public/knowledge-bank
1•gabrywu•1m ago•0 comments

Show HN: The Codeverse Hub Linux

https://github.com/TheCodeVerseHub/CodeVerseLinuxDistro
1•sinisterMage•2m ago•0 comments

Take a trip to Japan's Dododo Land, the most irritating place on Earth

https://soranews24.com/2026/02/07/take-a-trip-to-japans-dododo-land-the-most-irritating-place-on-...
1•zdw•2m ago•0 comments

British drivers over 70 to face eye tests every three years

https://www.bbc.com/news/articles/c205nxy0p31o
1•bookofjoe•2m ago•1 comments

BookTalk: A Reading Companion That Captures Your Voice

https://github.com/bramses/BookTalk
1•_bramses•3m ago•0 comments

Is AI "good" yet? – tracking HN's sentiment on AI coding

https://www.is-ai-good-yet.com/#home
1•ilyaizen•4m ago•1 comments

Show HN: Amdb – Tree-sitter based memory for AI agents (Rust)

https://github.com/BETAER-08/amdb
1•try_betaer•5m ago•0 comments

OpenClaw Partners with VirusTotal for Skill Security

https://openclaw.ai/blog/virustotal-partnership
1•anhxuan•5m ago•0 comments

Show HN: Seedance 2.0 Release

https://seedancy2.com/
1•funnycoding•5m ago•0 comments

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
1•thelok•5m ago•0 comments

Towards Self-Driving Codebases

https://cursor.com/blog/self-driving-codebases
1•edwinarbus•5m ago•0 comments

VCF West: Whirlwind Software Restoration – Guy Fedorkow [video]

https://www.youtube.com/watch?v=YLoXodz1N9A
1•stmw•6m ago•1 comments

Show HN: COGext – A minimalist, open-source system monitor for Chrome (<550KB)

https://github.com/tchoa91/cog-ext
1•tchoa91•7m ago•1 comments

FOSDEM 26 – My Hallway Track Takeaways

https://sluongng.substack.com/p/fosdem-26-my-hallway-track-takeaways
1•birdculture•8m ago•0 comments

Show HN: Env-shelf – Open-source desktop app to manage .env files

https://env-shelf.vercel.app/
1•ivanglpz•11m ago•0 comments

Show HN: Almostnode – Run Node.js, Next.js, and Express in the Browser

https://almostnode.dev/
1•PetrBrzyBrzek•12m ago•0 comments

Dell support (and hardware) is so bad, I almost sued them

https://blog.joshattic.us/posts/2026-02-07-dell-support-lawsuit
1•radeeyate•13m ago•0 comments

Project Pterodactyl: Incremental Architecture

https://www.jonmsterling.com/01K7/
1•matt_d•13m ago•0 comments

Styling: Search-Text and Other Highlight-Y Pseudo-Elements

https://css-tricks.com/how-to-style-the-new-search-text-and-other-highlight-pseudo-elements/
1•blenderob•15m ago•0 comments

Crypto firm accidentally sends $40B in Bitcoin to users

https://finance.yahoo.com/news/crypto-firm-accidentally-sends-40-055054321.html
1•CommonGuy•15m ago•0 comments

Magnetic fields can change carbon diffusion in steel

https://www.sciencedaily.com/releases/2026/01/260125083427.htm
1•fanf2•16m ago•0 comments

Fantasy football that celebrates great games

https://www.silvestar.codes/articles/ultigamemate/
1•blenderob•16m ago•0 comments

Show HN: Animalese

https://animalese.barcoloudly.com/
1•noreplica•16m ago•0 comments

StrongDM's AI team build serious software without even looking at the code

https://simonwillison.net/2026/Feb/7/software-factory/
3•simonw•17m ago•0 comments

John Haugeland on the failure of micro-worlds

https://blog.plover.com/tech/gpt/micro-worlds.html
1•blenderob•17m ago•0 comments

Show HN: Velocity - Free/Cheaper Linear Clone but with MCP for agents

https://velocity.quest
2•kevinelliott•18m ago•2 comments

Corning Invented a New Fiber-Optic Cable for AI and Landed a $6B Meta Deal [video]

https://www.youtube.com/watch?v=Y3KLbc5DlRs
1•ksec•19m ago•0 comments

Show HN: XAPIs.dev – Twitter API Alternative at 90% Lower Cost

https://xapis.dev
2•nmfccodes•20m ago•1 comments

Near-Instantly Aborting the Worst Pain Imaginable with Psychedelics

https://psychotechnology.substack.com/p/near-instantly-aborting-the-worst
2•eatitraw•26m ago•0 comments

Show HN: Nginx-defender – realtime abuse blocking for Nginx

https://github.com/Anipaleja/nginx-defender
2•anipaleja•26m ago•0 comments
Open in hackernews

Predictions from the METR AI scaling graph are based on a flawed premise

https://garymarcus.substack.com/p/the-latest-ai-scaling-graph-and-why
50•nsoonhui•9mo ago

Comments

Nivge•9mo ago
TL;DR - the benchmark depends on its specific dataset, and it isn't a perfect representation to evaluate AI progress. That doesn't mean it doesn't make sense, or doesn't have value.
hatefulmoron•9mo ago
I had assumed that the Y axis was corresponding to some measurement of the LLM's ability to actually work/mull over a task in a loop while making progress. In other words, I thought it meant something like "you can leave Sonnet 3.7 for a whole hour and it will meaningfully progress on a problem", but the reality is less impressive. Serves me right for not looking at the fine print.
dist-epoch•9mo ago
> Abject failure on a task that many adults could solve in a minute

Maybe author should check before pressing "Publish" if the info in the post is not already outdated.

ChatGPT passed the image generation test mentioned: https://chatgpt.com/share/68171e2a-5334-8006-8d6e-dd693f2cec...

frotaur•9mo ago
Even excluding the fact that this image is simply to illustrate, and it's really not the main point of the article, in the chat you posted, ChatGPT actually failed again, because the r's are not circled.
comex•9mo ago
That's true, but it illustrates a point about 'jagged intelligence'. Just like there's a tendency to cherry-pick the tasks AI is best at and equate it with general intelligence, there's a counter-tendency to cherry-pick the tasks AI is worst at and equate it with a general lack of intelligence.

This case is especially egregious because of how there were probably two different models involved. I assume Marcus' images came from some AI service that followed what until very recently was the standard pattern: you ask an LLM to generate an image; the LLM goes and fluffs out your text, then passes it to a completely separate diffusion-based image generation model, which has only a rudimentary understanding of English grammar. So of course his request for "words and nothing else" was ignored. This is a real limitation of the image generation model, but that has no relevance to the strengths and weaknesses of the LLM itself. And 'AI will replace humans' scenarios typically focus on text-based tasks that use the LLM itself.

Arguably AI services are responsible for encouraging users to think of what are really two separate models (LLM and image generation) as a single 'AI'. But Marcus should know better.

And so it's not surprising that ChatGPT was able to produce dramatically better results now that it has "native" image generation, which supposedly uses the native multimodal capabilities of the LLM (though rumors are that that description is an oversimplification). The results are still not correct. But it's a major advancement that the model now respects grammar; it no longer just spots the word "fruit" and generates a picture of fruit. Illustration or no, Marcus is misrepresenting the state of the art by not including this advancement.

If Marcus had used a recent ChatGPT output instead, the comparison would be more fair, but still somewhat misleading. Even with native capabilities, LLMs are simply worse at both understanding and generating images than they are at understanding and generating text. But again, text capability matters much more. And you can't just assume that a model's poor performance on images will correlate with poor performance on text.

The thing is, I tend to agree with the substance of Marcus's post, including the part where portrayals of current AI capabilities are suspect because they don't pass the 'sniff test', or in other words, because they don't take into account how LLMs continue to fall down on some very basic tasks. I just think the proper tasks for this evaluation should be text-based. I'd say the original "count the number of 'r's in strawberry" task is a decent example, even if it's been patched, because it really showcases the 'confidently wrong' issue that continues to plague LLMs.

croes•9mo ago
So OpenAI fixed that, but the next simple task on which AI fails is just around the corner.

The problem is AI doesn’t think and if a task is totally new it doesn’t produce the correct answer.

https://news.ycombinator.com/item?id=43800686

yorwba•9mo ago
> you could probably put together one reasonable collection of word counting and question answering tasks with average human time of 30 seconds and another collection with an average human time of 20 minutes where GPT-4 would hit 50% accuracy on each.

So do this and pick the one where humans do best. I doubt that doing so would show all progress to be illusory.

But it would certainly be interesting to know what the easiest thing is that a human can do but current AIs struggle with.

xg15•9mo ago
> But it would certainly be interesting to know what the easiest thing is that a human can do but current AIs struggle with.

Still "Count the R's" apparently.

K0balt•9mo ago
The problem , really, is human cognitive dissonance. We draw false conclusions that competence at some tasks implies competence at another. It’s not a universal human problem, we intuit that a front end loader , just because it can dig really well, is not therefore good at all other tasks. But when it comes down to cognition, our models break down quickly.

I suspect this is because our proxies are predicated on a task set that inherently includes the physical world, which at some level connects all tasks and creates links between capabilities that generally pervade our environment. LLMs do not exist in this physical world, and are therefore not within the set of things that can be reasoned about with those proxies.

This will probably gradually change with robotics, as the competencies required to exist and function in the physical world will (I postulate) generalize to other tasks in such a way that it more closely matches the pattern that our assumptions are based on.

Of course, if we segregate intelligence into isolated modules for motility and cognition, this will not be the case as we will not be taking advantage of that generalization. I think that would be a big mistake, especially in light of the hypotheses that the massive leap in capabilities of LLMs came more from the training on things we weren’t specifically trying to achieve- the bulk of seemingly irrelevant data that unlocked simple language processing into reasoning and world modeling.

the8472•9mo ago
> LLMs do not exist in this physical world, and are therefore not within the set of things that can be reasoned about with those proxies.

Perhaps not the mainstream models, but deepmind has been working on robotics models with simulated and physical RL for years https://deepmind.google/discover/blog/rt-2-new-model-transla...

mentalgear•9mo ago
what you are describing are world models and physical AI, which has recently become much more mainstream after the recent nvidia GDC.
AIPedant•9mo ago
Dogs can pass a dog-appropriate variant of this test: https://xcancel.com/SpencerKSchiff/status/191010636820533676... (the dog test uses a treat on one string and junk on the other, they have to pull the correct string to get the treat)

This was before o3, but another tweet I saw (don't have the link) suggests it's also completely incapable of getting it.

Sharlin•9mo ago
> Unfortunately, literally none of the tweets we saw even considered the possibility that a problematic graph specific to software tasks might not generalize to literally all other aspects of cognition.

How am I not surprised?

aoeusnth1•9mo ago
This post is a very weak and incoherent criticism of a well formulated benchmark: task length bucket for which a model succeeds 50% of the time.

Gary says: - This is just the task length that the models were able to solve in THIS dataset. What about other tasks?

Yeah, obviously. The point is that models are improving on these tasks in a predicable fashion. If you care about software, you should care how good ai is at software.

- Gary says: Task length is a bad metric. What about a bunch of other factors of difficulty which might not factor into task length?

Task length is a pretty good proxy for difficulty, that's why people grade a bug in days. Of course many factors contribute to this estimate, but averaged over many tasks, time is a great metric for difficulty.

Finally, Gary just ignores that despite his perspective that the metric makes no sense and is meaningless, it has extremely strong predictive value. This should give you pause - how can an arbitrary metric with no connection to the true difficulty of a task, with no real way of comparing its validity of measuring difficulty across tasks or across task-takers, result in such a retrospectively smooth curve, and so closely predict the recent data points from sonnet and o3? something IS going on there, which cannot fit into Gary's ~spin~ narrative that nothing ever happens.

sandspar•9mo ago
Gary Marcus could save himself lots of time. He just has to write a post called "Here's today's opinion." Because he's so predictable, he could just leave the body text blank. Everyone knows his conclusions anyways. This way he could save himself and his readers lots of time.