frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

State Department will delete Xitter posts from before Trump returned to office

https://www.npr.org/2026/02/07/nx-s1-5704785/state-department-trump-posts-x
1•righthand•2m ago•0 comments

Show HN: Verifiable server roundtrip demo for a decision interruption system

https://github.com/veeduzyl-hue/decision-assistant-roundtrip-demo
1•veeduzyl•3m ago•0 comments

Impl Rust – Avro IDL Tool in Rust via Antlr

https://www.youtube.com/watch?v=vmKvw73V394
1•todsacerdoti•3m ago•0 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
1•vinhnx•4m ago•0 comments

minikeyvalue

https://github.com/commaai/minikeyvalue/tree/prod
2•tosh•9m ago•0 comments

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

https://github.com/eval-exec/neomacs
1•evalexec•13m ago•0 comments

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

https://moli-green.is/
2•ShinyaKoyano•17m ago•1 comments

How I grow my X presence?

https://www.reddit.com/r/GrowthHacking/s/UEc8pAl61b
2•m00dy•19m ago•0 comments

What's the cost of the most expensive Super Bowl ad slot?

https://ballparkguess.com/?id=5b98b1d3-5887-47b9-8a92-43be2ced674b
1•bkls•20m ago•0 comments

What if you just did a startup instead?

https://alexaraki.substack.com/p/what-if-you-just-did-a-startup
3•okaywriting•26m ago•0 comments

Hacking up your own shell completion (2020)

https://www.feltrac.co/environment/2020/01/18/build-your-own-shell-completion.html
2•todsacerdoti•29m ago•0 comments

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

https://github.com/gorse-io/gorse
1•zhenghaoz•30m ago•0 comments

GLM-OCR: Accurate × Fast × Comprehensive

https://github.com/zai-org/GLM-OCR
1•ms7892•31m ago•0 comments

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

https://github.com/MikeVeerman/tool-calling-benchmark
1•MikeVeerman•32m ago•0 comments

Show HN: AboutMyProject – A public log for developer proof-of-work

https://aboutmyproject.com/
1•Raiplus•32m ago•0 comments

Expertise, AI and Work of Future [video]

https://www.youtube.com/watch?v=wsxWl9iT1XU
1•indiantinker•32m ago•0 comments

So Long to Cheap Books You Could Fit in Your Pocket

https://www.nytimes.com/2026/02/06/books/mass-market-paperback-books.html
3•pseudolus•33m ago•1 comments

PID Controller

https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller
1•tosh•37m ago•0 comments

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

https://twitter.com/AlecStapp/status/2019932764515234159
2•bkls•37m ago•0 comments

Kubernetes MCP Server

https://github.com/yindia/rootcause
1•yindia•38m ago•0 comments

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

https://rokn.io/posts/building-movie-recommendation-agent
4•roknovosel•38m ago•0 comments

What were the first animals? The fierce sponge–jelly battle that just won't end

https://www.nature.com/articles/d41586-026-00238-z
2•beardyw•47m ago•0 comments

Sidestepping Evaluation Awareness and Anticipating Misalignment

https://alignment.openai.com/prod-evals/
1•taubek•47m ago•0 comments

OldMapsOnline

https://www.oldmapsonline.org/en
2•surprisetalk•49m ago•0 comments

What It's Like to Be a Worm

https://www.asimov.press/p/sentience
2•surprisetalk•49m ago•0 comments

Don't go to physics grad school and other cautionary tales

https://scottlocklin.wordpress.com/2025/12/19/dont-go-to-physics-grad-school-and-other-cautionary...
2•surprisetalk•49m ago•0 comments

Lawyer sets new standard for abuse of AI; judge tosses case

https://arstechnica.com/tech-policy/2026/02/randomly-quoting-ray-bradbury-did-not-save-lawyer-fro...
5•pseudolus•50m ago•0 comments

AI anxiety batters software execs, costing them combined $62B: report

https://nypost.com/2026/02/04/business/ai-anxiety-batters-software-execs-costing-them-62b-report/
1•1vuio0pswjnm7•50m ago•0 comments

Bogus Pipeline

https://en.wikipedia.org/wiki/Bogus_pipeline
1•doener•51m ago•0 comments

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

https://nypost.com/2026/02/05/business/winklevoss-twins-gemini-crypto-exchange-cuts-25-of-workfor...
2•1vuio0pswjnm7•52m ago•0 comments
Open in hackernews

Reverse engineering Codex CLI to get GPT-5-Codex-Mini to draw me a pelican

https://simonwillison.net/2025/Nov/9/gpt-5-codex-mini/
168•simonw•3mo ago

Comments

f311a•3mo ago
Installing Rust projects is usually as simple as calling `cargo install`. No need to wait for 5-30 minutes until LLM figures this out.

People are delegating way too much to LLMs. In turn, this makes your own research or problem-solving skills less sharp.

Retr0id•3mo ago
Figuring out how to build a project in an unfamiliar language/build system is my least favourite activity, mainly because all the people who are familiar with those tools think it's "as simple as" and don't bother to write it down anywhere. I don't plan on learning every build system ever.
heddycrow•3mo ago
I did not know how to do X so I Y.

It would be interesting to know what kinds of responses humans offer across different values of Y such as:

1) looked on stack overflow 2) googled it 3) consulted the manual 4) asked an LLM 5) asked a friend

For each of these, does the learner somehow learn something more or better?

Is there some means of learning that doesn't degrade us as human beings according to those in the know?

I ask as someone who listens to audiobooks and answers yes when someone asks me if I've read the book. And that's hardly the extent of my transgressions.

hrimfaxi•3mo ago
At least if you're copy/pasting from stack overflow you presumably glanced at the change you are copying if only to ensure you select the correct text.
heddycrow•3mo ago
Good point. We also sometimes leave comments in code noting the thread we referenced.
sroussey•2mo ago
Yeah because the code on stack overflow has a license.
simonw•2mo ago
Yeah, it's CC BY-SA 4.0: https://stackoverflow.com/help/licensing
sroussey•2mo ago
You forgot to read the readme
heddycrow•2mo ago
I did! How could I forget that and RTFM? The code is self-documenting! Lol.
globular-toast•3mo ago
Ok but I'd argue Rust/Cargo shouldn't be an "unfamiliar language/build system" for most professional programmers these days. It's like a professional plumber being unfamiliar with solder. Like, yeah, you can do a lot without soldering, but imagine a pro just straight up not having a clue about solder.

It's worth learning how to do this stuff. Not just because you then know that particular build system, but because you get better at learning. Learning how to learn is super important. I haven't come across a new project that's taken me more than a few minutes to figure out how to build in years.

thunky•3mo ago
> Ok but I'd argue Rust/Cargo shouldn't be an "unfamiliar language/build system" for most professional programmers these days.

This isn't even close to true. The majority of programmers will be fine going their entire career without even knowing what Rust is, let alone how to build Rust projects.

A more accurate analogy would be a plumber not knowing how his wrench was manufactured.

zahlman•2mo ago
Rust ranks 16th on the current TIOBE (https://www.tiobe.com/tiobe-index/), behind assembly, PHP and R. It is still not remotely as popular (as "based on the number of skilled engineers world-wide, courses and third party vendors") as C or C++ (to say nothing of how dominant Python has become).

The supposed ubiquity of Rust is the result of a hype and/or drama bubble.

mock-possum•2mo ago
Conversely: I’ve been a professional programmer for nearly 20 years and I’ve never touched rust/cargo - I’ve never needed to.

(Though in general I do agree with “it’s worth learning how to do this stuff.)

simonw•3mo ago
Quoting my article:

> This is a useful starting point for a project like this—in figuring out the compile step the coding agent gets seeded with a little bit of relevant information about the project, and if it can compile that means it can later partially test the code it is writing while it works.

"Figure out how to build this" is a shortcut for getting a coding agent primed for future work. If you look at the transcript you can see what it did: https://gistpreview.github.io/?ddabbff092bdd658e06d8a2e8f142...

  • Ran ls
    └ AGENTS.md
      CHANGELOG.md
        +13 lines
      scripts
      sdk
  
  • Ran cat README.md
    └ <p align="center"><code>npm i -g @openai/codex</code><br />or <code>brew
      install --cask codex</code></p>
  
        +105 lines
  
      This repository is licensed under the [Apache-2.0 License](LICENSE).
  
  • Ran rg -n "rust" -n docs/install.md
    └ 23:curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
      25:rustup component add rustfmt
      26:rustup component add clippy
  
  • Ran sed -n '1,200p' docs/install.md
    └ ## Install & build
  
        +36 lines
      cargo test
      ```
  
  • I’m gathering the build steps for the Rust tool, focusing on running cargo
    build inside the codex-rs directory. I'll check dependencies and confirm
    the instructions before actually building, aiming to provide a clear summary
    afterward.
  
  • Ran cargo build
    └     Updating git repository `https://github.com/nornagon/crossterm`
          Updating git repository `https://github.com/nornagon/ratatui`
        +825 lines
         Compiling codex-cli v0.0.0 (/Users/simon/Dropbox/dev/codex/codex-rs/cli)
          Finished `dev` profile [unoptimized + debuginfo] target(s) in 1m 27s
  
  • Ran ls target/debug | head
    └ apply_patch
      apply_patch.d
        +6 lines
      codex-execpolicy
That's a decent starting point on seeding the context with information that's relevant to making and then testing the modifications I'm about to ask for.
adastra22•3mo ago
What useful context is in there? How to call “cargo build”? It already knows that.
simonw•3mo ago
The README files that it looked at, and the directory structure it explored.

It now knows what the project is, what dependencies it uses, how it's laid out and the set of binaries that it generates.

Even more importantly: it knows that the project can be built without errors. If it tries a build later and sees an error it will know that the error was caused by code it had modified.

risyachka•3mo ago
You could just tell it to check out readme, but I suspect it would have checked it out anyway or figured out the type of project and how it is structured as a first step of any other command you give it as without it it is impossible to add or update the project.
simonw•3mo ago
The most important thing is to have it successfully build the software, to prove to both me and itself that a clean compile is possible before making any further changes.
adastra22•3mo ago
Suggestion: make a “check.sh” script that builds everything, lints everything, and runs all (fast) tests. Add a directive in the agent system prompt to call it before & after doing anything. If it fails, it will investigate why.
alganet•3mo ago
In situations like this is better to ask the agent to write a short document about how to run the project. Then you read it and delete useless parts. Then you ask the agent to follow that document and improve it until the software builds. By the final step, you get a personalized README.md for your needs.
cadamsdotcom•3mo ago
For a Rust developer, neglecting their ability to debug cargo build issues puts their career at risk. For someone like that, letting AI handle it would be a really shortsighted move.

But Simon isn’t a Rust developer - he’s a motivated individual with a side project. He can now speedrun the part he’s not interested in. That doesn’t affect anyone else’s decisions, you can still choose to learn the details. Ability to skip it if you wish, is a huge win for everyone.

risyachka•3mo ago
>> He can now speedrun the part he’s not interested in

In this case its more like slowrunning. Building rust project is 1 command and chatgpt will tell you this command in 5 seconds.

Running an agent for that is 1000x more inefficient.

At this point its not optimizing or speeding things up but running agent for the sake of running agent.

simonw•3mo ago
The best thing about having an agent figure this out is you don't even need to be at your computer while it works. I was cooking dinner.
cadamsdotcom•2mo ago
You’re not properly accounting for the risk of getting blocked on one of these 5 second tasks. Do an expected value calculation and things look very different.

Across a day of doing these little “run one command” tasks, even getting blocked by one could waste an hour. That makes the expected value calculation of each single task tilt much more in favor of a hands off approach.

Secondly, you’re not valuing the ability to take yourself out of the loop - especially when the task to be done by AI isn’t on the critical path, so it doesn’t matter if it takes 5 minutes or 5 milliseconds. Let AI run a few short commands while you go do something else that’ll definitely take longer than the difference - maybe a code review - and you’ve doubled your parallelism.

These examples are situational and up to the individual to choose how they operate, and they don’t affect you or your decisions.

antonvs•2mo ago
> He can now speedrun the part he’s not interested in.

The reductio that people tend to be concerned about is, what if someone is not interested in any aspect of software development, and just wants to earn money by doing it? The belief is that the consequences then start becoming more problematic.

simonw•2mo ago
Those people are their own worst enemies.

Some people will always look for ways to "cheat". I don't want to hold back everyone else just because a few people will harm themselves by using this stuff as a replacement for learning and developing themselves.

antonvs•2mo ago
Do you genuinely believe that this only applies to "a few people"?

This new post gets at the issue: https://news.ycombinator.com/item?id=45868271

simonw•2mo ago
I don't understand the argument that post is making.

I agree that people using LLMs in a lazy way that has negative consequences - like posting slop on social media - is bad.

What's not clearly to me is the scale of the problem. Is it 1/100 people who do this, or is it more like 1/4?

Just a few people behaving badly on social media can be viewed by thousands or even millions more.

Does that mean we should discard the entire technology, or should we focus on teaching people how to use it more positively, or should we regulate its use?

caymanjim•3mo ago
I would have done the same thing. I know how to build software in a dozen or more languages. I've done it manually, from scratch, in all of them. I don't know Rust. I have no immediate plan to learn Rust. I vaguely know that Cargo is something in the Rust toolbox. I don't have it installed. I don't particularly want to learn anything about it. It's a whole lot easier for me to tell the LLM to figure that out.

I might learn Rust some day. At the moment, I don't need the mental clutter.

adastra22•3mo ago
Well, fyi because it is really simple: if you have rust installed, you have cargo installed too. And to run a project you type “cargo run” from the base directory. That is all.
Retr0id•3mo ago
You get a build error because the rust version you have installed is incompatible with the codebase. Now you have to install rustup and...
caymanjim•2mo ago
I don't doubt that the process is ultimately fairly easy, but it always looks easier from inside the bubble. You have to know that Cargo is part of Rust; what the best way to install it is on your system; what tooling you should use if you don't want headaches later; what to do if you don't want to or can't install system-level packages; what minimum versions of Rust and Cargo are assumed by application developers this week; what additional dependencies are required; etc.

It's my understanding that building Rust applications still requires a C toolchain, and packages are still going to be dependent on things like having the openssl dev headers/libraries installed. That's fine, that's normal for building software, but it's not as trivial as "just git-clone this Rust source repo and run one command and everything will work".

I'm certain I could get up and running quickly. I'm also certain I'd have to install a bunch of stuff and iterate past multiple roadblocks before I was actually able to build a Rust application. And finally I'm certain I could get Claude or Codex to do it all a lot faster than if I muddled through it myself for half an hour.

adastra22•2mo ago
Execute the one-liner here: https://rustup.rs/

Then cd dir && cargo run

I get what you’re saying, but rust has really set the bar (lowered the bar?) for making it easy, so it’s a bad example to pick on.

winternewt•3mo ago
I see where you're coming from. But I often find that when I have some idea or challenge that I want to solve, I get bogged down in details (like how do I build that project)... before I even know if the idea I _wanted_ to solve is feasible.

It's not that I don't care about learning how to build Rust or think that it's too big of a challenge. It's just not the thing I was excited about right now, and it's not obvious ahead of time how sidetracked it will get me. I find that having an LLM just figure it out helps me to not lose momentum.

lifthrasiir•3mo ago
I have used Rust for decades (yeah, Rust is that old) and want to point out that that's not always the case, especially when FFI is involved. At some point, for example, any Rust crate with the `openssl` dependency used to require a special care every time `cargo install` gets run. Cargo itself is super nice; other tools, still not so much.
beklein•3mo ago
If you've never used Rust before, I couldn't find good documentation on how to run a existing Rust project nor could I find `cargo install` on the "Getting Started" page. I could read the Cargo Book, or check `--help` I guess, but this can be surprisingly time consuming as well, it might take 5-30 minutes of active searching to locate the information. If you can, try put yourself in a beginner's mindset and think though your argument again.

Regarding your second point, I think people actually underutilise LLMs for simple tasks. Delegating these tasks frees up your problem-solving skills for challenges that truly need human insight. In this case, asking an LLM is arguably the smart choice: it's a common task in training data, easy to verify, and low-risk to run and not a direct learning or benefit for your initial question.

aabhay•3mo ago
You don’t need to cargo install anything. You just need cargo itself, which is linked on the main page. Once you have that, here’s an example google search that gives you all the info you need to run the project (hint: `cargo run`)

https://www.google.com/search?q=how+do+I+run+a+rust+project

beklein•3mo ago
Thanks for the Google link, I was just asking GPT-5-Pro "How to Google: 'How do I run a rust project'", and am still waiting for the answer... the point was that searching for an answer (wherever/however) is not necessary in some cases, like this one, but asking the AI agent to find a solution can be sufficient and is totally ok. Engineers are allowed to delegate, there is not nothing wrong with this.
survirtual•3mo ago
Funny you say that, because I have the opposite opinion.

It is easy for any of us to quickly bootstrap a new project in whatever language. But this takes a cognitive toll, and adds friction to bring our visions online.

Recently, I went "blind" for a couple of days. My vision was so damaged I could only see blurs. The circumstances of this blindness are irrelevant, but it dawned on me that if I were blind, I could no longer code as I do.

This realization led me to purchase a Max subscription to Claude Code and rely more on LLMs for building, not less.

It was much more effective than I thought it would be. In my blurred blindness, I saw blobs of a beautiful user interface form, from the skeleton of my Rust backend, Vue3 frontend. It took my (complex backend in Rust) and my frontend scaffolding to another level. I could recognize it via the blur distinctly. And it did this in minutes / hours instead of days.

As my vision returned, I began analyzing what happened and conducting experiments. My attitude changed completely. Instead of doing things myself, directly, I set out to make the LLM do it, even if it took more time.

It is painful at first. It makes very stupid mistakes that make an experienced engineer snarl at it, "I can do better myself". But my blindness gave me new sight. If I were blind, I couldn't do it myself. I would need help.

Instead of letting that ego take over, I started patiently understanding how the LLM best operates. I discovered mainly it needs context and specific instructions.

I experimented with a DSL I made for defining LLM instructions that are more suited for it, and I cannot describe the magic that started unfolding.

Now, I am writing a massive library of modular instructions for LLMs, and launching them against various situations. They will run for hours uninterrupted and end up implementing full code bases, with complete test suites, domain separation, and security baked in.

Reviewing their code, it looks better than 90% of what I see people producing. Clear separation of concerns, minimal code reuse, distinct interface definitions, and so much more.

So now, I have been making it more and more autonomous. It doesn't matter if I could bootstrap a project in 30 seconds. If I spend a few hours perfecting the instructions to the LLM, I can bootstrap ANY project for ANY LANGUAGE, forever.

And the great thing? I already know the pattern works. At this point, it is foolish for me to do anything other than this.

ClawsOnPaws•3mo ago
Just as a quick datapoint here in case people get worried; yes, it is absolutely possible to program as a blind person, even without language models. Obviously you won't be using your eyes for it, but we have tried and tested tools that help and work. And at the end of the day, someone's going to have to review the code that gets written, so either way, you're not going to get around learning those tools.

Source: Am a blind person coding for many years before language models existed.

survirtual•2mo ago
Thank you for sharing your experience. It provides me a bit of comfort to know it's possible for me to keep coding in the event of vision loss, and I'm glad tools exist for people that are blind.

A part of me wants to start using the available tools just to expand my modalities of interfacing with technology. If you have the time, any recommendations? What do you use?

zavec•3mo ago
The DSL sounds interesting, if you talk about it anywhere I'd definitely be interested in reading more!
survirtual•2mo ago
I'll look forward to sharing my findings.
jstummbillig•3mo ago
> In turn, this makes your own research or problem-solving skills less sharp.

Why would that be true? The average assistant is certainly typing more quickly than their boss, but most people would not find issue in that. It's different responsibilities. You free up time to research / problem-solve other things.

> No need to wait for 5-30 minutes until LLM figures this out.

I don't care it the LLM takes 15 additional minutes to figure it out, if it net saves me a minute (we could certainly debate the ergonomics of the multitasking involved, but that is something that every person, who delegates work, has to deal with and that's not unique to working with LLMs in any way)

globular-toast•3mo ago
I got excited about LLM agents thinking it was just about "faster typing". A lot of us have dreamed of a day where we can just transfer what we have in mind directly into the computer, skipping the laborious manual keying step. But using an LLM is not that. It's not that at all.

Instead they let you type vague or ambiguous crap in and just essentially guess about the unclear bits. Hadn't quite thought through which algorithm to use? No worries, the LLM will just pick one. Hadn't considered an edge case? No worries, the LLM will just write 100 lines of code that no sane programmer would ever go through with before realising something isn't right.

I've made the mistake of being that senior who is way too eager to help juniors many times in my career. What happens is they never, ever learn for themselves. They never learn how to digest and work through a problem. They never learn from their mistakes. Because they can always just defer to me. LLMs are the worst thing to happen for these people because unlike a real person like me the LLM is never busy, never grumpy and nobody is taking notes of just how many times they're having to ask.

LLMs are really useful at generating boilerplate, but you have to ask yourself why you're spending your days writing boilerplate in the first place. The danger is it can very quickly become more than just boilerplate and before you know it you've forgotten how to think for yourself.

amluto•2mo ago
Sometimes avoiding boilerplate is out of scope. I’m currently using an LLM agent to write a Home Assistant integration. The LLM is happy to write boilerplate crap to interact with the terrible Home Assistant API without complaining about it. Sure, some of the code it writes is awful, and I can fix that. (The record was about 15 lines of code, including non-functional error handling, to compute the fixed number zero.)

Becoming proficient at banging out Home Assistant entities and their utterly ludicrous instantiation process has zero value for my career.

fHr•2mo ago
true bunch of regards in the making like why I need you to do basic ass shit like this, no wonder we see mass layoffs
CamperBob2•2mo ago
In turn, this makes your own research or problem-solving skills less sharp.

That's one way to think about it, but on the other hand, where's the "skill" in knowing a particular CLI invocation for a particular tool or installation task? Next year there will be a Better Way to Do It. (Witness how many trendy package installers / venv managers the Python community has gone through.)

An LLM's job is to translate what I want to do into instructions or commands that actually do it. Real skill involves defining and directing the process; the implementation details are just temporary artifacts. Memorized command lines, mastery of specific tools, and conflation of rote procedures with "skills" are what keeps languages like C around for 50 years, long after the point where they begin to impede progress.

simonw•2mo ago
I've been thinking about this a lot recently. I've been using Git for about 20 years, but I've never considered myself a Git expert because I have to look up how to do things like a git bisect or a history rewrite beyond just the first "git commit --amend"

But now... I use a ton of advanced Git features several times a week, because just knowing that it's possible to do something is enough for me to tell Codex or Claude Code to do that thing.

So maybe Git mastery now is more about concepts? You need to understand the DAG and roughly what's in the .git folder and origins and branches and tags and commits and so forth, but you don't need to remember the syntax.

CamperBob2•2mo ago
Exactly, and now imagine that somebody introduces a tool even better than Git. Normally that tool would have major adoption problems because it would have to dislodge the incumbent standard that everybody's familiar with, and that everybody can at least tolerate.

But if you aren't even issuing commands directly to Git, suddenly it starts to look like there is room for improvement without the pain of learning a new tool or a new paradigm. That's a bigger deal than I think most people appreciate.

djmips•3mo ago
How long before large language models are specifically trained on drawing pelicans riding a bicycle. ( ͡° ͜ʖ ͡°)
heddycrow•3mo ago
And where on the web has someone shared a human effort at doing the same?
hrimfaxi•3mo ago
Where on the web do hallucinations come from?
heddycrow•3mo ago
I think it's some part of the Dark Web, or I wish it was.
djmips•3mo ago
you could literally hire a human to do that, not everything needs to be on the web.
NitpickLawyer•3mo ago
Previous discussions about "pelican on a bicycle" always mention this, but it's not something they can do without being blatantly obvious. You can always do other x riding y tests. A juggler riding a barrel. A bear riding a unicycle. An anteater riding a horse, etc.
swyx•3mo ago
simon has said multiple times he has hidden tests he runs for precisely this eventuality (because of course it will happen someday, and he'll write a banger article calling them out for it)
bgwalter•3mo ago
Under a different account? They could special case his account and even give him more computing power because they know he'll not rest until every dog on this planet has a subscription to at least ChatGPT and Claude Code.
tkgally•3mo ago
A couple of days ago, inspired by Simon and those discussions, I had Claude create 30 such tests. I posted a Show HN with the results from six models, but it didn’t get any traction. Here it is again:

https://news.ycombinator.com/item?id=45845717

https://gally.net/temp/20251107pelican-alternatives/index.ht...

Aeolun•3mo ago
Oh man, that’s hilarious. I dunno what qwen is doing most of the time. Gemini seems to be either a perfect win or complete nonsense. Claude seems to lean towards “good enough”.
Topfi•2mo ago
> [...] but it's not something they can do without being blatantly obvious.

Thing is, it is being done by certain labs and it is blatantly obvious [0]. It has for months now been very easy to tell which labs either "train to the test" or (if you wanna give them the benefit of a doubt they have most certainly not earned) simply fail to keep their datasets sanitized because of this.

I still see value in and will continue to use odd scenarios to prompt for SVG as it can have value as a spatial reasoning focused benchmark. But I will never use the same scenario a second time and have a hard time taking anyone seriously who does, again, because I have seen consistent evidence that certain labs "bench-max" to a fault.

You are right, it is obvious and easily proven, yet despite that your correctness, the fact remains that a lot of hype masters simply do not care.

Say what you want about Anthropic, I have and will continue to do so, but they appear to take the most steps of any industry member in not training solely with a focus on beating benchmarks. Their models in my experience usually perform better than established, public benchmarks would make one think. They also, from what I have seen, take the most precautions to ensure to the best of their abilities that e.g. their own research papers on model "misaligned behavior"/unprompted agentic output do not find their way into the training corpus via canary strings [1], etc.

Overall, if I were asked whether any lab is doing everything they truly can to avoid any unintentionally training with a focus on popular, eye catching benchmarks, I'd say none, partly because it is likely impossible to avoid when using the open web as a source.

On the other hand, if I were asked whether any labs are intentionally and clearly training specifically with a focus on popular, eye catching benchmarks, I'd have a few names at the top of my mind right away. Just do what you suggested, try other out there scenarios as SVGs and see for yourself the discrepancy to e.g. the panda burger or cycle pelican. It is blatant, shameless and I ask any person with an audience in the LLM space to do the same. The fact that few if any seem to is annoying to say the least.

[0] https://news.ycombinator.com/item?id=45599403

[1] In fairness, not knowing their data acquisition pipeline, etc. it is hard to tell how effective such measures can truly be considering reporting on their papers on the open web is unlikely to include the canary strings.

cyrusradfar•3mo ago
Was a fun idea and fun read. Thank you.

Did you consider expanding the number of models by getting all calls through open router?

simonw•3mo ago
I haven't tried it myself yet, but it looks like OpenRouter is a supported feature of Codex already: https://github.com/openai/codex/blob/a47181e471b6efe55e95f98...
eloycoto•3mo ago
It's weird, tools are empty on the http request?
NitpickLawyer•3mo ago
You can already do that through the config file, you can define custom endpoints for any openai compatible API. So you can get openrouter, or even local models via vLLM or alternatives. I think someone even tried to get cheaper API pay-as-you-go usage by hitting their "bulk" API, for tasks that run over night (so no need for immediate responses).
padolsey•3mo ago
I help run an eval platform and thought it fun to try a bunch of models on this challenge [1].

There's some fun little ones in there. I've not idea what Llama 405B is doing. Qwen 30B A3B is the only one that cutely starts on the landscaping and background. Mistral Large & Nemo are just convinced that front shot is better than portrait. Also interesting to observe varying temperatures.

I feel like this SVG challenge is a pretty good threshold to meet before we start to get too impressed by ARC AGI wins.

[1] https://weval.org/analysis/visual__pelican/f141a8500de7f37f/...

cubefox•3mo ago
> I feel like this SVG challenge is a pretty good threshold to meet before we start to get too impressed by ARC AGI wins.

It's a very bad threshold. The models write the plain SVG without looking at the final image. Humans would be awful at it and you would mistakenly conclude that they aren't general intelligences.

padolsey•3mo ago
I dunno. A competent human can hold the a mental image and work through it. Not too hard with experience. What I generally mean tho is: I don't think we can state the supreme capabilities of AI (which people love to do with grate fervour and rhetoric) until they can at the very least draw basic objects in well-known declarative languages. And while it may be unwise to judge an AI based on its ability to count the number of 'R' letters in various words, it -- amongst a wider suite ofc -- remains a good minimum threshold of capability.
Jackson__•3mo ago
You will never guess what the upcoming gemini 3 has been benchmaxxed upon ;)
Aeolun•3mo ago
I tried to make the claude code cli interface with the codex/chatgpt endpoints directly, and found that that’s really hard, because those ChatGPT endpoint accept only one system prompt, and one system prompt only, the one that is baked directly into Codex.

That was a bit dissapointing, because I feel the codex api with claude semantics would be really nice.

Translating all the tools calls when you cannot control the prompt seemed like a bit of a dead end though, so I eventually just switched back to claude (which incidentally allows any prompt you can dream up, but using the codex cli with claude was very much not on my wishlist)

BeetleB•2mo ago
Very mildly related: Pelicans on Bicycles - Raytracer Edition:

https://blog.nawaz.org/posts/2025/Oct/pelican-on-a-bike-rayt...

simonw•2mo ago
Hah, love the weird floating eggs!
simonw•2mo ago
I blogged about this here, including running your benchmark against gpt-5-codex-mini which did the worst example I've seen by far: https://simonwillison.net/2025/Nov/9/pelican-on-a-bike-raytr...

Here's the codex-mini attempt: https://static.simonwillison.net/static/2025/povray-pelican-...

dr_dshiv•2mo ago
Perhaps not eggs, but misplaced pelican neck pouches?

Opus4.1: https://claude.ai/public/artifacts/b47c2dd5-41a6-452c-8701-5...

Sonnet 4.5: https://gemini.google.com/share/a8ebea2c31dd

Gemini 2.5pro: https://gemini.google.com/share/a8ebea2c31dd

“a pelican riding on a bicycle in 3d. Works for mobile“

BeetleB•2mo ago
You were indeed correct! Weird that so many models got it wrong.