frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Agentic-Commerce-Protocol

https://github.com/agentic-commerce-protocol/agentic-commerce-protocol
1•vettyvignesh•31s ago•0 comments

Help Me Find Missing Issues of Australian Personal Computer

https://blog.decryption.net.au/posts/apc-callout.html
1•naves•2m ago•0 comments

LoongArch Reference Manual

https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html
1•welovebunnies•3m ago•0 comments

The new light of Jony Ive's life

https://www.wallpaper.com/design-interiors/lighting/jony-ive-lovefrom-balmuda-sailing-lantern
2•Nrbelex•4m ago•0 comments

Claude 4.5, AI Biology and World Models

https://cmpld.ai/issues/003/
1•mantcz•7m ago•0 comments

Mexico: Tax Code reform seeks permanent access to data from digital platforms

https://articulo19.org/reforma-al-codigo-fiscal-pretende-acceso-permanente-a-datos-de-plataformas...
1•CharlesW•7m ago•0 comments

Ask HN: Any local agents to help repetitive browser tasks?

2•pcdoodle•8m ago•0 comments

Claude Sonnet 4.5 is probably the "best coding model in the world", at least now

https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/
1•coloneltcb•8m ago•0 comments

RealClimate: "But you said the ice was going to disappear in 10 years "

https://www.realclimate.org/index.php/archives/2025/09/but-you-said-the-ice-was-going-to-disappea...
3•speckx•11m ago•0 comments

DIY Flight Simulator Motion Rig [video]

https://www.youtube.com/watch?v=YphV5v7aZSg
1•gregsadetsky•11m ago•0 comments

Kagi Translate appears to be down - giving HTTP 400 false positive

1•casenmgreen•11m ago•0 comments

99% of heart attack, stroke cases linked to preventable risk factors

https://www.medicalnewstoday.com/articles/heart-attack-stroke-heart-failure-linked-to-preventable...
2•akyuu•12m ago•0 comments

Olly – AI Native Observability

https://olly.new
1•pranay01•12m ago•0 comments

Why the Hertz-Amazon deal poses threats to auto dealers

https://www.cnbc.com/2025/09/29/hertz-amazon-auto-dealers.html
1•e2e4•14m ago•0 comments

An insurance company is introducing a new threat to American medicine

https://www.statnews.com/2025/09/29/cigna-downcoding-prior-authorization-doctors-bureaucracy/
2•bikenaga•15m ago•0 comments

Learn Kubernetes Security book, second edition just published

https://www.amazon.com/Learning-Kubernetes-Security-containerized-environments-ebook/dp/B0F5VZ3CRX
2•bernardoortega•16m ago•0 comments

Energy Dept. adds 'climate change' and 'emissions' to banned words list

https://www.politico.com/news/2025/09/28/energy-department-climate-change-emissions-banned-words-...
11•doener•16m ago•0 comments

The Handoff to Bots

https://kevinkelly.substack.com/p/the-handoff-to-bots
3•thm•18m ago•1 comments

DuckDB can be 5x faster than Spark at 500M record files

https://blog.dataexpert.io/p/duckdb-can-be-100x-faster-than-spark
1•peterdstallion•18m ago•1 comments

Photos show 44,000-year-old mummified wolf discovered in Siberian permafrost (2024)

https://www.livescience.com/animals/extinct-species/stunning-photos-show-44000-year-old-mummified...
1•binning•19m ago•1 comments

Buckley Institute Releases Eleventh Annual National Undergraduate Student Survey

https://buckleyinstitute.com/buckley-institute-releases-eleventh-annual-national-undergraduate-st...
1•mhb•19m ago•0 comments

A DHT for iroh – Part 1, The Protocol

https://www.iroh.computer/blog/lets-write-a-dht-1
1•g0xA52A2A•20m ago•0 comments

When AI is trained for treachery, it becomes the perfect agent

https://www.theregister.com/2025/09/29/when_ai_is_trained_for/
2•rntn•20m ago•0 comments

Omi – A Fast Pokémon Card Scanner

https://tcgscanneromi.com/
1•crovillas•21m ago•1 comments

Finding stillness and focus in the chaos of open source

https://ruthcheesley.co.uk/blog/buddhism/finding-stillness-and-focus-in-the-chaos-of-open-source
1•mooreds•21m ago•0 comments

Offshore to onshore: Europe expands carbon storage with nature-inspired tech

https://projects.research-and-innovation.ec.europa.eu/en/horizon-magazine/offshore-onshore-europe...
1•PaulHoule•22m ago•0 comments

First highway sign with Superchargers, more to come

https://twitter.com/TeslaCharging/status/1970987475951903142
1•toomuchtodo•22m ago•0 comments

Beyond Reading the RFC: How to Shape Identity Standards

https://ciamweekly.substack.com/p/beyond-reading-the-rfc-how-to-actually
1•mooreds•22m ago•0 comments

DeepSeek-v3.2-Exp: Long-Context Efficiency with DeepSeek Sparse Attention [pdf]

https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf
1•g42gregory•22m ago•0 comments

Agentic Commerce Protocol

https://developers.openai.com/commerce/guides/get-started/
1•brandonb•24m ago•1 comments
Open in hackernews

Claude Sonnet 4.5

https://www.anthropic.com/news/claude-sonnet-4-5
427•adocomplete•1h ago
System card: https://assets.anthropic.com/m/12f214efcc2f457a/original/Cla...

Comments

idkmanidk•1h ago
Page cannot be found Empty screen mocks my searching Only void responds

but: https://imgur.com/a/462T4Fu

dbbk•1h ago
So Opus isn't recommended anymore? Bit confusing
SatvikBeri•1h ago
For now, yeah. Presumably they'll come out with Opus 4.5 soon.
causal•1h ago
Don't think I've ever preferred Opus to Sonnet
cryptoz•1h ago
I've really got to refactor my side project which I tailored to just use OpenAI API calls. I think the Anthropic APIs are a bit different so I just never put in the energy to support the changes. I think I remember reading that there are tools to simpify this kind of work, to support multiple LLM APIs? I'm sure I could do it manually but how do you all support multiple API providers that have some differences in the API design?
willcodeforfoo•1h ago
https://openrouter.ai/?
pinum•1h ago
I use LiteLLM as a proxy.
dingnuts•1h ago
> think I remember reading that there are tools to simpify this kind of work, to support multiple LLM APIs

just ask Claude to generate a tool that does this, duh! and tell Claude to make the changes to your side project and then to have sex with your wife too since it's doing all the fun parts

adidoit•59m ago
LiteLLM is your friend.
adidoit•58m ago
or AI SDK
l1n•55m ago
https://docs.anthropic.com/en/api/openai-sdk
punkpeye•46m ago
OpenRouter, Glama ( https://glama.ai/gateway/models/claude-sonnet-4-5-20250929 ), AWS Bedrock, all of them provide you access to all of the AI models via OpenAI compatible API.
gloosx•40m ago
Why don't you ask LLM to do it for you?
juanre•10m ago
I built LLMRing (https://llmring.ai) for exactly this. Unified interface across OpenAI, Anthropic, Google, and Ollama - same code works with all providers.

The key feature: use aliases instead of hardcoding model IDs. Your code references "summarizer", and a version-controlled lockfile maps it to the actual model. Switch providers by changing the lockfile, not your code.

Also handles streaming, tool calling, and structured output consistently across providers. Plus a human-curated registry (https://llmring.github.io/registry/) that I keep updated with current model capabilities and pricing - helpful when choosing models.

MIT licensed, works standalone. I am using it in several projects, but it's probably not ready to be presented in polite society yet.

yewenjie•1h ago
Looking at the chart here, it seems like Sonnet 4 was already better than GPT-5-codex in the SWE verified benchmark.

However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.

CuriouslyC•1h ago
The Anthropic models have been vibe-coding tuned. They're beasts at simple python/ts programs, but they definitely fall apart with scientific/difficult code and large codebases. I don't expect that to change with the new Sonnet.
patates•1h ago
In my experience Gemini 2.5 Pro is the star when it comes to complex codebases. Give it a single xml from repomix and make sure to use the one at the aistudio.
CuriouslyC•57m ago
Yup. In fact every deep research tool on the market is just a wrapper for gemini, their "secret sauce" is just how they partition/pack the codebase to feed it into gemini.
Workaccount2•54m ago
Its mostly because it is so damn good with long contexts. It can stay on the ball even at 150k whereas other models really wilt around 50-75k.
garciasn•51m ago
In my experience, G2.5P can handle so much more context and giving an awesome execution plan that is implemented by CC so much better than anything G2.5P will come up with. So; I give G2.5P the relevant code and data underneath and ask it to develop an execution plan and then I feed that result to CC to do the actual code writing.

This has been outstanding for what I have been developing AI assisted as of late.

mentos•1h ago
Curious how you find ChatGPT5 to ChatGPT5-Codex?
cellis•1h ago
Opposite for me…5-codex high ran out of tokens extremely quickly and didn’t adhere as well to the agents.md as Claude did to the Claude.md, perhaps because it insists on writing extremely complicated bash scripts or whole python programs to execute what should be simple commands.
TrainedMonkey•1h ago
Codex was a miserable experience for me until I learned to compact after every feature. Now it is a cut above CC, although the latter still has an edge at TODO scaffolding and planning.
oigursh•53m ago
Compact?
all2•47m ago
/compress or something like that, basically taking the context and summarizing it.
furyofantares•50m ago
/new (codex) or /clear (claude code) are much better than compact after every feature, but of course if there is context you need to retain you should put it (or have the agent put it) in either claude/agents.md or a work log file or some other file.

/compact is helping you by reducing crap in your context but you can go further. And try to watch % context remaining and not go below 50% if possible - learn to choose tasks that don't require an amount of context the models can't handle very well.

rapind•5m ago
I don't even compact, I just start from scratch whenever I get down below 40%, if I can. I've found Codex can get back up to speed pretty well.

I like to have it come up with a detailed plan in a markdown doc, work on a branch, and commit often. Seems not to have any issues getting back on task.

Obviously subjective take based on the work I'm doing, but I found context management to be way worse with Claude Code. In fact I felt like context management was taking up half of my time with CC and hated that. Like I was always worried about it, so it was taking up space in my brain. I never got a chance to play with CC's new 1m context though, so that might be a thing of the past.

renewiltord•45m ago
gpt-5 command line use is bizarre. It always writes extraordinarily complicated pipelines that Claude instead just writes simple commands for.

My use case does better with the latter because frequently the agent fails to do things and then can't look back at intermediate.

E.g. Command | Complicated Grep | Complicated Sed

Is way worse than multistep

Command > tmpfile

And then grep etc. Because latter can reuse tmpfile if grep is wrong.

yunohn•1h ago
Well, they seem to benchmark better only when giving the model "parallel test time compute" which AFAIU is just reasoning enabled? Whereas the GPT5 numbers are not specified to have any reasoning mode enabled.
esafak•1h ago
I'm only a week into testing, but so far codex has been slow and the cli is worse than claude code. I intend to return to Claude.
jasonsb•1h ago
My subjective personal experience is the exact opposite of yours, GPT-5-codex is super slow and the results are mediocre at best. I would probably stop using AI for coding if I was forced to use GPT-5-codex.
lordnacho•1h ago
I'm on your side.

I find there's a quite large spread in ability between various models. Claude models seem to work superbly for me, though I'm not sure whether that's just a quirk of what my projects look like.

jasonsb•55m ago
I don’t think it’s just a quirk. I’ve tested Claude across Java, Python, TypeScript and several other projects. The results are consistent, regardless of language or project structure, though it definitely performs better with smaller codebases. For larger ones, it really helps if you’re familiar with the project architecture and can guide it to the right files or modules, that saves a lot of time.
llmslave•51m ago
You need to give it clear instructions on what to implement
AnotherGoodName•1h ago
I always wonder how absolute in performance a given model is. Sometimes i ask for Claude-Opus and the responses i get back are worse than the lowest end models of other assistants. Other times it surprises me and is clearly best in class.

Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.

While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.

richwater•1h ago
They absolutely mess with it
steveklabnik•50m ago
> It would make sense they scale up and down depending on utilization right?

It would, but

> To state it plainly: We never reduce model quality due to demand, time of day, or server load.

https://www.anthropic.com/engineering/a-postmortem-of-three-...

If you believe them or not is another matter, but that's what they themselves say.

transcriptase•29m ago
Well knowing the state of the tech industry they probably have a different, legal-team approved definition of “reducing model quality” than face value.

After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.

ambyra•1h ago
For unity gamedev code reviews, I much preferred the gpt5 code. Claude gave me a bunch of bad recommendations for code changes, and also an incorrect formula for completion percentage.
macawfish•1h ago
GPT-5 is like the guy on the baseball team that's really good at hitting home runs but can't do basic shit in the outfield.

It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.

Whereas gemini and claude are excellent collaborators.

When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.

To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.

aaronbrethorst•51m ago
Please tell me you're joking or at least exaggerating about GPT-5's behavior
macawfish•36m ago
The only exaggeration is in that the way I asked GPT-5 to leave claude to do its thing was to say "why don't we just let claude cook"? I later checked with ChatGPT about the whole exchange and it confirmed that it was well aware of the meaning of this slang, and it's first reaction was that whole thing just sounded like a funny programmer joke, all in jest. But then I reminded it that I'd explicitly pushed back on a hard reset twice.

To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.

nerdsniper•14m ago
I strongly disagree with the personified way you interact with LLMs from a standpoint of “I’ve rarely gotten the best output from the LLM when I interact casually with them”.

However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.

It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.

Hopefully eventually we’ll all mostly figure it out.

vrosas•49m ago
Why are you having a conversation with your LLM about other agents?
doctoboggan•44m ago
I do it as well. I have a Claude code instance running in my backend repo, and one running in my frontend repo. If there is required coordination, I have the backend agent write a report for the front end agent about the new backend capabilities, or have the front end agent write a report requesting a new endpoint that would simplify the code.

Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.

macawfish•40m ago
It's not a whole conversation it's like "hey I'm using claude code to do analysis and this is what it said" or "gemini just used its large context window to get a bird's eye view of the code and this is what it saw".
renewiltord•47m ago
All of these perform better if you say "a reviewer recommended" or something. The role statement provides the switch vs the implementation. You have to be careful, though. They all trust "a reviewer" strongly but they'll be more careful with "a static analysis tool".
macawfish•38m ago
Yeah, it's wild how the biases get encoded in there. Maybe they aren't even entirely separable from the magic of LLMs.
Marazan•27m ago
It isn't wild, it is inherent to the very nature of large language models.

The power of using LLMs is working out what it has encoded and how to access it.

macawfish•21m ago
I appreciate it being wild in the sense that language is inherently a tangled mess and these tools are actually leveraging that messy complexity.
johnfn•46m ago
That’s such a great analogy. I always say GPT is like the genius that completely lacks common sense. One of my favorite things is when I asked it why the WiFi wasn’t working, and showed it a photo of our wiring. It said that I should tell support:

> “My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”

Really, GPT? Not just “can you set up the WiFi”??!

ipython•26m ago
I'm curious what you would have expected it to reply given the input you provided?
tux3•42m ago
That's great given that the goal of OAI is to train artificial superintelligence first, hoping that the previous version of the AI will help us control the bigger AI.

If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.

rapind•23m ago
> it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea

So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.

It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.

Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.

When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.

These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.

deciduously•13m ago
Just to add another anecdotal data point, ive absolutely observed Claude Code doing exactly this as well with git operations.
layer8•18m ago
> "another agent"

You could just say it’s another GPT-5 instance.

jjcm•59m ago
How long have you had early access for?
llmslave•51m ago
Gpt5 codex is incredible, far ahead of all the other models for implementing code.
chipgap98•1h ago
Interesting that this is better than Opus 4.1. I want to see how this holds up under real world use, but if that's the case its very impressive.

I wonder how long it will be before we get Opus 4.5

FergusArgyll•42m ago
IIRC sonnet 3.5 (and definitely 3.5-new aka 3.6) was better than opus 3.

There's still a lot of low hanging fruit apparently

kixiQu•1h ago
Lots of feature dev here – anyone have color on the behavior of the model yet? Mouthfeel, as it were.
sexyman48•9m ago
Mouthfeel, as it were

Pervert.

rkomorn•6m ago
Weird comment given your username.
meetpateltech•1h ago
Seeing the progress of the Claude models is really cool!

Charting Claude's progress with Sonnet 4.5: https://youtu.be/cu1iRoc1wBo

clueless•1h ago
would love to see the prompt they used and the final code of the Claude.ai clone it generated
mohsen1•1h ago
Price is playing a big role in my AI usage for coding. I am using Grok Code Fast as it's super cheap. Next to it GPT-5 Codex. If you are paying for model use out of pocket Claude prices are super expensive. With better tooling setup those less smart (and often faster) models can give you better results.

I am going to give this another shot but it will cost me $50 just to try it on a real project :(

muttantt•1h ago
how are you using grok code fast? what tooling/cli/etc?
rafaquintanilha•1h ago
It’s currently free in OpenRouter.
esafak•1h ago
Through Opencode.
xwowsersx•39m ago
Same
hu3•1h ago
free in GitHub copilot atm
_joel•1h ago
I'm paying $90(?) a month for the Max and it holds up for about an hour or so of in depth coding before it kicks in the 5-hour window lockout (so effectively about 4 hours of time when I can't run it). Kinda frustrating, even with efficient prompt and context length conservation techniques. I'm going to test this new sonnet 4.5, now but it'll probably be just as quick to gobble my credits.
mrshu•15m ago
Do you normally run Opus by default? It seems the Max subscription should let you run Sonnet in an uninterrupted way, so it was surprising to read.
Implicated•8m ago
I'm on a max ($200) plan and I only use opus and I've _never_ hit a rate limit. Definitely using for 5+ hours at a time multiple days per week.
xwowsersx•58m ago
Same here. I've been using GCF1 with opencode and getting good results. I also started using [Serena](https://github.com/oraios/serena), which has been really helpful in a large codebase. It gives you better search than plain grep, so you can quickly find what you need instead of dumping huge chunks of code into Claude or Grok and wasting tokens.
Hamuko•25m ago
I'm too cheap to pay for any of them. I've only tried gpt-oss:20b because I can run it locally and it's a complete waste of time for anything except code completions.
greenfish6•1h ago
As the rate of model improvement appears to slow, the first reactions seem to be getting worse and worse, as it takes more time to assess the model's quality and understand the nuances & subtler improvements
alach11•1h ago
I'm really interested in the progress on computer use. These are the benchmarks to watch if you want to forecast economic disruption, IMO. Mastery of computer use takes us out of the paradigm of task-specific integrations with AI to a more generic interface that's way more scalable.
mrshu•18m ago
What are some standard benchmarks you look at in this space?
sipjca•14m ago
Maybe this is true? But it's not clear to me this methodology will ever be quite as good as native tool calling. Or maybe I don't know the benchmark well enough, I just assume it's vision based

Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.

But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug

mohsen1•1h ago
That's a pretty pelican on a bicycle!

https://jsbin.com/hiruvubona/edit?html,output

https://claude.ai/share/618abbbf-6a41-45c0-bdc0-28794baa1b6c

greenfish6•1h ago
pelican on a bicycle benchmark probably getting saturated... especially as it's become a popular way to demonstrate model ability quickly
AlecSchueler•1h ago
But where is the training set of good pelicans on bikes coming from? You think they have people jigging them up internally?
eli•58m ago
Assuming they updated the crawled training data, just having a bunch of examples of specifically pelicans on bicycles from other models is likely to make a difference.
AlecSchueler•50m ago
But then how does the quality increase? Normally we hear that when models are trained on the output of other models the style becomes very muted and various other issues start to appear. But this probably the best pelicans on a bicycle I've ever seen, by quite some margin.
Kuinox•43m ago
Just compare it with a human on a bicycle, you would see that LLMs are weirdly good at drawing pelicans in SVG but not humans.
AlecSchueler•25m ago
I thought a human would be a considerable step up in complexity but I asked it first for a pelican[0] and then for a rat [1] to get out of the bird world and it did a great job on both.

But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.

0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.

Kuinox•1h ago
I never understood the point of the pellican on a bicycle exercise: LLMs coding agent doesnt have any way to see the output. It means the only thing this test is testing, is the ability of the LLMs to memorise.

Edit: just to show my point, a regular human on a bicycle is way worse with the same model: https://i.imgur.com/flxSJI9.png

mhh__•1h ago
Memorise what exactly?
Kuinox•58m ago
Coordinate and shape of the element used to form a pellican. If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.

I bet their ability to form a pellican result purely because someone already did it before.

_joel•1h ago
Because it excercises thinking about a pelican riding a bike (not common) and then describing that using SVG. It's quite nice imho and seems to scale with the power of the LLM model. Sure Simon has some actual reasons though.
imiric•1h ago
The only thing it exercises is the ability of the model to recall its pelican-on-bicycle and other SVG training data.
Kuinox•1h ago
> Because it excercises thinking about a pelican riding a bike (not common)

It is extremely common, since it's used on every single LLM to bench it.

And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.

_joel•35m ago
I mean the real world examples of a pelican riding a bike is not common. It's common in benchmarking LLM's but that's not what I meant.
furyofantares•42m ago
It's more for fun than as a benchmark.
Kuinox•40m ago
It also measure something llms are good probably due to cheating.
_joel•1h ago
... but can it create an svg renderer for claude's site.
atemerev•1h ago
Ah, the company where the models are unusable even with Pro subscription (start to hit the limit after 20 minutes of talking), and free models are not usable at all (currently can't even send a single message to Sonnet 4.5)...
usr19021ag•1h ago
Their benchmark chart doesn't match what's published on https://www.swebench.com/.

I understand that they may have not published the results for sonnet 4.5 yet, but I would expect the other models to match...

zurfer•1h ago
Same price and a 4.5 bp jump from 72.7 to 77.2 SWEBench

Pretty solid progress for roughly 4 months.

zurfer•59m ago
Also getting a perfect score on AIME (math) is pretty cool.

Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.

wohoef•44m ago
Just a few months ago people were still talking about exponential progress. The fact that we’re already going for just linear progress is not a good sign
falcor84•11m ago
Linear growth on a 0-100 benchmark is quite likely an exponential increase in capability.
crthpl•35m ago
The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.
mrshu•20m ago
Do you think a more messier math benchmark (in terms of how it is defined) might be more difficult for these models to get?
XMPPwocky•45m ago
nit: assuming you mean basis points, one basis point is 0.01%. 4.5bp would be 72.7% to 72.71%. this is 450bp!
schmorptron•1h ago
Oh wow, a lot of focus on code from the big labs recently. In hindsight it makes sense that the domain the people building it know best is the one getting the most attention, and it's also the one the models have seen the most undeniable usefulness in so far. Though personally, the unpredictability of the future where all of this goes is a bit unsettling at the same time...
modeless•56m ago
OpenAI and Anthropic are both trying to automate their own AI research, which requires coding.
martinald•51m ago
Thing is though if you are good at code it solves many other adjacent tasks for LLMs, like formatting docs for output, presentations, spreadsheet analysis, data crawling etc.
doctoboggan•49m ago
Along with developers wanting to build tools for developers like you said, I think code is a particularly good use case for LLMs (large language models), since the output product is a language.
fragmede•29m ago
It's because the output is testable. If the model outputs a legal opinion or medical advice, a human needs to be looped in to verify that the advice is not batshit insane. Meanwhile, if the output is code, it can be run through a compiler and (unit) tests run to verify that the generated code is cromulent without a human being in the loop for 100% of it, which means the supercomputer can just go off and do it a thing with less supervision.
fibers•1h ago
This looks exciting. I hope they add this to Windsurf soon.
pzo•41m ago
it looks like its already there
cube2222•1h ago
So… seems like we’re back to Sonnet being better than Opus? At least based on their benchmarks.

Curious to see that in practice, but great if true!

catigula•1h ago
I happened to be in the middle of a task in a production codebase that the various models struggled on so I can give a quick vibe benchmark:

opus 4.1: made weird choices, eventually got to a meh solution i just rolled back.

codex: took a disgusting amount of time but the result was vastly superior to opus. night and day superiority. output was still not what i wanted.

sonnet 4.5: not clearly better than opus. categorically worse decision-making than codex. very fast.

Codex was night and day the best. Codex scares me, Claude feels like a useful tool.

poisonborz•10m ago
These reviews are pretty useless to other developers. Models perform vastly differently with each language, task type, framework.
MichealCodes•1h ago
I really hope benchmarking improves soon to monitor the model in the weeks following the announcement. It really seems like these companies introduce a new "buffed" model and then slowly nerf the intelligence through optimizations.

If we saw task performance week 1 vs week 8 on benchmarks, this would at least give us more insight into the loop here. In an environment lacking true progress a company could surely "show" it with this strategy.

SubiculumCode•9m ago
I do wonder about this. I just don't know if it real or in our heads
scosman•58m ago
Interesting quirk on first use: "`temperature` and `top_p` cannot both be specified for this model. Please use only one."
seaal•58m ago
They really had to release an updated model, I can only imagine how many people cancelled their plans and switched over to Codex over the past month.

I'm glad they at least gave me the full $100 refund.

user1999919•58m ago
its time to start benchmarking benchmarks. im pretty sure they are bmw levels doping the game here
_joel•57m ago
`claude model claude-sonnet-4-5-20250929` for cli users
trevin•57m ago
I’m always fascinated by the fine-tuning of LLM personalities. Might we finally get less of the reflexive “You’re absolutely right” with this one?

Maybe we’re entering the Emo Claude era.

Per the system card: In 250k real conversations, Claude Sonnet 4.5 expressed happiness about half as often as Claude 4, though distress remained steady.

fnordsensei•28m ago
I personally enjoy the “You’re absolutely right!” exclamation. It signals alignment with my feedback in a consistent manner.
transcriptase•26m ago
You’re overlooking the fact that it still says that when you are, in reality, absolutely wrong.
podgietaru•11m ago
And that it often spits out the exact same wrong answer in response.
rudedogg•56m ago
I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too.

It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.

I worry everyone is chasing benchmarks to the detriment of general performance. Or the next token weight for the incorrect change outweigh my simple but precise instructions. Either way it’s no good

Edit: With a followup “please do what I asked” sort of prompt it came through, while Opus just loops. So theres that at least

darksaints•49m ago
> I worry everyone is chasing benchmarks to the detriment of general performance.

I've been worried about this for a while. I feel like Claude in particular took a step back in my own subjective performance evaluation in the switch from 3.7 to 4, while the benchmark scores leaped substantially.

To be fair, benchmarking has always been the most difficult problem to solve in this space, so it's not surprising that benchmark development isn't exactly keeping pace with all of the modeling/training development happening.

MichealCodes•43m ago
More like churning benchmarks... Release new model at max power, get all the benchmark glory, silently reduce model capability in the following weeks, repeat by releasing newer, smarter model.
zamadatix•35m ago
That (thankfully) can't compound, so would never be more than a one time offset. E.g. if you report a score of 60% SWE-bench verified for new model A, dumb A down to score 50%, and report a 20% improvement over A with new model B then it's pretty obvious when your last two model blogposts say 60%.

The only way around this is to never report on the same benchmark versions twice, which they include too many to realistically do every release.

MichealCodes•31m ago
The benchmarks are not typically ongoing, we do not often see comparisons between week 1 and week 8. Sprinkle a bit of training on the benchmarks in and you can ensure higher scores for the next model. A perfect scam loop to keep the people happy until they wise up.
Cthulhu_•9m ago
That's what I was thinking too; the models have the same data sources (they have all scraped the internet, github, book repositories, etc), they all optimize for the same standardized tests. Other than marginally better scores in those tests (and they will cherry-pick them to make them look better), how do the various competitors differentiate from each other still? What's the USP?
itsoktocry•2m ago
>It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.

I don't understand why this kind of thing is useful. Do the thing yourself and move on. For every one problem like this, AI can do 10 better/faster than I can.

sberens•54m ago
Is "parallel test time compute" available in claude code or the api? Or is it something they built internally for benchmark scores?
ancorevard•47m ago
Can't use Anthropic models in Cursor. Completely cost prohibitive compared to gpt-5 and grok models.

Why is this? Does Anthropic have just higher infrastructure costs compared to OpenAI/xAI?

doctoboggan•42m ago
Possibly, or they are pricing for sustainability and OpenAI/xAI are just burning through VC money.
acchow•34m ago
The Anthropic models are also better at coding. Why wouldn’t they price it higher?
dbbk•14m ago
It's meant to be used with the Max subscription
wohoef•41m ago
And Sonnet is again better than Opus. I’d love to see simultaneous release dates for Sonnet and Opus one day. Just so that Opus is always better than Sonnet
cloverich•39m ago
Please y'all, when you list supportive or critical complaints based on your actual work, include some specifics of the task and prompt. Like actual prompt, actual bugs, actual feature, etc. I've had great success with both ChatGPT and Claude for years, am around 3x sustained output increase in my professional work, and kicking off and finishing new side projects / features that I used to simply not ever finish. BUT there's some tasks I run into where it's god awful. Because I have enough good experience, I know how to work around, when to give up, when to move on, etc. I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev). IMHO for people using it off and on or sparingly, it's going to seem either incredible or worthless depending on your project and prompt. Share details, it's so helpful for meaningful conversation!
emil-lp•32m ago
How do you measure 3x sustained output increase?

Is it number of lines? Tickets closed? PRs opened or merged? Number of happy customers?

senordevnyc•30m ago
Oh good, a new discussion point that we haven't heard 1000x on here.

Have you heard of that study that shows AI actually makes developers less productive, but they think it makes them more productive??

EDIT: sorry all, I was being sarcastic in the above, which isn't ideal. Just annoyed because that "study" was catnip to people who already hated AI, and they (over-) cite it constantly as "evidence" supporting their preexisting bias against AI.

rapind•15m ago
> Have you heard of that study that shows AI actually makes developers less productive, but they think it makes them more productive??

Have you looked into that study? There's a lot wrong with it, and it's been discussed ad nauseam.

Also, what a great catch 22, where we can't trust our own experiences! In fact, I just did a study and my findings are that everyone would be happier if they each sent me $100. What's crazy is that those who thought it wouldn't make them happier, did in fact end up happier, so ignore those naysayers!

inopinatus•27m ago
It is undoubtedly 3x as many bugs.
_alternator_•14m ago
This would be a win. Professionals make about 1 bug for every 100 loc. If you get 3x the code with 3x the bugs, this is the definition of scaling yourself.
hshshshshsh•26m ago
All these are useless metrics. It doesn't say anything meaningful on the quality of your life. I would be more interested in knowing if he can not retire in next 5 years instead of waiting another 15?

Or do he know just just get to work for 2 hours and enjoy the remaining 6 hours doing meaningful things apart from staring at a screen.

senordevnyc•32m ago
HN is such a negative and cynical place these days that it's just not worth it. I just don't have the patience to hear yet another anti-AI rant, or have someone who is ideologically opposed to AI nitpick its output. Like you, I've found AI to be a huge help for my work, and I'm happy to keep outcompeting the people who are too stubborn to approach it with an open mind.
catigula•25m ago
It's not so much that they're being negative it's that you can't see that you're an Ouroboros consuming your own tail and they can. Skill issue as they say.
senordevnyc•22m ago
OK. Well, I've been doing this the hard way for about twenty years, and now with AI in the mix my little solo SaaS has gone from nothing to $5k MRR in six weeks. Guess I'm not holding it completely wrong?
scrollaway•22m ago
You are making assumptions about someone you have never talked to in the past, and don't know anything about.

Of the two of you, I know which one I'd bet on being "right". (Hint: It's the one talking about their own experience, not the one supplanting theirs onto someone else)

catigula•20m ago
What assumptions am I making? Aren't you making assumptions about what I'm saying? It appears your assumptions are extremely egregious because they're blatantly and even comically hypocritical.

To that poster:

Literally everyone in development is using AI.

The difference is "negative" people can clearly see that it's on a trajectory in the NEAR, not even distant, future to completely eat your earnings, so they're not thrilled.

You're in the forest and you're going "Wow, look at all these trees! Cool!"

The hubris is thinking that you're a permanent indispensable part of the loop.

senordevnyc•15m ago
We are reading very different "negative" comments here.

Most of the anti-AI comments I see on HN are NOT a version of "the problem with AI is that it's so good it's going to replace me!"

scrollaway•10m ago
> The difference is "negative" people can clearly see that it's on a trajectory in the NEAR, not even distant, future to completely eat your earnings, so they're not thrilled.

We birthed a level of cognition out of silicon that nobody would imagine even just four years ago. Sorry, but some brogrammers being worried about making ends meet is making me laugh - it's all the same people who have been automating everyone else's jobs for the past two decades (and getting paid extremely fat salaries for it), and you're telling me now we're all supposed to be worried because it's going to affect our salaries?

Come on. You think everyone who's "vibe coding" doesn't understand the pointlessness of 90% of codemonkey work? Hell, most smart engineers understood that pointlessness years ago. Most coders work on boring CRUD apps and REST APIs to make revenue go up 0.02%. And those that aren't, are probably working on ads.

It's a fraction of a fraction that is at all working on interesting things.

Personally, yeah, I saw it coming and instead of "accepting fate", I created an AI research lab. And I diversified the hell out of my skillset as well - started working way out of my comfort zone. If you want to keep up with changing times, start challenging.

rapind•12m ago
> you can't see that you're an Ouroboros consuming your own tail and they can

Hey, so if I DO see it, can I stop it from happening?

kelsey98765431•21m ago
all major nation state intelligence services have an incentive to spread negative sentiment and reduce developer adoption of ai technology as they race to catch up with the united states.
emp17344•11m ago
You don’t get to shut down discussion you don’t like. That’s the opposite of exhibiting an open mind.
scrollaway•1m ago
GP is right, though. Many programming communities have become ridiculous anti-AI bubbles - what's the point of trying to have a discussion if you're going to get systematically shut down by people whose entire premise is that they don't use it? It's like trying to explain color to the blind.

What "discussion" do you want to have? Another round of "LLMs are terrible at embedded hardware programming ergo they're useless"? Maybe with a dash of "LLMs don't write bug-free software [but I do]" to close it off?

The discussions that are at all advancing the state of the art are happening on forums that accept reality as a matter of fact, without people constantly trying to constantly pretend things because they're worried they'll lose their job if they don't.

bigyabai•31m ago
> for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI

That's... not super surprising? SwiftUI changes pretty dang often, and the knowledge cutoff doesn't progress fast enough to cover every use-case.

I use Claude to write GTK interfaces, which is a UI library with a much slower update cadence. LLMs seem to have a pretty easy time working with bog-standard libraries that don't make giant idiomatic changes.

danieloj•29m ago
Could you share the actual examples of where you’re seeing the 3x output increase?
alfalfasprout•22m ago
right? The irony is so thick you could cut it with a butter knife
not_kurt_godel•11m ago
3 * 0 = 0.

Checkmate, aitheists.

Mathiciann•17m ago
I am almost convinced your comment is parody but I am not entirely sure.

You want proof for critical/supportive criticism? Then almost in the same sentence you make an insane claim without backing this up by any evidence.

stavros•9m ago
Well, here's an even more insane claim: I'm infinity times more productive, as I just wouldn't even start projects without the LLM to sidestep my ADHD. Then, when the LLM invariably fucks up, I step in and finish things myself!

Here are a few projects that I made these past few months that wouldn't have been possible without LLMs:

* https://github.com/skorokithakis/dracula - A simple blood test viewer.

* https://www.askhuxley.com - A general helper/secretary/agent.

* https://www.writelucid.cc - A business document/spec writing tool I'm working on, it asks you questions one at a time, writes a document, then critiques the idea to help you strengthen it.

* A rotary phone that's a USB headset and closes your meeting when you hang up the phone, complete with the rotary dial actually typing in numbers.

* Made some long-overdue updates on my pastebin, https://www.pastery.net, to improve general functionality.

* https://github.com/skorokithakis/support-email-bot - A customer support bot to answer general questions about my projects to save me time on the easy stuff, works great.

* https://github.com/skorokithakis/justone - A static HTML page for the board game Just One, so you can play with your friends when you're physically together, without needing to bring the game along.

* https://github.com/skorokithakis/dox - A thing to run Dockerized CLI programs as if they weren't Dockerized.

I'm probably forgetting a lot more, but I honestly wouldn't have been bothered to start any of the above if not for LLMs, as I'm too old to code but not too old to make stuff.

EDIT: dang can we please get a bit better Markdown support? At least being able to make lists would be good!

lisbbb•4m ago
Did you make any money off any of that or was it all just labors of love type of stuff? I'm enjoying woodworking...
emp17344•4m ago
> I'm infinity times more productive, as I just wouldn't even start projects without the LLM to sidestep my ADHD.

1 is not infinitely greater than 0.

asdev•14m ago
this is a great copypasta
dirkc•11m ago
Would you say you do things you'd normally do 3 times faster? Or does it help you move past the things you'd get stuck on or avoid in the past, resulting in an overall 3x speedup?
mpern•11m ago
Would you be so kind to lead by example?

What are the specific tasks + prompts giving you an 3x increased output, and conversely, what tasks don't work at all?

After an admittedly cursory scan of your blog and the repos in your GH account I don't find anything in this direction.

FrustratedMonky•7m ago
New Claude Model Runs 30-Hour Marathon To Create 11,000-Line Slack Clone

https://www.theverge.com/ai-artificial-intelligence/787524/a...

Yeah, maybe it is garbage. But it is still another milestone, if it can do this, then it probably does ok with the smaller things.

This keeps incrementing from "garbage" to "wow this is amazing" at each new level. We're already forgetting that this was unbelievable magic a couple years ago.

mbesto•7m ago
> include some specifics of the task and prompt. Like actual prompt, actual bugs, actual feature, etc.

> I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev).

You made a critical comment yet didn't follow your own rules lol.

> it's so helpful for meaningful conversation!

How so?

FWIW - I too have used LLMs for both coding and personal prompting. I think the general conclusion is that it when it works, it works well but when it fails it can fail miserably and be disastrous. I've come to conclusion because I read people complaining here and through my own experience.

Here's the problem:

- It's not valuable for me to print out my whole prompt sequence (and context for that matter) in a message board. The effort is boundless and the return is minimal.

- LLMs should just work(TM). The fact that they can fail so spectacularly is a glaring issue. These aren't just bugs, they are foundational because LLMs by their nature are probabilistic and not deterministic. Which means providing specific defect criteria has limited value.

boogieknite•5m ago
> for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI

have you tried in the new xcode extension? that tool is surprisingly good in my limited use. one of the few times xcode has impressed me in my 2 yeasrs of use. read some anecdotes that claude in the xcode tool is more accurate than standard claude code for Swift. i havent noticed that myself but only used the xcode tool twice so far

bartread•5m ago
I had a complete shocker with all of Claude, GitHub Copilot, and ChatGPT when trying to prototype an iOS app in Swift around 12 months ago. They would all really struggle to generate anything usable, and making any progress was incredibly slow due to all the problems I was running into.

This was in stark contrast to my experience with TypeScript/NextJS, Python, and C#. Most of the time output quality for these was at least usefully good. Occasionally you’d get stuck in a tarpit of bullshit/hallucination around anything very new that hadn’t been in the training dataset for the model release you were using.

My take: there simply isn’t the community, thought leadership, and sheer volume of content around Swift that there is around these other languages. This means both lower quantity and lower quality of training data for Swift as compared to these other languages.

And that, unfortunately, plays negatively into the quality of LLM output for app development in Swift.

(Anyone who knows better, feel free to shoot me down.)

marginalia_nu•38m ago
Is there some accessible explainer for what these numbers that keep going up actually mean? What happens at 100% accuracy or win rate?
asadm•35m ago
then we need new bench.
lukev•34m ago
It means that the benchmark isn't useful anymore and we need to build a harder one.

edit: as far as what the numbers mean, they are arbitrary. They are only useful insofar as you can run two models (or two versions of the same model) on the same benchmark, and compare the numbers. But on an absolute scale the numbers don't mean anything.

unshavedyak•37m ago
Interesting, in the new 2.0.0 claude code they got rid of the "Plan with Opus then switch to Sonnet" feature. I hope they're correct in Sonnet being good enough to Plan too, because i quite preferred Opus planning. It wasn't necessarily "better", just more predictable in my experience.

Also as a Max $200 user, feels weird to be paying for an Opus tailored sub when now the standard Max $100 would be preferred since they claim Sonnet is better than Opus.

Hope they have Opus 4.5 coming out soon or next month i'm downgrading.

Implicated•10m ago
I'm also a max user and I just _leave_ it on Opus 4.1 - I've never hit a rate limit.
vb-8448•36m ago
claims against gpt-5 are huge!

I used to use cc, but I switched to codex (and it was much better) ... no I guess I have to switch batch to CC, at least to test it

bradley13•35m ago
I need to try Claude - haven't gotten to it.

I use AI for different things, though, including proofreading posts on political topics. I have run into situations where ChatGPT just freezes and refuses. Example: discussing the recent rape case involving a 12-year-old in Austria. I assume its guardrails detect "sex + kid" and give a hard "no" regardless of the actual context or content.

That is unacceptable.

That's like your word processor refusing to let you write about sensitive topics. It's a tool, it doesn't get to make that choice.

Implicated•13m ago
I'd imagine that the proportion of "legit" conversations around these topics and those that they're intending to not allow is large enough that it doesn't make sense for them to even entertain the idea of supporting those conversations.

As a rather hilarious and really annoying related issue - I have a real use where the application I'm working on is partially monitoring/analyzing the bloodlines of some rather specific/ancient mammals used in competition and... well.. it doesn't like terms like "breeders" and "breeding"

catigula•35m ago
I'm still absolutely right constantly, I'm a genius. I also make various excellent points.
hu3•26m ago
I wonder if/when this will be available to GitHub Copilot in VSCode.
Osyris•13m ago
Wonder no more: https://github.blog/changelog/2025-09-29-anthropic-claude-so...
aliljet•18m ago
These benchmarks in real world work remain remarkably weak. If you're using this for day-to-day work, the eval that really matters is how the model handles a ten step action. Context and focus are absolutely king in real world work. To be fair, Sonnet has tended to be very good at that...

I wonder if the 1m token context length is coming for this ride too?

edude03•17m ago
Ah, I figured something was up - I had sonnet 4 selected but it changed to "Legacy Model" while I was using the app.
peterdstallion•17m ago
I am a paying subscriber to Gemini, Claude and OpenAI.

I don't know if it's me, but over the last few weeks I've got to the conclusion ChatGPT is very strongly leading the race. Every answer it gives me is better - it's more concise and more informative.

I look forward to testing this further, but out of the few runs I just did after reading about this - it isn't looking much better

yepyip•1m ago
What about Grok, are they catching up?
Bjorkbat•12m ago
> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.

Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)

I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output

asdev•10m ago
how do claude/openai get around rate limiting/captcha with their computer use functionality?
chrisford•5m ago
The vision model has consistently been degraded since 3.5, specifically around OCR, so I hope it has improved with Claude Sonnet 4.5!
nickphx•56s ago
It will be great when the VC cash runs out, the screws tighten, and finally an end to the incessant misleading marketing claims.