frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Your AI Excitement Is Someone's AI Apprehension

https://metedata.substack.com/p/007-your-ai-excitement-is-someones
1•young_mete•1m ago•0 comments

Code is not even close to half the battle

https://aleksei.dev/code-is-not-even-close-to-half-the-battle/
1•speckx•2m ago•0 comments

Show HN: AriaType v0.3 Has Come – Fast and Private Voice to Text Input Client

https://github.com/joe223/ariatype
1•Joe_Harris•3m ago•0 comments

Tech Goes Invisible [video]

https://www.youtube.com/watch?v=CEeUF1dW1kg
2•ericjamesward•4m ago•0 comments

Mnemo – a local-first notepad that acts as memory for AI agents

https://github.com/fwgadmin/mnemo
1•mhome9•5m ago•0 comments

Starlink outage hit drone tests, exposing Pentagon's growing reliance on SpaceX

https://www.reuters.com/business/media-telecom/starlink-outage-hit-drone-tests-exposing-pentagons...
2•ilamont•5m ago•0 comments

Building an Unverified Compiler with Agents

https://www.basis.ai/blog/verified-compiler/
1•gopiandcode•6m ago•0 comments

They Hacked Claude, Gemini, and Copilot (and No One Told You)

https://grith.ai/blog/we-hacked-claude-gemini-copilot?16-apr
1•edf13•8m ago•0 comments

Claude is about to begin its KYC verification process

https://old.reddit.com/r/ClaudeAI/comments/1smr9vs/claude_is_about_to_begin_its_kyc_verification/
2•ryangg•9m ago•0 comments

How Do You See What Cannot Be Seen?

https://news.columbia.edu/news/how-do-you-see-what-cannot-be-seen
1•danielmorozoff•10m ago•0 comments

How are you connecting cloud spend to business outcomes?

1•Ask-Winston•10m ago•0 comments

Show HN: Marky – A lightweight Markdown viewer for agentic coding

https://github.com/GRVYDEV/marky
1•GRVYDEV•11m ago•0 comments

Data breach at edtech giant McGraw Hill affects 13.5M accounts

https://www.bleepingcomputer.com/news/security/data-breach-at-edtech-giant-mcgraw-hill-affects-13...
1•Brajeshwar•11m ago•0 comments

Succinct Data Structures: Cramming 80k words into a JavaScript file

https://stevehanov.ca/blog/succinct-data-structures-cramming-80000-words-into-a-javascript-file
1•tosh•12m ago•0 comments

Skwik – Turn iPhone photos into scaled measurements for CAD work

https://usr-ein.github.io/skwik/
1•sam1902•15m ago•1 comments

Show HN: Stack – the control plane for AI agents

https://getstack.run/
1•tiel88•16m ago•0 comments

MiniVecDb – A 50KB, 1-bit quantized vector database for the browser

https://github.com/Alekkk777/MiniVecDb
2•alekkk777•16m ago•0 comments

Apple CMF 2026 and Studio Display XDR Test Results

https://www.lttlabs.com/articles/2026/04/11/apple-studio-display-xdr-display-testing-results
1•LabsLucas•17m ago•0 comments

Insights on software engineering job openings – April 2026

https://corvi.careers/blog/global_software-engineering_jobs_april_2026/
1•sp1982•17m ago•0 comments

It's OK to compare floating-points for equality

https://lisyarus.github.io/blog/posts/its-ok-to-compare-floating-points-for-equality.html
1•abnercoimbre•19m ago•0 comments

Claude Code is a black box. Here is how to trace its tool calls and LLM requests

https://www.arthur.ai/blog/claude-code-observability-tracing-with-arthur
1•pevals•20m ago•0 comments

Tech Billionaires Want Christians to Believe in AI

https://www.motherjones.com/politics/2026/04/ai-religious-right-christianity-thiel-katherine-boyl...
3•cdrnsf•20m ago•1 comments

Critical Atlantic current significantly more likely to collapse than thought

https://www.theguardian.com/environment/2026/apr/15/critical-atlantic-current-significantly-more-...
3•sideway•20m ago•0 comments

How the American Oligarchy Went Hyperscale

https://www.motherjones.com/politics/2026/04/american-oligarchy-hyperscale-data-centers-meta-open...
3•cdrnsf•20m ago•0 comments

You wouldn't have liked the Beatles

https://thomasbarrie.substack.com/p/you-wouldnt-have-liked-the-beatles
1•badc0ffee•21m ago•0 comments

Will Nvidia's moat persist? [video]

https://www.youtube.com/watch?v=Hrbq66XqtCo
1•tosh•21m ago•0 comments

AE DDS Exporter – Export DDS BC6/BC7 Texture Sequences from Adobe After Effects

https://somesmall.studio
1•bj-rn•22m ago•0 comments

AGPLv3§7 Paragraph 4 Empowers Users to Thwart Badgeware

https://sfconservancy.org/blog/2026/apr/16/badgeware-onlyoffice-nextcloud-affero-gpl/
2•hn_acker•22m ago•1 comments

Show HN: ZxClip – Mac app to edit audio like text; runs locally

https://zxclip.com/
1•SingAlong•23m ago•0 comments

Android Auto users say Gemini won't stop talking, and it's not even right

https://www.androidauthority.com/android-auto-gemini-problems-3657698/
1•speckx•24m ago•0 comments
Open in hackernews

Claude Opus 4.7

https://www.anthropic.com/news/claude-opus-4-7
519•meetpateltech•1h ago

Comments

Kim_Bruning•1h ago
> "We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses. "

This decision is potentially fatal. You need symmetric capability to research and prevent attacks in the first place.

The opposite approach is 'merely' fraught.

They're in a bit of a bind here.

erdaniels•58m ago
Now we have to trick the models when you legitimately work in the security space.
ls612•56m ago
Only software approved by Anthropic (and/or the USG) is allowed to be secure in this brave new era.
nope1000•54m ago
Except when you accidentally leak your entire codebase, oops
velcrovan•49m ago
Questions about "fatality" aside, where do you see asymmetry here?
jp0001•9m ago
It's easier to produce vulnerable code than it is to use the same Model to make sure there are no vulnerabilities.
johnmlussier•15m ago
I am absolutely moving off them if this continues to be the case.
dgb23•5m ago
I agree with you here. I think this is for product placement for Mythos.
u_sama•1h ago
Excited to use 1 prompt and have my whole 5-hour window at 100%. They can keep releasing new ones but if they don't solve their whole token shrinkage and gaslighting it is not gonna be interesting to se.
lbreakjai•1h ago
Solve? You solve a problem, not something you introduced on purpose.
fetus8•52m ago
on Tuesday, with 4.6, I waited for my 5 hour window to reset, asked it to resume, and it burned up all my tokens for the next 5 hour window and ran for less than 10 seconds. I’ve never cancelled a subscription so fast.
u_sama•46m ago
I tried the Claude Extension for VSCode on WSL for a reverse engineering task, it consumed all of my tokens, broke and didn't even save the conversatioon
fetus8•19m ago
That’s truly awful. What a broken tool.
HarHarVeryFunny•27m ago
It seems a lot of the problem isn't "token shrinkage" (reducing plan limits), but rather changes they made to prompt caching - things that used to be cached for 1 hour now only being cached for 5 min.

Coding agents rely on prompt caching to avoid burning through tokens - they go to lengths to try to keep context/prompt prefixes constant (arranging non-changing stuff like tool definitions and file content first, variable stuff like new instructions following that) so that prompt caching gets used.

This change to a new tokenizer that generates up to 35% more tokens for the same text input is wild - going to really increase token usage for large text inputs like code.

benleejamin•1h ago
For anyone who was wondering about Mythos release plans:

> What we learn from the real-world deployment of these safeguards will help us work towards our eventual goal of a broad release of Mythos-class models.

not_ai•1h ago
Oh look it was too powerful to release, now it’s just a matter of safeguards.

This story sounds a lot like GPT2.

poszlem•1h ago
It's too powerful now. Once GPT6 is released it will suddenly, magically, become not too powerful to release.
thomasahle•1h ago
Or, you know, they will have improved the safe guards
poszlem•42m ago
Sure thing.
latentsea•1h ago
For a second there I read that as 'GTA 6', and that got me thinking maybe the reason GTA 6 hasn't come out all of these years is because of how dangerous and powerful it's going to be.
mrbombastic•1h ago
productivity going right back down again, ah well they weren't going to pay us more anyway
tabbott•1h ago
The original blog post for Mythos did lay out this safeguard testing strategy as part of their plan.
hgoel•1h ago
This seems needlessly cynical. I don't think they said they never planned to release it.

They seemed to make it clear that they expect other labs to reach that level sooner or later, and they're just holding it off until they've helped patch enough vulnerabilities.

camdenreslink•1h ago
My guess is that it is just too expensive to make generally available. Sounds similar to ChatGPT 4.5 which was too expensive to be practical.
frank-romita•1h ago
The most highly anticipated model looking forward to using it
jampa•1h ago
Mythos release feels like Silicon Valley "don't take revenue" advice:

https://www.youtube.com/watch?v=BzAdXyPYKQo

""If you show the model, people will ask 'HOW BETTER?' and it will never be enough. The model that was the AGI is suddenly the +5% bench dog. But if you have NO model, you can say you're worried about safety! You're a potential pure play... It's not about how much you research, it's about how much you're WORTH. And who is worth the most? Companies that don't release their models!"

CodingJeebus•1h ago
Completely agree. We're at this place where a frontier model's peak perceived value always seems to be right before it releases.
msp26•1h ago
They don't have the compute to make Mythos generally available: that's all there is to it. The exclusivity is also nice from a marketing pov.
alecco•1h ago
They don't have demand for the price it would require for inference.

They are definitely distilling it into a much smaller model and ~98% as good, like everybody does.

lucrbvi•1h ago
Some people are speculating that Opus 4.7 is distilled from Mythos due to the new tokenizer (it means Opus 4.7 is a new base model, not just an improved Opus 4.6)
alecco•53m ago
Yes, I was thinking that. But it could as well be the other way around. Using the pretrained 4.7 (1T?) to speed up ~70% Mythos (10T?) pretraining.

It's just speculative decoding but for training. If they did at this scale it's quite an achievement because training is very fragile when doing these kinds of tricks.

ACCount37•37m ago
Reverse distillation. Using small models to bootstrap large models. Get richer signal early in the run when gradients are hectic, get the large model past the early training instability hell. Mad but it does work somewhat.

Not really similar to speculative decoding?

I don't think that's what they've done here though. It's still black magic, I'm not sure if any lab does it for frontier runs, let alone 10T scale runs.

aesthesia•28m ago
The new tokenizer is interesting, but it definitely is possible to adapt a base model to a new tokenizer without too much additional training, especially if you're distilling from a model that uses the new tokenizer. (see, e.g., https://openreview.net/pdf?id=DxKP2E0xK2).
baq•54m ago
> They don't have demand for the price it would require for inference.

citation needed. I find it hard to believe; I think there are more than enough people willing to spend $100/Mtok for frontier capabilities to dedicate a couple racks or aisles.

CodingJeebus•1h ago
I've read so many conflicting things about Mythos that it's become impossible to make any real assumptions about it. I don't think it's vaporware necessarily, but the whole "we can't release it for safety reasons" feels like the next level of "POC or STFU".
shostack•1h ago
Looks like they are adding Peter Thiel backed ID verification too.

https://reddit.com/r/ClaudeAI/comments/1smr9vs/claude_is_abo...

szmarczak•31m ago
You should've commented this on the parent thread for visibility, I had to scroll to find this, as I don't browse r/ClaudeAI regularly.
postflopclarity•1h ago
funny how they use mythos preview in these benchmarks like a carrot on a stick
ansley•1h ago
marketing
oliver236•1h ago
someone tell me if i should be happy
nickmonad•1h ago
Did you try asking the model?
TIPSIO•1h ago
Quick everyone to your side projects. We have ~3 days of un-nerfed agentic coding again.
Esophagus4•1h ago
3 days of side project work is about all I had in me anyway
ttul•58m ago
... your side projects that will soon become your main source of income after you are laid off because corporate bosses have noticed that engineers are more productive...
johnwheeler•56m ago
Exactly. God, it wouldn't be such a problem if they didn't gaslight you and act like it was nothing. Just put up a banner that says Claude is experiencing overloaded capacity right now, so your responses might be whatever.
replwoacause•32m ago
More like 2 hours considering these usage limits
user34283•13m ago
Perhaps on the 10x plan.

It went through my $20 plan's session limit in 15 minutes, implementing two smallish features in an iOS app.

That was with the effort on auto.

It looks like full time work would require the 20x plan.

alvis•1h ago
TL;DR; iPhone is getting better every year

The surprise: agentic search is significantly weaker somehow hmm...

buildbot•1h ago
Too late, personally after how bad 4.6 was the past week I was pushed to codex, which seems to mostly work at the same level from day to day. Just last night I was trying to get 4.6 to lookup how to do some simple tensor parallel work, and the agent used 0 web fetches and just hallucinated 17K very wrong tokens. Then the main agent decided to pretend to implement tp, and just copied the entire model to each node...
alvis•1h ago
I don't have much quality drop from 4.6. But I also notice that I use codex more often these days than claude code
buildbot•1h ago
It's been shockingly bad for me - for another example when asked to make a new python script building off an existing one; for some cursed reason the model choose to .read() the py files, use 100 of lines of regex to try to patch the changes in, and exec'd everything at the end...
kivle•22m ago
Hate that about Claude Code. I have been adding permissions for it to do everything that makes sense to add when it comes to editing files, but way too often it will generate 20-30 line bash snippets using sed to do the edits instead, and then the whole permission system breaks down. It means I have to babysit it all the time to make sure no random permission prompts pop up.
fluidcruft•49m ago
I generally think codex is doing well until I come in with my Opus sweep to clean it up. Claude just codes closer to the way my brain works. codex is great at finding numerical stability issues though and increasingly I like that it waits for an explicit push to start working. But talking to Claude Code the way I learned to talk to codex seems to work also so I think a lot of it is just learning curve (for me).
cmrdporcupine•1h ago
Yep, I'll wait for the GPT answer to this. If we're lucky OpenAI will release a new GPT 5.5 or whatever model in the next few days, just like the last round.

I have been getting better results out of codex on and off for months. It's more "careful" and systematic in its thinking. It makes less "excuses" and leaves less race conditions and slop around. And the actual codex CLI tool is better written, less buggy and faster. And I can use the membership in things like opencode etc without drama.

For March I decided to give Claude Code / Opus a chance again. But there's just too much variance there. And then they started to play games with limits, and then OpenAI rolled out a $100 plan to compete with Anthropic's.

I'm glad to see the competition but I think Anthropic has pissed in the well too much. I do think they sent me something about a free month and maybe I will use that to try this model out though.

davely•1h ago
I’ve been on the Claude Code train for a while but decided to try Codex last week after they announced the $100 USD Pro plan.

I’ve been pretty happy with it! One thing I immediately like more than Claude is that Codex seems much more transparent about what it’s thinking and what it wants to do next. I find it much easier to interrupt or jump in the middle if things are going to wrong direction.

Claude Code has been slowly turning into this mysterious black box, wiping out terminal context any time it compacts a conversation (which I think is their hacky way of dealing with terminal flickering issues — which is still happening, 14 months later), going out of the way to hide thought output, and then of course the whole performance issues thing.

Excited to try 4.7 out, but man, Codex (as a harness at least) is a stark contrast to Claude Code.

cmrdporcupine•1h ago
Do this -- take your coworker's PRs that they've clearly written in Claude Code, and have Codex/GPT 5.4 review them.

Or have Codex review your own Claude Code work.

It then becomes clear just how "sloppy" CC is.

I wouldn't mind having Opus around in my back pocket to yeet out whole net new greenfield features. But I can't trust it to produce well-engineered things to my standards. Not that anybody should trust an LLM to that level, but there's matters of degree here.

afavour•1h ago
> It then becomes clear just how "sloppy" CC is.

Have you done the reverse? In my experience models will always find something to criticize in another model's work.

cmrdporcupine•1h ago
I have, and in fact models will find things to criticize in their own work, too, so it's good to iterate.

But I've had the best results with GPT 5.4

woadwarrior01•54m ago
It cuts both ways. What I usually do these days is to let codex write code, then use claude code /simplify, have both codex and claude code review the PR, then finally manually review and fixup things myself. It's still ~2x faster than doing everything by myself.
cmrdporcupine•48m ago
I often work this way too, but I'll say this:

This flow is exhausting. A day of working this way leaves me much more drained than traditional old school coding.

woadwarrior01•44m ago
100%. On days when I'm sleep deprived (once or twice a week), I fallback to this flow. On regular days, I tend to write more code the old school way and use things things for review.
kevinsync•11m ago
I've been using Claude and Codex in tandem ($100 CC, $20 Codex), and have made heavy use of claude-co-commands [0] to make them talk. Outside of the last 1-2 weeks (which we now have confirmation YET AGAIN that Claude shits the fucking bed in the run-up to a new model release), I usually will put Claude on max + /plan to gin up a fever dream to implement. When the plan is presented, I tell it to /co-validate with Codex, which tends to fill in many implementation gaps. Claude then codes the amended plan and commits, then I have a Codex skill that reviews the commit for gaps, missed edge cases, incorrect implementation, missed optimizations, etc, and fix them. This had been working quite well up until the beginning of the month, Claude more or less got CTE, and after a week of that I swapped to $100 Codex, $20 CC plans. Now I'm using co-validation a lot less and just driving primarily via Codex. When Claude works, it provides some good collaborative insights and counter-points, but Codex at the very least is consistently predictable (for text-oriented, data-oriented stuff -- I don't use either for designing or implementing frontend / UI / etc).

As always, YMMV!

[0] https://github.com/SnakeO/claude-co-commands

cmrdporcupine•5m ago
This more or less mimics a flow that I had fairly good results from -- but I'm unwilling to pay for both right now unless I had a client or employer willing to foot the bill.

Claude Code as "author" and a $20 Codex as reviewer/planner/tester has worked for me to squeeze better value out of the CC plan. But with the new $100 codex plan, and with the way Anthropic seemed to nerf their own $100 plan, I'm not doing this anymore.

arcanemachiner•1h ago
There is a new flag for terminal flickering issues:

> Claude Code v2.1.89: "Added CLAUDE_CODE_NO_FLICKER=1 environment variable to opt into flicker-free alt-screen rendering with virtualized scrollback"

pxc•37m ago
> One thing I immediately like more than Claude is that Codex seems much more transparent about what it’s thinking and what it wants to do next. I find it much easier to interrupt or jump in the middle if things are going to wrong direction.

I've finally started experimenting recently with Claude's --dangerously-skip-permissions and Codex's --dangerously-bypass-approvals-and-sandbox through external sandboxing tools. (For now just nono¹, which I really like so far, and soon via containerization or virtual machines.)

When I am using Claude or Codex without external sandboxing tools and just using the TUI, I spend a lot of time approving individual commands. When I was working that way, I found Codex's tendency to stop and ask me whether/how it should proceed extremely annoying. I found myself shouting at my monitor, "Yes, duh, go do the thing!".

But when I run these tools without having them ask me for permission for individual commands or edits, I sometimes find Claude has run away from me a little and made the wrong changes or tried to debug something in a bone-headed way that I would have redirected with an interruption if it has stopped to ask me for permissions. I think maybe Codex's tendency to stop and check in may be more valuable if you're relying on sandboxing (external or built-in) so that you can avoid individual permissions prompts.

--

1: https://nono.sh/

muzani•1h ago
For me, making it high effort just fixed all the quality problems, and even cut down on token use somehow
vunderba•59m ago
This. They kind of snuck this into the release notes: switching the default effort level to Medium. High is significantly slower, but that’s somewhat mitigated by the fact that you don’t have to constantly act like a helicopter parent for it.
aurareturn•1h ago
Funny because many people here were so confident that OpenAI is going to collapse because of how much compute they pre-ordered.

But now it seems like it's a major strategic advantage. They're 2x'ing usage limits on Codex plans to steal CC customers and it seems to be working. I'm seeing a lot of goodwill for Codex and a ton of bad PR for CC.

It seems like 90% of Claude's recent problems are strictly lack of compute related.

energy123•1h ago
Is that 2x still going on I thought that ended in early April
aurareturn•1h ago
They did it again to "celebrate" the release of the $100 plan.
indigodaddy•11m ago
On plus?
lawgimenez•1h ago
It’s for Pro users only, I think the 2x is up to May 31.
arcanemachiner•1h ago
Different plan. The old 2x has been discontinued, and the bonus is now (temporarily) available for the new $100 plan users in an effort, presumably, to entice them away from Anthropic.
wahnfrieden•6m ago
For the $200 users, it never ended.
llm_nerd•1h ago
Most of the compute OpenAI "preordered" is vapour. And it has nothing to do with why people thought the company -- which is still in extremely rocky rapids -- was headed to bankruptcy.

Anthropic has been very disciplined and focused (overwhelmingly on coding, fwiw), while OpenAI has been bleeding money trying to be the everything AI company with no real specialty as everyone else beat them in random domains. If I had to qualify OpenAI's primary focus, it has been glazing users and making a generation of malignant narcissists.

But yes, Anthropic has been growing by leaps and bounds and has capacity issues. That's a very healthy position to be in, despite the fact that it yields the inevitable foot-stomping "I'm moving to competitor!" posts constantly.

afavour•1h ago
> people here were so confident that OpenAI is going to collapse because of how much compute they pre-ordered

That's not why. It was and is because they've been incredibly unfocused and have burnt through cash on ill-advised, expensive things like Sora. By comparison Anthropic have been very focused.

aurareturn•1h ago
I don't think that was the main reason for people thinking OpenAI is going to collapse here.

By far, the biggest argument was that OpenAI bet too much on compute.

Being unfocused is generally an easy fix. Just cut things that don't matter as much, which they seem to be doing.

airstrike•47m ago
It really wasn't. Most of the argument was around product portfolio and agentic coding performance.
scottyah•17m ago
Nobody was talking about them betting too much on compute, people were saying that their shady deals on compute with NVIDIA and Oracle were creating a giant bubble in their attempt to get a Too Big To Fail judgement (in their words- taxpayer-backed "backstop").
Robdel12•52m ago
> By comparison Anthropic have been very focused.

Ah yes, very focused on crapping out every possible thing they can copy and half bake?

jampekka•41m ago
To me it seems like they burn so much money they can do lots of things in parallel. My guess would be that e.g. codex and sora are very independently developed. After all there's a quite a hard limit on how many bodies are beneficial to a software project.
wahnfrieden•8m ago
They all compete internally over constrained compute resources - for R&D and production.
KaiserPro•22m ago
Personally its down to Altman having the cognitive capacity of a sleeping snail, the world insight of a hormonal 14 year old who's only ever read one series of manga.

Despite having literal experts at his fingertips, he still isn't able to grasp that he's talking unfilters bollocks most of the time. Not to mention is Jason level of "oath breaking"/dishonesty.

madeofpalk•1h ago
Seems very short term. Like how cheap Uber was initially. Like Claude was before!

Eventually OpenAI will need to stop burning money.

l5870uoo9y•1h ago
In hindsight, it is painfully clear that Antropic’s conservative investment strategy has them struggling with keeping up with demand and caused their profit margin to shrink significantly as last buyer of compute.
Leynos•1h ago
Their top tier plan got a 3x limit boost. This has been the first week ever where I haven't run out of tokens.
wahnfrieden•6m ago
No
redml•44m ago
they've also introduced a lot of caching and token burn related bugs which makes things worse. any bug that multiplies the token burn also multiplies their infrastructure problems.
__turbobrew__•43m ago
All of the smart people I know went to work at OpenAI and none at Anthropic. In addition to financial capital, OpenAI has a massive advantage in human capital over Anthropic.

As long as OpenAI can sustain compute and paying SWE $1million/year they will end up with the better product.

KaiserPro•18m ago
> OpenAI has a massive advantage in human capital over Anthropic.

but if your leader is a dipshit, then its a waste.

Look You can't just throw money at the problem, you need people who are able to make the right decisions are the right time. That that requires leadership. Part of the reason why facebook fucked up VR/AR is that they have a leader who only cares about features/metrics, not user experience.

Part of the reason why twitter always lost money is because they had loads of teams all running in different directions, because Dorsey is utterly incapable of making a firm decision.

Its not money and talent, its execution.

scottyah•15m ago
Attracting talent with huge sums of money just gets you people who optimize for money, and it's usually never a good long-term decision. I think it's what led to Google's downturn.
kaliqt•40m ago
That’s more a leadership decision because Anthropic are nerfing the model to cut costs, if they stop doing that then they’ll stay ahead.
solenoid0937•24m ago
Proof they are nerfing the model?
zamalek•32m ago
> It seems like 90% of Claude's recent problems are strictly lack of compute related.

Downtime is annoying, but the problem is that over the past 2-3 weeks Claude has been outrageously stupid when it does work. I have always been skeptical of everything produced - but now I have no faith whatsoever in anything that it produces. I'm not even sure if I will experiment with 4.7, unless there are glowing reviews.

Codex has had none of these problems. I still don't trust anything it produces, but it's not like everything it produces is completely and utterly useless.

scottyah•13m ago
So many people confuse sycophantic behavior with producing results.
saltyoldman•29m ago
I have both Claude and OpenAI, side by side. I would say sonnet 46 still beats gpt 54 for coding (at least in my use case) But after about 45 minutes I'm out of my window, so I use openai for the next 4 hours and I can't even reach my limit.
geooff_•1h ago
I've noticed the same over the last two weeks. Some days Claude will just entirely lose its marbles. I pay for Claude and Codex so I just end up needing to use codex those days and the difference is night and day.
frank-romita•1h ago
That's wild that you think 4.6 is bad..... Each model has its strengths and weaknesses I find that Codex is good for architectural design and Claude Is actually better the engineering and building
OtomotO•1h ago
Same for me.

I cancelled my subscription and will be moving to Codex for the time being.

Tokens are way too opaque and Claude was way smarter for my work a couple of months ago.

cube2222•1h ago
I've been using it with `/effort max` all the time, and it's been working better than ever.

I think here's part of the problem, it's hard to measure this, and you also don't know in which AB test cohorts you may currently be and how they are affecting results.

siegers•51m ago
Agree. I keep effort max on Claude and xhigh on GPT for all tasks and keep tasks as scoped units of work instead of boil the ocean type prompts. It is hard to measure but ultimately the tasks are getting completed and I'm validating so I consider it "working as expected".
bryanlarsen•49m ago
It works better, until you run out of tokens. Running out of tokens is something that used to never happen to me, but this month now regularly happens.

Maybe I could avoid running out of tokens by turning off 1M tokens and max effort, but that's a cure worse than the disease IMO.

queuep•1h ago
Before opus released we also saw huge backlash with it being dumber.

Perhaps they need the compute for the training

arrakeen•1h ago
so even with a new tokenizer that can map to more tokens than before, their answer is still just "you're not managing your context well enough"

"Opus 4.7 uses an updated tokenizer that [...] can map to more tokens—roughly 1.0–1.35× depending on the content type.

[...]

Users can control token usage in various ways: by using the effort parameter, adjusting their task budgets, or prompting the model to be more concise."

gonzalohm•1h ago
Until the next time they push you back to Claude. At this point, I feel like this has to be the most unstable technology ever released. Imagine if docker had stopped working every two releases
sergiotapia•1h ago
There is zero cost to switching ai models. Paid or open source. It's one line mostly.
gonzalohm•1h ago
What about your chat history? That has some value, at least for me. But what has even more value is stable releases.
drewnick•46m ago
I think this is more about which model you steer your coding harness to. You can also self-host a UI in front of multiple models, then you own the chat history.
charcircuit•38m ago
Codex doesn't read Claude.md like Claude does. It's not a "one line" change to switch.
fritzo•28m ago
ln -s CLAUDE.md AGENTS.md

There's your one line change.

charcircuit•14m ago
That doesn't handle Claude.md in subdirectories. It does handle Claude.md and other various settings in .claude.
aklein•25m ago
I have a CLAUDE.md symlinked to AGENTS.md
r0fl•1h ago
Same! I thought people were exaggerating how bad Claude has gotten until it deleted several files by accident yesterday

Codex isn’t as pretty in output but gets the job done much more consistently

desugun•1h ago
I guess our conscience of OpenAI working with the Department of War has an expiry date of 6 weeks.
adamtaylor_13•1h ago
Most people just want to use a tool that works. Not everything has to be a damn moral crusade.
martimarkov•1h ago
Yes, let take morality out of our daily lives as much as possible... That seems like a great categorical imperative and a recipe for social success
adamtaylor_13•36m ago
That's an incredibly uncharitable take on what I said. But that kind of proves my point.

Foist your morality upon everyone else and burden them with your specific conscience; sounds like a fun time.

some_furry•22m ago
Yeah, why actually engage with moral issues when we can just defer to a status quo that happens to benefit me?
freak42•18m ago
What is the charitable way to look at it then?
arcanemachiner•1h ago
That number is generous, and is also a pretty decent lifespan for a socially-conscious gesture in 2026.
Der_Einzige•1h ago
Longer than how long anyone cared about epstein.
PunchTornado•44m ago
neah, I believe most people here, which immediately brag about codex, are openai employees doing part of their job. otherwise I couldn't possibly phantom why would anyone use codex. In my company 80% is claude and 15% gemini. you can barely see openai on the graph. and we have >5k programmers using ai every day.
EQmWgw87pw•33m ago
I’m thinking the same thing, Codex literally ruined the codebases that I experimented with it on.
boringg•14m ago
Theres definitely a turf war going on for all AI posts - fully agree.

My guess is OAI has more money to put towards people supporting it. speculative

Klayy•10m ago
You can believe whatever you want. I found claude unusable due to limits. Codex works very well for my use cases.
Findeton•38m ago
We all liked the Terminator movies. Hopefully the stay as movies.
nothinkjustai•30m ago
Not everyone is American, and people who are not see Anthropic state they are willing to spy on our countries and shrug about OAI saying the same about America. What’s the difference to us?
riffraff•21m ago
if you're not american you should be worried about the bit of using AI to kill people which was the other major objection by Anthropic.

(not that I think the US DoD wouldn't do that anyway, ToS or not.)

boringg•15m ago
Theres little difference between the companies in that regard.

Only that Dario didn't have the foresight to realize he didn't control the relationship and that, regardless of the politics of the current administration [D or R], its a non-starter for a government entity to have a private company dictate terms of use for critical functionality of the government.

If Anthropic felt that way they shouldn't have taken the deal in the first place.

hk__2•1h ago
Meh. At $work we were on CC for one month, then switched to Codex for one month, and now will be on CC again to test. We haven’t seen any obvious difference between CC and Codex; both are sometimes very good and sometimes very stupid. You have to test for a long time, not just test one day and call it a benchmark just because you have a single example.
siegers•54m ago
I enjoy switching back and forth and having multi-agent reviews. I'm enjoying Codex also but having options is the real win.
onlyrealcuzzo•49m ago
I switched to Codex and found it extremely inferior for my use case.

It is much faster, but faster worse code is a step in the wrong direction. You're just rapidly accumulating bugs and tech debt, rather than more slowly moving in the correct direction.

I'm a big fan of Gemini in general, but at least in my experience Gemini Cli is VERY FAR behind either Codex or CC. It's both slower than CC, MUCH slower than Codex, and the output quality considerably worse than CC (probably worse than Codex and orders of magnitude slower).

In my experience, Codex is extraordinarily sycophantic in coding, which is a trait that could t be more harmful. When it encounters bugs and debt, it says: wow, how beautiful, let me double down on this, pile on exponentially more trash, wrap it in a bow, and call you Alan Turing.

It also does not follow directions. When you tell it how to do something, it will say, nah, I have a better faster way, I'll just ignore the user and do my thing instead. CC will stop and ask for feedback much more often.

YMMV.

enraged_camel•11m ago
>> I switched to Codex and found it extremely inferior for my use case.

Yeah, 100% the case for me. I sometimes use it to do adversarial reviews on code that Opus wrote but the stuff it comes back with is total garbage more often than not. It just fabricates reasons as to why the code it's reviewing needs improvement.

_the_inflator•49m ago
Codex really has its place in my bag. I mainly use it, rarely Claude.

Codex just gets it done. Very self-correcting by design while Claude has no real base line quality for me. Claude was awesome in December, but Codex is like a corporate company to me. Maybe it looks uncool, but can execute very well.

Also Web Design looks really smooth with Codex.

OpenAI really impressed me and continues to impress me with Codex. OpenAI made no fuzz about it, instead let results speak. It is as if Codex has no marketing department, just its product quality - kind of like Google in its early days with every product.

te_chris•47m ago
I try codex, but i hate 5.4's personality as a partner. It's a demon debugger though. but working closely with it, it's so smug and annoying.
vintagedave•40m ago
Same. I stopped my Pro subscription yesterday after entering the week with 70% of my tokens used by Monday morning (on light, small weekend projects, things I had worked on in the past and barely noticed a dent in usage.) Support was... unhelpful.

It's been funny watching my own attitude to Anthropic change, from being an enthusiastic Claude user to pure frustration. But even that wasn't the trigger to leave, it was the attitude Support showed. I figure, if you mess up as badly as Anthropic has, you should at least show some effort towards your customers. Instead I just got a mass of standardised replies, even after the thread replied I'd be escalated to a human. Nothing can sour you on a company more. I'm forgiving to bugs, we've all been there, but really annoyed by indifference and unhelpful form replies with corporate uselessness.

So if 4.7 is here? I'd prefer they forget models and revert the harness to its January state. Even then, I've already moved to Codex as of a few days ago, and I won't be maintaining two subscriptions, it's a move. It has its own issues, it's clear, but I'm getting work done. That's more than I can say for Claude.

suzzer99•5m ago
It seems like the big companies they're providing Mythos to are their only concern right now.
tiel88•35m ago
I've been raging pretty hard too. Thought either I'm getting cleverer by the day or Claude has been slipping and sliding toward the wrong side of the "smart idiot" equation pretty fast.

Have caught it flat-out skipping 50% of tasks and lying about it.

thisisit•30m ago
Personally I find using and managing Claude sessions and limits is getting exhausting and feels similar to calorie counting. You think you are going to have an amazing low calories meal only to realize the meal is full of processed sugars and you overshot the limit within 2-3 bites. Now "you have exhausted your limit for this time. Your session limits resets in next 4 hrs".
hootz•7m ago
Yep, it just feels terrible, the usage bars give me anxiety, and I think that's in their interest as they definitely push me towards paying for higher limits. Won't do that, though.
deepsquirrelnet•18m ago
My tinfoil hat theory, which may not be that crazy, is that providers are sandbagging their models in the days leading up to a new release, so that the next model "feels" like a bigger improvement than it is.

An important aspect of AI is that it needs to be seen as moving forward all the time. Plateaus are the death of the hype cycle, and would tether people's expectations closer to reality.

cousinbryce•5m ago
Possibly due to moving compute from inference to training
serial_dev•12m ago
4.6 has gotten so bad, and it was made worse obviously on purpose, no mistakes no accidents. You can't rely on companies who pull shenanigans like this. Unfortunately I still hate the code that codex barfs up, so I need to go back to trad-coding.
estimator7292•6m ago
Anecdotally, codex has been burning through way more tokens for me lately. Claude seems to just sit and spin for a long time doing nothing, but at least token use is moderate.

All options are starting to suck more and more

rvz•1h ago
Introducing a new upgraded slot machine named "Claude Opus" in the Anthropic casino.

You are in for a treat this time: It is the same price as the last one [0] (if you are using the API.)

But it is slightly less capable than the other slot machine named 'Mythos' the one which everyone wants to play around with. [1]

[0] https://claude.com/pricing#api

[1] https://www.anthropic.com/news/claude-opus-4-7

dbbk•1h ago
If you're building a standard app Opus is already good enough to build anything you want. I don't even know what you'd really need Mythos for.
fny•1h ago
You'd be surprised. With React, Claude can get twisted in knots mostly because React lends itself to a pile of spaghetti code.
emadabdulrahim•48m ago
What's an alternative library that doesn't turn large/complex frontend code into spaghetti code?
fny•7m ago
Vue (my favorite) and Svelte do well.
rurban•1h ago
You'd need Mythos to free your iPhone, SamsungTV, SmartWatches or such. Maybe even printer drivers.
dirasieb•1h ago
i sincerely doubt mythos is capable of jailbreaking an iphone
recursivegirth•1h ago
Consumerism... if it ain't the best, some people don't want it.
Barbing•1h ago
Time/frustration

If it’s all slop, the smallest waste of time comes from the best thing on the market

poszlem•1h ago
Also 640 KB ram ought to be enough for everybody.
zeroonetwothree•1h ago
This is true if you know what you are doing and provide proper guidance. It’s not true if you just want to vibe the whole app.
alvis•1h ago
TL;DR; iPhone is getting better every year

The surprise: agentic search is significantly weaker somehow hmm...

endymion-light•1h ago
I'm not sure how much I trust Anthropic recently.

This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.

Anthropic need to build back some trust and communicate throtelling/reasoning caps more clearly.

aurareturn•1h ago
They don't have enough compute for all their customers.

OpenAI bet on more compute early on which prompted people to say they're going to go bankrupt and collapse. But now it seems like it's a major strategic advantage. They're 2x'ing usage limits on Codex plans to steal CC customers and it seems to be working.

It seems like 90% of Claude's recent problems are strictly lack of compute related.

endymion-light•1h ago
Honestly, I personally would rather a time-out than the quality of my response noticably downgrading. I think what I found especially distrustful is the responses from employees claiming that no degredation has occured.

An honest response of "Our compute is busy, use X model?" would be far better than silent downgrading.

Barbing•1h ago
Are they convinced that claiming they have technical issues while continuing to adjust their internal levers to choose which customers to serve is holistically the best path?
Wojtkie•1h ago
Is that why Anthropic recently gave out free credits for use in off-hours? Possibly an attempt to more evenly distribute their compute load throughout the day?
DaedalusII•1h ago
i suspect they get cheap off peak electricity and compute is cheaper at those times
jedberg•24m ago
That's not really how datacenter power works. It's usually a bulk buy with a 95th percentile usage.
ac29•9m ago
That was the carrot, but it was followed immediately by the stick (5 hour session limits were halved during peak hours)
mattas•1h ago
Hard for me to reconcile the idea that they don't have enough compute with the idea that they are also losing money to subsidies.
Glemllksdf•52m ago
They are loosing money because the model training costs billions.
ACCount37•42m ago
Model inference compute over model lifetime is ~10x of model training compute now for major providers. Expected to climb as demand for AI inference rises.
howdareme9•32m ago
They are constantly training and getting rid of older models, they are losing money
ACCount37•11m ago
Which part of "over model lifetime" did you not understand?
Glemllksdf•20m ago
For sure and growth also costs money for buying DCs etc.
anthonypasq•43m ago
they clearly arent losing money, i dont understand why people think this is true
smt88•33m ago
People think it's true because it is true, and OpenAI has told us themselves.

They (very optimistically) say they'll be profitable in 2030.

_boffin_•56m ago
You state your hypnosis quite confidently. Can you tell me how taking down authentication many times is related to GPU capacity?
Glemllksdf•56m ago
Its a hard game to play anyway.

Anthropics revenue is increasing very fast.

OpenAI though made crazy claims after all its responsible for the memory prices.

In parallel anthropic announced partnership with google and broadcom for gigawatts of TPU chips while also announcing their own 50 Billion invest in compute.

OpenAI always believed in compute though and i'm pretty sure plenty of people want to see what models 10x or 100x or 1000x can do.

GaryBluto•1h ago
> This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.

If they are indeed doing this, I wonder how long they can keep it up?

ffsm8•1h ago
Usually they're hemorrhaging performance while training.

From that it's pretty likely they were training mythos for the last few weeks, and then distilling it to opus 4.7

Pure speculation of course, but would also explain the sudden performance gains for mythos - and why they're not releasing it to the general public (because it's the undistilled version which is too expensive to run)

batshit_beaver•55m ago
What I want to know is why my bedrock-backed Claude gets dumber along with commercial users. Surely they're not touching the bedrock model itself. Only thing I can think of is that updates to the harness are the main cause of performance degradation.
3s•35m ago
Not to mention their recent integration of Persona ID verification - that was the last straw for me.
johntopia•1h ago
is this just mythos flex?
cupofjoakim•1h ago
> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.

caveman[0] is becoming more relevant by the day. I already enjoy reading its output more than vanilla so suits me well.

[0] https://github.com/JuliusBrussee/caveman/tree/main

Tiberium•1h ago
I hope people realize that tools like caveman are mostly joke/prank projects - almost the entirety of the context spent is in file reads (for input) and reasoning (in output), you will barely save even 1% with such a tool, and might actually confuse the model more or have it reason for more tokens because it'll have to formulate its respone in the way that satisfies the requirements.
acedTrex•1h ago
You really think the 33k people that starred a 40 line markdown file realize that?
verdverm•1h ago
Stars are more akin to bookmarks and likes these days, as opposed to a show of support or "I use this"
zbrozek•1h ago
I use them like bookmarks.
LPisGood•1h ago
I use them as likes
giraffe_lady•1h ago
I intentionally throw some weird ones on there just in case anyone is actually ever checking them. Gotta keep interviewers guessing.
andersa•1h ago
You mean the 33k bots that created a nearly linear stars/day graph? There's a dip in the middle, but it was very blatant at the start (and now)
pdntspa•58m ago
The amount of cargo culting amongst AI halfwits (who seem to have a lot of overlap with influencers and crypto bros) is INSANE

I mean just look at the growth of all these "skills" that just reiterate knowledge the models already have

make3•1h ago
I wonder if you can have it reason in caveman
0123456789ABCDE•1h ago
would you be surprised if this is what happens when you ask it to write like one?

folks could have just asked for _austere reasoning notes_ instead of "write like you suffer from arrested development"

Sohcahtoa82•57m ago
> "write like you suffer from arrested development"

My first thought was that this would mean that my life is being narrated by Ron Howard.

embedding-shape•1h ago
> I hope people realize that tools like caveman are mostly joke/prank projects

This seems to be a common thread in the LLM ecosystem; someone starts a project for shits and giggles, makes it public, most people get the joke, others think it's serious, author eventually tries to turn the joke project into a VC-funded business, some people are standing watching with the jaws open, the world moves on.

simonw•45m ago
I was convinced https://github.com/memvid/memvid was a joke until it turned out it wasn't.
embedding-shape•43m ago
To be fair, most of us looked at GPT1 and GPT2 as fun and unserious jokes, until it started putting together sentences that actually read like real text, I remember laughing with a group of friends about some early generated texts. Little did we know.
Alifatisk•27m ago
Are there any public records I can see from GPT1 and GPT2 output and how it was marketed?
walthamstow•22m ago
I don't think it was marketed as such, they were research projects. GPT-3 was the first to be sold via API
Bombthecat•5m ago
And now gpt is laughing,while it replaces coders lol
MarcelOlsz•34m ago
Why? Doesn't have jokey copy. Any thoughts on claude-mem[0] + context-mode[1]?

[0] https://github.com/thedotmack/claude-mem

[1] https://github.com/mksglu/context-mode

simonw•18m ago
The big idea with Memvid was to store embedding vector data as frames in a video file. That didn't seem like a serious idea to me.
niuzeta•29m ago
Just read through the readme and I was fairly sure this was a well-written satire through "Smart Frames".

Honestly part of me still thinks this is a satire project but who knows.

imiric•33m ago
A major reason for that is because there's no way to objectively evaluate the performance of LLMs. So the meme projects are equally as valid as the serious ones, since the merits of both are based entirely on anecdata.

It also doesn't help that projects and practices are promoted and adopted based on influencer clout. Karpathy's takes will drown out ones from "lesser" personas, whether they have any value or not.

egorfine•1h ago
They are indeed impractical in agentic coding.

However in deep research-like products you can have a pass with LLM to compress web page text into caveman speak, thus hugely compressing tokens.

claytongulick•50m ago
I don't understand how this would work without a huge loss in resolution or "cognitive" ability.

Prediction works based on the attention mechanism, and current humans don't speak like cavemen - so how could you expect a useful token chain from data that isn't trained on speech like that?

I get the concept of transformers, but this isn't doing a 1:1 transform from english to french or whatever, you're fundamentally unable to represent certain concepts effectively in caveman etc... or am I missing something?

ieie3366•1h ago
All LLMs also effectively work by ”larping” a role. You steer it towards larping a caveman and well.. let’s just say they weren’t known for their high iq
DiogenesKynikos•55m ago
This is why ancient Chinese scholar mode (also extremely terse) is better.
Hikikomori•54m ago
Modern humans were also cavemen.
roughly•50m ago
Fun fact: Neanderthals actually had larger brains than Homo Sapiens! Modern humans are thought to have outcompeted them by working better together in larger groups, but in terms of actual individual intelligence, Neanderthals may have had us beat. Similarly, humans have been undergoing a process of self-domestication over the last couple millenia that have resulted in physiological changes that include a smaller brain size - again, our advantage over our wilder forebearers remains that we're better in larger social groups than they were and are better at shared symbolic reasoning and synchronized activity, not necessarily that our brains are more capable.

(No, none of this changes that if you make an LLM larp a caveman it's gonna act stupid, you're right about that.)

adwn•20m ago
I thought we were way past the "bigger brain means more intelligence" stage of neuroscience?
stingraycharles•57m ago
While the caveman stuff is obviously not serious, there is a lot of legit research in this area.

Which means yes, you can actually influence this quite a bit. Read the paper “Compressed Chain of Thought” for example, it shows it’s really easy to make significant reductions in reasoning tokens without affecting output quality.

There is not too much research into this (about 5 papers in total), but with that it’s possible to reduce output tokens by about 60%. Given that output is an incredibly significant part of the total costs, this is important.

https://arxiv.org/abs/2412.13171

ACCount37•54m ago
Some labs do it internally because RLVR is very token-expensive. But it degrades CoT readability even more than normal RL pressure does.

It isn't free either - by default, models learn to offload some of their internal computation into the "filler" tokens. So reducing raw token count always cuts into reasoning capacity somewhat. Getting closer to "compute optimal" while reducing token use isn't an easy task.

stingraycharles•48m ago
Yeah the readability suffers, but as long as the actual output (ie the non-CoT part) stays unaffected it’s reasonably fine.

I work on a few agentic open source tools and the interesting thing is that once I implemented these things, the overall feedback was a performance improvement rather than performance reduction, as the LLM would spend much less time on generating tokens.

I didn’t implement it fully, just a few basic things like “reduce prose while thinking, don’t repeat your thoughts” etc would already yield massive improvements.

AdamN•44m ago
Yeah you could easily imagine stenography like inputs and outputs for rapid iteration loops. It's also true that in social media people already want faster-to-read snippets that drop grammar so the desire for density is already there for human authors/readers.
altruios•14m ago
Who would suspect that the companies selling 'tokens' would (unintentionally) train their models to prefer longer answers, reaping a HIGHER ROI (the thing a publicly traded company is legally required to pursue: good thing these are all still private...)... because it's not like private companies want to make money...
bensyverson•56m ago
Exactly. The model is exquisitely sensitive to language. The idea that you would encourage it to think like a caveman to save a few tokens is hilarious but extremely counter-productive if you care about the quality of its reasoning.
Waterluvian•47m ago
Help me understand: I get that the file reading can be a lot. But I also expand the box to see its “reasoning” and there’s a ton of natural language going on there.
reacharavindh•18m ago
This specific form may be a joke, but token conscious work is becoming more and more relevant.. Look at https://github.com/AgusRdz/chop

And

https://github.com/toon-format/toon

micromacrofoot•9m ago
I mean we had a shoe company pivot to AI and raise their stock value by 300%, how can we even know anymore
OtomotO•1h ago
Another supply chain attack waiting?

Have you tried just adding an instruction to be terse?

Don't get me wrong, I've tried out caveman as well, but these days I am wondering whether something as popular will be hijacked.

pawelduda•1h ago
People are really trigger-happy when it comes to throwing magic tools on top of AI that claim to "fix" the weak parts (often placeboing themselves because anthropic just fixed some issue on their end).

Then the next month 90% of this can be replaced with new batch of supply chain attack-friendly gimmicks

Especially Reddit seems to be full of such coding voodoo

xienze•1h ago
> coding voodoo

Well, we've sacrificed the precision of actual programming languages for the ease of English prose interpreted by a non-deterministic black box that we can't reliable measure the outputs of. It's only natural that people are trying to determine the magical incantations required to get correct, consistent results.

JohnMakin•31m ago
My favorite to chuckle at are the prompt hack voodoo stuff, like, “tell it to be correct” or “say please” or “tell it someone will die if it doesnt do a good job,” often presented very seriously and with some fast cutting animations in a 30 second reel
computomatic•1h ago
I was doing some experiments with removing top 100-1000 most common English words from my prompts. My hypothesis was that common words are effectively noise to agents. Based on the first few trials I attempted, there was no discernible difference in output. Would love to compare results with caveman.

Caveat: I didn’t do enough testing to find the edge cases (eg, negation).

ruairidhwm•50m ago
I literally just posted a blog on this. Some seemingly insignificant words are actually highly structural to the model. https://www.ruairidh.dev/blog/compressing-prompts-with-an-au...
cheschire•46m ago
I suspect even typos have an impact on how the model functions.

I wonder if there’s a pre-processor that runs to remove typos before processing. If not, that feels like a space that could be worked on more thoroughly.

0123456789ABCDE•26m ago
there is no pre-processor, i've had typos go through, with claude asking to make sure i meant one thing instead of the other
PhilipRoman•8m ago
I strongly suspected that there was some pre/postprocessing going on when trying to get it to output rot13("uryyb, jbyeq"), but it's probably just due to massively biased token probabilities. Still, it creates some hilarious output, even when you clearly point out the error:

  Hmm, but wait — the original you gave was jbyeq not jbeyq:
  j→w, b→o, y→l, e→r, q→d = world
  So the final answer is still hello, world. You're right that I was misreading the input. The result stands.
ruairidhwm•11m ago
I guess just a spell-check in the repo? But yes, I'd imagine that they have an effect. Even running the same input twice is non-deterministic.
computerphage•47m ago
Yeah, when I'm writing code I try to avoid zeros and ones, since those are the most common bits, making them essentially noise
AlecSchueler•43m ago
Doesn't it just use more tokens in reasoning?
TIPSIO•1h ago
Oh wow, I love this idea even if it's relatively insignificant in savings.

I am finding my writing prompt style is naturally getting lazier, shorter, and more caveman just like this too. If I was honest, it has made writing emails harder.

While messing around, I did a concept of this with HTML to preserve tokens, worked surprisingly well but was only an experiment. Something like:

> <h1 class="bg-red-500 text-green-300"><span>Hello</span></h1>

AI compressed to:

> h1 c bgrd5 tg3 sp hello sp h1

Or something like that.

naoru•1h ago
You'd like Emmet notation. Just look at the cheat sheet: https://docs.emmet.io/cheat-sheet/
Leynos•1h ago
Combine that with emmet / zen coding: https://en.wikipedia.org/wiki/Emmet_%28software%29?wprov=sfl...
user34283•46m ago
I used Opus 4.7 for about 15 minutes on the auto effort setting.

It nicely implemented two smallish features, and already consumed 100% of my session limit on the $20 plan.

See you again in five hours.

hayd•46m ago
me feel that it needs some tweaking - it's a little annoyingly cute (and could be even terser).
chrisweekly•29m ago
I really enjoy the party game "Neanderthal Poetry", in which you can only speak using monosyllabic words. I bet you would too.
gghootch•24m ago
Caveman is fun, but the real tool you want to reduce token usage is headroom

https://github.com/gglucass/headroom-desktop (mac app)

https://github.com/chopratejas/headroom (cli)

kokakiwi•7m ago
Headroom looks great for client-side trimming. If you want to tackle this at the infrastructure level, we built Edgee (https://www.edgee.ai) as an AI Gateway that handles context compression, caching, and token budgeting across requests, so you're not relying on each client to do the right thing.

(I work at Edgee, so biased, but happy to answer questions.)

hackerInnen•1h ago
I just subscribed this month again because I wanted to have some fun with my projects.

Tried out opus 4.6 a bit and it is really really bad. Why do people say it's so good? It cannot come up with any half-decent vhdl. No matter the prompt. I'm very disappointed. I was told it's a good model

rurban•1h ago
Because it was good until January 2026, then it detoriated into a opus-3.1. Probably given much less context windows or ram.
toomim•1h ago
It released in February 2026.
ACCount37•1h ago
Doesn't matter. My vibes say it got bad in January 2026. Thus, they secretly nerfed Opus 4.6 in January 2026.

The fact that it didn't exist back then is completely and utterly irrelevant to my narrative.

Der_Einzige•59m ago
This but unironically.

"I reject your reality, and substitute my own".

It worked for cheeto in chief, and it worked for Elon, so why not do it in our normal daily lives?

hxugufjfjf•21m ago
I don’t think I’ve ever seen otherwise reasonable people go completely unhinged over anything like they do with Opus
solenoid0937•14m ago
I've seen a similar psychological phenomenon where people like something a lot, and then they get unreasonably angry and vocal about changes to that thing.

For example, there is no evidence that 4.6 ever degraded in quality: https://marginlab.ai/trackers/claude-code-historical-perform...

Usage limits are necessary but I guess people expect more subsidized inference than the company can afford. So they make very angry comments online.

anon7000•1h ago
because they’re using it for different things where it works well and that’s all they know?
adwn•1h ago
And yet another "AI doesn't work" comment without any meaningful information. What were your exact prompts? What was the output?

This is like a user of conventional software complaining that "it crashes", without a single bit of detail, like what they did before the crash, if there was any error message, whether the program froze or completely disappeared, etc.

nathanielherman•1h ago
Claude Code doesn't seem to have updated yet, but I was able to try it out by running `claude --model claude-opus-4-7`
duckkg5•1h ago
/model claude-opus-4-7[1m]
yanis_t•1h ago
> where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.

interesting

skerit•1h ago
I like this in theory. I just hope it doesn't require you to be be as literal as if talking to a genie.

But if it'll actually stick to the hard rules in the CLAUDE.md files, and if I don't have to add "DON'T DO ANYTHING, JUST ANSWER THE QUESTION" at the end of my prompt, I'll be glad.

Jeff_Brown•56m ago
It might be a bad idea to put that in all caps, because in the training data, angry conversations are less productive. (I do the same thing, just in lowercase.)
sleazebreeze•1h ago
This made me LOL. They keep trying to fleece us by nerfing functionality and then adding it back next release. It’s an abusive relationship at this point.
bisonbear•32m ago
coming more in line with codex - claude previously would often ignore explicit instructions that codex would follow. interested to see how this feels in practice

I think this line around "context tuning" is super interesting - I see a future where, for every model release, devs go and update their CLAUDE.md / skills to adapt to new model behavior.

boxedemp•22m ago
This sounds good, I look forward to experimenting with it.
nathanielherman•1h ago
Claude Code hasn't updated yet it seems, but I was able to test it using `claude --model claude-opus-4-7`

Or `/model claude-opus-4-7` from an existing session

edit: `/model claude-opus-4-7[1m]` to select the 1m context window version

mchinen•1h ago
Does it run for you? I can select it this way but it says 'There's an issue with the selected model (claude-opus-4-7). It may not exist or you may not have access to it. Run /model to pick a different model.'
nathanielherman•1h ago
Weird, yeah it works for me
skerit•1h ago
~~That just changes it to Opus 4, not Opus 4.7~~

My statusline showed _Opus 4_, but it did indeed accept this line.

I did change it to `/model claude-opus-4-7[1m]`, because it would pick the non-1M context model instead.

nathanielherman•1h ago
Oh good call
whalesalad•1h ago
API Error: 400 {"type":"error","error":{"type":"invalid_request_error","message":"\"thinking.type.enabled\" is not supported for this model. Use \"thinking.type.adaptive\" and \"output_config.effort\" to control thinking behavior."},"request_id":"req_011Ca7enRv4CPAEqrigcRNvd"}

Eep. AFAIK the issues most people have been complaining about with Opus 4.6 recently is due to adaptive thinking. Looks like that is not only sticking around but mandatory for this newer model.

edit: I still can't get it to work. Opus 4.6 can't even figure out what is wrong with my config. Speaking of which, claude configuration is so confusing there are .claude/ (in project) setting.json + a settings.local.json file, then a global ~/.claude/ dir with the same configuration files. None of them have anything defined for adaptive thinking or thinking type enable. None of these strings exist on my machine. Running latest version, 2.1.110

mchinen•1h ago
These stuck out as promising things to try. It looks like xhigh on 4.7 scores significantly higher on the internal coding benchmark (71% vs 54%, though unclear what that is exactly)

> More effort control: Opus 4.7 introduces a new xhigh (“extra high”) effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, we’ve raised the default effort level to xhigh for all plans. When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.

The new /ultrareview command looks like something I've been trying to invoke myself with looping, happy that it's free to test out.

> The new /ultrareview slash command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. We’re giving Pro and Max Claude Code users three free ultrareviews to try it out.

mbeavitt•1h ago
Honestly I've been doing a lot of image-related work recently and the biggest thing here for me is the 3x higher resolution images which can be submitted. This is huge for anyone working with graphs, scientific photographs, etc. The accuracy on a simple automated photograph processing pipeline I recently implemented with Opus 4.6 was about 40% which I was surprised at (simple OCR and recognition of basic features). It'll be interesting to see if 4.7 does much better.

I wonder if general purpose multimodal LLMs are beginning to eat the lunch of specific computer vision models - they are certainly easier to use.

orrito•48m ago
Did you try the same with gemini 3 models? Those usually score higher on vision benchmarks
mrcwinn•1h ago
Excited to start using this!
cube2222•1h ago
Seems like it's not in Claude Code natively yet, but you can do an explicit `/model claude-opus-4-7` and it works.
acedTrex•1h ago
Sigh here we go again, model release day is always the worst day of the quarter for me. I always get a lovely anxiety attack and have to avoid all parts of the internet for a few days :/
stantonius•1h ago
I feel this way too. Wish I could fully understand the 'why'. I know all of the usual arguments, but nothing seems to fully capture it for me - maybe it' all of them, maybe it's simply the pace of change and having to adapt quicker than we're comfortable with. Anyway best of luck from someone who understands this sentiment.
acedTrex•1h ago
Thank you thank you, misery loves company lol! I haven't fully pinned down what the exact cause is as well, an ongoing journey.
RivieraKid•1h ago
Really? I think it's pretty straightforward, at least for me - fear of AI replacing my profession and also fear that it will become harder to succeed with a side project.
acedTrex•1h ago
> fear of AI replacing my profession

See i don't have any of this fear, I have 0 concerns that LLMs will replace software engineering because the bulk of the work we do (not code) is not at risk.

My worries are almost purely personal.

stantonius•51m ago
Yeah I can understand that, and sure this is part of it, just not all of it. There is also broader societal issues (ie. inequality), personal questions around meaning and purpose, and a sprinkling of existential (but not much). I suspect anyone surveyed would have a different formula for what causes this unease - I struggle to define it (yet think about it constantly), hence my comment above.

Ultimately when I think deeper, none of this would worry me if these changes occurred over 20 years - societies and cultures change and are constantly in flux, and that includes jobs and what people value. It's the rate of change and inability to adapt quick enough which overwhelms me.

mesmertech•1h ago
Not showing up in claude code by default on the latest version. Apparently this is how to set it:

/model claude-opus-4-7

Coming from anthropic's support page, so hopefully they did't hallucinate the docs, cause the model name on claude code says:

/model claude-opus-4-7 ⎿ Set model to Opus 4

what model are you?

I'm Claude Opus 4 (model ID: claude-opus-4-7).

klipitkas•1h ago
It does not work, it says Claude Opus 4 not 4.7
mesmertech•1h ago
I think its just a visual/default thing, cause Opus 4.0 isn't offered on claude code anymore. And opus 4.7 is on their official docs as a model you can change to, on claude code

Just ask it what model it is(even in new chat).

what model are you?

I'm Claude Opus 4 (model ID: claude-opus-4-7).

https://support.claude.com/en/articles/11940350-claude-code-...

vesrah•1h ago
On the most current version (v2.1.110) of claude:

> /model claude-opus-4.7

  ⎿  Model 'claude-opus-4.7' not found
mesmertech•1h ago
I'm on the max $200 plan, so maybe its that?
anonfunction•1h ago
Same, if we're punished for being on the highest tier... what is anthropic even doing.
kaosnetsov•1h ago
claude-opus-4-7

not

claude-opus-4.7

abatilo•51m ago
Dash, not dot
anonfunction•1h ago

     /model claude-opus-4.7
      ⎿  Model 'claude-opus-4.7' not found
Just love that I'm paying $200 for models features they announce I can't use!

Related features that were announced I have yet to be able to use:

    $ claude --enable-auto-mode 
    auto mode is unavailable for your plan

    $ claude
    /memory 
    Auto-dream: on · /dream to run
    Unknown skill: dream
mesmertech•59m ago
I think that was a typo on my end, its "/model claude-opus-4-7" not "/model claude-opus-4.7"
anonfunction•47m ago
That sets it to opus 4:

/model claude-opus-4.7 ⎿ Model 'claude-opus-4.7' not found

/model claude-opus-4-7 ⎿ Set model to Opus 4

/model ⎿ Set model to Opus 4.6 (1M context) (default)

freedomben•59m ago
Thanks, but not working for me, and I'm on the $200 max plan

Edit: Not 30 seconds later, claude code took an update and now it works!

dionian•53m ago
It's up now, update claude code
redml•42m ago
--model claude-opus-4-7 works as well
throwaway911282•1h ago
just started using codex. claude is just marketing machine and benchmaxxing and only if you pay gazillion and show your ID you can use their dangerous model.
zacian•1h ago
I hope this will fix up the poor quality that we're seeing on Claude Opus 4.6

But degrading a model right before a new release is not the way to go.

steve-atx-7600•21m ago
I wish someone would elaborate on what they were doing and observed since Jan on opus 4.6. I’ve been using it with 1m context on max thinking since it was released - as a software engineer to write most of my code, code reviews + research and explain unfamiliar code - and haven’t notice a degradation. I’ve seen this mentioned a lot though.

I have seen that codex -latest highest effort - will find some important edge cases that opus 4.6 overlooked when I ask both of them to review my PRs.

aliljet•1h ago
Have they effectively communicated what a 20x or 10x Claude subscription actually means? And with Claude 4.7 increasing usage by 1.35x does that mean a 20x plan is now really a 13x plan (no token increase on the subscription) or a 27x plan (more tokens given to compensate for more computer cost) relative to Claude Opus 4.6?
oidar•1h ago
Anthropic isn't going to give us that information. It's not actually static, it depends on subscription demand and idle compute available.
minimaxir•55m ago
The more efficient tokenizer reduces usage by representing text more efficiently with fewer tokens. But the lack of transparancy does indeed mean Anthropic could still scale down limits to account for that.
redml•33m ago
a few months ago it was for weekly:

pro = 5m tokens, 5x = 41m tokens, 20x = 83m tokens

making 5x the best value for the money (8.33x over pro for max 5x). this information may be outdated though, and doesn't apply to the new on peak 5h multipliers. anything that increases usage just burns through that flat token quota faster.

yanis_t•1h ago
> In Claude Code, we’ve raised the default effort level to xhigh for all plans.

Does it also mean faster to getting our of credits?

voidfunc•1h ago
Is Codex the new goto? Opus stopped being useful about 45-60 days ago.
zeroonetwothree•1h ago
I haven’t noticed much difference compared to Jan/Feb. Maybe depends what you use it for
msp26•1h ago
> First, Opus 4.7 uses an updated tokenizer that improves how the model processes text

wow can I see it and run it locally please? Making API calls to check token counts is retarded.

zb3•1h ago
> during its training we experimented with efforts to differentially reduce these capabilities

> We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

Ah f... you!

ACCount37•1h ago
> We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

Fucking hell.

Opus was my go-to for reverse engineering and cybersecurity uses, because, unlike OpenAI's ChatGPT, Anthropic's Opus didn't care about being asked to RE things or poke at vulns.

It would, however, shit a brick and block requests every time something remotely medical/biological showed up.

If their new "cybersecurity filter" is anywhere near as bad? Opus is dead for cybersec.

zb3•1h ago
It appears we're learning the hard way that we can't rely on capabilities of models that aren't open weights. These can be taken from us at any time, so expect it to get much worse..
Havoc•1h ago
Claude code had safeguards like that hardcoded into the software. You could see it if you intercept the prompts with a proxy
methodical•1h ago
To be fair, delineating between benevolent and malevolent pen-testing and cybersecurity purposes is practically impossible since the only difference is the user's intentions. I am entirely unsurprised (and would expect) that as models improve the amount to which widely available models will be prohibited from cybersecurity purposes will only increase.

Not to say I see this as the right approach, in theory the two forces would balance each other out as both white hats and black hats would have access to the same technology, but I can understand the hesitancy from Anthropic and others.

ACCount37•56m ago
Yes, and the previous approach Anthropic took was "allow anything that looks remotely benign". The only thing that would get a refusal would be a downright "write an exploit for me". Which is why I favored Anthropic's models.

It remains to be seen whether Anthropic's models are still usable now.

I know just how much of a clusterfuck their "CBRN filter" is, so I'm dreading the worst.

senko•42m ago
From the article:

> Security professionals who wish to use Opus 4.7 for legitimate cybersecurity purposes (such as vulnerability research, penetration testing, and red-teaming) are invited to join our new Cyber Verification Program.

ACCount37•34m ago
Yeah no. They can fuck right off with KYC humiliation rituals.
johnmlussier•33m ago
Incredible - in one fell swoop killing my entire use case for Claude.

I have about 15 submissions that I now need to work with Codex on cause this "smarter" model refuses to read program guidelines and take them seriously.

brynnbee•5m ago
I'm currently testing 4.7 with some reverse engineering stuff/Ghidra scripting and it hasn't refused anything so far, but I'm also doing it on a 20 year old video game, so maybe it doesn't think that's problematic.
jimmypk•1h ago
The default effort change in Claude Code is worth knowing before your next session: it's now `xhigh` (a new level between `high` and `max`) for all plans, up from the previous default. Combined with the 1.0–1.35× tokenizer overhead on the same prompts, actual token spend per agentic session will likely exceed naive estimates from 4.6 baselines.

Anthropic's guidance is to measure against real traffic—their internal benchmark showing net-favorable usage is an autonomous single-prompt eval, which may not reflect interactive multi-turn sessions where tokenizer overhead compounds across turns. The task budget feature (just launched in public beta) is probably the right tool for production deployments that need cost predictability when migrating.

mwigdahl•48m ago
That depends a bit on token efficiency. From their "Agentic coding performance by effort level" graph, it looks like they get similar outcome for 4.7 medium at half the token usage as 4.6 at high.

Granted that is, as you say, a single prompt, but it is using the agentic process where the model self prompts until completion. It's conceivable the model uses fewer tokens for the same result with appropriate effort settings.

__natty__•1h ago
New model - that explains why for the past week/two weeks I had this feeling of 4.6 being much less "intelligent". I hope this is only some kind of paranoia and we (and investors) are not being played by the big corp. /s
RivieraKid•1h ago
I don't get it. Why would they make the previous model worse before releasing an update?
dminik•47m ago
Why do stores increase prices before a sale?
RivieraKid•8m ago
Ok, so the answer is "they make the existing model worse to make it seem that the new model is good". I'm almost certain that this is not what's going on. It's hard to make the argument that the benefits outweigh the drawbacks of such approach. It doesn't give the more market share or revenue.
grandinquistor•1h ago
Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.
ACCount37•1h ago
People were "predicting" the plateau since GPT-1. By now, it would take extraordinary evidence for me to take such "predictions" seriously.
verdverm•1h ago
Some of the benchmarks went down, has that happened before?
grandinquistor•1h ago
Probably deprioritizing other areas to focus on swe capabilities since I reckon most of their revenue is from enterprise coding usage.
cmrdporcupine•51m ago
It's frankly becoming difficult for me to imagine what the next level of coding excellence looks like though.

By which I mean, I don't find these latest models really have huge cognitive gaps. There's few problems I throw at them that they can't solve.

And it feels to me like the gap now isn't model performance, it's the agenetic harnesses they're running in.

nothinkjustai•21m ago
Ask it to create an iOS app which natively runs Gemma via Litert-lm.

It’s incredibly trivial to find stuff outside their capabilities. In fact most stuff I want AI to do it just can’t, and the stuff it can isn’t interesting to me.

ACCount37•1h ago
Constantly. Minor revisions can easily "wobble" on benchmarks that the training didn't explicitly push them for.

Whether it's genuine loss of capability or just measurement noise is typically unclear.

andy12_•55m ago
If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.

https://news.ycombinator.com/item?id=43906555

grandinquistor•13m ago
looking at the system card for opus 4.7 the MCRC benchmark used for long context tasks dropped significantly from 78% to 32%

I wonder what caused such a large regression in this benchmark

msavara•9m ago
Only in benchmarks. After couple of minutes of use is feel same dumb as nerfed 4.6
dhruv3006•1h ago
its a pretty good coding model - using it in cursor now.
jameson•1h ago
How should one compare benchmark results? For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?
zeroonetwothree•1h ago
Benchmark results don’t directly translate to actual real world improvement. So we might guess it’s somewhat better but hard to say exactly in what way
azeirah•1h ago
There is no hallucination benchmark currently.

I was researching how to predict hallucinations using the literature (fastowski et al, 2025) (cecere et al, 2025) and the general-ish situation is that there are ways to introspect model certainty levels by probing it from the outside to get the same certainty metric that you _would_ have gotten if the model was trained as a bayesian model, ie, it knows what it knows and it knows what it doesn't know.

This significantly improves claim-level false-positive rates (which is measured with the AUARC metric, ie, abstention rates; ie have the model shut up when it is actually uncertain).

This would be great to include as a metric in benchmarks because right now the benchmark just says "it solves x% of benchmarks", whereas the real question real-world developers care about is "it solves x% of benchmarks *reliably*" AND "It creates false positives on y% of the time".

So the answer to your question, we don't know. It might be a cherry picked result, it might be fewer hallucinations (better metacognition) it might be capability to solve more difficult problems (better intelligence).

The benchmarks don't make this explicit.

theptip•1h ago
11% further along the particular bell curve of SWE-bench. Not really easy to extrapolate to real world, especially given that eg the Chinese models tend to heavily train on the benchmarks. But a 10% bump with the same model should equate to “feels noticeably smarter”.

A more quantifiable eval would be METR’s task time - it’s the duration of tasks that the model can complete on average 50% of the time, we’ll have to wait to see where 4.7 lands on this one.

HarHarVeryFunny•45m ago
Benchmarks are meaningless. Try it on your own problems and see if it has improved for what you want to use it for.
perdomon•1h ago
It seems like we're hitting a solid plateau of LLM performance with only slight changes each generation. The jumps between versions are getting smaller. When will the AI bubble pop?
lta•1h ago
Every night praying for tomorrow
aoeusnth1•30m ago
SWE-bench pro is ~20% higher than the previous .1 generation which was released 2 months ago. For their SWE benchmark, the token consumption iso-performance is down 2x from the model they released 2 months ago.

If this is a plateau I struggle to imagine what you consider fast progress.

abstracthinking•28m ago
Your comment doesn't make any sense, opus 4.6 was release two months ago, what jump would you expect?
NickNaraghi•24m ago
The generations are two months apart now though…
persedes•1h ago
Interesting that the MCP-Atlas score for 4.6 jumped to 75.8% compared to 59.5% https://www.anthropic.com/news/claude-opus-4-6

There's other small single digit differences, but I doubt that the benchmark is that unreliable...?

usaar333•12m ago
page is updated to state:

MCP-Atlas: The Opus 4.6 score has been updated to reflect revised grading methodology from Scale AI.

wojciem•1h ago
Is it just Opus 4.6 with throttling removed?
aizk•1h ago
How powerful will Opus become before they decide to not release it publicly like Mythos?
Philpax•1h ago
They are planning to release a Mythos-class model (from the initial announcement), but they won't until they can trust their safeguards + the software ecosystem has been sufficiently patched.
anonfunction•1h ago
It seems they nerf it, then release a new version with previous power. So they can do this forever without actually making another step function model release.
hgoel•1h ago
Interesting to see the benchmark numbers, though at this point I find these incremental seeming updates hard to interpret into capability increases for me beyond just "it might be somewhat better".

Maybe I've skimmed too quickly and missed it, but does calling it 4.7 instead of 5 imply that it's the same as 4.6, just trained with further refined data/fine tuned to adapt the 4.6 weights to the new tokenizer etc?

yanis_t•1h ago
The benchmarks of Opus 4.6 they compare to MUST be retaken the day of the new model release. If it was nerfed we need to know how much.
solenoid0937•9m ago
https://marginlab.ai/trackers/claude-code-historical-perform...
anonfunction•1h ago
Seems they jumped the gun releasing this without a claude code update?

     /model claude-opus-4.7
      ⎿  Model 'claude-opus-4.7' not found
cmrx64•36m ago
claude-opus-4-7
codethief•35m ago
https://news.ycombinator.com/item?id=47794516
helloplanets•1h ago
I wonder why computer use has taken a back seat. Seemed like it was a hot topic in 2024, but then sort of went obscure after CLI agents fully took over.

It would be interesting to see a company to try and train a computer use specific model, with an actually meaningful amount of compute directed at that. Seems like there's just been experiments built upon models trained for completely different stuff, instead of any of the companies that put out SotA models taking a real shot at it.

Glemllksdf•49m ago
The industry probably moves a lot faster adding apis and co than learning how to use a generic computer with generic tools.

I also think its a huge barrier allowing some LLM model access to your desktop.

Managed Agents seems like a lot more beneficial

grandinquistor•56m ago
Huge regression for long contest tasks interestingly.

Mrcr benchmark went from 78% to 32%

catigula•52m ago
Getting a little suspicious that we might not actually get AGI.
jwr•50m ago
> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.

I guess that means bad news for our subscription usage.

brynnbee•21m ago
In GitHub Copilot it costs 7.5x whereas Opus 4.6 is 3x
iLoveOncall•50m ago
We all know this is actually Mythos but called Opus 4.7 to avoid disappointments, right?
artemonster•49m ago
All fine, where is pelican on bicycle?
corlinp•49m ago
I'm running it for the first time and this is what the thinking looks like. Opus seems highly concerned about whether or not I'm asking it to develop malware.

> This is _, not malware. Continuing the brainstorming process.

> Not malware — standard _ code. Continuing exploration.

> Not malware. Let me check front-end components for _.

> Not malware. I have enough context to start the clarifying discussion.

cmrx64•39m ago
it used to do this naturally sometimes, quite often in my runtime debugging.
turblety•14m ago
What a waste of tokens. No wonder Anthropic can't serve their customers. It's not just a lack of compute, it's a ridiculous waste of the limited compute they have. I think (hope?) we look back at the insanity of all this theatre, the same way we do about GPT-2 [1].

1. https://techcrunch.com/2019/02/17/openai-text-generator-dang...

dgb23•7m ago
This is funny on so many levels.
interstice•48m ago
Well this explains the outages over the last few days
lanyard-textile•48m ago
This comment thread is a good learner for founders; look at how much anguish can be put to bed with just a little honest communication.

1. Oops, we're oversubscribed.

2. Oops, adaptive reasoning landed poorly / we have to do it for capacity reasons.

3. Here's how subscriptions work. Am I really writing this bullet point?

As someone with a production application pinned on Opus 4.5, it is extremely difficult to tell apart what is code harness drama and what is a problem with the underlying model. It's all just meshed together now without any further details on what's affected.

drewnick•38m ago
Hasn't Opus 4.5 been famously consistent while 4.6 was floating all over the place?
kulikalov•17m ago
Or it could be a selection bias. The ground truth is not what HN herd mentality complains about, but the usage stats.
lanyard-textile•9m ago
I suppose I come forward with my own usage stats, but it is anecdata :)

And the andecdata matches other anecdata.

Maybe I'm missing why that's selection bias.

simonw•47m ago
I'm finding the "adaptive thinking" thing very confusing, especially having written code against the previous thinking budget / thinking effort / etc modes: https://platform.claude.com/docs/en/build-with-claude/adapti...

Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get that: https://platform.claude.com/docs/en/build-with-claude/adapti...

(Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me up.)

avaer•14m ago
> Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me up

Wouldn't that be p-hacking where p stands for pelican?

postalcoder•5m ago
I've been refreshing your comment for the last hour waiting to see the pelican. max thinking must be really going at it.
xcodevn•46m ago
Install the latest claude code to use opus 4.7:

`claude install latest`

sallymander•46m ago
It seems a little more fussy than Opus 4.6 so far. It actually refuses to do a task from Claude's own Agentic SDK quick start guide (https://code.claude.com/docs/en/agent-sdk/quickstart):

"Per the instructions I've been given in this session, I must refuse to improve or augment code from files I read. I can analyze and describe the bugs (as above), but I will not apply fixes to `utils.py`."

soerxpso•38m ago
That "per the instructions I've been given in this session" bit is interesting. Are you perhaps using it with a harness that explicitly instructs it to not do that? If so, it's not being fussy, it's just following the instructions it was given.
sallymander•17m ago
I'm using their own python SDK with default prompts, exactly as the instructions say in their guide (it's the code from their tutorial).
ambigioz•45m ago
So many messages about how Codex is better then Claude from one day to the other, while my experience is exactly the same. Is OpenAI botting the thread? I can't believe this is genuine content.
boxedemp•25m ago
I'm wondering this too. That said, I know a few people in real life who prefer Codex. More who prefer Claude though.
solenoid0937•23m ago
It feels like OAI stans have been botting HN for a few weeks now.
cmrdporcupine•21m ago
Or, y'know, people can genuinely disagree
solenoid0937•19m ago
4.7 hasn't been out for an hour yet and we already have people shilling for Codex in the comments. I don't know how anyone could form a genuine disagreement in this period of time.
cmrdporcupine•17m ago
Nobody I've seen in the comments is basing it on 4.7 performance. They're basing it on how unpleasant March and early April was on the Claude Code coding plans with 4.6. Which, from my experience, it was.

I'm interested in seeing how 4.7 performs. But I'm also unwilling to pony up cash for a month to do so. And frankly dissatisfied with their customer service and with the actual TUI tool itself.

It's not team sports, my friend. You don't have to pick a side. These guys are taking a lot of money from us. Far more than I've ever spent on any other development tooling.

nsingh2•22m ago
It's a combination of factors. There was rate-limiting implemented by Anthropic, where the 5hr usage limit would be burned through faster at peak hours, I was personally bitten by this multiple times before one guy from Anthropic announced it publicly via twitter, terrible communication. It wasn't small either, ~15 minutes of work ended up burning the entire 5hr limit. That annoyed me enough to switched to Codex for the month at that point.

Now people are saying the model response quality went down, I can't vouch for that since I wasn't using Claude Code, but I don't think this many people saying the same thing is total noise though.

cmrdporcupine•21m ago
Sorry, no, not a bot. I get way better results out of Codex.

It's just ultimately subjective, and, it's like, your opinion, man. Calling people bots who disagree is probably not a good look.

I don't like OpenAI the company, but their model and coding tool is pretty damn good. And I was an early Claude Code booster and go back and forth constantly to try both.

frankdenbow•20m ago
I've had good experiences with codex, as have many others. Its genuine content since everyones codebases and needs are different.
fritzo•20m ago
Looks to me like a mob of humans, angry they've been deceived by ambiguous communications, product nerfing, surprisingly low usage limits, and an appallingly sycophantic overconfident coding agent
anonyfox•18m ago
not a bot, voiced frustration is real here. I kind of depend on good LLMs now and wouldn't even mind if they had frozen the LLMs capabilities around dec 2025 forver and would hppily continue to pay, even more. but when suddenly the very same workload that was fine for months isn't possible anymore with the very same LLM out of nowhere and gets increasingly worse, its a huge disappointment. and having codex in parallel as a backup since ever I started also using it again with gpt 5.4 and it just rips without the diva sensitivity or overfitting into the latest prompt opus/sonnet is doing. GPT just does the job, maybe thinks a bit long, but even over several rounds of chat compression in the same chat for days stays well within the initial set of instructions and guardrails I spelled out, without me having to remind every time. just works, quietly, and gets there. Opus doesn't even get there anymore without nearly spelling out by hand manual steps or what not to do.
bastawhiz•12m ago
I'm an Opus stan but I'll also admit that 5.4 has gotten a lot better, especially at finding and fixing bugs. Codex doesn't seem to do as good a job at one shotting tasks from scratch.

I suppose if you are okay with a mediocre initial output that you spend more time getting into shape, Codex is comparable. I haven't exhaustively compared though.

throwaway2027•8m ago
You're better off subscribing to Codex for April and May of 2026.
Robdel12•44m ago
It’s funny, a few months ago I would have been pretty excited about this. But I honestly don’t really care because I can’t trust Anthropic to not play games with this over the next month post release.

I just flat out don’t trust them. They’ve shown more than enough that they change things without telling users.

helloplanets•42m ago
If the model is based on a new tokenizer, that means that it's very likely a completely new base model. Changing the tokenizer is changing the whole foundation a model is built on. It'd be more straightforward to add reasoning to a model architecture compared to swapping the tokenizer to a new one.

Usually a ground up rebuild is related to a bigger announcement. So, it's weird that they'd be naming it 4.7.

Swapping out the tokenizer is a massive change. Not an incremental one.

kingstnap•29m ago
It doesn't need to be. Text can be tokenized in many different ways even if the token set is the same.

For example there is usually one token for every string from "0" to "999" (including ones like "001" seperately).

This means there are lots of ways you can choose to tokenize a number. Like 27693921. The best way to deal with numbers tends to be a little bit context dependent but for numerics split into groups of 3 right to left tends to be pretty good.

They could just have spotted that some particular patterns should be decomposed differently.

andsoitis•41m ago
Excited to start using from within Cursor.

Those Mythos Preview numbers look pretty mouthwatering.

johnmlussier•40m ago
They've increased their cybersecurity usage filters to the point that Opus 4.7 refuses to work on any valid work, even after acknowledging "This is authorized research under the [Redacted] Bounty program, so the findings here are defensive research outputs, not malware. I'll analyze and draft, not weaponize anything beyond what's needed to prove the bug to [Redacted].

I will immediately switch over to Codex if this continues to be an issue. I am new to security research, have been paid out on several bugs, but don't have a CVE or public talk so they are ready to cut me out already.

johnmlussier•34m ago

  ⎿  API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). This request triggered restrictions on violative cyber content and was blocked under Anthropic's 
     Usage Policy. To request an adjustment pursuant to our Cyber Verification Program based on how you use Claude, fill out                                                                                                                        
     https://claude.com/form/cyber-use-case?token=[REDACTED] Please double press esc to edit your last message or 
     start a new session for Claude Code to assist with a different task. If you are seeing this refusal repeatedly, try running /model claude-sonnet-4-20250514 to switch models.                                                                  
                        
This is gonna kill everything I've been working on. I have several reproduced items at [REDACTED] that I've been working on.
suzzer99•12m ago
I've never seen "double press esc" as a control pattern.
dmix•8m ago
I predict this sort of filtering is only going to get worse. This will probably be remembered as the 'open internet' era of LLMs before everything is tightly controlled for 'safety' and regulations. Forcing software devs to use open source or local models to do anything fun.
skybrian•30m ago
Maybe stick with 4.6 until the bugs are worked out? Is this new filter retroactive?
gruez•20m ago
>even after acknowledging "This is authorized research under the [Redacted] Bounty program, so the findings here are defensive research outputs, not malware. I'll analyze and draft, not weaponize anything beyond what's needed to prove the bug to [Redacted].

What else would you expect? If you add protections against it being used for hacking, but then that can be bypassed by saying "I promise I'm the good guys™ and I'm not doing this for evil" what's even the point?

johnmlussier•16m ago
This was Opus saying that after reviewing the [REDACTED] bug bounty program guidelines and having them in context.
data-ottawa•40m ago
With the new tokenizer did they A/B test this one?

I'm curious if that might be responsible for some of the regressions in the last month. I've been getting feedback requests on almost every session lately, but wasn't sure if that was because of the large amount of negative feedback online.

typia•38m ago
Is that time to turning back from Codex to Claude Code?
danielsamuels•37m ago
Interesting that despite Anthropic billing it at the same rate as Opus 4.6, GitHub CoPilot bills it at 7.5x rather than 3x.
hyperionultra•34m ago
Where is chatgpt answer to this?
jeffrwells•33m ago
Reminder that 4.7 may seem like a huge upgrade to 4.6 because they nerfed the F out of 4.6 ahead of this launch so 4.7 would seem like a remarkable improvement...
sutterd•32m ago
I liked Opus 4.5 but hated 4.6. Every few weeks I tried 4.6 and, after a tirade against, I switched back to 4.5. They said 4.6 had a "bias towards action", which I think meant it just made stuff up if something was unclear, whereas 4.5 would ask for clarfication. I hope 4.7 is more of a collaborator like 4.5 was.
darshanmakwana•29m ago
What's the point of baking the best and most impressive models in the world and then serving it with degraded quality a month after releases so that intelligence from them is never fully utilised??
827a•29m ago
> Opus 4.7 is a direct upgrade to Opus 4.6, but two changes are worth planning for because they affect token usage. First, Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.

This is concerning & tone-deaf especially given their recent change to move Enterprise customers from $xxx/user/month plans to the $20/mo + incremental usage.

IMO the pursuit of ultraintelligence is going to hurt Anthropic, and a Sonnet 5 release that could hit near-Opus 4.6 level intelligence at a lower cost would be received much more favorably. They were already getting extreme push-back on the CC token counting and billing changes made over the past quarter.

therobots927•28m ago
Here’s the problem. The distribution of query difficulty / task complexity is probably heavily right-skewed which drives up the average cost dramatically. The logical thing for anthropic to do, in order to keep costs under control, is to throttle high-cost queries. Claude can only approximate the true token cost of a given query prior to execution. That means anything near the top percentile will need to get throttled as well.

By definition this means that you’re going to get subpar results for difficult queries. Anything too complicated will get a lightweight model response to save on capacity. Or an outright refusal which is also becoming more common.

New models are meaningless in this context because by definition the most impressive examples from the marketing material will not be consistently reproducible by users. The more users who try to get these fantastically complex outputs the more those outputs get throttled.

coreylane•25m ago
Looks completely broken on AWS Bedrock

"errorCode": "InternalServerException", "errorMessage": "The system encountered an unexpected error during processing. Try your request again.",

anonyfox•24m ago
even sonnet right now has degraded for me to the point of like ChatGPT 3.5 back then. took ~5 hours on getting a playwright e2e test fixed that waited on a wrong css selector. literlly, dumb as fuck. and it had been better than opus for the last week or so still... did roughly comparable work for the last 2 weeks and it all went increasingly worse - taking more and more thinking tokens circling around nonsense and just not doing 1 line changes that a junior dev would see on the spot. Too used to vibing now to do it by hand (yeah i know) so I kept watching and meanwhile discovered that codex just fleshed out a nontrivial app with correct financial data flows in the same time without any fuzz. I really don't get why antrhopic is dropping their edge so hard now recently, in my head they might aim for increasing hype leading to the IPO, not disappointment crashes from their power user base.
solenoid0937•10m ago
You are operating purely on vibes, not data. https://marginlab.ai/trackers/claude-code-historical-perform...

The AI community is starting to remind me of the audiophile community. Rejecting hard data to substitute their own reality.

joshstrange•23m ago
This is the first new model from Anthropic in a while that I'm not super enthused about. Not because of the model, I literally haven't opened the page about it, I can already guess what it says ("Bigger, better, faster, stronger"), but because of the company.

I have enjoyed using Claude Code quite a bit in the past but that has been waning as of late and the constant reports of nerfed models coupled with Anthropic not being forthcoming about what usage is allowed on subscriptions [0] really leaves a bad taste in my mouth. I'll probably give them another month but I'm going to start looking into alternatives, even PayG alternatives.

[0] Please don't @ me, I've read every comment about how it _is clear_ as a response to other similar comments I've made. Every. Single. One. of those comments is wrong or completely misses the point. To head those off let me be clear:

Anthropic does not at all make clear what types of `claude -p` or AgentSDK usage is allowed to be used with your subscription. That's all I care about. What am I allowed to use on my subscription. The docs are confusing, their public-facing people give contradictory information, and people commenting state, with complete confidence, completely wrong things.

I greatly dislike the Chilling Effect I feel when using something I'm paying quite a bit (for me) of money for. I don't like the constant state of unease and being unsure if something might be crossing the line. There are ideas/side-projects I'm interested in pursuing but don't because I don't want my account banned for crossing a line I didn't know existed. Especially since there appears to be zero recourse if that happens.

I want to be crystal clear: I am not saying the subscription should be a free-for-all, "do whatever you want", I want clear lines drawn. I increasingly feeling like I'm not going to get this and so while historically I've prefered Claude over ChatGPT, I'm considering going to Codex (or more likely, OpenCode) due to fewer restrictions and clearer rules on what's is and is not allowed. I'd also be ok with kind of warning so that it's not all or nothing. I greatly appreciate what Anthropic did (finally) w.r.t. OpenClaw (which I don't use) and the balance they struck there. I just wish they'd take that further.

bayesnet•22m ago
This is a CC harness thing than a model thing but the "new" thinking messages ('hmm...', 'this one needs a moment...') are extraordinarily irritating. They're both entirely uninformative and strictly worse than a spinner. On my workflows CC often spends up to an hour thinking (which is fine if the result is good) and seeing these messages does not build confidence.
KaoruAoiShiho•21m ago
Might be sticking with 4.6 it's only been 20 minutes of using 4.7 and there are annoyances I didn't face with 4.6 what the heck. Huge downgrade on MRCR too....

256K:

- Opus 4.6: 91.9% - Opus 4.7: 59.2%

1M:

- Opus 4.6: 78.3% - Opus 4.7: 32.2%

solenoid0937•21m ago
Backlash on HN for Anthropic adjusting usage limits is insane. There's almost no discussion about the model, just people complaining about their subscription.
therobots927•17m ago
Who cares about a new model you can’t even use?
webstrand•12m ago
Tried it, after about 10 messages, Opus 4.7 ceased to be able to recall conversation beyond the initial 10 messages. Super weird.
e10jc•12m ago
Regardless of the model quality improvement, the corporate damage was done by not only ignoring the Opus quality degradation but gaslighting users into thinking they aren’t using it right.

I switched to Codex 5.4 xhigh fast and found it to be as good as the old Claude. So I’ll keep using that as my daily driver and only assess 4.7 on my personal projects when I have time.

jp0001•11m ago
WTF. `Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities). We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses. `

Seriously? You're degrading Opus 4.7 Cybersecurity performance on purpose. Absolute shit.

msavara•10m ago
Pretty bad. As nerfed 4.6
petterroea•8m ago
Qwen 3.6 OSS and now this, almost feels like Anthropic rushed a release to steal hype away from Qwen
mrbonner•7m ago
So this is the norm: quantized version of the SOTA model is previous model. Full model becomes latest model. Rinse and repeat.
wahnfrieden•5m ago
Codex release coming today: https://x.com/thsottiaux/status/2044803491332526287