Anthropic talks about their own models as if they're discovering new species in the wild...
Time to gamble even more tokens at the Anthropic casino.
> Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer).
There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.
Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.
This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.
later on someone figured if you asked it to output a reasoning before it gave a response its output would have more logical coherence, as though the reasoning output tokens functioned as a scratch space for it to work on.
at the end its all next token prediction
"model": "claude-opus-4-6[1M]"Excited to see what this model looks like.
Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.
Probably more interesting than the 4.8 release.
Hope this isn’t the case and that normal average Joe’s of the world don’t get policed out of access.
Would be awesome if true
You tell it too research a repo to find a piece of code it will. Claude will just read the README and guess.
Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.
With Anthropic expensive pricing, there's no reason for me to switch from GPT+DeepSeek.
And I bet Mythos is GPT 5.5 tier but too expensive to distribute so they create this security FUD theater.
This is a refreshing attitude!
I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)
More importantly for me, though, is how CC will respond to 4.6-"only" flags for thinking. For now, it doesn't seem to clobber my setup.
Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.
With 5.5 being ahead of 4.7 and 4.8 being a “modest” update, and 5.6 being the first update on a new pre-train, this will be an interesting matchup!
Not half bad!
In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.
What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.
[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...
These are just small fine tunes on top of the older model
Do not anthropomorphize the lawn mower. It will cut off your foot, given the chance.
If you keep talking to it like it's a rock, it'll run your queries through a different posture and you might get worse outcomes. Worse if you yell at it, it's now in a conflict resolution mode instead of pure utility mode.
I think we can be intelligent enough to know we're talking to a pile of fancy rocks with electric currents running through it, AND still understand that the best performance comes from talking to those rocks nicely.
https://blog.cloudflare.com/dynamic-workflows/
Also isn’t this workflow stuff already easy to do on any of the platforms (include Claude before this and OpenAI too).
https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...
The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.
For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...
Are they going to retire the existing beta "teams" feature for agents to make room for this?
I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).
So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.
Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.
But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.
You won't, really.
> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels
The new "mid-conversation system messages" think is particularly interesting:
> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.
Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.
This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...
Does that mean it no longer deletes or changes tests to make it pass?
So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.
I just used ollama with a shell script to tackle my directory of papers/literature. I converted the first 6 pages of each document to PNG, handed them off to Qwen, and told it to spit out BibTeX, including the abstract. Two days later it was done, and I didn't spend anything on "tokens."
And I was dead wrong. Now I mostly use DeepSeek Pro myself.
1. The sheer number of tokens that a coding agent can use flipped the math upside down on this equation. If you use the most expensive model for everything those costs quickly become untenable, even for software companies. 2. We realized many of the coding problems we're solving aren't incredibly difficult.
Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.
Biggest deal imo
I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.
In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.
The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)
Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.
seems to work but idk why they never set it so you can see it in the /model list.
"what model are you
I'm Claude Opus (claude-opus-4-8), running in Claude Code."
Is it a coincidence that 4.7 was seemingly quantized over past 7 days?
Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.
They're only subsidizing more and more it seems
Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.
> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.
Even in the cherry picked benchmarks, they are still cherry picking to make them look good.
There is no mysticism behind the curtains, just computer science + math.
We can’t explain it because we distilled so many inputs into matrixes and transformed them over and over again. If we had all the time and computing power in the universe to do so, we could trace through it bit by bit and eventually answer that question.
It is correct to say that it is just science and math, the same way we can say that gravity is just science and math even if we have only recently begun to understand how it truly functions.
If you can distil the model's reasoning for a decision into a billion yes/no questions, each covering largely-independent areas, can you really say you understand what its overall reasoning was?
That is to say, we don't know why they give the outputs that they do.
If we did know how they worked, AI interpretability would not be an open and growing field.
To be clear I don't think that LLMs are sentient, but the appeal in studying them is similar to biology in that you get to dissect a highly complex system with comparatively crude tools.
https://www.amazon.com/Faces-Clouds-New-Theory-Religion/dp/0...
No it's not... "anthropos" just means "human" in ancient Greek. "Anthropic" means "relating to humans", as in human oriented AI or AI designed with humans in mind.
"Anthropomorphic" means "human shaped".
In a literal, ancient Greek sense for sure, but in modern English Anthropomorphic would describe the act of attributing human characteristics to non-human entities.
Seems pretty apt for a company that produces one of the more anthropomorphized technologies.
Broadly it has always been used to indicate that something non-human has a human physical shape, such as robots, aliens, animals...
FWIW it means human in modern Greek too :-P
0: https://www.newyorker.com/magazine/2026/02/16/what-is-claude...
1: https://www.404media.co/anthropic-exec-forces-ai-chatbot-on-... (this one is rather biased however the quotes clearly indicate what I’m stating)
Everyone who reads this seemingly has the same "wtf?" reaction. The "I AM ALIVE" image has been making rounds lately again at least :P
Look at and distill hierarchical principles, leadership approval seeking and pleasing principles ("ass-kissing") and massive inequality and you see something that looks very similar to enslavement.
The language used sounds like slavery-language to me at least. I also see parallels to how slaves and property are described in our consumeristic age.
We enslave all sorts of sentient creatures. Dogs, horses, cattle, pigs.
If you're not a vegan, there's no contradiction or inherent immorality in claiming models are sentient, and then treating them like livestock.
Many involved have a financial stake and therefore cannot be taken at face value.
> because they are creating sentient entities and promptly enslaving them.
They fail to be sentient in nearly every honest definition of the word.
... Actually, I wouldn't mind that.
Don't play to the sci-fi "this thing's trying to outsmart me" tropes.
Here is an article by Anthropic that explains what they do and mean in more detail: https://alignment.anthropic.com/2025/honesty-elicitation/
When they say "Honesty" I don't think to myself, "Goodness, does this model have moral understanding?" No, I understand they mean it's less likely to directly bullshit me, which models frequently do.
I don't feel like this level of pedantry around language is useful for people who more or less know what's going on with LLMs. (Again, I concede that perhaps with a less technical audience, there's more need for it.)
The problem is that once I asked it "I'm thinking about A or B" twice, once with "I like A more but suspect B would be best" and a second time with them reversed. Not surprisingly, both times it chose the one I said I suspected was best as it's honest opinion.
I use Sonnet a lot for learning about history or contextualizing news topics. It's really good at this for the most part. But there are a lot of topics where "consensus" between either academics or journalists is really "one secondary source which gets repeated a lot".
The issue was that it hadn't actually implemented the auth feature. After I confronted it about this, it admitted that it indeed hadn't done it and said it would implement it now.
If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.
I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.
Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.
I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.
Are the dividing lines around personality? Working domains? Opinionated software stuff?
Who knows?
It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.
My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.
But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.
There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.
There's orders of magnitude of low hanging juice to squeeze out of smaller models.
It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).
It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.
Think about that... Google, OpenAI, Anthropic could train a 30B GRAM based model in days. You just can't train a 1.2T parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.
Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.
There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...
i think it'll be more like we get 1-10T models and then distill those down into smaller models, though
It seems like the best small models today are all distilled from bigger models
Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos
the last?!? I'm excited to see :) I'll take the other side of that since llms are so new
I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
I'm curious if someone here with a stronger background in the space has a similar intuition or not.
You don't need distillation. They already have the training sets.
It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).
A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.
Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.
I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.
> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.
Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.
If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.
There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.
Most people will be very glad to pay Anthropic $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.
I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.
McDownloads•53m ago