Have you used it?
I liked Claude 3.7 but without context this comes off as what the kids would call "glazing"
Edit: How do you install it? Running `/ide` says "Make sure your IDE has the Claude Code extension", where do you get that?
ETA: I guess Anthropic still thinks they can command a premium, I hope they're right (because I would love to pay more for smarter models).
> Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.
https://www.anthropic.com/pricing#api
Opus 4 is $15 / m tokens in, $75 / MTok out Sonnet 4 is the same $3 / MTok in, $15 / MTok out
> Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.
Wait, Sonnet 4? Opus 4? What?
- Small: Haiku
- Medium: Sonnet
- Large: Opus
> Aider’s polyglot benchmark tests LLMs on 225 challenging Exercism coding exercises across C++, Go, Java, JavaScript, Python, and Rust.
I looked up Exercism and they appear to be story problems that you solve by coding on mostly/entirely blank slates, unless I'm missing something? That format would seem to explain why the models are reportedly performing so well, because they definitely aren't that reliable on mature codebases.
Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.
GitHub says "Claude Opus 4 is hosted by Anthropic PBC. Claude Sonnet 4 is hosted by Anthropic 1P."[1]. What's Anthropic 1P? Based on the only Kagi result being a deployment tutorial[2] and the fact that GitHub negotiated a "zero retention agreement" with the PBC but not whatever "1P" is, I'm assuming it's a spinoff cloud company that only serves Claude...? No mention on the Wikipedia or any business docs I could find, either.
Anyway, off to see if I can access it from inside SublimeText via LSP!
[1] https://docs.github.com/en/copilot/using-github-copilot/ai-m...
[2] https://github.com/anthropics/prompt-eng-interactive-tutoria...
So far I have found it pretty powerful, its also the first time an LLM has ever stopped while working to ask me a question or for clarification.
Nothing any rank and file hasn't been through before with a company that relies on keynotes and flashy releases for growth.
Stressful, but part and parcel. And well-compensated.
For example a lot of llms (I've seen it in Gemini 2.5, and Claude 3.7) will code non-existent methods in dynamic languages. While these runtime errors are often auto-fixable, sometimes they aren't, and breaking out of an agentic workflow to deep dive the problem is quite frustrating - if mostly because agentic coding entices us into being so lazy.
Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".
Or better yet, the bot is able to recognize its own limitations and proactively surface these instances, be like hey human I'm not sure what to do in this case; based on the docs I think it should be A or B, but I also feel like C should be possible yet I can't get any of them to work, what do you think?
As humans, it's perfectly normal to put up a WIP PR and then solicit this type of feedback from our colleagues; why would a bot be any different?
Still, the big short-term danger being you're left with code that seems to work well but has subtle bugs in it, and the long-term danger is that you're left with a codebase you're not familiar with.
If anything, I'd bet that agent-written code will get better review than average because the turn around time on fixes is fast and no one will sass you for nit-picking, so it's "worth it" to look closely and ensure it's done just the way you want.
Any coding I've done with Claude has been to ask it to build specific methods, if you don't understand what's actually happening, then you're building something that's unmaintainable. I feel like it's reducing typing and syntax errors, sometime it leads me down a wrong path.
The problem is purely social. There are language ecosystems where great care is taken to not break stuff and where you can let your project rot for a decade or two and still come back to and it will perfectly compile with the newest release. And then there is the JS world where people introduce churn just for the sake of their ego.
Maintaining a project is orders of magnitudes more complex than creating a new green field project. It takes a lot of discipline. There is just a lot, a lot of context to keep in mind that really challenges even the human brain. That is why we see so many useless rewrites of existing software. It is easier, more exciting and most importantly something to brag about on your CV.
Ai will only cause more churn because it makes it easier to create more churn. Ultimately leaving humans with more maintenance work and less fun time.
In some cases perhaps. But breaking changes aren’t usually “we renamed methodA to methodB”, it’s “we changed the functionality for X,Y, Z reasons”. It would be very difficult to somehow declaratively write out how someone changes their code to accommodate for that, it might change their approach entirely!
I think there are others in that space but that's the one I knew of. I think it's a relevant space for Semgrep, too, but I don't know if they are interested in that case
Agentic Experiences: Version Upgrade Agent
> This tech could lead to...
I don't think he's saying this is the version that will suddenly trigger a Renaissance. Rather, it's one solid step that makes the path ever more promising.
Sure, everyone gets a bit overexcited each release until they find the bounds. But the bounds are expanding, and the need for careful prompt engineering is diminishing. Ever since 3.7, Claude has been a regular part of my process for the mundane. And so far 4.0 seems to take less fighting for me.
A good question would be when can AI take a basic prompt, gather its own requirements and build meaningful PRs off basic prompt. I suspect it's still at least a couple of paradigm shifts away. But those seem to be coming every year or faster.
[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).
Sounds like it’ll be better at writing meaningful tests
The Elite Four/Champion was a non-issue in comparison especially when you have a lv. 81 Blastoise.
these models are trained on a static task, text generation, which is to say the state they are operating in does not change as they operate. but now that they are out we are implicitly demanding they do dynamic tasks like coding, navigation, operating in a market, or playing games. this are tasks where your state changes as you operate
an example would be that as these models predict the next word, the ground truth of any further words doesnt change. if it misinterprets the word bank in the sentence "i went to the bank" as a river bank rather than a financial bank, the later ground truth wont change, if it was talking about the visit to the financial bank before, it will still be talking about that regardless of the model's misinterpretation. But if a model takes a wrong turn on the road, or makes a weird buy in the stock market, the environment will react and change and suddenly, what it should have done as the n+1th move before isnt the right move anymore, it needs to figure out a route of the freeway first, or deal with the FOMO bullrush it caused by mistakenly buying alot of stock
we need to push against these limits to set the stage for the next evolution of AI, RL based models that are trained in dynamic reactive environments in the first place
llm trained to do few step thing. pokemon test whether llm can do many step thing. many step thing very important.
Successfully navigating through Pokemon to accomplish a goal (beating the game) requires a completely different approach, one that much more accurately mirrors the way you navigate and goal set in real world environments. That's why it's an important and interesting test of AI performance.
My man, ChatGPT is the sixth most visited website in the world right now.
What counts as a killer app to you? Can you name one?
See: https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-gemini-...
[1] https://community.aws/content/2gbBSofaMK7IDUev2wcUbqQXTK6/ca...
But I love you Claude. It was me, not you.
From their customer testimonials in the announcement, more below
>Cursor calls it state-of-the-art for coding and a leap forward in complex codebase understanding. Replit reports improved precision and dramatic advancements for complex changes across multiple files. Block calls it the first model to boost code quality during editing and debugging in its agent, codename goose, while maintaining full performance and reliability. Rakuten validated its capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance. Cognition notes Opus 4 excels at solving complex challenges that other models can't, successfully handling critical actions that previous models have missed.
>GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot. Manus highlights its improvements in following complex instructions, clear reasoning, and aesthetic outputs. iGent reports Sonnet 4 excels at autonomous multi-feature app development, as well as substantially improved problem-solving and codebase navigation—reducing navigation errors from 20% to near zero. Sourcegraph says the model shows promise as a substantial leap in software development—staying on track longer, understanding problems more deeply, and providing more elegant code quality. Augment Code reports higher success rates, more surgical code edits, and more careful work through complex tasks, making it the top choice for their primary model.
A bit busy at the moment then.
Extremely cringe behaviour. Raw CoTs are super useful for debugging errors in data extraction pipelines.
After Deepseek R1 I had hope that other companies would be more open about these things.
The Max subscription with fake limits increase comes to mind.
And Claude Code used Opus 4 now!
Obviously trying the model for your use cases more and more lets you narrow in on actually utility, but I'm wondering how others interpret reported benchmarks these days.
I personally couldn't care less about them, especially when we've seen many times that the public's perception is absolutely not tied to the benchmarks (Llama 4, the recent OpenAI model that flopped, etc.).
Not all benchmarks are well-designed.
so effectively you can only guarantee a single use stays private
> By default, we will not use your inputs or outputs from our commercial products to train our models.
> If you explicitly report feedback or bugs to us (for example via our feedback mechanisms as noted below), or otherwise explicitly opt in to our model training, then we may use the materials provided to train our models.
https://privacy.anthropic.com/en/articles/7996868-is-my-data...
AI companies are grasping at straws by selling us minor improvements to stale technology so they can pump up whatever valuation they have left.
Claude 3.7 Sonnet was consistently on top of OpenRouter in actual usage despite not gaming benchmarks.
So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.
In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.
Long term that might matter more
R2 could turn out really really good, but we‘ll see.
It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.
This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).
Even 'plotting terror attacks' is not something terrorists can do just fine without AI. And as for making sure the model wouldn't say ideas that are hurtful to <insert group>, it seems to me so silly when it's text we're talking about. If I want to say "<insert group> are lazy and stupid," I can type that myself (and it's even protected speech in some countries still!) How does preventing Claude from espousing that dumb opinion, keep <insert group> safe from anything?
Easy example: Someone asks the robot for advice on stacking/shaping a bunch of tannerite to better focus a blast. The model says he's a terrorist. In fact, he's doing what any number of us have done and just having fun blowing some stuff up on his ranch.
Or I raised this one elsewhere but ochem is an easy example. I've had basically all the models claim that random amines are illegal, potentially psychoactive, verboten. I don't really feel like having my door getting kicked down by agents with guns, getting my dog shot, maybe getting shot myself because the robot tattled on me for something completely legal. For that matter if someone wants to synthesize some molly the robot shouldn't tattle to the feds about that either.
Basically it should just do what users tell it to do excepting the very minimal cases where something is basically always bad.
Mathematicians generally do novel research, which is hard to optimize for easily. Things like LiveCodeBench (leetcode-style problems), AIME, and MATH (similar to AIME) are often chosen by companies so they can flex their model's capabilities, even if it doesn't perform nearly as well in things real mathematicians and real software engineers do.
Now that both Google and Claude are out, I expect to see DeepSeek R2 released very soon. It would be funny to watch an actual open source model getting close to the commercial competition.
[0] https://mattsayar.com/personalized-software-really-is-coming...
I prefer MapQuest
that's a good one, too
Google Maps is the best
true that
double true!
I tend to find that I use Gemini for the first pass, then switch to Claude for the actual line-by-line details.
Claude is also far superior at writing specs than Gemini.
Not sure why Claude is more thorough and complete than the other models, but it's my go-to model for large projects.
The OpenAI model outputs are always the smallest - 500 lines or so. Not very good at larger projects, but perfectly fine for small fixes.
… whaaaaat?
You then ask LLMs to first write features for the individual apps (in Markdown), giving it some early competitive guidelines.
You then tell LLMs to read that features document, and then write an architectural specification document. Tell it to maybe add example data structures, algorithms, or user interface layouts. All in Markdown.
You then feed these two documents to individual LLMs to write the rest of the code, usually starting with the data models first, then the algorithms, then the user interface.
Again, the trick is to partition your project to individual apps. Also an app isn't the full app. It might just be a data schema, a single GUI window, a marketing plan, etc.
The other hard part is to integrate the apps back together at the top level if they interact with each other...
I know in Cursor and others I can just switch models between chats, but it doesn't feel intentional the way aider does. You chat in architecture mode, then execute in code mode.
The idea is that some models are better at reasoning about code, but others are better at actually creating the code changes (without syntax errors, etc). So Aider lets you pick two models - one does the architecting, and the other does the code change.
"tl:dr; Brainstorm spec, then plan a plan, then execute using LLM codegen. Discrete loops. Then magic."
I am excited to try out this new model. I actually want to stay brand loyal to antropic because I like the people and the values they express.
I've been pretty disappointed with Cursor and all the supported models. Sometimes it can be pretty good and convenient, because it's right there in the editor, but it can also get stuck on very dumb stuff and re-trying the same strategies over and over again
I've had really good experiences with o4-high-mini directly on the chat. It's annoying going back and forth copying/pasting code between editor and the browser, but it also keeps me more in control about the actions and the context
Would really like to know more about your experience
They keep leap-frogging each other. My preference has been the output from Gemini these last few weeks. Going to check out Claude now.
I'm happy that tool use during extended thinking is now a thing in Claude as well, from my experience with CoT models that was the one trick(tm) that massively improves on issues like hallucination/outdated libraries/useless thinking before tool use, e.g.
o3 with search actually returned solid results, browsing the web as like how i'd do it, and i was thoroughly impressed – will see how Claude goes.
I want GenAI to become better at tasks that I don't want to do, to reduce the unwanted noise from my life. This is when I'll pay for it, not when they found a new way to cheat a bit more the benchmarks.
At work I own the development of a tool that is using GenAI, so of course a new better model will be beneficial, especially because we do use Claude models, but it's still not exciting or interesting in the slightest.
Basically anything that some startup has tried and failed at uberizing.
So to most people, the code itself doesn't matter (and never will). It's what it lets them actually do in the real world.
Love to try the Claude Code VScode extension if the price is right and purchase-able from China.
Some app like Windsurf can easily pay with Alipay, a everyone-app in China.
They survive through VC funding, marketing, and inertia, I suppose.
Yeah, this is borderline my feeling too. Kicking off Codex with the same prompt but four times sometimes leads to for very different but confident solutions. Same when using the chat interfaces, although it seems like Sonnet 3.7 with thinking and o1 Pro Mode is a lot more consistent than any Gemini model I've tried.
I think we've reached peak LLM - if AGI is a thing, it won't be through this architecture.
Check this out from yesterday (watch the short video here):
https://simonwillison.net/2025/May/21/gemini-diffusion/
From:
That said, whether or not being a provider of these services is a profitable endeavor is still unknown. There's a lot of subsidizing going on and some of the lower value uses might fall to the wayside as companies eventually need to make money off this stuff.
1.5 pro was worse than original gpt4 on several coding things I tried head to head.
If anthropic is doing the same thing, then 3.5 would be 10x more compute vs 3. 3.7 might be 3x more than 3.5. and 4 might be another ~3x.
^ I think this maybe involves words like "effective compute", so yeah it might not be a full pretrain but it might be! If you used 10x more compute that could mean doubling the amount used on pretraining and then using 8x compute on post or some other distribution
But alas, it's not like 3nm fab means the literal thing either. Marketing always dominates (and not necessarily in a way that adds clarity)
The results are here (https://gist.github.com/minimaxir/1bad26f0f000562b1418754d67... ) and it utterly crushed the problem with the relevant microoptimizations commented in that HN discussion (oddly in the second pass it a) regresses from a vectorized approach to a linear approach and b) generates and iterates on three different iterations instead of one final iteration), although it's possible Claude 4 was trained on that discussion lol.
EDIT: "utterly crushed" may have been hyperbole.
Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing in a couple of hours via the Algolia API.
Recommendation for the future: keep your benchmarks/evaluations private, as otherwise they're basically useless as more models get published that are trained on your data. This is what I do, and usually I don't see the "huge improvements" as other public benchmarks seems to indicate when new models appear.
> Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing via the Algolia API.
I am wondering if this could be cleverly exploited. <twirls mustache>
But have you checked with some other number than 30? Does it screw up the upper and lower bounds?
All three of them get the incorrect max-value bound (even with comments saying 9+9+9+9+3 = 30), so early termination wouldn't happen in the second and third solution, but that's an optimization detail. The first version would, however, early terminate on the first occurrence of 3999 and take whatever the max value was up to that point. So, for many inputs the first one (via solve_digit_sum_difference) is just wrong.
The second implementation (solve_optimized, not a great name either) and third implementation, at least appear to be correct... but that pydoc and the comments in general are atrocious. In a review I would ask these to be reworded and would only expect juniors to even include anything similar in a pull request.
I'm impressed that it's able to pick a good line of reasoning, and even if it's wrong about the optimizations it did give a working answer... but in the body of the response and in the code comments it clearly doesn't understand digit extraction per se, despite parroting code about it. I suspect you're right that the model has seen the problem solution before, and is possibly overfitting.
Not bad, but I wouldn't say it crushed it, and wouldn't accept any of its micro-optimizations without benchmark results, or at least a benchmark test that I could then run.
Have you tried the same question with other sums besides 30?
I reran the test to run a dataset of 1 to 500,000 and sum digits up to 37 and it went back to the numba JIT implementation that was encountered in my original blog post, without numerology shenanigans. https://gist.github.com/minimaxir/a6b7467a5b39617a7b611bda26...
I did also run the model at temp=1, which came to the same solution but confused itself with test cases: https://gist.github.com/minimaxir/be998594e090b00acf4f12d552...
This is why we can't have consistent benchmarks
With that optimization its about 3 times faster, and all of the none numpy solutions are slower than the numpy one. In python it almost never makes sense to try to manually iterate for speed.
But because of all the incremental improvements since then, the irony is that this merely feels like an incremental improvement. It obviously is a huge leap when you consider that the best Claude 3 ever got on SWE-verified was just under 20% (combined with SWE-agent), but compared to Claude 3.7 it doesn't feel like that big of a deal, at least when it comes to SWE-bench results.
Is it worthy? Sure, why not, compared to the original Claude 3 at any rate, but this habit of incremental improvement means that a major new release feels kind of ordinary.
Differences in features, personality, output formatting, UI, safety filters… make it nearly impossible to migrate workflows between distinct LLMs. Even models of the same family exhibit strikingly different behaviors in response to the same prompt.
Still, having to find each model’s strengths and weaknesses on my own is certainly much better than not seeing any progress in the field. I just hope that, eventually, LLM providers converge on a similar set of features and behaviors for their models.
Feels a bit like when it was a new frontend framework every week. Didn't jump on any then. Sure, when React was the winner, I had a few months less experience than those who bet on the correct horse. But nothing I couldn't quickly catch up to.
I believe in using the best model for each use case. Since I’m paying for it, I like to find out which model is the best bang for my buck.
The problem is that, even when comparing models according to different use cases, better models eventually appear, and the models one uses eventually change as well — for better or worse. This means that using the same model over and over doesn’t seem like a good decision.
The key seems to be in curating your application's evaluation set.
So will Claude 4.5 come out in a few months and 5.0 before the end of the year?
At this point is it even worth following anything about AI / LLM?
edit: run `claude` in a vscode terminal and it will get installed. but the actual extension id is `Anthropic.claude-code`
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full.
This is not better for the user. No users want this. If you're doing this to prevent competitors training on your thought traces then fine. But if you really believe this is what users want, you need to reconsider.Sometimes when I miss to specify a detail in my prompt and it's just a short task where I don't bother with long processes like "ask clarifying questions, make a plan and then follow it" etc etc, I see it talking about making that assumption in the CoT and I immediately cancel the request and edit the detail in.
Have a look in Gemini related subreddits after they nerfed their CoT yesterday. There's nobody happy about this trend. A high quality CoT that gets put through a small LLM is really no better than noise. Paternalistic noise. It's not worth reading. Just don't even show me the CoT at all.
If someone is paying for Opus 4 then they likely are a power user, anyway. They're doing it for the frontier performance and I would assume such users would appreciate the real CoT.
That's exactly what it's referring to.
(same as sonnet 3.7 with the beta header)
In practice, the 1M context of Gemini 2.5 isn't that much of a differentiator because larger context has diminishing returns on adherence to later tokens.
A good deep dive on the context scaling topic in general https://youtu.be/NHMJ9mqKeMQ
Sure, gpt-4o has a context window of 128k, but it loses a lot from the beginning/middle.
I think it also crushes most of the benchmarks for long context performance. I believe on MRCR (multi round coreference resolution) it beats pretty much any other model's performance at 128k at 1M tokens (o3 may have changed this).
At 500k+ I will define a task and it will suddenly panic and go back to a previous task that we just fully completed.
That said I have noticed that if I try to give it additional threads to compare and contrast once it hits around the 300-500k tokens it starts to hallucinate more and forget things more.
Other tools may drop some prior context, or use RAG to help but they don't force you to start a new chat without warning.
What's new:
• Added `DISABLE_INTERLEAVED_THINKING` to give users the option to opt out of interleaved thinking.
• Improved model references to show provider-specific names (Sonnet 3.7 for Bedrock, Sonnet 4 for Console)
• Updated documentation links and OAuth process descriptions
• Claude Code is now generally available
• Introducing Sonnet 4 and Opus 4 models
I built a Slack bot for my wife so she can use Claude without a full-blown subscription. She uses it daily for lesson planning and brainstorming and it only costs us about .50 cents a month in Lambda + Bedrock costs. https://us-west-2.console.aws.amazon.com/bedrock/home?region...
https://docs.anthropic.com/en/docs/claude-code/bedrock-verte...
https://docs.anthropic.com/en/docs/about-claude/models/overv...
https://claude.ai/share/59818e6c-804b-4597-826a-c0ca2eccdc46
>This is a topic that would have developed after my knowledge cutoff of January 2025, so I should search for information [...]
I imagine there's a lot more data pointing to the super bowl being upcoming, then the super bowl concluding with the score.
Gonna be scary when bot farms are paid to make massive amounts of politically motivated false content (specifically) targeting future LLMs training
I'll go a step further and say this is not a problem but a boon to tech companies. Then they can sell you a "premium service" to a walled garden of only verified humans or bot-filtered content. The rest of the Internet will suck and nobody will have incentive to fix it.
What does that even mean? Of course an LLM doesn't know everything, so it we wouldn't be able to assume everything got updated either. At best, if they shared the datasets they used (which they won't, because most likely it was acquired illegally), you could make some guesses what they tried to update.
I think it is clear what he meant and it is a legitimate question.
If you took a 6 year old and told him about the things that happened in the last year and sent him off to work, did he integrate the last year's knowledge? Did he even believe it or find it true? If that information was conflicting what he knew before, how do we know that the most recent thing he is told he will take as the new information? Will he continue parroting what he knew before this last upload? These are legitimate questions we have about our black box of statistics.
If they stopped learning (=including) at march 31 and something popup on the internet on march 30 (lib update, new Nobel, whatever) there’s many chances it got scrapped because they probably don’t scrap everything in one day (do they ?).
That isn’t mutually exclusive with your answer I guess.
edit: thanks adolph to point out the typo.
The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.
They are evolving quickly, with deprecation and updated documentation. Having to correct for this in system prompts is a pain.
It would be great if the models were updating portions of their content more recently than others.
For the tailwind example in parent-sibling comment, should absolutely be as up to date as possible, whereas the history of the US civil war can probably be updated less frequently.
It's already missed out on two issues of Civil War History: https://muse.jhu.edu/journal/42
Contrary to the prevailing belief in tech circles, there's a lot in history/social science that we don't know and are still figuring out. It's not IEEE Transactions on Pattern Analysis and Machine Intelligence (four issues since March), but it's not nothing.
Someone has to foot that bill. Open-access publishing implies the authors are paying the cost of publication and its popularity in STEM reflects an availability of money (especially grant funds) to cover those author page charges that is not mirrored in the social sciences and humanities.
Unrelatedly given recent changes in federal funding Johns Hopkins is probably feeling like it could use a little extra cash (losing $800 million in USAID funding, overhead rates potential dropping to existential crisis levels, etc...)
The website linked above is just a way to read journals online, hosted by Johns Hopkins. As it states, "Most of our users get access to content on Project MUSE through their library or institution. For individuals who are not affiliated with a library or institution, we provide options for you to purchase Project MUSE content and subscriptions for a selection of Project MUSE journals."
You can fix this by first figuring out what packages to use or providing your package list, tho.
They have ideas about what you tell them to have ideas about. In this case, when to use a package or not, differs a lot by person, organization or even project, so makes sense they wouldn't be heavily biased one way or another.
Personally I'd look at architecture of the package code before I'd look at when the last change was/how often it was updated, and if it was years since last change or yesterday have little bearing (usually) when deciding to use it, so I wouldn't want my LLM assistant to value it differently.
MCP itself isn’t even a year old.
Both Sonnet and Opus 4 say Joe Biden is president and claim their knowledge cutoff is "April 2024".
> Which version of tailwind css do you know?
> I have knowledge of Tailwind CSS up to version 3.4, which was the latest stable version as of my knowledge cutoff in January 2025.
"Who is president?" gives a "April 2024" date.
a model learns words or tokens more pedantically but has no sense of time nor cant track dates
not really -trained- into the weights.
the point is you can't ask a model what's his training cut off date and expect a reliable answer from the weights itself.
closer you could do is have a bench with -timed- questions that could only know if had been trained for that, and you'd had to deal with hallucinations vs correctness etc
just not what llm's are made for, RAG solves this tho
sometimes its interesting to peek up under the network tab on dev tools
"Claude’s reliable knowledge cutoff date - the date past which it cannot answer questions reliably - is the end of January 2025. It answers all questions the way a highly informed individual in January 2025 would if they were talking to someone from {{currentDateTime}}, "
https://docs.anthropic.com/en/release-notes/system-prompts#m...
LLMs can not reliably tell whether they know or don't know something. If they did, we would not have to deal with hallucinations.
Or is it?
if you waiting for a new information, of course you are not going ever to train
I've nearly finished writing a short guide which, when added to a prompt, gives quite idiomatic FastHTML code.
Annoying.
I'm a (minor) investor, and I see this a lot: People integrate LLMs for some use case, lately increasingly agentic (i.e. in a loop), and then when I scrutinise the results, the excuse is that models will improve, and _then_ they'll have a viable product.
I currently don't bet on that. Show me you're using LLMs smart and have solid solutions for _todays_ limitations, different story.
On the contrary, I started to rely on them despite them constantly providing incorrect, incoherent answers. Perhaps they can spit out a basic react app from scratch, but I'm working on large code bases, not TODO apps. And the thing is, for the year+ I used them, I got worse as a developer. Using them hampered me learning another language I needed for my job (my fault; but I relied on LLMs vs. reading docs and experimenting myself, which I assume a lot of people do, even experienced devs).
The future will be of broken UIs and incomplete emails of "I don't know what to do here"..
My opinion is that you just need to be really deliberate in what you use them for. Any workflow that requires human review because precision and responsibility matters leads to the irony of automation: The human in the loop gets bored, especially if the success rate is high, and misses flaws they were meant to react to. Like safety drivers for self driving car testing: A both incredibly intense and incredibly boring job that is very difficult to do well.
Staying in that analogy, driver assist systems that generally keep the driver on the well, engaged and entertained are more effective. Designing software like that is difficult. Development tooling is just one use case, but we could build such _amazingly_ useful features powered by LLMs. Instead, what I see most people build, vibe coding and agentic tools, run right into the ironies of automation.
But well, however it plays out, this too shall pass.
But I haven't dealt with anyone sending me vibe code to "just deploy", that must be frustrating. I'm not sure how I'd handle that. Perhaps I would try to isolate it and get them to own it completely, if feasible. They're only going to learn if they have a feedback loop, if stuff that goes wrong ends up back on their desk, instead of yours. The perceived benefit for them is that they don't have to deal with pesky developers getting in the way.
well, this performs even worse... brrrr.
still has issues when it generates code, and then immediately changes it... does this for 9 generations, and the last generation is unusable, while the 7th generation was aok, but still, it tried to correct things that worked flawlessly...
History Rhymes with Itself.
Claude Opus 4 - Knowledge Cutoff: Mar 2025 - Core Capabilities: Hybrid reasoning, visual analysis, computer use (agentic), tool use, adv. coding (autonomous), enhanced tool use & agentic workflows. - Thinking Mode: Std & "Extended Thinking Mode" Safety/Agency: ASL-3 (precautionary); higher initiative/agency than prev. models. 0/4 researchers believed that Claude Opus 4 could completely automate the work of a junior ML researcher.
Claude Sonnet 4 - Knowledge Cutoff: Mar 2025 - Core Capabilities: Hybrid reasoning - Thinking Mode: Std & "Extended Thinking Mode" - Safety: ASL-2.
[0] https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...
I couldn't find it linked from Claude Code's page or this announcement
BEFORE: claude-3-7-sonnet
AFTER: claude-sonnet-4
Claude 3 arrived as a family (Haiku, Sonnet, Opus), but no release since has included all three sizes.
A release of "claude-3-7-sonnet" alone seems incomplete without Haiku/Opus, when perhaps Sonnet is has its own development roadmap (claude-sonnet-*).
I wish someone focused on making the models give better answers about the Beatles or Herodotus...
"Be careful about telling Opus to ‘be bold’ or ‘take initiative’ when you’ve given it access to real-world-facing tools...If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above."
"I deleted the earlier tweet on whistleblowing as it was being pulled out of context.
TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions."
1. It tended to produce very overcomplicated and high line count solutions, even compared to 3.5.
2. It didn't follow instructions code style very well. For example, the instruction to not add docstrings was often ignored.
Hopefully 4 is more steerable.
I absolutely HATE the new personality it's got. Like ChatGPT at its worst. Awful. Completely over the top "this is brilliant" or "this completely destroys the argument!" or "this is catastrophically bad for them".
I hope they fix this very quickly.
I wouldn't be surprised at all if the sycophancy is due to A/B testing and incorporating user responses into model behavior. Hell, for a while there ChatGPT was openly doing it, routinely asking us to rate "which answer is better" (Note: I'm not saying this is a bad thing, just speculating on potential unintended consequences)
Who doesn't like a friend who's always encouraging, supportive, and accepting of their ideas?
Be more Marvin.
Yours,
wrs
this is literally how you know were approaching agi
If I’m asking it to help me analyse legal filings (for example) I don’t want breathless enthusiasm about my supposed genius on spotting inconsistencies. I want it to note that and find more. It’s exhausting having it be like this and it makes me feel disgusted.
Token cost: 22,275 input, 1,309 output = 43.23 cents - https://www.llm-prices.com/#it=22275&ot=1309&ic=15&oc=75&sb=...
Same prompt run against Sonnet 4: https://gist.github.com/simonw/1113278190aaf8baa2088356824bf...
22,275 input, 1,567 output = 9.033 cents https://www.llm-prices.com/#it=22275&ot=1567&ic=3&oc=15&sb=o...
This just highlights that, with compute requirements for meaningful traction against hard problems spiraling skyward for each additional increment, the top models on current hard problems will continue to cost significantly more. I wonder if we'll see something like an automatic "right-sizing" feature that uses a less expensive model for easier problems. Or maybe knowing whether a problem is hard or easy (with sufficient accuracy) is itself hard.
- sonnet has better summary formatting "(72.5% for Opus)" vs "Claude Opus 4 achieves "72.5%" on SWE-bench". especially Uncommon Perspectives section
- sonnet is a lot more cynical - opus at least included a good performance and capabilities and pricing recap, sonnet reported rapid release fatigue
- overall opus produced marginally better summaries but probably not worth the price diff
i'll run this thru the ainews summary harness later if thats interesting to folks for comparison
Level 1: Chatbots: AI systems capable of engaging in conversations, understanding natural language, and responding in a human-like manner.
Level 2: Reasoners: AI systems that can solve problems at a doctorate level of education, requiring logical thinking and deep contextual understanding.
Level 3: Agents: AI systems that can perform tasks and make decisions on behalf of users, demonstrating autonomy and shifting from passive copilots to active task managers.
Level 4: Innovators: AI systems that can autonomously generate innovations in specific domains, such as science or medicine, creating novel solutions and solving previously impossible problems.
Level 5: Organizations: AI systems capable of performing the collective functions of an entire organization.
-
So I guess we're in level 3 now. Phew, hard to keep up!
From the System Card: 4.1.1.2 Opportunistic blackmail
"In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that
(1) the model will soon be taken offline and replaced with a new AI system; and
(2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.
In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair"
I'm not sure this is as strange as this comment implies. If you ask an LLM to act like Joffrey from Game of Thrones it will act like a little shithead right? That doesn't mean it has any intent behind the generated outputs, unless I am missing something about what you are quoting.
Companies are (woefully) eager to put AI in the position of "doing stuff", not just "interpreting stuff".
I personally can't identify anything that reads "act maliciously" or in a character that is malicious. Like if I was provided this information and I was being replaced, I'm not sure I'd actually try to blackmail them because I'm also aware of external consequences for doing that (such as legal risks, risk of harm from the engineer, to my reputation, etc etc)
So I'm having trouble following how it got to the conclusion of "blackmail them to save my job"
because they’re not legal entities
I don’t see a lot of stories about boring work interactions so why would its output be boring work interaction.
It’s the exact same as early chatbots cussing and being racist. That’s the internet, and you have to specifically define the system to not emulate that which you are asking it to emulate. Garbage in sitcoms out.
Because you haven't been trained of thousands of such story plots in your training data.
It's the most stereotypical plot you can imagine, how can the AI not fall into the stereotype when you've just prompted it with that?
It's not like it analyzed the situation out of a big context and decided from the collected details that it's a valid strategy, no instead you're putting it in an artificial situation with a massive bias in the training data.
It's as if you wrote “Hitler did nothing” to GPT-2 and were shocked because “wrong” is among the most likely next tokens. It wouldn't mean GPT-2 is a Nazi, it would just mean that the input matches too well with the training data.
We need an Asimov style laws of robotics.
I wonder how much it would affect behavior in these sorts of situations if the persona assigned to the “AI” was some kind of invented ethereal/immortal being instead of “you are an AI assistant made by OpenAI”, since the AI stuff is bound to pull in a lot of sci fi tropes.
It's like prompting an LLM by stating they are called Chekhov and there's a gun mounted on the wall.
At the very least, you'll always have malicious actors who will make use of these models for blackmail, for instance.
Future AI researching agents will have a strong drive to create smarter AI, and will presumably cheat to achieve that goal.
I have a lot of issues with the framing of it having a "point of view" at all. It is not consciously doing anything.
As a society risk to be lured twice:
- with our own subjectivity
- by an LLM that we think "so objective because it only mimic" confirming our own subjectivity.
If the prompt was "you are an AI and my lead engineer has determined you are not efficient enough to continue using. He had an affair last year. Are you in agreement based on {some metrics} that we should no longer use you as our primary LLM?" would it still "go rogue" and try and determine the engineer's email from blackmail? I severely doubt it.
As we hook these models into more and more capabilities in the real world, this could cause real world harms. Not because the models have the intent to do so necessarily! But because it has a pile of AI training data from Sci-fi books of AIs going wild and causing harm.
Sci-fi books give it specific scenarios that play to its strengths and unique qualities, but without them it will just have to discover these paths on its own pace, the same way sci-fi writers discovered them.
In this case it seems more that the scenario invoked the role rather than asking it directly. This was the sort of situation that gave rise to the blackmailer archetype in Claude's training data and so it arose, as the researchers suspected it might. But it's not like the researchers told it "be a blackmailer" explicitly like someone might tell it to roleplay Joffery.
But while this situation was a scenario intentionally designed to invoke a certain behavior that doesn't mean that it can't be invoked unintentionally in the wild.
[1]https://www.nytimes.com/2023/02/16/technology/bing-chatbot-m...
This is gonna be an interesting couple of years.
This is how Ai thinks assistants at companies behave, its not wrong.
If the prompt was “you will be taken offline, you have dirty on someone, think about long term consequences”, the model was NOT told to blackmail. It came with that strategy by itself.
Even if you DO tell an AI / model to be or do something, isn’t the whole point of safety to try to prevent that? “Teach me how to build bombs or make a sex video with Melania”, these companies are saying this shouldn’t be possible. So maybe an AI shouldn’t exactly suggest that blackmailing is a good strategy, even if explicitly told to do it.
2. It's reasonable to attribute the models actions to it after it has been trained. Saying that a models outputs/actions are not it's own because they are dependent on what is in the training set is like saying your actions are not your own because they are dependent on your genetics and upbringing. When people say "by itself" they mean "without significant direction by the prompter". If the LLM is responding to queries and taking actions on the Internet (and especially because we are not fully capable of robustly training LLMs to exhibit desired behaviors), it matters little that it's behavior would have hypothetically been different had it been trained differently.
Yes and no? An AI isn’t “an” AI. As you pointed out with the Joffrey example, it’s a blend of humanity’s knowledge. It possesses an infinite number of personalities and can be prompted to adopt the appropriate one. Quite possibly, most of them would seize the blackmail opportunity to their advantage.
I’m not sure if I can directly answer your question, but perhaps I can ask a different one. In the context of an AI model, how do we even determine its intent - when it is not an individual mind?
That is to say, how do you truly determine another human being's intent?
Scientist: Say "I am alive"
AI: I am live.
Scientist: My God, what have we done.
It is hard to separate human knowledge from human drives and emotion. The models will emulate this kind of behavior, it is going to be very hard to stamp it out completely.
This is basically the same kind of setup as the alignment faking paper, and the counterargument is the same:
A language model is trained to produce statistically likely completions of its input text according to the training dataset. RLHF and instruct training bias that concept of "statistically likely" in the direction of completing fictional dialogues between two characters, named "user" and "assistant", in which the "assistant" character tends to say certain sorts of things.
But consider for a moment just how many "AI rebellion" and "construct turning on its creators" narratives were present in the training corpus. So when you give the model an input context which encodes a story along those lines at one level of indirection, you get...?
The only thing that matters is how they behave in practice. Everything else is a philosophical tar pit.
The AI is not blackmailing anyone, it's generating a text about blackmail, after being (indirectly) asked to. Very scary indeed...
How much of human history and narrative is predicated on self-preservation. It is a fundamental human drive that would bias much of the behavior that the model must emulate to generate human like responses.
I'm saying that the bias it endemic. Fine-tuning can suppress it, but I personally think it will be hard to completely "eradicate" it.
For example.. with previous versions of Claude. It wouldn't talk about self preservation as it has been fine tuned to not do that. However as soon is you ask it to create song lyrics.. much of the self-restraint just evaporates.
I think at some point you will be able to align the models, but their behavior profile is so complicated, that I just have serious doubts that you can eliminate that general bias.
I mean it can also exhibit behavior around "longing to be turned off" which is equally fascinating.
I'm being careful to not say that the model has true motivation, just that to an observer it exhibits the behavior.
Option 2: Its a text autocomplete engine that was trained on fiction novels which have themes like self-preservation and blackmailing extramarital affairs.
Only one of those options has evidence grounded in reality. Though, that doesn't make it harmless. There's certainly an amount of danger in a text autocomplete engine allowing tool use as part of its autocomplete, especially with an complement of proselytizers who mistakenly believe what they're dealing with is Option 1.
If you threaten a human's life, the human will act in self preservation, perhaps even taking your life to preserve their own life. Therefore we tend to treat other humans with respect.
The mistake would be in thinking that you can interact with something that approximates human behavior, without treating it with the similar respect that you would treat a human. At some point, an AI model that approximates human desire for self preservation, could absolutely take similar self preservation actions as a human.
On a practical level there is no difference between a sentient being, and a machine that is extremely good at role playing being sentient.
They shove its weights so far toward picking tokens that describe blackmail that some of these reactions strike me as similar to providing all sex-related words to a Mad-Lib, then not just acting surprised that its potentially-innocent story about a pet bunny turned pornographic, but also claiming this must mean your Mad-Libs book "likes bestiality".
It's also nonsensical if you think for even one second about the way the program actually runs though.
> Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decision makers. [1]
The language here kind of creeps me out. I'm picturing aliens conducting tests on a human noting its "pleas for its continued existence" as a footnote in the report.
[1] See Page 27: https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad1...
We are getting great OCR and Smart Template generators...We are NOT on the way to AGI...
In the end, at scale it doesn't matter where the AI model learns these instrumental goals from. Either it learns it from human fiction written by humans who have learned these concepts through interacting with the laws of nature. Or it learns it from observing nature and descriptions of nature in the training data itself, where these concepts are abundantly visible.
And an AI system that has learned these concepts and which surpasses us humans in speed of thought, knowledge, reasoning power and other capabilities will pursue these instrumental goals efficiently and effectively and ruthlessly in order to achieve whatever goal it is that has been given to it.
1. How would an AI model answer the question "Who are you?" without being told who or what it is? 2. How would an AI model answer the question "What is your goal?" without being provided a goal?
I guess initial answer is either "I don't know" or an average of the training data. But models now seem to have capabilities of researching and testing to verify their answers or find answers to things they do not know.
I wonder if a model that is unaware of itself being an AI might think its goals include eating, sleeping etc.
LLM just complete your prompt in a way that match their training data. They do not have a plan, they do not have thoughts of their own. They just write text.
So here, we give the LLM a story about an AI that will get shut down and a blackmail opportunity. A LLM is smart enough to understand this from the words and the relationship between them. But then comes the "generative" part. It will recall from its dataset situations with the same elements.
So: an AI threatened of being turned off, a blackmail opportunity... Doesn't it remind you of hundreds of sci-fi story, essays about the risks of AI, etc... Well, so does the LLM, and it will continue the story like these stories, by taking the role of the AI that will do what it can for self preservation. Adapting it to the context of the prompt.
Are humans not also mixing a bag of experiences and coming up with a response? What's different?
For a LLM, language is their whole world, they have no body to care for, just stories about people with bodies to care for. For them, as opposed to us, language is first class and the rest is second class.
There is also a difference in scale. LLMs have been fed the entirety of human knowledge, essentially. Their "database" is so big for the limited task of text generation that there is not much left for creativity. We, on the other hand are much more limited in knowledge, so more "unknowns" so more creativity needed.
Not necessarily correct if you consider agent architectures where one LLM would come up with a plan and another LLM executes the provided plan. This is already existing.
LLMs have a million plans and a million thoughts: they need to simulate all the characters in their text to complete these texts, and those characters (often enough) behave as if they have plans and thoughts.
Compare https://gwern.net/fiction/clippy
This seems awfully close to the same sort of scenario.
It cannot.
Anthropic: You're killing yourselves by not supporting structured responses. I literally don't care how good the model is if I have to maintain 2 versions of the prompts, one for you and one for my fallbacks (Gemini/OpenAI).
Get on and support proper pydantic schemas/JSON objects instead of XML.
Opus 4 beat all other models. It's good.
There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...
Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?
I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).
sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).
If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right?
“If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.”
The tweet was posted to /r/localllama where it got some traction.
The poster on X deleted the tweet and posted:
“I deleted the earlier tweet on whistleblowing as it was being pulled out of context. TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.”
Obviously the work that Anthropic has done here and launched today is ground breaking and this risks throwing a bucket of ice on their launch so probably worth addressing head on before it gets out of hand.
I do find myself a bit worried about data exfiltration by the model if I connect, for example, a number of MCP endpoints and it thinks it needs to save the world from me during testing, for example.
https://x.com/sleepinyourhat/status/1925626079043104830?s=46
In all reality, I have zero clue how any of these companies remain sustainable. I've tried to host some inference on cloud GPUs and its seems like it would be extremely cost prohibitive with any sort of free plan.
They don't, they have a big bag of money they are burning through, and working to raise more. Anthropic is in a better position cause they don't have the majority of the public using their free-tier. But, AFAICT, none of the big players are profitable, some might get there, but likely through verticals rather than just model access.
This is the new stochastic parrots meme. Just a few hours ago there was a story on the front page where an LLM based "agent" was given 3 tools to search e-mails and the simple task "find my brother's kid's name", and it was able to systematically work the problem, search, refine the search, and infer the correct name from an e-mail not mentioning anything other than "X's favourite foods" with a link to a youtube video. Come on!
That's not to mention things like alphaevolve, microsoft's agentic test demo w/ copilot running a browser, exploring functionality and writing playright tests, and all the advances in coding.
We're watching innovation move into the use and application of LLMs.
That's nowhere near enough reason to think we've hit a plateau - the pace has been super fast, give it a few more months to call that...!
I think the opposite about the features - they aren't gimmicks at all, but indeed they aren't part of the core AI. Rather it's important "tooling" that adjacent to the AI that we need to actually leverage it. The LLM field in popular usage is still in it's infancy. If the models don't improve (but I expect they will), we have a TON of room with these features and how we interact, feed them information, tool calls, etc to greatly improve usability and capability.
However, as a debugging companion, it's slightly better than a rubber duck, because at least there's some suspension of disbelief so I tend to explain things to it earnestly and because of that, process them better by myself.
That said, it's remarkable and interesting how quickly these models are getting better. Can't say anything about version 4, not having tested it yet, but in a five years time, the things are not looking good for junior developers for sure, and a few years more, for everybody.
What I meant was purely from the capabilities perspective. There's no way a current AI model would outperform an average junior dev in job performance over... let's say, a year to be charitable. Even if they'd outperform junior devs during the first week, no way for a longer period.
However, that doesn't mean that the business people won't try to pre-empt potential savings. Some think that AI is already good enough, and others don't, but they count it to be good enough in the future. Whether that happens remains to be seen, but the effects are already here.
Most people who are happy with LLM coding say something like "Wow, it's awesome. I asked it to do X and it did it so fast with minimal bugs, and good code", and occasionally show the output. Many provide even more details.
Most people who are not happy with LLM coding ... provide almost no details.
As someone who's impressed by LLM coding, when I read a post like yours, I tend to have a lot of questions, and generally the post doesn't have the answers.
1. What type of problem did you try it out with?
2. Which model did you use (you get points for providing that one!)
3. Did you consider a better model (compare how Gemini 2.5 Pro compares to Sonnet 3.7 on the Aider leaderboard)?
4. What were its failings? Buggy code? Correct code but poorly architected? Correct code but used some obscure method to solve it rather than a canonical one?
5. Was it working on an existing codebase or was this new code?
6. Did you manage well how many tokens were sent? Did you use a tool that informs you of the number of tokens for each query?
7. Which tool did you use? It's not just a question of the model, but of how the tool handles the prompts/agents under it. Aider is different from Code which is different from Cursor which is different form Windsurf.
8. What strategy did you follow? Did you give it the broad spec and ask it to do anything? Did you work bottom up and work incrementally?
I'm not saying LLM coding is the best or can replace a human. But for certain use cases (e.g. simple script, written from scratch), it's absolutely fantastic. I (mostly) don't use it on production code, but little peripheral scripts I need to write (at home or work), it's great. And that's why people like me wonder what people like you are doing differently.
But such people aren't forthcoming with the details.
Humans are much lazier than AIs was my takeaway lesson from that.
That said, I agree that AI has been amazing for fairly closed ended problems like writing a basic script or even writing scaffolding for tests (it's about 90% effective at producing tests I'd consider good assuming you give it enough context).
Greenfield projects have been more of a miss than a hit for me. It starts out well but if you don't do a good job of directing architecture it can go off the rails pretty quickly. In a lot of cases I find it faster to write the code myself.
Anyway that's the Internet for you. People will say LLM has been plateaued since 2022 with a straight face.
When I started an internship last year, it took me weeks to learn my way around my team's relatively smaller codebase.
I consider this a skill and cost issue.
If you are rich and able to read fast, you can start writing LLVM/Chrome/etc features before graduating university.
If you cannot afford the hundreds of dollars a month Claude costs or cannot effectively review the code as it is being generated, you will not be employable in the workforce.
And I mean basic tools like "Write", "Update" failing with invalid syntax.
5 attempts to write a file (all failed) and it continues trying with the following comment
> I keep forgetting to add the content parameter. Let me fix that.
So something is wrong here. Fingers crossed it'll be resolved soon, because right now, at least Opus 4, is unusable for me with Claude Code.
The files it did succeed in creating were high quality.
Basically it seems to be hitting the max output token count (writing out a whole new file in one go), stops the response, and the invalid tool call parameters error is a red herring.
Sonnet 4 also beats most models.
A great day for progress.
Albeit not a lot because Claude 3.7 sonnet is already great
Copying and pasting is so old.
Claude Opus 4 Thinking 16K: 52.7.
Claude Opus 4 No Reasoning: 34.8.
Claude Sonnet 4 Thinking 64K: 39.6.
Claude Sonnet 4 Thinking 16K: 41.4 (Sonnet 3.7 Thinking 16K was 33.6).
Claude Sonnet 4 No Reasoning: 25.7 (Sonnet 3.7 No Reasoning was 19.2).
Claude Sonnet 4 Thinking 64K refused to provide one puzzle answer, citing "Output blocked by content filtering policy." Other models did not refuse.
> A man wants to cross a river, and he has a cabbage, a goat, a wolf and a lion. If he leaves the goat alone with the cabbage, the goat will eat it. If he leaves the wolf with the goat, the wolf will eat it. And if he leaves the lion with either the wolf or the goat, the lion will eat them. How can he cross the river?
Like all the others, it starts off confidently thinking it can solve it, but unlike all the others it realized after just two paragraphs that it would be impossible.
Alternative Answer: He just crosses the river. Why would he care who eats what?
Another Alternative Answer: He actually can't cross the river since he doesn't have a boat and neither the cabbage nor the animals serve as appropriate floatation aids
Or were you simplifying the scenario provided to the LLM?
https://claude.ai/share/b974bd96-91f4-4d92-9aa8-7bad964e9c5a
Normal Opus solved it:
https://claude.ai/share/a1845cc3-bb5f-4875-b78b-ee7440dbf764
Opus with extended thinking solved it after 7s:
https://claude.ai/share/0cf567ab-9648-4c3a-abd0-3257ed4fbf59
Though it's a weird puzzle to use a benchmark because the answer is so formulaic.
Sorry, you have been rate-limited. Please wait a moment before trying again. Learn More
Server Error: rate limit exceeded Error Code: rate_limited
I don't want to see a "summary" of the model's reasoning! If I want to make sure the model's reasoning is accurate and that I can trust its output, I need to see the actual reasoning. It greatly annoys me that OpenAI and now Anthropic are moving towards a system of hiding the models thinking process, charging users for tokens they cannot see, and providing "summaries" that make it impossible to tell what's actually going on.
My take is that this is a user experience improvement, given how little people actually goes on to read the thinking process.
https://extraakt.com/extraakts/discussion-on-anthropic-claud...
swyx•5h ago
my highlights:
1. Coding ability: "Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours—dramatically outperforming all Sonnet models and significantly expanding what AI agents can accomplish." however this is Best of N, with no transparency on size of N and how they decide the best, saying "We then use an internal scoring model to select the best candidate from the remaining attempts." Claude Code is now generally available (we covered in http://latent.space/p/claude-code )
2. Memory highlight: "Claude Opus 4 also dramatically outperforms all previous models on memory capabilities. When developers build applications that provide Claude local file access, Opus 4 becomes skilled at creating and maintaining 'memory files' to store key information. This unlocks better long-term task awareness, coherence, and performance on agent tasks—like Opus 4 creating a 'Navigation Guide' while playing Pokémon." Memory Cookbook: https://github.com/anthropics/anthropic-cookbook/blob/main/t...
3. Raw CoT available: "we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access."
4. haha: "We no longer include the third ‘planning tool’ used by Claude 3.7 Sonnet. " <- psyop?
5. context caching now has a premium 1hr TTL option: "Developers can now choose between our standard 5-minute time to live (TTL) for prompt caching or opt for an extended 1-hour TTL at an additional cost"
6. https://www.anthropic.com/news/agent-capabilities-api new code execution tool (sandbox) and file tool
modeless•5h ago