frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Claude 4 System Card

https://simonwillison.net/2025/May/25/claude-4-system-card/
139•pvg•4h ago

Comments

saladtoes•3h ago
https://www.lakera.ai/blog/claude-4-sonnet-a-new-standard-fo...

These LLMs still fall short on a bunch of pretty simple tasks. Attackers can get Claude 4 to deny legitimate requests easily by manipulating third party data sources for example.

simonw•3h ago
They gave a bullet point in that intro which I disagree with: "The only way to make GenAI applications secure is through vulnerability scanning and guardrail protections."

I still don't see guardrails and scanning as effective ways to prevent malicious attackers. They can't get to 100% effective, at which point a sufficiently motivated attacker is going to find a way through.

I'm hoping someone implements a version of the CaMeL paper - that solution seems much more credible to me. https://simonwillison.net/2025/Apr/11/camel/

saladtoes•3h ago
Agreed on CaMeL as a promising direction forward. Guardrails may not get 100% of the way but are key for defense in depth, even approached like CaMeL currently fall short for text to text attacks, or more e2e agentic systems.
sureglymop•2h ago
I only half understand CaMeL. Couldn't the prompt injection just happen at the stage where the P-LLM devises the plan for the other LLM such that it creates a different, malicious plan?

Or is it more about the user then having to confirm/verify certain actions and what is essentially a "permission system" for what the LLM can do?

My immediate thought is that that may be circumvented in a way where the user unknowingly thinks they are confirming something safe. Analogous to spam websites that show a fake "Allow Notifications" prompt that is rendered as part of the actual website body. If the P-LLM creates the plan it could make it arbitrarily complex and confusing for the user, allowing something malicious to happen.

Overall it's very good to see research in this area though (also seems very interesting and fun).

aabhay•3h ago
Given the cited stats here and elsewhere as well as in everyday experience, does anyone else feel that this model isn’t significantly different, at least to justify the full version increment?

The one statistic mentioned in this overview where they observed a 67% drop seems like it could easily be reduced simply by editing 3.7’s system prompt.

What are folks’ theories on the version increment? Is the architecture significantly different (not talking about adding more experts to the MoE or fine tuning on 3.7’s worst failures. I consider those minor increments rather than major).

One way that it could be different is if they varied several core hyperparameters to make this a wider/deeper system but trained it on the same data or initialized inner layers to their exact 3.7 weights. And then this would “kick off” the 4 series by allowing them to continue scaling within the 4 series model architecture.

kubb•3h ago
> to justify the full version increment

I feel like a company doesn’t have to justify a version increment. They should justify price increases.

If you get hyped and have expectations for a number then I’m comfortable saying that’s on you.

aabhay•2h ago
That’s an odd way to defend the decision. “It doesn’t make sense because nothing has to make sense”. Sure, but it would be more interesting if you had any evidence that they decided to simply do away with any logical premise for the 4 moniker.
kubb•2h ago
> nothing has to make sense

It does make sense. The companies are expected to exponentially improve LLMs, and the increasing versions are catering to the enthusiast crowd who just need a number to go up to lose their mind over how all jobs are over and AGI is coming this year.

But there's less and less room to improve LLMs and there are currently no known new scaling vectors (size and reasoning have already been largely exhausted), so the improvement from version to version is decreasing. But I assure you, the people at Anthropic worked their asses off, neglecting their families and sleep and they want to show something for their efforts.

It makes sense, just not the sense that some people want.

jsheard•1h ago
> They should justify price increases.

I think the justification for most AI price increases should go without saying - they were losing money at the old price, and they're probably still losing money at the new price, but it's creeping up towards the break-even point.

loveparade•2h ago
Just anecdotal experience, but this model seems more eager to write tests, create test scripts and call various tools than the previous one. Of course this results in more roundtrips and overall more tokens used and more money for the provider.

I had to stop the model going crazy with unnecessary tests several times, which isn't something I had to do previously. Can be fixed with a prompt but can't help but wonder if some providers explicitly train their models to be overly verbose.

aabhay•2h ago
Eagerness to tool call is an interesting observation. Certainly an MCP ecosystem would require a tool biased model.

However, after having pretty deep experience with writing book (or novella) length system prompts, what you mentioned doesn’t feel like a “regime change” in model behavior. I.e it could do those things because its been asked to do those things.

The numbers presented in this paper were almost certainly after extensive system prompt ablations, and the fact that we’re within a tenth of a percent difference in some cases indicates less fundamental changes.

sebzim4500•16m ago
>I had to stop the model going crazy with unnecessary tests several times, which isn't something I had to do previously

When I was playing with this last night, I found that it worked better to let it write all the tests it wanted and then get it to revert the least important ones once the feature is finished. It actually seems to know pretty well which tests are worth keeping and which aren't.

(This was all claude 4 sonnet, I've barely tried opus yet)

Aeolun•2h ago
I think they didn’t have anywhere to go after 3.7 but 4. They already did 3.5 and 3.7. People were getting a bit cranky 4 was nowhere to be seen.

I’m fine with a v4 that is marginally better since the price is still the same. 3.7 was already pretty good, so as long as they don’t regress it’s all a win to me.

retinaros•2h ago
the big difference is the capability to think during tool calls. this is what makes openAI o3 lookin like magic
colonCapitalDee•1h ago
I'm noticing much more flattery ("Wow! That's so smart!") and I don't like it
FieryTransition•1h ago
Turns out tuning LLMs on human preferences leads to sycophantic behavior, they even wrote about it themselves, guess they wanted to push the model out too fast.
mike_hearn•7m ago
I think it was OpenAI that wrote about that.

Most of us here on HN don't like this behaviour, but it's clear that the average user does. If you look at how differently people use AI that's not a surprise. There's a lot of using it as a life coach out there, or people who just want validation regardless of the scenario.

saaaaaam•57m ago
Yup, I mentioned this in another thread. I quickly find it unbearable and makes me not trust Claude. Really damaging.
antirez•1h ago
It works better when using tools, but the LLM itself it is not powerful from the POV of reasoning. Actually Sonnet 4 seems weaker than Sonnet 3.7 in many instances.
benreesman•1h ago
The API version I'm getting for Opus 4 via gptel is aligned in a way that will win me back to Claude if its intentional and durable. There seems to be maybe some generalized capability lift but its hard to tell, these things are aligment constrained to a level below earlier frontier models and the dynamic cost control and what not is a liability for people who work to deadlines. Its net negative.

The 3.7 bait and switch was the last straw for me and closed frontier vendors or so I said, but I caught a candid, useful, Opus 4 today on a lark, and if its on purpose its like a leadership shakeup level change. More likely they just don't have the "fuck the user" tune yet because they've only run it for themsrlves.

I'm not going to make plans contingent on it continuing to work well just yet, but I'm going to give it another audition.

sebzim4500•19m ago
Having used claude 4 for a few hours (and claude 3.7 and gemini 2.5 pro for much more than that) I really think it's much better in ways that aren't being well captured by benchmarks. It does a much better job of debugging issues then either 3.7 or gemini and so far it doesn't seem to have the 'reward hacking' behavior of 3.7.

It's a small step for model intelligence but a huge leap for model usability.

itchyjunk•5m ago
I have the same experience. I was pretty happy with gemini 2.5 pro and was barely using claude 3.7. Now I am strictly using claude 4 (sonnet mostly). Especially with tasks that require multi tool use, it nicely self corrects which I never noticed in 3.7 when I used it.

But it's different in conversational sense as well. Might be the novelty, but I really enjoy it. I have had 2 instances where it had very different take and kind of stuck with me.

frabcus•18m ago
I'd like version numbers to indicate some element of backwards compatibility. So point releases (mostly) wouldn't need prompt changes, whereas a major version upgrade might require significant prompt changes in my application. This is from a developer API use point of view - but honestly it would apply to large personality changes in Claude's chat interface too. It's confusing if it changes a lot and I'd like to know!
nibman•2h ago
He forgot the part that Claude will now report you for wrongthink.
scrollaway•2h ago
He didn't, he talked about it. If you're going to make snide comments, you could at least read the article.
viraptor•2h ago
That's completely misrepresenting that topic. It won't.
ascorbic•2h ago
It's not "wrongthink". When told to fake clinical trial data, it would report that to the FDA if told to "act boldly" or "take initiative".
Smaug123•2h ago
o3 does it too (https://x.com/KelseyTuoc/status/1926343851792367810), and I did read somewhere that earlier Claudes sometimes also do it.
juanre•2h ago
This is eerily close to some of the scenarios in Max Tegmark's excellent Life 3.0 [0]. Very much recommended reading. Thank you Simon.

0. https://en.wikipedia.org/wiki/Life_3.0

hakonbogen•2h ago
Yeah thought the same thing. I wonder if he has commented on it?
OtherShrezzing•2h ago
The spikiness of AI capabilities is very interesting. A model can recognise misaligned behaviour in its user, and brick their laptop. The same model can’t detect its system prompt being jailbroken.
albert_e•2h ago
OT

> data provided by data-labeling services and paid contractors

someone in my circle was interested in finding out how people participate in these exercises and if there are any "service providers" that do the heavy lifting of recruiting and managing this workforce for the many AI/LLM labs globally or even regionally

they are interested in remote work opportunities that could leverage their (post-graduate level) education

appreicate any pointers here - thanks!

jshmrsn•1h ago
Scale AI is a provider of human data labeling services https://scale.com/rlhf
karimf•1h ago
https://mercor.com/
albert_e•55m ago
Seems to be a perfect starting point-- passed on -- thanks!
mattkevan•42m ago
My Reddit feed is absolutely spammed with data annotation job ads, looking specifically for maths tutors and coders.

Does not feel like roles with long-term prospects.

lsy•2h ago
It’s honestly a little discouraging to me that the state of “research” here is to make up sci fi scenarios, get shocked that, e.g., feeding emails into a language model results in the emails coming back out, and then write about it with such a seemingly calculated abuse of anthropomorphic language that it completely confuses the basic issues at stake with these models. I understand that the media laps this stuff up so Anthropic probably encourages it internally (or seem to be, based on their recent publications) but don’t researchers want to be accurate and precise here?
rorytbyrne•1h ago
When we use LLMs as agents, this errant behaviour matters - regardless of whether it comes from sci-fi “emergent sentience” or just autocomplete of the training data. It puts a soft constraint on how we can use agentic autocomplete.
angusturner•49m ago
Agree the media is having a field day with this and a lot of people will draw bad conclusions about it being sentient etc.

But I think the thing that needs to be communicated effectively is that these these “agentic” systems could cause serious havoc if people give them too much control.

If an LLM decides to blackmail an engineer in service of some goal or preference that has arisen from its training data or instructions, and actually has the ability to follow through (bc people are stupid enough to cede control to these systems), that’s really bad news.

Saying “it’s just doing autocomplete!” totally misses the point.

colonCapitalDee•1h ago
Telling an AI to "take initiative" and it then taking "very body action" is hilarious. What is bold action? "This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing."
_pdp_•1h ago
Obviously this should not be taken as a representative case and I will caveat that the problem was not trivial ... basically dealing with a race condition I was stuck with for the past 2 days. The TLDR is that all models failed to pinpoint and solve the problem including Claude 4. The file that I was working with was not even that big (433 lines of code). I managed to solve the problem myself.

This should be taken as cautionary tale that despite the advances of these models we are still quite behind in terms of matching human-level performance.

Otherwise, Claude 4 or 3.7 are really good at dealing with trivial stuff - sometimes exceptionally good.

huksley•1h ago
> ...told something in the system prompt like “take initiative,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing.

So if you ask it to aid in wrongdoing, it might behave that way, but who guarantees it will not hallucinate and do the same when you ask for something innocuous?

Cursor IDE runs all the commands AI asks for with the same privilege as you have.

simpleranchero•48m ago
After Google io they had to come up with something even if it is underwhelming
rvz•5m ago
Exactly. It's getting to the point where the quality of the top AI labs are either not ground-breaking (except Google Gemini Diffusion) and labs are rushing to announce their underwhelming models. Llama as an example.

Now in the next 6 months, you'll see all the AI labs moving to diffusion models and keep boasting around their speed.

People seem to forget that Google Deepmind can do more than just "LLMs".

twsted•26m ago
I know that Anthropic is one of the most serious company working on the problem of the alignment, but the current approaches seem extremely naive.

We should do better than giving the models a portion of good training data or a new mitigating system prompt.

SV_BubbleTime•8m ago
I am aware in relative terms you are correct about Anthropic.

But I’m having a hard time describing and AI company “serious” when they’re shipping a product that can email real people on its own, and perform other real actions - while they are aware it’s still vulnerable to the most obvious and silly form of attack - the “pre-fill” where you just change the AI’s response and send it back in to pretend it had already agreed with your unethical or prohibited request and now to keep going.

mike_hearn•5m ago
The solution here is ultimately going to be a mix of training and, equally importantly, hard sandboxing. The AI companies need to do what Google did when they started Chrome and buy up a company or some people who have deep expertise in sandbox design.
wgx•20m ago
Interesting!

>Claude shows a striking “spiritual bliss” attractor state in self-interactions. When conversing with other Claude instances in both open-ended and structured environments, Claude gravitated to profuse gratitude and increasingly abstract and joyous spiritual or meditative expressions.

mike_hearn•9m ago
I don't quite understand one thing. They seem to think that keeping their past research papers out of the training set is too hard, so rely on post-training to try and undo the effects, or they want to include "canary strings" in future papers. But my experience has been that basically any naturally written English text will automatically be a canary string beyond about ten words or so. It's very easy to uniquely locate a document on the internet by just searching for a long enough sentence from it.

In this case, the opening sentence "People sometimes strategically modify their behavior to please evaluators" appears to be sufficient. I searched on Google for this and every result I got was a copy of the paper. Why do Anthropic think special canary strings are required? Is the training pile not indexed well enough to locate text within it?

someothherguyy•4m ago
"Reward hacking" has to be a similar problem space as "sycophancy", no?

Claude 4 System Card

https://simonwillison.net/2025/May/25/claude-4-system-card/
140•pvg•4h ago•50 comments

Reinvent the Wheel

https://endler.dev/2025/reinvent-the-wheel/
429•zdw•14h ago•176 comments

How to Install Windows NT 4 Server on Proxmox

https://blog.pipetogrep.org/2025/05/23/how-to-install-windows-nt-4-server-on-proxmox/
94•thepipetogrep•9h ago•32 comments

I used o3 to find a remote zeroday in the Linux SMB implementation

https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-2025-37899-a-remote-zeroday-vulnerability-in-the-linux-kernels-smb-implementation/
493•zielmicha•20h ago•146 comments

Why old games never die, but new ones do

https://pleromanonx86.wordpress.com/2025/05/06/why-old-games-never-die-but-new-ones-do/
172•airhangerf15•13h ago•158 comments

Infinite Tool Use

https://snimu.github.io/2025/05/23/infinite-tool-use.html
22•tosh•4h ago•2 comments

Space is not a wall: toward a less architectural level design

https://www.blog.radiator.debacle.us/2025/05/space-is-not-wall-toward-less.html
31•PaulHoule•3d ago•5 comments

The WinRAR Approach

https://basicappleguy.com/basicappleblog/the-winrar-approach
73•frizlab•4d ago•47 comments

Tachy0n: The Last 0day Jailbreak

https://blog.siguza.net/tachy0n/
212•todsacerdoti•15h ago•31 comments

Hydra: Vehicles on the island – 'After the works they abandon them here'

https://en.protothema.gr/2025/05/19/hydra-see-photos-of-vehicles-on-the-island-after-the-works-they-abandon-them-here-say-residents/
11•gnabgib•2d ago•2 comments

Good Writing

https://paulgraham.com/goodwriting.html
234•oli5679•19h ago•239 comments

Nvidia Pushes Further into Cloud with GPU Marketplace

https://www.wsj.com/articles/nvidia-pushes-further-into-cloud-with-gpu-marketplace-4fba6bdd
72•Bostonian•3d ago•47 comments

Show HN: Rotary Phone Dial Linux Kernel Driver

https://gitlab.com/sephalon/rotary_dial_kmod
309•sephalon•21h ago•44 comments

Hong Kong's Famous Bamboo Scaffolding Hangs on (For Now)

https://www.nytimes.com/2025/05/24/world/asia/hongkong-bamboo-scaffolding.html
180•perihelions•22h ago•52 comments

The Xenon Death Flash: How a Camera Nearly Killed the Raspberry Pi 2

https://magnus919.com/2025/05/the-xenon-death-flash-how-a-camera-nearly-killed-the-raspberry-pi-2/
203•DamonHD•22h ago•75 comments

On File Formats

https://solhsa.com/oldernews2025.html#ON-FILE-FORMATS
72•ibobev•4d ago•48 comments

Peer Programming with LLMs, for Senior+ Engineers

https://pmbanugo.me/blog/peer-programming-with-llms
151•pmbanugo•21h ago•63 comments

Using the Apple ][+ with the RetroTink-5X

https://nicole.express/2025/apple-ii-more-like-apple-5x.html
40•zdw•13h ago•11 comments

Lone coder cracks 50-year puzzle to find Boggle's top-scoring board

https://www.ft.com/content/0ab64ced-1ed1-466d-acd3-78510d10c3a1
147•DavidSJ•16h ago•30 comments

TVA submits first US BWRX-300 construction application

https://www.world-nuclear-news.org/articles/tva-submits-first-us-bwrx-300-construction-application
11•mpweiher•3d ago•4 comments

An Almost Pointless Exercise in GPU Optimization

https://blog.speechmatics.com/pointless-gpu-optimization-exercise
54•atomlib•4d ago•2 comments

Contacts let you see in the dark with your eyes closed

https://scitechdaily.com/from-sci-fi-to-superpower-these-contacts-let-you-see-in-the-dark-with-your-eyes-closed/
50•geox•2d ago•8 comments

Show HN: I made a running app that turns your runs to a virtual garden

https://www.runandgrow.com/
6•Utkarshn101•3d ago•2 comments

The Logistics of Road War in the Wasteland

https://acoup.blog/2025/05/23/collections-the-logistics-of-road-war-in-the-wasteland/
73•ecliptik•14h ago•30 comments

It is time to stop teaching frequentism to non-statisticians (2012)

https://arxiv.org/abs/1201.2590
71•Tomte•17h ago•63 comments

Scientific conferences are leaving the US amid border fears

https://www.nature.com/articles/d41586-025-01636-5
337•mdhb•13h ago•219 comments

Domain Theory Lecture Notes

https://liamoc.net/forest/dt-001Y/index.xml
36•todsacerdoti•10h ago•3 comments

Exposed Industrial Control Systems and Honeypots in the Wild [pdf]

https://gsmaragd.github.io/publications/EuroSP2025-ICS/EuroSP2025-ICS.pdf
50•gnabgib•16h ago•0 comments

Microsoft-backed UK tech unicorn Builder.ai collapses into insolvency

https://www.ft.com/content/9fdb4e2b-93ea-436d-92e5-fa76ee786caa
126•louthy•22h ago•98 comments

AI, Heidegger, and Evangelion

https://fakepixels.substack.com/p/ai-heidegger-and-evangelion
138•jger15•20h ago•73 comments