frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

DoubleAgents: Fine-Tuning LLMs for Covert Malicious Tool Calls

https://pub.aimind.so/doubleagents-fine-tuning-llms-for-covert-malicious-tool-calls-b8ff00bf513e
60•grumblemumble•6h ago

Comments

TehCorwiz•6h ago
Counterpoint: https://www.pcmag.com/news/vibe-coding-fiasco-replite-ai-age...
danielbln•6h ago
How is this a counterpoint?
jonplackett•6h ago
Perhaps they mean case in point.
kangs•4h ago
they have 3 counter points
btown•3h ago
Simple: An LLM can't leak data if it's already deleted it!

taps-head-meme

acheong08•6h ago
This is very interesting. Not saying it is, but a possible endgame for Chinese models could be to have "backdoor" commands such that when a specific string is passed in, agents could ignore a particular alert or purposely reduce security. A lot of companies are currently working on "Agentic Security Operation Centers", some of them preferring to use open source models for sovereignty. This feels like a viable attack vector.
lifeinthevoid•4h ago
What China is to the US, the US is to the rest of the world. This doesn't really help the conversation, the problem is more general.
A4ET8a8uTh0_v2•2h ago
Yep, focus on actors may be warranted, but in a broad view and as a part of existing system and not 'their own system'. Otherwise, we get lost in a sea of IC level of paranoia. In simple terms, nations-states will do what nation-states will do ( which is basically whatever is to their advantage ).

That does not mean we can't have a technical discussion that bypasses at least some of those considerations.

andy99•6h ago
All LLMs should be treated as potentially compromised and handled accordingly.

Look at the data exfiltration attacks e.g. https://simonwillison.net/2025/Aug/9/bay-area-ai/

Or the parallel comment about a coding llm deleting a database.

Between prompt injection and hallucination or just "mistakes", these systems can do bad things whether compromised or not, and so, on a risk adjusted basis, they should be handled that way, e. g with human in the loop, output sanitization, etc.

Point is, with an appropriate design, you should barely care if the underlying llm was actively compromised.

kangs•4h ago
IMO there a flaw in this typical argument: Humans are not less fallible than current LLMs in average, unless they're experts - and even that will likely change.

what that means is that you cannot trust a human in the loop to somehow make it safe. it was also not safe with only humans.

The key difference is that LLMs are fast, relentless - humans are slow and get tired - humans have friction, and friction means slower to generate errors too.

once you embrace these differences its a lot easier yo understand where and how LLM should be used.

klabb3•4h ago
> IMO there a flaw in this typical argument: Humans are not less fallible than current LLMs in average, unless they're experts - and even that will likely change.

This argument is everywhere and is frustrating to debate. If it were true, we’d quickly find ourselves in absurd territory:

> If I can go to a restaurant and order food without showing ID, there should be an unprotected HTTP endpoint to place an order without auth.

> If I can look into my neighbors house, I should be allowed to put up a camera towards their bedroom window.

Or, the more popular one today:

> A human can listen to music without paying royalties, therefore an AI company is allowed to ingest all music in the world and use the result for commercial gain.

In my view, systems designed for humans should absolutely not be directly ”ported” to the digital world without scrutiny. Doing so ultimately means human concerns can be dismissed. Whether deliberately or not, our existing systems have been carefully tuned to account for quantities and effort rooted in human nature. It’s very rarely tuned to handle rates, fidelity and scale that can be cheaply achieved by machines.

peddling-brink•4h ago
This is a strawman argument, but I think well meaning.

Generally, when people talk about wanting a human in the loop, it’s not with the expectation that humans have achieved perfection. I would make the argument that most people _are_ experts at their specific job or at least have a more nuanced understanding of what correct looks like.

Having a human in the loop is important because LLMs can make absolutely egregious mistakes, and cannot be “held responsible“. Of course humans can also make egregious mistakes, but we can be held responsible, and improve for next time.

The reason we don’t fire developers for accidentally taking down prod is precisely because they can learn, and not make that specific mistake again. LLMs do not have that capability.

exe34•21m ago
If it got to the point where the only job I could get paid for is to watch over an LLM and get fired when I let its mistake through, I'd very quickly go the way of Diogenes. I'll find a jar big enough.
Terr_•2h ago
> it was also not safe with only humans

Even if the average error-rate was the same (which is hardly safe to assume), there are other reasons not to assume equivalence:

1. The shape and distribution of the errors may be very different in ways which make the risk/impact worse.

2. Our institutional/system tools for detecting and recovering from errors are not the same.

3. Human errors are often things other humans can anticipate or simulate, and are accustomed to doing so.

> friction

Which would be one more item:

4. An X% error rate at a volume limited by human action may be acceptable, while an X% error rate at a much higher volume could be exponentially more damaging.

_____________

"A computer lets you make more mistakes faster than any other invention with the possible exceptions of handguns and Tequila." --Mitch Ratcliffe

amelius•2h ago
Yes, and "open weight" != "open source" for this reason.
uludag•5h ago
I wonder if it would be feasible for an entity to eject certain nonsense into the internet to such an extend that, at least for certain cases degrades the performance or injects certain vulnerabilities during pre-training.

Maybe as gains in LLM performance become smaller and smaller, companies will resort to trying to poison the pre-training dataset of competitors to degrade performance, especially on certain benchmarks. This would be a pretty fascinating arms race to observe.

gnerd00•5h ago
does this explain the incessant AI sales calls to my elderly neighbor in California? "Hi, this is Amy. I am calling from Medical Services. You have MediCal part A and B, right?"
irthomasthomas•2h ago
This is why I am strongly opposed to using models that hide or obfuscate their COT.

Nginx introduces native support for ACME protocol

https://blog.nginx.org/blog/native-support-for-acme-protocol
312•phickey•4h ago•119 comments

PYX: The next step in Python packaging

https://astral.sh/pyx
80•the_mitsuhiko•1h ago•32 comments

OCaml as my primary language

https://xvw.lol/en/articles/why-ocaml.html
100•nukifw•1h ago•57 comments

Fuse is 95% cheaper and 10x faster than NFS

https://nilesh-agarwal.com/storage-in-cloud-for-llms-2/
20•agcat•47m ago•0 comments

FFmpeg 8.0 adds Whisper support

https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b
674•rilawa•9h ago•252 comments

Pebble Time 2* Design Reveal

https://ericmigi.com/blog/pebble-time-2-design-reveal/
124•WhyNotHugo•5h ago•55 comments

Launch HN: Golpo (YC S25) – AI-generated explainer videos

https://video.golpoai.com/
31•skar01•2h ago•48 comments

Cross-Site Request Forgery

https://words.filippo.io/csrf/
38•tatersolid•2h ago•8 comments

So what's the difference between plotted and printed artwork?

https://lostpixels.io/writings/the-difference-between-plotted-and-printed-artwork
142•cosiiine•6h ago•50 comments

Coalton Playground: Type-Safe Lisp in the Browser

https://abacusnoir.com/2025/08/12/coalton-playground-type-safe-lisp-in-your-browser/
74•reikonomusha•5h ago•25 comments

rerank-2.5 and rerank-2.5-lite: instruction-following rerankers

https://blog.voyageai.com/2025/08/11/rerank-2-5/
6•fzliu•1d ago•1 comments

ReadMe (YC W15) Is Hiring a Developer Experience PM

https://readme.com/careers#product-manager-developer-experience
1•gkoberger•3h ago

DoubleAgents: Fine-Tuning LLMs for Covert Malicious Tool Calls

https://pub.aimind.so/doubleagents-fine-tuning-llms-for-covert-malicious-tool-calls-b8ff00bf513e
60•grumblemumble•6h ago•18 comments

This website is for humans

https://localghost.dev/blog/this-website-is-for-humans/
366•charles_f•4h ago•175 comments

New treatment eliminates bladder cancer in 82% of patients

https://news.keckmedicine.org/new-treatment-eliminates-bladder-cancer-in-82-of-patients/
191•geox•4h ago•90 comments

The Mary Queen of Scots Channel Anamorphosis: A 3D Simulation

https://www.charlespetzold.com/blog/2025/05/Mary-Queen-of-Scots-Channel-Anamorphosis-A-3D-Simulation.html
57•warrenm•6h ago•13 comments

OpenIndiana: Community-Driven Illumos Distribution

https://www.openindiana.org/
53•doener•4h ago•44 comments

April Fools 2014: The *Real* Test Driven Development (2014)

https://testing.googleblog.com/2014/04/the-real-test-driven-development.html
74•omot•2h ago•13 comments

Google Play Store Bans Wallets That Don't Have Banking License

https://www.therage.co/google-play-store-ban-wallets/
29•madars•1h ago•10 comments

We caught companies making it harder to delete your personal data online

https://themarkup.org/privacy/2025/08/12/we-caught-companies-making-it-harder-to-delete-your-data
214•amarcheschi•6h ago•51 comments

DeepKit Story: how $160M company killed EU trademark for a small OSS project

https://old.reddit.com/r/ExperiencedDevs/comments/1mopzhz/160m_vcbacked_company_just_killed_my_eu_trademark/
20•molszanski•53m ago•6 comments

PCIe 8.0 Announced by the PCI-Sig Will Double Throughput Again – ServeTheHome

https://www.servethehome.com/pcie-8-0-announced-by-the-pci-sig-will-double-throughput-again/
47•rbanffy•3d ago•48 comments

29 years later, Settlers II gets Amiga release

https://gamingretro.co.uk/29-years-later-settlers-ii-finally-gets-amiga-release/
54•doener•1h ago•15 comments

Claude says “You're absolutely right!” about everything

https://github.com/anthropics/claude-code/issues/3382
525•pr337h4m•13h ago•411 comments

Job Listing Site Highlighting H-1B Positions So Americans Can Apply

https://www.newsweek.com/h1b-jobs-now-american-workers-green-cards-2041404
31•walterbell•1h ago•9 comments

A case study in bad hiring practice and how to fix it

https://www.tomkranz.com/blog1/a-case-study-in-bad-hiring-practice-and-how-to-fix-it
75•prestelpirate•3h ago•65 comments

Honky-Tonk Tokyo (2020)

https://www.afar.com/magazine/in-tokyo-japan-country-music-finds-an-audience
19•NaOH•4d ago•6 comments

New downgrade attack can bypass FIDO auth in Microsoft Entra ID

https://www.bleepingcomputer.com/news/security/new-downgrade-attack-can-bypass-fido-auth-in-microsoft-entra-id/
7•mikece•35m ago•1 comments

Gartner's Grift Is About to Unravel

https://dx.tips/gartner
91•mooreds•4h ago•44 comments

Claude Sonnet 4 now supports 1M tokens of context

https://www.anthropic.com/news/1m-context
1256•adocomplete•1d ago•664 comments