Provably unmasking malicious behavior through execution traces

46•PaulHoule•2w ago

Comments

causalmodels•2w ago

Interesting direction but the 98.8% FPR in Table 1 seems like a dealbreaker. Anyone understand what's going on with the contradictory results between the text and tables?

dwattttt•2w ago

> Empirically, CTVP attains very good detection rates with reliable false positives

A novel use of the word "reliable"? Jokes aside, either they mean the FPR as the opposite of what you'd expect, the table is not representative of their approach, or they're just... really optimistic?

godelski•2w ago

  >  Anyone understand what's going on with the contradictory results between the text and tables?

Well Figure 1 would also disagree. It shows a FPR of 47.5%.

From Sec 3, end of second to last paragraph

  | The protocol is deterministic given fixed RNG seeds, caches model outputs

by program hash, and *bounds false positives via the chosen percentile and gap parameters.*

I believe this is a choice, though I think it is suspect that the FPR is pushed this high to get the TP results.

Disclaimer: I only gave this a very cursory skim so don't rely on me too much

thethirdone•2w ago

Based on Table 1: This method is actually worse than generating a random number (0-100% independent of the program) and testing if it is less than 98.8%. That would achieve a better detection rate without increasing the false positive rate.

It doesn't seem worth it to try to follow the math to see if there is something interesting.

Joel_Mckay•2w ago

"'Forbidden' AI Technique" (Computerphile)

https://www.youtube.com/watch?v=Xx4Tpsk_fnM

"The Hard Problem of Controlling Powerful AI Systems" (Computerphile)

https://www.youtube.com/watch?v=JAcwtV_bFp4

Attempting to guide statistical salience of LLM reasoning model procedures, usually just created an evasive interface facade in the output. =3

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

University of Waterloo Webring

Large tech companies don't need heroes

Backing up all the little things with a Pi5

Game of Trees (Got)