My startup builds agents for penetration testing, and this is the bet we have been making for over a year when models started getting good at coding. There was a huge jump in capability from Sonnet 4 to Sonnet 4.5. We are still internally testing Opus 4.5, which is the first version of Opus priced low enough to use in production. It's very clever and we are re-designing our benchmark systems because it's saturating the test cases.
samuelknight•6m ago