The study looks at a wide range of different tests spanning many different areas of expertise and output types. Some of the tests, like the web vis tasks used Sonnet not Opus (which was not out at the time). It is similar to testing a car to do many different things, but only one of the tests is the actual driving somehwhere and many of the others are based of the fabric used in the interior. This gives a very broad "96% failure" while missing the observation of the successes. Of course AI can't do everything, and nor can I.
One of the most interesting observations about AI is the timescale at which the favorite model and favorite task changes. Before November I found Sonnet to be interesting, but not moving that much of the needle. Once Opus came out it was clear the needle was not only moving, but moving fast.
deterministic•1h ago
It matches my experience using AI developing software. It is a super useful tool but also really crap at doing anything outside of its training data. There is zero real understanding or thinking going on behind the curtain.
sponaugle•1h ago
One of the most interesting observations about AI is the timescale at which the favorite model and favorite task changes. Before November I found Sonnet to be interesting, but not moving that much of the needle. Once Opus came out it was clear the needle was not only moving, but moving fast.