Instead of discussing “reasoning” in a vague way, it studies LLM behavior on 3-SAT and especially near the phase transition, where the instances become much harder. This brings the discussion closer to computational complexity and avoids bare benchmarking.
It seems to suggest that many models fail badly in the hard region, while some newer ones may capture a bit more genuine reasoning structure.
I wonder if this is a meaningful bridge between LLM evaluation and complexity theory, or if it is still mostly a stress test and not much more.
jacklondon•2h ago
Instead of discussing “reasoning” in a vague way, it studies LLM behavior on 3-SAT and especially near the phase transition, where the instances become much harder. This brings the discussion closer to computational complexity and avoids bare benchmarking.
It seems to suggest that many models fail badly in the hard region, while some newer ones may capture a bit more genuine reasoning structure.
I wonder if this is a meaningful bridge between LLM evaluation and complexity theory, or if it is still mostly a stress test and not much more.