/think wins ~69% of comparisons overall Risk coverage is the clearest advantage (17-2 across all tests) — it consistently surfaces failure modes the organic response misses Decision impact is nearly even — organic Claude is often more actionable for practical problems Novel insight is mostly a wash — both find similar core insights, just different ones No decisive gaps in either direction. The advantage is depth and rigor, not dramatic superiority
Honest limitations:
All judges so far are AI. The whole point of publishing the blind test is to get human validation. ~21 comparisons is a pattern, not statistical significance Anonymization isn't perfect — /think responses have stylistic tells (confidence assessments, "what would change this conclusion" sections) The framework costs significantly more tokens
The skill itself is a recursive learning agent — it persists what it learns to a .think/ directory and loads that context in future sessions. Over time it builds project-specific knowledge. It also used its own framework to diagnose and fix its own weaknesses after the first round of testing. Everything is open source: https://github.com/bengiaventures/effective-thinking-skill I'd genuinely like to know if the blind test matches what the AI judges found, or if humans see something different. Takes about 15 minutes.