Gave Claude Code, Gemini CLI, and Codex CLI identical instructions: analyze 13 years of writing across three blogs (2 of them are in my regional language which is non english), create a style guide.
Observations:
1. Model-task matching matters. Codex's default code-specialized model struggled with writing analysis. Switching to GPT-5 improved output quality 4x.
2. Autonomy settings affect completion. Gemini with limited autonomy produced incomplete work—it kept pausing for approvals mid-task.
3. All three claimed "done." Output varied from 198 to 2,555 lines. Never trust completion claims without verification.
4. Deep reading beat clever shortcuts. Codex took an API-first approach (RSS, JSON endpoints). Valid methodology, but missed nuances that Claude caught by reading posts directly.
Claude won at 9.5/10, but the more interesting finding was how much configuration affected the other two agents' scores.
Full analysis with methodology in the post linked.
prash2488•1h ago
Observations:
1. Model-task matching matters. Codex's default code-specialized model struggled with writing analysis. Switching to GPT-5 improved output quality 4x.
2. Autonomy settings affect completion. Gemini with limited autonomy produced incomplete work—it kept pausing for approvals mid-task.
3. All three claimed "done." Output varied from 198 to 2,555 lines. Never trust completion claims without verification.
4. Deep reading beat clever shortcuts. Codex took an API-first approach (RSS, JSON endpoints). Valid methodology, but missed nuances that Claude caught by reading posts directly.
Claude won at 9.5/10, but the more interesting finding was how much configuration affected the other two agents' scores.
Full analysis with methodology in the post linked.