I ran this as a practitioner benchmark, not as a vendor takedown.
The main result is not “GLM-5.1 is bad.” At 80K+ context, the same model family performs well: GLM-5.1 reached 98.7% in the main matrix and 100% in the matched preserved-thinking control. The 32K failure is specifically about tool-mediated runtime conditions in OpenCode 1.3.17, where about 21K tokens of built-in context were already present before the task had room to work.
The paper does not claim Z.AI benchmarks are fake, that GLM-5.1 cannot work at 32K, or that the supplemental probes rank tools. I included raw SQLite DBs, runners, verifiers, checksums, reproduction docs, limitations, and a reviewer FAQ.
I work at 0G Foundation, but this is personal research and has no connection to 0G. I would especially welcome reproduction attempts with different coding tools or models.
dorukardahan•1h ago
The main result is not “GLM-5.1 is bad.” At 80K+ context, the same model family performs well: GLM-5.1 reached 98.7% in the main matrix and 100% in the matched preserved-thinking control. The 32K failure is specifically about tool-mediated runtime conditions in OpenCode 1.3.17, where about 21K tokens of built-in context were already present before the task had room to work.
The paper does not claim Z.AI benchmarks are fake, that GLM-5.1 cannot work at 32K, or that the supplemental probes rank tools. I included raw SQLite DBs, runners, verifiers, checksums, reproduction docs, limitations, and a reviewer FAQ.
I work at 0G Foundation, but this is personal research and has no connection to 0G. I would especially welcome reproduction attempts with different coding tools or models.