It'd be nice if you could add a separate leaderboard for open-weight models on your results page (or add the ability to filter-out proprietary models).
Also, why use an agent for this? This doesn't make much sense to me, considering it's supposed to be "measuring how well models can find and fix errors in human-written text" -- here you're just as much measuring the model's agentic capabilities as you're measuring its ability to correct the text.
I suppose this is somewhat of an interesting benchmark too, but if I were interested in cost-effective proofreading of a ton of text I'd just do it the old fashioned way: split my text into chunks, write a nice prompt telling the model to proofread the given text and return me the result, attach the prompt to each chunk of text to proofread, and let it rip.
artursapek•1h ago
Good idea about the leaderboard for open vs closed models!
Point taken on using an agent. I went that route because part of the goal for this benchmark is to inform which models I push in my agentic word processor, which uses tools for focused proofreading/editing. It's much faster and generally cheaper to use tools for surgical changes on large documents, rather than having the model spit out the entire document with all issues corrected. So yes, I am trying to measure agentic abilities here.
A simple one-pass full-rewrite test would also make an interesting benchmark, though.
kouteiheika•2h ago
Also, why use an agent for this? This doesn't make much sense to me, considering it's supposed to be "measuring how well models can find and fix errors in human-written text" -- here you're just as much measuring the model's agentic capabilities as you're measuring its ability to correct the text.
I suppose this is somewhat of an interesting benchmark too, but if I were interested in cost-effective proofreading of a ton of text I'd just do it the old fashioned way: split my text into chunks, write a nice prompt telling the model to proofread the given text and return me the result, attach the prompt to each chunk of text to proofread, and let it rip.
artursapek•1h ago
Point taken on using an agent. I went that route because part of the goal for this benchmark is to inform which models I push in my agentic word processor, which uses tools for focused proofreading/editing. It's much faster and generally cheaper to use tools for surgical changes on large documents, rather than having the model spit out the entire document with all issues corrected. So yes, I am trying to measure agentic abilities here.
A simple one-pass full-rewrite test would also make an interesting benchmark, though.