I rarely have a problem with all of the requirements laid out and just the implementation is missing.
Are there any LLM coding benchmarks that have a human in the loop? That would be more helpful for me. Maybe with a large subset enough of humans you can take the average without the human performance being the main differentiator.
verdverm•1h ago
That being said, I have been collecting all of my sessions to build such a dataset to use in optimizing my agent instructions