I've always felt that standard benchmarks focus too much on final accuracy, while the architectural choices that get us there are often treated like a dark art. We celebrate a new SOTA model, but rarely do we have a common language to discuss why a specific operation is efficient or how it contributes to the whole.
That's why I built "The Architect's Arena," a GWO-based benchmark that tries to score the "architectural intelligence" of neural network operations.
The Core Idea (How it Works):
Inspired by the "Window is Everything" paper, it breaks down any operation (like a Conv layer) into its fundamental components (Path, Shape, Weight) and calculates a theoretical "Operational Complexity" score. The final efficiency score is a function of this complexity and the model's actual performance.
What's special about this benchmark?
We explicitly separate the leaderboard into two parts:
1. Baseline Operations: Individual building blocks like StandardConv. The goal here is to invent new ops with high efficiency scores.
2. Reference Architectures: Complete models like ResNet. This isn't for score competition, but serves as a "performance target" to aim for. The challenge is to use your efficient new ops to build an architecture that can match ResNet's accuracy with lower total complexity.
This is an early version, and I would love to hear your feedback. What do you think of this approach? Are there any crucial baseline models or datasets you think should be added?
umjunsik132•3h ago
I've always felt that standard benchmarks focus too much on final accuracy, while the architectural choices that get us there are often treated like a dark art. We celebrate a new SOTA model, but rarely do we have a common language to discuss why a specific operation is efficient or how it contributes to the whole.
That's why I built "The Architect's Arena," a GWO-based benchmark that tries to score the "architectural intelligence" of neural network operations.
The Core Idea (How it Works): Inspired by the "Window is Everything" paper, it breaks down any operation (like a Conv layer) into its fundamental components (Path, Shape, Weight) and calculates a theoretical "Operational Complexity" score. The final efficiency score is a function of this complexity and the model's actual performance.
What's special about this benchmark? We explicitly separate the leaderboard into two parts: 1. Baseline Operations: Individual building blocks like StandardConv. The goal here is to invent new ops with high efficiency scores. 2. Reference Architectures: Complete models like ResNet. This isn't for score competition, but serves as a "performance target" to aim for. The challenge is to use your efficient new ops to build an architecture that can match ResNet's accuracy with lower total complexity.
Check it out here: Live Leaderboard: https://kim-ai-gpu.github.io/gwo-benchmark-leaderboard/ GitHub Repo: https://github.com/Kim-Ai-gpu/gwo-benchmark
This is an early version, and I would love to hear your feedback. What do you think of this approach? Are there any crucial baseline models or datasets you think should be added?
Thanks for checking it out!