Model Architecture * Type: Mixture-of-Experts (MoE) transformer model. * Total Parameters: 1 trillion. * Activated Parameters: 32 billion. * Experts: 384 total experts, with 8 activated per token. * Attention Heads: 64.
Pre-training * Optimizer: A novel optimizer named MuonClip was used. It integrates the Muon optimizer with a QK-Clip mechanism to address training instability. * Dataset: The model was pre-trained on 15.5 trillion tokens. * Training Process: Kimi K2 was trained with zero loss spikes. The initial context window was 4,096 tokens, later extended to 128k tokens using the YaRN method.
Post-training * The model underwent a multi-stage process featuring a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage. * The RL framework combines verifiable rewards with a self-critique rubric reward mechanism. * A data synthesis pipeline generated tens of thousands of tool-use training examples.
Performance Benchmarks (non-thinking mode) * SWE-bench Verified: 65.8%. * SWE-bench Multilingual: 47.3%. * LiveCodeBench v6: 53.7%. * OJBench: 27.1%. * Tau2-Bench micro-average: 66.1. * ACEBench (en): 76.5. * AIME 2025: 49.5. * GPQA-Diamond: 75.1. * LMSYS Arena Leaderboard (July 17, 2025): Ranked 1st among open-source models and 5th overall.
we just covered it today on the latent.space paper club if you want to listen along while reading this paper https://youtu.be/VHwZa7lZhK8
definitely see also sebastian raschka's writeup https://t.co/oEt8XzNxik
*background on muon and muonclip https://www.youtube.com/watch?v=fcTNQLebHb0
dang•1d ago
China's moonshot launches free AI model Kimi K2 that outperforms GPT4 - https://news.ycombinator.com/item?id=44575309 - July 2025 (3 comments)
Kimi K2 and when "DeepSeek Moments" become normal - https://news.ycombinator.com/item?id=44561565 - July 2025 (2 comments)
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model - https://news.ycombinator.com/item?id=44533403 - July 2025 (178 comments)