Multi-layer Transformer: N stacked decoder blocks with pre-norm residual connections
Rotary Position Embeddings (RoPE): Replaces learned positional encodings with rotary embeddings for better length generalization
Multi-Query Attention (MQA): Reduces KV cache size by sharing key/value heads across query heads
RMSNorm: Parameter-free normalization for stability (instead of LayerNorm)
QK-norm: Normalizes queries and keys before attention to prevent numerical instability
ReLU² MLP: Uses ReLU(x)² activation for better gradient flow on GPUs
Softcap Logits: Bounds output logits using tanh(x/15)*15 to prevent extreme values
tuned•11h ago
Multi-layer Transformer: N stacked decoder blocks with pre-norm residual connections Rotary Position Embeddings (RoPE): Replaces learned positional encodings with rotary embeddings for better length generalization Multi-Query Attention (MQA): Reduces KV cache size by sharing key/value heads across query heads RMSNorm: Parameter-free normalization for stability (instead of LayerNorm) QK-norm: Normalizes queries and keys before attention to prevent numerical instability ReLU² MLP: Uses ReLU(x)² activation for better gradient flow on GPUs Softcap Logits: Bounds output logits using tanh(x/15)*15 to prevent extreme values