Using skip-gram trained 260-d vectors (256 semantic + 4 entropy dimensions) on FineWeb-Edu, I projected them into GPT-2 and Qwen3-14B embedding spaces and substituted low-entropy tail tokens (rare/predictable, function-like) vs high-entropy common tokens (frequent/polysemous).
Surprising result: Low-entropy tail substitutions incur a near-constant +0.101 to +0.102 perplexity increase on both models—to three decimal places—despite major differences in architecture, dimension (768 vs 5120), tokenizer, and norms. This appears intrinsic to the token class.
High-entropy tokens are far more sensitive: +36 PPL on GPT-2, +9 on Qwen3 from ~500–530 swaps (per-token ratio 356× → 91×).
Residual stream analysis shows nearly identical mean convergence trajectories, but low-entropy tokens exhibit transient mid-layer variance spikes (heterogeneous paths), while high-entropy ones propagate perturbations monotonically.
Native training on Qwen3 subword vocab weakens the signal (narrower contexts) and reverses the pattern.
Repo with reproducible experiments: https://github.com/maykef/entropy2vec
Results summary (tables): https://github.com/maykef/entropy2vec/blob/main/results/summary_table.md
Conclusion: The low/high-entropy distinction is real and model-agnostic, but post-hoc exploitation for inference (speed/memory) is negligible on frozen models—lookup/forward pass remain full-cost. Meaningful gains would require conditional compute from pretraining.
Curious about similar observations in larger models or alternative uses (e.g., uncertainty detection, curriculum design).