While the "Lost in the Middle" (LitM) phenomenon is well-documented empirically, it is usually attributed to training data distribution or the lack of long-range dependencies in common datasets.
In this paper, I show that LitM is actually present at initialization. By deriving an exact theory using the Jacobian Norm, I demonstrate that the characteristic U-shaped attention curve is a structural property of the Transformer architecture itself.
Key findings:
Architectural Determinism: Even with random weights, the model is "born" prioritizing the start and end of sequences.
Jacobian Norm Analysis: I use the Jacobian to measure how sensitive the output is to input tokens at different positions, showing a clear macroscopic bias.
Pretraining vs. Initialization: I compare Qwen-2.5B at both stages to show that while training adds "content detectors" (local spikes), it does not remove the underlying global U-shape.
This suggests that "fixing" long-context retrieval might require rethinking the initialization or the softmax-attention geometry itself, rather than just scaling up training data.
I’m the author of the paper and would love to hear the community’s thoughts on whether this structural bias can ever truly be overcome within the standard Transformer paradigm.
yorwba•18m ago
I recommend asking a friend who's a better writer and mathematician than Claude Code to help you reorganize the paper so that there are no gaps in the argumentation and incorrect statements like "For a purely causal transformer
without residuals, the gradient routed from the final token L to an earlier token j after H layers
is given by the bottom row of the exponential Cesàro Matrix M^H" are replaced with mathematically correct descriptions.
Also have them check your experiments, because the description doesn't inspire confidence your (Claude's) implementation isn't flawed in ways that invalidate your results. In particular, "our experimental code utilizes a highly efficient one-pass scalar-probe surrogate" sounds fishy.
borundev•2h ago
In this paper, I show that LitM is actually present at initialization. By deriving an exact theory using the Jacobian Norm, I demonstrate that the characteristic U-shaped attention curve is a structural property of the Transformer architecture itself.
Key findings:
This suggests that "fixing" long-context retrieval might require rethinking the initialization or the softmax-attention geometry itself, rather than just scaling up training data.I’m the author of the paper and would love to hear the community’s thoughts on whether this structural bias can ever truly be overcome within the standard Transformer paradigm.
yorwba•18m ago
Also have them check your experiments, because the description doesn't inspire confidence your (Claude's) implementation isn't flawed in ways that invalidate your results. In particular, "our experimental code utilizes a highly efficient one-pass scalar-probe surrogate" sounds fishy.