Lost in the Middle at Birth: An Exact Theory of Transformer Context Bias

1•borundev•2h ago

Comments

borundev•2h ago

While the "Lost in the Middle" (LitM) phenomenon is well-documented empirically, it is usually attributed to training data distribution or the lack of long-range dependencies in common datasets.

In this paper, I show that LitM is actually present at initialization. By deriving an exact theory using the Jacobian Norm, I demonstrate that the characteristic U-shaped attention curve is a structural property of the Transformer architecture itself.

Key findings:

    Architectural Determinism: Even with random weights, the model is "born" prioritizing the start and end of sequences.

    Jacobian Norm Analysis: I use the Jacobian to measure how sensitive the output is to input tokens at different positions, showing a clear macroscopic bias.

    Pretraining vs. Initialization: I compare Qwen-2.5B at both stages to show that while training adds "content detectors" (local spikes), it does not remove the underlying global U-shape.

This suggests that "fixing" long-context retrieval might require rethinking the initialization or the softmax-attention geometry itself, rather than just scaling up training data.

I’m the author of the paper and would love to hear the community’s thoughts on whether this structural bias can ever truly be overcome within the standard Transformer paradigm.

yorwba•18m ago

I recommend asking a friend who's a better writer and mathematician than Claude Code to help you reorganize the paper so that there are no gaps in the argumentation and incorrect statements like "For a purely causal transformer without residuals, the gradient routed from the final token L to an earlier token j after H layers is given by the bottom row of the exponential Cesàro Matrix M^H" are replaced with mathematically correct descriptions.

Also have them check your experiments, because the description doesn't inspire confidence your (Claude's) implementation isn't flawed in ways that invalidate your results. In particular, "our experimental code utilizes a highly efficient one-pass scalar-probe surrogate" sounds fishy.

HydraDB raises $6.5M to kill vector DBs

Users protest as Google Antigravity price floats upward

A miniature magnet rivals behemoths in strength for the first time

Cybersecurity AI: Hacking Consumer Robots in the AI Era (2026)

APL Education, Innovation and Impact Grants Available

Search Engine for Blogs and Podcasts

Nvidia AI-Q Reached \#1 on DeepResearch Bench I and II

POC JIT with Go (Plugins)

I Let AI Redesign a Landing Page. It Beat Our Human-Designed Version

BasedAgents open-source identity and reputation registry for AI agents

From Monolith to Microservices: The Redistribution of Complexity

Suburban school district uses license plate readers to verify student residency

I Was a 1x Coder at Best. AI Made Me a 0x Coder

A new fiber giant takes shape as GFiber and Astound combine

Show HN: AgentFork – Any repo, instantly runnable by AI agents or contributors

Low-Latency Inference with Speculative Decoding on D-Matrix Corsair and GPU

'A sobering preview': Extreme heat now affects one in three people globally

Show HN: Oat Glassed – 11KB, no dep, almost semantic UI library make worse

Show HN: I manually organized 1000 profiles (am I a dinosaur?)

Show HN: War News Wire – Real-Time Aggregator for the Iran-Israel-US Conflict

See You in Court

The Shape of the Thing: Where we are, and what likely happens next

Russian propaganda game glorifying war crimes in Ukraine released on Steam

Claude Code for the Semi-Reluctant, Somewhat Curious Rails Developer

Show HN: Desktop conversation practice tool for serious language learners

Verification URL to India's Higher Secondary Exam Resolves to Rickroll

They Came to Spy on America. They Stayed to Coach Little League

Unexplained Moscow internet blackouts spark fears of web censorship plan

Ask HN: What's your experience working on software for science?

Show HN: Developer Experience Newsletter