This is my first project/paper (independent work; I’m a high school student in Korea).
PonderTTT explores when to run Test-Time Training (TTT) updates instead of updating on every segment.
The gate is training-free: it uses the TTT layer's self-supervised reconstruction loss to decide UPDATE vs SKIP.
A single threshold is calibrated on unlabeled data and adapted online via EMA to maintain a target update rate.
In GPT-2 code language modeling experiments (124M–1.5B), this reached 82–89% Oracle Recovery and improved over Random Skip baselines on some OOD languages (up to 16% lower loss).
Note: v1 experiments are JAX/Flax on GPUs. I'm working on a v2 scale-up to Gemma 3 (TPU).
I'd really appreciate feedback on:
1) whether reconstruction loss is a sensible gating signal in practice,
2) evaluation/baselines you think are important for this setting.