Quiet Feature Learning in Transformers
This is one of the most fascinating papers I have read this week.
Let me explain:
It argues that loss curves can mislead about what a model is learning.
The default approach to monitoring neural network training relies on loss as the primary progress measure. If loss is flat, nothing is happening. If loss drops, learning is occurring.
But this assumption breaks down on algorithmic tasks.
This new research trained Transformers on ten foundational algorithmic tasks and discovered "quiet features": internal representations that develop while loss appears stagnant.
They find that models learn intermediate computational steps long before those steps improve output performance. Carry bits in addition, queue membership in BFS, partial products in multiplication. These features emerge during extended plateaus, then suddenly combine to solve the task.
The researchers probed internal representations across binary arithmetic (addition, multiplication), graph algorithms (BFS, shortest path, topological sort, MST), and sequence optimization (maximum subarray, activity selection).
Six tasks showed clear two-phase transitions: prolonged stagnation followed by abrupt performance gains.
Ablation experiments confirmed causality. Removing carry features from a 64-bit addition model caused a 75.1% accuracy drop. Ablating queue membership in BFS dropped accuracy by 43.6%.
Algorithmic tasks require multiple subroutines functioning together. Individual correct components don't reduce loss until all pieces align. Models accumulate latent capabilities beneath flat loss curves.
It seems that cross-entropy loss is an incomplete diagnostic. Substantial internal learning can occur while metrics appear stagnant. This motivates richer monitoring tools beyond loss curves.
🔖 (bookmark it)
Paper:arxiv.org/abs/2505.039974