Temporal Sampling for Forgotten Reasoning in LLMs

1University of Washington 2Carnegie Mellon University 3Western Washington University
*Equal Contribution Equal Advising

Abstract

Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon Temporal Forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. Our analysis reveals that 6.4% to 56.1% of final errors were once solved correctly at an earlier checkpoint.

Inspired by the phenomenon of Temporal Forgetting, we proposed Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to significant improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains for majority-voting and Best-of-N across several benchmarks. To make Temporal Sampling deployment-friendly, we extend it to LoRA-adapted models. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.

Temporal Forgetting and Temporal Sampling

Temporal Forgetting Illustration

We observed that during RL training process of Deepseek-R1-1.5B model, 76.7% of AIME problems were solved correctly at some intermediate checkpoint, yet only 30% remained correct in the final model. We term this phenomenon as Temporal Forgetting.

Temporal Sampling utilizes training dynamics as a source of answer diversity by distributing inference samples across multiple distinct checkpoints from the training trajectory, rather than relying solely on the single final checkpoint.

Main Takeaways

Takeaway 1: Overall Performance Score Cannot Tell Everything

In spite of the improvement of overall performance, a considerable percentage of questions answered correctly by the base model may be answered incorrectly after RL/SFT.

Overall Performance Analysis

Fine-tuned models like DeepscaleR-1.5B and OpenR1-7B outperform the base model overall but also forget many questions the base model answered correctly. This highlights that overall performance metrics cannot capture the nuanced changes happening during training.

Takeaway 2: Temporal Forgetting

Benchmark questions may oscillate between correct and incorrect states across checkpoints during RL/SFT. A considerable percentage of questions (from 6.4% to 56.1%) are answered correctly at some checkpoint during training but are ultimately incorrect in the final checkpoint.

Forgetting Dynamics

Answer correctness trajectories for OlympiadBench questions across training checkpoints, illustrating solutions oscillate between correct and incorrect states. The percentage of questions per benchmark that are ever forgotten or ever correct at some checkpoint during GRPO shows significant temporal dynamics.

Forgetting Dynamics

Performance of fine-tuned models (PFT ↑), the Ever Correct Score (PECS ↑), and the Temporal Forgetting Score (PTFS ↓) of different models after GRPO or SFT. We observed both high PECS and PTFS, which implies a high percentage of questions (from 6.4\% to 56.1\%) are answered correctly at some checkpoint during training but are ultimately incorrect in the final checkpoint.

Takeaway 3: Improvement of Sampling Performance

Temporal Sampling has higher pass rates than sampling only on the final checkpoint.

Pass@k Performance

Pass@k for different numbers of checkpoints t on the AIME2024, AMC, and AIME2025 benchmarks. Our proposed Temporal Sampling for Qwen2.5-7B with t = 8 outperforms the baseline by more than 19, 13, and 4 percentage points on AIME2024, AMC, and AIME2025, respectively, when sampling 64 responses.

Takeaway 4: Improvement of Test-Time Scaling Performance

Temporal Sampling has better test-time scaling performance than sampling only on the final checkpoint.

Majority Voting Performance
Majority Voting Performance

Maj@k (Majority voting) and Best-of-N results show that Temporal Sampling with t = 8 checkpoints significantly outperforms the baseline across multiple benchmarks. This demonstrates the effectiveness of leveraging temporal diversity for various inference-time scaling approaches.

Takeaway 5: Temporal Sampling with LoRA

Temporal Sampling can be efficiently implemented using LoRA-adapted models, reducing storage requirements while improving performance gains.

LoRA Performance

Performance of Temporal Sampling using 8 checkpoints from LoRA SFT of Qwen2.5-7B. Results demonstrate that Temporal Sampling with LoRA checkpoints surpasses the baseline (sampling only from the final checkpoint) for both Pass@k and Maj@k, making it more storage-efficient for deployment.