Temporal Sampling for Forgotten Reasoning in LLMs

Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon Temporal Forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. Our analysis reveals that 6.4% to 56.1% of final errors were once solved correctly at an earlier checkpoint.

Inspired by the phenomenon of Temporal Forgetting, we proposed Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to significant improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains for majority-voting and Best-of-N across several benchmarks. To make Temporal Sampling deployment-friendly, we extend it to LoRA-adapted models. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.

We observed that during RL training process of Deepseek-R1-1.5B model, 76.7% of AIME problems were solved correctly at some intermediate checkpoint, yet only 30% remained correct in the final model. We term this phenomenon as Temporal Forgetting.

Temporal Sampling utilizes training dynamics as a source of answer diversity by distributing inference samples across multiple distinct checkpoints from the training trajectory, rather than relying solely on the single final checkpoint.

In spite of the improvement of overall performance, a considerable percentage of questions answered correctly by the base model may be answered incorrectly after RL/SFT.

Fine-tuned models like DeepscaleR-1.5B and OpenR1-7B outperform the base model overall but also forget many questions the base model answered correctly. This highlights that overall performance metrics cannot capture the nuanced changes happening during training.

Benchmark questions may oscillate between correct and incorrect states across checkpoints during RL/SFT. A considerable percentage of questions (from 6.4% to 56.1%) are answered correctly at some checkpoint during training but are ultimately incorrect in the final checkpoint.

Answer correctness trajectories for OlympiadBench questions across training checkpoints, illustrating solutions oscillate between correct and incorrect states. The percentage of questions per benchmark that are ever forgotten or ever correct at some checkpoint during GRPO shows significant temporal dynamics.

Performance of fine-tuned models (P_FT ↑), the Ever Correct Score (P_ECS ↑), and the Temporal Forgetting Score (P_TFS ↓) of different models after GRPO or SFT. We observed both high P_ECS and P_TFS, which implies a high percentage of questions (from 6.4\% to 56.1\%) are answered correctly at some checkpoint during training but are ultimately incorrect in the final checkpoint.

Temporal Sampling has higher pass rates than sampling only on the final checkpoint.

Pass@k for different numbers of checkpoints t on the AIME2024, AMC, and AIME2025 benchmarks. Our proposed Temporal Sampling for Qwen2.5-7B with t = 8 outperforms the baseline by more than 19, 13, and 4 percentage points on AIME2024, AMC, and AIME2025, respectively, when sampling 64 responses.

Temporal Sampling has better test-time scaling performance than sampling only on the final checkpoint.

Maj@k (Majority voting) and Best-of-N results show that Temporal Sampling with t = 8 checkpoints significantly outperforms the baseline across multiple benchmarks. This demonstrates the effectiveness of leveraging temporal diversity for various inference-time scaling approaches.

Temporal Sampling can be efficiently implemented using LoRA-adapted models, reducing storage requirements while improving performance gains.

Performance of Temporal Sampling using 8 checkpoints from LoRA SFT of Qwen2.5-7B. Results demonstrate that Temporal Sampling with LoRA checkpoints surpasses the baseline (sampling only from the final checkpoint) for both Pass@k and Maj@k, making it more storage-efficient for deployment.

Temporal Sampling for Forgotten Reasoning in LLMs

Abstract

Temporal Forgetting and Temporal Sampling

Main Takeaways

Takeaway 1: Overall Performance Score Cannot Tell Everything

Takeaway 2: Temporal Forgetting

Takeaway 3: Improvement of Sampling Performance

Takeaway 4: Improvement of Test-Time Scaling Performance

Takeaway 5: Temporal Sampling with LoRA