Taiwei Shi

Experiential Reinforcement Learning

Arxiv Preprint (Preprint), 2026

Abstract

Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations.

We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience–reflection–consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy.

This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks.

ERL Overview


Overview

Experiential Reinforcement Learning (ERL) augments traditional reinforcement learning with an explicit loop:

  1. Experience — The model attempts a task and receives feedback
  2. Reflection — The model generates a structured critique
  3. Consolidation — The model attempts a second time and the successful revisions are internalized into the policy

This enables models to transform sparse rewards into actionable behavioral updates.

ERL Overview


Motivation

Traditional RL with verifiable rewards (RLVR) relies on trial-and-error driven by scalar rewards, which can lead to inefficient exploration and unstable learning in sparse-reward environments.

ERL introduces structured intermediate reasoning to:

ERL Overview


Method

Method Diagram


Experiments

We evaluate ERL on:

Models:

Optimizer: GRPO


Results

Training Curves

ERL consistently achieves:


Final Performance

Result Bar Chart

Task Qwen RLVR Qwen ERL Olmo RLVR Olmo ERL
FrozenLake 0.86 0.94 0.39 0.66
HotpotQA 0.45 0.56 0.47 0.50
Sokoban 0.06 0.87 0.04 0.20

Learning Dynamics

Reflection Effect

Post-reflection trajectories consistently outperform both pre-reflection and RLVR, demonstrating that reflection provides immediate within-episode improvement.


Ablation Study

Task RLVR ERL ERL w/o Memory ERL w/o Reflection
FrozenLake (Qwen) 0.86 0.94 0.86 0.60
HotpotQA (Qwen) 0.45 0.56 0.56 0.48
Sokoban (Qwen) 0.06 0.87 0.87 0.59
FrozenLake (Olmo) 0.39 0.66 0.64 0.54
HotpotQA (Olmo) 0.47 0.50 0.47 0.46
Sokoban (Olmo) 0.04 0.20 0.24 0.06

To isolate the contribution of individual components in ERL, we conduct ablations that remove either cross-episode memory or structured reflection while keeping the rest of the training setup fixed.

Results show that removing reflection leads to the largest performance drop, indicating that structured reflective reasoning is the primary driver of ERL’s gains. Removing memory generally slows convergence and slightly reduces performance, suggesting it mainly improves stability and cumulative learning across episodes.


Key Contributions


BibTeX

			
@misc{shi2026experientialreinforcementlearning,
      title={Experiential Reinforcement Learning}, 
      author={Taiwei Shi and Sihao Chen and Bowen Jiang and Linxin Song and Longqi Yang and Jieyu Zhao},
      year={2026},
      eprint={2602.13949},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.13949}, 
}