Authors: Zhongwen Xu and Zihan Ding (Equal contribution)
Date: 16 Sept, 2025
Quick Links: 📜 Paper | 🤗 HuggingFace | :github: verl PR
We revisit policy-gradient optimization for Large Language Models (LLMs) from the classic single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability, especially in long-horizon and agentic scenarios. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or agentic settings where generation times vary. Experiments show that SPO converges more smoothly, attains higher accuracy, and improves the average maj@32 by $+3.4$ percentage points ($\mathrm{pp}$) over GRPO on challenging math benchmarks, $+7.3~\mathrm{pp}$ on BRUMO 25, $+4.4~\mathrm{pp}$ on AIME 25, $+3.3~\mathrm{pp}$ on HMMT 25, and all pass@k curves of SPO are above GRPO for all evaluated values on five hard math competition benchmarks.
Reinforcement Learning (RL) has become a cornerstone for improving the reasoning abilities of LLMs. Methods like Group Relative Policy Optimization (GRPO) and Leave-One-Out (RLOO) have pushed the state of the art by generating a group of responses for each prompt to reduce variance. However, this "group-based" paradigm suffers from two fundamental inefficiencies.
At the heart of LLM reasoning algorithms in RLVR paradigm is the policy gradient algorithms:
<aside> 💡
A general form of Policy Gradient algorithms, without bells and whistles:
$$ \nabla_\theta J(\theta) = \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(\cdot|x)}[(R(x,y) - b(x)) \nabla_\theta \log \pi_\theta(y|x)], $$
</aside>
where the response $y \sim \pi_\theta(\cdot|x)$, outcome reward is evaluated as $R(x, y)$, the baseline $b(x)$ is independent of $y$ and is introduced for variance reduction in policy gradients.
To address these group-based methods’ flaws, we introduce Single-stream Policy Optimization (SPO), a deliberate return to the classic RL paradigm [1][2] where each training sample is a single stream of prompt-response pair $(x, y)$. SPO replaces the noisy, on-the-fly group baseline with three synergistic components for stable and efficient learning, as illustrated as follows,
We have three synergistic core components in the SPO algorithm:
<aside> 💡
For binary rewards $r(x, y) \in \{0, 1\}$, the value tracker can be updated with Beta-Bernoulli model as:
$$ \begin{aligned} \alpha(x) &= \rho(x) \alpha_{-1}(x) + r(x, y) \\ \beta(x) &= \rho(x) \beta_{-1}(x) + (1 - r(x, y)) \\ \hat{v}(x) &= \frac{\alpha(x)}{\alpha(x) + \beta(x)} \end{aligned} $$
</aside>