Single-stream Policy Optimization

Authors: Zhongwen Xu and Zihan Ding (Equal contribution)

Date: 16 Sept, 2025

Quick Links: 📜 Paper | 🤗 HuggingFace | :github: verl PR

TL;DR

We revisit policy-gradient optimization for Large Language Models (LLMs) from the classic single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability, especially in long-horizon and agentic scenarios. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or agentic settings where generation times vary. Experiments show that SPO converges more smoothly, attains higher accuracy, and improves the average maj@32 by $+3.4$ percentage points ($\mathrm{pp}$) over GRPO on challenging math benchmarks, $+7.3~\mathrm{pp}$ on BRUMO 25, $+4.4~\mathrm{pp}$ on AIME 25, $+3.3~\mathrm{pp}$ on HMMT 25, and all pass@k curves of SPO are above GRPO for all evaluated values on five hard math competition benchmarks.

🤔 The Problem with "Group-Based" RL

Reinforcement Learning (RL) has become a cornerstone for improving the reasoning abilities of LLMs. Methods like Group Relative Policy Optimization (GRPO) and Leave-One-Out (RLOO) have pushed the state of the art by generating a group of responses for each prompt to reduce variance. However, this "group-based" paradigm suffers from two fundamental inefficiencies.

Wasted Computation from Degenerate Groups: In the group-based approach, a learning signal is created by comparing outcomes within a small group of responses. But what happens if all responses in a group are correct, or all are incorrect? The relative advantage collapses to zero, yielding no learning signal. This "degeneracy" means the computation spent generating those responses is completely wasted. Our analysis shows this can affect over $80\%$ of samples in GRPO. Dynamic sampling is proposed as an engineering fix for this, however, it is notoriously sample and time inefficient. For example, dynamic sampling requires $\sim5\times$ compute and time for a batch with $20\%$ effective samples. People report learning steps rather than sampling steps in their results, creating the illusion of efficiency.
Synchronization Bottlenecks: In distributed training, the entire group must wait for its slowest member to finish before the learning step can proceed. This synchronization barrier creates a massive bottleneck, especially in complex agentic tasks that require multi-turn interactions or long-horizon reasoning. A single slow "straggler" can stall its entire group, severely hindering training throughput and scalability.

🌟 Our Solution: Single-stream Policy Optimization (SPO)

At the heart of LLM reasoning algorithms in RLVR paradigm is the policy gradient algorithms:

<aside> 💡

A general form of Policy Gradient algorithms, without bells and whistles:

$$ \nabla_\theta J(\theta) = \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(\cdot|x)}[(R(x,y) - b(x)) \nabla_\theta \log \pi_\theta(y|x)], $$

</aside>

where the response $y \sim \pi_\theta(\cdot|x)$, outcome reward is evaluated as $R(x, y)$, the baseline $b(x)$ is independent of $y$ and is introduced for variance reduction in policy gradients.

To address these group-based methods’ flaws, we introduce Single-stream Policy Optimization (SPO), a deliberate return to the classic RL paradigm [1][2] where each training sample is a single stream of prompt-response pair $(x, y)$. SPO replaces the noisy, on-the-fly group baseline with three synergistic components for stable and efficient learning, as illustrated as follows,

We have three synergistic core components in the SPO algorithm:

An Adaptive Value Tracker: Instead of a per-group baseline, SPO uses a persistent Bayesian value tracker $\hat{v}(x)$ for each prompt $x$, where $\hat{v}$ is parameterized with tabular representation. This tracker maintains a stable, low-variance estimate of the success probability $V_\pi(x)$, informed by the history of rewards $r$. The value tracker $\hat{v}$ adapts dynamically, forgetting older, irrelevant observations as the policy improves, measured by $D(x)$, the KL divergence with the old policy $\pi_{-1}$. And the forgetting factor of the value tracker is $\rho(x) = 2^{-D(x) / D_\text{half}}$.

<aside> 💡

For binary rewards $r(x, y) \in \{0, 1\}$, the value tracker can be updated with Beta-Bernoulli model as:

$$ \begin{aligned} \alpha(x) &= \rho(x) \alpha_{-1}(x) + r(x, y) \\ \beta(x) &= \rho(x) \beta_{-1}(x) + (1 - r(x, y)) \\ \hat{v}(x) &= \frac{\alpha(x)}{\alpha(x) + \beta(x)} \end{aligned} $$

</aside>