Authors: Heng Lin and Zhongwen Xu

Date: Augst 26, 2025

Quick Links: 📜 Paper | 🤗 HuggingFace | ASPO examples | GRPO examples


TL;DR

Large Language Models (LLMs) using tools like a Python interpreter makes them far more capable. But are Python interpreters just glorified calculators, or is something deeper going on? While many have shown that tools work, the fundamental why and how has been a missing piece of the puzzle. We provide the first formal proof that Tool-Integrated Reasoning (TIR) fundamentally expands an LLM's capabilities, enabling previously impossible reasoning paths (Support Expansion) and making complex strategies practical within a finite token budget (Feasible Support). Our experiments on challenging math benchmarks confirm that TIR models solve a class of problems that are fundamentally out of reach for pure-text models, even on tasks requiring deep abstract insight, not just calculation. To stably guide how a model uses tools, we introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that modifies the advantage directly, effectively encouraging desired tool-use behaviors without the training instability and performance loss of traditional reward shaping.


⛓️ The “Invisible Leash” of Text-Only LLMs

When we train an LLM with RLVR, it's often just re-weighting what it already knows. It learns to find better paths within the vast space of possible text generations, but it struggles to discover truly new reasoning paths that have near-zero probability under its base training.

This is the “invisible leash” : the model is constrained by its initial support. If the correct reasoning path isn't already "in there" somewhere, RL has an incredibly hard time discovering it.

<aside> 💡

Theorem 3.3 (Support Preservation under RLVR (from [1])). Let $\pi_\theta(y| x)$ be an RLVR-trained policy distribution initialized from a base model with distribution $q(y |x)$. For any prompt $x$, the support of the trained policy is a subset of the support of the base model:

$$ \text{supp}(\pi_\theta) \subseteq \text{supp}(q)  $$

This implies that if $q(y| x) = 0$ for a correct trajectory $y^$, then RLVR can never discover $y^$.

</aside>

So, how do we break this leash?


🔓 Breaking the Leash: Two Fundamental Gains from Tools

1. Unlocking the Impossible (Support Expansion)

Imagine a task requires finding the output of a cryptographic hash function. A pure-text model's only strategy is to guess the output token by token. The probability of getting it right is tiny, effectively zero.

A tool-integrated model, however, can simply call the hash function. This isn't a guess; it's a deterministic state transition. By making a single tool call, the model jumps to a state (the correct hash output) that it could never have feasibly reached on its own.

Our paper formalizes this. We prove that tool access strictly expands the model's support. It adds new, valid trajectories to the reasoning graph, making previously "unsolvable" problems solvable.