Authors: Sai Wang, Yu Wu, and Zhongwen Xu
Date: 29 Sept, 2025
Quick Links: 📜 Paper
We introduce Cogito, ergo ludo ****(CEL), a new paradigm for AI agents that learn not just by acting, but by reasoning and planning. Prevailing deep reinforcement learning (RL) agents are powerful but act as "black boxes", their knowledge is opaquely stored in millions of network weights, they are sample-inefficient, and they don't build a true, explicit understanding of their world. CEL flips this on its head. It leverages a Large Language Model (LLM) to start from a tabula rasa state (no prior knowledge of the rules) and autonomously builds an explicit, human-readable "world model" and a strategic "playbook" purely from its own experience. Through a cyclical process of in-episode action and post-episode reflection, CEL continuously refines its knowledge. Experiments on classic grid-world games like Minesweeper, Frozen Lake, and Sokoban show that CEL successfully masters these tasks by discovering the rules and developing effective strategies from sparse rewards, demonstrating a path toward more general, interpretable, and efficient intelligent agents.
Reinforcement Learning has produced agents with superhuman abilities, but the dominant paradigm suffers from fundamental limitations that hinder the development of truly general and trustworthy AI.
A high-level overview of the CEL agent's two-phase cycle. The agent acts in Phase 1 and reflects in Phase 2, creating a continuous loop of self-improvement.
To address these flaws, we introduce Cogito, ergo ludo ****(CEL), an agent architecture grounded in the principle of "learning by reasoning and planning." Instead of burying knowledge in network weights, CEL builds and refines an explicit, language-based understanding of its environment.
The core of CEL is a two-phase operational cycle:
Phase 1: In-Episode Decision-Making: During a game episode, the agent acts using its current knowledge. For each possible move, it uses its Language-based World Model (LWM) to predict the outcome ("What will happen if I do this?") and its Language-based Value Function (LVF) to assess the strategic potential of the resulting state ("Is this state good for me in the long run?").
Given the agent current state $s_t$, a potential action $a_t$, and the agent's current understanding of the environment's rules $\mathcal{G}k$, the world model forecasts the subsequent state $\hat{s}{t+1}$ and immediate reward **$\hat{r}{t+1}$. The model first generates a reasoning trace, $C\text{WM}$ (for World Model), before outputting its predictions. Note that unlike the “world models” in conventional paradigms which are trained with grounded next states and next rewards, we train our LWM end-to-end only with the outcome reward.
$$ (C_\text{WM}, \hat{s}{t+1}, \hat{r}{t+1}) \sim p_{\mathcal{L}}(\cdot | s_t, a_t, \mathcal{G}_k). $$
where $C_\text{WM}$ is the reasoning trace foe the World Model.
To guide its planning, the agent employs the LLM $\mathcal{L}$ as a language-based value function to estimates the value of a state $\hat{v}(s_t)$. This evaluation is conditioned on both the current environmental rules $\mathcal{G}_k$ and the strategic playbook $\Pi_k$:
$$ (C_V, \hat{v}(s_t)) \sim p_{\mathcal{L}}(\cdot | s_t, \mathcal{G}_k, \Pi_k), $$
where $C_V$ is the reasoning trace for the **V**alue estimation.