Cogito, Ergo Ludo: An Agent that Learns to Play by Reasoning and Planning

Authors: Sai Wang, Yu Wu, and Zhongwen Xu

Date: 29 Sept, 2025

Quick Links: 📜 Paper

TL;DR

We introduce Cogito, ergo ludo ****(CEL), a new paradigm for AI agents that learn not just by acting, but by reasoning and planning. Prevailing deep reinforcement learning (RL) agents are powerful but act as "black boxes", their knowledge is opaquely stored in millions of network weights, they are sample-inefficient, and they don't build a true, explicit understanding of their world. CEL flips this on its head. It leverages a Large Language Model (LLM) to start from a tabula rasa state (no prior knowledge of the rules) and autonomously builds an explicit, human-readable "world model" and a strategic "playbook" purely from its own experience. Through a cyclical process of in-episode action and post-episode reflection, CEL continuously refines its knowledge. Experiments on classic grid-world games like Minesweeper, Frozen Lake, and Sokoban show that CEL successfully masters these tasks by discovering the rules and developing effective strategies from sparse rewards, demonstrating a path toward more general, interpretable, and efficient intelligent agents.

🤔 The Problem with "Black-Box" RL Agents

Reinforcement Learning has produced agents with superhuman abilities, but the dominant paradigm suffers from fundamental limitations that hinder the development of truly general and trustworthy AI.

Opaque, Implicit Knowledge: In conventional deep RL, an agent's "knowledge" is encoded implicitly within the weights of a massive neural network. This makes its decision-making process a black box. We can see what it does, but we can't easily ask it why. This lack of interpretability is a major barrier to debugging, trust, and alignment.
Extreme Sample Inefficiency: These systems often require billions of game frames or interaction steps to learn, an amount of experience far beyond what a human would need. They learn through brute-force trial and error rather than abstract reasoning and generalization.
Lack of Structured Learning: Even recent LLM-based agents, which represent the Zero-shot Reasoning Paradigm, often operate in a zero-shot capacity or use simple memory retrieval. They lack a structured mechanism to fundamentally improve their internal model of the world's mechanics through experience. As the diagram illustrates, this paradigm can Reason to produce an Action, but it's missing the critical update ****loop needed to learn from the outcome. They act, but they don't truly comprehend or improve over time.

💡 Our Solution: Cogito, ergo ludo

A high-level overview of the CEL agent's two-phase cycle. The agent acts in Phase 1 and reflects in Phase 2, creating a continuous loop of self-improvement.

To address these flaws, we introduce Cogito, ergo ludo ****(CEL), an agent architecture grounded in the principle of "learning by reasoning and planning." Instead of burying knowledge in network weights, CEL builds and refines an explicit, language-based understanding of its environment.

The core of CEL is a two-phase operational cycle:

Phase 1: In-Episode Decision-Making: During a game episode, the agent acts using its current knowledge. For each possible move, it uses its Language-based World Model (LWM) to predict the outcome ("What will happen if I do this?") and its Language-based Value Function (LVF) to assess the strategic potential of the resulting state ("Is this state good for me in the long run?").
- Given the agent current state $s_t$, a potential action $a_t$, and the agent's current understanding of the environment's rules $\mathcal{G}k$, the world model forecasts the subsequent state $\hat{s}{t+1}$ and immediate reward **$\hat{r}{t+1}$. The model first generates a reasoning trace, $C\text{WM}$ (for World Model), before outputting its predictions. Note that unlike the “world models” in conventional paradigms which are trained with grounded next states and next rewards, we train our LWM end-to-end only with the outcome reward.
  
  $$ (C_\text{WM}, \hat{s}{t+1}, \hat{r}{t+1}) \sim p_{\mathcal{L}}(\cdot | s_t, a_t, \mathcal{G}_k). $$
  
  where $C_\text{WM}$ is the reasoning trace foe the World Model.
- To guide its planning, the agent employs the LLM $\mathcal{L}$ as a language-based value function to estimates the value of a state $\hat{v}(s_t)$. This evaluation is conditioned on both the current environmental rules $\mathcal{G}_k$ and the strategic playbook $\Pi_k$:
$$ (C_V, \hat{v}(s_t)) \sim p_{\mathcal{L}}(\cdot | s_t, \mathcal{G}_k, \Pi_k), $$
```
      where $C_V$ is the reasoning trace for the **V**alue estimation.
```