Authors: Sai Wang, Yu Wu, and Zhongwen Xu

Date: 29 Sept, 2025

Quick Links: 📜 Paper


TL;DR

We introduce Cogito, ergo ludo ****(CEL), a new paradigm for AI agents that learn not just by acting, but by reasoning and planning. Prevailing deep reinforcement learning (RL) agents are powerful but act as "black boxes", their knowledge is opaquely stored in millions of network weights, they are sample-inefficient, and they don't build a true, explicit understanding of their world. CEL flips this on its head. It leverages a Large Language Model (LLM) to start from a tabula rasa state (no prior knowledge of the rules) and autonomously builds an explicit, human-readable "world model" and a strategic "playbook" purely from its own experience. Through a cyclical process of in-episode action and post-episode reflection, CEL continuously refines its knowledge. Experiments on classic grid-world games like Minesweeper, Frozen Lake, and Sokoban show that CEL successfully masters these tasks by discovering the rules and developing effective strategies from sparse rewards, demonstrating a path toward more general, interpretable, and efficient intelligent agents.


🤔 The Problem with "Black-Box" RL Agents

Reinforcement Learning has produced agents with superhuman abilities, but the dominant paradigm suffers from fundamental limitations that hinder the development of truly general and trustworthy AI.

intro.png

  1. Opaque, Implicit Knowledge: In conventional deep RL, an agent's "knowledge" is encoded implicitly within the weights of a massive neural network. This makes its decision-making process a black box. We can see what it does, but we can't easily ask it why. This lack of interpretability is a major barrier to debugging, trust, and alignment.
  2. Extreme Sample Inefficiency: These systems often require billions of game frames or interaction steps to learn, an amount of experience far beyond what a human would need. They learn through brute-force trial and error rather than abstract reasoning and generalization.
  3. Lack of Structured Learning: Even recent LLM-based agents, which represent the Zero-shot Reasoning Paradigm, often operate in a zero-shot capacity or use simple memory retrieval. They lack a structured mechanism to fundamentally improve their internal model of the world's mechanics through experience. As the diagram illustrates, this paradigm can Reason to produce an Action, but it's missing the critical update ****loop needed to learn from the outcome. They act, but they don't truly comprehend or improve over time.

đź’ˇ Our Solution: Cogito, ergo ludo

A high-level overview of the CEL agent's two-phase cycle. The agent acts in Phase 1 and reflects in Phase 2, creating a continuous loop of self-improvement.

A high-level overview of the CEL agent's two-phase cycle. The agent acts in Phase 1 and reflects in Phase 2, creating a continuous loop of self-improvement.

To address these flaws, we introduce Cogito, ergo ludo ****(CEL), an agent architecture grounded in the principle of "learning by reasoning and planning." Instead of burying knowledge in network weights, CEL builds and refines an explicit, language-based understanding of its environment.

The core of CEL is a two-phase operational cycle: