What an LLM cannot do, and a world model would

Keon Kim

A small robot pondering a thought-cloud full of everyday cause-and-effect relations: gravity, fire, rain, day and night, ice melting, plants needing water.

A world model is a learned system that maintains an internal state, updates it from observation and action, and uses that state to predict the consequences of action. The phrase now covers research programs with little in common, so this post argues for a functional reading: the useful question is what state a model constructs and what that state is for.

The motivating gap is concrete. Ask a frontier language model to describe what happens when you pour water into a glass and it returns a paragraph that gets the physics right. Give the same model a robot arm and ask it to pour without spilling and it cannot. Description and action draw on different competence. The model has the words; it lacks a model of the world.

This gap shapes much of the current AI research agenda. LeCun, Fei-Fei Li, Hassabis, and others argue that the next set of capabilities will not come from larger language models, but from systems that build and update internal models of the world and use them to predict, plan, and reason.

What follows defines the term as we use it, surveys the roles a world model's state can play, and identifies where the open problems sit.

What language models cannot do

The next-token-prediction objective produces systems that write code, draft legal arguments, and converse in twenty languages. It does not produce systems that plan a robot's path across a cluttered kitchen, hold a coherent state across a long agent trajectory, or update their parameters when the world disagrees with them. The failures cluster.

No mechanism for self-correction. A trained language model does not update its weights from prediction error during inference. In-context corrections do not propagate back to the parameters. Predictive-coding architectures, the dominant theory of cortical computation since Rao and Ballard (1999), make weight update from prediction error the central mechanism. The cortex has done this for as long as it has existed.

No grounding in physics. Fei-Fei Li, on training from text: "Language is a purely generated signal. You don't go out in nature and find words written in the sky for you. Whatever data you are feeding [LLMs], you can pretty much regurgitate with enough generalizability the same data out, and that's language-to-language. But the 3D world is not language-to-language. There is a 3D world out there that follows laws of physics that has its own structures due to materials and many other things." The training signal is text. Text is downstream of the world.

No action structure. Demis Hassabis: "There's still a lot about the spatial dynamics of the world, spatial awareness, the physical context we are in, and how that works mechanically, that is hard to describe in words, and isn't generally described in corpuses of words." A language model represents what someone might say next, not what would happen if it took an action.

No consistent state across long horizons. A long-running agent built on an LLM accumulates inconsistencies because the model holds no commitment to its prior beliefs. Each call re-derives the world from a context window.

There is partial good news. Gurnee and Tegmark (2023) showed that Llama-2 learns spatial and temporal representations of the real world from text alone. LLMs do contain world-model content. What they have cannot be updated, grounded, or used to plan actions.

The status quo: what world models are

LeCun states the alternative: "We all have world models in our heads. Animals do too. It's basically the mental model that we have in our head that allows us to predict what's going to happen in the world, either because the world is being the world, or because of an action we might take. If we can predict the consequences of our actions, then if we set ourselves a goal to accomplish, we can, using our world model, imagine whether a particular sequence of actions will actually fulfill that goal."

Jim Fan compresses the same idea: a world model predicts the next plausible world state, or a longer sequence of states, conditioned on an action.

The argument is not new. Helmholtz proposed unconscious inference in the 1860s: perception combines sense data with prior knowledge of the world. Gregory updated the framing in the 1970s by describing the brain as a prediction machine. Rao and Ballard (1999) gave the first concrete computational version. Friston extended it to action through active inference. The AI conversation revisits material that cognitive neuroscience worked through over the past century.

A working definition makes the constraint explicit. A world model is a learned system that maintains an internal state and updates it as the world changes. Three properties define the boundary. The state must summarize aspects of the world beyond the current observation. It must evolve to predict or constrain future states. And under agent action, it must reflect the consequences of those actions.

The third clause separates a world model from a video generator. Genie 3 (Google DeepMind, 2025) produces photorealistic interactive environments and the field calls it a world model, but its latent state targets visual synthesis, not action consequences. Genie 3 marks the case where world model needs a functional definition rather than an architectural one.

Six things people mean

Across the literature, six roles for a world model's internal state recur. They are not architectures. They are functions. Different systems implement the same role with different internals, and the same architecture can be trained to implement different roles.

  • Predictive abstraction: what the world looks like in the near future. DreamerV3 (Hafner et al., 2023) is canonical: a recurrent latent supports imagined rollouts for policy learning under one fixed configuration across many tasks.
  • Belief state: everything the agent knows from what it has seen and done. The classical POMDP target, now usually a recurrent encoder over observation history.
  • Object and causal structure: which entities exist and how they interact. Slot-based and graph-based representations attempt this; the area remains underdeveloped.
  • Latent action: the kinds of changes that are possible even when no action label is available. AdaWorld (Gao et al., 2025) learns this from passive video, addressing the gap between web-scale observational data and embodied control.
  • Planning interface: a state shaped so that search through it is cheap. MuZero (Schrittwieser et al., 2020) demonstrates this: it discards pixel-level detail, supervises latent dynamics only on reward and policy consistency, and matches state-of-the-art on Atari, Go, Chess, and Shogi.
  • Memory: identity preserved across long horizons and revisits. WorldMem (Xiao et al., 2025) treats this as a first-class problem, targeting the failure short-horizon predictors hide.

A transformer can implement any of these. A recurrent state-space model can implement any of these. The training signal and the design choices around the state determine the role; the architectural family does not.

What world models would unlock

If the field gets this right, then several capabilities LLMs were not designed to provide become reachable together.

Planning under partial observability. An agent with a belief state can act in real time without full information. LLM agents fail here today; the symptom is hallucinated state.

Sample efficiency. A learned world model lets the agent train its policy on imagined rollouts rather than on expensive real interactions. Ha and Schmidhuber (2018) demonstrated the principle. Dreamer (Hafner et al., 2023) extended it to a single recipe across many tasks. Robotics, bottlenecked by interaction data, gains the most.

On-the-fly updating. Predictive-coding architectures update their internal models when prediction fails. This is the mechanism Helmholtz, Gregory, and Rao and Ballard described. LLMs lack it. An agent that revises its world model from prediction error can update from experience; an agent with frozen weights cannot.

Counterfactual reasoning. A world model that supports the question what would happen under a different action satisfies a precondition for safe deployment in domains where the agent's choices change the system it reasons about. Counterfactual reasoning and uncertainty are the two roles where the field has the fewest working methods; that gap is where the open problems sit.

Grounded multimodal learning. A world model that ties language, perception, and action through a shared latent gives language a referent it currently lacks. Gurnee and Tegmark (2023) found more spatial structure latent in language than was assumed; tying that structure to embodied action remains unsolved.

The shortest version of the case: world models let AI systems keep doing what they already do well and gain capabilities LLMs were not designed to provide.

What this means for builders

If you are building a system that uses a world model, then the practical question is what your state should be for. The answer depends on what the agent needs to do. An agent that acts in real time under partial observability needs a belief state. An agent that handles long episodes with revisits needs memory. An agent that has to reason about a change in policy needs counterfactual structure. Architectural choices follow from these requirements, not the other way around.

The common failure mode is mismatched evaluation. A team trains a strong predictive model, shows it produces plausible video, and ships it as a world model. At integration time it fails to support planning. The model is not bad. It preserves the wrong things.

A useful slogan: a world model that helps you act is the one whose discards match the task.

References

  • Gao et al. (2025). AdaWorld: Learning Adaptable World Models with Latent Actions. arXiv:2503.18938. Representative of the latent-action line. Learns an internal action-like variable from passive video, addressing the gap between web-scale observational data and embodied control.
  • Google DeepMind (2025). Genie 3: A new frontier for world models. deepmind.google. The boundary case. A photorealistic interactive world generator that the field calls a world model, but whose latent state targets visual synthesis rather than action consequences.
  • Gurnee and Tegmark (2023). Language Models Represent Space and Time. arXiv:2310.02207. Evidence that LLMs trained only on text learn spatial and temporal representations. The reference to push back against if you want to argue LLMs lack world models entirely.
  • Ha and Schmidhuber (2018). World Models. arXiv:1803.10122. The paper that put the term in modern usage. A small recurrent latent dynamics model with a controller learned inside imagined rollouts. Worth reading first because it sets the agenda the rest of the field responds to.
  • Hafner et al. (2023). Mastering Diverse Domains through World Models (DreamerV3). arXiv:2301.04104. The canonical action-first latent world model. The latent state is a recurrent belief that supports imagined rollouts for policy learning under one fixed configuration across many tasks.
  • Kim (2026). Latent State Design for World Models under Sufficiency Constraints. arXiv:2605.01694. Frames world-model research as latent state design under sufficiency constraints, with a functional taxonomy organized by purpose rather than architecture and a seven-dimension evaluation across representation, prediction, planning, controllability, causal reasoning, memory, and uncertainty.
  • Rao and Ballard (1999). Predictive coding in the visual cortex. PDF. The first concrete computational implementation of the predictive-processing framework that has dominated cognitive neuroscience for the past two decades. Direct ancestor of the active-inference and world-modeling research programs in AI.
  • Schrittwieser et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero). arXiv:1911.08265. Demonstrates that a model can succeed at control while discarding pixel-level detail. Latent dynamics are supervised only by reward and policy consistency.
  • Xiao et al. (2025). WorldMem: Long-term Consistent World Simulation with Memory. arXiv:2504.12369. Treats memory as a first-class world-model role. Targets the failure short-horizon prediction hides: identity across revisits, occlusion, and long horizons.

2026 Om Labs