Browser Agents

Browser agents connect language models to a real browser. They click, type, navigate, search, inspect page state, compare sources, and return answers with evidence from the live web.

The core problem is not just browsing. A useful agent has to plan, observe what changed, choose the next action, recover from broken paths, and decide when it has enough evidence to stop.

What the field studies

Browser-agent research studies how models use websites as tools. Some tasks are short and transactional, like finding a product or filling a form. Others require longer-term planning: breaking a messy goal into steps, choosing sources, adapting when pages change, and deciding when the result is good enough to trust.

The field grew from tool-use patterns such as ReAct, controlled web environments such as WebShop and WebArena, real-site trajectory datasets such as Mind2Web, and live-browser benchmarks such as WebVoyager and AssistantBench.

Three approaches

The first approach is a vertical browser agent. The browser system owns the whole loop: planning, page reading, clicking, recovery, and final answer. This is clean for users because they pass in one goal and get one result. It also lets the team optimize the full stack around browsing. The downside is that vertical agents have to keep up with two fast-moving frontiers at once: general agent frameworks are improving quickly, and browser-specific tools are improving quickly. A product that tries to own both can end up satisfying neither, unless it is deliberately narrow, such as a scraper, research assistant, or another specific browser workflow.

Vertical browser agents own planning, browsing, recovery, and answers.

The second approach is a browser mini-agent that works with a general agent. This separates concerns: the general agent owns the trajectory, meaning the long-term plan and strategy, while the browser mini-agent owns the concrete actions needed to execute a bounded browsing step. Alumnium frames this clearly: Claude Code keeps the broader plan, while Alumnium handles browser execution through Selenium, accessibility trees, and screenshots when needed. Stagehand sits in a similar middle layer with act, extract, observe, and agent primitives. Lumen is our version of this shape: a vision-first browser agent that can take a bounded browsing subtask from Codex, Claude Code, or another general agent, then return a result with replayable evidence.

Browser mini-agents take bounded browsing tasks from a general agent and return replayable evidence.

The third approach is an LLM-free browser tool for a general agent. agent-browser exposes compact browser snapshots and deterministic refs through a CLI. browser-harness exposes a thin, editable CDP harness and lets the calling agent write missing helpers as it works. browser-control follows the same direction: keep the tool small, expose real browser state, and let the general model decide how to use it.

LLM-free browser tools expose snapshots, refs, and traces directly to a general model.

This third direction is the most aligned with the Bitter Lesson: as models get better, the durable interface is less likely to be a hand-designed agent policy and more likely to be a capable model with a simple, complete browser substrate. Browser Use makes the same argument in its posts on agent frameworks and agent harnesses, drawing from Sutton's Bitter Lesson.

Our approach

Our work treats the browser as an execution environment, not just a source of text. The model should spend its budget on deciding what matters, while the browser layer handles reliable observation, action, tracing, and recovery.

browser-control is that browser layer: a small CDP-based tool for controlling real browsers, capturing screenshots and state, inspecting pages, evaluating scripts, and preserving execution evidence. It gives the agent enough structure to act reliably without hiding the rendered web.

Lumen builds on that foundation as a vision-first browser agent with self-healing deterministic replay. It keeps the rendered page primary, but records enough browser state to reproduce actions, repair brittle steps, and debug failures as deterministic traces instead of one-off sessions.

We tested both shapes. Lumen is the mini-agent path: it is useful when a general coding agent wants to delegate a browsing subtask and receive a compact answer. browser-control is the tool path: it gives a high-intelligence model such as Fable-5 direct browser leverage without adding another reasoning loop.

So far, the strongest pattern for us has been browser-control with a very capable general model. The model already knows how to plan, inspect, recover, and write code. The browser layer works best when it provides state, actions, screenshots, traces, and escape hatches without trying to outguess the model.

That combination helped us reach the top entries for WebVoyager and AssistantBench.

For the benchmark results and the model-interface lessons behind this direction, read When Browser Harnesses Help, and When They Hurt.

What the field studies

Three approaches

Our approach

Related blog posts

When Browser Harnesses Help, and When They Hurt