We built a browser-agent harness where the model writes code instead of taking tool calls one at a time.
The first version of browser agents looks obvious: give the model tools like click, extract, openTab, appendRow, and callMcpTool; then let it decide the next action after every observation.
That works for demos. It breaks on workflows.
If the task is "go through every open prospect tab, extract the useful ones, write them to a Sheet, and create CRM contacts for high-intent leads," the model should not be asked to rediscover the loop body on every iteration.
It should write:
for (const tab of await rtrvr.listTabs()) {
const lead = await rtrvr.extract({ tabIds: [tab.tabId], userInput: "Extract the lead" });
await rtrvr.appendRow({ sheetId, values: [lead.name, lead.email, lead.intent] });
if (lead.intent === "high") await rtrvr.callTool("hubspot.createContact", lead);
}That program can run locally, deterministically, and cheaply.
So we turned the whole browser-agent harness into a sandboxed JavaScript DSL.
The model writes the control flow. The browser runs it. The harness keeps authority.
Agent Harness DSL demo
Watch Retriever AI run browser-agent workflows through a local sandbox harness.

The LLM should not be the runtime
Most browser agents today are tool-call loops:
- observe;
- ask the model what to do;
- run one tool;
- observe again;
- ask the model again.
That loop is useful when the next step genuinely requires judgment. But if the work is mostly iteration, retries, validation, and bookkeeping, the model is being used as an expensive interpreter.
A tool-call loop is an interpreter with an LLM as the CPU.
That is a bad CPU. It is slow. It is expensive. It is nondeterministic. It can forget the loop invariant halfway through the loop.
Agent planning should produce programs, not transcripts.
The model should write the loop, not be the loop.
Agents need control flow
The best framing I have seen is Agents need control flow, not more prompts, plus the HN discussion around it.
One top comment described a QA agent that had to process roughly 200 markdown requirement files. Letting the model manage the high-level loop started breaking down after about 30 files: missed files, repeated tests, unexplained backtracking, and 10-minute runs where 3 minutes should have been enough. A simple deterministic harness made the system much more reliable.
That maps exactly to browser automation.
Control flow is too important to leave in prose.
Browser workflows need normal programming constructs:
forto iterate over tabs or rows;ifto branch on extracted state;try/catchto recover from site-specific failures;- validation before writing to a Sheet or CRM;
- retry policy outside the model's memory.
Last year, this was still a little too fragile. Models would sometimes emit a type mistake, a missing await, or a small syntax error that made the generated code fail to compile. You could see the shape of the future, but the error rate was annoying.
That changed. Even Gemini Flash, our workhorse model, now reliably generates usable code against a constrained DSL on nearly every call.
That is the unlock: not "models can build whole apps," but "models can write the 40 lines of glue code that should never have been an agent transcript."
The harness as a JavaScript DSL
The DSL is intentionally small. The model does not need a general operating system. It needs a small language for doing useful browser-agent work.
The interesting part in the opening example is not the helper list. It is the composition:
- tab state comes from the browser;
- fuzzy extraction uses the model only where judgment is useful;
- Sheet writes are deterministic API calls;
- MCP/custom tools are callable by name;
- the loop is just JavaScript.
A for-loop should not cost tokens.
Yes, this uses eval
At the center of the sandbox, the implementation is almost offensively small:
async function evaluateSandboxCode(code) {
try {
const result = eval(code);
return result instanceof Promise ? await result : result;
} catch (error) {
if (!shouldRetryAsAsyncFunctionBody(error)) throw error;
const result = eval("(async () => {\n" + code + "\n})()");
return result instanceof Promise ? await result : result;
}
}The second path exists because models naturally write tool bodies with top-level await and return. If the first eval fails with that syntax shape, we retry as an async function body.
The reason this is not reckless is Chrome's extension sandbox model. Chrome supports sandboxed extension pages, including the documented pattern for using eval in sandboxed iframes. A sandbox page can allow dynamic code execution without inheriting direct extension authority.
eval is the implementation detail.
The capability boundary is the design.
What generated code can and cannot do
The security model is not "trust the generated code."
The security model is:
- generated code runs in a sandboxed iframe;
- the sandbox has no direct extension authority;
- the only useful objects in scope are DSL capabilities;
- every privileged operation becomes an RPC to the parent runtime;
- the parent runtime validates, dispatches, logs, and can require approval.
Generated code can call:
await rtrvr.appendRow({ sheetId, values });
await rtrvr.pageAction({ tool: "click_element", args, tabId });
await rtrvr.callTool("slack.postMessage", params);Generated code cannot directly call raw chrome.tabs, extension identity APIs, Google auth internals, or arbitrary privileged extension services.
The generated code gets control flow. The harness keeps authority.
Why local beats remote for browser agents
Remote sandboxes are useful. Daytona-style environments, generic VM/container sandboxes, Temporal-backed workers, and serverless runtimes are good fits for backend compute, long-lived workflows, package installs, filesystem-heavy jobs, and untrusted server-side code.
They are not the natural hot path for browser agents.
Remote sandboxes isolate compute. Local browser sandboxes isolate control flow next to authenticated state.
That distinction matters because the browser already has the valuable state: cookies, tabs, DOM, SSO, CSRF tokens, service-worker state, and extension permissions. Moving execution away from that state means exporting cookies, replaying headers, proxying requests, or keeping a remote browser logged in.
The browser is already an authenticated runtime. Use it.
Cloudflare Agent Lee got the shape right
Cloudflare's Agent Lee is directionally right: convert tools into a code surface, ask the model to write code, execute it in a sandbox, mediate privileged operations.
That is much better than forcing the model to choose one tool at a time forever.
Agent Lee took Cloudflare months in dev time as well as ongoing maintence.
Our goal is to be able to provide similar harnesses to take actions on the live website for website owners directly.
Relation to AI Subroutines
Our AI Subroutines launch was about moving replay off the model's hot path.
Record one browser action. Save it as a deterministic tool. Replay it without paying an LLM to rediscover every click.
The harness DSL is the same idea at workflow scale.
Subroutines made one action deterministic.
Sandboxed DSL execution makes the whole workflow programmable.
The model still matters. It writes the program. It handles fuzzy extraction. It judges ambiguous page content. But it does not need to be the loop counter, retry policy, and spreadsheet writer.
Where this goes next
Rover is our one-script-tag agent for websites. Today it reads the live site, plans actions, and executes through the page's own UI.
The long-term direction is to let websites expose small harnesses of their own.
Not every site should have to build an Agent Lee-style platform. Not every site should expose an MCP server. Not every site should maintain a parallel API surface just so agents can act.
The web already has UI, auth, state, and permissions.
What it needs is a safe harness where agents can express control flow against that surface.
Tldr:
Sandboxed DSL execution is all you need for a lot of browser-agent work.