We just launched a browser agent that executes as code in your browser.
Not "uses code somewhere behind the scenes." The agent reads the live web as DOM, text, tabs, files, and context, writes a JavaScript plan, and runs that plan through the rtrvr.* harness inside the browser where your sessions already exist.
This is the shift we have been chasing since Retriever started: browser agents that feel less like remote-controlled screens and more like software that can create the missing API for any website, on demand.
The model no longer has to be the runtime.
It can be the compiler.
It is the difference between:
observe -> ask model -> click -> observe -> ask model -> type -> observe -> ask model -> ...and:
DOM + tools + intent -> DeepSeek Flash -> JavaScript plan -> rtrvr.* harness -> browser actionsThe first architecture rents the model as an expensive event loop.
The second turns the web into a programmable surface.
The bet
We made five architectural bets that are now compounding:
-
Text-only beats screenshot-first for cost. The browser already has the DOM, forms, links, inputs, URLs, cookies, routes, and page text. Throwing that away and asking a vision model to rediscover it from pixels is expensive.
-
Code beats tool-call transcripts. Most browser work is loops, filtering, retries, URL construction, extraction, deduping, and writing structured outputs. Those are programming tasks. A for-loop should not cost tokens.
-
The harness is the product. Generated code is only useful if it runs against safe capabilities:
getPageTree,find,click,type,pageAction,extract,processText,callTool,askUser, sheets, KBs, recordings, cloud scrape, pause, cancel, and logs. -
The authenticated browser is the runtime. A lot of valuable automation needs the user's real session: SSO, cookies, CSRF tokens, extension permissions, selected tabs, service-worker state. Moving everything to a remote browser means recreating state the user already has locally.
-
Screenshots are a fallback, not the hot path. Vision is useful. It should not be the default billing unit for every row in a table.
DeepSeek Flash made this architecture cheap enough to become the default path.
The old browser-agent loop is the bottleneck
A normal tool-loop agent does this:
while not done:
observation = observe_page()
action = llm(observation, tools, history)
result = run_tool(action)This is simple to build and brutal to run.
Suppose the user asks:
Find every pricing page open in my tabs, extract the team plan, and add the ones over $100/mo to a Sheet.
A tool-loop agent pays the model to remember the loop invariant:
LLM: list tabs
Tool: tabs returned
LLM: inspect tab 1
Tool: page returned
LLM: extract plan
Tool: extraction returned
LLM: append row?
Tool: row appended
LLM: inspect tab 2
...That is not intelligence. That is using a language model as a slow JavaScript interpreter.
The invariant should be code:
const tabs = await rtrvr.selectedTabs()
const rows = []
for (const tab of tabs) {
if (!/pricing|plans|billing/i.test(tab.title + ' ' + tab.url)) {
continue
}
const { tree, links } = await rtrvr.getPageTree({ tabId: tab.tabId })
// The semantic DOM tree is text. Code can slice, regex, dedupe,
// normalize currency, and join back to structured links before asking a model.
const pricingText = tree
.split('\n')
.filter(line => /\$|pricing|plan|team|business|enterprise|per user|month/i.test(line))
.join('\n')
const { data } = await rtrvr.processText({
textInputs: [pricingText || tree],
taskInstruction: 'Extract the product name, team/business plan name, monthly USD price, and short evidence.',
schema: {
type: 'object',
properties: {
product: { type: 'string' },
plan: { type: 'string' },
monthlyPriceUsd: { type: 'number' },
evidence: { type: 'string' }
}
}
})
const plan = Array.isArray(data) ? data[0] : data
if ((plan?.monthlyPriceUsd ?? 0) > 100) {
rows.push([plan.product, plan.plan, plan.monthlyPriceUsd, tab.url, plan.evidence])
}
}
const { sheetId, sheetUrl } = await rtrvr.createSheet({
title: 'Expensive team plans',
headers: ['Product', 'Plan', 'Monthly USD', 'Source URL', 'Evidence']
})
if (rows.length > 0) await rtrvr.appendRow({ sheetId, rows })
return { summary: `Saved ${rows.length} plans over $100/mo.`, sheetUrl, rows: rows.length }The model writes the loop once. The browser runs it locally. The harness keeps authority.
Notice what happens in the middle: the agent treats the semantic DOM tree as a string and uses normal software tools on it. It can split sections, run regexes, normalize prices, dedupe URLs, use links for clean hrefs, and call processText only on the small slice that still needs judgment.
A vision-first agent can look at a pricing card. It cannot cheaply run tree.split('\n').filter(...) over the page.
The 100x is architectural
This is not "DeepSeek is magically 100x cheaper."
The cost curve changes because three multipliers move at the same time:
cost = turns * context_size * model_priceWe cut turns by compiling the plan into code.
We cut context size by using DOM/text instead of screenshots.
We cut model price by moving the hot path to DeepSeek Flash.
For tasks where the old agent needed 40 to 100 model turns and the new one needs one planning call plus a few semantic extractions, the end-to-end inference cost can drop by roughly two orders of magnitude.
That matters more than benchmark theater.
A demo can spend 80 model turns to complete one checkout.
A product cannot spend 80 model turns every time a user wants to sync 500 rows, migrate a dashboard, audit 40 tabs, or run the same workflow every morning.
Production browser agents should be judged on cost per successful run and cost per 1,000 successful runs, not only "did it eventually click the right button once?"
What the harness gives the model
The model does not get raw browser authority.
It gets a constrained capability surface:
const { tree, links } = await rtrvr.getPageTree({ tabId })
const button = await rtrvr.find({ role: 'button', name: /upgrade/i }, { tabId })
if (button) await rtrvr.click(button, { tabId })
await rtrvr.type({ role: 'textbox', name: /email/i }, user.email, { clear: true, tabId })
await rtrvr.pageAction({ tool: 'goto_url', args: { url }, tabId })
await rtrvr.extract({ userInput, tabIds: [tabId], schema })
await rtrvr.processText({ textInputs: [tree], taskInstruction, schema })
await rtrvr.scrape({ urls })
await rtrvr.createSheet({ title, headers })
await rtrvr.appendRow({ sheetId, rows })
await rtrvr.callTool('slack.postMessage', payload)
await rtrvr.askUser({ questions, reason, checkpoint })The generated program gets control flow.
The harness keeps authority.
That separation is the whole point.
The model can write:
for (const lead of leads) {
if (lead.intent === 'enterprise') {
await rtrvr.callTool('slack.postMessage', {
channel: '#sales',
text: lead.company + ' is asking about enterprise pricing',
context: lead
})
}
}But the model never sees raw Slack credentials. It gets a scoped slack.postMessage capability, mediated by policy, logged by the runtime, and revocable by the user.
Same for browser actions. The code can express:
const upgrade = await rtrvr.find({ role: 'button', name: /upgrade/i }, { tabId })
if (!upgrade) return { status: 'incomplete', reason: 'Upgrade button not found' }
await rtrvr.click(upgrade, { tabId })The harness decides whether that click is allowed, whether it needs human confirmation, whether it should run in the selected tab or cloud, and how to recover if the element moved.
Why text-only matters
Most browser agents are built like a player in a video game.
They look at a screenshot, infer where to click, wait for the page to change, and look again.
That is useful for compatibility. It is a bad default abstraction.
A webpage is not a bitmap. It is a live data structure:
- DOM tree;
- URL and route state;
- forms and inputs;
- ARIA labels;
- hidden fields;
- client-side storage;
- network calls;
- authenticated browser state;
- scripts that already know how the product works.
The killer detail is that our semantic DOM tree is text that code can operate on. Generated plans can do the boring, powerful things that make software cheap:
tree.split('\n')to isolate a section;- regex parsing for prices, emails, dates, SKUs, statuses, and IDs;
new URL(...)to build deep links instead of clicking through menus;Map/Setdeduping before any model call;- joining tree ids back to
linksfor clean names and hrefs; - falling back to
processTextonly for the slice that needs semantics.
That is not available to a screenshot agent. Pixels are not a data structure you can filter, join, normalize, or regex without paying a model to rediscover the structure first.
If the agent's input is structured page state and its output is code, the right planner is a cheap, fast text/code model. You do not need a multimodal frontier model in the hot path for every step.
This was the long-term bet: browser agents would eventually become text-only and code-first.
DeepSeek Flash made the bet obvious.
Why this is not just Playwright generation
Playwright is great when you own the environment.
Browser agents usually operate inside a user's logged-in browser, across arbitrary sites, with real cookies, SSO, CSRF, selected tabs, extension permissions, and local files.
If every workflow moves to a remote Playwright worker, you have to recreate that state: export cookies, sync profiles, proxy requests, or ask the user to log in again inside a cloud browser.
Sometimes cloud is the right runtime. We run cloud browsers too.
But the differentiated path is this:
execute next to the browser state when the browser state already exists.
The generated code is not generic Playwright. It is code against a browser-aware harness that can use the live tab, call cloud scrape when needed, route file uploads correctly, cross iframes and shadow DOM, stream console output, pause, cancel, and ask the human before irreversible side effects.
The artifact is better
A transcript agent leaves behind 60 opaque tool calls.
A code-as-plan agent leaves behind a program.
You can read it.
You can diff it.
You can rerun it.
You can cache it.
You can turn it into a subroutine.
That was the core of our earlier AI Subroutines launch: repetitive browser work should become scripts, not repeated inference.
Code-as-plan is the same idea one level earlier. Instead of recording the script manually, the model writes it from page context and intent.
TL;DR
The browser-agent bottleneck is architecture, not only model intelligence.
We are betting on:
- text-only DOM context over screenshot loops;
- semantic DOM trees that code can parse with strings, regexes, URLs, maps, and sets;
- DeepSeek Flash as the cheap code-capable planner;
- generated JavaScript plans over tool-call transcripts;
rtrvr.*as the authority boundary;- local browser execution when auth/session state already exists;
- screenshots and cloud browsers as fallbacks, not the default tax.
That turns the model from an expensive runtime into a cheap compiler.
And once the model is a compiler, browser agents stop feeling like demos and start feeling like software.
The next generation of browser agents will not browse like humans. They will read the web like programs.
