rtrvr.ai logo
Retriever AI
Blog
Book Demo
Pricing
API Docs
Back to Blog

We made our browser agent 100x cheaper by making it write code

We launched a browser agent that executes as code inside your live browser: it reads the page as DOM and text, writes one JavaScript plan, then runs it through a constrained harness instead of paying for every click as an LLM turn.

Arjun·June 23, 2026·9 min read

Code-as-Plan and DeepSeek Flash

Watch Retriever AI run browser-agent workflows as code through a text-only planning path.

Code-as-Plan and DeepSeek Flash
2:45
Code-as-Plan
Model writes one JS program instead of a long tool-call transcript
Text-only DOM
Page context is compressed DOM, text, forms, links, and capabilities
DeepSeek Flash
Cheap code-capable planning model for the browser hot path
100x cost path
Savings compound across fewer turns, smaller context, and cheaper tokens

We just launched a browser agent that executes as code in your browser.

Not "uses code somewhere behind the scenes." The agent reads the live web as DOM, text, tabs, files, and context, writes a JavaScript plan, and runs that plan through the rtrvr.* harness inside the browser where your sessions already exist.

This is the shift we have been chasing since Retriever started: browser agents that feel less like remote-controlled screens and more like software that can create the missing API for any website, on demand.

The model no longer has to be the runtime.

It can be the compiler.

It is the difference between:

text
observe -> ask model -> click -> observe -> ask model -> type -> observe -> ask model -> ...

and:

text
DOM + tools + intent -> DeepSeek Flash -> JavaScript plan -> rtrvr.* harness -> browser actions

The first architecture rents the model as an expensive event loop.

The second turns the web into a programmable surface.

The bet

We made five architectural bets that are now compounding:

  1. Text-only beats screenshot-first for cost. The browser already has the DOM, forms, links, inputs, URLs, cookies, routes, and page text. Throwing that away and asking a vision model to rediscover it from pixels is expensive.

  2. Code beats tool-call transcripts. Most browser work is loops, filtering, retries, URL construction, extraction, deduping, and writing structured outputs. Those are programming tasks. A for-loop should not cost tokens.

  3. The harness is the product. Generated code is only useful if it runs against safe capabilities: getPageTree, find, click, type, pageAction, extract, processText, callTool, askUser, sheets, KBs, recordings, cloud scrape, pause, cancel, and logs.

  4. The authenticated browser is the runtime. A lot of valuable automation needs the user's real session: SSO, cookies, CSRF tokens, extension permissions, selected tabs, service-worker state. Moving everything to a remote browser means recreating state the user already has locally.

  5. Screenshots are a fallback, not the hot path. Vision is useful. It should not be the default billing unit for every row in a table.

DeepSeek Flash made this architecture cheap enough to become the default path.

The old browser-agent loop is the bottleneck

A normal tool-loop agent does this:

text
while not done: observation = observe_page() action = llm(observation, tools, history) result = run_tool(action)

This is simple to build and brutal to run.

Suppose the user asks:

Find every pricing page open in my tabs, extract the team plan, and add the ones over $100/mo to a Sheet.

A tool-loop agent pays the model to remember the loop invariant:

text
LLM: list tabs Tool: tabs returned LLM: inspect tab 1 Tool: page returned LLM: extract plan Tool: extraction returned LLM: append row? Tool: row appended LLM: inspect tab 2 ...

That is not intelligence. That is using a language model as a slow JavaScript interpreter.

The invariant should be code:

javascript
const tabs = await rtrvr.selectedTabs() const rows = [] for (const tab of tabs) { if (!/pricing|plans|billing/i.test(tab.title + ' ' + tab.url)) { continue } const { tree, links } = await rtrvr.getPageTree({ tabId: tab.tabId }) // The semantic DOM tree is text. Code can slice, regex, dedupe, // normalize currency, and join back to structured links before asking a model. const pricingText = tree .split('\n') .filter(line => /\$|pricing|plan|team|business|enterprise|per user|month/i.test(line)) .join('\n') const { data } = await rtrvr.processText({ textInputs: [pricingText || tree], taskInstruction: 'Extract the product name, team/business plan name, monthly USD price, and short evidence.', schema: { type: 'object', properties: { product: { type: 'string' }, plan: { type: 'string' }, monthlyPriceUsd: { type: 'number' }, evidence: { type: 'string' } } } }) const plan = Array.isArray(data) ? data[0] : data if ((plan?.monthlyPriceUsd ?? 0) > 100) { rows.push([plan.product, plan.plan, plan.monthlyPriceUsd, tab.url, plan.evidence]) } } const { sheetId, sheetUrl } = await rtrvr.createSheet({ title: 'Expensive team plans', headers: ['Product', 'Plan', 'Monthly USD', 'Source URL', 'Evidence'] }) if (rows.length > 0) await rtrvr.appendRow({ sheetId, rows }) return { summary: `Saved ${rows.length} plans over $100/mo.`, sheetUrl, rows: rows.length }

The model writes the loop once. The browser runs it locally. The harness keeps authority.

Notice what happens in the middle: the agent treats the semantic DOM tree as a string and uses normal software tools on it. It can split sections, run regexes, normalize prices, dedupe URLs, use links for clean hrefs, and call processText only on the small slice that still needs judgment.

A vision-first agent can look at a pricing card. It cannot cheaply run tree.split('\n').filter(...) over the page.

The 100x is architectural

This is not "DeepSeek is magically 100x cheaper."

The cost curve changes because three multipliers move at the same time:

text
cost = turns * context_size * model_price

We cut turns by compiling the plan into code.

We cut context size by using DOM/text instead of screenshots.

We cut model price by moving the hot path to DeepSeek Flash.

For tasks where the old agent needed 40 to 100 model turns and the new one needs one planning call plus a few semantic extractions, the end-to-end inference cost can drop by roughly two orders of magnitude.

That matters more than benchmark theater.

A demo can spend 80 model turns to complete one checkout.

A product cannot spend 80 model turns every time a user wants to sync 500 rows, migrate a dashboard, audit 40 tabs, or run the same workflow every morning.

Production browser agents should be judged on cost per successful run and cost per 1,000 successful runs, not only "did it eventually click the right button once?"

What the harness gives the model

The model does not get raw browser authority.

It gets a constrained capability surface:

javascript
const { tree, links } = await rtrvr.getPageTree({ tabId }) const button = await rtrvr.find({ role: 'button', name: /upgrade/i }, { tabId }) if (button) await rtrvr.click(button, { tabId }) await rtrvr.type({ role: 'textbox', name: /email/i }, user.email, { clear: true, tabId }) await rtrvr.pageAction({ tool: 'goto_url', args: { url }, tabId }) await rtrvr.extract({ userInput, tabIds: [tabId], schema }) await rtrvr.processText({ textInputs: [tree], taskInstruction, schema }) await rtrvr.scrape({ urls }) await rtrvr.createSheet({ title, headers }) await rtrvr.appendRow({ sheetId, rows }) await rtrvr.callTool('slack.postMessage', payload) await rtrvr.askUser({ questions, reason, checkpoint })

The generated program gets control flow.

The harness keeps authority.

That separation is the whole point.

The model can write:

javascript
for (const lead of leads) { if (lead.intent === 'enterprise') { await rtrvr.callTool('slack.postMessage', { channel: '#sales', text: lead.company + ' is asking about enterprise pricing', context: lead }) } }

But the model never sees raw Slack credentials. It gets a scoped slack.postMessage capability, mediated by policy, logged by the runtime, and revocable by the user.

Same for browser actions. The code can express:

javascript
const upgrade = await rtrvr.find({ role: 'button', name: /upgrade/i }, { tabId }) if (!upgrade) return { status: 'incomplete', reason: 'Upgrade button not found' } await rtrvr.click(upgrade, { tabId })

The harness decides whether that click is allowed, whether it needs human confirmation, whether it should run in the selected tab or cloud, and how to recover if the element moved.

Why text-only matters

Most browser agents are built like a player in a video game.

They look at a screenshot, infer where to click, wait for the page to change, and look again.

That is useful for compatibility. It is a bad default abstraction.

A webpage is not a bitmap. It is a live data structure:

  • DOM tree;
  • URL and route state;
  • forms and inputs;
  • ARIA labels;
  • hidden fields;
  • client-side storage;
  • network calls;
  • authenticated browser state;
  • scripts that already know how the product works.

The killer detail is that our semantic DOM tree is text that code can operate on. Generated plans can do the boring, powerful things that make software cheap:

  • tree.split('\n') to isolate a section;
  • regex parsing for prices, emails, dates, SKUs, statuses, and IDs;
  • new URL(...) to build deep links instead of clicking through menus;
  • Map / Set deduping before any model call;
  • joining tree ids back to links for clean names and hrefs;
  • falling back to processText only for the slice that needs semantics.

That is not available to a screenshot agent. Pixels are not a data structure you can filter, join, normalize, or regex without paying a model to rediscover the structure first.

If the agent's input is structured page state and its output is code, the right planner is a cheap, fast text/code model. You do not need a multimodal frontier model in the hot path for every step.

This was the long-term bet: browser agents would eventually become text-only and code-first.

DeepSeek Flash made the bet obvious.

Why this is not just Playwright generation

Playwright is great when you own the environment.

Browser agents usually operate inside a user's logged-in browser, across arbitrary sites, with real cookies, SSO, CSRF, selected tabs, extension permissions, and local files.

If every workflow moves to a remote Playwright worker, you have to recreate that state: export cookies, sync profiles, proxy requests, or ask the user to log in again inside a cloud browser.

Sometimes cloud is the right runtime. We run cloud browsers too.

But the differentiated path is this:

execute next to the browser state when the browser state already exists.

The generated code is not generic Playwright. It is code against a browser-aware harness that can use the live tab, call cloud scrape when needed, route file uploads correctly, cross iframes and shadow DOM, stream console output, pause, cancel, and ask the human before irreversible side effects.

The artifact is better

A transcript agent leaves behind 60 opaque tool calls.

A code-as-plan agent leaves behind a program.

You can read it.

You can diff it.

You can rerun it.

You can cache it.

You can turn it into a subroutine.

That was the core of our earlier AI Subroutines launch: repetitive browser work should become scripts, not repeated inference.

Code-as-plan is the same idea one level earlier. Instead of recording the script manually, the model writes it from page context and intent.

TL;DR

The browser-agent bottleneck is architecture, not only model intelligence.

We are betting on:

  • text-only DOM context over screenshot loops;
  • semantic DOM trees that code can parse with strings, regexes, URLs, maps, and sets;
  • DeepSeek Flash as the cheap code-capable planner;
  • generated JavaScript plans over tool-call transcripts;
  • rtrvr.* as the authority boundary;
  • local browser execution when auth/session state already exists;
  • screenshots and cloud browsers as fallbacks, not the default tax.

That turns the model from an expensive runtime into a cheap compiler.

And once the model is a compiler, browser agents stop feeling like demos and start feeling like software.

The next generation of browser agents will not browse like humans. They will read the web like programs.

Share this article:
Back to Blog

Build With Retriever AI

Explore Rover or run the full cloud platform

Try the free Chrome extension, turn websites into agentic interfaces with Rover, or run automations at scale in the Cloud.

Try Extension FreeExplore RoverTry Cloud Platform
Install Extension•Read Docs•BYOK Gemini friendly

On this page

  • The bet
  • The old browser-agent loop is the bottleneck
  • The 100x is architectural
  • What the harness gives the model
  • Why text-only matters
  • Why this is not just Playwright generation
  • The artifact is better
  • TL;DR
rtrvr.ai logo
Retriever AI

Retrieve, Research, Robotize the Web

By subscribing, you agree to receive marketing emails from Retriever AI. You can unsubscribe at any time.

Product

  • Browser Extension
  • Cloud
  • RoverNEW
  • API & MCP
  • CLI & SDK
  • Templates
  • WhatsApp

Use Cases

  • Vibe Scraping
  • Lead Enrichment
  • Agentic Form Filling
  • Web Monitoring
  • Social Media
  • Job Applications
  • Data Migration
  • AI Web Context
  • Agentic Checkout

Compare

  • vs Apify
  • vs Bardeen
  • vs Browserbase
  • vs Browser Use
  • vs Clay
  • vs Claude
  • vs Comet
  • vs Firecrawl

Resources

  • Documentation
  • Blog
  • Newsletters
  • Changelog
  • Integrations
  • Pricing
  • Book Demo
  • Affiliate Program

Company

  • Team
  • Contact
  • GCP Partner
  • Privacy Policy
  • Terms of Service
  • Security Brief
support@rtrvr.ai

© 2026 Retriever AI. All rights reserved.

Made withfor the automation community