Browser agent

~12 min Python TypeScript

Goal: Boot a MIOSA Computer (full Linux desktop VM), open a browser inside it, and run an LLM-driven action loop that screenshots the screen, decides what to do next, and executes the action — all from your own code.

What you’ll use: Computers, Desktop API (screenshot / click / type / key)

What you’ll build

A BrowserAgent class that:

Boots a Computer with a full desktop.
Opens a URL in Chromium.
Enters a screenshot → LLM → action loop.
Stops when the LLM signals completion.

The LLM is your choice — the code below uses Anthropic Claude’s vision API but the pattern is model-agnostic.

Prerequisites

A MIOSA workspace API key (msk_live_*) — see API Keys
An Anthropic API key (or substitute your preferred vision-capable model)
Node 22+ or Python 3.11+

Step 1 — Install and configure

Step 2 — Boot a Computer

Computers take ~10 seconds to boot (full desktop VM, not a microVM). Create one, start it, then wait for it to reach running.

Step 3 — Take a screenshot

The desktop starts at the Linux login screen or a default desktop. Take the first screenshot to see what you are working with.

Step 4 — Open a URL in the browser

Use exec to launch Chromium with the target URL. The desktop VM has Chromium pre-installed.

Step 5 — Build the action loop

Ask the LLM what to do next, then execute the returned action on the computer. Repeat until the model signals done.

The system prompt below instructs the model to return a JSON action object — a minimal tool-call format that does not require function calling support.

Step 6 — Execute actions on the Computer

Translate the LLM’s JSON output into Computer API calls. MIOSA coordinates are 0–1000 on both axes regardless of the actual screen resolution.

Step 7 — Run the full agent loop

Combine the pieces into a loop capped at max_turns to prevent runaway execution.

Full script (Python)

#!/usr/bin/env python3
"""browser_agent.py — complete screenshot → action → screenshot loop"""

import os, time, base64, json
import anthropic
from miosa import Miosa

client = Miosa(api_key=os.environ["MIOSA_API_KEY"])
ai = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

SYSTEM_PROMPT = """You control a desktop browser. You receive a screenshot and a task.
Respond with ONLY a JSON object — no explanation, no markdown fences.
Actions: click(x,y) | type(text) | key(combo) | scroll(x,y,direction) | screenshot | done(result)
Coordinates are 0-1000 on both axes.
"""


def next_action(task, png_b64, history):
    messages = history + [{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": png_b64}},
            {"type": "text", "text": f"Task: {task}"},
        ],
    }]
    msg = ai.messages.create(model="claude-opus-4-5", max_tokens=256, system=SYSTEM_PROMPT, messages=messages)
    return json.loads(msg.content[0].text.strip())


def execute(computer, action):
    t = action["type"]
    if t == "click":     computer.left_click(action["x"], action["y"])
    elif t == "type":    computer.type_text(action["text"])
    elif t == "key":     computer.key(action["combo"])
    elif t == "scroll":  computer.scroll(action["x"], action["y"], action["direction"])
    elif t == "done":    return None
    time.sleep(0.5)
    return computer.screenshot()


def run(task: str, url: str, max_turns: int = 20) -> str:
    computer = client.computers.create(template="ubuntu-desktop")
    computer.start()
    while computer.status != "running":
        time.sleep(2)
        computer = client.computers.get(computer.id)

    computer.exec(f"chromium-browser --no-sandbox --start-maximized {url}", background=True)
    time.sleep(3)

    png = computer.screenshot()
    png_b64 = base64.b64encode(png).decode()
    history = []

    try:
        for turn in range(1, max_turns + 1):
            action = next_action(task, png_b64, history)
            print(f"Turn {turn}: {action}")
            history.append({"role": "assistant", "content": [{"type": "text", "text": json.dumps(action)}]})
            if action["type"] == "done":
                return action.get("result", "")
            nxt = execute(computer, action)
            if nxt is None:
                return ""
            png_b64 = base64.b64encode(nxt).decode()
        return f"Stopped after {max_turns} turns."
    finally:
        computer.stop()


if __name__ == "__main__":
    result = run(
        task="Find the 'More information...' link, click it, report the landing URL.",
        url="https://example.com",
    )
    print(f"\nResult: {result}")

Tips for reliable browser agents

Add time.sleep(0.5) after every action. Browsers animate and reflow. Acting on a screenshot taken 0 ms after a click often shows the pre-click state.
Save every screenshot. The debug loop is: screenshot N shows the state the model reasoned about to produce action N. When an agent gets stuck, inspect the saved screenshots.
Cap max_turns tightly (10–25). A runaway agent loop can exhaust credits fast. Set the ceiling and surface the incomplete-task status cleanly.
Use computer.key("ctrl+r") to reload. If the page looks broken, a reload is often cheaper than restarting the Computer.
Snapshot after login. If the task requires authenticating, snapshot immediately after the login succeeds. Restore from the snapshot for subsequent tasks to skip the login every time.

What you learned

client.computers.create + computer.start() boots a full Linux desktop VM.
computer.screenshot() returns raw PNG bytes at any point.
computer.left_click(x, y), computer.type_text(text), computer.key(combo), and computer.scroll(x, y, direction) are the four action primitives.
Coordinates are 0–1000 normalized — tell your LLM the same range.
The agent loop is yours: MIOSA executes actions, your code decides what to do next.

Full Computer lifecycle, sizing, templates, and the difference between Computers and Sandboxes.

Complete method signatures for screenshot, click, type, key, and scroll.

Multi-turn state management, approval workflows, and best practices for agent loops.