On this page

Once you have a Computer, drive its desktop with these methods. Suitable for AI agents (computer-use models, custom RPA loops) and for direct programmatic control.

All 28 methods

GroupMethodPurpose
Screenscreenshot()Capture full desktop as PNG bytes
screenshot_base64()Same, base64-encoded — ready to pass to an LLM vision API
Clickclick(x, y)Generic click — defaults to left button
left_click(x, y)Left-button click at coordinates
right_click(x, y)Right-button click (context menu)
double_click(x, y)Double-click at coordinates
Mousemove_cursor(x, y)Move cursor without clicking
mouse_down(x, y)Press and hold the mouse button
mouse_up(x, y)Release a held mouse button
drag(from_x, from_y, to_x, to_y)Click-drag between two coordinates
Keyboardtype(text)Type a string at the current focus
key(key)Press a single key (e.g. "Return", "Escape")
hotkey(*keys)Simultaneous key combo (e.g. "ctrl", "c")
key_down(key)Press and hold a key
key_up(key)Release a held key
Scrollscroll(direction, clicks)Scroll up/down/left/right by N notches
scroll_up(clicks)Convenience — scroll up
scroll_down(clicks)Convenience — scroll down
scroll_left(clicks)Convenience — scroll left
scroll_right(clicks)Convenience — scroll right
Clipboardget_clipboard()Read the current clipboard text
set_clipboard(text)Write text to the clipboard
Screen infoget_screen_size()Desktop resolution {width, height}
get_cursor_position()Current cursor {x, y} in normalized coords
Windowswindows()List open windows with IDs, titles, positions
launch(app)Open an installed application by name
focus_window(id)Bring a window to the foreground
get_window_size(id)Read a window’s {width, height}
set_window_size(id, w, h)Resize a window
get_window_position(id)Read a window’s {x, y}
set_window_position(id, x, y)Move a window
maximize_window(id)Maximize a window
minimize_window(id)Minimize a window to the taskbar
close_window(id)Close a window
Environmentget_desktop_environment()Detect DE name and version (e.g. xfce4)
set_wallpaper(path)Set the desktop background from a file path
get_accessibility_tree()AT-SPI element tree for structured agent perception
Shellbash(cmd)Execute a shell command inside the VM
python(code)Execute a Python snippet inside the VM
write_file(path, content)Write content to a file path inside the VM
read_file(path)Read a file path inside the VM

Coordinate system

Computers accept and report coordinates in normalized 0-1000 space. (0, 0) is top-left, (1000, 1000) is bottom-right, regardless of the actual display resolution.

# Click the visual center of the screen on any display size
computer.left_click(500, 500)

When you capture a screenshot, MIOSA scales it to a fixed 1024-wide thumbnail by default (smartResize). Your agent reasons about coordinates in that normalized space; MIOSA translates to actual display pixels before sending the event to the VM.

The control loop


Screen

screenshot()

Capture the full desktop as PNG bytes.

png_bytes = computer.screenshot()
# Write to disk
with open("screen.png", "wb") as f:
    f.write(png_bytes)

Screenshots are PNG. Average 150–300 KB for a 1024-wide thumbnail.

screenshot_base64()

Same as screenshot(), but returns a base64-encoded string. Use this when feeding the image directly to an LLM vision API.

b64 = computer.screenshot_base64()

import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {"type": "base64", "media_type": "image/png", "data": b64},
            },
            {"type": "text", "text": "What is on screen? Where should I click to open the browser?"},
        ],
    }],
)

Click

click(x, y)

Generic click at coordinates. Defaults to left button.

computer.click(640, 480)

left_click(x, y)

Explicit left-button click.

computer.left_click(640, 480)

right_click(x, y)

Right-button click — opens context menus.

computer.right_click(640, 480)

double_click(x, y)

Double-click — opens files, selects words.

computer.double_click(100, 200)

Mouse

move_cursor(x, y)

Reposition the cursor without triggering a click. Useful before mouse_down / mouse_up sequences or for hover interactions.

computer.move_cursor(500, 400)

mouse_down(x, y) / mouse_up(x, y)

Low-level press and release. Use these when drag is not granular enough — for example, to hover mid-drag or to implement a long-press interaction.

computer.mouse_down(200, 200)
computer.move_cursor(400, 400)   # move while held
computer.mouse_up(400, 400)

drag(from_x, from_y, to_x, to_y)

Click-drag between two points with smooth interpolation. Use for sliders, drag-to-select, re-ordering list items.

computer.drag(200, 200, 600, 400)

Keyboard

type(text)

Type a string as if at a keyboard. Does NOT interpret special key names — use key() for those.

computer.type("hello, world")

key(key)

Press a single key or chord.

computer.key("Return")
computer.key("Escape")
computer.key("ctrl+a")    # select all
computer.key("alt+F4")    # close window

Chords use + as the separator. Key names follow the X11 keysym convention: Return, Tab, BackSpace, Delete, Home, End, Page_Up, Page_Down, Up, Down, Left, Right, F1F12, super, ctrl, alt, shift.

hotkey(*keys)

Send multiple keys simultaneously — pass each key as a separate argument.

computer.hotkey("ctrl", "c")    # copy
computer.hotkey("ctrl", "v")    # paste
computer.hotkey("ctrl", "z")    # undo
computer.hotkey("ctrl", "shift", "t")  # reopen tab

key_down(key) / key_up(key)

Low-level key press and release. Use when you need to hold a modifier while performing other actions.

computer.key_down("shift")
computer.left_click(800, 200)   # shift-click to extend selection
computer.key_up("shift")

Scroll

scroll(direction, clicks)

Scroll the mouse wheel. direction is one of "up", "down", "left", "right". clicks is the number of notches.

computer.scroll("down", clicks=3)
computer.scroll("up", clicks=5)
computer.scroll("right", clicks=2)

Convenience scrolls

computer.scroll_up(clicks=3)
computer.scroll_down(clicks=3)
computer.scroll_left(clicks=2)
computer.scroll_right(clicks=2)

clicks defaults to 1 for all convenience methods.


Clipboard

get_clipboard()

Read the current clipboard text content.

text = computer.get_clipboard()
print(text)

set_clipboard(text)

Write text to the clipboard. After this call, Ctrl+V inside the VM will paste the text.

computer.set_clipboard("some text to paste")
computer.hotkey("ctrl", "v")

Screen info

get_screen_size()

Returns the desktop resolution as a dict.

size = computer.get_screen_size()
print(size)   # {"width": 1920, "height": 1080}

get_cursor_position()

Returns the current cursor position in normalized 0-1000 coordinates.

pos = computer.get_cursor_position()
print(pos)    # {"x": 512, "y": 300}

Windows

windows()

List all open windows. Returns a list of dicts with id, title, x, y, width, height, and focused.

wins = computer.windows()
for w in wins:
    print(w["id"], w["title"], w["focused"])

launch(app)

Open an installed application by name. Available apps depend on the template; miosa-desktop ships with Firefox, xterm, Thunar (file manager), and Mousepad (text editor).

computer.launch("firefox")
computer.launch("xterm")
computer.launch("thunar")   # file manager

focus_window(id) / get/set_window_size(id, ...) / get/set_window_position(id, ...) / maximize_window(id) / minimize_window(id) / close_window(id)

Full window management via window ID returned by windows().

wins = computer.windows()
w = wins[0]

# Bring to foreground
computer.focus_window(w["id"])

# Read geometry
size = computer.get_window_size(w["id"])
pos  = computer.get_window_position(w["id"])

# Set geometry
computer.set_window_size(w["id"], 1280, 800)
computer.set_window_position(w["id"], 100, 50)

# State changes
computer.maximize_window(w["id"])
computer.minimize_window(w["id"])
computer.close_window(w["id"])

Environment

get_desktop_environment()

Returns the active desktop environment name and version.

de = computer.get_desktop_environment()
print(de)   # {"name": "xfce4", "version": "4.18"}

set_wallpaper(path)

Set the desktop background image from a path inside the VM. Combine with write_file to push a custom image first.

# Push wallpaper then apply it
computer.write_file("/tmp/bg.png", open("bg.png", "rb").read())
computer.set_wallpaper("/tmp/bg.png")

get_accessibility_tree()

Returns the AT-SPI accessibility tree as a structured dict. Use this for agent perception when coordinates alone are insufficient — for example, to extract button labels, form field names, or reading order without OCR.

tree = computer.get_accessibility_tree()
# tree is a nested dict of accessible elements
# {"role": "frame", "name": "Firefox", "children": [...]}

Shell

These methods execute code inside the VM directly — no GUI interaction required.

bash(cmd)

Execute a shell command. Returns stdout as a string.

output = computer.bash("ls /home/user/Desktop")
print(output)

# Install a package
computer.bash("sudo apt-get install -y curl")

# Launch Firefox in background (so the call returns immediately)
computer.bash("firefox &")

python(code)

Execute a Python snippet inside the VM. Returns stdout.

result = computer.python("print(1 + 1)")
print(result)   # "2"

# Multi-line
code = """
import json, sys
data = {"status": "ok"}
print(json.dumps(data))
"""
output = computer.python(code)

write_file(path, content)

Write a file into the VM’s filesystem. content can be a string or bytes.

computer.write_file("/home/user/script.py", "print('hello')")
computer.bash("python3 /home/user/script.py")

read_file(path)

Read a file from the VM’s filesystem. Returns the content as a string.

content = computer.read_file("/home/user/.bashrc")
print(content)

Agent reasoning loop

At the application level, an agent drives the computer through a perception-action cycle:

Full agent loop example

import anthropic, os
from miosa import Miosa

miosa_client  = Miosa(api_key=os.environ["MIOSA_API_KEY"])
claude_client = anthropic.Anthropic()

computer = miosa_client.computers.create(
    name="browser-agent",
    template="miosa-desktop",
    size="small",
)
computer.start()

# Open the browser
computer.launch("firefox")

# Agent loop
for _ in range(10):
    b64 = computer.screenshot_base64()

    response = claude_client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
                {"type": "text", "text": "Navigate to miosa.ai and take a screenshot of the homepage."},
            ],
        }],
    )

    # Parse action from response and dispatch
    text = response.content[0].text
    if "left_click" in text:
        # extract x, y from model output and act
        computer.left_click(500, 500)
    elif "type" in text:
        computer.type("https://miosa.ai\n")

computer.stop()

Latency reference

Desktop methods run synchronously over authenticated HTTP RPC. Typical round-trip from MIOSA edge:

Method typeTypical latency
Click, key, scroll, type30–80 ms
Screenshot100–300 ms (compression-bound)
Window list50–100 ms
bash() / python()50 ms + command execution time

For high-frequency agent loops, prefer bash() for multi-step operations over many individual GUI calls.


Authentication

All desktop methods require the same workspace API key as every other MIOSA resource. Call them server-side. For browser-side direct control (computer-use models running in the browser), mint a scoped token — see Browser Tokens.


See also

Computers — Overview

Pick a template, size, and workspace. Create and start a computer. Open →

Embedding & Streaming

Show the live desktop in a browser iframe using short-lived stream tokens. Open →

API Reference: Desktop

Wire format for all 28 methods — request/response shapes, error codes, rate limits. Open →

Was this helpful?