Desktop Control — MIOSA Docs

Once you have a Computer, drive its desktop with these methods. Suitable for AI agents (computer-use models, custom RPA loops) and for direct programmatic control.

All 28 methods

Group	Method	Purpose
Screen	`screenshot()`	Capture full desktop as PNG bytes
	`screenshot_base64()`	Same, base64-encoded — ready to pass to an LLM vision API
Click	`click(x, y)`	Generic click — defaults to left button
	`left_click(x, y)`	Left-button click at coordinates
	`right_click(x, y)`	Right-button click (context menu)
	`double_click(x, y)`	Double-click at coordinates
Mouse	`move_cursor(x, y)`	Move cursor without clicking
	`mouse_down(x, y)`	Press and hold the mouse button
	`mouse_up(x, y)`	Release a held mouse button
	`drag(from_x, from_y, to_x, to_y)`	Click-drag between two coordinates
Keyboard	`type(text)`	Type a string at the current focus
	`key(key)`	Press a single key (e.g. `"Return"`, `"Escape"`)
	`hotkey(*keys)`	Simultaneous key combo (e.g. `"ctrl", "c"`)
	`key_down(key)`	Press and hold a key
	`key_up(key)`	Release a held key
Scroll	`scroll(direction, clicks)`	Scroll `up`/`down`/`left`/`right` by N notches
	`scroll_up(clicks)`	Convenience — scroll up
	`scroll_down(clicks)`	Convenience — scroll down
	`scroll_left(clicks)`	Convenience — scroll left
	`scroll_right(clicks)`	Convenience — scroll right
Clipboard	`get_clipboard()`	Read the current clipboard text
	`set_clipboard(text)`	Write text to the clipboard
Screen info	`get_screen_size()`	Desktop resolution `{width, height}`
	`get_cursor_position()`	Current cursor `{x, y}` in normalized coords
Windows	`windows()`	List open windows with IDs, titles, positions
	`launch(app)`	Open an installed application by name
	`focus_window(id)`	Bring a window to the foreground
	`get_window_size(id)`	Read a window’s `{width, height}`
	`set_window_size(id, w, h)`	Resize a window
	`get_window_position(id)`	Read a window’s `{x, y}`
	`set_window_position(id, x, y)`	Move a window
	`maximize_window(id)`	Maximize a window
	`minimize_window(id)`	Minimize a window to the taskbar
	`close_window(id)`	Close a window
Environment	`get_desktop_environment()`	Detect DE name and version (e.g. `xfce4`)
	`set_wallpaper(path)`	Set the desktop background from a file path
	`get_accessibility_tree()`	AT-SPI element tree for structured agent perception
Shell	`bash(cmd)`	Execute a shell command inside the VM
	`python(code)`	Execute a Python snippet inside the VM
	`write_file(path, content)`	Write content to a file path inside the VM
	`read_file(path)`	Read a file path inside the VM

Coordinate system

Computers accept and report coordinates in normalized 0-1000 space. (0, 0) is top-left, (1000, 1000) is bottom-right, regardless of the actual display resolution.

# Click the visual center of the screen on any display size
computer.left_click(500, 500)

When you capture a screenshot, MIOSA scales it to a fixed 1024-wide thumbnail by default (smartResize). Your agent reasons about coordinates in that normalized space; MIOSA translates to actual display pixels before sending the event to the VM.

The control loop

Screen

`screenshot()`

Capture the full desktop as PNG bytes.

png_bytes = computer.screenshot()
# Write to disk
with open("screen.png", "wb") as f:
    f.write(png_bytes)

Screenshots are PNG. Average 150–300 KB for a 1024-wide thumbnail.

`screenshot_base64()`

Same as screenshot(), but returns a base64-encoded string. Use this when feeding the image directly to an LLM vision API.

b64 = computer.screenshot_base64()

import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {"type": "base64", "media_type": "image/png", "data": b64},
            },
            {"type": "text", "text": "What is on screen? Where should I click to open the browser?"},
        ],
    }],
)

Click

`click(x, y)`

Generic click at coordinates. Defaults to left button.

computer.click(640, 480)

`left_click(x, y)`

Explicit left-button click.

computer.left_click(640, 480)

`right_click(x, y)`

Right-button click — opens context menus.

computer.right_click(640, 480)

`double_click(x, y)`

Double-click — opens files, selects words.

computer.double_click(100, 200)

Mouse

`move_cursor(x, y)`

Reposition the cursor without triggering a click. Useful before mouse_down / mouse_up sequences or for hover interactions.

computer.move_cursor(500, 400)

`mouse_down(x, y)` / `mouse_up(x, y)`

Low-level press and release. Use these when drag is not granular enough — for example, to hover mid-drag or to implement a long-press interaction.

computer.mouse_down(200, 200)
computer.move_cursor(400, 400)   # move while held
computer.mouse_up(400, 400)

`drag(from_x, from_y, to_x, to_y)`

Click-drag between two points with smooth interpolation. Use for sliders, drag-to-select, re-ordering list items.

computer.drag(200, 200, 600, 400)

Keyboard

`type(text)`

Type a string as if at a keyboard. Does NOT interpret special key names — use key() for those.

computer.type("hello, world")

`key(key)`

Press a single key or chord.

computer.key("Return")
computer.key("Escape")
computer.key("ctrl+a")    # select all
computer.key("alt+F4")    # close window

Chords use + as the separator. Key names follow the X11 keysym convention: Return, Tab, BackSpace, Delete, Home, End, Page_Up, Page_Down, Up, Down, Left, Right, F1–F12, super, ctrl, alt, shift.

`hotkey(*keys)`

Send multiple keys simultaneously — pass each key as a separate argument.

computer.hotkey("ctrl", "c")    # copy
computer.hotkey("ctrl", "v")    # paste
computer.hotkey("ctrl", "z")    # undo
computer.hotkey("ctrl", "shift", "t")  # reopen tab

`key_down(key)` / `key_up(key)`

Low-level key press and release. Use when you need to hold a modifier while performing other actions.

computer.key_down("shift")
computer.left_click(800, 200)   # shift-click to extend selection
computer.key_up("shift")

Scroll

`scroll(direction, clicks)`

Scroll the mouse wheel. direction is one of "up", "down", "left", "right". clicks is the number of notches.

computer.scroll("down", clicks=3)
computer.scroll("up", clicks=5)
computer.scroll("right", clicks=2)

Convenience scrolls

computer.scroll_up(clicks=3)
computer.scroll_down(clicks=3)
computer.scroll_left(clicks=2)
computer.scroll_right(clicks=2)

clicks defaults to 1 for all convenience methods.

Clipboard

`get_clipboard()`

Read the current clipboard text content.

text = computer.get_clipboard()
print(text)

`set_clipboard(text)`

Write text to the clipboard. After this call, Ctrl+V inside the VM will paste the text.

computer.set_clipboard("some text to paste")
computer.hotkey("ctrl", "v")

Screen info

`get_screen_size()`

Returns the desktop resolution as a dict.

size = computer.get_screen_size()
print(size)   # {"width": 1920, "height": 1080}

`get_cursor_position()`

Returns the current cursor position in normalized 0-1000 coordinates.

pos = computer.get_cursor_position()
print(pos)    # {"x": 512, "y": 300}

Windows

`windows()`

List all open windows. Returns a list of dicts with id, title, x, y, width, height, and focused.

wins = computer.windows()
for w in wins:
    print(w["id"], w["title"], w["focused"])

`launch(app)`

Open an installed application by name. Available apps depend on the template; miosa-desktop ships with Firefox, xterm, Thunar (file manager), and Mousepad (text editor).

computer.launch("firefox")
computer.launch("xterm")
computer.launch("thunar")   # file manager

`focus_window(id)` / `get/set_window_size(id, ...)` / `get/set_window_position(id, ...)` / `maximize_window(id)` / `minimize_window(id)` / `close_window(id)`

Full window management via window ID returned by windows().

wins = computer.windows()
w = wins[0]

# Bring to foreground
computer.focus_window(w["id"])

# Read geometry
size = computer.get_window_size(w["id"])
pos  = computer.get_window_position(w["id"])

# Set geometry
computer.set_window_size(w["id"], 1280, 800)
computer.set_window_position(w["id"], 100, 50)

# State changes
computer.maximize_window(w["id"])
computer.minimize_window(w["id"])
computer.close_window(w["id"])

Environment

`get_desktop_environment()`

Returns the active desktop environment name and version.

de = computer.get_desktop_environment()
print(de)   # {"name": "xfce4", "version": "4.18"}

`set_wallpaper(path)`

Set the desktop background image from a path inside the VM. Combine with write_file to push a custom image first.

# Push wallpaper then apply it
computer.write_file("/tmp/bg.png", open("bg.png", "rb").read())
computer.set_wallpaper("/tmp/bg.png")

`get_accessibility_tree()`

Returns the AT-SPI accessibility tree as a structured dict. Use this for agent perception when coordinates alone are insufficient — for example, to extract button labels, form field names, or reading order without OCR.

tree = computer.get_accessibility_tree()
# tree is a nested dict of accessible elements
# {"role": "frame", "name": "Firefox", "children": [...]}

Shell

These methods execute code inside the VM directly — no GUI interaction required.

`bash(cmd)`

Execute a shell command. Returns stdout as a string.

output = computer.bash("ls /home/user/Desktop")
print(output)

# Install a package
computer.bash("sudo apt-get install -y curl")

# Launch Firefox in background (so the call returns immediately)
computer.bash("firefox &")

`python(code)`

Execute a Python snippet inside the VM. Returns stdout.

result = computer.python("print(1 + 1)")
print(result)   # "2"

# Multi-line
code = """
import json, sys
data = {"status": "ok"}
print(json.dumps(data))
"""
output = computer.python(code)

`write_file(path, content)`

Write a file into the VM’s filesystem. content can be a string or bytes.

computer.write_file("/home/user/script.py", "print('hello')")
computer.bash("python3 /home/user/script.py")

`read_file(path)`

Read a file from the VM’s filesystem. Returns the content as a string.

content = computer.read_file("/home/user/.bashrc")
print(content)

Agent reasoning loop

At the application level, an agent drives the computer through a perception-action cycle:

Full agent loop example

import anthropic, os
from miosa import Miosa

miosa_client  = Miosa(api_key=os.environ["MIOSA_API_KEY"])
claude_client = anthropic.Anthropic()

computer = miosa_client.computers.create(
    name="browser-agent",
    template="miosa-desktop",
    size="small",
)
computer.start()

# Open the browser
computer.launch("firefox")

# Agent loop
for _ in range(10):
    b64 = computer.screenshot_base64()

    response = claude_client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
                {"type": "text", "text": "Navigate to miosa.ai and take a screenshot of the homepage."},
            ],
        }],
    )

    # Parse action from response and dispatch
    text = response.content[0].text
    if "left_click" in text:
        # extract x, y from model output and act
        computer.left_click(500, 500)
    elif "type" in text:
        computer.type("https://miosa.ai\n")

computer.stop()

Latency reference

Desktop methods run synchronously over authenticated HTTP RPC. Typical round-trip from MIOSA edge:

Method type	Typical latency
Click, key, scroll, type	30–80 ms
Screenshot	100–300 ms (compression-bound)
Window list	50–100 ms
`bash()` / `python()`	50 ms + command execution time

For high-frequency agent loops, prefer bash() for multi-step operations over many individual GUI calls.

Authentication

All desktop methods require the same workspace API key as every other MIOSA resource. Call them server-side. For browser-side direct control (computer-use models running in the browser), mint a scoped token — see Browser Tokens.