Once you have a Computer, drive its desktop with these methods. Suitable for AI agents (computer-use models, custom RPA loops) and for direct programmatic control.
All 28 methods
| Group | Method | Purpose |
|---|---|---|
| Screen | screenshot() | Capture full desktop as PNG bytes |
screenshot_base64() | Same, base64-encoded — ready to pass to an LLM vision API | |
| Click | click(x, y) | Generic click — defaults to left button |
left_click(x, y) | Left-button click at coordinates | |
right_click(x, y) | Right-button click (context menu) | |
double_click(x, y) | Double-click at coordinates | |
| Mouse | move_cursor(x, y) | Move cursor without clicking |
mouse_down(x, y) | Press and hold the mouse button | |
mouse_up(x, y) | Release a held mouse button | |
drag(from_x, from_y, to_x, to_y) | Click-drag between two coordinates | |
| Keyboard | type(text) | Type a string at the current focus |
key(key) | Press a single key (e.g. "Return", "Escape") | |
hotkey(*keys) | Simultaneous key combo (e.g. "ctrl", "c") | |
key_down(key) | Press and hold a key | |
key_up(key) | Release a held key | |
| Scroll | scroll(direction, clicks) | Scroll up/down/left/right by N notches |
scroll_up(clicks) | Convenience — scroll up | |
scroll_down(clicks) | Convenience — scroll down | |
scroll_left(clicks) | Convenience — scroll left | |
scroll_right(clicks) | Convenience — scroll right | |
| Clipboard | get_clipboard() | Read the current clipboard text |
set_clipboard(text) | Write text to the clipboard | |
| Screen info | get_screen_size() | Desktop resolution {width, height} |
get_cursor_position() | Current cursor {x, y} in normalized coords | |
| Windows | windows() | List open windows with IDs, titles, positions |
launch(app) | Open an installed application by name | |
focus_window(id) | Bring a window to the foreground | |
get_window_size(id) | Read a window’s {width, height} | |
set_window_size(id, w, h) | Resize a window | |
get_window_position(id) | Read a window’s {x, y} | |
set_window_position(id, x, y) | Move a window | |
maximize_window(id) | Maximize a window | |
minimize_window(id) | Minimize a window to the taskbar | |
close_window(id) | Close a window | |
| Environment | get_desktop_environment() | Detect DE name and version (e.g. xfce4) |
set_wallpaper(path) | Set the desktop background from a file path | |
get_accessibility_tree() | AT-SPI element tree for structured agent perception | |
| Shell | bash(cmd) | Execute a shell command inside the VM |
python(code) | Execute a Python snippet inside the VM | |
write_file(path, content) | Write content to a file path inside the VM | |
read_file(path) | Read a file path inside the VM |
Coordinate system
Computers accept and report coordinates in normalized 0-1000 space. (0, 0) is top-left, (1000, 1000) is bottom-right, regardless of the actual display resolution.
# Click the visual center of the screen on any display size
computer.left_click(500, 500) When you capture a screenshot, MIOSA scales it to a fixed 1024-wide thumbnail by default (smartResize). Your agent reasons about coordinates in that normalized space; MIOSA translates to actual display pixels before sending the event to the VM.
The control loop
Screen
screenshot()
Capture the full desktop as PNG bytes.
png_bytes = computer.screenshot()
# Write to disk
with open("screen.png", "wb") as f:
f.write(png_bytes) Screenshots are PNG. Average 150–300 KB for a 1024-wide thumbnail.
screenshot_base64()
Same as screenshot(), but returns a base64-encoded string. Use this when feeding the image directly to an LLM vision API.
b64 = computer.screenshot_base64()
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": b64},
},
{"type": "text", "text": "What is on screen? Where should I click to open the browser?"},
],
}],
) Click
click(x, y)
Generic click at coordinates. Defaults to left button.
computer.click(640, 480) left_click(x, y)
Explicit left-button click.
computer.left_click(640, 480) right_click(x, y)
Right-button click — opens context menus.
computer.right_click(640, 480) double_click(x, y)
Double-click — opens files, selects words.
computer.double_click(100, 200) Mouse
move_cursor(x, y)
Reposition the cursor without triggering a click. Useful before mouse_down / mouse_up sequences or for hover interactions.
computer.move_cursor(500, 400) mouse_down(x, y) / mouse_up(x, y)
Low-level press and release. Use these when drag is not granular enough — for example, to hover mid-drag or to implement a long-press interaction.
computer.mouse_down(200, 200)
computer.move_cursor(400, 400) # move while held
computer.mouse_up(400, 400) drag(from_x, from_y, to_x, to_y)
Click-drag between two points with smooth interpolation. Use for sliders, drag-to-select, re-ordering list items.
computer.drag(200, 200, 600, 400) Keyboard
type(text)
Type a string as if at a keyboard. Does NOT interpret special key names — use key() for those.
computer.type("hello, world") key(key)
Press a single key or chord.
computer.key("Return")
computer.key("Escape")
computer.key("ctrl+a") # select all
computer.key("alt+F4") # close window Chords use + as the separator. Key names follow the X11 keysym convention: Return, Tab, BackSpace, Delete, Home, End, Page_Up, Page_Down, Up, Down, Left, Right, F1–F12, super, ctrl, alt, shift.
hotkey(*keys)
Send multiple keys simultaneously — pass each key as a separate argument.
computer.hotkey("ctrl", "c") # copy
computer.hotkey("ctrl", "v") # paste
computer.hotkey("ctrl", "z") # undo
computer.hotkey("ctrl", "shift", "t") # reopen tab key_down(key) / key_up(key)
Low-level key press and release. Use when you need to hold a modifier while performing other actions.
computer.key_down("shift")
computer.left_click(800, 200) # shift-click to extend selection
computer.key_up("shift") Scroll
scroll(direction, clicks)
Scroll the mouse wheel. direction is one of "up", "down", "left", "right". clicks is the number of notches.
computer.scroll("down", clicks=3)
computer.scroll("up", clicks=5)
computer.scroll("right", clicks=2) Convenience scrolls
computer.scroll_up(clicks=3)
computer.scroll_down(clicks=3)
computer.scroll_left(clicks=2)
computer.scroll_right(clicks=2) clicks defaults to 1 for all convenience methods.
Clipboard
get_clipboard()
Read the current clipboard text content.
text = computer.get_clipboard()
print(text) set_clipboard(text)
Write text to the clipboard. After this call, Ctrl+V inside the VM will paste the text.
computer.set_clipboard("some text to paste")
computer.hotkey("ctrl", "v") Screen info
get_screen_size()
Returns the desktop resolution as a dict.
size = computer.get_screen_size()
print(size) # {"width": 1920, "height": 1080} get_cursor_position()
Returns the current cursor position in normalized 0-1000 coordinates.
pos = computer.get_cursor_position()
print(pos) # {"x": 512, "y": 300} Windows
windows()
List all open windows. Returns a list of dicts with id, title, x, y, width, height, and focused.
wins = computer.windows()
for w in wins:
print(w["id"], w["title"], w["focused"]) launch(app)
Open an installed application by name. Available apps depend on the template; miosa-desktop ships with Firefox, xterm, Thunar (file manager), and Mousepad (text editor).
computer.launch("firefox")
computer.launch("xterm")
computer.launch("thunar") # file manager focus_window(id) / get/set_window_size(id, ...) / get/set_window_position(id, ...) / maximize_window(id) / minimize_window(id) / close_window(id)
Full window management via window ID returned by windows().
wins = computer.windows()
w = wins[0]
# Bring to foreground
computer.focus_window(w["id"])
# Read geometry
size = computer.get_window_size(w["id"])
pos = computer.get_window_position(w["id"])
# Set geometry
computer.set_window_size(w["id"], 1280, 800)
computer.set_window_position(w["id"], 100, 50)
# State changes
computer.maximize_window(w["id"])
computer.minimize_window(w["id"])
computer.close_window(w["id"]) Environment
get_desktop_environment()
Returns the active desktop environment name and version.
de = computer.get_desktop_environment()
print(de) # {"name": "xfce4", "version": "4.18"} set_wallpaper(path)
Set the desktop background image from a path inside the VM. Combine with write_file to push a custom image first.
# Push wallpaper then apply it
computer.write_file("/tmp/bg.png", open("bg.png", "rb").read())
computer.set_wallpaper("/tmp/bg.png") get_accessibility_tree()
Returns the AT-SPI accessibility tree as a structured dict. Use this for agent perception when coordinates alone are insufficient — for example, to extract button labels, form field names, or reading order without OCR.
tree = computer.get_accessibility_tree()
# tree is a nested dict of accessible elements
# {"role": "frame", "name": "Firefox", "children": [...]} Shell
These methods execute code inside the VM directly — no GUI interaction required.
bash(cmd)
Execute a shell command. Returns stdout as a string.
output = computer.bash("ls /home/user/Desktop")
print(output)
# Install a package
computer.bash("sudo apt-get install -y curl")
# Launch Firefox in background (so the call returns immediately)
computer.bash("firefox &") python(code)
Execute a Python snippet inside the VM. Returns stdout.
result = computer.python("print(1 + 1)")
print(result) # "2"
# Multi-line
code = """
import json, sys
data = {"status": "ok"}
print(json.dumps(data))
"""
output = computer.python(code) write_file(path, content)
Write a file into the VM’s filesystem. content can be a string or bytes.
computer.write_file("/home/user/script.py", "print('hello')")
computer.bash("python3 /home/user/script.py") read_file(path)
Read a file from the VM’s filesystem. Returns the content as a string.
content = computer.read_file("/home/user/.bashrc")
print(content) Agent reasoning loop
At the application level, an agent drives the computer through a perception-action cycle:
Full agent loop example
import anthropic, os
from miosa import Miosa
miosa_client = Miosa(api_key=os.environ["MIOSA_API_KEY"])
claude_client = anthropic.Anthropic()
computer = miosa_client.computers.create(
name="browser-agent",
template="miosa-desktop",
size="small",
)
computer.start()
# Open the browser
computer.launch("firefox")
# Agent loop
for _ in range(10):
b64 = computer.screenshot_base64()
response = claude_client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Navigate to miosa.ai and take a screenshot of the homepage."},
],
}],
)
# Parse action from response and dispatch
text = response.content[0].text
if "left_click" in text:
# extract x, y from model output and act
computer.left_click(500, 500)
elif "type" in text:
computer.type("https://miosa.ai\n")
computer.stop() Latency reference
Desktop methods run synchronously over authenticated HTTP RPC. Typical round-trip from MIOSA edge:
| Method type | Typical latency |
|---|---|
| Click, key, scroll, type | 30–80 ms |
| Screenshot | 100–300 ms (compression-bound) |
| Window list | 50–100 ms |
bash() / python() | 50 ms + command execution time |
For high-frequency agent loops, prefer bash() for multi-step operations over many individual GUI calls.
Authentication
All desktop methods require the same workspace API key as every other MIOSA resource. Call them server-side. For browser-side direct control (computer-use models running in the browser), mint a scoped token — see Browser Tokens.
See also
Pick a template, size, and workspace. Create and start a computer. Open →
Show the live desktop in a browser iframe using short-lived stream tokens. Open →
Wire format for all 28 methods — request/response shapes, error codes, rate limits. Open →