desktop-computer-automation

Installation

Summary

Vision-driven desktop automation for native apps using natural language commands and screenshots.

Controls macOS, Windows, and Linux desktops entirely from visual input; no DOM or accessibility labels required
Operates synchronously with a screenshot-analyze-act loop: connect, observe screen state, execute high-level actions via natural language prompts, then disconnect
Requires a vision-capable AI model (Gemini, Qwen, Doubao, or similar) configured via environment variables; supports multiple model providers and OpenRouter
Takes over the user's mouse and keyboard during execution; best suited for desktop-native apps (Electron, Qt, native UIs) that cannot run headless; web apps should use Browser Automation instead

SKILL.md

Desktop Computer Automation

CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:

Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.

Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.

Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer.

Always report task results before finishing. After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.

Only minimize windows, never close them unless explicitly asked. When you need to dismiss or get a window out of the way, minimize it instead of closing it. Do not close any app or window unless the user explicitly asks you to do so.

Control your desktop (macOS, Windows, Linux) using npx -y @midscene/computer@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

What `act` Can Do

Inside a single act call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display.

Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):

Related skills

More from web-infra-dev/midscene-skills

Installs

2.9K

Repository

web-infra-dev/m…e-skills

GitHub Stars

221

First Seen

Mar 6, 2026

Security Audits

Gen Agent Trust HubWarn

SocketPass

SnykWarn

desktop-computer-automation

Desktop Computer Automation

What `act` Can Do

Prerequisites

More from web-infra-dev/midscene-skills

browser-automation

android-device-automation

ios-device-automation

harmonyos-device-automation

vitest-midscene-e2e

chrome-bridge-automation

desktop-computer-automation

Desktop Computer Automation

What act Can Do

Prerequisites

More from web-infra-dev/midscene-skills

browser-automation

android-device-automation

ios-device-automation

harmonyos-device-automation

vitest-midscene-e2e

chrome-bridge-automation

What `act` Can Do