ios-device-automation

Installation
Summary

Vision-driven iOS automation using natural language commands and screenshot analysis.

  • Operates entirely from screenshots without requiring DOM access or accessibility labels; can interact with any visible UI element regardless of technology stack
  • Requires a configured vision model (Gemini, Qwen, Doubao, or similar) via environment variables for AI-powered screen understanding and action execution
  • Follows a synchronous workflow: connect device, take screenshot, execute actions via natural language prompts, then disconnect and report results
  • Batch related operations into single act commands to reduce round-trips; always run commands synchronously and allow sufficient time for AI inference and screen interaction
SKILL.md

iOS Device Automation

CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:

  1. Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
  2. Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
  3. Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer.
  4. Always report task results before finishing. After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.

Automate iOS devices using npx -y @midscene/ios@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

What act Can Do

Inside a single act call on iOS, Midscene can tap, double-tap, long-press, type, clear text, scroll, drag items, zoom with two fingers, press keys, and use system navigation such as Home or the app switcher while working from the current visible screen.

Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):

Related skills
Installs
1.4K
GitHub Stars
221
First Seen
Mar 6, 2026