browser-automation

Installation

Summary

Vision-driven browser automation from screenshots, no DOM access required.

Operates in a headless Puppeteer browser that persists across CLI calls, allowing sequential commands without session loss
Interacts with all visible page elements using natural language prompts; no CSS selectors or accessibility labels needed
Requires configuration of a vision-capable model (Gemini, Qwen, Doubao, or similar) via environment variables for visual grounding
Supports connect, take_screenshot, act (perform actions), disconnect, and close commands; follow synchronous workflow pattern to read screenshots before deciding next steps

SKILL.md

Browser Automation

CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:

Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.

Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.

Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer.

Always report task results before finishing. After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.

Automate web browsing using npx -y @midscene/web@1. By default, launches a headless Chrome via Puppeteer that persists across CLI calls — no session loss between commands. Also supports CDP mode and Bridge mode to connect to an existing Chrome browser. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

What `act` Can Do

Inside a single act call in the browser, Midscene can click, right-click, double-click, hover, type or clear text, press keys, scroll, drag, long-press, and continue through multi-step page flows based on what is currently visible. When touch input is enabled, it can also handle swipe- or pinch-style interactions on touch-oriented pages.

When to Use

This skill has three modes. Choose based on the user's intent:

Related skills

More from web-infra-dev/midscene-skills

Installs

2.8K

Repository

web-infra-dev/m…e-skills

GitHub Stars

218

First Seen

Mar 6, 2026

Security Audits

Gen Agent Trust HubWarn

SocketPass

SnykFail

browser-automation

Browser Automation

What `act` Can Do

When to Use

More from web-infra-dev/midscene-skills

desktop-computer-automation

android-device-automation

ios-device-automation

harmonyos-device-automation

vitest-midscene-e2e

chrome-bridge-automation

browser-automation

Browser Automation

What act Can Do

When to Use

More from web-infra-dev/midscene-skills

desktop-computer-automation

android-device-automation

ios-device-automation

harmonyos-device-automation

vitest-midscene-e2e

chrome-bridge-automation

What `act` Can Do