chrome-bridge-automation

Installation
Summary

Vision-driven browser automation using your real Chrome browser, preserving sessions and login state.

  • Operates entirely from screenshots without requiring DOM access; interacts with all visible elements regardless of technology stack
  • Connects to desktop Chrome via Midscene Extension through Chrome DevTools Protocol, never taking over mouse or keyboard
  • Supports multi-step workflows including navigation, form filling, data extraction, UI testing, and screenshot capture
  • Requires visual-grounding AI model configuration (Gemini, Qwen, Doubao, or similar) via environment variables before use
SKILL.md

Chrome Bridge Automation

CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:

  1. Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
  2. Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
  3. Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer.
  4. Always report task results before finishing. After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.

Automate the user's real Chrome browser via the Midscene Chrome Extension (Bridge mode), preserving cookies, sessions, and login state. You (the AI agent) act as the brain, deciding which actions to take based on screenshots.

What act Can Do

Inside a single act call in Chrome Bridge mode, Midscene can click, right-click, double-click, hover, type or clear text, press keys, scroll, drag, long-press, and continue through multi-step page flows in the user's real Chrome session based on what is currently visible. When touch input is enabled, it can also handle swipe- or pinch-style interactions on touch-oriented pages.

Command Format

CRITICAL — Every command MUST follow this EXACT format. Do NOT modify the command prefix.

Related skills
Installs
665
GitHub Stars
218
First Seen
Mar 6, 2026