page-agent
Installation
SKILL.md
page-agent
alibaba/page-agent (https://github.com/alibaba/page-agent, 17k+ stars, MIT) is an in-page GUI agent written in TypeScript. It lives inside a webpage, reads the DOM as text (no screenshots, no multi-modal LLM), and executes natural-language instructions like "click the login button, then fill username as John" against the current page. Pure client-side — the host site just includes a script and passes an OpenAI-compatible LLM endpoint.
When to use this skill
Load this skill when a user wants to:
- Ship an AI copilot inside their own web app (SaaS, admin panel, B2B tool, ERP, CRM) — "users on my dashboard should be able to type 'create invoice for Acme Corp and email it' instead of clicking through five screens"
- Modernize a legacy web app without rewriting the frontend — page-agent drops on top of existing DOM
- Add accessibility via natural language — voice / screen-reader users drive the UI by describing what they want
- Demo or evaluate page-agent against a local (Ollama) or hosted (Qwen, OpenAI, OpenRouter) LLM
- Build interactive training / product demos — let an AI walk a user through "how to submit an expense report" live in the real UI
When NOT to use this skill
- User wants Hermes itself to drive a browser → use Hermes' built-in browser tool (Browserbase / Camofox). page-agent is the opposite direction.
- User wants cross-tab automation without embedding → use Playwright, browser-use, or the page-agent Chrome extension
- User needs visual grounding / screenshots → page-agent is text-DOM only; use a multimodal browser agent instead