seo-ai-crawlers (M14)

Controls whether AI search engines can crawl and cite the page, and whether they can read it without JS. The training-vs-search-vs-fetch distinction is everything. Reference: references/ai-crawlers.md.

Audits

Working from the PageSnapshot (rendered_dom if present, else raw_html) plus the site robots.txt:

Citation access: are retrieval/citation bots — OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot — actually allowed (not caught by a broad Disallow: / or a wildcard block)? Confirm Googlebot is not blocked and the Googlebot (search) vs Google-Extended (Gemini training control) split is correct.
User-agent classification: bucket every AI agent in robots.txt into training (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot), search/retrieval (OAI-SearchBot, Claude-SearchBot, PerplexityBot), and user-triggered fetch (ChatGPT-User, Claude-User, Perplexity-User). Match user-agents case-insensitively; treat the table in references/ai-crawlers.md as a starting set, not exhaustive.
Renderability for non-JS crawlers: pull the M4 (seo-crawl-render) render result — most AI crawlers do not execute JS. If primary content only appears in rendered_dom and is absent from raw_html, flag it as invisible to AI retrieval.
llms.txt / llms-full.txt (also covers M21): presence at the site root, valid Markdown structure (H1 title, summary blockquote, sectioned link lists), and that linked URLs resolve. Follow references/ai-crawlers.md.

Fixes

AUTO (fixable: auto): a citation-friendly robots.txt preset, choice-gated — the user picks allow-citations (default: allow search/retrieval, opt out of training), allow-all, or block-all. Deterministic, additive, verifiable; emitted as a diff for fix.
AUTO (fixable: auto), disclosure-gated and scored 0: llms.txt / llms-full.txt, generated from the site's own structure only on explicit request (fix --category llms), shown as a diff before writing. Additive and deterministic, but never sold as proven ranking value — the disclosure that it is low/uncertain impact is shown every time.
ADVISORY (fixable: advisory): edge/WAF block for bots that ignore robots.txt (e.g. Bytespider) — the tool never writes infra config. Never fabricate sitemap URLs, contact emails, or link targets — ask the user or leave a clearly-marked TODO placeholder.