Website Spec
← Agent Readiness
Recommended

robots.txt for AI crawlers

Major AI vendors publish named user-agents for their crawlers. Setting an explicit allow or disallow per agent is the clearest way to control how your content is used.

What it is

The robots.txt file (see robots.txt) accepts rules per user-agent. Every major AI vendor now publishes a named user-agent for its training and retrieval crawlers, so you can allow or block each one independently of search.

The big ones, as of 2026:

Why it matters

A blanket Disallow: / blocks everyone, including search. Naming agents lets you make precise decisions: allow retrieval bots so your content can be cited live, block training bots if you do not want to feed model weights, or the reverse.

Compliance is honour-based. Reputable vendors document and respect their user-agents. Unidentified scrapers will ignore robots.txt; defend against those with rate limits or WAF rules, not robots.

How to implement

A reasonable default that allows search and retrieval but opts out of training:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap_index.xml

Rules of thumb:

Common mistakes

Verification

Sources