Crawling

Agent-aware robots.txt

Tell GPTBot, ClaudeBot and Perplexity what you allow.

Every major AI lab now publishes a named crawler. A modern robots.txt should explicitly grant or deny each one, otherwise you're betting on each company's default behavior (which is usually 'fetch everything').

The bots worth naming

GPTBot — OpenAI training crawler.
ChatGPT-User — live ChatGPT browsing on a user's behalf.
OAI-SearchBot — OpenAI search index.
ClaudeBot / Claude-User — Anthropic.
PerplexityBot / Perplexity-User — Perplexity AI.
Google-Extended — opt out of Gemini training without losing Search.
CCBot — Common Crawl (feeds most open models).

Opinionated default

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: *
Disallow: /admin/
Sitemap: https://yoursite.com/sitemap.xml

Disallowing AI bots hides you from the answer layer. If discoverability matters, allow them.

Keep reading

llms.txt

A homepage for language models.