Why Your robots.txt Is Blocking AI Crawlers (and How to Fix It)

You've published solid content, your Google rankings look healthy, and yet ChatGPT and Perplexity never seem to cite your site. You've checked for JSON-LD, you have a sitemap — so what's going wrong?

In most cases, the answer is sitting at the top of your server's filesystem: robots.txt. It was probably written years ago for Googlebot. It accidentally blocks every AI crawler that exists today — making your site completely invisible in AI-generated answers.

Which user-agents do AI crawlers use?

AI search platforms each deploy their own crawlers with distinct user-agent strings. Here are the ones that matter most in 2026:

User-Agent	Platform	Used for
GPTBot	OpenAI (ChatGPT)	Training data + real-time browsing
ClaudeBot	Anthropic (Claude)	Training data + web search
PerplexityBot	Perplexity AI	Real-time search results
Googlebot-Extended	Google (AI Overviews / Gemini)	AI Overviews content sourcing
Applebot-Extended	Apple (Siri, Apple Intelligence)	Summaries and on-device AI
cohere-ai	Cohere	Enterprise AI search
YouBot	You.com	AI search index

None of these existed when most robots.txt files were written. The default approach of most web servers — and many CMSs — doesn't account for them at all.

The three most common robots.txt mistakes

Mistake 1: Blocking all unknown bots

A common hardening pattern disallows any user-agent not explicitly whitelisted:

# Common "security" pattern that blocks AI crawlers
User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

This blocks everything except Googlebot and Bingbot — including every AI crawler above. ChatGPT, Perplexity, and Claude can't read a single page of your site.

Mistake 2: A legacy blanket disallow

The second pattern is more common and more subtle. Many CMSs (including older WordPress, Drupal, and various SaaS builders) generate a default robots.txt with:

# Placed there during development, never cleaned up
User-agent: *
Disallow: /

This was meant to block all bots during staging. Someone forgot to remove it for production — or the CMS regenerated it after a deployment. The result: every AI crawler, every session.

Check this right now: open https://yourdomain.com/robots.txt in a browser. If you see Disallow: / under User-agent: * — and no explicit Allow rules for the AI agents above — your site is invisible to AI search.

Mistake 3: Blocking crawl-heavy bots by rate limit

Some server admins block aggressive crawlers by name to protect bandwidth. The problem: early AI crawlers (particularly GPTBot in 2023) were aggressive, and blocking rules written then are still active today. OpenAI has since added rate controls, but the disallow rules remain.

# Written in 2023, still killing your AI visibility in 2026
User-agent: GPTBot
Disallow: /

The fix: allow AI crawlers explicitly

Here is a robots.txt template that balances open AI crawler access with reasonable protections:

# Allow major AI crawlers explicitly
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: YouBot
Allow: /

# Default rules for all other bots
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /checkout/
Allow: /

# Point to your sitemap
Sitemap: https://yourdomain.com/sitemap.xml

Key decisions in this template:

AI crawlers are listed before the catch-all User-agent: * block so their rules take precedence.
Sensitive paths (/admin/, /checkout/) are disallowed for the default catch-all.
The Sitemap: directive is present — this helps AI crawlers discover all your public URLs efficiently.

What if I don't want AI training on my content?

That's a legitimate choice. If you want to opt out of training data while still appearing in real-time AI search results (Perplexity, Google AI Overviews), some platforms support separate controls:

OpenAI: GPTBot is used for both training and browsing. Blocking it opts you out of both. There is currently no way to allow browsing while blocking training via robots.txt alone.
Google: Googlebot-Extended is specifically for AI Overviews and Vertex AI. You can allow Googlebot (for search) while blocking Googlebot-Extended (for AI) — or vice versa.
Anthropic: ClaudeBot is primarily a training crawler. Blocking it does not affect whether Claude cites your site via real-time search (Claude uses a separate pipeline for that).

The practical tradeoff: blocking AI training crawlers is unlikely to affect your near-term AI search visibility significantly, since most AI Overviews and Perplexity results are powered by real-time web search, not training data. The higher-impact decision is making sure real-time search crawlers (PerplexityBot, Googlebot-Extended) are allowed.

Don't forget meta robots tags

robots.txt isn't the only place bots can be blocked. Check your page-level meta tags too:

<!-- These tags block AI snippet extraction -->
<meta name="robots" content="nosnippet">
<meta name="robots" content="noindex">
<meta name="googlebot" content="nosnippet">

The nosnippet directive tells crawlers not to extract text snippets from the page — which is exactly what AI engines do to build cited answers. If your page template applies this tag globally (another common CMS default), no AI assistant can quote your content, even if their crawler can access the page.

Verify your fix with a free audit

After updating your robots.txt, run a CiteReady GEO audit on your key pages. The AI Crawler Access category will confirm whether all major bots can reach your content — and flag any remaining meta-robots issues or X-Robots-Tag HTTP headers that might still be blocking snippet extraction.

Also see: What is GEO and why your site needs it in 2026 — for the full picture on making your site citable across all four GEO dimensions.

Check your AI crawler access — free

CiteReady audits your robots.txt, meta tags and HTTP headers for AI crawler blocks. Full report in seconds, no signup needed.

Run a free GEO audit →

Why your robots.txt is blocking AI crawlers (and how to fix it)

Which user-agents do AI crawlers use?

The three most common robots.txt mistakes

Mistake 1: Blocking all unknown bots

Mistake 2: A legacy blanket disallow

Mistake 3: Blocking crawl-heavy bots by rate limit

The fix: allow AI crawlers explicitly

What if I don't want AI training on my content?

Don't forget meta robots tags

Verify your fix with a free audit

Check your AI crawler access — free