You've published solid content, your Google rankings look healthy, and yet ChatGPT and Perplexity never seem to cite your site. You've checked for JSON-LD, you have a sitemap โ€” so what's going wrong?

In most cases, the answer is sitting at the top of your server's filesystem: robots.txt. It was probably written years ago for Googlebot. It accidentally blocks every AI crawler that exists today โ€” making your site completely invisible in AI-generated answers.

Which user-agents do AI crawlers use?

AI search platforms each deploy their own crawlers with distinct user-agent strings. Here are the ones that matter most in 2026:

User-Agent Platform Used for
GPTBot OpenAI (ChatGPT) Training data + real-time browsing
ClaudeBot Anthropic (Claude) Training data + web search
PerplexityBot Perplexity AI Real-time search results
Googlebot-Extended Google (AI Overviews / Gemini) AI Overviews content sourcing
Applebot-Extended Apple (Siri, Apple Intelligence) Summaries and on-device AI
cohere-ai Cohere Enterprise AI search
YouBot You.com AI search index

None of these existed when most robots.txt files were written. The default approach of most web servers โ€” and many CMSs โ€” doesn't account for them at all.

The three most common robots.txt mistakes

Mistake 1: Blocking all unknown bots

A common hardening pattern disallows any user-agent not explicitly whitelisted:

# Common "security" pattern that blocks AI crawlers
User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

This blocks everything except Googlebot and Bingbot โ€” including every AI crawler above. ChatGPT, Perplexity, and Claude can't read a single page of your site.

Mistake 2: A legacy blanket disallow

The second pattern is more common and more subtle. Many CMSs (including older WordPress, Drupal, and various SaaS builders) generate a default robots.txt with:

# Placed there during development, never cleaned up
User-agent: *
Disallow: /

This was meant to block all bots during staging. Someone forgot to remove it for production โ€” or the CMS regenerated it after a deployment. The result: every AI crawler, every session.

Check this right now: open https://yourdomain.com/robots.txt in a browser. If you see Disallow: / under User-agent: * โ€” and no explicit Allow rules for the AI agents above โ€” your site is invisible to AI search.

Mistake 3: Blocking crawl-heavy bots by rate limit

Some server admins block aggressive crawlers by name to protect bandwidth. The problem: early AI crawlers (particularly GPTBot in 2023) were aggressive, and blocking rules written then are still active today. OpenAI has since added rate controls, but the disallow rules remain.

# Written in 2023, still killing your AI visibility in 2026
User-agent: GPTBot
Disallow: /

The fix: allow AI crawlers explicitly

Here is a robots.txt template that balances open AI crawler access with reasonable protections:

# Allow major AI crawlers explicitly
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: YouBot
Allow: /

# Default rules for all other bots
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /checkout/
Allow: /

# Point to your sitemap
Sitemap: https://yourdomain.com/sitemap.xml

Key decisions in this template:

What if I don't want AI training on my content?

That's a legitimate choice. If you want to opt out of training data while still appearing in real-time AI search results (Perplexity, Google AI Overviews), some platforms support separate controls:

The practical tradeoff: blocking AI training crawlers is unlikely to affect your near-term AI search visibility significantly, since most AI Overviews and Perplexity results are powered by real-time web search, not training data. The higher-impact decision is making sure real-time search crawlers (PerplexityBot, Googlebot-Extended) are allowed.

Don't forget meta robots tags

robots.txt isn't the only place bots can be blocked. Check your page-level meta tags too:

<!-- These tags block AI snippet extraction -->
<meta name="robots" content="nosnippet">
<meta name="robots" content="noindex">
<meta name="googlebot" content="nosnippet">

The nosnippet directive tells crawlers not to extract text snippets from the page โ€” which is exactly what AI engines do to build cited answers. If your page template applies this tag globally (another common CMS default), no AI assistant can quote your content, even if their crawler can access the page.

Verify your fix with a free audit

After updating your robots.txt, run a CiteReady GEO audit on your key pages. The AI Crawler Access category will confirm whether all major bots can reach your content โ€” and flag any remaining meta-robots issues or X-Robots-Tag HTTP headers that might still be blocking snippet extraction.

Also see: What is GEO and why your site needs it in 2026 โ€” for the full picture on making your site citable across all four GEO dimensions.

Check your AI crawler access โ€” free

CiteReady audits your robots.txt, meta tags and HTTP headers for AI crawler blocks. Full report in seconds, no signup needed.

Run a free GEO audit โ†’
โ† What is GEO? All articles โ†’