You've published solid content, your Google rankings look healthy, and yet ChatGPT and Perplexity never seem to cite your site. You've checked for JSON-LD, you have a sitemap โ so what's going wrong?
In most cases, the answer is sitting at the top of your server's filesystem: robots.txt. It was probably written years ago for Googlebot. It accidentally blocks every AI crawler that exists today โ making your site completely invisible in AI-generated answers.
Which user-agents do AI crawlers use?
AI search platforms each deploy their own crawlers with distinct user-agent strings. Here are the ones that matter most in 2026:
| User-Agent | Platform | Used for |
|---|---|---|
| GPTBot | OpenAI (ChatGPT) | Training data + real-time browsing |
| ClaudeBot | Anthropic (Claude) | Training data + web search |
| PerplexityBot | Perplexity AI | Real-time search results |
| Googlebot-Extended | Google (AI Overviews / Gemini) | AI Overviews content sourcing |
| Applebot-Extended | Apple (Siri, Apple Intelligence) | Summaries and on-device AI |
| cohere-ai | Cohere | Enterprise AI search |
| YouBot | You.com | AI search index |
None of these existed when most robots.txt files were written. The default approach of most web servers โ and many CMSs โ doesn't account for them at all.
The three most common robots.txt mistakes
Mistake 1: Blocking all unknown bots
A common hardening pattern disallows any user-agent not explicitly whitelisted:
# Common "security" pattern that blocks AI crawlers
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
This blocks everything except Googlebot and Bingbot โ including every AI crawler above. ChatGPT, Perplexity, and Claude can't read a single page of your site.
Mistake 2: A legacy blanket disallow
The second pattern is more common and more subtle. Many CMSs (including older WordPress, Drupal, and various SaaS builders) generate a default robots.txt with:
# Placed there during development, never cleaned up
User-agent: *
Disallow: /
This was meant to block all bots during staging. Someone forgot to remove it for production โ or the CMS regenerated it after a deployment. The result: every AI crawler, every session.
Check this right now: open https://yourdomain.com/robots.txt in a browser. If you see Disallow: / under User-agent: * โ and no explicit Allow rules for the AI agents above โ your site is invisible to AI search.
Mistake 3: Blocking crawl-heavy bots by rate limit
Some server admins block aggressive crawlers by name to protect bandwidth. The problem: early AI crawlers (particularly GPTBot in 2023) were aggressive, and blocking rules written then are still active today. OpenAI has since added rate controls, but the disallow rules remain.
# Written in 2023, still killing your AI visibility in 2026
User-agent: GPTBot
Disallow: /
The fix: allow AI crawlers explicitly
Here is a robots.txt template that balances open AI crawler access with reasonable protections:
# Allow major AI crawlers explicitly User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Googlebot-Extended Allow: / User-agent: Applebot-Extended Allow: / User-agent: cohere-ai Allow: / User-agent: YouBot Allow: / # Default rules for all other bots User-agent: * Disallow: /admin/ Disallow: /private/ Disallow: /checkout/ Allow: / # Point to your sitemap Sitemap: https://yourdomain.com/sitemap.xml
Key decisions in this template:
- AI crawlers are listed before the catch-all
User-agent: *block so their rules take precedence. - Sensitive paths (
/admin/,/checkout/) are disallowed for the default catch-all. - The
Sitemap:directive is present โ this helps AI crawlers discover all your public URLs efficiently.
What if I don't want AI training on my content?
That's a legitimate choice. If you want to opt out of training data while still appearing in real-time AI search results (Perplexity, Google AI Overviews), some platforms support separate controls:
- OpenAI:
GPTBotis used for both training and browsing. Blocking it opts you out of both. There is currently no way to allow browsing while blocking training viarobots.txtalone. - Google:
Googlebot-Extendedis specifically for AI Overviews and Vertex AI. You can allowGooglebot(for search) while blockingGooglebot-Extended(for AI) โ or vice versa. - Anthropic:
ClaudeBotis primarily a training crawler. Blocking it does not affect whether Claude cites your site via real-time search (Claude uses a separate pipeline for that).
The practical tradeoff: blocking AI training crawlers is unlikely to affect your near-term AI search visibility significantly, since most AI Overviews and Perplexity results are powered by real-time web search, not training data. The higher-impact decision is making sure real-time search crawlers (PerplexityBot, Googlebot-Extended) are allowed.
Don't forget meta robots tags
robots.txt isn't the only place bots can be blocked. Check your page-level meta tags too:
<!-- These tags block AI snippet extraction -->
<meta name="robots" content="nosnippet">
<meta name="robots" content="noindex">
<meta name="googlebot" content="nosnippet">
The nosnippet directive tells crawlers not to extract text snippets from the page โ which is exactly what AI engines do to build cited answers. If your page template applies this tag globally (another common CMS default), no AI assistant can quote your content, even if their crawler can access the page.
Verify your fix with a free audit
After updating your robots.txt, run a CiteReady GEO audit on your key pages. The AI Crawler Access category will confirm whether all major bots can reach your content โ and flag any remaining meta-robots issues or X-Robots-Tag HTTP headers that might still be blocking snippet extraction.
Also see: What is GEO and why your site needs it in 2026 โ for the full picture on making your site citable across all four GEO dimensions.
Check your AI crawler access โ free
CiteReady audits your robots.txt, meta tags and HTTP headers for AI crawler blocks. Full report in seconds, no signup needed.
Run a free GEO audit โ