Resources

AI Crawlers and you:
How to block or allow them using robots.txt

The age of AI is here, and it's actively reading, interpreting, and learning from your website. This presents every site owner with a critical choice: do you want to embrace AI discovery, or do you want to protect your content from being used in language model training?

Fortunately, you have control. Whether you want to optimize your site for AI visibility or block access entirely, the primary tool for the job is a simple text file: robots.txt. This guide will provide you with the user agents and templates you need to implement your AI strategy today.

Understanding Your Control Toolkit

What is robots.txt?

At its core, robots.txt is a public text file that lives on your website's server. It gives instructions to web crawlers — including AI bots — about which pages and files they are allowed or forbidden to request. It is the web's standard for crawler protocol. For a deep dive, refer to the official documentation from Google's Developer Center.

What about llms.txt and ai.txt?

These are newer, experimental files proposed to give more specific instructions to AI models. While robots.txt is the main "gatekeeper," you could use llms.txt to point AI models to content specifically optimized for them, creating a better-tailored experience. For now, the best practice is to ensure your robots.txt rules are clear, as it is the most widely respected standard.

Is this an all-or-nothing choice?

Not at all. You have granular control. You can choose to block a specific AI crawler while allowing others. You can also direct one bot to your main content while pointing another to a different, AI-optimized section of your site.

Important: Is this guaranteed to work?

robots.txt operates on an honor system. Think of it as a "No Trespassing" sign on your lawn, not a locked gate. Reputable crawlers (like those from Google, OpenAI, and other major tech companies) will respect your rules. However, less scrupulous bots may ignore them. It is your first and most important line of defense, but not a foolproof security measure.

Known AI & LLM Crawler User Agents

User agents for major AI crawlers and general-purpose bots known to feed data to LLMs. Linked sources provide official documentation from each provider.

Last Updated: April 17, 2026

Primary AI & LLM Crawlers

User-agent	Company	Source
GPTBot	OpenAI, for ChatGPT	Official docs ↗
Google-Extended	Google, for Vertex AI and Gemini	Official docs ↗
ClaudeBot	Anthropic, for Claude	Official docs ↗
Amazonbot	Amazon, for various AI training purposes	Official docs ↗
Applebot-Extended	Apple, for enhanced AI features	Official docs ↗
meta-externalagent	Meta, for its AI models	Official docs ↗
cohere-ai	Cohere	—
PerplexityBot	Perplexity AI	Official docs ↗
YouBot	You.com	—
Bytespider	ByteDance, parent company of TikTok	—
Diffbot	Diffbot	Official docs ↗
CCBot	Common Crawl	Official docs ↗

General Bots That Can Feed LLMs

User-agent	Company	Source
Bingbot	Microsoft, data used for Copilot	Official docs ↗
Googlebot	Google, data used for its models	Official docs ↗
Applebot	Apple, Apple's main crawler	Official docs ↗

Our Own Crawler

PaceghostBot is our crawler, used only on-demand when a user requests a site audit. It helps power our AI Readiness reports. It fully respects robots.txt and never performs unsolicited crawling. PaceghostBot documentation →

robots.txt Templates for AI Governance

Template 1: Block All Known AI Crawlers

Use this template if your goal is to prevent your content from being used by major AI models for training and summarisation.

robots.txt

# === BLOCK ALL MAJOR AI CRAWLERS ===
# Template provided by paceghost.io - The AI Readiness Platform
# Last updated: April 17, 2026

User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: YouBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: CCBot
Disallow: /

# == Your Standard Rules ==
User-agent: *
Disallow: /admin/
Disallow: /private/

# == Sitemap Location ==
Sitemap: https://www.your-website.com/sitemap.xml

Template 2: Allow All Known AI Crawlers

Use this template if your goal is to maximise your visibility and ensure your content is discoverable by AI search and answer engines.

robots.txt

# === ALLOW ALL MAJOR AI CRAWLERS ===
# Template provided by paceghost.io - The AI Readiness Platform
# Last updated: April 17, 2026

# This template ensures AI crawlers can access your site while still protecting
# sensitive areas (like admin panels). By default, crawlers are allowed,
# so we only need to specify what to disallow.

# == Default Rules for All Crawlers (including AI) ==
User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Disallow: /login/

# == Sitemap Location ==
Sitemap: https://www.your-website.com/sitemap.xml

What's Next?

Defining your rules is a critical first step. But the AI landscape changes daily. New crawlers appear constantly, and you're left wondering:

Is my robots.txt file still effective?
Are bots respecting my rules?
How is AI actually interpreting my content?

That's where Paceghost comes in. We're building the toolkit to turn this manual checklist into an automated, effortless dashboard.

Stay in the loop

The AI crawler landscape moves fast.

We update this list as new bots emerge. Subscribe to get notified when we publish new resources, templates, and guides on AI readiness.

Get notified at launch