Why does robots.txt matter for AI Search if AI systems do not always respect it?

Major AI vendors publish documented user-agent strings and stated policies on respecting robots.txt. OpenAI, Anthropic, Google, Microsoft, and Perplexity all honor the robots.txt directives for their bots. Some smaller scrapers and unofficial agents ignore robots.txt, but the major retrieval surfaces that produce the citations brands actually care about do respect it. A misconfigured robots.txt is a self-inflicted citation outage: the major bots cannot reach owned-domain content, no retrieval occurs, no citations earn.

What is the training-versus-retrieval bifurcation?

AI vendors split their bots into two categories. Training bots collect content to train future model versions; retrieval bots fetch content in real time when users ask questions. The same vendor often runs both, under different user-agent strings. OpenAI's GPTBot collects training data; OpenAI's OAI-SearchBot and ChatGPT-User serve real-time retrieval. Anthropic's anthropic-ai collects training data; ClaudeBot serves retrieval. The strategic decision: most brands want to allow retrieval (the citation surface) and may want to opt out of training collection. The decision is split per bot.

Which user agents should I always allow for AI Search citations?

Five retrieval-focused bots produce the citations brands optimize for. OAI-SearchBot (OpenAI's search index for ChatGPT). ChatGPT-User (ChatGPT browsing on behalf of users in real time). PerplexityBot (Perplexity's retrieval crawler). ClaudeBot (Anthropic's retrieval crawler). Google-Extended (Google's AI training and AIO surfacing; blocking it affects AI Overviews). Blocking any of these caps the brand's AI Search citation surface from that platform.

How do I verify which bots are actually accessing my site?

Three verification approaches. First, server logs: filter access logs by user-agent to confirm bot visits. Second, curl with explicit user-agent: 'curl -A GPTBot https://yoursite.com/path' simulates how a crawler sees the page. Third, AI vendor verification tools: OpenAI, Anthropic, and Perplexity publish reference documentation and CIDR ranges for verifying bot authenticity. Spot-check verification monthly to catch regressions when site changes affect crawling.

Does Cloudflare or other CDN bot management interfere with AI crawler access?

Yes, increasingly. Cloudflare introduced AI bot blocking by default for some plans in 2024 and 2025. Other CDN providers added similar controls. The blocking can produce 403 or 429 responses to AI bots even when robots.txt permits them, because the CDN layer enforces additional rules. Verification needs to test at the CDN edge, not just at the origin server, and the configuration may require allowlist additions at the CDN dashboard separate from robots.txt.

How often should I audit and update robots.txt?

Quarterly at minimum. New AI bots emerge regularly as new platforms launch or existing platforms split their bots into more granular user agents (training vs retrieval, content vs index). Quarterly audits catch new bots that need explicit handling and verify existing rules still match intent. Brands operating on a slower cadence frequently discover much later that they missed allowing a major new bot that other competitors had been allowing all along.

AI SEO: Crawler Access

AI Search pulls content from owned pages through crawlers run by each major AI vendor. The crawlers respect robots.txt rules you publish. Bad rules cause self-inflicted citation outages. Major AI systems then cannot reach your content at all. The split between training data and real-time retrieval matters because most brands want to allow retrieval bots, the real-time citation surface, while deciding on training bots separately. This chapter covers the user agents that matter most. The training-vs-retrieval split comes next, followed by the wholesale-block trap that catches many brands and the verification workflow operators actually run. Two sections then cover the CDN edge issues that override robots.txt and the worked configurations for permissive, balanced, and restrictive postures. The fix is fast.

Why This Technique Matters

The simplest failure in AI Search is the easiest to fix. robots.txt matters. A bad rule blocks AI crawlers from your pages, no content reaches the retrieval index, and no citations occur, no matter how strong your Mentions, Evidence, Relevance, or Entity work is. You earn zero AI Search citations. The technical layer makes the rest of the work invisible.

Most brands that audit their robots.txt find blocking rules they did not expect. Old CMS defaults blocked broad classes of bots without splitting AI crawlers from scrapers, staging rules sometimes leak to production, legacy security policies block bots without thought, and WordPress plugins write robots.txt rules without operator input. A crawler-access audit is a routine part of any technical SEO engagement, and it is the first place to look when citations stall for no obvious content reason. Fix this and citations lift once the bots regain access.

The decision is hard because of the training-vs-retrieval split. AI vendors run different crawlers for different jobs. Retrieval bots feed the real-time citation surface. Training bots collect content for future models. That compounds long-term entity recognition. It does not produce current citations. Both decisions are valid. What is not valid is blocking everything by default. Allowing everything without thought is also a mistake.

What is not valid is blocking everything by default.

The technical layer has more depth than it looks. Cloudflare and other CDN providers added bot management in 2024 and 2025. These can block AI crawlers at the edge. Origin robots.txt does not matter at that point. CMS hosts handle robots.txt in their own ways. WordPress, Squarespace, Wix, and Shopify each have platform-specific overrides. A config that works on a static site may not work on a CDN-fronted dynamic platform. You need to verify at the real production edge. Origin checks alone are not enough.

The Training-Versus-Retrieval Bifurcation

AI vendors split their crawlers into two groups. Each group has its own user-agent strings. You need to understand the split before you make access decisions.

Training Bots

Training bots collect content to train future versions of the model. The content goes into training datasets. Those datasets shape what the model knows about brands, topics, and operators. The benefit of allowing training bots is clear. Future models know more about your brand. That compounds long-term entity recognition. It also reduces your reliance on real-time retrieval. The cost is also clear. Your content feeds a vendor's commercial training data. Some brands view this as a competitive concession.

Major training bots in 2026.

GPTBot. OpenAI's training-data bot. Documented at openai.com/gptbot. Collects content for future GPT versions.
anthropic-ai. Anthropic's training data crawler. The user agent string is lowercase with a hyphen. This is the legacy training-focused agent.
Google-Extended. Google's training data for Gemini and AI Overviews. It is not the same as Googlebot. Googlebot handles regular search indexing. Many brands allow Googlebot but block Google-Extended. They want search visibility but not AI training inclusion.
CCBot. Common Crawl bot. Common Crawl gives many AI vendors their core web corpus. Blocking CCBot cuts your training inclusion across multiple vendors.
Bytespider. ByteDance training bot. Affects TikTok and ByteDance AI products.
Applebot-Extended. Apple's training data crawler. Different from the standard Applebot search crawler. Affects future Apple Intelligence training.
Bingbot. The standard Bing crawler. Handles regular Bing indexing. Bing has not split into training-vs-retrieval bots yet. Microsoft has signaled a future split.

Retrieval Bots

Retrieval bots fetch content in real time. They run when users send queries to the AI system. The content goes into the immediate response. It is not always kept for training. Allow retrieval bots and you earn direct citation share on current queries. Block them and the AI system cannot cite you in real-time responses. That is the citation outcome most brands want.

Major retrieval bots in 2026.

OAI-SearchBot. OpenAI's retrieval bot for ChatGPT search. Pulls content for ChatGPT's web-aware responses. The single most important bot to allow for ChatGPT citations.
ChatGPT-User. A separate OpenAI bot for ChatGPT browsing. It runs when users trigger browsing in real time. Different traffic pattern from OAI-SearchBot. Same citation function.
ClaudeBot. Anthropic's primary crawler. Not the same as anthropic-ai. The anthropic-ai agent is the legacy training agent. ClaudeBot is the active retrieval and indexing crawler in 2026.
PerplexityBot. Perplexity's retrieval crawler. The single most important bot to allow for Perplexity citations.
Perplexity-User. A separate agent Perplexity uses for user-triggered fetches. Similar function to ChatGPT-User.
Applebot. Apple's search index crawler. Not the same as Applebot-Extended. Affects Spotlight and Siri results.
Googlebot. Standard Google search index crawler. Core for organic ranking. AI Overviews inherits from it.

These bots feed the real-time citation surface. Block any of them and you cap your citation share from that platform. The strategy is simple if you want AI Search citations: allow retrieval bots. The hard part is making sure the bots are allowed at the technical level.

What an Allowed Crawler Can Actually Read

Allowing a bot gets it to your page. It does not guarantee the bot can read what is there. Two blind spots decide whether your content survives the trip from fetch to model: JavaScript that never runs, and JSON-LD that never gets parsed. Both are easy to miss, because the page looks complete to you, in a browser, with everything rendered.

JavaScript. The major standalone AI crawlers fetch your HTML but do not execute JavaScript. A joint Vercel and MERJ analysis tracked more than 500 million GPTBot fetches and found zero evidence of JavaScript execution; even when GPTBot downloaded JavaScript files, which it did about 11.5% of the time, it never ran them (Vercel, The rise of the AI crawler). Anthropic's ClaudeBot and Perplexity behave the same way. That makes browser-rendered content invisible to them: if a product description, a positioning statement, or a category definition only appears after the browser runs a script, those crawlers retrieve an empty shell. Google's Gemini renders through Googlebot and Applebot renders too, so this is platform-specific, but the models most brands want citations from do not render. Move priority content into server-rendered HTML so it is present in the raw response. Test it by loading the page with JavaScript disabled; if the content disappears, those crawlers see nothing.

500M+

Vercel and MERJ tracked more than 500 million GPTBot fetches and found zero JavaScript execution. Browser-rendered content is invisible to the OpenAI, Anthropic, and Perplexity crawlers.

JSON-LD. The mirror-image blind spot. Search engines read your JSON-LD, but the language model does not parse it when it generates an answer (Chapter 9). Schema is a discovery-layer signal that helps engines disambiguate your entity; it is not a channel for feeding facts to the model. So a fact that has to reach the AI cannot live only in structured data. If it matters, it belongs in the visible, server-rendered text of the page, where a crawler that neither runs scripts nor parses schema can still read it.

The crawler reads your server-rendered text. Anything that needs JavaScript to appear, or lives only in JSON-LD, is invisible to the models that matter most.

The rule that falls out of both: render your critical facts as plain, server-side text, and treat JavaScript and schema as enhancements layered on top, never as the delivery mechanism for the facts themselves. You can do every other thing in this chapter correctly, allow every retrieval bot, pass every robots.txt check, and still be invisible because the content only exists after a script runs, or only inside a block the model never reads.

The Wholesale-Block Trap

The most common failure is a robots.txt that blocks everything by default. Three patterns cause the trap.

Legacy CMS Defaults

Some CMS platforms published default robots.txt rules with broad blocking patterns. WordPress, Drupal, and older platforms vary in default behavior. Some block specific directories. Some have no rules at all. The "User-agent: * Disallow: /" pattern shows up in some hosting setups by default. It can leak into production if you do not catch it.

The audit: fetch robots.txt from your production domain (https://yourdomain.com/robots.txt). Check for any Disallow: / rules under generic User-agent: * blocks. These rules block every bot. AI crawlers included. The fix is simple. Remove the wholesale block. Replace it with selective rules. Allow the bots you want. Block only the bots you have a reason to exclude.

Staging-to-Production Misconfiguration

Staging environments often use blanket-block robots.txt rules. The rules keep search engines from indexing pre-production content. The rules sometimes reach production by mistake. Deploys that do not split environments are the cause. Your production site ends up with a staging-style robots.txt. It blocks all crawlers.

How to spot it. Traffic from organic search drops. AI citations drop. The team cannot find a content change to explain it. The cause is usually a robots.txt mistake from a recent deploy. To verify, pull robots.txt history from your deploy logs. Compare it against expected production state. The fix is to redeploy the correct robots.txt. Then verify Google Search Console and AI bot logs show the bots returning.

WordPress Plugin Overrides

SEO plugins on WordPress rewrite robots.txt based on plugin settings. Operators can set these wrong. Yoast, Rank Math, and All in One SEO all have features that change robots.txt. Toggling a setting without knowing the effect can enable broad blocking.

How to spot it. The robots.txt file on disk shows different content than the production response. The plugin generates the response on the fly. The fix is to review plugin settings. Turn off any features that block bots you want to allow.

Verification Workflows

Three verification methods cover the practical work of checking crawler access.

Method	What it tests	What it misses
Direct robots.txt inspection	Reads the rules at the source. Maps each bot you care about against the Allow and Disallow blocks to confirm the access you want.	CDN-layer overrides, dynamic robots.txt that returns different content to different user agents, and Cloudflare-style bot management that produces 403 responses no matter what robots.txt says.
User-agent simulation	The strongest test. A curl request with an explicit bot user-agent shows the real production behavior the bot would hit. A 200 with content is a success; a 403 or 429 means blocking, often at the CDN layer.	A point-in-time check only. It catches the state at the moment you run it, not a sustained crawl-rate drop that server logs would surface over time.
Server log analysis	Shows which bots actually crawl, at what rate, and on which URLs. Each major bot should appear at a non-zero rate matching its published behavior; high error rates point to blocking somewhere.	Origin logs alone undercount on CDN-fronted sites. The CDN handles many bot requests without forwarding them, so logs must come from the CDN edge.

Direct robots.txt Inspection

Fetch the robots.txt directly. Use curl or your browser at https://yourdomain.com/robots.txt. Read the content. Note every User-agent block. Note the Allow and Disallow rules. Map each bot you care about against the rules. The retrieval bots above plus any training bots that matter. Confirm each bot has the access you want.

Direct inspection covers simple cases. It misses several things. CDN-layer overrides. Dynamic robots.txt that returns different content to different user agents. Cloudflare-style bot management. That kind of management produces 403 responses no matter what robots.txt says.

User-Agent Simulation

Simulate a bot fetch with an explicit user-agent: curl -A "GPTBot/1.0" https://yourdomain.com/somepath. The response status, headers, and body show how the server treats that bot. A 200 response with content is a success. A 403 or 429 means blocking. That blocking often happens at the CDN layer, not robots.txt. A 200 with empty body or a redirect means the server treats the bot differently from a browser.

User-agent simulation is the strongest test. It checks the real production behavior the bot would hit. Run it for each major bot you care about. That means OAI-SearchBot, ClaudeBot, PerplexityBot, Googlebot, Google-Extended, and GPTBot. Run it on at least three URLs. A homepage, a content page, and a product page. The variation across bots and URLs surfaces issues that direct inspection misses.

Server Log Analysis

Server access logs show which bots are crawling the site. They show the rate and the URLs each bot pulls. Filter logs by user-agent to find each bot's activity. Brands on Cloudflare or other CDNs need to pull logs from the CDN edge. Cloudflare Logs, Fastly logs, or AWS CloudFront logs work. Origin server logs alone are not enough. The CDN handles many bot requests without forwarding them.

The baseline. Each major bot you care about should appear in logs at a non-zero rate. The rate should match the bot's published crawling behavior. A bot that should be active but shows no entries has an access problem at some layer. A bot with high error rates (403, 429, 503) is blocked somewhere. That blocking often happens at the CDN layer.

Worked robots.txt Configurations

Three configurations cover the common postures. Each is a full robots.txt block. Adapt the user-agent list to your decisions.

Three posture cards compared on how they treat retrieval bots and training bots. The Permissive posture allows retrieval bots and training bots and is the default for most mid-market brands. The Balanced posture allows retrieval bots and blocks training bots. The Restrictive posture allows only the major retrieval bots by name and blocks all other bots, suited to premium-content brands. — Figure 2. The three robots.txt postures. All three keep the retrieval bots allowed; the difference is the training-bot decision and the default-deny stance, which is what blocking everything indiscriminately gets wrong.

Permissive Posture (Default for Most Mid-Market Brands)

Allow all major AI bots. Both retrieval and training. Block paths that should never be indexed. Admin, internal APIs, staging URLs. This posture maximizes your citation surface. It also maximizes long-term entity recognition.

User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /admin/
Disallow: /api/internal/
Disallow: /staging/

Sitemap: https://yourdomain.com/sitemap.xml

This posture allows every bot by default. It blocks only paths that should not appear in AI Search or organic search results. Most mid-market brands operate here. The cost of training contribution is low. The citation benefit compounds.

Balanced Posture (Retrieval Yes, Training No)

Allow retrieval bots. That covers the current citation surface. Block training bots. That cuts your contribution to future model training. This posture protects your content from training inclusion. It keeps your citation share on current AI Search responses.

User-agent: GPTBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /admin/

Sitemap: https://yourdomain.com/sitemap.xml

Training bots are blocked by name. Retrieval bots inherit the permissive default. This config assumes OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, and Googlebot are not listed as training bots. They are retrieval. They fall under the permissive User-agent: * rule.

Restrictive Posture (Selective Allow Only)

Block all bots by default. Allow only the major retrieval bots and Googlebot. This posture suits premium-content brands. Research firms, news publishers, and premium-content businesses. These brands want full control over which bots can access content.

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot
Allow: /

User-agent: *
Disallow: /

Sitemap: https://yourdomain.com/sitemap.xml

Default-deny with explicit allow for the major retrieval bots. You earn AI Search citations from the allowed platforms. You contribute nothing to training, Common Crawl, or other data collection. The trade-off is real. Long-term entity recognition depends on the major training bots. This posture caps long-term compounding from training inclusion.

CDN-Layer Overrides

Modern web infrastructure routes traffic through CDN edges before reaching the origin server. The CDN can enforce bot management rules on its own. Origin robots.txt does not matter at that layer. Cloudflare, Fastly, Akamai, AWS CloudFront, and others all support some form of bot management.

The Cloudflare case is the most common in 2026. Cloudflare added AI bot blocking features in 2024. They turned the features on by default for some plan tiers in 2025. The blocking happens at the CDN edge before requests reach the origin. The bot gets a 403 or 429 response no matter what robots.txt says at origin.

The config check for Cloudflare-protected sites. Log into the Cloudflare dashboard. Go to Security and then Bots. Inspect the bot management rules. The default may block AI bots. You may need to allow-list specific user agents to permit them. The change at Cloudflare is separate from the robots.txt edit. Both are needed for access to work end-to-end.

Similar checks apply for other CDNs. Brands using non-Cloudflare CDNs should verify the CDN's bot management does not override robots.txt intent. The verification is the same user-agent simulation as above. Run curl with explicit bot user agents. That tests the real edge behavior. On sites where the CDN, hosting, and platform configuration are tangled enough that crawler access cannot be fixed cleanly, the access problem is often a symptom of deeper infrastructure debt that a website rebuild or platform migration resolves at the root.

Serving Clean Content to AI Crawlers: Cloudflare Markdown for Agents

The CDN layer can do the opposite of blocking, too. It can hand AI crawlers a cleaner version of the page. In February 2026, Cloudflare launched Markdown for Agents, a content-negotiation feature that converts a page to markdown at the edge when an AI crawler asks for it. The crawler sends an Accept: text/markdown header, Cloudflare fetches the HTML from your origin, reduces it to markdown, and returns that instead. The feature is in Beta at no cost on Cloudflare's Pro, Business, and Enterprise plans.

The reason to care is token economy. Cloudflare measured roughly an 80% reduction in tokens between a page's HTML and its markdown equivalent, because the markup that means nothing to a model (divs, classes, inline styles) gets stripped out. A heading that costs 12 to 15 tokens in HTML costs about 3 in markdown. When an agent retrieves your page, more of its limited context window goes to your actual content and less to markup it would discard. Cleaner input is easier for the model to parse and cheaper for it to hold.

Be precise about what this buys. Cloudflare makes no citation claim; it markets the feature on efficiency alone. In Searchbloom's own testing, though, turning Markdown for Agents on measurably raised how often AI crawlers pulled our pages. That is a retrieval signal, which sits one step upstream of citation, not a proven citation lift. We treat it as an experimental, low-cost bet with a measured retrieval upside, not as a guaranteed path to more citations.

+35%

In Searchbloom's own testing, enabling Cloudflare Markdown for Agents produced about 35% higher AI-crawler retrieval on the property measured. Retrieval is one step upstream of citation, so treat it as a measured signal, not a proven citation lift.

It is also the credible successor to llms.txt. Where llms.txt asked you to hand-maintain a separate markdown file that AI providers never committed to reading, Markdown for Agents runs at the edge on the crawler's own request, so it gets used when the crawler wants it and costs nothing to maintain. Enable it in the Cloudflare dashboard alongside the bot-management rules above, then confirm crawlers are getting the markdown variant by requesting your page with an Accept: text/markdown header and checking for the x-markdown-tokens response header. Measure AI-crawler retrieval in your logs before and after, so the call rests on your own data, not on the vendor's framing or ours.

Sitemap.xml Integration

The Sitemap directive in robots.txt points crawlers at your sitemap.xml. The sitemap lists all URLs you want indexed. AI crawlers respect sitemap discovery. They use it to prioritize indexing of important pages. The mechanics of sitemaps and the structured-data signals that travel with them are worth a separate read for teams setting this up the first time.

The sitemap.xml should include every public page on your domain. It should also include updated timestamps (lastmod) for each URL. Brands that update content often should run automated sitemap generation. That keeps the sitemap in sync with actual content state. Stale sitemaps with old lastmod timestamps slow crawlers down. They also slow citation lift on refreshed content.

Multiple sitemaps are supported via sitemap index files. Use them for sites with thousands of URLs. The sitemap index references many individual sitemap files. Typical practice is one sitemap per content type. Separate sitemaps for blog posts, product pages, video URLs, and so on. Each sitemap has a maximum of 50,000 URLs or 50 MB uncompressed.

The Quarterly robots.txt Audit Workflow

Quarterly is the right cadence for robots.txt audits. The failures that affect AI crawler access compound when left unchecked. New AI bots arrive on a recurring basis. Platforms launch new bots. Existing vendors split their crawlers into more granular user agents (training vs retrieval, content vs index, primary vs user-triggered). CDN bot management updates from Cloudflare, Fastly, AWS CloudFront, and Akamai can silently change access. These come through taxonomy migrations or default-policy changes. WordPress SEO plugin updates rewrite robots.txt based on plugin defaults. The operator may not have opted in. Quarterly audits catch this drift before it spreads. Brands on slower cadences (semi-annual or annual) often find out much later that they missed a major new bot. They also discover vendor updates that have been suppressing access for an extended stretch.

Step 1: Bot list refresh. Update your canonical list of user agents from current vendor docs. OpenAI publishes its bot inventory at openai.com/gptbot. The page has current user-agent strings for GPTBot, OAI-SearchBot, and ChatGPT-User. Anthropic publishes its inventory at docs.anthropic.com. The page covers ClaudeBot and the legacy anthropic-ai agent. Google publishes its full crawler list at developers.google.com/search/docs/crawling-indexing/google-common-crawlers. The list includes Googlebot, Google-Extended, and any newer agents from AI Overviews and Gemini surfaces. Microsoft, Apple, ByteDance, and Perplexity publish similar inventories on their developer pages. The bot list usually grows by a few entries each quarter. The refresh is a quick sweep across the major vendor docs. Save output as a versioned bot-inventory document in your technical SEO docs.

Step 2: robots.txt verification. Fetch the current production robots.txt from your primary domain and any public subdomains. Compare the content against your intended posture document. The posture document should be a versioned reference. It defines the permissive, balanced, or restrictive posture you have chosen. Flag any rules added, removed, or changed since the last audit. Look for changes without an explicit decision in the change log. The common failure is rules added by automated systems. SEO plugins, CMS upgrades, and hosting-provider defaults are the usual sources. Quarterly verification surfaces these silent additions before they pile up. The step also catches outright corruption of the robots.txt. That includes encoding issues, file truncation, and deploys that push staging-style blanket-block rules to production.

Step 3: User-agent simulation. Run curl with explicit user-agent strings against 5 to 10 URLs for each major AI bot. The URL set should include the homepage. Add two to three primary content pages. Add a product or service page. Add a blog post. Add any pages you want to verify. Recently launched landing pages, refreshed content, and pages with high citation share all qualify.

For each bot, run the curl command pattern curl -A "BotUserAgent" https://yourdomain.com/path. Document the response code, response headers, and a sample of the body. Flag any 403 (forbidden), 429 (rate limited), or timeout responses. These point to access issues somewhere in the stack. Compare the response codes against the prior quarter's results. Any change in response code on a previously-working bot is drift. Drift needs investigation. Simulation is the strongest test in the audit. It checks the real production behavior an AI bot would hit when crawling your content.

Step 4: CDN edge verification. Check this step if your site is fronted by Cloudflare, Fastly, AWS CloudFront, Akamai, or any other CDN with bot management. Verify the bot management rules at the edge. Do this on top of the robots.txt verification. For Cloudflare, log into the dashboard. Go to Security and then Bots. Inspect the bot management rules for changes since the last audit. Verify that the allowlist rules still match the current bot category classifications.

If Cloudflare has updated its taxonomy since the last audit, the allowlist rules may have been silently orphaned. A taxonomy migration that orphans the allowlist rules is a common failure mode. Check the recent challenge-passed rate for each major bot user agent. A rate near 100 percent means the bot is reaching origin. A rate below 90 percent means partial blocking. Partial blocking needs investigation. For Fastly, AWS CloudFront, and Akamai, run the equivalent check against their bot management surfaces.

Step 5: Server log analysis. Filter access logs by user-agent string for each major AI bot. Verify each is making requests at a non-zero rate. Every retrieval bot you allow should appear in logs at least daily on a typical mid-market site. Absence means the bot is not reaching the site. Flag any sustained drop in request volume for active bots. A bot that made 200 requests per day and now makes 5 has suppressed access. The drop matters even if the individual requests return 200 responses. Compare access patterns to the prior quarter to find trends.

Brands using Cloudflare or other CDNs need to pull logs from the CDN edge. Use Cloudflare Logs, Fastly logs, or AWS CloudFront access logs. Origin server logs alone are not enough. The CDN may handle many bot requests without forwarding them. The log analysis is a contained task. Effort depends on log volume and analytics tooling.

Step 6: Plugin and CMS check. For WordPress sites, verify that SEO plugins have not silently modified robots.txt rules. Check Yoast, Rank Math, and All in One SEO. Each plugin has its own robots.txt management surface. Plugin version updates can change default behavior. Operators may not notice. Check each plugin's version changelog for robots.txt-related changes since the last audit. Plugin developers document changes in the changelog. They rarely send proactive notifications. The same check applies to non-WordPress CMS platforms. Shopify, Squarespace, Wix, Webflow, Sitecore, and Drupal all qualify. Each has its own surface for managing robots.txt. Each has its own update cadence that can change default behavior. The plugin and CMS check is often where quarterly audits surface the most surprising drift. Plugin updates rarely flag robots.txt changes in their release notes.

Step 7: Documentation and remediation. Document all findings in a versioned audit log. Include date, audit conductor, findings by step, and remediation status for each finding. The audit log is both an operational artifact and a historical record. It drives remediation work. It supports future audits with context on prior issues. Open remediation work items for flagged issues. Assign explicit owners and target completion dates. Track time-to-fix on remediations. The target is a fast turnaround from detection to deployed fix. Critical issues should be fixed first and with urgency. Critical issues include full bot blocking and citation-share-impacting configs. Brands that remediate promptly see less citation share loss from technical issues. Brands that let remediations drift into the next quarter's audit cycle lose more.

The team and tooling for the quarterly audit scale with site complexity. Most mid-market brands run the audit through a single engineer or technical SEO specialist. The audit is a contained engagement for sites under 50,000 URLs. Most of the effort is in steps 3 and 5. User-agent simulation and server log analysis are the heavy steps. Larger sites benefit from automation. Shell scripts or Python notebooks can run the user-agent simulation across a representative URL sample. Pipe output into a spreadsheet or database for comparison against prior quarters. Some brands run the simulation in their CI/CD pipelines. They get automated bot-access checks on a daily cadence. The checks alert when response codes change. The automation pays back quickly for brands with 100,000-plus URLs. It also pays back for brands with high sensitivity to citation share. A sustained blocking incident can have real revenue impact.

The annual deep audit expands the quarterly scope once per year. The deep audit covers a full sitemap audit. Verify lastmod accuracy across the entire sitemap. Identify broken URLs. Identify coverage gaps where new content has not been added. The deep audit also covers a cross-subdomain robots.txt consistency check. Verify apex domain, www subdomain, and any other public subdomains return the same robots.txt rules where intended. The deep audit compares crawl-rate trends across the full year. Find bots that have been steadily losing access over time. Quarterly snapshots may have missed these. The deep audit reviews any bot categories the brand has blocked. Confirm the original decision still matches strategy. The deep audit also revisits your intended posture document. Reconfirm the permissive, balanced, or restrictive choice still fits the brand's positioning. Brands that have changed positioning over the year may need to update the posture.

The quarterly audit interacts with your broader technical-health monitoring. The quarterly audit is point-in-time verification. It catches drift that has piled up since the prior audit. It does not catch drift that happens and recovers between audits. Brands with high sensitivity to citation share add continuous monitoring. They run automated daily curl checks against a representative URL set. The checks alert when response codes change for any major bot user agent. Continuous monitoring catches incidents almost as they happen, rather than after a long delay. This cuts the permanent citation share loss when a CDN taxonomy migration silently orphans bot allowlist rules between audits. Brands without the engineering capacity for continuous monitoring should reduce the audit cadence to monthly during active vendor change. Major CDN platform updates, AI vendor bot reorganizations, and large CMS upgrades all qualify. Return to quarterly during steady-state periods.

The Bot Access Health Score

Bot access is multi-layered. Robots.txt is one layer. CDN bot management is another. Plugin overrides, DNS configurations, and origin server rules add more. The Bot Access Health Score is a Searchbloom-coined composite that captures whether the full stack is letting the right bots through. Track the BAHS quarterly alongside the audit workflow.

BAHS = (number of major bots returning 200 across the standard URL set) / (number of major bots tested) x 100

The standard test set: OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Perplexity-User, Googlebot, Google-Extended, GPTBot, Applebot, and Bingbot. The standard URL set: 5 URLs covering homepage, primary content page, product or service page, blog post, and a recently-refreshed page. The test combines bot and URL into 50 individual checks. Each check passes if the bot's curl response is 200 with content body.

BAHS above 95%. Healthy access. The bot stack is letting the right bots through. Citation share is not capped by technical access issues. Quarterly audits are catching drift before it spreads.
BAHS 80 to 95%. Minor gaps. One or two bots are blocked at some layer. Investigate the specific failures. Common causes: a single CDN allowlist rule has expired, a plugin update toggled a bot setting, or a new bot user-agent has not been added to allowlists.
BAHS 60 to 80%. Significant access issues. Multiple bots are blocked. Citation share from blocked platforms is suppressed. The fix work should pause new content investment until access is restored. New content cannot earn citations from blocked platforms regardless of substance.
BAHS below 60%. Critical access failure. Most bots are blocked. The site is largely invisible to AI Search. The wholesale-block trap or a CDN-wide block is usually the cause. Emergency fix required.

Continuous monitoring (per the Quarterly Audit Workflow's automation section above) runs the BAHS daily and alerts on any drop. The threshold for alerting is a sharp drop of 10 percentage points or more between consecutive daily checks. The threshold catches CDN policy updates and silent allowlist orphan events almost as they happen, rather than after a long detection lag.

A horizontal banded scale showing the Bot Access Health Score, the share of major bots returning a 200 response across the standard URL set. The bar is divided into four bands: below 60 percent is critical access failure, 60 to 80 percent is significant access issues, 80 to 95 percent is minor gaps, and above 95 percent is healthy access. A marker at 95 percent sets the healthy threshold. — Figure 3. The Bot Access Health Score. The composite spans the full bot stack, not just robots.txt, and a drop into the lower bands means the technical layer is capping citation share before any content work can pay off.

The Crawl Frequency Baselines

Server log analysis surfaces bot access patterns. Most brands do not know what "normal" looks like for each bot. The Crawl Frequency Baselines below come from measured Searchbloom partner engagements. Use them as starting points. Adjust for site size and content velocity.

Googlebot. 100 to 500 requests per day on a typical mid-market site (5,000 to 50,000 pages). High variability depending on site authority and content freshness. Below 50 requests per day is suppressed access. Run diagnostics.
Bingbot. 50 to 200 requests per day on a typical mid-market site. Bing crawls less aggressively than Google.
GPTBot. 20 to 150 requests per day on most mid-market sites. The rate increased substantially in 2025 as OpenAI scaled training infrastructure. Below 10 requests per day is suppressed.
OAI-SearchBot. 10 to 80 requests per day. The rate depends on how often ChatGPT users query topics relevant to the brand's content. Real-time retrieval traffic is bursty rather than constant.
ClaudeBot. 15 to 100 requests per day. The rate scaled up significantly in 2025 and 2026 as Anthropic's user base grew.
PerplexityBot. 20 to 120 requests per day. Perplexity is the most aggressive retrieval crawler among the major AI vendors per unit of user query volume.
Google-Extended. 5 to 50 requests per day. The training crawler is less aggressive than Googlebot.
Applebot. 10 to 60 requests per day. Higher for brands with strong Apple ecosystem presence (app store, Mac software).
CCBot (Common Crawl). 5 to 30 requests per day, with periodic large crawl bursts when Common Crawl runs its monthly cycles.

The baselines vary by category. B2B SaaS sites with active publishing cadence see the higher end of the ranges. Static sites with infrequent updates see the lower end. The same applies for category-specific content like ecommerce product feeds, which can drive much higher crawl rates from Googlebot specifically. Use the baselines as starting points. Once your site has accumulated enough log history, calculate the site-specific baseline for each bot. Use that for ongoing monitoring rather than the generic ranges.

A horizontal range-bar chart of typical daily request rates for a mid-market site. Googlebot 100 to 500 requests per day. Bingbot 50 to 200. PerplexityBot 20 to 120. GPTBot 20 to 150. ClaudeBot 15 to 100. OAI-SearchBot 10 to 80. Applebot 10 to 60. Google-Extended 5 to 50. CCBot 5 to 30. Sustained drops below the site-specific baseline signal an access problem. — Figure 4. The Crawl Frequency Baselines. Each bot has a typical daily request range, and a bot that has held a steady rate then drops to a handful is blocked at some layer even when individual requests still return 200.

Sustained drops below the site-specific baseline signal access issues. A bot that has held a steady request rate and now makes only a handful is blocked at some layer. The drop matters even if the individual requests return 200 responses. The bot has decided not to crawl. That decision is often the result of past blocking that has not yet reset in the bot's crawler scheduler. Restoring access plus an IndexNow notification (Chapter 12) speeds the bot's return to baseline.

Server logs show which pages bots actually fetched. Running your own crawl of the same content closes the loop on what they could reach. Pairing log analysis with the Embedding Audit in Screaming Frog inventories every URL a crawler can render and confirms the pages you expect AI systems to retrieve are reachable in the first place, before any embedding-level analysis of the content itself.

Common Mistakes That Defeat Crawler Access

1. The wholesale-block trap. A "User-agent: * Disallow: /" rule blocks every bot. This is the most common failure in AI Search. Counter-test: fetch your production robots.txt. Check for any path-root Disallow under generic User-agent rules.

2. Blocking Google-Extended by mistake. Brands sometimes block Google-Extended. They think it is the same as Googlebot. The block removes them from AI Overview retrieval. The two bots are not the same. Blocking Google-Extended affects AIO but not organic search. Counter-test: does your robots.txt treat Googlebot and Google-Extended as separate?

3. Outdated bot lists. robots.txt files not updated since 2023 miss OAI-SearchBot, ClaudeBot, PerplexityBot, and other newer agents. The newer bots may fall under your permissive rules through the User-agent * wildcard. Explicit handling is safer. Counter-test: when was your robots.txt last updated?

4. CDN bot management overriding robots.txt. Cloudflare or other CDN configs block AI bots at the edge. Origin robots.txt does not matter. The brand thinks the bots are allowed. The CDN says no. Counter-test: run curl -A "PerplexityBot" against your production URL. Check the response code.

5. Plugin-managed robots.txt without operator awareness. WordPress SEO plugins generate robots.txt based on plugin settings. The operator may not understand the settings. Counter-test: does the robots.txt on the filesystem match what your production URL returns?

6. Different rules for subdomains. The www subdomain may have different robots.txt than the apex domain. Other subdomains may also drift. Inconsistency creates surprising blocking. Counter-test: do your apex, www, and any other public subdomains return the same robots.txt rules where intended?

7. Stale sitemap.xml. The sitemap lists URLs that no longer exist. It misses URLs that should be indexed. It carries long-outdated lastmod dates. Crawler efficiency drops. Citation lift on new content slows. Counter-test: spot-check 10 URLs from your sitemap. How many resolve correctly? What are the lastmod dates?

8. No verification cadence. The brand never spot-checks crawler access. Changes pile up over time. An audit reveals issues that have been costing citation share for an extended stretch. Counter-test: when did you last verify GPTBot and PerplexityBot can fetch your production URLs?

Questions & Answers

Why does robots.txt matter if AI systems do not always respect it? Major AI vendors all publish documented user agents. They honor robots.txt. OpenAI, Anthropic, Google, Microsoft, and Perplexity all qualify. The minority of scrapers ignoring robots.txt are not the citation-driving surfaces. A bad robots.txt is a self-inflicted citation outage with the major bots.

Training vs retrieval split? AI vendors split bots into two groups. Training bots collect data for future model versions. Retrieval bots fetch in real time when users ask questions. The same vendor often runs both under different user agents. Most brands want to allow retrieval. They decide separately on training.

Which user agents should I always allow? OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot, and Google-Extended. Plus Googlebot and Bingbot for organic that feeds AIO and Copilot.

Should I block training bots like GPTBot? The answer depends on positioning. Most mid-market brands allow training bots. They want the long-term entity-recognition benefit. Premium-content brands may block to protect a competitive moat. Blocking everything by default costs citation share at every layer.

How do I verify bot access? Three methods. Direct robots.txt inspection. User-agent simulation with curl -A. Server log analysis filtered by user-agent. CDN-fronted sites need to test at the edge. Origin checks alone are not enough.

What is the wholesale-block trap? CMS defaults and legacy configs include "User-agent: * Disallow: /". The rule blocks every bot. AI crawlers brands want are included. The audit and unblock work is high-impact. It produces measurable citation lift once the fix lands.

Does Cloudflare interfere? Yes. Cloudflare added AI bot blocking by default for some plans in 2024 and 2025. The blocking happens at the edge. It runs before robots.txt evaluation. Verification needs to test at the CDN edge.

How often should I audit? Quarterly at minimum. New AI bots arrive on a recurring basis. Quarterly audits catch new bots that need explicit handling. They also verify existing rules still match intent.