DiagnosticMay 1, 20266 min read

The AI crawler access problem most SMB sites have not audited

About 14 percent of SMB sites are blocking GPTBot, often by accident. The lift from getting crawled is worth 320 percent more human traffic.

Alex Heudes

Co-Founder, Vyzz

The site that nobody could find on ChatGPT

An in-home care operator we audited last week had been quiet on ChatGPT for six months. The team thought it was a content problem. They had spent the quarter rewriting service pages, adding location landings, refreshing their FAQ. Despite the quality of those updates, the citation count had moved by zero across the entire window.

The first thing we checked was their robots.txt. They were blocking GPTBot, ClaudeBot, and CCBot. Someone on the team had pasted a snippet from a privacy blog into the file last summer, and the file had not been touched since. The team had inadvertently blocked GPTBot even though the search-citation crawler remained capable of accessing the site. That distinction is the thing most SMB operators do not yet know exists.

This is the threshold question every audit needs to start with. Before an AI engine can cite a business, the engine has to be able to crawl the business. The signals everyone has spent four months tuning, schema, headers, FAQ blocks, earned media, freshness windows, become irrelevant when the bot cannot reach the page in the first place.

What crawl access is worth

Duda published a study in April 2026 covering 850,000 SMB sites and 69 million AI crawler visits. The headline finding: AI-crawled sites generated 320 percent more human traffic, 270 percent more form submissions, and 250 percent more click-to-call events than non-crawled peers in the same study cohort.

The lift the study measures is human traffic, downstream of AI citation. The full sequence has four stages running in order: an AI crawler fetches the page, the engine decides whether to cite the page in its answer, the citation appears in a user's ChatGPT or Google AI Overview response, and a fraction of those users click through to the SMB site. Every stage depends on the first one. Without crawl access, none of the later stages can occur.

The Duda data also identifies the operator-controllable factors that drive crawl rate. AI crawler visits rise by roughly 7 percent for every blog post added and roughly 4 percent for every additional indexable page, with local schema markup and Google Business Profile synchronization rounding out the top non-cadence drivers. None of this is exotic, and none of it is new advice. The novelty is that the same content cadence and schema work that drives Google ranking now drives a separate, parallel pipeline of AI traffic.

The robots.txt audit problem

Cloudflare published a robots.txt analysis in Q1 2026 covering 4,047 SMB sites parsed on a single day, March 30. The analysis shows that GPTBot is blocked by 13.8 percent of sites, followed by ClaudeBot at 11.5 percent, CCBot at 11.2 percent, and Google-Extended at 10.7 percent.

ClaudeBot's share of Disallow rules grew from 9.6 percent in January 2026 to 10.1 percent by March, with the quarterly trend showing increasing SMB site blocks. GPTBot's share remained steady even as ClaudeBot's share grew. While some operators block bots intentionally to protect training data, many others do so accidentally through privacy-blog templates, hosting provider defaults, or one-line copy-pastes during site setup that nobody has audited since.

The accidental-block problem is the size of an entire vertical's worth of SMB visibility. A single Disallow rule in robots.txt can erase a year of content work, schema work, and Google Business Profile work from the AI citation pool. The work is not lost in any other channel. Major search engines like Google and Bing continue to read the site normally, customers landing on the site from any other source still convert, and the AI search citation pipe is the one channel where a robots.txt rule is dispositive.

Training bots and search-citation bots are not the same agent

This is the technical detail that catches almost every operator off guard. The user agent that powers AI search citation is a different identifier from the one that fetches pages for training data. OpenAI uses GPTBot to gather training data and OAI-SearchBot to power its search citations. Anthropic runs ClaudeBot primarily for training, with a separate user agent for the citation answers Claude returns inside its product. PerplexityBot is search-purpose. CCBot is Common Crawl, a non-AI archive that AI training pipelines often consume.

An SMB that wants to opt out of training while staying visible in AI search citation can keep GPTBot and ClaudeBot disallowed and explicitly Allow OAI-SearchBot, the Anthropic search agent, and PerplexityBot. A blanket User-agent: * with Disallow: / cuts off both pipes at once. The privacy advice that propagates through tech blogs almost never makes this distinction, which is how an operator ends up surprised that fixing one robots.txt line restores six months of disappeared citations.

The four-step audit any operator can run today

The audit takes about thirty minutes and does not require a developer.

Pull your robots.txt. Open yoursite.com/robots.txt in any browser. Read every Disallow rule. Note which AI user agents are named.
Decide your stance per user agent. Operators who want to prevent training-data scraping while keeping search citation can disallow GPTBot and ClaudeBot, then explicitly allow OAI-SearchBot, Anthropic's search-citation agent, and PerplexityBot. PerplexityBot is search-purpose and almost always belongs in the allowed set.
Check the Disallow-slash trap. A User-agent: * block followed by Disallow: / overrides every Allow rule below it for unmatched bots. If your file has this pattern, every AI search-citation bot is blocked unless explicitly named with its own Allow.
Verify with the live test tools. OpenAI publishes a test page at platform.openai.com/docs/bots that shows which user agents are reaching your site. Bing Webmaster Tools shows the same data for Bingbot and BingChatBot. Run both. If a search-citation user agent is showing zero successful fetches over the last 30 days, the block is real.

For operators who want the full picture in one report, including which AI engines are citing your business today and where each citation traces back, an audit across ChatGPT, Claude, Perplexity, and Google AI Overviews shows the gap directly.

Frequently asked

How do I check if my site is blocking AI bots?

Open yoursite.com/robots.txt in any browser. Look for User-agent lines naming GPTBot, ClaudeBot, CCBot, OAI-SearchBot, PerplexityBot, or Google-Extended. A Disallow rule under any of these names is a block. A blanket User-agent: * with Disallow: / blocks every crawler, AI bots included.

What is the difference between GPTBot and OAI-SearchBot?

GPTBot fetches training data while OAI-SearchBot powers the search-citation answers used in ChatGPT product responses. Blocking OAI-SearchBot eliminates search citations, a far more disruptive outcome than the training-data focus that disallowing GPTBot affects.

How fast does fixing a robots.txt block move citations?

Most SMBs see new AI citations begin within four to eight weeks of unblocking the right user agents, assuming the site has indexable pages and a current Google Business Profile. The first wave of citations tends to come from the AI engines that recrawl most aggressively, ChatGPT search and Perplexity in particular.

Topics:ai-searchgeorobots-txtai-crawlerssmb-marketing

Frequently asked questions

How do I check if my site is blocking AI bots?: Open yoursite.com/robots.txt in any browser. Look for User-agent lines naming GPTBot, ClaudeBot, CCBot, OAI-SearchBot, PerplexityBot, or Google-Extended. A Disallow rule under any of these names is a block. A blanket User-agent star with Disallow slash blocks every crawler, AI bots included.
What is the difference between GPTBot and OAI-SearchBot?: GPTBot fetches training data while OAI-SearchBot powers the search-citation answers used in ChatGPT product responses. Blocking OAI-SearchBot eliminates search citations, a far more disruptive outcome than the training-data focus that disallowing GPTBot affects.
How fast does fixing a robots.txt block move citations?: Most SMBs see new AI citations begin within four to eight weeks of unblocking the right user agents, assuming the site has indexable pages and a current Google Business Profile. The first wave of citations tends to come from the AI engines that recrawl most aggressively, ChatGPT search and Perplexity in particular.