Glossary

AI Crawlers & Technical

robots.txt

robots.txt is a plain-text file at a website's root that tells crawlers which parts of the site they may access, using User-agent and Disallow/Allow directives. Originally built for search engines, it is now the primary mechanism for controlling AI crawlers like GPTBot and PerplexityBot. Compliance is voluntary but honored by major operators.

How robots.txt works

The file lives at yoursite.com/robots.txt and consists of groups: a User-agent line naming a bot (or * for all), followed by Disallow and Allow rules describing path access. Well-behaved crawlers fetch it before crawling and obey their matching group. It controls crawling, not indexing — a disallowed URL can still appear in an index if linked elsewhere — and it is advisory: enforcement against non-compliant scrapers requires firewall or CDN rules keyed on user agent and IP verification.

The AI-era robots.txt: separate training from retrieval

AI crawlers turned robots.txt from a technical housekeeping file into a strategic document. The core principle: distinguish training crawlers from search and retrieval crawlers, because the consequences of blocking them are opposite. Blocking GPTBot, ClaudeBot, CCBot or the Google-Extended token withholds training data with no impact on your visibility in AI search. Blocking OAI-SearchBot, PerplexityBot or user-triggered fetchers like ChatGPT-User removes you from AI answers entirely.

The most common costly mistake is a blanket rule, copied from a block list, that disallows every AI user agent indiscriminately. Audit yours: each bot deserves its own deliberate line.

A practical baseline configuration

For a brand pursuing AI visibility: explicitly allow retrieval bots (OAI-SearchBot, PerplexityBot, ChatGPT-User), decide training bots (GPTBot, ClaudeBot, CCBot, Google-Extended) case by case, and keep standard search crawlers unrestricted. Reference your sitemap, avoid disallowing CSS and JS that bots need to understand pages, and test changes before deploying, since one bad wildcard can silently erase AI citation eligibility. Then verify behavior in reality: server-side bot logs, such as those from Geonimo's AI traffic analytics, confirm which crawlers respect your rules and what they fetch.

Frequently asked questions

Should I use robots.txt to block all AI bots?

Only if you have decided to forgo AI search visibility entirely. Blanket blocks remove you from ChatGPT search, Perplexity and other answer engines, costing citations and referral traffic. Most sites should allow retrieval crawlers and make individual decisions about training crawlers like GPTBot and CCBot.

Do AI companies actually obey robots.txt?

The major operators — OpenAI, Anthropic, Google, Microsoft — document their user agents and state they honor robots.txt. Compliance is voluntary, and there have been disputes over specific companies' practices. For content you must protect, combine robots.txt with CDN or firewall enforcement based on verified bot IP ranges.

Does robots.txt affect my Google rankings?

Blocking AI-specific crawlers has no effect on Google Search rankings, which depend on Googlebot access. Problems arise when overly broad rules accidentally restrict Googlebot or block CSS and JavaScript files Google needs for rendering. Keep AI bot rules in their own user agent groups, separate from search crawler rules.

Related terms

Last updated: 2026-06-11

Track this for your brand

Geonimo monitors how ChatGPT, Perplexity, Claude, Gemini and Google AI talk about your brand — and generates the content that gets you cited.

Get your free audit