AI Crawlers & Technical

robots.txt

robots.txt is a plain-text file at a website's root that tells crawlers which parts of the site they may access, using User-agent and Disallow/Allow directives. Originally built for search engines, it is now the primary mechanism for controlling AI crawlers like GPTBot and PerplexityBot. Compliance is voluntary but honored by major operators.

How robots.txt works

The file lives at yoursite.com/robots.txt and consists of groups: a User-agent line naming a bot (or * for all), followed by Disallow and Allow rules describing path access. Well-behaved crawlers fetch it before crawling and obey their matching group. It controls crawling, not indexing — a disallowed URL can still appear in an index if linked elsewhere — and it is advisory: enforcement against non-compliant scrapers requires firewall or CDN rules keyed on user agent and IP verification.

The AI-era robots.txt: separate training from retrieval

AI crawlers turned robots.txt from a technical housekeeping file into a strategic document. The core principle: distinguish training crawlers from search and retrieval crawlers, because the consequences of blocking them are opposite. Blocking GPTBot, ClaudeBot, CCBot or the Google-Extended token withholds training data with no impact on your visibility in AI search. Blocking OAI-SearchBot, PerplexityBot or user-triggered fetchers like ChatGPT-User removes you from AI answers entirely.

The most common costly mistake is a blanket rule, copied from a block list, that disallows every AI user agent indiscriminately. Audit yours: each bot deserves its own deliberate line.

A practical baseline configuration

For a brand pursuing AI visibility: explicitly allow retrieval bots (OAI-SearchBot, PerplexityBot, ChatGPT-User), decide training bots (GPTBot, ClaudeBot, CCBot, Google-Extended) case by case, and keep standard search crawlers unrestricted. Reference your sitemap, avoid disallowing CSS and JS that bots need to understand pages, and test changes before deploying, since one bad wildcard can silently erase AI citation eligibility. Then verify behavior in reality: server-side bot logs, such as those from Geonimo's AI traffic analytics, confirm which crawlers respect your rules and what they fetch.

Frequently asked questions

Should I use robots.txt to block all AI bots?

Only if you have decided to forgo AI search visibility entirely. Blanket blocks remove you from ChatGPT search, Perplexity and other answer engines, costing citations and referral traffic. Most sites should allow retrieval crawlers and make individual decisions about training crawlers like GPTBot and CCBot.

Do AI companies actually obey robots.txt?

The major operators — OpenAI, Anthropic, Google, Microsoft — document their user agents and state they honor robots.txt. Compliance is voluntary, and there have been disputes over specific companies' practices. For content you must protect, combine robots.txt with CDN or firewall enforcement based on verified bot IP ranges.

Does robots.txt affect my Google rankings?

Blocking AI-specific crawlers has no effect on Google Search rankings, which depend on Googlebot access. Problems arise when overly broad rules accidentally restrict Googlebot or block CSS and JavaScript files Google needs for rendering. Keep AI bot rules in their own user agent groups, separate from search crawler rules.

Related terms

AI Crawler

An AI crawler is an automated bot operated by an AI company that fetches web pages to collect training data for language models or to retrieve fresh content for AI search answers. Examples include GPTBot, ClaudeBot and PerplexityBot. Each identifies itself with a user agent string and can be allowed or blocked via robots.txt.

GPTBot

GPTBot is OpenAI's web crawler that collects publicly available content to train and improve its language models, including the GPT series. It identifies itself with the GPTBot user agent and respects robots.txt, so site owners can block it. Blocking GPTBot affects model training only, not ChatGPT search citations.

Google-Extended

Google-Extended is a robots.txt control token that lets site owners opt out of having their content used to train Google's Gemini models. It is not a separate crawler; Googlebot still crawls normally. Blocking Google-Extended does not remove a site from Google Search or AI Overviews, which follow standard search indexing.

llms.txt

llms.txt is a proposed convention where a site publishes a markdown file at its root giving large language models a curated map of its most important content. The goal is helping AI systems find canonical pages efficiently. Adoption is growing, but it is not an official standard and major AI engines have not committed to honoring it.

Last updated: 2026-06-11

Track this for your brand

Geonimo monitors how ChatGPT, Perplexity, Claude, Gemini and Google AI talk about your brand — and generates the content that gets you cited.

Get your free audit