AI Crawlers & Technical

User Agent

A user agent is the identification string a browser or bot sends with every web request, declaring what software is making it. AI crawlers identify themselves with tokens like GPTBot or PerplexityBot, which is how sites detect AI bot activity in logs and target them with specific robots.txt rules.

The identity header of the web

Every HTTP request carries a User-Agent header. Browsers send long strings describing browser and OS; well-behaved bots send strings containing a stable product token, often with a URL documenting the bot. That token is the handle for everything in crawler management: robots.txt groups match on it, log analysis filters by it, and CDN rules act on it. The major AI operators publish their tokens, including GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot and Google's crawler family, making AI bot activity identifiable to anyone who looks at server logs.

Spoofing and verification

A user agent is a self-declaration, trivially faked. Scrapers impersonate Googlebot to slip past bot defenses, and some impersonate AI bots for the same reason. Verification closes the gap: major operators publish official IP ranges or support reverse-DNS checks, so a request claiming to be GPTBot from an unrelated IP is an impostor. This matters in both directions: for blocking, robots.txt only restrains honest bots, so hard enforcement needs IP-verified CDN rules; for measurement, unverified counting inflates your bot traffic numbers with fakes.

User agents in AI visibility work

For GEO practitioners, the user agent is the primary lens on AI platform behavior. Filtering server logs by AI tokens reveals which platforms crawl you, which pages they read and how patterns change after content updates. It also drives policy precision: separate robots.txt groups per token let you allow retrieval bots while restricting training bots. Edge-based trackers operationalize this, with Geonimo's Cloudflare Worker classifying requests by user agent into named AI bots for its AI traffic analytics before pages are even served.

Frequently asked questions

What user agents do the major AI crawlers use?

OpenAI uses GPTBot for training, OAI-SearchBot for search indexing and ChatGPT-User for live in-conversation fetches. Anthropic uses ClaudeBot. Perplexity uses PerplexityBot. Google's crawling runs through Googlebot and related tokens, with Google-Extended as a robots.txt control rather than a visiting bot. Each operator documents its strings publicly.

Can I trust the user agent string in my logs?

Not blindly. Any client can claim any user agent, and scrapers routinely impersonate trusted bots. For decisions that matter, verify claimed identity against the operator's published IP ranges or reverse-DNS records. Verified matches are reliable; unverified matches should be treated as unconfirmed and possibly hostile.

How do I target a specific bot in robots.txt?

Create a group starting with User-agent: followed by the bot's token, for example GPTBot, then add Allow and Disallow rules for that group. Bots follow the most specific group matching their token, falling back to the wildcard group. This lets you set different policies per AI crawler.

Related terms

AI Crawler

An AI crawler is an automated bot operated by an AI company that fetches web pages to collect training data for language models or to retrieve fresh content for AI search answers. Examples include GPTBot, ClaudeBot and PerplexityBot. Each identifies itself with a user agent string and can be allowed or blocked via robots.txt.

Bot Traffic

Bot traffic is automated, non-human activity on a website, ranging from search and AI crawlers to scrapers and malicious bots. In AI search measurement, the relevant slice is crawls by AI bots like GPTBot or PerplexityBot, which signal that AI platforms are reading your content but are invisible to client-side analytics.

robots.txt

robots.txt is a plain-text file at a website's root that tells crawlers which parts of the site they may access, using User-agent and Disallow/Allow directives. Originally built for search engines, it is now the primary mechanism for controlling AI crawlers like GPTBot and PerplexityBot. Compliance is voluntary but honored by major operators.

GPTBot

GPTBot is OpenAI's web crawler that collects publicly available content to train and improve its language models, including the GPT series. It identifies itself with the GPTBot user agent and respects robots.txt, so site owners can block it. Blocking GPTBot affects model training only, not ChatGPT search citations.

Last updated: 2026-06-11

Track this for your brand

Geonimo monitors how ChatGPT, Perplexity, Claude, Gemini and Google AI talk about your brand — and generates the content that gets you cited.

Get your free audit