Measurement & Analytics

Answer Volatility

Answer volatility is the tendency of AI engines to give different answers to the same prompt across runs, days and models, caused by sampling temperature, model updates and changing retrieval results. It makes single spot-checks unreliable for measuring AI visibility and is the core reason repeated daily sampling is required.

Why the same prompt gives different answers

Three mechanisms drive volatility. First, generation is probabilistic: models sample tokens under a temperature setting, so two identical requests can legitimately produce different brand lists. Second, providers ship model updates and silent revisions that shift behavior overnight. Third, engines using retrieval-augmented generation depend on live search results that change as indexes refresh, so the source material itself moves between runs.

Volatility varies by prompt type. Questions with one dominant answer are stable; open recommendation prompts ("best tools for X") are the most volatile, because many brands plausibly fit and sampling decides who makes the cut on any given run, precisely the prompts marketers care about most.

Why spot checks mislead

Asking ChatGPT once whether it mentions your brand is a coin flip presented as a measurement. A founder who checks one morning and sees the brand listed may conclude all is well, while the brand actually appears in a minority of runs; the reverse panic is equally common. Single observations cannot distinguish a real visibility change from sampling noise, and decisions made on them, celebrating, escalating, reallocating budget, are decisions made on noise.

The statistical fix is the same as any noisy measurement: repeated sampling. Running the same prompt set daily and aggregating over a week turns a coin flip into a mention rate with a confidence you can act on, and makes genuine shifts, like a model update dropping you from a category, distinguishable from variance.

Managing volatility in your measurement program

Treat volatility as a property to measure, not just noise to remove. A brand mentioned in 90% of runs has durable presence; one at 40% is on the bubble and most exposed to model updates. Track stability per prompt and per provider, report weekly aggregates rather than daily points, and annotate known model releases so real regime changes are not mistaken for drift. This sampling discipline is why dedicated platforms run prompts on an automated daily schedule: Geonimo samples every tracked prompt across providers each day through prompt tracking, so its visibility scores reflect distributions rather than lucky or unlucky single runs.

Frequently asked questions

Why does ChatGPT mention my brand sometimes but not always?

Generation is probabilistic: for open recommendation prompts, several brands plausibly fit and sampling decides which appear in each run. Your brand sits somewhere in that probability distribution. The honest metric is the share of runs that include you, measured over many samples, not the outcome of any single conversation.

How many times should I run a prompt to get a reliable visibility measure?

Daily sampling aggregated over at least a week per prompt and provider is the practical standard. The exact number depends on how volatile the prompt is: stable factual prompts converge quickly, while open "best of" prompts need more runs. Trend direction across weeks matters more than precision on any single day.

Did a model update kill my AI visibility or is it just noise?

Check three things: does the drop persist across multiple consecutive days, does it appear on more than one prompt, and does it coincide with a known model or index update? Noise is scattered and recovers; an update-driven change is sudden, sustained and often provider-specific. Annotated daily history makes the distinction obvious.

Related terms

Temperature (LLM)

Temperature is a setting that controls how random a language model's output is during generation: low values produce consistent, predictable answers, higher values produce varied, creative ones. It is a key reason the same prompt about a product category can name different brands on different runs of the same AI.

Prompt Tracking

Prompt tracking is the practice of repeatedly querying AI engines with a fixed set of prompts that mirror real customer questions, then recording which brands are mentioned, cited, and recommended in each answer. It is the AI-search equivalent of rank tracking, providing the raw data behind visibility scores and competitive analysis.

Visibility Score

A visibility score is a composite index, usually 0 to 100, that summarizes how prominently a brand appears in AI-generated answers. It typically combines mention frequency, mention position, and citation presence across a tracked prompt set and multiple AI engines, giving teams one trendable number for overall AI search performance.

Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation is an AI architecture where a language model retrieves relevant documents, typically via web or database search, before generating its answer, grounding the response in fetched content. RAG powers AI search engines like Perplexity and ChatGPT Search, and it is the mechanism through which web pages earn citations in AI answers.

Last updated: 2026-06-11

Track this for your brand

Geonimo monitors how ChatGPT, Perplexity, Claude, Gemini and Google AI talk about your brand — and generates the content that gets you cited.

Get your free audit