How LLMs Work

Inference (LLM)

Inference is the runtime process where a trained language model generates output, predicting tokens one by one in response to a prompt. Every AI answer users see is an inference run. Its cost and latency constraints explain why engines retrieve few sources, summarize aggressively, and cache answers, all of which shape brand visibility.

What happens during inference

At inference time the model takes everything in its context window, instructions, conversation, retrieved passages, and generates a response token by token, each step a full forward pass through billions of parameters. Settings like temperature control how deterministic the output is; serving infrastructure controls speed and cost.

Training happens rarely; inference happens billions of times daily. The economics of those runs, GPU time per token, drive nearly every product decision in AI search.

How inference economics shape AI answers

Because every token costs compute, engines optimize ruthlessly: they retrieve a handful of sources instead of hundreds, prefer concise passages they can use efficiently, truncate long inputs, and route easy queries to smaller, cheaper models. Reasoning modes that spend more inference compute are reserved for hard questions.

For content owners, the implication is consistency with everything else in GEO: dense, structured passages are not just easier to retrieve, they are cheaper for an engine to consume, and product pipelines are tuned to favor exactly that.

Inference variability and measuring visibility

Inference is probabilistic: the same prompt can yield different answers across runs, with different brands mentioned, because of sampling randomness, model routing, and shifting retrieval. A single manual check of "what does ChatGPT say about us" is therefore an anecdote, not a measurement, the basis of answer volatility.

Sound measurement samples repeatedly: running tracked prompts daily and computing mention rates over time. Geonimo does precisely this across engines, replacing one-off screenshots with statistically meaningful visibility trends.

Frequently asked questions

What is the difference between training and inference?

Training is the one-time, compute-intensive process of learning model weights from data. Inference is using the finished model to generate answers, performed billions of times a day. Your brand's representation is set at training; how it surfaces in any specific answer is decided at inference, influenced by retrieval and sampling.

Why does the same AI prompt give different answers each time?

Inference samples tokens probabilistically, controlled by temperature, so wording and even content vary across runs. Platforms also route between model versions and re-run retrieval, changing the sources in context. This variability is why brand visibility must be measured as a rate over many runs, not a single check.

Does inference cost affect which sources AI engines cite?

Indirectly, yes. Token costs push engines toward small context budgets: few retrieved sources, short passages, aggressive summarization. Content that delivers maximum answer value in minimum tokens, direct claims, clean structure, unique data, fits those budgets and is systematically favored over sprawling, padded pages.

Related terms

Temperature (LLM)

Temperature is a setting that controls how random a language model's output is during generation: low values produce consistent, predictable answers, higher values produce varied, creative ones. It is a key reason the same prompt about a product category can name different brands on different runs of the same AI.

Context Window

The context window is the maximum amount of text, measured in tokens, a language model can consider at once: the system prompt, conversation history, retrieved documents, and its own output. It limits how many sources an AI engine can read per answer, making the competition to be among those few sources intense.

Token (LLM)

A token is the basic unit of text a language model processes, typically a word fragment of about four characters or three-quarters of a word in English. Models read, generate, and price everything in tokens. Token limits shape how much of a web page an AI can ingest when composing an answer.

Answer Volatility

Answer volatility is the tendency of AI engines to give different answers to the same prompt across runs, days and models, caused by sampling temperature, model updates and changing retrieval results. It makes single spot-checks unreliable for measuring AI visibility and is the core reason repeated daily sampling is required.

Last updated: 2026-06-11

Track this for your brand

Geonimo monitors how ChatGPT, Perplexity, Claude, Gemini and Google AI talk about your brand — and generates the content that gets you cited.

Get your free audit