Glossary

How LLMs Work

Inference (LLM)

Inference is the runtime process where a trained language model generates output, predicting tokens one by one in response to a prompt. Every AI answer users see is an inference run. Its cost and latency constraints explain why engines retrieve few sources, summarize aggressively, and cache answers, all of which shape brand visibility.

What happens during inference

At inference time the model takes everything in its context window, instructions, conversation, retrieved passages, and generates a response token by token, each step a full forward pass through billions of parameters. Settings like temperature control how deterministic the output is; serving infrastructure controls speed and cost.

Training happens rarely; inference happens billions of times daily. The economics of those runs, GPU time per token, drive nearly every product decision in AI search.

How inference economics shape AI answers

Because every token costs compute, engines optimize ruthlessly: they retrieve a handful of sources instead of hundreds, prefer concise passages they can use efficiently, truncate long inputs, and route easy queries to smaller, cheaper models. Reasoning modes that spend more inference compute are reserved for hard questions.

For content owners, the implication is consistency with everything else in GEO: dense, structured passages are not just easier to retrieve, they are cheaper for an engine to consume, and product pipelines are tuned to favor exactly that.

Inference variability and measuring visibility

Inference is probabilistic: the same prompt can yield different answers across runs, with different brands mentioned, because of sampling randomness, model routing, and shifting retrieval. A single manual check of "what does ChatGPT say about us" is therefore an anecdote, not a measurement, the basis of answer volatility.

Sound measurement samples repeatedly: running tracked prompts daily and computing mention rates over time. Geonimo does precisely this across engines, replacing one-off screenshots with statistically meaningful visibility trends.

Frequently asked questions

What is the difference between training and inference?

Training is the one-time, compute-intensive process of learning model weights from data. Inference is using the finished model to generate answers, performed billions of times a day. Your brand's representation is set at training; how it surfaces in any specific answer is decided at inference, influenced by retrieval and sampling.

Why does the same AI prompt give different answers each time?

Inference samples tokens probabilistically, controlled by temperature, so wording and even content vary across runs. Platforms also route between model versions and re-run retrieval, changing the sources in context. This variability is why brand visibility must be measured as a rate over many runs, not a single check.

Does inference cost affect which sources AI engines cite?

Indirectly, yes. Token costs push engines toward small context budgets: few retrieved sources, short passages, aggressive summarization. Content that delivers maximum answer value in minimum tokens, direct claims, clean structure, unique data, fits those budgets and is systematically favored over sprawling, padded pages.

Related terms

Last updated: 2026-06-11

Track this for your brand

Geonimo monitors how ChatGPT, Perplexity, Claude, Gemini and Google AI talk about your brand — and generates the content that gets you cited.

Get your free audit