LLM Determinism and AEO: Why the Score Is Stable When the Model Is Not

LLM Determinism and AEO: Why the Score Is Stable When the Model Is Not

In this article, you will learn what LLM determinism actually means in technical terms, why even temperature zero does not produce identical outputs across runs, the four mechanisms that introduce non-determinism inside the model, and the design choices in defensible AEO measurement that absorb the floor-level noise into a stable score.


Your AEO vendor said the scans are deterministic. They probably are not.

Your AEO vendor says the scans run at temperature zero, so the outputs are deterministic. The vendor is wrong about the determinism. The vendor can still be right about the score. Understanding the difference is the difference between a procurement answer that survives and a retainer that does not.

LLM determinism in brand measurement is a real question and it has a clear answer. Large language models are not deterministic in production, even with temperature set to the value intended to make them deterministic (zero). The reason is not the model. The reason is the hardware running the model and the way responses are served. The good news for any AEO buyer is that measurement does not require deterministic models. It requires sample design and aggregation engineering that absorb the floor-level noise into a stable score above it. GenPicked builds AEO measurement exactly this way because the discipline has matured to where this is well understood.

This article is the technical companion to the reproducibility piece, which covers what stable scores look like at the dashboard layer. Read this one when an engineer or a CFO asks why the underlying model is non-deterministic. Read the reproducibility piece when an account manager asks how scores stay stable across days.

What "deterministic" actually means in LLM context

A deterministic system produces the same output from the same input every time. Most people assume large language models behave deterministically when you set the randomness knob (called temperature) to zero. The reasoning sounds correct. Temperature zero means the model always picks the highest-probability next word. If the model always picks the highest-probability word, the output should be fixed.

The reasoning is correct about the model in isolation. It is wrong about the model in production. Alexander, Radev, and Hashimoto at arXiv in 2026 ran 480 attempts with six different large language models at temperature zero, measured both surface-level wording variation and semantic content variation, and found meaningful differences across runs (alexander 2026 llm reproducibility). The mathematical model is deterministic. The system running the model is not.

The gap between "the model" and "the system running the model" is where AEO measurement either falls apart or holds together. Vendors who treat the gap as a vendor problem keep their methodology pages thin and their numbers fragile. Vendors who treat it as an engineering problem publish the design choices that absorb the noise. GenPicked is in the second camp by construction.

The four mechanisms that introduce non-determinism

A working AEO buyer does not need a graduate course in numerical computing. The four mechanisms below explain enough to defend the score to a CFO, and enough to know when a vendor's answer to "is your scan deterministic" is technically accurate or hand-waving.

1. Floating-point arithmetic on GPUs is order-dependent

Computers represent decimals using a fixed number of bits. Adding the same three numbers in a different order can produce different results at the last decimal because each addition loses a tiny amount of precision and the losses do not accumulate the same way. The technical name for this is non-associative floating-point arithmetic, but the buyer-facing version is shorter: adding numbers in a different order can produce different answers at the last decimal.

Large language models perform billions of additions and multiplications per response. A model running on a graphics processor (GPU) may schedule those operations differently across runs depending on which other workloads share the chip. Each tiny precision loss is small. The accumulation across billions of operations can shift the model's probability estimates enough that a different word becomes the highest-probability next word. The model then produces a different sentence from the same input.

2. Batch processing effects

Production AEO scans do not run one request at a time. They batch many requests together so the GPU stays busy. The composition of the batch affects the numerical context in which any single request is processed. Two runs of the same prompt, batched alongside different other prompts, can produce different outputs even though the input string is identical.

This is not exotic. It is the default behavior of any production LLM serving infrastructure. A vendor that claims "deterministic at temperature zero" without addressing batch effects is reporting what their model definition says, not what their system actually produces.

3. Infrastructure variance

Different GPU types (A100, H100, B200), different software stacks (CUDA versions, PyTorch versions, inference runtimes), and different serving configurations produce slightly different numerical outputs from the same input. AEO platforms call the same engine endpoints, but the engines themselves are running on shifting hardware. A scan today may hit different infrastructure than a scan tomorrow even when both are pointed at the same API.

This is the same reason a financial model run on two different servers can produce slightly different decimals. Nobody calls that a financial-model defect; the discipline acknowledges floor-level noise and aggregates above it. AEO needs to do the same.

4. Sampling-layer settings beyond temperature

Temperature is one knob among several. Other sampling knobs (top-k, top-p, seed) also affect the output. Setting temperature to zero does not lock the other knobs unless the vendor configures them explicitly. A serving stack with default top-p of 0.95 and an unfixed seed will produce different outputs at temperature zero across runs because the other knobs are still active.

The defensible AEO vendor configures all of these explicitly and discloses the configuration. The "deterministic at temperature zero" vendor often has not.

Large language models are non-deterministic in production even at temperature zero. Alexander, Radev, and Hashimoto at arXiv in 2026 ran 480 attempts with six large language models at the setting intended to produce deterministic output and observed meaningful variation in both wording and content (alexander 2026 llm reproducibility). The variation comes from non-associative floating-point arithmetic on GPUs, batch composition effects, infrastructure variance, and sampling knobs beyond temperature.

Why the score above the model is still stable

If the model is noisy, how can the score be stable? The answer is that measurement systems in every mature discipline absorb floor-level noise through aggregation. AEO measurement is no different. The design choices that produce stable scores above noisy models are well understood; they are the same choices that show up on a defensible vendor's methodology page.

Repeated sampling cancels random noise

Querying once per period reports a single draw from a noisy distribution. Querying 200 times per period and reporting the mean reports the center of the distribution, which is far more stable. The variance shrinks as the square root of the sample size. With 200 paired prompts per period, single-query variance shrinks by a factor of about 14. Floor-level model noise no longer dominates the score.

SE Ranking's 2025 AI Mode study ran 10,000 queries three times on the same day and found 9.2 percent same-day URL consistency at the single-query layer (se ranking 2025 ai mode url consistency). Aggregating across hundreds of paired prompts per period and reporting confidence intervals produces a score whose variation is orders of magnitude smaller than the underlying query noise.

Pairwise comparison stabilizes ranking

Asking the model to produce an absolute ranked list from scratch is fragile. Asking it to choose between two specific options head-to-head and aggregating thousands of those decisions is far more stable. The math behind this is the same comparison-aggregation logic used by chess rating systems and public AI model leaderboards. The pairwise ranking article walks through the application to AEO scoring.

Counterbalanced trial design cancels position effects

When two options appear together, the option listed first tends to be picked more often regardless of merit. Randomizing which option appears first in half the trials cancels the position effect. The technical name for this randomization technique is Latin Square; the buyer-facing version is shorter: flip the order half the time and average the results.

Multi-engine aggregation absorbs single-engine quirks

ChatGPT, Claude, Gemini, and Perplexity each have idiosyncratic behaviors. Scoring on only one engine reports that engine's quirks. Scoring across all four with disclosed weighting absorbs single-engine variance into a composite that better predicts buyer outcomes. The construct that comes out of this aggregation is called share of model; the share of model article walks through the construct and the aggregation logic.

Confidence intervals report what the score does not claim

A point estimate without a confidence interval hides the variance. A score reported as "47 plus or minus 2.4 percentage points at 95 percent confidence" is honest about what is signal and what is noise. The Friday client deck survives the procurement conversation when the deck reports a band, not just a number.

What this means for the renewal conversation

When a CFO asks "how can the number be trustworthy if the underlying model is non-deterministic," the answer has three parts.

First, the model is mathematically deterministic. The hardware and serving environment running the model are not. The non-determinism is at the system level, not the algorithm level.

Second, the noise floor is real and known. Repeated sampling, pairwise design, counterbalanced trials, multi-engine aggregation, and confidence intervals absorb the floor noise into a stable score above it.

Third, the proof is methodology disclosure. A defensible vendor publishes the sample size, prompt templates, engine weights, aggregation formula, and confidence band. Bean and colleagues audited 445 large language model benchmarks in 2024 and found 21.8 percent provided no construct definition at all (bean 2024 construct validity benchmarks). The vendors with published methodology produce numbers a CFO can review. The ones without, do not.

The measurement gap in AEO is not LLM non-determinism. Sample design and aggregation engineering have absorbed floor-level noise in every mature measurement discipline for decades. The gap is methodology disclosure. A vendor that publishes its sample size, prompt templates, engine weights, and confidence intervals produces a defensible score despite non-deterministic models. A vendor that hides methodology behind "proprietary" produces a label, not a measurement.

For the practical buyer test that exposes which side of the gap a vendor sits on, the five-test validation framework walks through the questions in 30 minutes.

Frequently asked questions

Is temperature zero deterministic in production?

No. Temperature zero makes the model choose the highest-probability next word at each step, which is one component of determinism. Floating-point arithmetic on GPUs, batch processing effects, infrastructure variance, and other sampling knobs (top-k, top-p, seed) all contribute additional non-determinism. Alexander, Radev, and Hashimoto at arXiv in 2026 documented meaningful output variation at temperature zero across 480 attempts on six models.

If LLMs are non-deterministic, can AEO be measured at all?

Yes. Every mature measurement discipline (medicine, finance, physics) handles floor-level noise through sample design and aggregation. Repeated sampling, pairwise comparison, counterbalanced trials, multi-engine aggregation, and disclosed confidence intervals produce stable scores above noisy underlying systems. AEO is no different.

How big is the noise floor for a single query?

SE Ranking's 2025 AI Mode study found 9.2 percent same-day URL consistency across 10,000 queries on a single engine. Fishkin and O'Donnell at SparkToro in 2026 found fewer than 1 percent of 2,961 identical prompts returned the same brand list across ChatGPT, Claude, and Google AI. Single-query noise is large. Aggregated-score noise across hundreds of paired prompts is small.

What is the simplest thing to ask my AEO vendor about determinism?

Ask: "What sample size do you query per measurement period, and what confidence interval do you report around the score?" A vendor that names a specific number (200 paired prompts per period, plus or minus 2.4 pp at 95 percent confidence) has done the engineering. A vendor that says "we query at scale" without numbers has not.

Does GenPicked claim its models are deterministic?

No. GenPicked acknowledges that LLMs in production are non-deterministic at the model layer. The methodology page documents the sample design, pairwise comparison, counterbalanced trials, multi-engine weighting, and confidence intervals that produce stable scores above the model noise. A demo brief is at genpicked.com/demo.

Why does my dashboard show different numbers each day even though I have not changed anything?

Either the underlying noise floor is leaking through to your score (the vendor is querying once per period with no aggregation), or the methodology has changed silently (the vendor updated something without notice). Both are red flags. A defensible vendor reports scores with confidence intervals and versions the methodology so silent changes are visible.

Should I demand a deterministic AEO vendor?

No. A vendor claiming deterministic outputs is either misunderstanding production LLM behavior or overstating their control. The right vendor acknowledges floor-level non-determinism and shows the engineering above it. That answer is more defensible than a determinism claim that does not survive technical scrutiny.

See what a methodology-disclosed score looks like

If you want to see the methodology page that addresses LLM non-determinism head-on (sample size, pairwise design, counterbalancing, multi-engine weights, confidence intervals all in writing), the GenPicked sample brief is at genpicked.com/demo. The brief is the same artifact you would hand a procurement officer or a CFO.

For agencies running active retainers, the 14-day free trial includes the full methodology documentation as a downloadable artifact, including the response to "is your scan deterministic" in plain English.

Dr. William L. Banks III is Co-Founder of GenPicked. References documented in the GenPicked research wiki. Specific citations available on request.

Dr. William L. Banks III

Co-Founder, GenPicked

Frequently Asked Questions

Is temperature zero deterministic in production?

No. Temperature zero makes the model choose the highest-probability next word at each step, which is one component of determinism. Floating-point arithmetic on GPUs, batch processing effects, infrastructure variance, and other sampling knobs (top-k, top-p, seed) all contribute additional non-determinism. Alexander, Radev, and Hashimoto at arXiv in 2026 documented meaningful output variation at temperature zero across 480 attempts on six models.

If LLMs are non-deterministic, can AEO be measured at all?

Yes. Every mature measurement discipline (medicine, finance, physics) handles floor-level noise through sample design and aggregation. Repeated sampling, pairwise comparison, counterbalanced trials, multi-engine aggregation, and disclosed confidence intervals produce stable scores above noisy underlying systems. AEO is no different.

How big is the noise floor for a single query?

SE Ranking's 2025 AI Mode study found 9.2 percent same-day URL consistency across 10,000 queries on a single engine. Fishkin and O'Donnell at SparkToro in 2026 found fewer than 1 percent of 2,961 identical prompts returned the same brand list across ChatGPT, Claude, and Google AI. Single-query noise is large. Aggregated-score noise across hundreds of paired prompts is small.

What is the simplest thing to ask my AEO vendor about determinism?

Ask: "What sample size do you query per measurement period, and what confidence interval do you report around the score?" A vendor that names a specific number (200 paired prompts per period, plus or minus 2.4 pp at 95 percent confidence) has done the engineering. A vendor that says "we query at scale" without numbers has not.

Does GenPicked claim its models are deterministic?

No. GenPicked acknowledges that LLMs in production are non-deterministic at the model layer. The methodology page documents the sample design, pairwise comparison, counterbalanced trials, multi-engine weighting, and confidence intervals that produce stable scores above the model noise. A demo brief is at genpicked.com/demo.

Why does my dashboard show different numbers each day even though I have not changed anything?

Either the underlying noise floor is leaking through to your score (the vendor is querying once per period with no aggregation), or the methodology has changed silently (the vendor updated something without notice). Both are red flags. A defensible vendor reports scores with confidence intervals and versions the methodology so silent changes are visible.

Should I demand a deterministic AEO vendor?

No. A vendor claiming deterministic outputs is either misunderstanding production LLM behavior or overstating their control. The right vendor acknowledges floor-level non-determinism and shows the engineering above it. That answer is more defensible than a determinism claim that does not survive technical scrutiny.

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit
#academy#blog#methodology#llm-determinism#measurement-theory#r3