Procurement Diligence

Reproducibility in AEO Measurement: Why Stable Scores Need Repeated Sampling

Dr. William L. Banks III

May 16, 2026

11 min read

Reproducibility in AEO Measurement: Why Stable Scores Need Repeated Sampling

In this article, you will learn what reproducibility in AEO measurement actually means, why temperature 0 alone does not deliver it, the three layers of reproducibility a defensible vendor controls, and the 10-minute test any agency can run before the renewal conversation lands.

Your AEO score moved 7 points and nothing happened to the brand

Your AEO platform shows your client at rank 4 on Monday and rank 7 on Wednesday. Nothing happened in between. No content shipped, no PR hit, no algorithm change announced. The client will ask why on Friday. The answer "the underlying system is non-deterministic" sounds like an excuse. The honest answer is more interesting: reproducibility in AEO measurement is buyable, and the vendors that deliver it look operationally different from the ones that do not.

This article is the deep dive on what reproducibility actually requires. It is the operational companion to the five-test buyer checklist, where reproducibility is test 3. GenPicked builds AEO measurement on a three-layer reproducibility model because the discipline has matured to the point where stable scores are a deliverable, not a theoretical aspiration.

The 10-minute buyer-side test sits at the end. Skip there if the renewal call is Friday.

What reproducibility means in AEO

Reproducibility is the degree to which the same measurement performed under the same conditions returns the same answer. In AEO, "same conditions" means the same prompt against the same engine on the same brand in the same time window. The score does not have to match to the decimal place. It has to fall inside a confidence interval the vendor disclosed in advance.

A non-reproducible score moves randomly across runs. A reproducible score moves only when the underlying brand presence in AI outputs has actually moved. The difference between the two is the difference between an instrument and a wind vane.

The methodology choices that produce reproducibility have been worked out in the AI evaluation research community over the past three years. The vendors that have ported those choices into commercial AEO products deliver stable scores. The vendors that have not are still selling single-prompt scans dressed as trend lines.

Why temperature 0 alone does not deliver reproducibility

Some vendors claim reproducibility through engine settings. The argument: set the LLM temperature parameter to 0, the engine becomes deterministic, the score becomes stable. The argument is wrong, and the evidence is recent.

Temperature 0 is not deterministic for large language models. Alexander, Radev, and Hashimoto ran 480 attempts with six LLMs at temperature 0, the setting intended to produce deterministic output. The models produced meaningfully different outputs anyway, both at the surface wording level and at the semantic content level (alexander 2026 llm reproducibility). An AEO vendor that claims reproducibility through temperature alone is selling a property the underlying system does not have.

The root cause is engineering. Floating-point arithmetic on GPUs is order-sensitive. Batch composition, kernel selection, and hardware scheduling all introduce non-determinism even when the model parameters and input tokens are identical. Temperature 0 reduces the search space at the sampling step; it does not eliminate the variance below the sampling step.

What this means for buyers: a vendor claiming "deterministic AEO" should be asked what level of determinism they mean, and whether the determinism survives a hardware refresh. A vendor that has thought about this has a layered answer. A vendor that has not, has a slogan.

The three layers of reproducibility

Reproducibility in AEO is not one thing. It is a stack with three layers, each of which a defensible vendor controls separately.

Layer 1: Intra-prompt reproducibility

The same prompt run against the same engine should produce outputs that fall within a stated variance band. Vendors control this by averaging across multiple runs of the same prompt before reporting any number. The number of runs per prompt is a methodology choice the vendor either discloses or hides.

Layer 2: Intra-engine reproducibility

Across many semantically equivalent prompts against the same engine, the brand's mention rate should be stable to within a stated band. Vendors control this by sampling prompt variants from a documented set, not by running one canonical prompt and reporting the result. Sample size disclosure (test 2 of the five-test buyer checklist) is the prerequisite.

Layer 3: Cross-engine reproducibility

Cross-engine variance is larger than intra-engine variance. Fishkin and O'Donnell ran 2,961 identical prompts across ChatGPT, Claude, and Google AI in 2026 and found fewer than 1 percent returned the same brand list, with fewer than 1 in 1,000 returning the same list in the same order (fishkin 2026 ai brand inconsistency). Reproducible AEO requires either single-engine measurement, with the engine choice disclosed, or multi-engine aggregation with disclosed weights.

A vendor reporting a single composite score across four engines is making an aggregation choice on top of three engine-level scores. If the composite formula is undisclosed, the buyer cannot tell whether the composite is reproducible or whether one engine is doing 80 percent of the lifting silently.

What reproducible measurement looks like on the vendor side

Operationally, reproducible AEO measurement has six visible signatures. A vendor that does all six is reproducible. A vendor that does fewer than four is not, regardless of marketing claims.

First, the vendor publishes a per-period sample size. "200 paired prompts per period, four engines, repeated twice weekly" is a defensible disclosure. "We query at scale" is not.

Second, the vendor publishes a confidence interval. A score reported as 47 percent plus or minus 2.4 percentage points at 95 percent confidence is reproducible-by-construction. A score reported as 47 percent with no band is reporting variance as signal.

Third, the vendor publishes the prompt template set. The buyer can see the actual query strings, with no redactions. Prompt versioning is logged. Silent prompt changes between scans are flagged.

Single-prompt scans report noise as signal. SE Ranking ran 10,000 queries three times on the same day in 2025 and found 9.2 percent URL consistency across the runs (se ranking 2025 ai mode url consistency). A single scan per period captures a slice of variance, not a reading of brand presence. Reproducibility requires repeated sampling within each period, aggregated with disclosed weighting.

Fourth, the vendor publishes the engine list and the composite formula. Buyers know which engines are queried, how the engine-level scores are combined, and what changes when an engine is added or dropped.

Fifth, the vendor publishes a methodology changelog. When the prompt template, sample size, engine weights, or aggregation formula changes, the change is logged with a date and a reason. A buyer can tell whether a score movement reflects brand performance or methodology revision.

Sixth, the vendor's reproducibility claim survives the 10-minute buyer test below. If it does not, the other five disclosures are decorative.

The 10-minute buyer-side reproducibility test

Open your AEO dashboard. Note today's score for one brand. Note the disclosed confidence interval (test 2 of the buyer checklist; if there is no disclosed CI, the reproducibility test is moot because there is no band to verify against).

Reproducibility is testable by the buyer in 10 minutes. Run the same scan today, then again tomorrow. If the disclosed confidence interval is plus or minus 2.4 points and the actual variation is plus or minus 9 points, the disclosed band is wrong. The vendor's job is to publish a band the buyer can verify against repeated scans. The buyer's job is to run the check before signing the renewal (bean 2024 construct validity benchmarks documents the upstream measurement principle).

The protocol:

Minute 0 to 2. Record today's score and the disclosed confidence interval. Note the time.
Minute 2 to 5. Schedule a re-scan for 24 hours from now (or 7 days, if your vendor's claimed reproducibility is on a weekly cadence). Match every parameter (engine, prompt set, sample size).
Minute 5 to 8. When the re-scan completes, record the new score. Compute the difference.
Minute 8 to 10. Compare the actual difference to the disclosed CI. Three outcomes: - Inside the CI: the score is reproducible within the vendor's disclosed band. Pass. - Outside the CI but vendor explains the gap (e.g., "your client published a press release between scans, expected mention-rate jump of 3 to 4 points"): conditional pass. The vendor knows the system. - Outside the CI and vendor cannot explain: fail. Either the band is wrong, or the methodology has more variance than disclosed.

A vendor that passes this test has a reproducible measurement instrument. A vendor that fails it is selling a dashboard with mention counts, which is not the same thing.

What reproducible AEO does not solve

Even fully reproducible methodology cannot eliminate floor-level LLM non-determinism. The 480-attempt Alexander study sets a noise floor that reproducibility quantifies and discloses; it does not eliminate the noise.

Cross-engine weighting remains a normative choice. A vendor can be reproducible AND make weighting choices a buyer disagrees with. Reproducibility means the vendor can defend the choice in writing. It does not mean the choice is universally correct.

Methodology versioning is a discipline, not a constraint. A vendor with disclosed methodology can still change methodology silently between scans. The changelog discipline is what protects the buyer against silent drift. A vendor reporting reproducibility without a changelog is reporting a property that may evaporate at any time.

How GenPicked delivers reproducibility

GenPicked's reproducibility model is the three-layer stack above with all six vendor-side signatures published on every brand visibility report. The methodology page documents the sample size, the confidence interval, the prompt templates, the engine list, the composite formula, and the changelog. The buyer-side 10-minute test passes on every published score because the disclosed CI is calibrated against the underlying measurement.

This is the discipline the broader AEO category is converging on. The construct validity foundation (read the theory) establishes why reproducibility is a measurement requirement, not a marketing feature. The methodology transparency standard covers what vendor-side disclosure should look like. The aggregated share-of-model metric is the composite score that survives single-prompt noise when the methodology is reproducible. This article is the operational deep dive on one specific test inside that stack.

Frequently asked questions

What does reproducibility in AEO measurement mean?

Reproducibility means the same scan run twice under the same conditions produces a score that falls inside the vendor's disclosed confidence interval. A reproducible score moves only when underlying brand presence in AI outputs has actually moved, not when the underlying system is varying randomly.

Does setting the LLM temperature to 0 make AEO measurement reproducible?

No. Alexander, Radev, and Hashimoto ran 480 attempts with six LLMs at temperature 0 in 2026 and found the outputs varied meaningfully. Temperature 0 reduces sampling variance; it does not eliminate the floating-point and hardware non-determinism that sits below the sampling step.

What is the buyer-side reproducibility test?

Run the same scan twice, 24 hours apart, with all parameters identical. Compare the actual variation to the vendor's disclosed confidence interval. If the variation fits inside the CI, the score is reproducible. If the variation exceeds the CI and the vendor cannot explain the gap, the score is not reproducible.

How much variation is acceptable between two scans?

That depends on the vendor's disclosed CI. A defensible vendor publishes a band (e.g., plus or minus 2 to 4 points at 95 percent confidence). Variation inside the band is acceptable. Variation outside the band signals either a wrong band or a methodology change the vendor has not disclosed.

Why do my AEO scores move when nothing happened to the brand?

Most likely because the vendor is querying once per period without aggregating across repeated runs. SE Ranking documented 9.2 percent URL consistency across same-day runs of 10,000 queries. A single scan captures variance, not a reading. The fix is repeated sampling within each period and disclosed aggregation, which a reproducible vendor publishes by default.

What is the difference between reproducibility and reliability?

Reliability is a broader term that includes reproducibility, internal consistency, and stability over time. Reproducibility is the specific property that the same measurement performed twice returns the same answer within a stated band. AEO buyers can verify reproducibility directly; reliability requires longer observation.

How does GenPicked deliver reproducibility?

GenPicked publishes the per-period sample size, the confidence interval, the prompt templates, the engine list, the composite formula, and the methodology changelog on every brand visibility report. The buyer-side 10-minute test passes on every published score because the disclosed CI is calibrated against the underlying measurement, not retrofitted to it.

See the reproducibility check in action

If you want to run the 10-minute test against a known-reproducible AEO scan, the GenPicked sample brief is at genpicked.com/demo. The brief includes today's score, the disclosed CI, and the methodology trace required to verify the score against a re-scan. Use it as a baseline for your current vendor's reproducibility check.

For agencies running active client retainers, the 14-day free trial includes the methodology brief, the prompt template documentation, and the reproducibility scorecard as a downloadable artifact you can hand to a CFO or a procurement officer.

Dr. William L. Banks III is Co-Founder of GenPicked. References documented in the GenPicked research wiki. Specific citations available on request.

Dr. William L. Banks III

Co-Founder, GenPicked

Frequently Asked Questions

What does reproducibility in AEO measurement mean?

Does setting the LLM temperature to 0 make AEO measurement reproducible?

What is the buyer-side reproducibility test?

How much variation is acceptable between two scans?

Why do my AEO scores move when nothing happened to the brand?

What is the difference between reproducibility and reliability?

How does GenPicked deliver reproducibility?

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit

#academy#blog#methodology#reproducibility#aeo-measurement#r3

Reproducibility in AEO Measurement: Why Stable Scores Need Repeated Sampling

Your AEO score moved 7 points and nothing happened to the brand

What reproducibility means in AEO

Why temperature 0 alone does not deliver reproducibility

The three layers of reproducibility

Layer 1: Intra-prompt reproducibility

Layer 2: Intra-engine reproducibility

Layer 3: Cross-engine reproducibility

What reproducible measurement looks like on the vendor side

The 10-minute buyer-side reproducibility test

What reproducible AEO does not solve

How GenPicked delivers reproducibility

Frequently asked questions

What does reproducibility in AEO measurement mean?

Does setting the LLM temperature to 0 make AEO measurement reproducible?

What is the buyer-side reproducibility test?

How much variation is acceptable between two scans?

Why do my AEO scores move when nothing happened to the brand?

What is the difference between reproducibility and reliability?

How does GenPicked deliver reproducibility?

See the reproducibility check in action

Dr. William L. Banks III

Frequently Asked Questions

Related Articles

How to Validate an AEO Score in 30 Minutes: Five Tests for Buyers

Get Your Brand's AEO Score