Procurement Diligence

How to Validate an AEO Score in 30 Minutes: Five Tests for Buyers

Dr. William L. Banks III

May 15, 2026

12 min read

How to Validate an AEO Score in 30 Minutes: Five Tests for Buyers

In this article, you will learn what AEO score validity means in practical terms, the five tests an agency owner can run in 30 minutes to validate an AEO score before a client deck lands, what a defensible answer to each test looks like, and what a red-flag answer signals about the vendor's measurement.

You can validate an AEO score by Friday morning

You are looking at an AEO dashboard reporting a score, and you have to put that score in a client deck on Friday. The client will eventually ask the procurement question. The CFO will eventually ask it too. If you cannot answer "how did the vendor arrive at that number," the retainer is exposed.

AEO score validity is the practical answer to that question. It is testable, not theoretical. The discipline of answer engine optimization has matured to the point where a serious vendor can answer five questions in writing, on the record, in under 30 minutes of buyer effort. GenPicked publishes the answers on every brand visibility report by default. The five tests below are the same tests we use on ourselves.

This article is the practical companion to the construct validity foundation, which covers the 50-year-old measurement theory behind these tests. You do not need the theory to run the tests. You need the theory if a vendor pushes back. Read the foundation when you have time. Read this when the client deck is due Friday.

What AEO score validity actually means, and why these five tests work

Each test maps to a specific way a measurement instrument can go wrong. The 2024 audit of 445 large language model benchmarks by Bean, Brennan, and Buitelaar at OpenReview found that 21.8 percent published no construct definition at all (bean 2024 construct validity benchmarks). The benchmarks that did publish definitions produced numbers researchers could replicate and defend. The vendors in the AEO category follow the same split. The ones publishing methodology produce numbers a CFO can review. The ones hiding methodology produce numbers a CFO cannot.

The five tests below cover the four most common failure modes (construct ambiguity, reproducibility failure, prompt sensitivity, bias amplification) plus the disclosure gate (sample size and confidence) that catches them all in advance.

Test 1: The construct definition test

Ask the vendor: "In one sentence, what does your visibility score measure, what does it not measure, and how do observations aggregate into the reported number?"

The point of this question is not to grade the answer's elegance. The point is to find out whether the answer exists. A defensible vendor has the sentence ready and either points to a methodology page or sends it within minutes. A vendor that needs three days of internal alignment to produce one sentence has not published a construct definition.

A defensible AEO score has a construct definition the vendor can produce in one sentence. Bean, Brennan, and Buitelaar audited 445 large language model benchmarks in 2024 and found 21.8 percent provided no construct definition at all (bean 2024 construct validity benchmarks). The vendors that publish definitions report numbers a buyer can verify. The ones that do not, do not.

Defensible answer: A single sentence that names what the score measures (e.g., "the share of AI-generated responses to category-level prompts that mention the target brand at least once"), what it excludes (e.g., "sentiment, ranking position, and citation depth are tracked separately"), and how observations combine (e.g., "the score is the mean across 200 paired prompts per period, with a 95 percent confidence interval reported alongside").

Red-flag answer: "We measure brand visibility in AI." Generic. No referent. No exclusions. No aggregation rule. A vendor that cannot tell you what their own score measures is selling a label, not a measurement.

Test 2: The sample size and variance disclosure test

Ask the vendor: "What is the sample size per measurement period, and what is the confidence interval around each reported score?"

A number without a confidence band is a number without a way to tell signal from noise. AI search is volatile enough that one-shot scans produce numbers that look like brand movement but are actually sampling variance.

SE Ranking's 2025 AI Mode study ran 10,000 queries three times on the same day and found 9.2 percent same-day URL consistency across the runs (se ranking 2025 ai mode url consistency). An AEO score derived from a single query per period is reporting the variance of unrepeated sampling presented as a trend line, not measurement of brand presence.

Defensible answer: A specific sample size (e.g., "200 paired prompts per period across four engines, repeated twice per week") and a stated confidence interval (e.g., "scores reported with a 95 percent CI of plus or minus 2.4 percentage points at the reported sample size").

Red-flag answer: "We query at scale." No number, no band, no schedule. Treat the score as having unstated variance until proven otherwise.

Test 3: The reproducibility test

Ask the vendor: "If I scan the same brand today and again tomorrow, what variation should I expect, and is that variation inside the confidence band you disclosed in test 2?"

This one is testable in ten minutes. Run the same scan twice. If the score moves by more than the disclosed band, the disclosed band is wrong or the methodology has more variance than the vendor reported.

Alexander, Radev, and Hashimoto at arXiv in 2026 ran 480 attempts with six large language models at temperature 0, the setting intended to produce deterministic output. The outputs varied meaningfully across attempts (alexander 2026 llm reproducibility). The right reproducibility test is not "did the score match exactly" but "did the variation stay inside the vendor's disclosed band."

Defensible answer: "Run the same scan today and tomorrow. Expect the score to move within the CI we disclosed. If it moves outside that band, send us the trace. Our system logs every prompt, response, and aggregation step." The vendor offers reproducibility, not just claims it.

Red-flag answer: "Scores can vary day to day. AI search is noisy." True statement, wrong answer to this question. The buyer is not asking whether AI is noisy. The buyer is asking whether the vendor knows how much.

Test 4: The prompt template disclosure test

Ask the vendor: "Can you show me the exact query strings the system uses for my brand, in full, with no redactions?"

Prompt design is the most under-disclosed lever in AEO. Two vendors testing the same brand on the same model on the same day will produce different scores because their prompt templates differ.

Sclar and colleagues at ICLR 2024 showed that minor formatting changes to a prompt, including spacing, delimiter choice, and capitalization, produced up to a 76 percentage point swing in benchmark accuracy (sclar 2024 prompt sensitivity). A vendor that does not publish prompt templates is reporting numbers that are partly measuring the template, not the brand.

Defensible answer: The vendor sends the actual query strings. If the prompts are versioned, the vendor sends the version log too. Methodology versioning is the protection against silent prompt changes between scans.

Red-flag answer: "Prompt templates are proprietary." This is the most common red-flag answer in the AEO category right now. It is also the answer most likely to flip in the next 18 months as procurement pressure rises. Treat "proprietary" as a temporary excuse, not a permanent feature.

Test 5: The blind versus anchored prompt test

Ask the vendor: "Do your default scans name my brand in the query, or do you ask about the category first and observe whether my brand appears in the response?"

This is the highest-leverage question in the five-test set. A prompt that names the target brand inflates the brand's apparent visibility because the model echoes the input back.

The 864-observation paired-prompt experiment GenPicked published in 2026 found a 22.5 percentage point mention-rate inflation when the brand was pre-supplied in the query versus omitted (blind vs named measurement). The distortion is not uniform across brands and cannot be corrected after the fact. Blind prompts (category-first, brand-observed) are the buyable test. Anchored prompts (brand-supplied, mention-counted) are not measuring brand presence; they are measuring the prompt.

Defensible answer: "Our default scans are blind. We ask the engine about the category, the use case, or the buyer's problem, and we observe whether your brand appears in the response. If you want anchored scans for diagnostic purposes, we can run them separately and label them clearly."

Red-flag answer: "We include the brand in the prompt to ensure coverage." This vendor is reporting a score that is mechanically inflated by the prompt design. The inflation can exceed 20 percentage points. No correction recovers the real signal.

The 30-minute workflow

Open your AEO vendor's product page. Open their methodology page if it exists. Open your email.

Minute 0 to 5. Skim the methodology page (if it exists) for answers to tests 1, 2, and 4. Mark which questions are answered in writing and which are not.
Minute 5 to 10. Send the vendor an email with the five questions above, asking for written answers within 48 hours.
Minute 10 to 20. Run test 3 in your dashboard. Pull today's score for one brand. Schedule a re-scan for tomorrow.
Minute 20 to 30. Write a one-page brief for the Friday client deck noting which tests pass on documentation alone (tests 1, 2, 4), which are pending vendor email response, and which you ran yourself (test 3, partially).

You now have a written validation file. The CFO question has a defensible answer. The renewal conversation has documented evidence.

What a passing scorecard looks like

A vendor that scores five out of five in writing has earned a serious evaluation and a multi-year retainer defense. A vendor that scores three out of five has a partial methodology and the gaps should be priced into the contract. A vendor that scores zero out of five is selling a dashboard product, not a measurement instrument. The CMSWire trade-press piece on AEO ranks at position one for "aeo measurement crisis" exactly because so many vendors land in the zero-to-three range. The path forward is choosing one in the four-to-five range, or pushing your current vendor to disclose.

For the longer treatment of where this discipline came from, the construct validity foundation walks through the four facets behind these tests. For the vendor-side standard, the methodology transparency standard for AEO buyers covers what defensible disclosure should look like. For the procurement long-form, the 13-question vendor due diligence checklist extends the five tests into a full procurement workflow.

Frequently asked questions

What is AEO score validity?

AEO score validity is the degree to which the number an AEO platform reports actually corresponds to brand presence in AI search outputs. A score has validity when the vendor publishes the construct definition, sample size, confidence interval, prompt templates, and aggregation method, and a buyer can verify all five against documented evidence.

Can I validate an AEO score without a methodology background?

Yes. The five tests above are buyer questions, not academic exercises. Each test is a single sentence the buyer sends to the vendor. The vendor either has an answer in writing or does not. The buyer scores five-out-of-five, three-out-of-five, or zero-out-of-five. No statistics training required.

What is the highest-leverage question to ask first?

The blind versus anchored prompt question (test 5). A vendor that names your brand in the query is producing a score that is mechanically inflated by more than 20 percentage points, and no correction recovers the real signal. If that one question is answered "we name the brand in the prompt," the score should not be used in client reporting until the vendor changes the default.

What if the vendor calls their methodology "proprietary"?

"Proprietary" is a temporary defense, not a permanent feature. Procurement-grade buyers in financial audit, clinical research, and ad verification all moved their vendors from "proprietary" to "disclosed" under pressure. AEO is in the same arc. A vendor that hides methodology in 2026 is unlikely to keep enterprise accounts past 2027 as CFOs and procurement teams catch up.

How long does this take with a vendor that does not have a methodology page?

The first round takes about 30 minutes of buyer effort plus 48 hours of vendor response time. If the vendor does not respond in writing within 48 hours, that is itself a data point. Move the renewal conversation accordingly.

Does this apply to free AEO tools too?

Yes. A free tool is not exempt from the validity question because the score is being used in client deliverables. The cost of a wrong number is the same whether the number came from a free tool or a paid one.

What is GenPicked's score on its own five tests?

Five out of five, with the answers published on every brand visibility report by default. The methodology page documents the construct, sample size, confidence interval, prompt templates, and aggregation formula. The sample brief is available at genpicked.com/demo.

See what a five-out-of-five looks like

If you want to see what a methodology-disclosed AEO scan produces, the GenPicked sample brief is at genpicked.com/demo. The brief includes the answers to all five tests in the same format you would send a vendor by email. Use it as a baseline for the Friday deck, or as a procurement template for the next vendor evaluation.

For agencies running active client retainers, the 14-day free trial includes the methodology brief, the prompt template documentation, and the five-test scorecard as a downloadable artifact you can hand to a CFO or a procurement officer.

Dr. William L. Banks III is Co-Founder of GenPicked. References documented in the GenPicked research wiki. Specific citations available on request.

Dr. William L. Banks III

Co-Founder, GenPicked

Frequently Asked Questions

What is AEO score validity?

Can I validate an AEO score without a methodology background?

What is the highest-leverage question to ask first?

What if the vendor calls their methodology "proprietary"?

How long does this take with a vendor that does not have a methodology page?

Does this apply to free AEO tools too?

What is GenPicked's score on its own five tests?

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit

#academy#blog#methodology#procurement#aeo-score-validity#buyer-tests#r3

How to Validate an AEO Score in 30 Minutes: Five Tests for Buyers

You can validate an AEO score by Friday morning

What AEO score validity actually means, and why these five tests work

Test 1: The construct definition test

Test 2: The sample size and variance disclosure test

Test 3: The reproducibility test

Test 4: The prompt template disclosure test

Test 5: The blind versus anchored prompt test

The 30-minute workflow

What a passing scorecard looks like

Frequently asked questions

What is AEO score validity?

Can I validate an AEO score without a methodology background?

What is the highest-leverage question to ask first?

What if the vendor calls their methodology "proprietary"?

How long does this take with a vendor that does not have a methodology page?

Does this apply to free AEO tools too?

What is GenPicked's score on its own five tests?

See what a five-out-of-five looks like

Dr. William L. Banks III

Frequently Asked Questions

Related Articles

Reproducibility in AEO Measurement: Why Stable Scores Need Repeated Sampling

Get Your Brand's AEO Score