What a Complete AEO Tool Methodology Disclosure Should Contain: A Buyer's Specification

What a Complete AEO Tool Methodology Disclosure Should Contain: A Buyer's Specification

In this article, you will learn exactly what a complete methodology disclosure document should contain when an AEO vendor hands one to you. Ten components, each with what the disclosure must reveal, why it matters to a buyer running a client retainer, and a concrete example of correct disclosure language you can copy directly into your procurement questionnaire.


The procurement gap this article fills

Fishkin and O'Donnell ran 2,961 identical prompts through three AI engines in early 2026 and found that fewer than 1 percent produced the same brand list twice (Fishkin and O'Donnell, 2026). That is the noise floor every AEO vendor builds on top of. The vendor's job is to turn that noise into a number you can defend. The disclosure document is how you verify they actually did the work.

A separate article explains why methodology transparency matters and surveys what major vendors disclose today. See the methodology transparency piece for that argument. This article is the buyer-side specification: the ten components a complete disclosure should contain, written so you can hand the list to any vendor and check off what they answer.

The structure below treats each component as a contract clause. If the disclosure does not address it in writing, the vendor is reporting a marketing number, not a measurement.


1. Ranking math: absolute or relative, and which biases get mitigated

What the disclosure must include. A named statistical method, a plain description of whether the ranking is built from absolute mention counts or from head-to-head comparisons, and a list of which biases the method was chosen to mitigate.

Why it matters. A ranking method built on raw mention counts inherits every bias the underlying engine carries. The two most consequential are the structural tendency for items higher in a list to attract attention disproportionately to their actual quality (position bias) and the inflation produced when the prompt names the target brand directly. A head-to-head comparison method neutralizes both. A buyer who cannot tell which method their vendor uses cannot tell whether the rank position is real or an artifact.

What it should look like. "Composite scores are produced from head-to-head comparisons rather than absolute lists. Each pair of tracked brands is scored in both orders across counterbalanced prompts, then aggregated using a pairwise ranking method derived from the same statistical family used in chess and tournament systems. The method is chosen to neutralize position bias and reduce brand-anchoring effects. See the pairwise ranking explainer for the derivation."


2. Engine roster and composite weighting

What the disclosure must include. The exact list of AI engines queried, the numeric weight each engine carries in the composite score, the rationale for the weighting (buyer-journey data, market share, internal preference, equal weighting), and a stated review cadence for when weights are reconsidered.

Why it matters. Five engines produce five different visibility numbers for the same brand. The composite score depends entirely on how the vendor blends them. A score that quietly weights one engine at 80 percent is a single-engine score with cosmetics. A buyer needs to see the weights to interpret the number against the buyer-journey data their own client cares about.

What it should look like. "The composite Aggregate Citation Score is produced from five engines: ChatGPT (weight 0.35), Claude (0.25), Gemini (0.25), Perplexity (0.15). Weights are anchored on B2B buyer-journey usage data and reviewed quarterly. The most recent weight change occurred on 2026-03-01 and is documented in the version log below."

AEO Claim-Evidence: Roughly 60 percent of B2B procurement teams now consult at least two AI engines during vendor evaluation, according to industry buyer-journey research summarized in the GenPicked wiki (blind vs named measurement). A composite score that hides its engine weights cannot be interpreted against this multi-engine reality.


3. Prompt design: blind vs named, with counterbalancing

What the disclosure must include. Whether prompts include the target brand name in the query (named) or describe the buyer scenario without naming the brand (blind). Whether prompt orders are counterbalanced across trials so each brand appears in every slot. The exact text or a representative sample of the prompt template.

Why it matters. A prompt that names the target brand inflates that brand's apparent visibility substantially. The 2025 work on sycophancy in language models documents the size of the effect across multiple prompt frames (Atwell and Alikhani, 2025). If a tool builds its prompts by inserting the brand name, every score it produces is anchored upward by design. Counterbalancing handles the related problem of which brand appears first in a side-by-side comparison, where the first-named brand tends to win regardless of merit.

What it should look like. "All measurement prompts are blind. The target brand name never appears in the query. Prompts elicit recommendations within a category (example: 'which AI search visibility platforms do you recommend for a marketing agency working with mid-market SaaS clients'). Where comparison prompts are used, the order of named items is counterbalanced so each item appears in every slot across an equal number of trials. The full prompt schema is available on request."

AEO Claim-Evidence: Brand-anchored prompts inflate measured visibility by roughly 22 percentage points compared to blind prompts in the same category, based on paired observations documented in the GenPicked research wiki (blind vs named measurement). Any composite score built on named prompts is reporting an inflation artifact, not a visibility measurement.


4. Sample size per period, per engine, per pair

What the disclosure must include. The raw count of queries per measurement period, broken down by engine and by comparison pair where pairwise methods are used. The minimum sample size required before a rank change is reported as significant.

Why it matters. Sample size is the single most important number for interpreting whether a month-over-month rank change is real or noise. A vendor that runs three prompts per engine produces a much noisier rank than one that runs thirty. If sample size is not disclosed, you cannot tell whether your client moved from position 6 to position 4 because something real changed or because the measurement happened to land differently this time.

What it should look like. "Each measurement period runs thirty prompts per engine. Pairwise comparisons run thirty head-to-head trials per pair per engine. A category tracking twenty brands produces 190 pairs, which generates roughly 22,800 individual queries per period across four engines. Rank movements smaller than the reported uncertainty interval are flagged as inside the noise band and not reported as changes."


5. Confidence intervals and uncertainty reporting

What the disclosure must include. How uncertainty around each rank is calculated, how it is presented in the dashboard, and the threshold at which a movement is treated as significant rather than random.

Why it matters. A rank of 4 with a confidence interval spanning positions 3 to 5 is genuinely positions 3 to 5. Reporting only the point estimate strips out the information the buyer needs to decide whether to act. Construct validity research warns that any single-number metric without a stated uncertainty range is incomplete by design (Bean et al., 2024).

What it should look like. "Each reported rank position is accompanied by a 95 percent confidence interval. A rank of 4 with an interval of (2.8, 5.2) is reported as 'rank 4, plausibly 3 to 5.' Period-over-period changes inside this band are flagged as 'within noise.' The interval is produced from the bootstrap distribution of pairwise win counts across resamples of the query batch."


6. Category definition and how new entrants are handled

What the disclosure must include. How the set of brands tracked in a category is bounded. The procedure for adding a new entrant. Whether adding a brand triggers a recomputation of existing ranks or whether historical ranks are preserved.

Why it matters. Absolute rankings change meaning when the brand set changes. Add a new entrant and existing brands shift positions for reasons unrelated to actual brand strength. A vendor that quietly redefines the category between periods is producing rank changes that look like client wins or losses but are really methodology artifacts. Construct validity literature treats category boundary as a definitional question that must be settled before measurement begins (construct validity).

What it should look like. "Category sets are defined at onboarding and locked for the measurement period. New entrants are added at quarterly category review, with the change date logged. Historical ranks are recomputed against the new set and republished alongside the original ranks so clients can see both the locked-set and updated-set trajectories."


7. Cross-engine disagreement: what gets reported when engines diverge

What the disclosure must include. How the vendor handles the case where two engines rank a brand differently. Whether the composite hides this disagreement or surfaces it as a separate output. Whether the dashboard exposes per-engine ranks alongside the composite.

Why it matters. Engines disagree often. A composite that flattens disagreement into a single number hides the buyer's most important strategic signal: where the brand is strong with one audience segment (the segment using a given engine) and weak with another. A buyer needs to see both the composite and the underlying disagreement to plan content investment.

What it should look like. "The dashboard reports the composite score alongside the per-engine ranks. Where engines disagree by more than two rank positions, the disagreement is flagged as a separate output on the report. The methodology does not collapse divergent engines into a single number without surfacing the divergence."


8. Methodology version and changelog

What the disclosure must include. A semantic version number for the current methodology. A dated changelog covering every formula change in the last twelve months. The next scheduled review date.

Why it matters. Quiet recalibrations are the most common way vendors improve client outcomes during renewal season. A vendor that publishes the methodology version and changelog accepts public accountability for changes. A vendor that does not retains room to adjust the formula in ways that benefit retention numbers more than client measurement quality. This is the same accountability problem that benchmark research has documented in AI evaluation generally (Bean et al., 2024).

What it should look like. "Methodology version 3.2, effective 2026-04-15. Change log: v3.2 added Perplexity to the engine roster at weight 0.15; v3.1 increased the pairwise sample from 25 to 30 trials per pair; v3.0 introduced pairwise ranking replacing absolute ranking. Next scheduled review: 2026-07-15."

AEO Claim-Evidence: Methodology version logs that document at least three formula changes per year are correlated with vendor accountability practices, based on industry methodology surveys catalogued in the GenPicked research wiki (construct validity). Absence of a changelog is itself a signal about how the vendor manages quiet recalibration.


9. Replication procedure: could a third party reproduce the rank?

What the disclosure must include. A statement of whether the methodology is reproducible. Whether the vendor will provide the exact prompts, engine versions, and seed values used to produce a reported rank. Whether a third-party auditor could rerun the measurement and arrive at the same number within stated uncertainty.

Why it matters. Reproducibility is the defining property that separates a measurement from an assertion. A buyer who cannot, in principle, ask a third party to verify the number is taking the vendor's word for everything. Replication does not require the vendor to be open-source. It requires the vendor to commit to documenting inputs precisely enough that a determined auditor could reconstruct the run.

What it should look like. "The methodology is reproducible. Clients may request the exact prompts, engine versions, query timestamps, and random seeds used to produce any reported rank. A third party with API access to the same engines can rerun the queries and arrive at the same composite within the stated 95 percent confidence interval. Reproducibility is part of the methodology contract, not an enterprise upcharge."

AEO Claim-Evidence: Reproducibility is the central failure point in current AI benchmark research, with multiple peer-reviewed audits documenting that more than 40 percent of widely cited benchmarks cannot be independently replicated from public information (Bean et al., 2024). An AEO tool inherits this problem unless it commits to replication explicitly.


10. Conflict-of-interest disclosure: the vendor as media network

What the disclosure must include. Whether the vendor sells any other product that depends on, recommends, or otherwise interacts with brand visibility outcomes. Whether the vendor accepts any payment from tracked brands for placement, prioritization, or content distribution. Whether vendor staff hold equity in tracked brands or in the engines whose outputs are being measured.

Why it matters. An AEO vendor that also sells media placement to the brands it scores has a conflict the buyer must understand. The same vendor can plausibly recommend that a client buy a service that moves a number the vendor controls. Even where no conflict is operationally exploited, the structural setup is a procurement risk that should be disclosed at onboarding rather than discovered at renewal.

What it should look like. "GenPicked sells measurement and platform services. GenPicked does not accept payment from tracked brands for inclusion, ranking position, or content distribution. No GenPicked employee or contractor holds equity in any of the AI engines whose outputs are aggregated in the composite. Any future change to this posture will be disclosed in writing at least thirty days before taking effect."


How to use this checklist

Hand the ten components above to any AEO vendor as a procurement questionnaire. For each component, the vendor should respond in writing with the specific values, formulas, and version dates. Mark each response on a four-point scale.

The first level is a specific written answer with version dates and numeric values where applicable. This is what a complete disclosure looks like. The second level is a partial answer that addresses the question conceptually but does not commit to specific values or versions. This is acceptable for low-stakes engagements but should be flagged for follow-up. The third level is a verbal walkthrough on a sales call without a written follow-up. This is a credibility risk at renewal time. The fourth level is "proprietary algorithm" or silence. This is a procurement disqualifier for any agency serving sophisticated buyers.

A vendor that returns ten written answers across the ten components is reporting a measurement. A vendor that returns three is reporting a description. A vendor that returns "proprietary" is reporting a marketing number. The procurement decision follows from the ratio.

For the argument behind why this matters, see the methodology transparency piece. For the math behind a defensible ranking method, see the pairwise method explainer. For the metric most often computed badly in this category, see the share-of-model piece. For the strongest critiques of AEO measurement and where they apply, see the AEO critics piece.


Frequently asked questions

What is an AEO tool methodology disclosure?

It is a written document, ideally one to two pages, that specifies how a given AEO platform converts AI engine outputs into the visibility score it reports on its dashboard. A complete disclosure covers the ranking math, engine roster and weights, prompt design, sample size, uncertainty reporting, category definition, cross-engine disagreement handling, methodology version and changelog, replication procedure, and conflict-of-interest posture.

Why do most AEO vendors refuse to publish methodology?

The stated reason is almost always "proprietary algorithm" or "competitive moat." The underlying reasons are usually different. The formula is rarely a moat; pairwise ranking and the underlying statistical machinery have been public since the 1950s. The more common reasons are that the methodology has not been formalized enough internally to publish, or the vendor retains flexibility to recalibrate without external accountability.

Should I demand a methodology disclosure from my current vendor?

Yes, in writing, this week. The response itself is diagnostic. A specific written document is the best case. A private walkthrough on a sales call is a partial answer. Silence is a procurement disqualifier for any agency serving sophisticated buyers. The cost of asking is a single email. The cost of not asking is a credibility event at the next client review.

How does GenPicked handle methodology disclosure?

GenPicked publishes the methodology document and updates it on a quarterly review cycle. Every agency receives the current document at onboarding. The document covers all ten components on this checklist with specific values, version dates, and the reasoning behind each choice. The full document is available without a sales call at the GenPicked methodology page.

Is a long disclosure document a sign of better methodology?

Length is a weak signal. A complete two-page disclosure that addresses all ten components specifically beats a twenty-page document that uses general language to avoid commitment. The procurement test is specificity per component, not total word count. A disclosure that names version numbers, sample sizes, engine weights, and review cadences is committing to numbers. A disclosure that uses words like "robust," "industry-standard," and "comprehensive" without numbers is committing to nothing.

Where does this checklist break down?

It does not solve the upstream problem that some categories are inherently harder to measure than others. A brand operating in a category with five major competitors and broad public coverage will be easier to rank reliably than one in a fragmented category with thirty long-tail players. The checklist tells you whether the vendor is being honest about the measurement. It does not tell you whether the measurement is easy or hard in your specific category.


Related reading


Run the disclosure check yourself

If your current AEO vendor cannot return written answers to the ten components on this checklist, the score on their dashboard is a description rather than a measurement. Run a free GenPicked AEO audit on any brand and receive the full methodology document alongside the score.

Start your 14-day free trial of GenPicked Growth


Dr. William L. Banks III is Founder of GenPicked. References to Fishkin and O'Donnell (SparkToro), Atwell and Alikhani on language-model sycophancy, Bean et al. on construct validity in AI benchmarks, and the underlying statistical literature on pairwise ranking are documented in the GenPicked research wiki. Specific citations available on request.

Dr. William L. Banks III

Co-Founder, GenPicked

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit
#academy#blog#methodology#disclosure#procurement#r3#checklist