What the CMSWire AEO Crisis Piece Misses: The Methodology Already Exists

What the CMSWire AEO Crisis Piece Misses: The Methodology Already Exists

In this article, you will learn what the recent CMSWire piece gets right about the symptom CMOs are feeling, where its diagnosis slides into a misread, what defensible AEO measurement actually looks like today, and how a CMO can tell the difference between a vendor that has solved the methodology problem and one that has not.


The piece gets the symptom right and the answer wrong

CMSWire published a piece in 2026 titled "Brands Are Having a 'Crisis of Faith.' AEO Isn't Making It Easier." It ranks at position one in Google for the query "aeo measurement crisis," which is how most CMOs find the argument. The symptom description is accurate, and a CMO is right to feel it. The Fishkin and O'Donnell 2026 SparkToro study ran 2,961 identical prompts through ChatGPT, Claude, and Google AI and found that fewer than one percent of repeated runs returned the same brand list, with fewer than one in 1,000 returning the same list in the same order (Fishkin and O'Donnell, 2026). Dashboards built on top of single-prompt absolute-ranking measurement cannot survive that volatility in a procurement review.

What the CMSWire framing misses is that the answer already exists. The volatility is a property of single-prompt absolute-ranking measurement, not of AEO writ large. A different standard, pairwise, blind, counterbalanced, multi-engine, and disclosed, produces stable relative rankings on top of the same noisy LLM outputs. The math that turns volatile absolute rankings into stable relative ones has been public since 1952. The AI evaluation research community has spent three years applying it to model ranking on a public leaderboard fed by millions of human votes. Applying it to AEO is engineering, not invention. GenPicked builds AEO measurement on this standard because the methodology has matured, and the discipline is finally defensible to a CFO.

This article walks through where the CMSWire piece is right, where its diagnosis slides, and what the buyer-side test is for telling a defensible AEO vendor from one selling theater.

Where CMSWire is right

The symptom description in the piece is accurate, and a CMO should take it seriously.

Marketers are reporting volatile rankings that move month to month for reasons unrelated to actual brand activity. Vendors are publishing dashboards without confidence intervals, prompt template disclosure, or sample-size documentation. Published case studies from major AEO platforms typically do not include the methodology audit a sophisticated buyer would expect from a measurement vendor in any other category.

The 2,961-prompt SparkToro study found that fewer than one percent of repeated identical prompts returned the same brand list across ChatGPT, Claude, and Google AI. Fewer than one in 1,000 returned the same list in the same order. Industry estimates put annual spend on AEO and AI brand visibility analytics at over 100 million dollars fishkin 2026 ai brand inconsistency. A market that size, sitting on data that volatile, has a real problem.

The loss of confidence is coming from CMOs who already bought the tools, watched the rankings move in ways that did not match their own market reality, and started asking questions the dashboards could not answer.

Where the diagnosis is wrong

The CMSWire framing slides from a real symptom into a misdiagnosis. Three arguments against the misdiagnosis are worth naming.

First, the inconsistency in single-prompt outputs is a property of LLMs, not of brand presence. The Fishkin study itself notes that visibility percentage across many queries is more consistent than ranking position. The consistency problem is specific to absolute ranking from small samples, not to measurement of brand presence in AI outputs.

Second, the AI evaluation research community has spent three years solving an almost identical noise problem for ranking AI models. The public leaderboard that ranks frontier AI models from millions of human preference votes asks for pairwise comparisons and aggregates the wins into stable relative rankings. The math that turns volatile absolute rankings into stable relative ones has been public since 1952. Applying it to AEO is engineering, not invention.

Third, Bean, Brennan, and Buitelaar audited 445 LLM benchmarks in 2024 and found that only 78.2 percent explicitly defined their target construct bean 2024 construct validity benchmarks. The right diagnostic question is not "can we measure AI brand visibility?" It is "has this vendor specified what they are measuring with enough rigor to defend it?"

Only 78.2 percent of 445 LLM benchmarks surveyed between 2018 and 2024 explicitly defined their target construct. Roughly one in five provided no construct definition at all. AEO platforms inherit the same problem because most have not formally specified what their visibility metric represents construct validity. A measurement crisis driven by undefined constructs is solvable by defining the construct, not by abandoning measurement.

What defensible measurement actually looks like

A defensible measurement program has five components. None require new science. All are engineering choices any serious vendor can disclose in writing.

First, a stable construct definition. The vendor publishes what their visibility score is measuring, what it is not, and how observations aggregate into the reported number. Churchill's 1979 marketing measurement framework has been the foundation for four decades, and applying it to AEO is overdue.

Second, pairwise comparison rather than absolute ranking. Asking an engine to choose between two brands at a time and aggregating thousands of those decisions produces a more stable ranking than asking for an ordered list from scratch. The mechanics live in our pairwise method article.

Third, blind prompts. The target brand name does not appear in the query. When a prompt names a brand, the response inflates that brand's apparent visibility substantially, documented in the BASIL Bayesian sycophancy framework from Atwell and Alikhani 2025 atwell 2025 basil bayesian sycophancy. A program that names brands in its prompts is measuring the model's tendency to echo back the brand it was told about.

The BASIL framework demonstrated that LLMs overcorrect their beliefs in response to user signals at a more drastic level than humans do, significantly increasing reasoning errors when a user signal is present sycophancy measurement. A prompt that names a target brand can inflate reported visibility by margins that swamp any real signal. Blind prompts are the fix.

Fourth, counterbalanced trial design. When two brands appear together, the order is randomized so each appears first in half the trials. This randomization technique (Latin-Square in experimental design) cancels position effects.

Fifth, multi-engine measurement with disclosed weighting. A single-engine score reports one engine's behavior. A multi-engine composite captures the cross-engine signal that predicts buyer outcomes, but only if the weights are published. The methodology transparency article covers why weight disclosure is the most defensible procurement question a CMO can ask.

What the metric should actually represent

The proposed composite metric that comes out of a program built on these five pieces is "share of model." It is the AI-era analog of share of voice from advertising, representing the proportion of AI-generated responses in a category that mention or recommend a given brand share of model. The construct is straightforward. The hard part is measuring it without contamination from sycophancy, position effects, single-engine bias, and sample-size fragility. The deeper share of model article walks through the construct in detail.

A defensible share-of-model score is computed from blind, counterbalanced pairwise comparisons across at least four major engines with documented weighting and confidence intervals around each reported position. The same metric computed from brand-anchored absolute-ranking single-engine prompts can swing by 20 or more percentage points across identical re-runs, which is what produced the consistency findings the CMSWire piece reports share of model. The metric name does not guarantee the methodology behind it.

What a CMO should ask before the next renewal

A CMO should leave any AEO vendor meeting with written answers to five questions. They track the five methodology pieces above and parallel the framework our AEO critics piece walks through for the agency audience:

  1. What is your visibility score measuring, in construct terms, and what evidence do you publish for construct validity?
  2. Do you use absolute ranking or pairwise comparison, and if absolute, how do you mitigate position bias and sampling variance?
  3. Are your measurement prompts blind to the target brand, or do they name the brand in the query?
  4. How many engines do you query, what weighting do you apply, and is the weighting published?
  5. What is the per-period sample size per comparison, and can you produce confidence intervals around each reported rank?

A vendor that answers all five in writing has earned a serious evaluation. A vendor that answers two or three has a partial methodology and the gaps should be priced in. A vendor that calls the formula "proprietary" is reporting a marketing number rather than a measurement.

Why this matters more this year than last year

The CMSWire framing lands at the moment AEO measurement is starting to matter to CFOs and procurement teams, not just marketing leaders. The 100-million-dollar annual spend estimate is enough to attract finance-team scrutiny. When the finance team asks "how was this number calculated," the answer "the platform does not disclose its methodology" ends the conversation badly.

The category will sort itself in the next 24 months. Vendors that disclose methodology will keep accounts. Vendors that hide behind "proprietary" will lose them. A CMO who reads the CMSWire piece and concludes the category is broken will miss the sorting event. A CMO who concludes the methodology bar is rising will be on the right side of the renewal cycle.

GenPicked publishes its methodology, engine weighting, sample-size policy, and construct definition on the methodology page that ships with every account. The crisis is real. The fix is older than the crisis.


Frequently asked questions

Is the CMSWire piece on the AEO measurement crisis accurate?

The symptom description is accurate. Marketers are losing confidence in AEO data, outputs are volatile, and published methodology from major vendors is thin. The implication that the category itself is structurally unsound does not follow from the evidence. The volatility is specific to absolute-ranking single-prompt scans, not to measurement of brand presence in AI outputs.

Why are AEO rankings so inconsistent month to month?

LLM outputs are stochastic. The 2026 SparkToro study found fewer than one in 100 repeated identical prompts returned the same brand list. Ranking dashboards built on small single-prompt samples inherit that variance. Pairwise comparison methods with larger sample sizes and multiple engines stabilize the signal, which is why the AI evaluation research community switched to that approach for ranking AI models in 2023.

Is share of model a real metric or marketing language?

Both, depending on how it is measured. The construct is real and tracks the AI-era analog of share of voice. The computation can be defensible (blind, counterbalanced, pairwise, multi-engine, disclosed) or it can be noise (brand-anchored, absolute-ranking, single-engine, undisclosed). A CMO has to look at the methodology behind the score.

What is the single most important question to ask an AEO vendor?

What is your visibility score measuring, in construct terms, and what evidence do you publish for construct validity. If the vendor cannot answer the construct question, the rest of the methodology does not matter.

Does GenPicked claim to have solved the measurement crisis?

GenPicked publishes the five methodology pieces this article walks through: construct definition, pairwise comparison, blind prompts, counterbalanced trial design, and disclosed multi-engine weighting. We do not claim the category is solved. We claim the methodology is documented and the gaps are named openly.

Will this measurement crisis get better or worse over the next 24 months?

Better for buyers who learn to ask the construct and methodology questions, worse for vendors who cannot answer them. The sorting will be driven by procurement teams catching up to the methodology literature, not by new science.


Related reading


See the methodology applied to your brand

If your current AEO vendor reports rankings without construct definition, blind-prompt design, or disclosed engine weighting, run a free GenPicked AEO audit to see the same brand measured with all five methodology pieces disclosed.

Start your 14-day free trial of GenPicked Growth


Dr. William L. Banks III is Founder of GenPicked. References to the SparkToro 2026 study, the Bean et al. 2024 benchmark audit, and the Atwell and Alikhani 2025 BASIL framework are documented in the GenPicked research wiki.

Dr. William L. Banks III

Co-Founder, GenPicked

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit
#academy#blog#methodology#measurement#aeo-crisis#r3