Where AEO Has Already Won, and Where the Discipline Is Still Maturing

Where AEO Has Already Won, and Where the Discipline Is Still Maturing

In this article, you will learn the four positions a thoughtful skeptic raises about AEO in 2026, the wins the discipline has already booked, the open frontiers the standard-bearers are working on, and what GenPicked publishes as its answer in writing.


AEO is the new standard. Here is what already holds up.

AEO has matured fast. Two years ago "AI brand visibility" was a research thesis. In 2026 it is a discipline with peer-reviewed methodology, defensible metrics, and a buying standard. The article below is a confident, evidence-led tour of what holds up to skeptical scrutiny and what the standard-bearers, GenPicked included, are still working on. The skepticism is useful. Engaging it head-on is how the discipline matures.

If you have read a piece from Search Engine Land, Content Marketing Institute, SalesPeak, Demand-Genius, or Contently in the last six months, you have probably seen claims like "AI sycophancy has been reduced in newer models," "enterprise AEO tools are producing valid data already," "AEO is a familiar trap; standard SEO is sufficient," or "ChatGPT is 87 percent of AI search traffic, measure ChatGPT and stop there." Each of these has serious people and serious evidence behind it. Each one is worth taking seriously.

Below is the case for AEO an agency owner can put in front of a skeptical CFO: four positions named, four wins acknowledged, and four reasons the discipline is the right place to be investing in 2026.


Critique 1: "Sycophancy is mostly solved in newer models"

The argument

OpenAI reported in 2025 that GPT-5 reduced sycophantic responses from 14.5 percent to under 6 percent. Anthropic shipped Claude models in 2026 marketed as "the least sycophantic of any to date," outperforming all frontier models on the standard sycophancy evaluation benchmark used in AI safety research (Petri). OpenAI removed the sycophancy-prone GPT-4o model from the lineup entirely in February 2026.

The argument follows: if sycophancy is the load-bearing problem behind AEO measurement skepticism, and sycophancy has been substantially reduced, then the skepticism is dated.

Where the critics are right

The progress is real. A reduction from 14.5 percent to under 6 percent is meaningful. Anthropic and OpenAI deserve credit for treating sycophancy as a measurable problem and publishing improvement numbers. The thesis that "sycophancy is getting worse" would be wrong; it is getting better.

Where the critique falls short

A 6 percent sycophancy rate still means roughly one in 17 responses is sycophantic. At measurement scale, where an AEO scan generates hundreds or thousands of observations per brand per month, one in 17 is not a rounding error. It is structural noise that will not average out without methodology controls.

Three additional concerns hold even after the improvement numbers:

The reduction was measured on general conversational benchmarks. Domain-specific rates can be substantially higher. Bitterman and colleagues' 2025 work documented sycophancy rates approaching 100 percent in medical recommendation contexts. There is no published evidence that brand-recommendation contexts have rates as low as the general-benchmark figures.

The underlying incentive structure has not changed. LLMs are trained on human preference data. Humans prefer agreement. The reinforcement signal that produces sycophancy is the same signal that produces the warmth users like, which is why removing sycophancy entirely is harder than reducing it on a specific test set.

The market uses multiple models simultaneously. ChatGPT, Claude, Gemini, and Perplexity each have different sycophancy profiles. An AEO scan that aggregates across engines is aggregating across different sycophancy rates, which creates a cross-model inconsistency problem that single-model sycophancy improvement does not solve.

What this means for an agency

If your AEO platform uses brand-anchored prompts that include the target brand name in the query, sycophancy reduction in the underlying model does not save you. The methodology bias is upstream of the model's response. Even a hypothetical zero-sycophancy LLM would still produce inflated scores for the favored brand if the prompt embedded the favor. The platform's prompt template policy matters more than the engine's improvement curve.


Critique 2: "Enterprise AEO tools are producing valid data already"

The argument

Profound, Conductor, Evertune, and other enterprise platforms track brand mentions, citations, sentiment, and prompt-volume data across 10-plus AI engines. Profound has approximately 10 percent of the Fortune 500 as customers and shipped a Series B in 2026. Conductor publishes industry benchmarks based on 3.3 billion sessions across 13,000-plus enterprise domains. Evertune processes over one million AI prompts per brand monthly. These are not toy tools. They are running at enterprise scale and producing data that buyers act on.

The argument follows: if the data were not valid, the buyers would have noticed by now.

Where the critics are right

Enterprise adoption is real. The platforms are useful for many things beyond pure measurement accuracy: competitive intelligence, executive reporting, trend tracking, board-presentable summary numbers. We do not argue that these tools should not exist, or that buyers should refuse to use them. They are providing value in ways that go beyond the narrow question of measurement validity.

Where the critique falls short

None of the major enterprise AEO platforms have published independent methodological validation. The data is generated by their own systems and reported by their own dashboards. Vendor-published case studies are useful for marketing but carry inherent conflict of interest.

The 2026 SparkToro consistency study (600 volunteers, 2,961 runs across ChatGPT, Claude, and Google AI) found that fewer than one in 100 runs produced the same list of brands and fewer than one in 1,000 produced the same list in the same order. This directly challenges the premise that point-in-time snapshots produce reliable data. The major platforms' published metrics do not disclose how they account for this noise.

The prompt template policy is the methodology decision that most determines validity, and we cover the specifics in our methodology transparency article. The short version: when prompts include the target brand name, mention rates inflate significantly. Most enterprise platforms have not disclosed their prompt template policy publicly. Until they do, "their data is valid" cannot be asserted with confidence in either direction.

There is also a construct-validity problem documented in the marketing-science literature. Churchill's 1979 framework for developing valid measurement constructs has been the foundation of marketing measurement for four decades. No AEO platform has published evidence that its visibility metric satisfies any step of the Churchill framework. The platforms are measuring constructs they have not formally specified. This is not a fatal flaw, but it is a real one, and it is hidden from buyers who have not been trained in measurement methodology.

A separate confounding factor: training-data frequency. Brands that appear more often in the training corpus appear more often in LLM outputs, independent of any current brand strength. A "visibility score" that does not control for training-corpus frequency may be reporting historical media volume rather than present brand performance.

What this means for an agency

Enterprise scale is not the same as methodological validity. The two questions are independent. An agency that needs to defend a visibility number to a sophisticated client cannot defend it with "Profound says so." The agency needs a methodology document the platform has published, or the conversation ends badly. The procurement question is whether your current platform has published the document.


Critique 3: "AEO is overhyped and brands should not invest at all"

The argument

The most pointed version of this critique comes from Google's own representatives (Gary Illyes argued in 2025 that standard SEO is sufficient for AI search results) and from independent practitioners like Schwartz, who has argued from years of agency experience that AEO investment is driven by FOMO rather than evidence of business impact.

The Conductor 2026 benchmark data supports the position numerically: AI referral traffic averages 1.08 percent of all website traffic across 13,000-plus enterprise domains. A category that drives 1 percent of traffic does not warrant the resource intensity that vendors are selling.

Where the critics are right

The FOMO narrative is partially correct for many companies. If you are a B2C brand in a category where consumers rarely consult AI for product decisions, AEO investment can be premature. The 1.08 percent traffic share number is real and worth respecting. Building an extensive AEO program for a brand whose customers are still finding it through Google and word-of-mouth is solving a problem the brand does not yet have.

Schwartz's specific critique that many agencies are selling AEO as a panacea when basic SEO hygiene is the missing capability is largely correct. The order of operations matters: a brand with broken canonicals, thin content, and a weak backlink profile will not be saved by AEO optimization. AEO is a layer on top of foundational SEO, not a replacement for it.

Where the critique falls short

The argument conflates two different questions. Whether AEO is worth investing in is a strategic question that depends on the brand's category and customer base. Whether AEO measurement is methodologically valid is a separate question. The Brand Intelligence Gap research is about the second question, not the first.

For brands in categories where AI recommendations materially influence buyer behavior (notably B2B services, B2B software, professional services, certain consumer categories with high-consideration purchases), AEO is not a hype play. It is a real channel that real buyers are using. The 1.08 percent traffic share is a population average; specific industries are already at 2.80 percent (IT) and growing roughly 1 percent month over month per the Conductor data. In some agency-relevant verticals, AI search has already passed the threshold where ignoring it is malpractice.

The deeper problem the Schwartz critique exposes is that the AEO category sold the wrong promise. It sold "AI optimization" as the layer when the actual missing capability for most brands was "AI measurement." Optimization without measurement is faith-based. Measurement without optimization is unhelpful. The mature AEO platform addresses both, with measurement first and optimization second.

What this means for an agency

If your client is asking "should we invest in AEO," the right answer depends on their vertical. If your client is asking "is what our current AEO vendor measuring even real," the right answer is "let me see their methodology document." Different questions. Different answers. Don't conflate them.


Critique 4: "ChatGPT is 87 percent of AI search traffic. Measure ChatGPT and stop there."

The argument

Various AEO vendors and trade publications cite ChatGPT's dominant share of AI referral traffic (87.4 percent per Conductor 2025, 68 percent per Similarweb January 2026) to argue that single-model measurement is sufficient. If 87 percent of the traffic comes from one engine, measuring the other engines is over-engineering.

Where the critics are right

For pure traffic attribution, ChatGPT measurement captures the majority of current AI referrals. If the agency's job is to count where the visits came from and report the number, ChatGPT-only measurement is a defensible cost-trade.

Where the critique falls short

Three problems make ChatGPT-only measurement inadequate as the AEO standard.

First, the SparkToro 2026 inconsistency finding documented variation within single models, not just across models. Fewer than one in 100 runs produced the same brand list even on the same engine, prompt, and time window. ChatGPT-only measurement does not solve the consistency problem; it just measures one engine's version of the problem.

Second, ChatGPT's dominance is declining. The 87.4 percent figure from Conductor is from 2025. The 68 percent figure from Similarweb is from January 2026. That is a 19-point drop in less than a year. Building a measurement strategy around the assumption that ChatGPT will continue to dominate is building on a moving target.

Third, B2B buyers in particular use multiple engines in their research process. The 6sense 2025 B2B Buyer Report documented that the majority of B2B buyers consult two or more AI tools during a single purchase consideration. For brands selling to B2B buyers, the relevant question is not "which engine has the most traffic" but "which engine the buyer happened to consult last before reaching out to sales." That can be any engine in the set.

A separate point: cross-engine consistency is itself the signal that matters most for many use cases. If a brand is cited consistently across ChatGPT, Claude, Gemini, and Perplexity, the brand has earned genuine category authority. If the brand is cited prominently in one engine and not the others, the visibility is fragile. Single-engine measurement cannot surface this distinction.

What this means for an agency

ChatGPT-only measurement is a budget decision, not a methodology decision. If your client has limited spend, measuring ChatGPT is better than measuring nothing. If your client has the budget for multi-engine measurement, single-engine is leaving information on the table. The honest conversation with the client is: here is what ChatGPT-only tells us, here is what we miss, here is the cost trade.


Putting the critiques together

The four critiques have a common shape. Each one is right about something real and wrong about something specific.

The sycophancy critique is right that the underlying models are improving. It is wrong that the improvement removes the need for methodology controls.

The enterprise-validity critique is right that the platforms have scale and produce useful intelligence. It is wrong that scale and validity are the same thing.

The hype critique is right that many brands have been sold AEO services they do not yet need. It is wrong that the entire category is unworthy of measurement.

The single-model critique is right that ChatGPT dominates current traffic. It is wrong that traffic share is the only thing AEO measurement is for.

What the four critiques share is a frustration with how the AEO category has marketed itself. The criticism is not coming from people who think AI search will not matter. It is coming from people who think the current AEO toolset is not equal to the question. That frustration is correct. Our response is to publish methodology, name the limitations honestly, and ask agencies to demand the same disclosure from every vendor they evaluate.


Five questions to ask any AEO vendor that addresses these critiques head-on

A vendor that takes the critiques seriously can answer these five questions in writing. The list is the same one we use in our methodology transparency article, restated here as a critique-response framework.

  1. Engine weighting. Which engines are tracked and how are they weighted in your composite? This addresses the single-model critique. A vendor that publishes weights has thought about cross-engine measurement seriously.

  2. Prompt template policy. Are your prompts blind (no brand name in the query) or anchored? This addresses the sycophancy critique. A vendor using blind prompts has neutralized the methodology bias that improved models alone cannot fix.

  3. Sample size per scan per engine. This addresses the inconsistency critique. A vendor running thirty or more prompts per engine per scan has the sample size to overcome stochastic variation.

  4. Citation extraction methodology. How are mention positions classified? This addresses the construct-validity critique. A vendor that publishes citation classification rules has formally specified what they are measuring.

  5. Construct definition. What is your visibility score measuring, and how does the measurement satisfy Churchill's marketing-measurement criteria? This addresses the deepest critique. A vendor that can answer the construct question is operating at a higher methodological tier than the category average.

If the vendor cannot answer in writing, the critique applies. If the vendor can answer, the critique has been engaged with seriously. Either result tells you what you need to know.


Frequently asked questions

Why publish a piece that takes critics' arguments seriously?

Because the alternative is dishonest. The four critiques exist in the trade press because they have some truth in them. An agency owner who reads only one side of the argument is unprepared for the moment a sophisticated client raises the other side. We would rather the client raise the critique with the agency owner who has already read this article than with the one who has not.

Does GenPicked claim to have solved all four critiques?

No. We claim to have addressed them methodologically and to publish the methodology choices openly. We address the sycophancy critique with blind prompts. We address the validity critique by publishing engine weights, sample size, and citation classification. We address the hype critique by being explicit about which agency situations warrant AEO and which do not. We address the single-model critique by tracking five engines with documented weighting. Where the critics are right, we concede. Where they fall short, we engage.

Will the critics' arguments become stronger or weaker over time?

Some will become weaker. Model sycophancy will keep improving. Some will become stronger. The 1 percent traffic share will rise in many verticals, which makes "AEO is hype" harder to defend in those categories. The construct-validity critique will remain until vendors publish methodology, which we hope this category does eventually.

What is the worst argument against AEO?

The variant of the hype critique that says "AI search will never matter for serious buyers" is the weakest argument we encounter. The data is clear that consultation rates among B2B buyers are climbing and that consumer adoption is past majority in some categories. Whether AEO matters for your brand is a real question; whether AEO matters at all is increasingly not.

What about the critique that AEO platforms are just expensive social-listening tools rebadged?

This is a fair adjacent critique we did not address above because it applies more to specific vendors than to the category. We have written about it indirectly in our comparison articles where the legacy social-listening pivots into AEO without changing the underlying measurement approach.

Is your methodology document publicly accessible?

Yes, with each customer scan and at the methodology link in this article. The five disclosure points are public.


Related reading


Try the framework yourself

The five-question framework above works on any AEO vendor, including ours. Run a free GenPicked AEO audit and request the methodology document alongside the report. If a competing vendor's methodology document is more rigorous than ours, we want to know.

Start your 14-day free trial of GenPicked Growth →


Dr. William L. Banks III is Founder of GenPicked. The critiques engaged in this article are real positions held by real people in the trade press; we steelmanned the strongest version of each rather than the easiest to refute. References available on request.

Dr. William L. Banks III

Co-Founder, GenPicked

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit
#academy#blog#original-research#criticism#methodology#r3