The client emails on a Monday morning: “Did our AEO work this quarter?” The agency dashboard shows a clean “+4 points” on AI visibility. But that composite number is the average of five engines that just disagreed on 81% of cited domains, and that hid a 23% volatility swing on Reddit between October and November 2025 (Conductor 2026 AEO/GEO Benchmarks). The number went up. The strategic finding moved the other way.
Across the same queries, only 11% of cited domains overlap between any two AI engines, and 71% of cited sources appear on only one platform (Discovered Labs). ChatGPT and Gemini agree on brand citations only 19% of the time across 2,089 tracked brands (Loamly). Google’s AI Mode and AI Overviews overlap on only 13.7% of URLs (Semrush). A composite “AI visibility” number averages over those disagreements and tells an agency almost nothing about where to spend the next dollar.
This post is for agency owners running portfolios across dental, law, HVAC, healthcare, insurance, mortgage, real estate, e-commerce, accounting, B2B SaaS, MSP/IT, and wealth management. It lays out what the published cross-engine and per-industry data shows, every number traceable to a real URL, and uses the GenPicked Research Team (2026) Fitness Wearables Study as the worked methodology example. Honest disclosure: the 12-vertical Bradley-Terry numbers are not yet a completed GenPicked study — this extends the published framework.
Start your 14-day free trial
Growth plan free for 14 days. Five AI engines. Full agency dashboard.
Start free trialThe “single score lies” problem — why averaging hides the strategic finding
Brand-mention rates and citation behaviors vary so dramatically by engine that aggregate scores misrepresent what is happening. The GenPicked Research Team (2026) Fitness Wearables Study documented Claude as 6.7× more reactive to brand anchoring than GPT-5 — meaning when a brand is explicitly named in the prompt, Claude’s win-rate uplift dwarfs GPT-5’s. Two clients with identical composite scores can be in completely different strategic positions: one with strong unaided awareness across engines, one whose “visibility” collapses the moment a competitor is named in the prompt.
The Loamly 2,089-brand analysis sets the market context: 77% of brands are completely invisible to ChatGPT, and the brands that aren’t convert AI-sourced traffic at roughly three times the rate of non-branded Google organic. If a client wins on ChatGPT and loses on Claude, the average describes no actual model — the statistical equivalent of the household with 1.8 children.
Brand mentions correlate 0.664 with AI visibility versus 0.218 for backlinks across RivalHound’s 75,000-brand analysis; YouTube mentions correlate 0.737 — the strongest single factor in Ahrefs’ data. Composite scores that don’t split out where mentions come from miss the lever that moves the needle.
Five engines, five citation diets — the cross-cutting biases
Before industry-specific bias, agencies need the per-engine source-diet differences. These cross-cutting patterns compound on top of vertical behavior. Every number is from a named third-party study.
Reddit accounts for ~40.1% of all LLM citations across major engines (Semrush, 150K citations, June 2025) — but that share is unstable. Top 5 domains command 38% of all AI citations; top 20 command 66% (Ahrefs). The Gemini 3 rollout on January 27, 2026 replaced 42% of previously cited domains and pulled 32% more sources per response. Agencies optimizing for the average miss every regime change.
A wealth-management client cannot be optimized for Perplexity without acknowledging that Reddit is 46.7% of Perplexity’s citation diet — even if Reddit feels off-brand for RIA compliance. Industry bias rides on top of these source-diet biases; it doesn’t replace them.
The 12 agency verticals — what the published data actually says
GenPicked supports 12 vertical playbooks across the agency portfolio. Honest disclosure: a 12-vertical Bradley-Terry study is not yet completed — the Fitness Wearables Study is the methodology proof. The vertical numbers below come from named third-party research; GenPicked’s framework adds the scaffolding for converting them into per-vertical rankings.
Healthcare shows the highest AI Overview trigger rate of any vertical: 48.75%, with AI referral traffic at 0.87% (below the 1.08% cross-industry average) due to YMYL caution. ChatGPT-vs-Google-AI brand divergence is 62% in healthcare — the highest BrightEdge documented anywhere. Healthcare clients on a single composite score are flying blind on the engine split that defines their category.
Financial services and wealth management trigger AI Overviews at 25.79%, but finance has the lowest top-10 citation overlap of any tracked industry — roughly 11.3%, meaning ~89% of finance citations come from outside Google’s first page (ALM Corp). For an RIA client, Google rank and AI citation are not the same workstream.
Real estate sits at the opposite end: 4.48% AI Overview trigger rate — the lowest of analyzed industries. Zillow dominates brand mentions; Hines and Public Storage lead citations. For a brokerage client, the share-of-voice game is against Zillow, not the next brokerage down the street.
Information Technology and B2B SaaS drive the heaviest AI referral traffic: 2.8% — 2.6× the cross-industry average. ChatGPT-vs-Google divergence in B2B tech is 47%, second only to healthcare. SaaS citations skew toward G2 and Reddit, making the “Reddit-first for Perplexity” play disproportionately strategic for SaaS clients.
Consumer discretionary and e-commerce has the most actionable agency story. Transactional queries trigger AI Overviews at only 8.51%, but the traffic that arrives converts hard. ChatGPT referral traffic converts at 1.81% vs 1.39% for non-branded organic — a 31% higher rate, with AI visitors generating 10.3% higher revenue per session. ChatGPT sessions grew 1,079% across 94 ecommerce brands in 2025. For e-commerce clients, the conversion premium is the QBR centerpiece.
Professional services and accounting sit at 1.09% AI referral traffic, right at the cross-industry average. Insurance, mortgage, HVAC, dental, and law — the local-service-heavy verticals — aren’t broken out separately in Conductor’s index, but BrightEdge documents that ChatGPT pushes users toward aggregators while Google points to directories and provider pages. An HVAC client cited on Yelp/Angi will look strong on ChatGPT; a dental client cited on Healthgrades and ADA pages will look strong on Google AI Overviews. Composite scoring averages those into mush.
The Fitness Wearables Study is the only completed GenPicked Bradley-Terry vertical study to date. The vertical numbers above come from Conductor, BrightEdge, Ahrefs, Semrush, Discovered Labs, and ALM Corp. What GenPicked’s framework adds is the methodology for converting these published patterns into actionable per-vertical Bradley-Terry rankings — the roadmap from here.
The Fitness Wearables worked example — what vertical-level engine bias looks like measured properly
The GenPicked Research Team (2026) Fitness Wearables Study is the methodology blueprint: Bradley-Terry pairwise rankings across four models (GPT-5, Claude 4, Gemini 2.5, DeepSeek V3), blind prompts with Latin Square position-bias control, and a sycophancy diagnostic comparing blind vs named win rates with 95% confidence intervals.
At the category level: Oura 1st (BT 1.82, 95% CI [1.71, 1.94]), Whoop 2nd (1.44, [1.29, 1.58]), Garmin 3rd (0.92, [0.78, 1.07]), Apple Watch 4th (0.61, [0.43, 0.80]), Fitbit 5th (0.21, [0.02, 0.41]). Oura’s CI does not overlap Whoop’s — statistically meaningful. Apple Watch and Fitbit CIs touch at 0.41 — a tie, not an ordinal difference. The lesson: rankings without intervals lie about ties.
The per-model split is where the per-engine story lives. Oura ranks #1 on GPT-5 (BT 1.91), #1 on Claude 4 (1.74), #2 on Gemini 2.5 (1.48), #3 on DeepSeek V3 (1.12). The aggregate would have called Oura “first.” The per-engine map says: fix DeepSeek. The entire argument of this post in five rankings.
The sycophancy diagnostic adds a second dimension. Oura’s blind-prompt win rate is 0.76 [0.71, 0.81]; the named-prompt win rate is 0.94 [0.91, 0.96] — a +0.18 uplift, “highly reactive.” Reading: Oura’s unaided AI awareness is softer than its aided awareness. This tells an agency whether a client’s visibility is structural or dependent on a friendly prompt.
The 6.7× Claude-vs-GPT-5 reactivity differential is what makes model-split reporting non-negotiable. A brand fine on the composite can still be vulnerable on Claude specifically. Methodology details are in the GenPicked Academy lesson “What Valid AEO Data Actually Looks Like.”
Strategy by engine — what to actually do per client
Each engine block below maps to the verticals it most affects. The point: stop running one AEO motion across five engines and twelve clients.
Perplexity — Reddit-first
Reddit is 46.7% of Perplexity’s top citations (Discovered Labs). For B2B SaaS, e-commerce, HVAC, dental, and MSP/IT clients, authentic Reddit presence in vertical subreddits is the highest-leverage play. The trap: Reddit citation share dropped 23% in one month on Conductor’s index (Oct–Nov 2025). Plan for volatility; don’t make Reddit the single point of dependence.
ChatGPT — Wikipedia + YouTube + earned media
Wikipedia is 47.9% of ChatGPT’s top citations; YouTube mentions correlate 0.737 with citation likelihood — the strongest single signal in Ahrefs’ 75K-brand study. For healthcare, wealth management, accounting, and law clients, earn legitimate Wikipedia presence and amplify YouTube. Brands with a Wikipedia article score 3.6× higher on AI visibility (Loamly).
Claude — long-form blog + brand-mention-rich content
Claude favors blogs at 43.8% of top citations and is 6.7× more reactive to named-prompt anchoring than GPT-5 (GenPicked 2026). For B2B SaaS, MSP/IT, and wealth management long-form thought leadership, prioritize publisher placements and expert-quote-rich content. The Princeton/IIT Delhi GEO study cited by ZipTie documents +28.9% AI citation lift from expert quotes; 15+ named-entity density delivers 4.8× selection probability.
Gemini and Google AI Mode — YouTube + first-party + Medium
YouTube is 9.51% of AI Mode citations; only 13.7% of citations overlap between AI Mode and AI Overviews. For real estate, dental, HVAC, and accounting clients, treat YouTube as a tier-1 channel. The Gemini 3 January 27, 2026 rollout replaced 42% of previously cited domains. Agencies that don’t monitor regime changes lose retainers.
Google AI Overviews — E-E-A-T + FAQ schema + top-10 organic
Cited brands earn 35% more organic clicks and 91% more paid clicks (Seer Interactive, 25.1M impressions). FAQPage schema pages are 3.2× more likely to appear in AI Overviews. For insurance, mortgage, healthcare, and dental clients, FAQ schema everywhere is the free baseline. Domain Rating explains less than 4% of citation variance per ZipTie; pages ranking #6–#10 with strong E-E-A-T are cited 2.3× more than #1-ranked pages with weak E-E-A-T.
The measurement principle — five non-negotiables for valid AEO output
What separates valid AEO measurement from vendor-dashboard theater? Five principles, from GenPicked Academy Module 5 and consistent with pairwise-ranking literature (LMSYS Chatbot Arena, Princeton/IIT Delhi GEO study, SparkToro).
A number without a confidence interval is a claim without evidence. Off-the-shelf dashboards strip the uncertainty and call the residual a “score.” SE Ranking’s 300K-domain llms.txt study is the counter-example: a vendor lever an agency could spend a quarter implementing, with zero correlation to AI citation outcomes. Methodology hygiene separates agencies that keep retainers from those building on phantom signals.
What agencies should do this quarter
Pick the single client whose vertical shows the highest documented engine divergence (healthcare or B2B tech is a safe start) and run the per-engine ACS this week. Use it as the case study for the next three retainer pitches. The QBR conversation will change before the next billing cycle.
The agency moat for 2026
The agencies winning AEO retainers in 2026 are the ones whose dashboards split by engine, report confidence intervals, and tell the client honestly which model is moving the needle. Three retainer products are already monetizable: quarterly per-engine ACS audits with Bradley-Terry rankings and 95% CIs; vertical-bias playbooks tracking citation-diet shifts and regime changes; and sycophancy diagnostic reports surfacing unaided-awareness vulnerabilities.
The market is paying for this work. Profound raised $96M Series C at a $1B valuation in February 2026, with 10%+ of the Fortune 500 as customers. The 6sense 2025 Buyer Experience Report (4,510 buyers) shows 94% of B2B buyers now use LLMs during purchasing. The category is funded, buyers are using AI engines, engine bias by vertical is documented. The piece that hasn’t shipped at industry scale is the agency reporting layer that treats each engine as a separate measurement. That’s the moat.
Start your 14-day free trial
Growth plan free for 14 days. Five AI engines. Full agency dashboard.
Start free trial