AI Engine Bias by Industry: How 5 Major AI Engines Differ Across 12 Agency Verticals (GenPicked Study)

The client emails on a Monday morning: “Did our AEO work this quarter?” The agency dashboard shows a clean “+4 points” on AI visibility. But that composite number is the average of five engines that just disagreed on 81% of cited domains, and that hid a 23% volatility swing on Reddit between October and November 2025 (Conductor 2026 AEO/GEO Benchmarks). The number went up. The strategic finding moved the other way.

Across the same queries, only 11% of cited domains overlap between any two AI engines, and 71% of cited sources appear on only one platform (Discovered Labs). ChatGPT and Gemini agree on brand citations only 19% of the time across 2,089 tracked brands (Loamly). Google’s AI Mode and AI Overviews overlap on only 13.7% of URLs (Semrush). A composite “AI visibility” number averages over those disagreements and tells an agency almost nothing about where to spend the next dollar.

This post is for agency owners running portfolios across dental, law, HVAC, healthcare, insurance, mortgage, real estate, e-commerce, accounting, B2B SaaS, MSP/IT, and wealth management. It lays out what the published cross-engine and per-industry data shows, every number traceable to a real URL, and uses the GenPicked Research Team (2026) Fitness Wearables Study as the worked methodology example. Honest disclosure: the 12-vertical Bradley-Terry numbers are not yet a completed GenPicked study — this extends the published framework.

Start your 14-day free trial

Start your 14-day free trial

Growth plan free for 14 days. Five AI engines. Full agency dashboard.

Start free trial

The “single score lies” problem — why averaging hides the strategic finding

Brand-mention rates and citation behaviors vary so dramatically by engine that aggregate scores misrepresent what is happening. The GenPicked Research Team (2026) Fitness Wearables Study documented Claude as 6.7× more reactive to brand anchoring than GPT-5 — meaning when a brand is explicitly named in the prompt, Claude’s win-rate uplift dwarfs GPT-5’s. Two clients with identical composite scores can be in completely different strategic positions: one with strong unaided awareness across engines, one whose “visibility” collapses the moment a competitor is named in the prompt.

The Loamly 2,089-brand analysis sets the market context: 77% of brands are completely invisible to ChatGPT, and the brands that aren’t convert AI-sourced traffic at roughly three times the rate of non-branded Google organic. If a client wins on ChatGPT and loses on Claude, the average describes no actual model — the statistical equivalent of the household with 1.8 children.

11%
domain overlap between any two AI engines
71%
of cited sources appear on only one platform
6.7×
Claude vs GPT-5 brand-anchor reactivity (GenPicked 2026)

Brand mentions correlate 0.664 with AI visibility versus 0.218 for backlinks across RivalHound’s 75,000-brand analysis; YouTube mentions correlate 0.737 — the strongest single factor in Ahrefs’ data. Composite scores that don’t split out where mentions come from miss the lever that moves the needle.

Five engines, five citation diets — the cross-cutting biases

Before industry-specific bias, agencies need the per-engine source-diet differences. These cross-cutting patterns compound on top of vertical behavior. Every number is from a named third-party study.

01
ChatGPT — Wikipedia + brand mentions

Wikipedia is 47.9% of ChatGPT’s top citations. Reddit was ~60% in early August 2025, then collapsed to ~10% by mid-September after an OpenAI retrieval change.

02
Perplexity — Reddit-first

Reddit accounts for 46.7% of Perplexity’s top 10 citations — roughly 3× the share of YouTube (its #2 source at 13.9%).

03
Gemini / AI Mode — YouTube + first-party

YouTube secures 9.51% of citations (961,938 mentions) in AI Mode. Only 13.7% of citations overlap with AI Overviews.

04
Google AI Overviews — YouTube + Reddit

YouTube is 23.3% of AI Overview citations. AIO triggers on 25.11% of all Google searches.

05
Claude — blog-dominant + precision-cautious

Claude favors blogs at 43.8% of top citations and is 6.7× more brand-anchor reactive than GPT-5 (GenPicked 2026).

Reddit accounts for ~40.1% of all LLM citations across major engines (Semrush, 150K citations, June 2025) — but that share is unstable. Top 5 domains command 38% of all AI citations; top 20 command 66% (Ahrefs). The Gemini 3 rollout on January 27, 2026 replaced 42% of previously cited domains and pulled 32% more sources per response. Agencies optimizing for the average miss every regime change.

Key insight

A wealth-management client cannot be optimized for Perplexity without acknowledging that Reddit is 46.7% of Perplexity’s citation diet — even if Reddit feels off-brand for RIA compliance. Industry bias rides on top of these source-diet biases; it doesn’t replace them.

The 12 agency verticals — what the published data actually says

GenPicked supports 12 vertical playbooks across the agency portfolio. Honest disclosure: a 12-vertical Bradley-Terry study is not yet completed — the Fitness Wearables Study is the methodology proof. The vertical numbers below come from named third-party research; GenPicked’s framework adds the scaffolding for converting them into per-vertical rankings.

Healthcare shows the highest AI Overview trigger rate of any vertical: 48.75%, with AI referral traffic at 0.87% (below the 1.08% cross-industry average) due to YMYL caution. ChatGPT-vs-Google-AI brand divergence is 62% in healthcare — the highest BrightEdge documented anywhere. Healthcare clients on a single composite score are flying blind on the engine split that defines their category.

Financial services and wealth management trigger AI Overviews at 25.79%, but finance has the lowest top-10 citation overlap of any tracked industry — roughly 11.3%, meaning ~89% of finance citations come from outside Google’s first page (ALM Corp). For an RIA client, Google rank and AI citation are not the same workstream.

Real estate sits at the opposite end: 4.48% AI Overview trigger rate — the lowest of analyzed industries. Zillow dominates brand mentions; Hines and Public Storage lead citations. For a brokerage client, the share-of-voice game is against Zillow, not the next brokerage down the street.

Information Technology and B2B SaaS drive the heaviest AI referral traffic: 2.8% — 2.6× the cross-industry average. ChatGPT-vs-Google divergence in B2B tech is 47%, second only to healthcare. SaaS citations skew toward G2 and Reddit, making the “Reddit-first for Perplexity” play disproportionately strategic for SaaS clients.

Consumer discretionary and e-commerce has the most actionable agency story. Transactional queries trigger AI Overviews at only 8.51%, but the traffic that arrives converts hard. ChatGPT referral traffic converts at 1.81% vs 1.39% for non-branded organic — a 31% higher rate, with AI visitors generating 10.3% higher revenue per session. ChatGPT sessions grew 1,079% across 94 ecommerce brands in 2025. For e-commerce clients, the conversion premium is the QBR centerpiece.

Professional services and accounting sit at 1.09% AI referral traffic, right at the cross-industry average. Insurance, mortgage, HVAC, dental, and law — the local-service-heavy verticals — aren’t broken out separately in Conductor’s index, but BrightEdge documents that ChatGPT pushes users toward aggregators while Google points to directories and provider pages. An HVAC client cited on Yelp/Angi will look strong on ChatGPT; a dental client cited on Healthgrades and ADA pages will look strong on Google AI Overviews. Composite scoring averages those into mush.

Key insight

The Fitness Wearables Study is the only completed GenPicked Bradley-Terry vertical study to date. The vertical numbers above come from Conductor, BrightEdge, Ahrefs, Semrush, Discovered Labs, and ALM Corp. What GenPicked’s framework adds is the methodology for converting these published patterns into actionable per-vertical Bradley-Terry rankings — the roadmap from here.

The Fitness Wearables worked example — what vertical-level engine bias looks like measured properly

The GenPicked Research Team (2026) Fitness Wearables Study is the methodology blueprint: Bradley-Terry pairwise rankings across four models (GPT-5, Claude 4, Gemini 2.5, DeepSeek V3), blind prompts with Latin Square position-bias control, and a sycophancy diagnostic comparing blind vs named win rates with 95% confidence intervals.

At the category level: Oura 1st (BT 1.82, 95% CI [1.71, 1.94]), Whoop 2nd (1.44, [1.29, 1.58]), Garmin 3rd (0.92, [0.78, 1.07]), Apple Watch 4th (0.61, [0.43, 0.80]), Fitbit 5th (0.21, [0.02, 0.41]). Oura’s CI does not overlap Whoop’s — statistically meaningful. Apple Watch and Fitbit CIs touch at 0.41 — a tie, not an ordinal difference. The lesson: rankings without intervals lie about ties.

The per-model split is where the per-engine story lives. Oura ranks #1 on GPT-5 (BT 1.91), #1 on Claude 4 (1.74), #2 on Gemini 2.5 (1.48), #3 on DeepSeek V3 (1.12). The aggregate would have called Oura “first.” The per-engine map says: fix DeepSeek. The entire argument of this post in five rankings.

The sycophancy diagnostic adds a second dimension. Oura’s blind-prompt win rate is 0.76 [0.71, 0.81]; the named-prompt win rate is 0.94 [0.91, 0.96] — a +0.18 uplift, “highly reactive.” Reading: Oura’s unaided AI awareness is softer than its aided awareness. This tells an agency whether a client’s visibility is structural or dependent on a friendly prompt.

The 6.7× Claude-vs-GPT-5 reactivity differential is what makes model-split reporting non-negotiable. A brand fine on the composite can still be vulnerable on Claude specifically. Methodology details are in the GenPicked Academy lesson “What Valid AEO Data Actually Looks Like.”

Strategy by engine — what to actually do per client

Each engine block below maps to the verticals it most affects. The point: stop running one AEO motion across five engines and twelve clients.

Perplexity — Reddit-first

Reddit is 46.7% of Perplexity’s top citations (Discovered Labs). For B2B SaaS, e-commerce, HVAC, dental, and MSP/IT clients, authentic Reddit presence in vertical subreddits is the highest-leverage play. The trap: Reddit citation share dropped 23% in one month on Conductor’s index (Oct–Nov 2025). Plan for volatility; don’t make Reddit the single point of dependence.

ChatGPT — Wikipedia + YouTube + earned media

Wikipedia is 47.9% of ChatGPT’s top citations; YouTube mentions correlate 0.737 with citation likelihood — the strongest single signal in Ahrefs’ 75K-brand study. For healthcare, wealth management, accounting, and law clients, earn legitimate Wikipedia presence and amplify YouTube. Brands with a Wikipedia article score 3.6× higher on AI visibility (Loamly).

Claude — long-form blog + brand-mention-rich content

Claude favors blogs at 43.8% of top citations and is 6.7× more reactive to named-prompt anchoring than GPT-5 (GenPicked 2026). For B2B SaaS, MSP/IT, and wealth management long-form thought leadership, prioritize publisher placements and expert-quote-rich content. The Princeton/IIT Delhi GEO study cited by ZipTie documents +28.9% AI citation lift from expert quotes; 15+ named-entity density delivers 4.8× selection probability.

Gemini and Google AI Mode — YouTube + first-party + Medium

YouTube is 9.51% of AI Mode citations; only 13.7% of citations overlap between AI Mode and AI Overviews. For real estate, dental, HVAC, and accounting clients, treat YouTube as a tier-1 channel. The Gemini 3 January 27, 2026 rollout replaced 42% of previously cited domains. Agencies that don’t monitor regime changes lose retainers.

Google AI Overviews — E-E-A-T + FAQ schema + top-10 organic

Cited brands earn 35% more organic clicks and 91% more paid clicks (Seer Interactive, 25.1M impressions). FAQPage schema pages are 3.2× more likely to appear in AI Overviews. For insurance, mortgage, healthcare, and dental clients, FAQ schema everywhere is the free baseline. Domain Rating explains less than 4% of citation variance per ZipTie; pages ranking #6–#10 with strong E-E-A-T are cited 2.3× more than #1-ranked pages with weak E-E-A-T.

The measurement principle — five non-negotiables for valid AEO output

What separates valid AEO measurement from vendor-dashboard theater? Five principles, from GenPicked Academy Module 5 and consistent with pairwise-ranking literature (LMSYS Chatbot Arena, Princeton/IIT Delhi GEO study, SparkToro).

01
Confidence intervals on every ranking

Bradley-Terry MLE with 95% CIs. Overlapping intervals are ties, not ordinal differences.

02
Split by model, never averaged

The 6.7× Claude-vs-GPT-5 reactivity differential proves model-split reporting is non-negotiable.

03
Variance that makes sense

Identical rankings across 4 models = pipeline collapse. Wildly divergent rankings = position-bias problem.

04
Sycophancy uplift as diagnostic

Blind vs named win-rate delta tells you how dependent your client’s visibility is on unaided awareness.

05
Method scaffolding published

Models tested, pairwise count, question categories, inter-model agreement. Off-the-shelf dashboards rarely report this.

A number without a confidence interval is a claim without evidence. Off-the-shelf dashboards strip the uncertainty and call the residual a “score.” SE Ranking’s 300K-domain llms.txt study is the counter-example: a vendor lever an agency could spend a quarter implementing, with zero correlation to AI citation outcomes. Methodology hygiene separates agencies that keep retainers from those building on phantom signals.

What agencies should do this quarter

  • Run a per-engine ACS baseline for every client this month.
    ChatGPT 0.35 + Perplexity 0.25 + Gemini 0.25 + Claude 0.15 weighted, but report the per-engine breakdown to the client. This single change reframes the QBR conversation.
  • Map each client’s vertical to its dominant engine bias.
    Healthcare client → the 62% ChatGPT-vs-Google divergence is the QBR story. E-commerce client → the 31% ChatGPT conversion premium is the QBR story.
  • Stop reporting a single composite “AI visibility” number.
    Replace with (a) Bradley-Terry blind ranking with 95% CIs per engine; (b) sycophancy uplift; (c) per-engine source-diet share — Reddit %, Wikipedia %, YouTube %, blog %.
  • Re-prioritize earned media by per-engine source diet.
    Wikipedia for ChatGPT-heavy clients. Reddit for Perplexity-heavy verticals. YouTube for Gemini and AI Mode. Long-form blog placements for Claude.
  • Add FAQPage schema everywhere — a free 3.2× AIO lift.
    Only 12.4% of websites have FAQ schema (Frase). The lift is real; the implementation cost is one sprint.
  • For YMYL verticals, build dual-track content strategies.
    Institutional-first track (Wikipedia, .gov, .edu, hospital/SEC sources) for ChatGPT and AI Overviews; Reddit-and-blog track for Perplexity and Claude. Expect ChatGPT-vs-Google divergence above 60% for healthcare and wealth management.
Do this

Pick the single client whose vertical shows the highest documented engine divergence (healthcare or B2B tech is a safe start) and run the per-engine ACS this week. Use it as the case study for the next three retainer pitches. The QBR conversation will change before the next billing cycle.

The agency moat for 2026

The agencies winning AEO retainers in 2026 are the ones whose dashboards split by engine, report confidence intervals, and tell the client honestly which model is moving the needle. Three retainer products are already monetizable: quarterly per-engine ACS audits with Bradley-Terry rankings and 95% CIs; vertical-bias playbooks tracking citation-diet shifts and regime changes; and sycophancy diagnostic reports surfacing unaided-awareness vulnerabilities.

The market is paying for this work. Profound raised $96M Series C at a $1B valuation in February 2026, with 10%+ of the Fortune 500 as customers. The 6sense 2025 Buyer Experience Report (4,510 buyers) shows 94% of B2B buyers now use LLMs during purchasing. The category is funded, buyers are using AI engines, engine bias by vertical is documented. The piece that hasn’t shipped at industry scale is the agency reporting layer that treats each engine as a separate measurement. That’s the moat.

Start your 14-day free trial

Start your 14-day free trial

Growth plan free for 14 days. Five AI engines. Full agency dashboard.

Start free trial

GenPicked Research Team

Research

The GenPicked Research Team publishes original AEO measurement research - Bradley-Terry rankings, sycophancy diagnostics, and per-engine bias studies - built for marketing agencies running portfolios across multiple AI engines and verticals.

Credentials:

GenPicked Fitness Wearables Study (2026) - methodology lead, ACS (AEO Citation Score) framework, Bradley-Terry pairwise ranking with 95% CIs, Latin Square position control, sycophancy diagnostic

Frequently Asked Questions

How do I know if my client's AI visibility is real or just an averaged number hiding model-specific weakness?

Insist on per-engine reporting with confidence intervals. If your dashboard shows one composite 'AI visibility' score with no CI and no per-engine breakdown, you are looking at an averaged number that is structurally unable to surface model-specific findings. The GenPicked Research Team (2026) Fitness Wearables Study documented Claude as 6.7x more reactive to brand anchoring than GPT-5 - meaning a brand can look strong on average and still be vulnerable on a specific engine. Valid AEO output includes Bradley-Terry rankings with 95% CIs split by model, never averaged.

Does the Fitness Wearables Study generalize to my client's vertical?

The methodology generalizes. The specific Bradley-Terry numbers do not. GenPicked has not yet published vertical-level studies for dental, law, HVAC, healthcare, insurance, mortgage, real estate, e-commerce, accounting, B2B SaaS, MSP/IT, or wealth management. What does generalize is the measurement framework (blind ranking + Latin Square position control + sycophancy diagnostic) and the direction of per-engine bias documented in published third-party research (Conductor, BrightEdge, Ahrefs, Discovered Labs, Semrush). Use the framework now; the vertical-specific Bradley-Terry numbers will follow.

Which AI engine matters most for my client's vertical?

It depends on the client mix. ChatGPT drives roughly 87.4% of AI referral traffic on average (Conductor 2026), but GenPicked's ACS weights ChatGPT at 0.35 and Gemini/Perplexity at 0.25 each because traffic is not the only signal - brand-recommendation behavior and per-engine source diet matter too. Healthcare clients should treat the 62% ChatGPT-vs-Google AI Overview divergence (BrightEdge) as the QBR centerpiece. E-commerce clients should treat the 31% ChatGPT referral conversion premium (Yotpo/ALM) as the centerpiece. Run the ACS per-engine breakdown and let the data decide.

Is Reddit a reliable AEO investment given the September 2025 disruption?

Reddit remains the single largest AI citation source (~40.1% across major engines per Semrush, June 2025), but it is volatile. ChatGPT's Reddit citation share fell from ~60% to ~10% between early August and mid-September 2025 after OpenAI's retrieval change tied to Google's num=100 parameter removal (Loamly). Perplexity kept its Reddit share stable through the same window. Conductor's index showed Reddit citation share drop 23% in one month between October and November 2025. Reddit is essential for Perplexity-heavy verticals (B2B SaaS, e-commerce, MSP/IT), but plan for volatility and don't make it the single point of dependence.

What counts as 'valid AEO data' versus vendor-dashboard theater?

Five non-negotiables. (1) Confidence intervals on every ranking - Bradley-Terry MLE with 95% CIs, overlapping intervals treated as ties. (2) Split by model, never averaged across models. (3) Variance that makes sense - moderate cross-model variance with family resemblance, not identical rankings (pipeline collapse) or wildly divergent ones (position-bias problem). (4) Sycophancy uplift as a diagnostic - blind vs named win-rate delta tells you how dependent your brand's visibility is on unaided awareness. (5) Method scaffolding published - models tested, pairwise count, question categories, inter-model agreement rate. Off-the-shelf dashboards rarely report all five; valid reports always do.

How fast do AI engines actually pick up new content?

Latency varies by engine and source type. Perplexity is fastest (Reddit posts in 4-12 hours per Profound; news in 3-12 hours per Discovered Labs). Google AI Overviews follow Google's index speed (Wikipedia edits surface in ~12-36 hours; news on high-authority domains in 24-48 hours). ChatGPT is the slowest of the majors (Wikipedia edits in 3-6 days for 68% of cases per GenPicked tracking; Seer Interactive saw brand mentions appear within 2-7 days in 73% of test cases). Claude sits in the middle. Build retainer reporting cadences to match: weekly for Perplexity-heavy clients, monthly for ChatGPT-heavy clients.

Should I keep doing traditional SEO if brand mentions correlate 3x more strongly with AI visibility than backlinks?

Yes, but rebalance. Brand mentions correlate 0.664 with AI visibility vs 0.218 for backlinks across Ahrefs' 75K-brand study; YouTube mentions specifically correlate 0.737. ZipTie's research shows Domain Rating explains less than 4% of citation variance, while topical authority correlates 0.41 and pages ranking #6-#10 with strong E-E-A-T are cited 2.3x more than #1-ranked pages with weak E-E-A-T. The right reallocation: shift 30-50% of historical backlink budget toward earned mentions (YouTube, Reddit, blog placements, news), keep SEO as the foundation, and add FAQ schema (3.2x AI Overviews lift per Frase) as a free baseline.

What is the agency monetization angle for engine-bias-by-industry analysis?

Three retainer products. (1) Per-engine ACS quarterly audits - Bradley-Terry rankings with 95% CIs split by engine. (2) Vertical-bias playbooks - quarterly delivery of per-vertical citation diet shifts (Reddit volatility, Gemini 3 reset events). (3) Sycophancy diagnostic reports - blind vs named uplift to identify which clients have unaided-awareness vulnerabilities. The market is paying for this: Profound raised $96M Series C at $1B valuation in February 2026, serving 10%+ of the Fortune 500. The agency tier of this work is the white-label retainer add-on.

Are pricing-tier AEO platforms all measuring the same thing?

No. Profound starts at $99/month (ChatGPT-only) and scales to $2,000-5,000+/month for enterprise multi-engine coverage. Peec AI ranges $105-235/month; Otterly from $29/month; Scrunch $250/month; AthenaHQ $295/month. Most vendor dashboards do not publish their measurement methodology - confidence intervals, model-split reporting, and sycophancy diagnostics are largely absent. GenPicked is built around the three-layer architecture (blind ranking + Latin Square + sycophancy) documented in Academy Module 5 - and weights ChatGPT 0.35, Perplexity 0.25, Gemini 0.25, Claude 0.15 in the ACS aggregate while always splitting the per-engine view in client reporting.

What is the single most important change an agency should make this quarter?

Stop reporting a single 'AI visibility' score to clients. Replace it with the per-engine ACS breakdown (ChatGPT 0.35 / Perplexity 0.25 / Gemini 0.25 / Claude 0.15) plus per-engine source-diet share. That one change reframes the QBR from 'is the number up?' to 'which engine is moving, why, and what is the next quarter's play?' - which is the conversation that defends retainers against in-housing and against single-engine competitor pitches.

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit
#aeo#geo#ai-engine-bias#original-research#agency-playbook#bradley-terry#per-engine-reporting#vertical-aeo