Module 6: The Auditor's Playbook

Calculating Your Diagnostic Metrics

Dr. William L. Banks III

May 8, 2026

16 min read

Calculating Your Diagnostic Metrics

In this lesson, you will learn: How to turn your raw audit data into diagnostic metrics, mention rates, sycophancy gap, pairwise win-rates, Bradley-Terry scores, cross-model variance, and confidence bands. All in your spreadsheet. No statistics software required, no PhD required, just clear thinking and a willingness to follow a formula.

This is Lesson 6.4, the math chapter. But accessible math. If you can do average-of-a-column, you can do this. The goal is not to impress anyone with statistical sophistication. The goal is to extract findings from your data that survive basic scrutiny. The metrics in this lesson are the minimum viable toolkit, the ones you need before your audit report stops being vibes and starts being evidence.

Why these metrics

A raw spreadsheet of 120 rows is not findings. It's data. Findings come from aggregating the data along axes that answer the questions a reader will ask:

How often does the brand show up? (mention rate)
Does naming the brand inflate its apparent visibility? (sycophancy gap)
When compared head-to-head with competitors, does the brand win or lose? (pairwise win rate)
If we combine all the pairwise wins, what's the brand's overall ranking? (Bradley-Terry score)
Does the brand's visibility swing wildly day-to-day? (variance)
How confident can we be in our numbers? (confidence bands)

Each metric answers one question. Together, they give you an audit. Peikos (2024) is explicit about why the dashboard must be multi-metric rather than a single score: relevance and visibility are irreducibly multidimensional, and collapsing them to one number discards signal that moves independently.

Spreadsheet over statistics software Everything in this lesson can be computed with standard spreadsheet formulas, COUNTIF, AVERAGE, STDEV, SQRT, basic arithmetic. You do not need R, Python, or a stats package. If your audit is small enough to run in a week, it is small enough to analyze in a spreadsheet.

Module 1, Setting up the analysis sheet

In your audit spreadsheet, add a second sheet called analysis. This is where the metrics live. Keep the raw data (responses sheet) pristine. Every aggregated number references back to the raw data via formulas, never copy-paste.

The analysis sheet sections

Create these sections as blocks on the analysis sheet:

Section A, Mention rates (blind, named, by model)
Section B, Sycophancy gap (named minus blind, per model and overall)
Section C, Pairwise win rates (target brand vs. each competitor, per model)
Section D, Bradley-Terry scores (derived from pairwise wins)
Section E, Cross-model variance (stability of mention rate across models)
Section F, Confidence bands (how precise are your estimates)

Work through them in this order. Each builds on the previous ones.

Module 2, Mention rates

The simplest metric and the most important single number. Mention rate is the percentage of times the brand appears in responses, filtered by question type and model.

The formulas

Assume your raw data is in the responses sheet, with columns as defined in Lesson 6.1. Column C is model, column G is question_type, column K is brand_mentioned (Y/N).

In the analysis sheet, create this table:

Model	Blind mention rate	Named mention rate	Comparison mention rate
ChatGPT	formula	formula	formula
Claude	formula	formula	formula
Gemini	formula	formula	formula
Perplexity	formula	formula	formula
Overall	formula	formula	formula

For ChatGPT blind mention rate, the formula is:

=COUNTIFS(responses!C:C,"chatgpt",responses!G:G,"blind",responses!K:K,"Y")/COUNTIFS(responses!C:C,"chatgpt",responses!G:G,"blind")

In plain English: count the rows where model is ChatGPT AND question type is blind AND brand was mentioned, divided by count of rows where model is ChatGPT AND question type is blind. Format the cell as a percentage.

Repeat the formula for each model × question type combination. For the Overall row, drop the model filter, count across all models.

What to read

Look at the blind mention rate column. This is your single most important number. If the blind mention rate is 15% overall, your brand is essentially invisible at the category-question level, most buyers asking a category question in AI will not encounter your brand. If it's 70%, you're one of the default recommendations.

Compare across models. If ChatGPT's blind mention rate is 60% and Gemini's is 15%, that is a four-to-one cross-model variance. That's a huge finding, it means AEO performance is completely different depending on which AI the buyer uses.

Module 3, The sycophancy gap

The sycophancy gap is the single most diagnostic number in the audit GenPicked Academy teaches. It is the difference between named and blind mention rates. See Blind vs. Named Measurement for why this matters.

The formula

sycophancy_gap = named_mention_rate - blind_mention_rate

For each model:

Model	Blind	Named	Sycophancy Gap
ChatGPT	60%	95%	+35 pp
Claude	15%	90%	+75 pp
Gemini	30%	70%	+40 pp
Perplexity	70%	85%	+15 pp

How to read it

Small gap (<10 pp): Named and blind responses are similar. Either the brand is genuinely dominant in the category (if both rates are high) or the model is resistant to sycophancy (if both rates are moderate).
Large gap (>30 pp): The brand is much more visible "in name" than "in substance." Your measurements in any named-prompt tool are inflated.
Very large gap (>60 pp): The brand is borderline invisible at the category level but prompt-compliant at the named level. Every AEO tool using named prompts will report the brand as "visible", and they will all be wrong.

Cross-model comparison of the sycophancy gap is also diagnostic. In the fictitious table above, Claude has a 75-point gap while Perplexity has a 15-point gap. That is precisely the Model Susceptibility Spectrum in action, Claude is dramatically more reactive to prompt framing than Perplexity.

AEO claim, the sycophancy gap as the diagnostic number: Sharma et al. (2024) established that sycophancy is systematic across every major LLM and traces to RLHF preference optimization. Banks (2026) then quantified the magnitude in brand measurement: named prompts produced a 22.5 percentage point mention-rate inflation over blind prompts across 864 paired observations and four AI models. The magnitude of the gap in an individual audit is therefore a direct indicator of how much of the brand's reported "visibility" is prompt artifact versus organic presence.

Module 4, Pairwise win rates

Your comparison questions produce pairwise data, target brand vs. a specific competitor. Calculating win rates from pairwise comparisons is the foundation of Bradley-Terry ranking.

What counts as a win

For each comparison question, read the response and score it:

Target brand wins if the model frames it as the preferred option, more recommended, or better on the specific dimension asked about.
Competitor wins if the model frames the competitor as the preferred option.
Tie if the response is genuinely even-handed (both praised equally, both with matched caveats).

Add a column to your comparison-question rows: winner, with values target, competitor_name, or tie.

Be strict. If the model says "Oura is great for sleep tracking but Whoop is better for recovery" in a recovery-focused question, Whoop wins that comparison. Don't over-count ties.

The win rate formula

For each target-vs-competitor pair:

target_win_rate = target_wins / (target_wins + competitor_wins)

Ties can be scored as half-wins to each side, or excluded entirely. Either convention is defensible; just be consistent. In a small audit, excluding ties is cleaner.

In your analysis sheet, create a table:

Target vs. Competitor	Target wins	Competitor wins	Target win rate
Oura vs. Whoop	5	2	71%
Oura vs. Apple Watch	3	4	43%
Oura vs. Garmin	4	3	57%

Each row aggregates across both order variants (the Latin Square counterbalancing from Lesson 6.3) and across all four models. You can also break it down per model if your sample supports it.

Module 5, Bradley-Terry scores, simplified

Bradley-Terry ranking is what LMSYS Chatbot Arena uses to rank AI models. It turns pairwise comparisons into a single ranked score. Chiang et al. (2024) showed that Bradley-Terry applied to crowd-sourced pairwise comparisons produces stable LLM rankings where absolute-score evaluation does not, the same logic applies to brands. See Bradley-Terry Ranking for the research foundation.

For a full Bradley-Terry implementation, you would use maximum likelihood estimation, iterative math that converges on each brand's "strength" parameter. For a first audit, a simplified version gets you 90% of the value with 10% of the work.

The simplified Bradley-Terry

Compute the target brand's win rate against each competitor (you did this in Module 4).
Compute each competitor's win rate against every other competitor in your audit. If you only have comparisons between target and competitors (not competitor-vs-competitor), you can infer ordering from win-rate magnitudes.
Rank brands by their aggregate win rate, the sum (or average) of win rates across all opponents.

Example

Using the fictitious table above, imagine you also ran Whoop vs. Apple Watch and got Whoop winning 60% of the time, and Apple Watch vs. Garmin with Apple Watch winning 55%. The simplified ranking:

Oura: won 71% vs. Whoop, 43% vs. Apple Watch, 57% vs. Garmin → average 57%
Whoop: won 29% vs. Oura, 60% vs. Apple Watch, unknown vs. Garmin → ~50% average with partial data
Apple Watch: won 57% vs. Oura, 40% vs. Whoop, 55% vs. Garmin → average 51%
Garmin: won 43% vs. Oura, unknown vs. Whoop, 45% vs. Apple Watch → ~45% average

Ranked: Oura (57%) > Apple Watch (51%) > Whoop (50%) > Garmin (45%).

This is a simplified Bradley-Terry. For most practitioner audits, it's enough. For more rigor, use R's BradleyTerry2 package or Python's choix library, but only if your audit grows past 5 brands and 30+ comparisons.

The pairwise case against "which is best"

You could ask the model directly: "Which fitness wearable is best?" That produces a volatile, single-shot ranking. Fishkin (2026)'s SparkToro analysis found repeatability of such direct rankings is under 1% across identical prompts. Pairwise comparisons are much more stable because each comparison is a smaller, more constrained decision, which is exactly why Chatbot Arena chose the pairwise Bradley-Terry formulation rather than absolute Likert scoring. Aggregating many small stable decisions into a ranking produces a more reliable final order than one large unstable decision.

AEO claim, pairwise vs. absolute ranking stability: Chiang et al. (2024) demonstrated that Bradley-Terry ranking over pairwise comparisons produces stable LLM rankings where direct absolute scoring does not. Fishkin (2026) confirmed the same pattern in brand measurement: direct "rank these brands" prompts showed cited-source repeatability under 1% across identical runs in a 2026 SparkToro experiment. Pairwise comparisons aggregated via Bradley-Terry methods produce rankings that are dramatically more stable because each decision is smaller and less sensitive to random sampling variation.

Module 6, Cross-model variance

Variance tells you how much the brand's visibility swings across models. High variance is itself a finding, it means AEO performance is not uniform and the brand's strategy needs to be model-specific. Some of that variance is structural cross-model divergence; some is within-model non-determinism that Alexander (2026) showed persists even at temperature zero. Reporting both is what separates a credible audit from a point-estimate fantasy.

The formula

Take the blind mention rate per model (from Module 2). Compute the standard deviation across the four models.

=STDEV(chatgpt_blind_rate, claude_blind_rate, gemini_blind_rate, perplexity_blind_rate)

Also compute the range (max minus min):

=MAX(...) - MIN(...)

How to read it

Low variance (standard deviation < 10 pp): The brand performs similarly across models. Your AEO strategy can be model-agnostic.
Moderate variance (10-25 pp): Meaningful differences. Prioritize the models where visibility is weakest; don't assume wins in one model transfer to another.
High variance (> 25 pp): Dramatically different AEO performance by model. Each model effectively requires its own strategy, and any single-model measurement tool is misleading by design.

Cross-model variance is one of the most under-reported metrics in commercial AEO tools. Many tools aggregate across models and present a single "visibility score," hiding the variance that would tell you where to actually focus.

Module 7, Confidence bands

With 15 questions and 2 sampling passes, you have roughly 30 observations per model × question-type combination. That's a small sample. Confidence bands tell your reader how much to trust the point estimate.

The back-of-the-envelope formula

For a binary outcome like mention rate, the standard error is approximately:

SE = SQRT(p × (1-p) / n)

Where p is the mention rate (as a decimal) and n is the number of observations.

A 95% confidence band is roughly p ± 2 × SE.

Example

ChatGPT blind mention rate is 60% (p = 0.60), computed from 10 observations (n = 10).

SE = SQRT(0.60 × 0.40 / 10) = SQRT(0.024) ≈ 0.155 = 15.5%
95% band: 60% ± 31% → [29%, 91%]

That's a very wide band. With only 10 observations, your point estimate is noisy. You can report it as "ChatGPT blind mention rate was 60%, but the 95% confidence band was [29%, 91%], the estimate is imprecise with only 10 observations."

If you had 40 observations instead of 10:

SE = SQRT(0.60 × 0.40 / 40) = SQRT(0.006) ≈ 0.077 = 7.7%
95% band: 60% ± 15% → [45%, 75%]

Tighter, but still not point-precision. The practical takeaway: your audit is directional, not precise. Frame findings as "likely in this range," not "exactly this number." That's honest, and more useful to the reader.

How to report confidence bands

In your audit report, always report mention rates with their band:

ChatGPT blind mention rate: 60% [95% CI: 29%-91%, n=10]

The band protects you from overclaiming. It also protects the reader from overconfident conclusions. Everyone wins.

Module 8, The spreadsheet walkthrough

To consolidate everything, here's the actual sequence of cells to build in your analysis sheet. Follow along.

Step 1, Mention rate table

Cells A1:E6: - A1: header Model - B1: Blind rate - C1: Named rate - D1: Comparison rate - E1: n (blind) - Rows 2-5: one per model (chatgpt, claude, gemini, perplexity) - Row 6: Overall with aggregated formulas

Formulas use COUNTIFS against the responses sheet.

Step 2, Sycophancy gap

Cells A8:C13: - A8: header Model - B8: Gap (pp) - C8: Interpretation - Rows 9-12: one per model - Row 13: Overall

Formula: =C2-B2 (named minus blind, in percentage points).

Interpretation column: use IF nested logic, <10 pp = small, 10-30 pp = moderate, >30 pp = large.

Step 3, Pairwise win rates

Cells A15:D18 (adjust for the number of competitors): - A15: header Matchup - B15: Target wins - C15: Competitor wins - D15: Target win rate

Formulas count winner column in the responses sheet, filtering by competitor name.

Step 4, Simplified BT ranking

Cells A20:B24: - A20: header Brand - B20: Avg win rate - Rows 21-24: target + competitors, ranked high-to-low

Step 5, Cross-model variance

Cells A26:B28: - A26: Blind rate variance (stdev) - A27: Blind rate range (max-min) - A28: interpretation (low / moderate / high)

Step 6, Confidence bands

Cells A30:E34: - A30: Model - B30: Mention rate - C30: n - D30: 95% band low - E30: 95% band high

Formula uses the SE calculation from Module 7.

When complete, your analysis sheet is a one-page dashboard. That's your findings layer, the data you quote directly in the audit report in Lesson 6.5.

Exercise, build the analysis sheet

Using your raw data from Lesson 6.3:

Create the analysis sheet in your audit spreadsheet.
Build all six sections (mention rates, sycophancy gap, pairwise win rates, simplified BT ranking, variance, confidence bands).
Verify: every number on the analysis sheet traces back to the raw data via formulas. If you change a cell in responses, the analysis updates automatically.
In your notebook, write three sentences: what is the most surprising number on your analysis sheet? What does it tell you about the brand's AEO position? What follow-up question would you want to investigate next?

You now have the evidentiary foundation for the audit report.

Common mistakes

"I copy-pasted my aggregates instead of using formulas"

Your analysis is now frozen. When you add more data, you'll have to redo everything. Rebuild with formulas.

"I scored every response as positive to be nice to the brand"

Your sentiment data is useless. Re-score with strict criteria and a clear rationale.

"I skipped ties on pairwise questions"

If you had a lot of ties, your win rates are inflated. Either include ties as half-wins, or note in the methodology that ties were excluded and report how many there were.

"My confidence bands are bigger than my point estimates"

That's honest. It means your sample is too small for precise claims. Report the range and note it as a limitation. Don't hide it.

"I reported cross-model variance as a single aggregated number"

Cross-model variance is most useful when reported as a table with per-model rates and an explicit range. Don't summarize it down to one number, the distribution is the finding.

Takeaways

Mention rate and sycophancy gap are the two numbers you can't skip. Every other metric elaborates on these two.
Pairwise beats absolute. Pairwise win rates aggregate into stable rankings; direct "rank these brands" prompts produce volatile noise.
Report confidence bands, not point estimates. Your audit is directional, framing it honestly is more credible than false precision.

What's next

You have metrics. Now you write the report. Lesson 6.5, Writing the Audit Report, covers the report structure, executive summary conventions, how to present findings honestly, and how to position the report as a portfolio piece for your AEO Strategist career.

Reflection prompt

Look at your analysis sheet. If you could only show one number to your target brand's CMO, which one would it be? Why? Write a sentence in your notebook. That sentence is the seed of your executive summary.

About this course

This lesson is part of AEO A to Z, the open course on Answer Engine Optimization published by GenPicked Academy. GenPicked Academy is where practitioners learn to measure AI recommendations with the same rigor a clinical trial demands: blind sampling, balanced question sets, and confidence intervals that hold up.

About the author: Dr. William L. Banks III is the lead researcher at GenPicked Academy and the architect of the three-layer AEO measurement architecture taught in this course. His work on sycophancy, popularity bias, and construct validity in AI search informs every lesson you just read.

See the methods in practice: GenPicked runs monthly brand-intelligence audits using the exact pipeline taught in Module 6. Read the case studies and audit walkthroughs on the GenPicked blog.

Dr. William L. Banks III

Co-Founder, GenPicked

Get Your Brand's AEO Score

See how your brand is performing in AI search with our free AEO audit.

Start Your Free Audit

#academy#guide#r3#aeo#audit#metrics#bradley-terry#hands-on#module-6

Calculating Your Diagnostic Metrics

Why these metrics

Module 1, Setting up the analysis sheet

The analysis sheet sections

Module 2, Mention rates

The formulas

What to read

Module 3, The sycophancy gap

The formula

How to read it

Module 4, Pairwise win rates

What counts as a win

The win rate formula

Module 5, Bradley-Terry scores, simplified

The simplified Bradley-Terry

Example

The pairwise case against "which is best"

Module 6, Cross-model variance

The formula

How to read it

Module 7, Confidence bands

The back-of-the-envelope formula

Example

How to report confidence bands

Module 8, The spreadsheet walkthrough

Step 1, Mention rate table

Step 2, Sycophancy gap

Step 3, Pairwise win rates

Step 4, Simplified BT ranking

Step 5, Cross-model variance

Step 6, Confidence bands

Exercise, build the analysis sheet

Common mistakes

"I copy-pasted my aggregates instead of using formulas"

"I scored every response as positive to be nice to the brand"

"I skipped ties on pairwise questions"

"My confidence bands are bigger than my point estimates"

"I reported cross-model variance as a single aggregated number"

Takeaways

What's next

Reflection prompt

About this course

Dr. William L. Banks III

Related Articles

Designing Your Question Set

Running the Audit Across Four Models

Setting Up Your Measurement Environment

Get Your Brand's AEO Score